Parallelized tree-code for clusters of personal computers
NASA Astrophysics Data System (ADS)
Viturro, H. R.; Carpintero, D. D.
2000-02-01
We present a tree-code for integrating the equations of the motion of collisionless systems, which has been fully parallelized and adapted to run in several PC-based processors simultaneously, using the well-known PVM message passing library software. SPH algorithms, not yet included, may be easily incorporated to the code. The code is written in ANSI C; it can be freely downloaded from a public ftp site. Simulations of collisions of galaxies are presented, with which the performance of the code is tested.
Parallel TREE code for two-component ultracold plasma analysis
NASA Astrophysics Data System (ADS)
Jeon, Byoungseon; Kress, Joel D.; Collins, Lee A.; Grønbech-Jensen, Niels
2008-02-01
The TREE method has been widely used for long-range interaction N-body problems. We have developed a parallel TREE code for two-component classical plasmas with open boundary conditions and highly non-uniform charge distributions. The program efficiently handles millions of particles evolved over long relaxation times requiring millions of time steps. Appropriate domain decomposition and dynamic data management were employed, and large-scale parallel processing was achieved using an intermediate level of granularity of domain decomposition and ghost TREE communication. Even though the computational load is not fully distributed in fine grains, high parallel efficiency was achieved for ultracold plasma systems of charged particles. As an application, we performed simulations of an ultracold neutral plasma with a half million particles and a half million time steps. For the long temporal trajectories of relaxation between heavy ions and light electrons, large configurations of ultracold plasmas can now be investigated, which was not possible in past studies.
FLY MPI-2: a parallel tree code for LSS
NASA Astrophysics Data System (ADS)
Becciani, U.; Comparato, M.; Antonuccio-Delogu, V.
2006-04-01
New version program summaryProgram title: FLY 3.1 Catalogue identifier: ADSC_v2_0 Licensing provisions: yes Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADSC_v2_0 Program obtainable from: CPC Program Library, Queen's University of Belfast, N. Ireland No. of lines in distributed program, including test data, etc.: 158 172 No. of bytes in distributed program, including test data, etc.: 4 719 953 Distribution format: tar.gz Programming language: Fortran 90, C Computer: Beowulf cluster, PC, MPP systems Operating system: Linux, Aix RAM: 100M words Catalogue identifier of previous version: ADSC_v1_0 Journal reference of previous version: Comput. Phys. Comm. 155 (2003) 159 Does the new version supersede the previous version?: yes Nature of problem: FLY is a parallel collisionless N-body code for the calculation of the gravitational force Solution method: FLY is based on the hierarchical oct-tree domain decomposition introduced by Barnes and Hut (1986) Reasons for the new version: The new version of FLY is implemented by using the MPI-2 standard: the distributed version 3.1 was developed by using the MPICH2 library on a PC Linux cluster. Today the FLY performance allows us to consider the FLY code among the most powerful parallel codes for tree N-body simulations. Another important new feature regards the availability of an interface with hydrodynamical Paramesh based codes. Simulations must follow a box large enough to accurately represent the power spectrum of fluctuations on very large scales so that we may hope to compare them meaningfully with real data. The number of particles then sets the mass resolution of the simulation, which we would like to make as fine as possible. The idea to build an interface between two codes, that have different and complementary cosmological tasks, allows us to execute complex cosmological simulations with FLY, specialized for DM evolution, and a code specialized for hydrodynamical components that uses a Paramesh block
An implementation of a tree code on a SIMD, parallel computer
NASA Technical Reports Server (NTRS)
Olson, Kevin M.; Dorband, John E.
1994-01-01
We describe a fast tree algorithm for gravitational N-body simulation on SIMD parallel computers. The tree construction uses fast, parallel sorts. The sorted lists are recursively divided along their x, y and z coordinates. This data structure is a completely balanced tree (i.e., each particle is paired with exactly one other particle) and maintains good spatial locality. An implementation of this tree-building algorithm on a 16k processor Maspar MP-1 performs well and constitutes only a small fraction (approximately 15%) of the entire cycle of finding the accelerations. Each node in the tree is treated as a monopole. The tree search and the summation of accelerations also perform well. During the tree search, node data that is needed from another processor is simply fetched. Roughly 55% of the tree search time is spent in communications between processors. We apply the code to two problems of astrophysical interest. The first is a simulation of the close passage of two gravitationally, interacting, disk galaxies using 65,636 particles. We also simulate the formation of structure in an expanding, model universe using 1,048,576 particles. Our code attains speeds comparable to one head of a Cray Y-MP, so single instruction, multiple data (SIMD) type computers can be used for these simulations. The cost/performance ratio for SIMD machines like the Maspar MP-1 make them an extremely attractive alternative to either vector processors or large multiple instruction, multiple data (MIMD) type parallel computers. With further optimizations (e.g., more careful load balancing), speeds in excess of today's vector processing computers should be possible.
PENTACLE: Parallelized particle-particle particle-tree code for planet formation
NASA Astrophysics Data System (ADS)
Iwasawa, Masaki; Oshino, Shoichi; Fujii, Michiko S.; Hori, Yasunori
2017-10-01
We have newly developed a parallelized particle-particle particle-tree code for planet formation, PENTACLE, which is a parallelized hybrid N-body integrator executed on a CPU-based (super)computer. PENTACLE uses a fourth-order Hermite algorithm to calculate gravitational interactions between particles within a cut-off radius and a Barnes-Hut tree method for gravity from particles beyond. It also implements an open-source library designed for full automatic parallelization of particle simulations, FDPS (Framework for Developing Particle Simulator), to parallelize a Barnes-Hut tree algorithm for a memory-distributed supercomputer. These allow us to handle 1-10 million particles in a high-resolution N-body simulation on CPU clusters for collisional dynamics, including physical collisions in a planetesimal disc. In this paper, we show the performance and the accuracy of PENTACLE in terms of \\tilde{R}_cut and a time-step Δt. It turns out that the accuracy of a hybrid N-body simulation is controlled through Δ t / \\tilde{R}_cut and Δ t / \\tilde{R}_cut ˜ 0.1 is necessary to simulate accurately the accretion process of a planet for ≥106 yr. For all those interested in large-scale particle simulations, PENTACLE, customized for planet formation, will be freely available from https://github.com/PENTACLE-Team/PENTACLE under the MIT licence.
FLY. A parallel tree N-body code for cosmological simulations
NASA Astrophysics Data System (ADS)
Antonuccio-Delogu, V.; Becciani, U.; Ferro, D.
2003-10-01
FLY is a parallel treecode which makes heavy use of the one-sided communication paradigm to handle the management of the tree structure. In its public version the code implements the equations for cosmological evolution, and can be run for different cosmological models. This reference guide describes the actual implementation of the algorithms of the public version of FLY, and suggests how to modify them to implement other types of equations (for instance, the Newtonian ones). Program summary Title of program: FLY Catalogue identifier: ADSC Program summary URL: http://cpc.cs.qub.ac.uk/summaries/ADSC Program obtainable from: CPC Program Library, Queen's University of Belfast, N. Ireland Computer for which the program is designed and others on which it has been tested: Cray T3E, Sgi Origin 3000, IBM SP Operating systems or monitors under which the program has been tested: Unicos 2.0.5.40, Irix 6.5.14, Aix 4.3.3 Programming language used: Fortran 90, C Memory required to execute with typical data: about 100 Mwords with 2 million-particles Number of bits in a word: 32 Number of processors used: parallel program. The user can select the number of processors >=1 Has the code been vectorized or parallelized?: parallelized Number of bytes in distributed program, including test data, etc.: 4615604 Distribution format: tar gzip file Keywords: Parallel tree N-body code for cosmological simulations Nature of physical problem: FLY is a parallel collisionless N-body code for the calculation of the gravitational force. Method of solution: It is based on the hierarchical oct-tree domain decomposition introduced by Barnes and Hut (1986). Restrictions on the complexity of the program: The program uses the leapfrog integrator schema, but could be changed by the user. Typical running time: 50 seconds for each time-step, running a 2-million-particles simulation on an Sgi Origin 3800 system with 8 processors having 512 Mbytes RAM for each processor. Unusual features of the program: FLY
Are you ready to FLY in the universe? A multi-platform /N-body tree code for parallel supercomputers
NASA Astrophysics Data System (ADS)
Becciani, U.; Antonuccio-Delogu, V.
2001-05-01
In the last few years, cosmological simulations of structures and galaxies formations have assumed a fundamental role in the study of the origin, formation and evolution of the universe. These studies improved enormously with the use of supercomputers and parallel systems, allowing more accurate simulations, in comparison with traditional serial systems. The code we describe, called FLY, is a newly written code (using the tree /N-body method), for three-dimensional self-gravitating collisionless systems evolution. FLY is a fully parallel code based on the tree Barnes-Hut algorithm and periodical boundary conditions are implemented by means of the Ewald summation technique. We use FLY to run simulations of the large scale structure of the universe and of cluster of galaxies, but it could be usefully adopted to run evolutions of systems based on a tree /N-body algorithm. FLY is based on the one-side communication paradigm to share data among the processors, that access to remote private data avoiding any kind of synchronism. The code was originally developed on CRAY T3E system using the logically SHared MEMory access routines (SHMEM) but it runs also on SGI ORIGIN systems and on IBM SP by using the Low-Level Application Programming Interface routines (LAPI). This new code is the evolution of preliminary codes (WDSH-PT and WD99) for cosmological simulations we implemented in the last years, and it reaches very high performance in all systems where it has been well-tested. This performance allows us today to consider the code FLY among the most powerful parallel codes for tree /N-body simulations. The performance that FLY reaches is discussed and reported, and a comparison with other similar codes is preliminary considered. The FLY version 1.1 is freely available on http://www.ct.astro.it/fly/ and it will be maintained and upgraded with new releases.
A Modified Parallel Tree Code for N-Body Simulation of the Large-Scale Structure of the Universe
NASA Astrophysics Data System (ADS)
Becciani, U.; Antonuccio-Delogu, V.; Gambera, M.
2000-09-01
N-body codes for performing simulations of the origin and evolution of the large-scale structure of the universe have improved significantly over the past decade in terms of both the resolution achieved and the reduction of the CPU time. However, state-of-the-art N-body codes hardly allow one to deal with particle numbers larger than a few 107, even on the largest parallel systems. In order to allow simulations with larger resolution, we have first reconsidered the grouping strategy as described in J. Barnes (1990, J. Comput. Phys. 87, 161) (hereafter B90) and applied it with some modifications to our WDSH-PT (Work and Data SHaring-Parallel Tree) code (U. Becciani et al., 1996, Comput. Phys. Comm. 99, 1). In the first part of this paper we will give a short description of the code adopting the algorithm of J. E. Barnes and P. Hut (1986, Nature 324, 446) and in particular the memory and work distribution strategy applied to describe the data distribution on a CC-NUMA machine like the CRAY-T3E system. In very large simulations (typically N>=107), due to network contention and the formation of clusters of galaxies, an uneven load easily verifies. To remedy this, we have devised an automatic work redistribution mechanism which provided a good dynamic load balance without adding significant overhead. In the second part of the paper we describe the modification to the Barnes grouping strategy we have devised to improve the performance of the WDSH-PT code. We will use the property that nearby particles have similar interaction lists. This idea has been checked in B90, where an interaction list is built which applies everywhere within a cell Cgroup containing a small number of particles Ncrit. B90 reuses this interaction list for each particle p∈Cgroup in the cell in turn. We will assume each particle p to have the same interaction list. We consider that the agent force Fp on a particle p can be decomposed into two terms Fp=Ffar+Fnear. The first term Ffar is the same for
A New Parallel Code Based on PVM
NASA Astrophysics Data System (ADS)
Xu, Guohong
1994-05-01
We have developed a new parallel code for solving purely gravitational problems by combining PM methods and TREE methods to achieve both high spatial solution and high mass resolution. Very preliminary results will be shown to demonstrate the potential accuracy which the new code can reach. As a first application of the code, we tried to calculate the density profile and velocity dispersion of clusters of galaxies. Further work will be done to include hydrodynamics in the code. Very high computational efficiency is achieved by application of PVM (Parallel Virtural Machine) techniques in the code to configure many workstations into a virtural machine.
Efficient tree codes on SIMD computer architectures
NASA Astrophysics Data System (ADS)
Olson, Kevin M.
1996-11-01
This paper describes changes made to a previous implementation of an N -body tree code developed for a fine-grained, SIMD computer architecture. These changes include (1) switching from a balanced binary tree to a balanced oct tree, (2) addition of quadrupole corrections, and (3) having the particles search the tree in groups rather than individually. An algorithm for limiting errors is also discussed. In aggregate, these changes have led to a performance increase of over a factor of 10 compared to the previous code. For problems several times larger than the processor array, the code now achieves performance levels of ~ 1 Gflop on the Maspar MP-2 or roughly 20% of the quoted peak performance of this machine. This percentage is competitive with other parallel implementations of tree codes on MIMD architectures. This is significant, considering the low relative cost of SIMD architectures.
Parallelization of the SIR code
NASA Astrophysics Data System (ADS)
Thonhofer, S.; Bellot Rubio, L. R.; Utz, D.; Jurčak, J.; Hanslmeier, A.; Piantschitsch, I.; Pauritsch, J.; Lemmerer, B.; Guttenbrunner, S.
A high-resolution 3-dimensional model of the photospheric magnetic field is essential for the investigation of small-scale solar magnetic phenomena. The SIR code is an advanced Stokes-inversion code that deduces physical quantities, e.g. magnetic field vector, temperature, and LOS velocity, from spectropolarimetric data. We extended this code by the capability of directly using large data sets and inverting the pixels in parallel. Due to this parallelization it is now feasible to apply the code directly on extensive data sets. Besides, we included the possibility to use different initial model atmospheres for the inversion, which enhances the quality of the results.
PARAVT: Parallel Voronoi tessellation code
NASA Astrophysics Data System (ADS)
González, R. E.
2016-10-01
In this study, we present a new open source code for massive parallel computation of Voronoi tessellations (VT hereafter) in large data sets. The code is focused for astrophysical purposes where VT densities and neighbors are widely used. There are several serial Voronoi tessellation codes, however no open source and parallel implementations are available to handle the large number of particles/galaxies in current N-body simulations and sky surveys. Parallelization is implemented under MPI and VT using Qhull library. Domain decomposition takes into account consistent boundary computation between tasks, and includes periodic conditions. In addition, the code computes neighbors list, Voronoi density, Voronoi cell volume, density gradient for each particle, and densities on a regular grid. Code implementation and user guide are publicly available at https://github.com/regonzar/paravt.
NASA Astrophysics Data System (ADS)
Gritschneder, M.; Naab, T.; Burkert, A.; Walch, S.; Heitsch, F.; Wetzstein, M.
2009-02-01
We present a three-dimensional, fully parallelized, efficient implementation of ionizing ultraviolet (UV) radiation for smoothed particle hydrodynamics (SPH) including self-gravity. Our method is based on the SPH/TREE code VINE. We therefore call it iVINE (for Ionization + VINE). This approach allows detailed high-resolution studies of the effects of ionizing radiation from, for example, young massive stars on their turbulent parental molecular clouds. In this paper, we describe the concept and the numerical implementation of the radiative transfer for a plane-parallel geometry and we discuss several test cases demonstrating the efficiency and accuracy of the new method. As a first application, we study the radiatively driven implosion of marginally stable molecular clouds at various distances of a strong UV source and show that they are driven into gravitational collapse. The resulting cores are very compact and dense exactly as it is observed in clustered environments. Our simulations indicate that the time of triggered collapse depends on the distance of the core from the UV source. Clouds closer to the source collapse several 105yr earlier than more distant clouds. This effect can explain the observed age spread in OB associations where stars closer to the source are found to be younger. We discuss possible uncertainties in the observational derivation of shock front velocities due to early stripping of protostellar envelopes by ionizing radiation.
National Combustion Code: Parallel Performance
NASA Technical Reports Server (NTRS)
Babrauckas, Theresa
2001-01-01
This report discusses the National Combustion Code (NCC). The NCC is an integrated system of codes for the design and analysis of combustion systems. The advanced features of the NCC meet designers' requirements for model accuracy and turn-around time. The fundamental features at the inception of the NCC were parallel processing and unstructured mesh. The design and performance of the NCC are discussed.
Parallel algorithms for contour extraction and coding
NASA Astrophysics Data System (ADS)
Dinstein, Its'hak; Landau, Gad M.
1990-07-01
A parallel approach to contour extraction and coding on an Exclusive Read Exclusive Write (EREW) Parallel Random Access Machine (PRAM) is presented and analyzed. The algorithm is intended for binary images. The labeled contours can be represented by lists of coordinates, and/or chain codes, and/or any other user designed codes. Using O(n2/log n) processors, the algorithm runs in O(logn) time, where n by n is the size of the processed binary image.
Optimal parallel evaluation of AND trees
NASA Technical Reports Server (NTRS)
Wah, Benjamin W.; Li, Guo-Jie
1990-01-01
A quantitative analysis based on both preemptive and nonpreemptive critical-path scheduling algorithms is presently conducted for the optimal degree of parallelism required in evaluating a given AND tree. The optimal degree of parallelism is found to depend on problem complexity, precedence-graph shape, and task-time distribution along each path. In addition to demonstrating the optimality of the preemptive critical-path scheduling algorithm for evaluating an arbitrary AND tree on a fixed number of processors, the possibility of efficiently ascertaining tight bounds on the number of processors for optimal processor-time efficiency is illustrated.
Code Parallelization with CAPO: A User Manual
NASA Technical Reports Server (NTRS)
Jin, Hao-Qiang; Frumkin, Michael; Yan, Jerry; Biegel, Bryan (Technical Monitor)
2001-01-01
A software tool has been developed to assist the parallelization of scientific codes. This tool, CAPO, extends an existing parallelization toolkit, CAPTools developed at the University of Greenwich, to generate OpenMP parallel codes for shared memory architectures. This is an interactive toolkit to transform a serial Fortran application code to an equivalent parallel version of the software - in a small fraction of the time normally required for a manual parallelization. We first discuss the way in which loop types are categorized and how efficient OpenMP directives can be defined and inserted into the existing code using the in-depth interprocedural analysis. The use of the toolkit on a number of application codes ranging from benchmark to real-world application codes is presented. This will demonstrate the great potential of using the toolkit to quickly parallelize serial programs as well as the good performance achievable on a large number of toolkit to quickly parallelize serial programs as well as the good performance achievable on a large number of processors. The second part of the document gives references to the parameters and the graphic user interface implemented in the toolkit. Finally a set of tutorials is included for hands-on experiences with this toolkit.
National Combustion Code: Parallel Implementation and Performance
NASA Technical Reports Server (NTRS)
Quealy, A.; Ryder, R.; Norris, A.; Liu, N.-S.
2000-01-01
The National Combustion Code (NCC) is being developed by an industry-government team for the design and analysis of combustion systems. CORSAIR-CCD is the current baseline reacting flow solver for NCC. This is a parallel, unstructured grid code which uses a distributed memory, message passing model for its parallel implementation. The focus of the present effort has been to improve the performance of the NCC flow solver to meet combustor designer requirements for model accuracy and analysis turnaround time. Improving the performance of this code contributes significantly to the overall reduction in time and cost of the combustor design cycle. This paper describes the parallel implementation of the NCC flow solver and summarizes its current parallel performance on an SGI Origin 2000. Earlier parallel performance results on an IBM SP-2 are also included. The performance improvements which have enabled a turnaround of less than 15 hours for a 1.3 million element fully reacting combustion simulation are described.
Parallel analog neural networks for tree searching
NASA Astrophysics Data System (ADS)
Saylor, Janet; Stork, David G.
1986-08-01
We have modeled parallel analog neural networks designed such that their evolution toward final states is equivalent to finding optimal (or nearly optimal) paths through decision trees. This work extends that done on the Traveling Salesman Problem (TSP)[1] and sheds light on the conditions under which analog neural networks can and cannot find solutions to discrete optimization problems. Neural networks show considerable specificity in finding optimal solutions for tree searches; in the cases when a final state does represent a syntactically correct path, that path will be the best path 70-90% of the time—even for trees with up to two thousand nodes. However, it appears that except for trivial networks lacking the ability to ``think globally,'' there exists no general network architecture that can strictly insure the convergence a state that represents a single, continuous, unambiguous path. In fact, we find that for roughly 15% of trees with six generations, 40% of trees with eight generations, and 70% of trees with ten generations, networks evolve to ``broken paths,'' i.e., combinations of the beginning of one and the end of another path through a tree. Tree searches illustrate well neural dynamics because tree structures make the effects of competition and positive feedback apparent. We have found that 1) convergence times for networks with up to 2000 neurons are very rapid, depend on the gain of neurons and magnitude of neural connections but not on the number of generations or branching factor of a tree, 2) all neurons along a ``winning'' path turn on exponentially with the same exponent, and 3) the general computational mechanism of these networks appears to be the pruning of a tree from the outer branches inward, as chain reactions of neurons being quenched tend to propagate along possible paths.
National Combustion Code Parallel Performance Enhancements
NASA Technical Reports Server (NTRS)
Quealy, Angela; Benyo, Theresa (Technical Monitor)
2002-01-01
The National Combustion Code (NCC) is being developed by an industry-government team for the design and analysis of combustion systems. The unstructured grid, reacting flow code uses a distributed memory, message passing model for its parallel implementation. The focus of the present effort has been to improve the performance of the NCC code to meet combustor designer requirements for model accuracy and analysis turnaround time. Improving the performance of this code contributes significantly to the overall reduction in time and cost of the combustor design cycle. This report describes recent parallel processing modifications to NCC that have improved the parallel scalability of the code, enabling a two hour turnaround for a 1.3 million element fully reacting combustion simulation on an SGI Origin 2000.
OVERAERO-MPI: Parallel Overset Aeroelasticity Code
NASA Technical Reports Server (NTRS)
Gee, Ken; Rizk, Yehia M.
1999-01-01
An overset modal structures analysis code was integrated with a parallel overset Navier-Stokes flow solver to obtain a code capable of static aeroelastic computations. The new code was used to compute the static aeroelastic deformation of an arrow-wing-body geometry and a complex, full aircraft configuration. For the simple geometry, the results were similar to the results obtained with the ENSAERO code and the PVM version of OVERAERO. The full potential of this code suite was illustrated in the complex, full aircraft computations.
On the parallelization of molecular dynamics codes
NASA Astrophysics Data System (ADS)
Trabado, G. P.; Plata, O.; Zapata, E. L.
2002-08-01
Molecular dynamics (MD) codes present a high degree of spatial data locality and a significant amount of independent computations. However, most of the parallelization strategies are usually based on the manual transformation of sequential programs either by completely rewriting the code with message passing routines or using specific libraries intended for writing new MD programs. In this paper we propose a new library-based approach (DDLY) which supports parallelization of existing short-range MD sequential codes. The novelty of this approach is that it can directly handle the distribution of common data structures used in MD codes to represent data (arrays, Verlet lists, link cells), using domain decomposition. Thus, the insertion of run-time support for distribution and communication in a MD program does not imply significant changes to its structure. The method is simple, efficient and portable. It may be also used to extend existing parallel programming languages, such as HPF.
FLY: a Tree Code for Adaptive Mesh Refinement
NASA Astrophysics Data System (ADS)
Becciani, U.; Antonuccio-Delogu, V.; Costa, A.; Ferro, D.
FLY is a public domain parallel treecode, which makes heavy use of the one-sided communication paradigm to handle the management of the tree structure. It implements the equations for cosmological evolution and can be run for different cosmological models. This paper shows an example of the integration of a tree N-body code with an adaptive mesh, following the PARAMESH scheme. This new implementation will allow the FLY output, and more generally any binary output, to be used with any hydrodynamics code that adopts the PARAMESH data structure, to study compressible flow problems.
New Bandwidth Efficient Parallel Concatenated Coding Schemes
NASA Technical Reports Server (NTRS)
Denedetto, S.; Divsalar, D.; Montorsi, G.; Pollara, F.
1996-01-01
We propose a new solution to parallel concatenation of trellis codes with multilevel amplitude/phase modulations and a suitable iterative decoding structure. Examples are given for throughputs 2 bits/sec/Hz with 8PSK and 16QAM signal constellations.
GRADSPMHD: A parallel MHD code based on the SPH formalism
NASA Astrophysics Data System (ADS)
Vanaverbeke, S.; Keppens, R.; Poedts, S.
2014-03-01
We present GRADSPMHD, a completely Lagrangian parallel magnetohydrodynamics code based on the SPH formalism. The implementation of the equations of SPMHD in the “GRAD-h” formalism assembles known results, including the derivation of the discretized MHD equations from a variational principle, the inclusion of time-dependent artificial viscosity, resistivity and conductivity terms, as well as the inclusion of a mixed hyperbolic/parabolic correction scheme for satisfying the ∇ṡB→ constraint on the magnetic field. The code uses a tree-based formalism for neighbor finding and can optionally use the tree code for computing the self-gravity of the plasma. The structure of the code closely follows the framework of our parallel GRADSPH FORTRAN 90 code which we added previously to the CPC program library. We demonstrate the capabilities of GRADSPMHD by running 1, 2, and 3 dimensional standard benchmark tests and we find good agreement with previous work done by other researchers. The code is also applied to the problem of simulating the magnetorotational instability in 2.5D shearing box tests as well as in global simulations of magnetized accretion disks. We find good agreement with available results on this subject in the literature. Finally, we discuss the performance of the code on a parallel supercomputer with distributed memory architecture. Catalogue identifier: AERP_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AERP_v1_0.html Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 620503 No. of bytes in distributed program, including test data, etc.: 19837671 Distribution format: tar.gz Programming language: FORTRAN 90/MPI. Computer: HPC cluster. Operating system: Unix. Has the code been vectorized or parallelized?: Yes, parallelized using MPI. RAM: ˜30 MB for a
Parallel CARLOS-3D code development
Putnam, J.M.; Kotulski, J.D.
1996-02-01
CARLOS-3D is a three-dimensional scattering code which was developed under the sponsorship of the Electromagnetic Code Consortium, and is currently used by over 80 aerospace companies and government agencies. The code has been extensively validated and runs on both serial workstations and parallel super computers such as the Intel Paragon. CARLOS-3D is a three-dimensional surface integral equation scattering code based on a Galerkin method of moments formulation employing Rao- Wilton-Glisson roof-top basis for triangular faceted surfaces. Fully arbitrary 3D geometries composed of multiple conducting and homogeneous bulk dielectric materials can be modeled. This presentation describes some of the extensions to the CARLOS-3D code, and how the operator structure of the code facilitated these improvements. Body of revolution (BOR) and two-dimensional geometries were incorporated by simply including new input routines, and the appropriate Galerkin matrix operator routines. Some additional modifications were required in the combined field integral equation matrix generation routine due to the symmetric nature of the BOR and 2D operators. Quadrilateral patched surfaces with linear roof-top basis functions were also implemented in the same manner. Quadrilateral facets and triangular facets can be used in combination to more efficiently model geometries with both large smooth surfaces and surfaces with fine detail such as gaps and cracks. Since the parallel implementation in CARLOS-3D is at high level, these changes were independent of the computer platform being used. This approach minimizes code maintenance, while providing capabilities with little additional effort. Results are presented showing the performance and accuracy of the code for some large scattering problems. Comparisons between triangular faceted and quadrilateral faceted geometry representations will be shown for some complex scatterers.
A parallel and modular deformable cell Car-Parrinello code
NASA Astrophysics Data System (ADS)
Cavazzoni, Carlo; Chiarotti, Guido L.
1999-12-01
We have developed a modular parallel code implementing the Car-Parrinello [Phys. Rev. Lett. 55 (1985) 2471] algorithm including the variable cell dynamics [Europhys. Lett. 36 (1994) 345; J. Phys. Chem. Solids 56 (1995) 510]. Our code is written in Fortran 90, and makes use of some new programming concepts like encapsulation, data abstraction and data hiding. The code has a multi-layer hierarchical structure with tree like dependences among modules. The modules include not only the variables but also the methods acting on them, in an object oriented fashion. The modular structure allows easier code maintenance, develop and debugging procedures, and is suitable for a developer team. The layer structure permits high portability. The code displays an almost linear speed-up in a wide range of number of processors independently of the architecture. Super-linear speed up is obtained with a "smart" Fast Fourier Transform (FFT) that uses the available memory on the single node (increasing for a fixed problem with the number of processing elements) as temporary buffer to store wave function transforms. This code has been used to simulate water and ammonia at giant planet conditions for systems as large as 64 molecules for ˜50 ps.
Portable, parallel, reusable Krylov space codes
Smith, B.; Gropp, W.
1994-12-31
Krylov space accelerators are an important component of many algorithms for the iterative solution of linear systems. Each Krylov space method has it`s own particular advantages and disadvantages, therefore it is desirable to have a variety of them available all with an identical, easy to use, interface. A common complaint application programmers have with available software libraries for the iterative solution of linear systems is that they require the programmer to use the data structures provided by the library. The library is not able to work with the data structures of the application code. Hence, application programmers find themselves constantly recoding the Krlov space algorithms. The Krylov space package (KSP) is a data-structure-neutral implementation of a variety of Krylov space methods including preconditioned conjugate gradient, GMRES, BiCG-Stab, transpose free QMR and CGS. Unlike all other software libraries for linear systems that the authors are aware of, KSP will work with any application codes data structures, in Fortran or C. Due to it`s data-structure-neutral design KSP runs unchanged on both sequential and parallel machines. KSP has been tested on workstations, the Intel i860 and Paragon, Thinking Machines CM-5 and the IBM SP1.
Parafrase restructuring of FORTRAN code for parallel processing
NASA Technical Reports Server (NTRS)
Wadhwa, Atul
1988-01-01
Parafrase transforms a FORTRAN code, subroutine by subroutine, into a parallel code for a vector and/or shared-memory multiprocessor system. Parafrase is not a compiler; it transforms a code and provides information for a vector or concurrent process. Parafrase uses a data dependency to reveal parallelism among instructions. The data dependency test distinguishes between recurrences and statements that can be directly vectorized or parallelized. A number of transformations are required to build a data dependency graph.
Petascale Parallelization of the Gyrokinetic Toroidal Code
Ethier, Stephane; Adams, Mark; Carter, Jonathan; Oliker, Leonid
2010-05-01
The Gyrokinetic Toroidal Code (GTC) is a global, three-dimensional particle-in-cell application developed to study microturbulence in tokamak fusion devices. The global capability of GTC is unique, allowing researchers to systematically analyze important dynamics such as turbulence spreading. In this work we examine a new radial domain decomposition approach to allow scalability onto the latest generation of petascale systems. Extensive performance evaluation is conducted on three high performance computing systems: the IBM BG/P, the Cray XT4, and an Intel Xeon Cluster. Overall results show that the radial decomposition approach dramatically increases scalability, while reducing the memory footprint - allowing for fusion device simulations at an unprecedented scale. After a decade where high-end computing (HEC) was dominated by the rapid pace of improvements to processor frequencies, the performance of next-generation supercomputers is increasingly differentiated by varying interconnect designs and levels of integration. Understanding the tradeoffs of these system designs is a key step towards making effective petascale computing a reality. In this work, we examine a new parallelization scheme for the Gyrokinetic Toroidal Code (GTC) [?] micro-turbulence fusion application. Extensive scalability results and analysis are presented on three HEC systems: the IBM BlueGene/P (BG/P) at Argonne National Laboratory, the Cray XT4 at Lawrence Berkeley National Laboratory, and an Intel Xeon cluster at Lawrence Livermore National Laboratory. Overall results indicate that the new radial decomposition approach successfully attains unprecedented scalability to 131,072 BG/P cores by overcoming the memory limitations of the previous approach. The new version is well suited to utilize emerging petascale resources to access new regimes of physical phenomena.
Adaptive Dynamic Event Tree in RAVEN code
Alfonsi, Andrea; Rabiti, Cristian; Mandelli, Diego; Cogliati, Joshua Joseph; Kinoshita, Robert Arthur
2014-11-01
RAVEN is a software tool that is focused on performing statistical analysis of stochastic dynamic systems. RAVEN has been designed in a high modular and pluggable way in order to enable easy integration of different programming languages (i.e., C++, Python) and coupling with other applications (system codes). Among the several capabilities currently present in RAVEN, there are five different sampling strategies: Monte Carlo, Latin Hyper Cube, Grid, Adaptive and Dynamic Event Tree (DET) sampling methodologies. The scope of this paper is to present a new sampling approach, currently under definition and implementation: an evolution of the DET me
Parallel object-oriented decision tree system
Kamath; Chandrika , Cantu-Paz; Erick
2006-02-28
A data mining decision tree system that uncovers patterns, associations, anomalies, and other statistically significant structures in data by reading and displaying data files, extracting relevant features for each of the objects, and using a method of recognizing patterns among the objects based upon object features through a decision tree that reads the data, sorts the data if necessary, determines the best manner to split the data into subsets according to some criterion, and splits the data.
PARAMESH: A Parallel, Adaptive Mesh Refinement Toolkit and Performance of the ASCI/FLASH code
NASA Astrophysics Data System (ADS)
Olson, K. M.; MacNeice, P.; Fryxell, B.; Ricker, P.; Timmes, F. X.; Zingale, M.
1999-12-01
We describe a package of routines known as PARAMESH which enables a user to easily convert an existing serial, uniform grid code to a parallel code with adaptive-mesh refinement. The package does this through the use of a block-structured form of AMR in combination with a tree data structure for distributing blocks to processors. We also describe some of the applications which have been developed using PARAMESH with special emaphasis on the ASCI/FLASH code. Performance results are also discussed for a variety of parallel architectures.
Parallel Tree Contraction and Its Application.
1985-12-01
observed by Uspensky [231, see 112]. These bounds are commonly known as Chernoff bounds 16J. We shall use the following simply stated bounds [3. Theorem 6...Functions in Logarithmic Parallel Time. 25th Annual Symp. on Foundations of Computer Science, IEEE, 1984, pp. 12-22. 22. J. Uspensky . Introduction to
Parallel Spectral Transform Shallow Water Model: A runtime-tunable parallel benchmark code
Worley, P.H.; Foster, I.T.
1994-05-01
Fairness is an important issue when benchmarking parallel computers using application codes. The best parallel algorithm on one platform may not be the best on another. While it is not feasible to reevaluate parallel algorithms and reimplement large codes whenever new machines become available, it is possible to embed algorithmic options into codes that allow them to be ``tuned`` for a paticular machine without requiring code modifications. In this paper, we describe a code in which such an approach was taken. PSTSWM was developed for evaluating parallel algorithms for the spectral transform method in atmospheric circulation models. Many levels of runtime-selectable algorithmic options are supported. We discuss these options and our evaluation methodology. We also provide empirical results from a number of parallel machines, indicating the importance of tuning for each platform before making a comparison.
Computational efficiency of parallel combinatorial OR-tree searches
NASA Technical Reports Server (NTRS)
Li, Guo-Jie; Wah, Benjamin W.
1990-01-01
The performance of parallel combinatorial OR-tree searches is analytically evaluated. This performance depends on the complexity of the problem to be solved, the error allowance function, the dominance relation, and the search strategies. The exact performance may be difficult to predict due to the nondeterminism and anomalies of parallelism. The authors derive the performance bounds of parallel OR-tree searches with respect to the best-first, depth-first, and breadth-first strategies, and verify these bounds by simulation. They show that a near-linear speedup can be achieved with respect to a large number of processors for parallel OR-tree searches. Using the bounds developed, the authors derive sufficient conditions for assuring that parallelism will not degrade performance and necessary conditions for allowing parallelism to have a speedup greater than the ratio of the numbers of processors. These bounds and conditions provide the theoretical foundation for determining the number of processors required to assure a near-linear speedup.
Memory Scalability and Efficiency Analysis of Parallel Codes
Janjusic, Tommy; Kartsaklis, Christos
2015-01-01
Memory scalability is an enduring problem and bottleneck that plagues many parallel codes. Parallel codes designed for High Performance Systems are typically designed over the span of several, and in some instances 10+, years. As a result, optimization practices which were appropriate for earlier systems may no longer be valid and thus require careful optimization consideration. Specifically, parallel codes whose memory footprint is a function of their scalability must be carefully considered for future exa-scale systems. In this paper we present a methodology and tool to study the memory scalability of parallel codes. Using our methodology we evaluate an application s memory footprint as a function of scalability, which we coined memory efficiency, and describe our results. In particular, using our in-house tools we can pinpoint the specific application components which contribute to the application s overall memory foot-print (application data- structures, libraries, etc.).
Shot level parallelization of a seismic inversion code using PVM
Versteeg, R.J.; Gockenback, M.; Symes, W.W.; Kern, M.
1994-12-31
This paper presents experience with parallelization using PVM of DSO, a seismic inversion code developed in The Rice Inversion Project. It focuses on one aspect: trying to run efficiently on a cluster of 4 workstations. The authors use a coarse grain parallelism in which they dynamically distribute the shots over the available machines in the cluster. The modeling and migration of their code is parallelized very effectively by this strategy; they have reached a overall performance of 104 Mflops using a configuration of one manager with 3 workers, a speedup of 2.4 versus the serial version, which according to Amdahl`s law is optimal given the current design of their code. Further speedup is currently limited by the non parallelized part of their code optimization, linear algebra and i(o).
Parallel-vector computation for CSI-design code
NASA Technical Reports Server (NTRS)
Nguyen, Duc T.
1990-01-01
Computational aspects of Control-Structure Interaction (CSI) DESIGN code is reviewed. Numerical intensive computation portions of CSI-DESIGN code were identified. Improvements in computational speed for the CSI-DESIGN code can be achieved by exploiting parallel and vector capabilities offered by modern computers, such as the Alliant, Convex, Cray-2, and Cray-YMP. Four options to generate the coefficient stiffness matrix and to solve the system of linear, simultaneous equations are currently available in the CSI-DESIGN code. A preprocessor to use RCM (Reverse Cuthill-Mackee) algorithm for bandwidth minimization was also developed for the CSI-DESIGN code. Preliminary results obtained by solving a small-scale, 97 node CSI finite element model (for eigensolution) have indicated that this new CSI-DESIGN code is 5 to 6 times faster (using 1 Alliant processor) than the old version of CSI-DESIGN code. This speed-up was achieved due to the RCM algorithm and the use of a new skyline solver. Efforts are underway to further improve the vector speed for CSI-DESIGN code, to evaluate its performance on a larger scale CSI model (such as phase zero CSI model) to make the code run efficiently on multiprocessor, parallel computer environment, and to make the code portable among different parallel computers available at NASA LaRC, such as Alliant, Convex, and Cray computers.
CALTRANS: A parallel, deterministic, 3D neutronics code
Carson, L.; Ferguson, J.; Rogers, J.
1994-04-01
Our efforts to parallelize the deterministic solution of the neutron transport equation has culminated in a new neutronics code CALTRANS, which has full 3D capability. In this article, we describe the layout and algorithms of CALTRANS and present performance measurements of the code on a variety of platforms. Explicit implementation of the parallel algorithms of CALTRANS using both the function calls of the Parallel Virtual Machine software package (PVM 3.2) and the Meiko CS-2 tagged message passing library (based on the Intel NX/2 interface) are provided in appendices.
Capabilities of Fully Parallelized MHD Stability Code MARS
NASA Astrophysics Data System (ADS)
Svidzinski, Vladimir; Galkin, Sergei; Kim, Jin-Soo; Liu, Yueqiang
2016-10-01
Results of full parallelization of the plasma stability code MARS will be reported. MARS calculates eigenmodes in 2D axisymmetric toroidal equilibria in MHD-kinetic plasma models. Parallel version of MARS, named PMARS, has been recently developed at FAR-TECH. Parallelized MARS is an efficient tool for simulation of MHD instabilities with low, intermediate and high toroidal mode numbers within both fluid and kinetic plasma models, implemented in MARS. Parallelization of the code included parallelization of the construction of the matrix for the eigenvalue problem and parallelization of the inverse vector iterations algorithm, implemented in MARS for the solution of the formulated eigenvalue problem. Construction of the matrix is parallelized by distributing the load among processors assigned to different magnetic surfaces. Parallelization of the solution of the eigenvalue problem is made by repeating steps of the MARS algorithm using parallel libraries and procedures. Parallelized MARS is capable of calculating eigenmodes with significantly increased spatial resolution: up to 5,000 adapted radial grid points with up to 500 poloidal harmonics. Such resolution is sufficient for simulation of kink, tearing and peeling-ballooning instabilities with physically relevant parameters. Work is supported by the U.S. DOE SBIR program.
Identifying failure in a tree network of a parallel computer
Archer, Charles J.; Pinnow, Kurt W.; Wallenfelt, Brian P.
2010-08-24
Methods, parallel computers, and products are provided for identifying failure in a tree network of a parallel computer. The parallel computer includes one or more processing sets including an I/O node and a plurality of compute nodes. For each processing set embodiments include selecting a set of test compute nodes, the test compute nodes being a subset of the compute nodes of the processing set; measuring the performance of the I/O node of the processing set; measuring the performance of the selected set of test compute nodes; calculating a current test value in dependence upon the measured performance of the I/O node of the processing set, the measured performance of the set of test compute nodes, and a predetermined value for I/O node performance; and comparing the current test value with a predetermined tree performance threshold. If the current test value is below the predetermined tree performance threshold, embodiments include selecting another set of test compute nodes. If the current test value is not below the predetermined tree performance threshold, embodiments include selecting from the test compute nodes one or more potential problem nodes and testing individually potential problem nodes and links to potential problem nodes.
Parallel peak pruning for scalable SMP contour tree computation
Carr, Hamish A.; Weber, Gunther H.; Sewell, Christopher M.; ...
2017-03-09
As data sets grow to exascale, automated data analysis and visualisation are increasingly important, to intermediate human understanding and to reduce demands on disk storage via in situ analysis. Trends in architecture of high performance computing systems necessitate analysis algorithms to make effective use of combinations of massively multicore and distributed systems. One of the principal analytic tools is the contour tree, which analyses relationships between contours to identify features of more than local importance. Unfortunately, the predominant algorithms for computing the contour tree are explicitly serial, and founded on serial metaphors, which has limited the scalability of this formmore » of analysis. While there is some work on distributed contour tree computation, and separately on hybrid GPU-CPU computation, there is no efficient algorithm with strong formal guarantees on performance allied with fast practical performance. Here in this paper, we report the first shared SMP algorithm for fully parallel contour tree computation, withfor-mal guarantees of O(lgnlgt) parallel steps and O(n lgn) work, and implementations with up to 10x parallel speed up in OpenMP and up to 50x speed up in NVIDIA Thrust.« less
Scalability and Parallelization of Monte-Carlo Tree Search
NASA Astrophysics Data System (ADS)
Bourki, Amine; Chaslot, Guillaume; Coulm, Matthieu; Danjean, Vincent; Doghmen, Hassen; Hoock, Jean-Baptiste; Hérault, Thomas; Rimmel, Arpad; Teytaud, Fabien; Teytaud, Olivier; Vayssière, Paul; Yu, Ziqin
Monte-Carlo Tree Search is now a well established algorithm, in games and beyond. We analyze its scalability, and in particular its limitations and the implications in terms of parallelization. We focus on our Go program MoGo and our Havannah program Shakti. We use multicore machines and message-passing machines. For both games and on both type of machines we achieve adequate efficiency for the parallel version. However, in spite of promising results in self-play there are situations for which increasing the time per move does not solve anything. Therefore parallelization is not a solution to all our problems. Nonetheless, for problems where the Monte-Carlo part is less biased than in the game of Go, parallelization should be quite efficient, even without shared memory.
BTREE: A FORTRAN Code for B+ Tree.
2014-09-26
such large databases. NSWC TR 85-54 REFERENCES 1. Comer , D., "The Ubiquitous B Tree," Computing Surveys, Vol. 11, 1979, pp. 121-137. 2. Knuth, D...34The Ubiquitous B Tree" by Douglas Comer , Computing Surveys, C 11(1979)121-137; a more complete discussion can be found in C "The Art of Computer
Parallelization of a Monte Carlo particle transport simulation code
NASA Astrophysics Data System (ADS)
Hadjidoukas, P.; Bousis, C.; Emfietzoglou, D.
2010-05-01
We have developed a high performance version of the Monte Carlo particle transport simulation code MC4. The original application code, developed in Visual Basic for Applications (VBA) for Microsoft Excel, was first rewritten in the C programming language for improving code portability. Several pseudo-random number generators have been also integrated and studied. The new MC4 version was then parallelized for shared and distributed-memory multiprocessor systems using the Message Passing Interface. Two parallel pseudo-random number generator libraries (SPRNG and DCMT) have been seamlessly integrated. The performance speedup of parallel MC4 has been studied on a variety of parallel computing architectures including an Intel Xeon server with 4 dual-core processors, a Sun cluster consisting of 16 nodes of 2 dual-core AMD Opteron processors and a 200 dual-processor HP cluster. For large problem size, which is limited only by the physical memory of the multiprocessor server, the speedup results are almost linear on all systems. We have validated the parallel implementation against the serial VBA and C implementations using the same random number generator. Our experimental results on the transport and energy loss of electrons in a water medium show that the serial and parallel codes are equivalent in accuracy. The present improvements allow for studying of higher particle energies with the use of more accurate physical models, and improve statistics as more particles tracks can be simulated in low response time.
Parallel Algorithms for Graph Optimization using Tree Decompositions
Sullivan, Blair D; Weerapurage, Dinesh P; Groer, Christopher S
2012-06-01
Although many $\\cal{NP}$-hard graph optimization problems can be solved in polynomial time on graphs of bounded tree-width, the adoption of these techniques into mainstream scientific computation has been limited due to the high memory requirements of the necessary dynamic programming tables and excessive runtimes of sequential implementations. This work addresses both challenges by proposing a set of new parallel algorithms for all steps of a tree decomposition-based approach to solve the maximum weighted independent set problem. A hybrid OpenMP/MPI implementation includes a highly scalable parallel dynamic programming algorithm leveraging the MADNESS task-based runtime, and computational results demonstrate scaling. This work enables a significant expansion of the scale of graphs on which exact solutions to maximum weighted independent set can be obtained, and forms a framework for solving additional graph optimization problems with similar techniques.
Parallel Continuous Flow: A Parallel Suffix Tree Construction Tool for Whole Genomes
Farreras, Montse
2014-01-01
Abstract The construction of suffix trees for very long sequences is essential for many applications, and it plays a central role in the bioinformatic domain. With the advent of modern sequencing technologies, biological sequence databases have grown dramatically. Also the methodologies required to analyze these data have become more complex everyday, requiring fast queries to multiple genomes. In this article, we present parallel continuous flow (PCF), a parallel suffix tree construction method that is suitable for very long genomes. We tested our method for the suffix tree construction of the entire human genome, about 3GB. We showed that PCF can scale gracefully as the size of the input genome grows. Our method can work with an efficiency of 90% with 36 processors and 55% with 172 processors. We can index the human genome in 7 minutes using 172 processes. PMID:24597675
Parallel continuous flow: a parallel suffix tree construction tool for whole genomes.
Comin, Matteo; Farreras, Montse
2014-04-01
The construction of suffix trees for very long sequences is essential for many applications, and it plays a central role in the bioinformatic domain. With the advent of modern sequencing technologies, biological sequence databases have grown dramatically. Also the methodologies required to analyze these data have become more complex everyday, requiring fast queries to multiple genomes. In this article, we present parallel continuous flow (PCF), a parallel suffix tree construction method that is suitable for very long genomes. We tested our method for the suffix tree construction of the entire human genome, about 3GB. We showed that PCF can scale gracefully as the size of the input genome grows. Our method can work with an efficiency of 90% with 36 processors and 55% with 172 processors. We can index the human genome in 7 minutes using 172 processes.
Parallel Scaling Characteristics of Selected NERSC User ProjectCodes
Skinner, David; Verdier, Francesca; Anand, Harsh; Carter,Jonathan; Durst, Mark; Gerber, Richard
2005-03-05
This report documents parallel scaling characteristics of NERSC user project codes between Fiscal Year 2003 and the first half of Fiscal Year 2004 (Oct 2002-March 2004). The codes analyzed cover 60% of all the CPU hours delivered during that time frame on seaborg, a 6080 CPU IBM SP and the largest parallel computer at NERSC. The scale in terms of concurrency and problem size of the workload is analyzed. Drawing on batch queue logs, performance data and feedback from researchers we detail the motivations, benefits, and challenges of implementing highly parallel scientific codes on current NERSC High Performance Computing systems. An evaluation and outlook of the NERSC workload for Allocation Year 2005 is presented.
Coding hazardous tree failures for a data management system
Lee A. Paine
1978-01-01
Codes for automatic data processing (ADP) are provided for hazardous tree failure data submitted on Report of Tree Failure forms. Definitions of data items and suggestions for interpreting ambiguously worded reports are also included. The manual is intended to insure the production of accurate and consistent punched ADP cards which are used in transfer of the data to...
Efficient coding of wavelet trees and its applications in image coding
NASA Astrophysics Data System (ADS)
Zhu, Bin; Yang, En-hui; Tewfik, Ahmed H.; Kieffer, John C.
1996-02-01
We propose in this paper a novel lossless tree coding algorithm. The technique is a direct extension of the bisection method, the simplest case of the complexity reduction method proposed recently by Kieffer and Yang, that has been used for lossless data string coding. A reduction rule is used to obtain the irreducible representation of a tree, and this irreducible tree is entropy-coded instead of the input tree itself. This reduction is reversible, and the original tree can be fully recovered from its irreducible representation. More specifically, we search for equivalent subtrees from top to bottom. When equivalent subtrees are found, a special symbol is appended to the value of the root node of the first equivalent subtree, and the root node of the second subtree is assigned to the index which points to the first subtree, an all other nodes in the second subtrees are removed. This procedure is repeated until it cannot be reduced further. This yields the irreducible tree or irreducible representation of the original tree. The proposed method can effectively remove the redundancy in an image, and results in more efficient compression. It is proved that when the tree size approaches infinity, the proposed method offers the optimal compression performance. It is generally more efficient in practice than direct coding of the input tree. The proposed method can be directly applied to code wavelet trees in non-iterative wavelet-based image coding schemes. A modified method is also proposed for coding wavelet zerotrees in embedded zerotree wavelet (EZW) image coding. Although its coding efficiency is slightly reduced, the modified version maintains exact control of bit rate and the scalability of the bit stream in EZW coding.
A Data Parallel Multizone Navier-Stokes Code
NASA Technical Reports Server (NTRS)
Jespersen, Dennis C.; Levit, Creon; Kwak, Dochan (Technical Monitor)
1995-01-01
We have developed a data parallel multizone compressible Navier-Stokes code on the Connection Machine CM-5. The code is set up for implicit time-stepping on single or multiple structured grids. For multiple grids and geometrically complex problems, we follow the "chimera" approach, where flow data on one zone is interpolated onto another in the region of overlap. We will describe our design philosophy and give some timing results for the current code. The design choices can be summarized as: 1. finite differences on structured grids; 2. implicit time-stepping with either distributed solves or data motion and local solves; 3. sequential stepping through multiple zones with interzone data transfer via a distributed data structure. We have implemented these ideas on the CM-5 using CMF (Connection Machine Fortran), a data parallel language which combines elements of Fortran 90 and certain extensions, and which bears a strong similarity to High Performance Fortran (HPF). One interesting feature is the issue of turbulence modeling, where the architecture of a parallel machine makes the use of an algebraic turbulence model awkward, whereas models based on transport equations are more natural. We will present some performance figures for the code on the CM-5, and consider the issues involved in transitioning the code to HPF for portability to other parallel platforms.
Gamr: A Free, Parallel, Adaptive Tectonics and Mantle Convection Code
NASA Astrophysics Data System (ADS)
Landry, W.
2010-12-01
Computational Infrastructure for Geodynamics (CIG) has begun development of Gamr: a new community code for tectonics and mantle convection. The principle new improvement of Gamr over existing community codes such as CitcomS and Gale is the use of parallel adaptive mesh refinement. This will allow Gamr to better resolve fine features such as faults, plate boundaries, and mantle plumes. I will discuss the current status of Gamr and outline future milestones.
An Expert System for the Development of Efficient Parallel Code
NASA Technical Reports Server (NTRS)
Jost, Gabriele; Chun, Robert; Jin, Hao-Qiang; Labarta, Jesus; Gimenez, Judit
2004-01-01
We have built the prototype of an expert system to assist the user in the development of efficient parallel code. The system was integrated into the parallel programming environment that is currently being developed at NASA Ames. The expert system interfaces to tools for automatic parallelization and performance analysis. It uses static program structure information and performance data in order to automatically determine causes of poor performance and to make suggestions for improvements. In this paper we give an overview of our programming environment, describe the prototype implementation of our expert system, and demonstrate its usefulness with several case studies.
Parallel family trees for transfer matrices in the Potts model
NASA Astrophysics Data System (ADS)
Navarro, Cristobal A.; Canfora, Fabrizio; Hitschfeld, Nancy; Navarro, Gonzalo
2015-02-01
The computational cost of transfer matrix methods for the Potts model is related to the question in how many ways can two layers of a lattice be connected? Answering the question leads to the generation of a combinatorial set of lattice configurations. This set defines the configuration space of the problem, and the smaller it is, the faster the transfer matrix can be computed. The configuration space of generic (q , v) transfer matrix methods for strips is in the order of the Catalan numbers, which grows asymptotically as O(4m) where m is the width of the strip. Other transfer matrix methods with a smaller configuration space indeed exist but they make assumptions on the temperature, number of spin states, or restrict the structure of the lattice. In this paper we propose a parallel algorithm that uses a sub-Catalan configuration space of O(3m) to build the generic (q , v) transfer matrix in a compressed form. The improvement is achieved by grouping the original set of Catalan configurations into a forest of family trees, in such a way that the solution to the problem is now computed by solving the root node of each family. As a result, the algorithm becomes exponentially faster than the Catalan approach while still highly parallel. The resulting matrix is stored in a compressed form using O(3m ×4m) of space, making numerical evaluation and decompression to be faster than evaluating the matrix in its O(4m ×4m) uncompressed form. Experimental results for different sizes of strip lattices show that the parallel family trees (PFT) strategy indeed runs exponentially faster than the Catalan Parallel Method (CPM), especially when dealing with dense transfer matrices. In terms of parallel performance, we report strong-scaling speedups of up to 5.7 × when running on an 8-core shared memory machine and 28 × for a 32-core cluster. The best balance of speedup and efficiency for the multi-core machine was achieved when using p = 4 processors, while for the cluster
Parallelizing the MARS15 Code with MPI for shielding applications
Mikhail A. Kostin and Nikolai V. Mokhov
2004-05-12
The MARS15 Monte Carlo code capabilities to deal with time-consuming deep penetration shielding problems and other computationally tough tasks in accelerator, detector and shielding applications, have been enhanced by a parallel processing option. It has been developed, implemented and tested on the Fermilab Accelerator Division Linux cluster and network of Sun workstations. The code uses MPI. It is scalable and demonstrates good performance. The general architecture of the code, specific uses of message passing, and effects of a scheduling on the performance and fault tolerance are described.
New Parallel computing framework for radiation transport codes
Kostin, M.A.; Mokhov, N.V.; Niita, K.; /JAERI, Tokai
2010-09-01
A new parallel computing framework has been developed to use with general-purpose radiation transport codes. The framework was implemented as a C++ module that uses MPI for message passing. The module is significantly independent of radiation transport codes it can be used with, and is connected to the codes by means of a number of interface functions. The framework was integrated with the MARS15 code, and an effort is under way to deploy it in PHITS. Besides the parallel computing functionality, the framework offers a checkpoint facility that allows restarting calculations with a saved checkpoint file. The checkpoint facility can be used in single process calculations as well as in the parallel regime. Several checkpoint files can be merged into one thus combining results of several calculations. The framework also corrects some of the known problems with the scheduling and load balancing found in the original implementations of the parallel computing functionality in MARS15 and PHITS. The framework can be used efficiently on homogeneous systems and networks of workstations, where the interference from the other users is possible.
LUDWIG: A parallel Lattice-Boltzmann code for complex fluids
NASA Astrophysics Data System (ADS)
Desplat, Jean-Christophe; Pagonabarraga, Ignacio; Bladon, Peter
2001-03-01
This paper describes Ludwig, a versatile code for the simulation of Lattice-Boltzmann (LB) models in 3D on cubic lattices. In fact, Ludwig is not a single code, but a set of codes that share certain common routines, such as I/O and communications. If Ludwig is used as intended, a variety of complex fluid models with different equilibrium free energies are simple to code, so that the user may concentrate on the physics of the problem, rather than on parallel computing issues. Thus far, Ludwig's main application has been to symmetric binary fluid mixtures. We first explain the philosophy and structure of Ludwig which is argued to be a very effective way of developing large codes for academic consortia. Next we elaborate on some parallel implementation issues such as parallel I/O, and the use of MPI to achieve full portability and good efficiency on both MPP and SMP systems. Finally, we describe how to implement generic solid boundaries, and look in detail at the particular case of a symmetric binary fluid mixture near a solid wall. We present a novel scheme for the thermodynamically consistent simulation of wetting phenomena, in the presence of static and moving solid boundaries, and check its performance.
The Forest Method as a New Parallel Tree Method with the Sectional Voronoi Tessellation
NASA Astrophysics Data System (ADS)
Yahagi, Hideki; Mori, Masao; Yoshii, Yuzuru
1999-09-01
We have developed a new parallel tree method which will be called the forest method hereafter. This new method uses the sectional Voronoi tessellation (SVT) for the domain decomposition. The SVT decomposes a whole space into polyhedra and allows their flat borders to move by assigning different weights. The forest method determines these weights based on the load balancing among processors by means of the overload diffusion (OLD). Moreover, since all the borders are flat, before receiving the data from other processors, each processor can collect enough data to calculate the gravity force with precision. Both the SVT and the OLD are coded in a highly vectorizable manner to accommodate on vector parallel processors. The parallel code based on the forest method with the Message Passing Interface is run on various platforms so that a wide portability is guaranteed. Extensive calculations with 15 processors of Fujitsu VPP300/16R indicate that the code can calculate the gravity force exerted on 105 particles in each second for some ideal dark halo. This code is found to enable an N-body simulation with 107 or more particles for a wide dynamic range and is therefore a very powerful tool for the study of galaxy formation and large-scale structure in the universe.
Boltzmann Transport Code Update: Parallelization and Integrated Design Updates
NASA Technical Reports Server (NTRS)
Heinbockel, J. H.; Nealy, J. E.; DeAngelis, G.; Feldman, G. A.; Chokshi, S.
2003-01-01
The on going efforts at developing a web site for radiation analysis is expected to result in an increased usage of the High Charge and Energy Transport Code HZETRN. It would be nice to be able to do the requested calculations quickly and efficiently. Therefore the question arose, "Could the implementation of parallel processing speed up the calculations required?" To answer this question two modifications of the HZETRN computer code were created. The first modification selected the shield material of Al(2219) , then polyethylene and then Al(2219). The modified Fortran code was labeled 1SSTRN.F. The second modification considered the shield material of CO2 and Martian regolith. This modified Fortran code was labeled MARSTRN.F.
Advances in Parallel Electromagnetic Codes for Accelerator Science and Development
Ko, Kwok; Candel, Arno; Ge, Lixin; Kabel, Andreas; Lee, Rich; Li, Zenghai; Ng, Cho; Rawat, Vineet; Schussman, Greg; Xiao, Liling; /SLAC
2011-02-07
Over a decade of concerted effort in code development for accelerator applications has resulted in a new set of electromagnetic codes which are based on higher-order finite elements for superior geometry fidelity and better solution accuracy. SLAC's ACE3P code suite is designed to harness the power of massively parallel computers to tackle large complex problems with the increased memory and solve them at greater speed. The US DOE supports the computational science R&D under the SciDAC project to improve the scalability of ACE3P, and provides the high performance computing resources needed for the applications. This paper summarizes the advances in the ACE3P set of codes, explains the capabilities of the modules, and presents results from selected applications covering a range of problems in accelerator science and development important to the Office of Science.
Parallelization of Finite Element Analysis Codes Using Heterogeneous Distributed Computing
NASA Technical Reports Server (NTRS)
Ozguner, Fusun
1996-01-01
Performance gains in computer design are quickly consumed as users seek to analyze larger problems to a higher degree of accuracy. Innovative computational methods, such as parallel and distributed computing, seek to multiply the power of existing hardware technology to satisfy the computational demands of large applications. In the early stages of this project, experiments were performed using two large, coarse-grained applications, CSTEM and METCAN. These applications were parallelized on an Intel iPSC/860 hypercube. It was found that the overall speedup was very low, due to large, inherently sequential code segments present in the applications. The overall execution time T(sub par), of the application is dependent on these sequential segments. If these segments make up a significant fraction of the overall code, the application will have a poor speedup measure.
Parallelization of KENO-Va Monte Carlo code
NASA Astrophysics Data System (ADS)
Ramón, Javier; Peña, Jorge
1995-07-01
KENO-Va is a code integrated within the SCALE system developed by Oak Ridge that solves the transport equation through the Monte Carlo Method. It is being used at the Consejo de Seguridad Nuclear (CSN) to perform criticality calculations for fuel storage pools and shipping casks. Two parallel versions of the code: one for shared memory machines and other for distributed memory systems using the message-passing interface PVM have been generated. In both versions the neutrons of each generation are tracked in parallel. In order to preserve the reproducibility of the results in both versions, advanced seeds for random numbers were used. The CONVEX C3440 with four processors and shared memory at CSN was used to implement the shared memory version. A FDDI network of 6 HP9000/735 was employed to implement the message-passing version using proprietary PVM. The speedup obtained was 3.6 in both cases.
Parallelization of an unstructured grid, hydrodynamic-diffusion code
Milovich, J L; Shestakov, A
1998-05-20
We describe the parallelization of a three dimensional, un structured grid, finite element code which solves hyperbolic conservation laws for mass, momentum, and energy, and diffusion equations modeling heat conduction and radiation transport. Explicit temporal differencing advances the cell-based gasdynamic equations. Diffusion equations use fully implicit differencing of nodal variables which leads to large, sparse, symmetric, and positive definite matrices. Because of the unstructured grid, the off-diagonal non-zero elements appear in unpredictable locations. The linear systems are solved using parallelized conjugate gradients. The code is parailelized by domain decomposition of physical space into disjoint subdomains (SDS). Each processor receives its own SD plus a border of ghost cells. Results are presented on a problem coupling hydrodynamics to non-linear heat cond
Parallelization of the Legendre Transform for a Geodynamics Code
NASA Astrophysics Data System (ADS)
Lokavarapu, H. V.; Matsui, H.; Heien, E. M.
2014-12-01
Calypso is a geodynamo code designed to model magnetohydrodynamics of a Boussinesq fluid in a rotating spherical shell, such as the outer core of Earth. The code has been shown to scale well on computer clusters capable of computing at the order of millions of core hours. Depending on the resolution and time requirements, simulations may require weeks to years of clock time for specific target problems. A significant portion of the code execution time is spent transforming computed quantities between physical values and spherical harmonic coefficients, equivalent to a series of linear algebra operations. Intermixing C and Fortran code has opened the door to the parallel computing platform, Cuda and its associated libraries. We successfully implemented the parallelization of the scaling of the Legendre polynomials by both Schmidt Normalization coefficients, and a set of weighting coefficients; however, the expected speedup was not realized. Specifically, the original version of Calypso 1.1 computes the Legendre transform approximately four seconds faster than the Cuda-enabled modified version. By profiling the code, we determined that the time taken to transfer the data from host memory to GPU memory does not compare to the number of computations happening within the GPU. Nevertheless, by utilizing techniques such as memory coalescing, cached memory, pinned memory, dynamic parallelism, asynchronous calls, and overlapped memory transfers with computations, the likelihood of a speedup increases. Moreover, ideally the generation of the Legendre polynomial coefficients, Schmidt Normalization Coefficients, and the set of weights should not only be parallelized but be computed on-the-fly within the GPU. The end result is that we reduce the number of memory transfers from host to GPU, increase the number of parallelized computations on the GPU, and decrease the number of serial computations on the CPU. Also, the time taken to transform physical values to spherical
Advances in Parallelization for Large Scale Oct-Tree Mesh Generation
NASA Technical Reports Server (NTRS)
O'Connell, Matthew; Karman, Steve L.
2015-01-01
Despite great advancements in the parallelization of numerical simulation codes over the last 20 years, it is still common to perform grid generation in serial. Generating large scale grids in serial often requires using special "grid generation" compute machines that can have more than ten times the memory of average machines. While some parallel mesh generation techniques have been proposed, generating very large meshes for LES or aeroacoustic simulations is still a challenging problem. An automated method for the parallel generation of very large scale off-body hierarchical meshes is presented here. This work enables large scale parallel generation of off-body meshes by using a novel combination of parallel grid generation techniques and a hybrid "top down" and "bottom up" oct-tree method. Meshes are generated using hardware commonly found in parallel compute clusters. The capability to generate very large meshes is demonstrated by the generation of off-body meshes surrounding complex aerospace geometries. Results are shown including a one billion cell mesh generated around a Predator Unmanned Aerial Vehicle geometry, which was generated on 64 processors in under 45 minutes.
Parallelization of PANDA discrete ordinates code using spatial decomposition
Humbert, P.
2006-07-01
We present the parallel method, based on spatial domain decomposition, implemented in the 2D and 3D versions of the discrete Ordinates code PANDA. The spatial mesh is orthogonal and the spatial domain decomposition is Cartesian. For 3D problems a 3D Cartesian domain topology is created and the parallel method is based on a domain diagonal plane ordered sweep algorithm. The parallel efficiency of the method is improved by directions and octants pipelining. The implementation of the algorithm is straightforward using MPI blocking point to point communications. The efficiency of the method is illustrated by an application to the 3D-Ext C5G7 benchmark of the OECD/NEA. (authors)
A Very Fast and Angular Momentum Conserving Tree Code
NASA Astrophysics Data System (ADS)
Marcello, Dominic C.
2017-09-01
There are many methods used to compute the classical gravitational field in astrophysical simulation codes. With the exception of the typically impractical method of direct computation, none ensure conservation of angular momentum to machine precision. Under uniform time-stepping, the Cartesian fast multipole method of Dehnen (also known as the very fast tree code) conserves linear momentum to machine precision. We show that it is possible to modify this method in a way that conserves both angular and linear momenta.
A new tree code method for simulation of planetesimal dynamics
NASA Astrophysics Data System (ADS)
Richardson, D. C.
1993-03-01
A new tree code method for simulation of planetesimal dynamics is presented. A self-similarity argument is used to restrict the problem to a small patch of a ring of planetesimals at 1 AU from the sun. The code incorporates a sliding box model with periodic boundary conditions and surrounding ghost particles. The tree is self-repairing and exploits the flattened nature of Keplerian disks to maximize efficiency. The code uses a fourth-order force polynomial integration algorithm with individual particle time-steps. Collisions and mergers, which play an important role in planetesimal evolution, are treated in a comprehensive manner. In typical runs with a few hundred central particles, the tree code is approximately 2-3 times faster than a recent direct summation method and requires about 1 CPU day on a Sparc IPX workstation to simulate 100 yr of evolution. The average relative force error incurred in such runs is less than 0.2 per cent in magnitude. In general, the CPU time as a function of particle number varies in a way consistent with an O(N log N) algorithm. In order to take advantage of facilities available, the code was written in C in a Unix workstation environment. The unique aspects of the code are discussed in detail and the results of a number of performance tests - including a comparison with previous work - are presented.
Development of Parallel Code for the Alaska Tsunami Forecast Model
NASA Astrophysics Data System (ADS)
Bahng, B.; Knight, W. R.; Whitmore, P.
2014-12-01
The Alaska Tsunami Forecast Model (ATFM) is a numerical model used to forecast propagation and inundation of tsunamis generated by earthquakes and other means in both the Pacific and Atlantic Oceans. At the U.S. National Tsunami Warning Center (NTWC), the model is mainly used in a pre-computed fashion. That is, results for hundreds of hypothetical events are computed before alerts, and are accessed and calibrated with observations during tsunamis to immediately produce forecasts. ATFM uses the non-linear, depth-averaged, shallow-water equations of motion with multiply nested grids in two-way communications between domains of each parent-child pair as waves get closer to coastal waters. Even with the pre-computation the task becomes non-trivial as sub-grid resolution gets finer. Currently, the finest resolution Digital Elevation Models (DEM) used by ATFM are 1/3 arc-seconds. With a serial code, large or multiple areas of very high resolution can produce run-times that are unrealistic even in a pre-computed approach. One way to increase the model performance is code parallelization used in conjunction with a multi-processor computing environment. NTWC developers have undertaken an ATFM code-parallelization effort to streamline the creation of the pre-computed database of results with the long term aim of tsunami forecasts from source to high resolution shoreline grids in real time. Parallelization will also permit timely regeneration of the forecast model database with new DEMs; and, will make possible future inclusion of new physics such as the non-hydrostatic treatment of tsunami propagation. The purpose of our presentation is to elaborate on the parallelization approach and to show the compute speed increase on various multi-processor systems.
A Parallel Numerical Micromagnetic Code Using FEniCS
NASA Astrophysics Data System (ADS)
Nagy, L.; Williams, W.; Mitchell, L.
2013-12-01
Many problems in the geosciences depend on understanding the ability of magnetic minerals to provide stable paleomagnetic recordings. Numerical micromagnetic modelling allows us to calculate the domain structures found in naturally occurring magnetic materials. However the computational cost rises exceedingly quickly with respect to the size and complexity of the geometries that we wish to model. This problem is compounded by the fact that the modern processor design no longer focuses on the speed at which calculations are performed, but rather on the number of computational units amongst which we may distribute our calculations. Consequently to better exploit modern computational resources our micromagnetic simulations must "go parallel". We present a parallel and scalable micromagnetics code written using FEniCS. FEniCS is a multinational collaboration involving several institutions (University of Cambridge, University of Chicago, The Simula Research Laboratory, etc.) that aims to provide a set of tools for writing scientific software; in particular software that employs the finite element method. The advantages of this approach are the leveraging of pre-existing projects from the world of scientific computing (PETSc, Trilinos, Metis/Parmetis, etc.) and exposing these so that researchers may pose problems in a manner closer to the mathematical language of their domain. Our code provides a scriptable interface (in Python) that allows users to not only run micromagnetic models in parallel, but also to perform pre/post processing of data.
Parallelization of ICF3D, a Diffusion and Hydrodynamics Code
NASA Astrophysics Data System (ADS)
Shestakov, A. I.; Milovich, J. L.
1997-11-01
We describe the parallelization of the unstructured grid ICF3D code. The strategy divides physical space into a collection of disjoint subdomains, one per processing element (PE). The subdomains may be of arbitrary shape but, for efficiency, should have small surface-to-volume ratios. The strategy is ideally suited for distributed memory computers, but also works on shared memory architectures. The hydrodynamic module, which uses a cell-based algorithm using discontinuous finite elements, is parallelized by assigning cells to different PEs. This assignment is done by a separate program and constitutes input data for ICF3D. The diffusion module, a kernel of the heat conduction and radiation diffusion packages, advances continuous fields which are discretized using a nodal finite element method. This module is parallelized by assigning points to individual PEs. The assignment is done within ICF3D. The code is in C++. Special message passing objects (MPO) determine the connectivity of the subdomains and transfer data between them by calling MPI functions. Results are presented on a variety of computers: CRAY T3D and IBM SP2 at Livermore, and Intel's ASCI RED at Sandia, Albuquerque.
Composing Data Parallel Code for a SPARQL Graph Engine
Castellana, Vito G.; Tumeo, Antonino; Villa, Oreste; Haglin, David J.; Feo, John
2013-09-08
Big data analytics process large amount of data to extract knowledge from them. Semantic databases are big data applications that adopt the Resource Description Framework (RDF) to structure metadata through a graph-based representation. The graph based representation provides several benefits, such as the possibility to perform in memory processing with large amounts of parallelism. SPARQL is a language used to perform queries on RDF-structured data through graph matching. In this paper we present a tool that automatically translates SPARQL queries to parallel graph crawling and graph matching operations. The tool also supports complex SPARQL constructs, which requires more than basic graph matching for their implementation. The tool generates parallel code annotated with OpenMP pragmas for x86 Shared-memory Multiprocessors (SMPs). With respect to commercial database systems such as Virtuoso, our approach reduces memory occupation due to join operations and provides higher performance. We show the scaling of the automatically generated graph-matching code on a 48-core SMP.
NASA Technical Reports Server (NTRS)
Hanebutte, Ulf R.; Joslin, Ronald D.; Zubair, Mohammad
1994-01-01
The implementation and the performance of a parallel spatial direct numerical simulation (PSDNS) code are reported for the IBM SP1 supercomputer. The spatially evolving disturbances that are associated with laminar-to-turbulent in three-dimensional boundary-layer flows are computed with the PS-DNS code. By remapping the distributed data structure during the course of the calculation, optimized serial library routines can be utilized that substantially increase the computational performance. Although the remapping incurs a high communication penalty, the parallel efficiency of the code remains above 40% for all performed calculations. By using appropriate compile options and optimized library routines, the serial code achieves 52-56 Mflops on a single node of the SP1 (45% of theoretical peak performance). The actual performance of the PSDNS code on the SP1 is evaluated with a 'real world' simulation that consists of 1.7 million grid points. One time step of this simulation is calculated on eight nodes of the SP1 in the same time as required by a Cray Y/MP for the same simulation. The scalability information provides estimated computational costs that match the actual costs relative to changes in the number of grid points.
CHOLLA: A New Massively Parallel Hydrodynamics Code for Astrophysical Simulation
NASA Astrophysics Data System (ADS)
Schneider, Evan E.; Robertson, Brant E.
2015-04-01
We present Computational Hydrodynamics On ParaLLel Architectures (Cholla ), a new three-dimensional hydrodynamics code that harnesses the power of graphics processing units (GPUs) to accelerate astrophysical simulations. Cholla models the Euler equations on a static mesh using state-of-the-art techniques, including the unsplit Corner Transport Upwind algorithm, a variety of exact and approximate Riemann solvers, and multiple spatial reconstruction techniques including the piecewise parabolic method (PPM). Using GPUs, Cholla evolves the fluid properties of thousands of cells simultaneously and can update over 10 million cells per GPU-second while using an exact Riemann solver and PPM reconstruction. Owing to the massively parallel architecture of GPUs and the design of the Cholla code, astrophysical simulations with physically interesting grid resolutions (≳2563) can easily be computed on a single device. We use the Message Passing Interface library to extend calculations onto multiple devices and demonstrate nearly ideal scaling beyond 64 GPUs. A suite of test problems highlights the physical accuracy of our modeling and provides a useful comparison to other codes. We then use Cholla to simulate the interaction of a shock wave with a gas cloud in the interstellar medium, showing that the evolution of the cloud is highly dependent on its density structure. We reconcile the computed mixing time of a turbulent cloud with a realistic density distribution destroyed by a strong shock with the existing analytic theory for spherical cloud destruction by describing the system in terms of its median gas density.
CHOLLA: A NEW MASSIVELY PARALLEL HYDRODYNAMICS CODE FOR ASTROPHYSICAL SIMULATION
Schneider, Evan E.; Robertson, Brant E.
2015-04-15
We present Computational Hydrodynamics On ParaLLel Architectures (Cholla ), a new three-dimensional hydrodynamics code that harnesses the power of graphics processing units (GPUs) to accelerate astrophysical simulations. Cholla models the Euler equations on a static mesh using state-of-the-art techniques, including the unsplit Corner Transport Upwind algorithm, a variety of exact and approximate Riemann solvers, and multiple spatial reconstruction techniques including the piecewise parabolic method (PPM). Using GPUs, Cholla evolves the fluid properties of thousands of cells simultaneously and can update over 10 million cells per GPU-second while using an exact Riemann solver and PPM reconstruction. Owing to the massively parallel architecture of GPUs and the design of the Cholla code, astrophysical simulations with physically interesting grid resolutions (≳256{sup 3}) can easily be computed on a single device. We use the Message Passing Interface library to extend calculations onto multiple devices and demonstrate nearly ideal scaling beyond 64 GPUs. A suite of test problems highlights the physical accuracy of our modeling and provides a useful comparison to other codes. We then use Cholla to simulate the interaction of a shock wave with a gas cloud in the interstellar medium, showing that the evolution of the cloud is highly dependent on its density structure. We reconcile the computed mixing time of a turbulent cloud with a realistic density distribution destroyed by a strong shock with the existing analytic theory for spherical cloud destruction by describing the system in terms of its median gas density.
MPI parallelization of full PIC simulation code with Adaptive Mesh Refinement
NASA Astrophysics Data System (ADS)
Matsui, Tatsuki; Nunami, Masanori; Usui, Hideyuki; Moritaka, Toseo
2010-11-01
A new parallelization technique developed for PIC method with adaptive mesh refinement (AMR) is introduced. In AMR technique, the complicated cell arrangements are organized and managed as interconnected pointers with multiple resolution levels, forming a fully threaded tree structure as a whole. In order to retain this tree structure distributed over multiple processes, remote memory access, an extended feature of MPI2 standards, is employed. Another important feature of the present simulation technique is the domain decomposition according to the modified Morton ordering. This algorithm can group up the equal number of particle calculation loops, which allows for the better load balance. Using this advanced simulation code, preliminary results for basic physical problems are exhibited for the validity check, together with the benchmarks to test the performance and the scalability.
Development of a massively parallel parachute performance prediction code
Peterson, C.W.; Strickland, J.H.; Wolfe, W.P.; Sundberg, W.D.; McBride, D.D.
1997-04-01
The Department of Energy has given Sandia full responsibility for the complete life cycle (cradle to grave) of all nuclear weapon parachutes. Sandia National Laboratories is initiating development of a complete numerical simulation of parachute performance, beginning with parachute deployment and continuing through inflation and steady state descent. The purpose of the parachute performance code is to predict the performance of stockpile weapon parachutes as these parachutes continue to age well beyond their intended service life. A new massively parallel computer will provide unprecedented speed and memory for solving this complex problem, and new software will be written to treat the coupled fluid, structure and trajectory calculations as part of a single code. Verification and validation experiments have been proposed to provide the necessary confidence in the computations.
Development of parallel DEM for the open source code MFIX
Gopalakrishnan, Pradeep; Tafti, Danesh
2013-02-01
The paper presents the development of a parallel Discrete Element Method (DEM) solver for the open source code, Multiphase Flow with Interphase eXchange (MFIX) based on the domain decomposition method. The performance of the code was evaluated by simulating a bubbling fluidized bed with 2.5 million particles. The DEM solver shows strong scalability up to 256 processors with an efficiency of 81%. Further, to analyze weak scaling, the static height of the fluidized bed was increased to hold 5 and 10 million particles. The results show that global communication cost increases with problem size while the computational cost remains constant. Further, the effects of static bed height on the bubble hydrodynamics and mixing characteristics are analyzed.
1 CFR 21.23 - Parallel citations of Code and Federal Register.
Code of Federal Regulations, 2012 CFR
2012-01-01
... 1 General Provisions 1 2012-01-01 2012-01-01 false Parallel citations of Code and Federal Register. 21.23 Section 21.23 General Provisions ADMINISTRATIVE COMMITTEE OF THE FEDERAL REGISTER PREPARATION... § 21.23 Parallel citations of Code and Federal Register. For parallel reference, the Code of...
1 CFR 21.23 - Parallel citations of Code and Federal Register.
Code of Federal Regulations, 2010 CFR
2010-01-01
... 1 General Provisions 1 2010-01-01 2010-01-01 false Parallel citations of Code and Federal Register. 21.23 Section 21.23 General Provisions ADMINISTRATIVE COMMITTEE OF THE FEDERAL REGISTER PREPARATION... § 21.23 Parallel citations of Code and Federal Register. For parallel reference, the Code of...
1 CFR 21.23 - Parallel citations of Code and Federal Register.
Code of Federal Regulations, 2014 CFR
2014-01-01
... 1 General Provisions 1 2014-01-01 2012-01-01 true Parallel citations of Code and Federal Register. 21.23 Section 21.23 General Provisions ADMINISTRATIVE COMMITTEE OF THE FEDERAL REGISTER PREPARATION... § 21.23 Parallel citations of Code and Federal Register. For parallel reference, the Code of...
1 CFR 21.23 - Parallel citations of Code and Federal Register.
Code of Federal Regulations, 2011 CFR
2011-01-01
... 1 General Provisions 1 2011-01-01 2011-01-01 false Parallel citations of Code and Federal Register. 21.23 Section 21.23 General Provisions ADMINISTRATIVE COMMITTEE OF THE FEDERAL REGISTER PREPARATION... § 21.23 Parallel citations of Code and Federal Register. For parallel reference, the Code of...
1 CFR 21.23 - Parallel citations of Code and Federal Register.
Code of Federal Regulations, 2013 CFR
2013-01-01
... 1 General Provisions 1 2013-01-01 2012-01-01 true Parallel citations of Code and Federal Register. 21.23 Section 21.23 General Provisions ADMINISTRATIVE COMMITTEE OF THE FEDERAL REGISTER PREPARATION... § 21.23 Parallel citations of Code and Federal Register. For parallel reference, the Code of...
Adaptive zero-tree structure for curved wavelet image coding
NASA Astrophysics Data System (ADS)
Zhang, Liang; Wang, Demin; Vincent, André
2006-02-01
We investigate the issue of efficient data organization and representation of the curved wavelet coefficients [curved wavelet transform (WT)]. We present an adaptive zero-tree structure that exploits the cross-subband similarity of the curved wavelet transform. In the embedded zero-tree wavelet (EZW) and the set partitioning in hierarchical trees (SPIHT), the parent-child relationship is defined in such a way that a parent has four children, restricted to a square of 2×2 pixels, the parent-child relationship in the adaptive zero-tree structure varies according to the curves along which the curved WT is performed. Five child patterns were determined based on different combinations of curve orientation. A new image coder was then developed based on this adaptive zero-tree structure and the set-partitioning technique. Experimental results using synthetic and natural images showed the effectiveness of the proposed adaptive zero-tree structure for encoding of the curved wavelet coefficients. The coding gain of the proposed coder can be up to 1.2 dB in terms of peak SNR (PSNR) compared to the SPIHT coder. Subjective evaluation shows that the proposed coder preserves lines and edges better than the SPIHT coder.
Parallel Monte Carlo Electron and Photon Transport Simulation Code (PMCEPT code)
NASA Astrophysics Data System (ADS)
Kum, Oyeon
2004-11-01
Simulations for customized cancer radiation treatment planning for each patient are very useful for both patient and doctor. These simulations can be used to find the most effective treatment with the least possible dose to the patient. This typical system, so called ``Doctor by Information Technology", will be useful to provide high quality medical services everywhere. However, the large amount of computing time required by the well-known general purpose Monte Carlo(MC) codes has prevented their use for routine dose distribution calculations for a customized radiation treatment planning. The optimal solution to provide ``accurate" dose distribution within an ``acceptable" time limit is to develop a parallel simulation algorithm on a beowulf PC cluster because it is the most accurate, efficient, and economic. I developed parallel MC electron and photon transport simulation code based on the standard MPI message passing interface. This algorithm solved the main difficulty of the parallel MC simulation (overlapped random number series in the different processors) using multiple random number seeds. The parallel results agreed well with the serial ones. The parallel efficiency approached 100% as was expected.
Dependent video coding using a tree representation of pixel dependencies
NASA Astrophysics Data System (ADS)
Amati, Luca; Valenzise, Giuseppe; Ortega, Antonio; Tubaro, Stefano
2011-09-01
Motion-compensated prediction induces a chain of coding dependencies between pixels in video. In principle, an optimal selection of encoding parameters (motion vectors, quantization parameters, coding modes) should take into account the whole temporal horizon of a GOP. However, in practical coding schemes, these choices are made on a frame-by-frame basis, thus with a possible loss of performance. In this paper we describe a tree-based model for pixelwise coding dependencies: each pixel in a frame is the child of a pixel in a previous reference frame. We show that some tree structures are more favorable than others from a rate-distortion perspective, e.g., because they entail a large descendance of pixels which are well predicted from a common ancestor. In those cases, a higher quality has to be assigned to pixels at the top of such trees. We promote the creation of these structures by adding a special discount term to the conventional Lagrangian cost adopted at the encoder. The proposed model can be implemented through a double-pass encoding procedure. Specifically, we devise heuristic cost functions to drive the selection of quantization parameters and of motion vectors, which can be readily implemented into a state-of-the-art H.264/AVC encoder. Our experiments demonstrate that coding efficiency is improved for video sequences with low motion, while there are no apparent gains for more complex motion. We argue that this is due to both the presence of complex encoder features not captured by the model, and to the complexity of the source to be encoded.
Recent developments in DYNSUB: New models, code optimization and parallelization
Daeubler, M.; Trost, N.; Jimenez, J.; Sanchez, V.
2013-07-01
DYNSUB is a high-fidelity coupled code system consisting of the reactor simulator DYN3D and the sub-channel code SUBCHANFLOW. It describes nuclear reactor core behavior with pin-by-pin resolution for both steady-state and transient scenarios. In the course of the coupled code system's active development, super-homogenization (SPH) and generalized equivalence theory (GET) discontinuity factors may be computed with and employed in DYNSUB to compensate pin level homogenization errors. Because of the largely increased numerical problem size for pin-by-pin simulations, DYNSUB has bene fitted from HPC techniques to improve its numerical performance. DYNSUB's coupling scheme has been structurally revised. Computational bottlenecks have been identified and parallelized for shared memory systems using OpenMP. Comparing the elapsed time for simulating a PWR core with one-eighth symmetry under hot zero power conditions applying the original and the optimized DYNSUB using 8 cores, overall speed up factors greater than 10 have been observed. The corresponding reduction in execution time enables a routine application of DYNSUB to study pin level safety parameters for engineering sized cases in a scientific environment. (authors)
GPU-based parallel clustered differential pulse code modulation
NASA Astrophysics Data System (ADS)
Wu, Jiaji; Li, Wenze; Kong, Wanqiu
2015-10-01
Hyperspectral remote sensing technology is widely used in marine remote sensing, geological exploration, atmospheric and environmental remote sensing. Owing to the rapid development of hyperspectral remote sensing technology, resolution of hyperspectral image has got a huge boost. Thus data size of hyperspectral image is becoming larger. In order to reduce their saving and transmission cost, lossless compression for hyperspectral image has become an important research topic. In recent years, large numbers of algorithms have been proposed to reduce the redundancy between different spectra. Among of them, the most classical and expansible algorithm is the Clustered Differential Pulse Code Modulation (CDPCM) algorithm. This algorithm contains three parts: first clusters all spectral lines, then trains linear predictors for each band. Secondly, use these predictors to predict pixels, and get the residual image by subtraction between original image and predicted image. Finally, encode the residual image. However, the process of calculating predictors is timecosting. In order to improve the processing speed, we propose a parallel C-DPCM based on CUDA (Compute Unified Device Architecture) with GPU. Recently, general-purpose computing based on GPUs has been greatly developed. The capacity of GPU improves rapidly by increasing the number of processing units and storage control units. CUDA is a parallel computing platform and programming model created by NVIDIA. It gives developers direct access to the virtual instruction set and memory of the parallel computational elements in GPUs. Our core idea is to achieve the calculation of predictors in parallel. By respectively adopting global memory, shared memory and register memory, we finally get a decent speedup.
Time-Dependent, Parallel Neutral Particle Transport Code System.
BAKER, RANDAL S.
2009-09-10
Version 00 PARTISN (PARallel, TIme-Dependent SN) is the evolutionary successor to CCC-547/DANTSYS. The PARTISN code package is a modular computer program package designed to solve the time-independent or dependent multigroup discrete ordinates form of the Boltzmann transport equation in several different geometries. The modular construction of the package separates the input processing, the transport equation solving, and the post processing (or edit) functions into distinct code modules: the Input Module, the Solver Module, and the Edit Module, respectively. PARTISN is the evolutionary successor to the DANTSYSTM code system package. The Input and Edit Modules in PARTISN are very similar to those in DANTSYS. However, unlike DANTSYS, the Solver Module in PARTISN contains one, two, and three-dimensional solvers in a single module. In addition to the diamond-differencing method, the Solver Module also has Adaptive Weighted Diamond-Differencing (AWDD), Linear Discontinuous (LD), and Exponential Discontinuous (ED) spatial differencing methods. The spatial mesh may consist of either a standard orthogonal mesh or a block adaptive orthogonal mesh. The Solver Module may be run in parallel for two and three dimensional problems. One can now run 1-D problems in parallel using Energy Domain Decomposition (triggered by Block 5 input keyword npeg>0). EDD can also be used in 2-D/3-D with or without our standard Spatial Domain Decomposition. Both the static (fixed source or eigenvalue) and time-dependent forms of the transport equation are solved in forward or adjoint mode. In addition, PARTISN now has a probabilistic mode for Probability of Initiation (static) and Probability of Survival (dynamic) calculations. Vacuum, reflective, periodic, white, or inhomogeneous boundary conditions are solved. General anisotropic scattering and inhomogeneous sources are permitted. PARTISN solves the transport equation on orthogonal (single level or block-structured AMR) grids in 1-D (slab, two
A GPU accelerated Barnes-Hut tree code for FLASH4
NASA Astrophysics Data System (ADS)
Lukat, Gunther; Banerjee, Robi
2016-05-01
We present a GPU accelerated CUDA-C implementation of the Barnes Hut (BH) tree code for calculating the gravitational potential on octree adaptive meshes. The tree code algorithm is implemented within the FLASH4 adaptive mesh refinement (AMR) code framework and therefore fully MPI parallel. We describe the algorithm and present test results that demonstrate its accuracy and performance in comparison to the algorithms available in the current FLASH4 version. We use a MacLaurin spheroid to test the accuracy of our new implementation and use spherical, collapsing cloud cores with effective AMR to carry out performance tests also in comparison with previous gravity solvers. Depending on the setup and the GPU/CPU ratio, we find a speedup for the gravity unit of at least a factor of 3 and up to 60 in comparison to the gravity solvers implemented in the FLASH4 code. We find an overall speedup factor for full simulations of at least factor 1.6 up to a factor of 10.
A Parallel Decoding Algorithm for Short Polar Codes Based on Error Checking and Correcting
Pan, Xiaofei; Pan, Kegang; Ye, Zhan; Gong, Chao
2014-01-01
We propose a parallel decoding algorithm based on error checking and correcting to improve the performance of the short polar codes. In order to enhance the error-correcting capacity of the decoding algorithm, we first derive the error-checking equations generated on the basis of the frozen nodes, and then we introduce the method to check the errors in the input nodes of the decoder by the solutions of these equations. In order to further correct those checked errors, we adopt the method of modifying the probability messages of the error nodes with constant values according to the maximization principle. Due to the existence of multiple solutions of the error-checking equations, we formulate a CRC-aided optimization problem of finding the optimal solution with three different target functions, so as to improve the accuracy of error checking. Besides, in order to increase the throughput of decoding, we use a parallel method based on the decoding tree to calculate probability messages of all the nodes in the decoder. Numerical results show that the proposed decoding algorithm achieves better performance than that of some existing decoding algorithms with the same code length. PMID:25540813
A Hybrid Shared-Memory Parallel Max-Tree Algorithm for Extreme Dynamic-Range Images.
Moschini, Ugo; Meijster, Arnold; Wilkinson, Michael
2017-03-30
Max-trees, or component trees, are graph structures that represent the connected components of an image in a hierarchical way. Nowadays, many application fields rely on images with high-dynamic range or floating point values. Efficient sequential algorithms exist to build trees and compute attributes for images of any bit depth. However, we show that the current parallel algorithms perform poorly already with integers at bit depths higher than 16 bits per pixel. We propose a parallel method combining the two worlds of flooding and merging max-tree algorithms. First, a pilot max-tree of a quantized version of the image is built in parallel using a flooding method. Later, this structure is used in a parallel leaf-to-root approach to compute efficiently the final max-tree and to drive the merging of the sub-trees computed by the threads. We present an analysis of the performance both on simulated and actual 2D images and 3D volumes. Execution times are about 20 better than the fastest sequential algorithm and speed-up goes up to 30 40 on 64 threads.
Parallel Formulations of Tree-Projection Based Sequence Mining Algorithms
2003-01-20
Keywords: frequent sequential patterns, database projection algorithms, data mining , parallel processing ∗This work was supported by NSF CCR-9972519... data mining research. These frequently occurring patterns can be used to find association rules and/or to extract prevalent patterns that exist in the...Distributed Computing (Special Issue on High Performance Data Mining ), 2000. [2] R. Agrawal and J.C. Shafer. Parallel mining of association rules. IEEE
NASA Astrophysics Data System (ADS)
Destefano, Anthony; Heerikhuisen, Jacob
2015-04-01
Fully 3D particle simulations can be a computationally and memory expensive task, especially when high resolution grid cells are required. The problem becomes further complicated when parallelization is needed. In this work we focus on computational methods to solve these difficulties. Hilbert curves are used to map the 3D particle space to the 1D contiguous memory space. This method of organization allows for minimized cache misses on the GPU as well as a sorted structure that is equivalent to an octal tree data structure. This type of sorted structure is attractive for uses in adaptive mesh implementations due to the logarithm search time. Implementations using the Message Passing Interface (MPI) library and NVIDIA's parallel computing platform CUDA will be compared, as MPI is commonly used on server nodes with many CPU's. We will also compare static grid structures with those of adaptive mesh structures. The physical test bed will be simulating heavy interstellar atoms interacting with a background plasma, the heliosphere, simulated from fully consistent coupled MHD/kinetic particle code. It is known that charge exchange is an important factor in space plasmas, specifically it modifies the structure of the heliosphere itself. We would like to thank the Alabama Supercomputer Authority for the use of their computational resources.
Breakdown of Spatial Parallel Coding in Children's Drawing
ERIC Educational Resources Information Center
De Bruyn, Bart; Davis, Alyson
2005-01-01
When drawing real scenes or copying simple geometric figures young children are highly sensitive to parallel cues and use them effectively. However, this sensitivity can break down in surprisingly simple tasks such as copying a single line where robust directional errors occur despite the presence of parallel cues. Before we can conclude that this…
Description of a parallel, 3D, finite element, hydrodynamics-diffusion code
Milovich, J L; Prasad, M K; Shestakov, A I
1999-04-11
We describe a parallel, 3D, unstructured grid finite element, hydrodynamic diffusion code for inertial confinement fusion (ICF) applications and the ancillary software used to run it. The code system is divided into two entities, a controller and a stand-alone physics code. The code system may reside on different computers; the controller on the user's workstation and the physics code on a supercomputer. The physics code is composed of separate hydrodynamic, equation-of-state, laser energy deposition, heat conduction, and radiation transport packages and is parallelized for distributed memory architectures. For parallelization, a SPMD model is adopted; the domain is decomposed into a disjoint collection of subdomains, one per processing element (PE). The PEs communicate using MPI. The code is used to simulate the hydrodynamic implosion of a spherical bubble.
An Optimal Parallel Algorithm for Constructing a Spanning Tree on Circular Permutation Graphs
NASA Astrophysics Data System (ADS)
Honma, Hirotoshi; Honma, Saki; Masuyama, Shigeru
The spanning tree problem is to find a tree that connects all the vertices of G. This problem has many applications, such as electric power systems, computer network design and circuit analysis. Klein and Stein demonstrated that a spanning tree can be found in O(log n) time with O(n + m) processors on the CRCW PRAM. In general, it is known that more efficient parallel algorithms can be developed by restricting classes of graphs. Circular permutation graphs properly contain the set of permutation graphs as a subclass and are first introduced by Rotem and Urrutia. They provided O(n2.376) time recognition algorithm. Circular permutation graphs and their models find several applications in VLSI layout. In this paper, we propose an optimal parallel algorithm for constructing a spanning tree on circular permutation graphs. It runs in O(log n) time with O(n/ log n) processors on the EREW PRAM.
Application of the hypercube parallel processor to a large-scale moment method code
NASA Technical Reports Server (NTRS)
Manshadi, Farzin; Liewer, Paulet C.; Patterson, Jean E.
1988-01-01
The applicability of a parallel computing architecture to the solution of a large-scale moment-method code is investigated. Specifically, the NEC (Numerical Electromagnetics Code) method-of-moments scattering program is implemented on a hypercube parallel processor. The accuracy and the increase in the speed of execution on this parallel architecture are demonstrated. The results show a very large reduction in execution time for large problems. The great potential of this parallel processor is shown for interactive solution of large NEC problems as well as other moment-method techniques such as the finite-element method.
Implementation of a 3D mixing layer code on parallel computers
Roe, K.; Thakur, R.; Dang, T.; Bogucz, E.
1995-09-01
This paper summarizes our progress and experience in the development of a Computational-Fluid-Dynamics code on parallel computers to simulate three-dimensional spatially-developing mixing layers. In this initial study, the three-dimensional time-dependent Euler equations are solved using a finite-volume explicit time-marching algorithm. The code was first programmed in Fortran 77 for sequential computers. The code was then converted for use on parallel computers using the conventional message-passing technique, while we have not been able to compile the code with the present version of HPF compilers.
Punctured Parallel and Serial Concatenated Convolutional Codes for BPSK/QPSK Channels
NASA Technical Reports Server (NTRS)
Acikel, Omer Fatih
1999-01-01
As available bandwidth for communication applications becomes scarce, bandwidth-efficient modulation and coding schemes become ever important. Since their discovery in 1993, turbo codes (parallel concatenated convolutional codes) have been the center of the attention in the coding community because of their bit error rate performance near the Shannon limit. Serial concatenated convolutional codes have also been shown to be as powerful as turbo codes. In this dissertation, we introduce algorithms for designing bandwidth-efficient rate r = k/(k + 1),k = 2, 3,..., 16, parallel and rate 3/4, 7/8, and 15/16 serial concatenated convolutional codes via puncturing for BPSK/QPSK (Binary Phase Shift Keying/Quadrature Phase Shift Keying) channels. Both parallel and serial concatenated convolutional codes have initially, steep bit error rate versus signal-to-noise ratio slope (called the -"cliff region"). However, this steep slope changes to a moderate slope with increasing signal-to-noise ratio, where the slope is characterized by the weight spectrum of the code. The region after the cliff region is called the "error rate floor" which dominates the behavior of these codes in moderate to high signal-to-noise ratios. Our goal is to design high rate parallel and serial concatenated convolutional codes while minimizing the error rate floor effect. The design algorithm includes an interleaver enhancement procedure and finds the polynomial sets (only for parallel concatenated convolutional codes) and the puncturing schemes that achieve the lowest bit error rate performance around the floor for the code rates of interest.
Parallelization of a Tight-Binding Molecular Dynamics Code by Using the Hpf Environment
NASA Astrophysics Data System (ADS)
Celino, M.; Rosato, V.; di Martino, B.
Molecular Dynamics simulations in the Tight-Binding approach allow the study of the ionic and electronic structures of semiconductors. The Tight-Binding codes are characterized by inhomogeneous data distribution and require the repeated diagonalization of a large sparse matrix to compute the whole body of its eigenvalues and eigenvectors. The code parallelization, by using the High Performance Fortran (HPF) environment, and the integration of optimized parallel mathematical routines is described.
Adaptive Mesh Refinement Algorithms for Parallel Unstructured Finite Element Codes
Parsons, I D; Solberg, J M
2006-02-03
This project produced algorithms for and software implementations of adaptive mesh refinement (AMR) methods for solving practical solid and thermal mechanics problems on multiprocessor parallel computers using unstructured finite element meshes. The overall goal is to provide computational solutions that are accurate to some prescribed tolerance, and adaptivity is the correct path toward this goal. These new tools will enable analysts to conduct more reliable simulations at reduced cost, both in terms of analyst and computer time. Previous academic research in the field of adaptive mesh refinement has produced a voluminous literature focused on error estimators and demonstration problems; relatively little progress has been made on producing efficient implementations suitable for large-scale problem solving on state-of-the-art computer systems. Research issues that were considered include: effective error estimators for nonlinear structural mechanics; local meshing at irregular geometric boundaries; and constructing efficient software for parallel computing environments.
Bit-parallel ASCII code artificial numeric keypad
Hale, G.M.
1981-03-01
Seven integrated circuits and a voltage regulator are combined with twelve reed relays to allow the ASCII encoded numerals 0 through 9 and characters ''.'' and R or S to momentarily close switches to an applications device, simulating keypad switch closures. This invention may be used as a PARALLEL TLL (Transistor Transistor Logic) data acqusition interface to a standard Hewlett-Packard HP-97 Calculator modified with a cable.
Fully Parallel Electrical Impedance Tomography Using Code Division Multiplexing.
Tšoeu, M S; Inggs, M R
2016-06-01
Electrical Impedance Tomography (EIT) has been dominated by the use of Time Division Multiplexing (TDM) and Frequency Division Multiplexing (FDM) as methods of achieving orthogonal injection of excitation signals. Code Division Multiplexing (CDM), presented in this paper is an alternative that eliminates temporal data inconsistencies of TDM for fast changing systems. Furthermore, this approach eliminates data inconsistencies that arise in FDM when frequency bands of current injecting electrodes are chosen over frequencies that have large changes in the imaged object's impedance. To the authors knowledge no fully functional wideband system or simulation platform using simultaneous injection of Gold codes currents has been reported. In this paper, we formulate, simulate and develop a fully functional pseudo-random (Gold) code driven EIT system with 15 excitation currents and 16 separate voltage measurement electrodes. In the work we verify the use of CDM as a multiplexing modality in simultaneous injection EIT, using a prototype system with an overall bandwidth of 15 kHz, and attainable speed of 462 frames/s using codes with a period of 31 chips. Simulations and experiments are performed using the Electrical Impedance and Diffuse Optics Reconstruction Software (EIDORS). We also propose the use of image processing on reconstructed images to establish their quality quantitatively without access to raw reconstruction data. The results of this study show that CDM can be successfully used in EIT, and gives results of similar visual quality to TDM and FDM. Achieved performance shows average position error of 3.5% and size error of 6.2%.
NASA Technical Reports Server (NTRS)
OKeefe, Matthew (Editor); Kerr, Christopher L. (Editor)
1998-01-01
This report contains the abstracts and technical papers from the Second International Workshop on Software Engineering and Code Design in Parallel Meteorological and Oceanographic Applications, held June 15-18, 1998, in Scottsdale, Arizona. The purpose of the workshop is to bring together software developers in meteorology and oceanography to discuss software engineering and code design issues for parallel architectures, including Massively Parallel Processors (MPP's), Parallel Vector Processors (PVP's), Symmetric Multi-Processors (SMP's), Distributed Shared Memory (DSM) multi-processors, and clusters. Issues to be discussed include: (1) code architectures for current parallel models, including basic data structures, storage allocation, variable naming conventions, coding rules and styles, i/o and pre/post-processing of data; (2) designing modular code; (3) load balancing and domain decomposition; (4) techniques that exploit parallelism efficiently yet hide the machine-related details from the programmer; (5) tools for making the programmer more productive; and (6) the proliferation of programming models (F--, OpenMP, MPI, and HPF).
Load-balancing techniques for a parallel electromagnetic particle-in-cell code
PLIMPTON,STEVEN J.; SEIDEL,DAVID B.; PASIK,MICHAEL F.; COATS,REBECCA S.
2000-01-01
QUICKSILVER is a 3-d electromagnetic particle-in-cell simulation code developed and used at Sandia to model relativistic charged particle transport. It models the time-response of electromagnetic fields and low-density-plasmas in a self-consistent manner: the fields push the plasma particles and the plasma current modifies the fields. Through an LDRD project a new parallel version of QUICKSILVER was created to enable large-scale plasma simulations to be run on massively-parallel distributed-memory supercomputers with thousands of processors, such as the Intel Tflops and DEC CPlant machines at Sandia. The new parallel code implements nearly all the features of the original serial QUICKSILVER and can be run on any platform which supports the message-passing interface (MPI) standard as well as on single-processor workstations. This report describes basic strategies useful for parallelizing and load-balancing particle-in-cell codes, outlines the parallel algorithms used in this implementation, and provides a summary of the modifications made to QUICKSILVER. It also highlights a series of benchmark simulations which have been run with the new code that illustrate its performance and parallel efficiency. These calculations have up to a billion grid cells and particles and were run on thousands of processors. This report also serves as a user manual for people wishing to run parallel QUICKSILVER.
BHARDWAJ, MANLJ K.; REESE,GARTH M.; DRIESSEN,BRIAN; ALVIN,KENNETH F.; DAY,DAVID M.
2000-04-06
As computational needs for structural finite element analysis increase, a robust implicit structural dynamics code is needed which can handle millions of degrees of freedom in the model and produce results with quick turn around time. A parallel code is needed to avoid limitations of serial platforms. Salinas is an implicit structural dynamics code specifically designed for massively parallel platforms. It computes the structural response of very large complex structures and provides solutions faster than any existing serial machine. This paper gives a current status of Salinas and uses demonstration problems to show Salinas' performance.
Nyx: A MASSIVELY PARALLEL AMR CODE FOR COMPUTATIONAL COSMOLOGY
Almgren, Ann S.; Bell, John B.; Lijewski, Mike J.; Lukic, Zarija; Van Andel, Ethan
2013-03-01
We present a new N-body and gas dynamics code, called Nyx, for large-scale cosmological simulations. Nyx follows the temporal evolution of a system of discrete dark matter particles gravitationally coupled to an inviscid ideal fluid in an expanding universe. The gas is advanced in an Eulerian framework with block-structured adaptive mesh refinement; a particle-mesh scheme using the same grid hierarchy is used to solve for self-gravity and advance the particles. Computational results demonstrating the validation of Nyx on standard cosmological test problems, and the scaling behavior of Nyx to 50,000 cores, are presented.
ANNarchy: a code generation approach to neural simulations on parallel hardware.
Vitay, Julien; Dinkelbach, Helge Ü; Hamker, Fred H
2015-01-01
Many modern neural simulators focus on the simulation of networks of spiking neurons on parallel hardware. Another important framework in computational neuroscience, rate-coded neural networks, is mostly difficult or impossible to implement using these simulators. We present here the ANNarchy (Artificial Neural Networks architect) neural simulator, which allows to easily define and simulate rate-coded and spiking networks, as well as combinations of both. The interface in Python has been designed to be close to the PyNN interface, while the definition of neuron and synapse models can be specified using an equation-oriented mathematical description similar to the Brian neural simulator. This information is used to generate C++ code that will efficiently perform the simulation on the chosen parallel hardware (multi-core system or graphical processing unit). Several numerical methods are available to transform ordinary differential equations into an efficient C++code. We compare the parallel performance of the simulator to existing solutions.
ANNarchy: a code generation approach to neural simulations on parallel hardware
Vitay, Julien; Dinkelbach, Helge Ü.; Hamker, Fred H.
2015-01-01
Many modern neural simulators focus on the simulation of networks of spiking neurons on parallel hardware. Another important framework in computational neuroscience, rate-coded neural networks, is mostly difficult or impossible to implement using these simulators. We present here the ANNarchy (Artificial Neural Networks architect) neural simulator, which allows to easily define and simulate rate-coded and spiking networks, as well as combinations of both. The interface in Python has been designed to be close to the PyNN interface, while the definition of neuron and synapse models can be specified using an equation-oriented mathematical description similar to the Brian neural simulator. This information is used to generate C++ code that will efficiently perform the simulation on the chosen parallel hardware (multi-core system or graphical processing unit). Several numerical methods are available to transform ordinary differential equations into an efficient C++code. We compare the parallel performance of the simulator to existing solutions. PMID:26283957
A parallelization approach to the COBRA-TF thermal-hydraulic subchannel code
NASA Astrophysics Data System (ADS)
Ramos, Enrique; Abarca, Agustín; Roman, Jose E.; Miró, Rafael
2014-06-01
In order to reduce the response time when simulating large reactors in detail, we have developed a parallel version of the thermal-hydraulic subchannel code COBRA-TF, with standard message passing technology (MPI). The parallelization is oriented to reactor cells, so it is best suited for models consisting of many cells. The generation of the Jacobian is parallelized, in such a way that each processor is in charge of generating the data associated to a subset of cells. Also, the solution of the linear system of equations is done in parallel, using the PETSc toolkit.
Computing Maximum Cardinality Matchings in Parallel on Bipartite Graphs via Tree-Grafting
Azad, Ariful; Buluc, Aydn; Pothen, Alex
2016-03-24
It is difficult to obtain high performance when computing matchings on parallel processors because matching algorithms explicitly or implicitly search for paths in the graph, and when these paths become long, there is little concurrency. In spite of this limitation, we present a new algorithm and its shared-memory parallelization that achieves good performance and scalability in computing maximum cardinality matchings in bipartite graphs. This algorithm searches for augmenting paths via specialized breadth-first searches (BFS) from multiple source vertices, hence creating more parallelism than single source algorithms. Algorithms that employ multiple-source searches cannot discard a search tree once no augmenting path is discovered from the tree, unlike algorithms that rely on single-source searches. We describe a novel tree-grafting method that eliminates most of the redundant edge traversals resulting from this property of multiple-source searches. We also employ the recent direction-optimizing BFS algorithm as a subroutine to discover augmenting paths faster. Our algorithm compares favorably with the current best algorithms in terms of the number of edges traversed, the average augmenting path length, and the number of iterations. Here, we provide a proof of correctness for our algorithm. Our NUMA-aware implementation is scalable to 80 threads of an Intel multiprocessor and to 240 threads on an Intel Knights Corner coprocessor. On average, our parallel algorithm runs an order of magnitude faster than the fastest algorithms available. The performance improvement is more significant on graphs with small matching number.
Object-Oriented Parallel Particle-in-Cell Code for Beam Dynamics Simulation in Linear Accelerators
Qiang, J.; Ryne, R.D.; Habib, S.; Decky, V.
1999-11-13
In this paper, we present an object-oriented three-dimensional parallel particle-in-cell code for beam dynamics simulation in linear accelerators. A two-dimensional parallel domain decomposition approach is employed within a message passing programming paradigm along with a dynamic load balancing. Implementing object-oriented software design provides the code with better maintainability, reusability, and extensibility compared with conventional structure based code. This also helps to encapsulate the details of communications syntax. Performance tests on SGI/Cray T3E-900 and SGI Origin 2000 machines show good scalability of the object-oriented code. Some important features of this code also include employing symplectic integration with linear maps of external focusing elements and using z as the independent variable, typical in accelerators. A successful application was done to simulate beam transport through three superconducting sections in the APT linac design.
Image coding using parallel implementations of the embedded zerotree wavelet algorithm
NASA Astrophysics Data System (ADS)
Creusere, Charles D.
1996-03-01
We explore here the implementation of Shapiro's embedded zerotree wavelet (EZW) image coding algorithms on an array of parallel processors. To this end, we first consider the problem of parallelizing the basic wavelet transform, discussing past work in this area and the compatibility of that work with the zerotree coding process. From this discussion, we present a parallel partitioning of the transform which is computationally efficient and which allows the wavelet coefficients to be coded with little or no additional inter-processor communication. The key to achieving low data dependence between the processors is to ensure that each processor contains only entire zerotrees of wavelet coefficients after the decomposition is complete. We next quantify the rate-distortion tradeoffs associated with different levels of parallelization for a few variations of the basic coding algorithm. Studying these results, we conclude that the quality of the coder decreases as the number of parallel processors used to implement it increases. Noting that the performance of the parallel algorithm might be unacceptably poor for large processor arrays, we also develop an alternate algorithm which always achieves the same rate-distortion performance as the original sequential EZW algorithm at the cost of higher complexity and reduced scalability.
Parallelization of the MAAP-A code neutronics/thermal hydraulics coupling
Froehle, P.H.; Wei, T.Y.C.; Weber, D.P.; Henry, R.E.
1998-12-31
A major new feature, one-dimensional space-time kinetics, has been added to a developmental version of the MAAP code through the introduction of the DIF3D-K module. This code is referred to as MAAP-A. To reduce the overall job time required, a capability has been provided to run the MAAP-A code in parallel. The parallel version of MAAP-A utilizes two machines running in parallel, with the DIF3D-K module executing on one machine and the rest of the MAAP-A code executing on the other machine. Timing results obtained during the development of the capability indicate that reductions in time of 30--40% are possible. The parallel version can be run on two SPARC 20 (SUN OS 5.5) workstations connected through the ethernet. MPI (Message Passing Interface standard) needs to be implemented on the machines. If necessary the parallel version can also be run on only one machine. The results obtained running in this one-machine mode identically match the results obtained from the serial version of the code.
PARALLEL IMPLEMENTATION OF THE TOPAZ OPACITY CODE: ISSUES IN LOAD-BALANCING
Sonnad, V; Iglesias, C A
2008-05-12
The TOPAZ opacity code explicitly includes configuration term structure in the calculation of bound-bound radiative transitions. This approach involves myriad spectral lines and requires the large computational capabilities of parallel processing computers. It is important, however, to make use of these resources efficiently. For example, an increase in the number of processors should yield a comparable reduction in computational time. This proportional 'speedup' indicates that very large problems can be addressed with massively parallel computers. Opacity codes can readily take advantage of parallel architecture since many intermediate calculations are independent. On the other hand, since the different tasks entail significantly disparate computational effort, load-balancing issues emerge so that parallel efficiency does not occur naturally. Several schemes to distribute the labor among processors are discussed.
A portable, parallel, object-oriented Monte Carlo neutron transport code in C++
Lee, S.R.; Cummings, J.C.; Nolen, S.D. |
1997-05-01
We have developed a multi-group Monte Carlo neutron transport code using C++ and the Parallel Object-Oriented Methods and Applications (POOMA) class library. This transport code, called MC++, currently computes k and {alpha}-eigenvalues and is portable to and runs parallel on a wide variety of platforms, including MPPs, clustered SMPs, and individual workstations. It contains appropriate classes and abstractions for particle transport and, through the use of POOMA, for portable parallelism. Current capabilities of MC++ are discussed, along with physics and performance results on a variety of hardware, including all Accelerated Strategic Computing Initiative (ASCI) hardware. Current parallel performance indicates the ability to compute {alpha}-eigenvalues in seconds to minutes rather than hours to days. Future plans and the implementation of a general transport physics framework are also discussed.
Parallelization of the LEMan Code with MPI and OpenMP
NASA Astrophysics Data System (ADS)
Mellet, N.; Cooper, W. A.
The low-frequency wave propagation code LEMan has been parallelized. Due to large memory requirement but fast computation with the cold model, the parallelization is limited to a low number of processors. The specific block-tridiagonal structure of the matrix to be solved has been taken into account for the MPI implementation. It has then been compared with the performance of OpenMP in order to determine the optimal method depending on the case studied.
Omega3P: A Parallel Finite-Element Eigenmode Analysis Code for Accelerator Cavities
Lee, Lie-Quan; Li, Zenghai; Ng, Cho; Ko, Kwok; /SLAC
2009-03-04
Omega3P is a parallel eigenmode calculation code for accelerator cavities in frequency domain analysis using finite-element methods. In this report, we will present detailed finite-element formulations and resulting eigenvalue problems for lossless cavities, cavities with lossy materials, cavities with imperfectly conducting surfaces, and cavities with waveguide coupling. We will discuss the parallel algorithms for solving those eigenvalue problems and demonstrate modeling of accelerator cavities through different examples.
The CERBERUS code: Experiments with parallel processing using RELAP5/MOD3
Makowitz, H. )
1989-01-01
CERBERUS, a six-equation parallel thermal-hydraulic system simulation code, is being developed at the Idaho National Engineering Laboratory (INEL). CERBERUS Ver.00 performs parallel computations only for the heat transfer model. It is projected that CERBERUS Ver.01 will have a parallel heat transfer and hydraulic module, excluding the matrix solver, and CERBERUS Ver.02 will contain Ver.01 plus the solver. Three implementations of the CERBERUS Ver.00 code with constructs of varying overhead have been developed using a META language. These implementations are under study on shared-memory Cray-like computer architectures. Results for the hybrid code version, which utilizes all three construct sets simultaneously (i.e., CRAY AUTO, MICRO, and MULTI TASKING) on 2- and 8-CPU Cray machines, indicate the importance of load balancing for overhead reduction, and indicate that greater speedup factors may be achievable than previously believed with a RELAP-based parallel code. Extrapolations based on Y-MP/832 overhead measurements indicate that a speedup factor of > 10 may be obtainable with the CERBERUS Ver.02 code on a 16-CPU machine.
Data Parallel Line Relaxation (DPLR) Code User Manual: Acadia - Version 4.01.1
NASA Technical Reports Server (NTRS)
Wright, Michael J.; White, Todd; Mangini, Nancy
2009-01-01
Data-Parallel Line Relaxation (DPLR) code is a computational fluid dynamic (CFD) solver that was developed at NASA Ames Research Center to help mission support teams generate high-value predictive solutions for hypersonic flow field problems. The DPLR Code Package is an MPI-based, parallel, full three-dimensional Navier-Stokes CFD solver with generalized models for finite-rate reaction kinetics, thermal and chemical non-equilibrium, accurate high-temperature transport coefficients, and ionized flow physics incorporated into the code. DPLR also includes a large selection of generalized realistic surface boundary conditions and links to enable loose coupling with external thermal protection system (TPS) material response and shock layer radiation codes.
Assessing the performance of a parallel MATLAB-based 3D convection code
NASA Astrophysics Data System (ADS)
Kirkpatrick, G. J.; Hasenclever, J.; Phipps Morgan, J.; Shi, C.
2008-12-01
We are currently building 2D and 3D MATLAB-based parallel finite element codes for mantle convection and melting. The codes use the MATLAB implementation of core MPI commands (eg. Send, Receive, Broadcast) for message passing between computational subdomains. We have found that code development and algorithm testing are much faster in MATLAB than in our previous work coding in C or FORTRAN, this code was built from scratch with only 12 man-months of effort. The one extra cost w.r.t. C coding on a Beowulf cluster is the cost of the parallel MATLAB license for a >4core cluster. Here we present some preliminary results on the efficiency of MPI messaging in MATLAB on a small 4 machine, 16core, 32Gb RAM Intel Q6600 processor-based cluster. Our code implements fully parallelized preconditioned conjugate gradients with a multigrid preconditioner. Our parallel viscous flow solver is currently 20% slower for a 1,000,000 DOF problem on a single core in 2D as the direct solve MILAMIN MATLAB viscous flow solver. We have tested both continuous and discontinuous pressure formulations. We test with various configurations of network hardware, CPU speeds, and memory using our own and MATLAB's built in cluster profiler. So far we have only explored relatively small (up to 1.6GB RAM) test problems. We find that with our current code and Intel memory controller bandwidth limitations we can only get ~2.3 times performance out of 4 cores than 1 core per machine. Even for these small problems the code runs faster with message passing between 4 machines with one core each than 1 machine with 4 cores and internal messaging (1.29x slower), or 1 core (2.15x slower). It surprised us that for 2D ~1GB-sized problems with only 3 multigrid levels, the direct- solve on the coarsest mesh consumes comparable time to the iterative solve on the finest mesh - a penalty that is greatly reduced either by using a 4th multigrid level or by using an iterative solve at the coarsest grid level. We plan to
NASA Technical Reports Server (NTRS)
Campbell, David; Wysong, Ingrid; Kaplan, Carolyn; Mott, David; Wadsworth, Dean; VanGilder, Douglas
2000-01-01
An AFRL/NRL team has recently been selected to develop a scalable, parallel, reacting, multidimensional (SUPREM) Direct Simulation Monte Carlo (DSMC) code for the DoD user community under the High Performance Computing Modernization Office (HPCMO) Common High Performance Computing Software Support Initiative (CHSSI). This paper will introduce the JANNAF Exhaust Plume community to this three-year development effort and present the overall goals, schedule, and current status of this new code.
NASA Technical Reports Server (NTRS)
Campbell, David; Wysong, Ingrid; Kaplan, Carolyn; Mott, David; Wadsworth, Dean; VanGilder, Douglas
2000-01-01
An AFRL/NRL team has recently been selected to develop a scalable, parallel, reacting, multidimensional (SUPREM) Direct Simulation Monte Carlo (DSMC) code for the DoD user community under the High Performance Computing Modernization Office (HPCMO) Common High Performance Computing Software Support Initiative (CHSSI). This paper will introduce the JANNAF Exhaust Plume community to this three-year development effort and present the overall goals, schedule, and current status of this new code.
Code Optimization and Parallelization on the Origins: Looking from Users' Perspective
NASA Technical Reports Server (NTRS)
Chang, Yan-Tyng Sherry; Thigpen, William W. (Technical Monitor)
2002-01-01
Parallel machines are becoming the main compute engines for high performance computing. Despite their increasing popularity, it is still a challenge for most users to learn the basic techniques to optimize/parallelize their codes on such platforms. In this paper, we present some experiences on learning these techniques for the Origin systems at the NASA Advanced Supercomputing Division. Emphasis of this paper will be on a few essential issues (with examples) that general users should master when they work with the Origins as well as other parallel systems.
CODE BLUE: Three dimensional massively-parallel simulation of multi-scale configurations
NASA Astrophysics Data System (ADS)
Juric, Damir; Kahouadji, Lyes; Chergui, Jalel; Shin, Seungwon; Craster, Richard; Matar, Omar
2016-11-01
We present recent progress on BLUE, a solver for massively parallel simulations of fully three-dimensional multiphase flows which runs on a variety of computer architectures from laptops to supercomputers and on 131072 threads or more (limited only by the availability to us of more threads). The code is wholly written in Fortran 2003 and uses a domain decomposition strategy for parallelization with MPI. The fluid interface solver is based on a parallel implementation of a hybrid Front Tracking/Level Set method designed to handle highly deforming interfaces with complex topology changes. We developed parallel GMRES and multigrid iterative solvers suited to the linear systems arising from the implicit solution for the fluid velocities and pressure in the presence of strong density and viscosity discontinuities across fluid phases. Particular attention is drawn to the details and performance of the parallel Multigrid solver. EPSRC UK Programme Grant MEMPHIS (EP/K003976/1).
Understanding Performance of Parallel Scientific Simulation Codes using Open|SpeedShop
Ghosh, K K
2011-11-07
Conclusions of this presentation are: (1) Open SpeedShop's (OSS) is convenient to use for large, parallel, scientific simulation codes; (2) Large codes benefit from uninstrumented execution; (3) Many experiments can be run in a short time - might need multiple shots e.g. usertime for caller-callee, hwcsamp for HW counters; (4) Decent idea of code's performance is easily obtained; (5) Statistical sampling calls for decent number of samples; and (6) HWC data is very useful for micro-analysis but can be tricky to analyze.
Computing Maximum Cardinality Matchings in Parallel on Bipartite Graphs via Tree-Grafting
Azad, Ariful; Buluc, Aydn; Pothen, Alex
2016-03-24
It is difficult to obtain high performance when computing matchings on parallel processors because matching algorithms explicitly or implicitly search for paths in the graph, and when these paths become long, there is little concurrency. In spite of this limitation, we present a new algorithm and its shared-memory parallelization that achieves good performance and scalability in computing maximum cardinality matchings in bipartite graphs. This algorithm searches for augmenting paths via specialized breadth-first searches (BFS) from multiple source vertices, hence creating more parallelism than single source algorithms. Algorithms that employ multiple-source searches cannot discard a search tree once no augmenting pathmore » is discovered from the tree, unlike algorithms that rely on single-source searches. We describe a novel tree-grafting method that eliminates most of the redundant edge traversals resulting from this property of multiple-source searches. We also employ the recent direction-optimizing BFS algorithm as a subroutine to discover augmenting paths faster. Our algorithm compares favorably with the current best algorithms in terms of the number of edges traversed, the average augmenting path length, and the number of iterations. Here, we provide a proof of correctness for our algorithm. Our NUMA-aware implementation is scalable to 80 threads of an Intel multiprocessor and to 240 threads on an Intel Knights Corner coprocessor. On average, our parallel algorithm runs an order of magnitude faster than the fastest algorithms available. The performance improvement is more significant on graphs with small matching number.« less
Recent development for the ITS code system: Parallel processing and visualization
Fan, W.C.; Turner, C.D.; Halbleib, J.A. Sr.; Kensek, R.P.
1996-03-01
A brief overview is given for two software developments related to the ITS code system. These developments provide parallel processing and visualization capabilities and thus allow users to perform ITS calculations more efficiently. Timing results and a graphical example are presented to demonstrate these capabilities.
User's Guide for TOUGH2-MP - A Massively Parallel Version of the TOUGH2 Code
Earth Sciences Division; Zhang, Keni; Zhang, Keni; Wu, Yu-Shu; Pruess, Karsten
2008-05-27
TOUGH2-MP is a massively parallel (MP) version of the TOUGH2 code, designed for computationally efficient parallel simulation of isothermal and nonisothermal flows of multicomponent, multiphase fluids in one, two, and three-dimensional porous and fractured media. In recent years, computational requirements have become increasingly intensive in large or highly nonlinear problems for applications in areas such as radioactive waste disposal, CO2 geological sequestration, environmental assessment and remediation, reservoir engineering, and groundwater hydrology. The primary objective of developing the parallel-simulation capability is to significantly improve the computational performance of the TOUGH2 family of codes. The particular goal for the parallel simulator is to achieve orders-of-magnitude improvement in computational time for models with ever-increasing complexity. TOUGH2-MP is designed to perform parallel simulation on multi-CPU computational platforms. An earlier version of TOUGH2-MP (V1.0) was based on the TOUGH2 Version 1.4 with EOS3, EOS9, and T2R3D modules, a software previously qualified for applications in the Yucca Mountain project, and was designed for execution on CRAY T3E and IBM SP supercomputers. The current version of TOUGH2-MP (V2.0) includes all fluid property modules of the standard version TOUGH2 V2.0. It provides computationally efficient capabilities using supercomputers, Linux clusters, or multi-core PCs, and also offers many user-friendly features. The parallel simulator inherits all process capabilities from V2.0 together with additional capabilities for handling fractured media from V1.4. This report provides a quick starting guide on how to set up and run the TOUGH2-MP program for users with a basic knowledge of running the (standard) version TOUGH2 code, The report also gives a brief technical description of the code, including a discussion of parallel methodology, code structure, as well as mathematical and numerical methods used
NASA Technical Reports Server (NTRS)
Waheed, Abdul; Yan, Jerry
1998-01-01
This paper presents a model to evaluate the performance and overhead of parallelizing sequential code using compiler directives for multiprocessing on distributed shared memory (DSM) systems. With increasing popularity of shared address space architectures, it is essential to understand their performance impact on programs that benefit from shared memory multiprocessing. We present a simple model to characterize the performance of programs that are parallelized using compiler directives for shared memory multiprocessing. We parallelized the sequential implementation of NAS benchmarks using native Fortran77 compiler directives for an Origin2000, which is a DSM system based on a cache-coherent Non Uniform Memory Access (ccNUMA) architecture. We report measurement based performance of these parallelized benchmarks from four perspectives: efficacy of parallelization process; scalability; parallelization overhead; and comparison with hand-parallelized and -optimized version of the same benchmarks. Our results indicate that sequential programs can conveniently be parallelized for DSM systems using compiler directives but realizing performance gains as predicted by the performance model depends primarily on minimizing architecture-specific data locality overhead.
Poole, G.; Heroux, M.
1994-12-31
This paper will focus on recent work in two widely used industrial applications codes with iterative methods. The ANSYS program, a general purpose finite element code widely used in structural analysis applications, has now added an iterative solver option. Some results are given from real applications comparing performance with the tradition parallel/vector frontal solver used in ANSYS. Discussion of the applicability of iterative solvers as a general purpose solver will include the topics of robustness, as well as memory requirements and CPU performance. The FIDAP program is a widely used CFD code which uses iterative solvers routinely. A brief description of preconditioners used and some performance enhancements for CRAY parallel/vector systems is given. The solution of large-scale applications in structures and CFD includes examples from industry problems solved on CRAY systems.
Parallel Grand Canonical Monte Carlo (ParaGrandMC) Simulation Code
NASA Technical Reports Server (NTRS)
Yamakov, Vesselin I.
2016-01-01
This report provides an overview of the Parallel Grand Canonical Monte Carlo (ParaGrandMC) simulation code. This is a highly scalable parallel FORTRAN code for simulating the thermodynamic evolution of metal alloy systems at the atomic level, and predicting the thermodynamic state, phase diagram, chemical composition and mechanical properties. The code is designed to simulate multi-component alloy systems, predict solid-state phase transformations such as austenite-martensite transformations, precipitate formation, recrystallization, capillary effects at interfaces, surface absorption, etc., which can aid the design of novel metallic alloys. While the software is mainly tailored for modeling metal alloys, it can also be used for other types of solid-state systems, and to some degree for liquid or gaseous systems, including multiphase systems forming solid-liquid-gas interfaces.
[Series: Medical Applications of the PHITS Code (2): Acceleration by Parallel Computing].
Furuta, Takuya; Sato, Tatsuhiko
2015-01-01
Time-consuming Monte Carlo dose calculation becomes feasible owing to the development of computer technology. However, the recent development is due to emergence of the multi-core high performance computers. Therefore, parallel computing becomes a key to achieve good performance of software programs. A Monte Carlo simulation code PHITS contains two parallel computing functions, the distributed-memory parallelization using protocols of message passing interface (MPI) and the shared-memory parallelization using open multi-processing (OpenMP) directives. Users can choose the two functions according to their needs. This paper gives the explanation of the two functions with their advantages and disadvantages. Some test applications are also provided to show their performance using a typical multi-core high performance workstation.
NASA Astrophysics Data System (ADS)
Moon, Hongsik
What is the impact of multicore and associated advanced technologies on computational software for science? Most researchers and students have multicore laptops or desktops for their research and they need computing power to run computational software packages. Computing power was initially derived from Central Processing Unit (CPU) clock speed. That changed when increases in clock speed became constrained by power requirements. Chip manufacturers turned to multicore CPU architectures and associated technological advancements to create the CPUs for the future. Most software applications benefited by the increased computing power the same way that increases in clock speed helped applications run faster. However, for Computational ElectroMagnetics (CEM) software developers, this change was not an obvious benefit - it appeared to be a detriment. Developers were challenged to find a way to correctly utilize the advancements in hardware so that their codes could benefit. The solution was parallelization and this dissertation details the investigation to address these challenges. Prior to multicore CPUs, advanced computer technologies were compared with the performance using benchmark software and the metric was FLoting-point Operations Per Seconds (FLOPS) which indicates system performance for scientific applications that make heavy use of floating-point calculations. Is FLOPS an effective metric for parallelized CEM simulation tools on new multicore system? Parallel CEM software needs to be benchmarked not only by FLOPS but also by the performance of other parameters related to type and utilization of the hardware, such as CPU, Random Access Memory (RAM), hard disk, network, etc. The codes need to be optimized for more than just FLOPs and new parameters must be included in benchmarking. In this dissertation, the parallel CEM software named High Order Basis Based Integral Equation Solver (HOBBIES) is introduced. This code was developed to address the needs of the
Tyagi, Neelam; Bose, Abhijit; Chetty, Indrin J
2004-09-01
We have parallelized the Dose Planning Method (DPM), a Monte Carlo code optimized for radiotherapy class problems, on distributed-memory processor architectures using the Message Passing Interface (MPI). Parallelization has been investigated on a variety of parallel computing architectures at the University of Michigan-Center for Advanced Computing, with respect to efficiency and speedup as a function of the number of processors. We have integrated the parallel pseudo random number generator from the Scalable Parallel Pseudo-Random Number Generator (SPRNG) library to run with the parallel DPM. The Intel cluster consisting of 800 MHz Intel Pentium III processor shows an almost linear speedup up to 32 processors for simulating 1 x 10(8) or more particles. The speedup results are nearly linear on an Athlon cluster (up to 24 processors based on availability) which consists of 1.8 GHz+ Advanced Micro Devices (AMD) Athlon processors on increasing the problem size up to 8 x 10(8) histories. For a smaller number of histories (1 x 10(8)) the reduction of efficiency with the Athlon cluster (down to 83.9% with 24 processors) occurs because the processing time required to simulate 1 x 10(8) histories is less than the time associated with interprocessor communication. A similar trend was seen with the Opteron Cluster (consisting of 1400 MHz, 64-bit AMD Opteron processors) on increasing the problem size. Because of the 64-bit architecture Opteron processors are capable of storing and processing instructions at a faster rate and hence are faster as compared to the 32-bit Athlon processors. We have validated our implementation with an in-phantom dose calculation study using a parallel pencil monoenergetic electron beam of 20 MeV energy. The phantom consists of layers of water, lung, bone, aluminum, and titanium. The agreement in the central axis depth dose curves and profiles at different depths shows that the serial and parallel codes are equivalent in accuracy.
Performance and Application of Parallel OVERFLOW Codes on Distributed and Shared Memory Platforms
NASA Technical Reports Server (NTRS)
Djomehri, M. Jahed; Rizk, Yehia M.
1999-01-01
The presentation discusses recent studies on the performance of the two parallel versions of the aerodynamics CFD code, OVERFLOW_MPI and _MLP. Developed at NASA Ames, the serial version, OVERFLOW, is a multidimensional Navier-Stokes flow solver based on overset (Chimera) grid technology. The code has recently been parallelized in two ways. One is based on the explicit message-passing interface (MPI) across processors and uses the _MPI communication package. This approach is primarily suited for distributed memory systems and workstation clusters. The second, termed the multi-level parallel (MLP) method, is simple and uses shared memory for all communications. The _MLP code is suitable on distributed-shared memory systems. For both methods, the message passing takes place across the processors or processes at the advancement of each time step. This procedure is, in effect, the Chimera boundary conditions update, which is done in an explicit "Jacobi" style. In contrast, the update in the serial code is done in more of the "Gauss-Sidel" fashion. The programming efforts for the _MPI code is more complicated than for the _MLP code; the former requires modification of the outer and some inner shells of the serial code, whereas the latter focuses only on the outer shell of the code. The _MPI version offers a great deal of flexibility in distributing grid zones across a specified number of processors in order to achieve load balancing. The approach is capable of partitioning zones across multiple processors or sending each zone and/or cluster of several zones into a single processor. The message passing across the processors consists of Chimera boundary and/or an overlap of "halo" boundary points for each partitioned zone. The MLP version is a new coarse-grain parallel concept at the zonal and intra-zonal levels. A grouping strategy is used to distribute zones into several groups forming sub-processes which will run in parallel. The total volume of grid points in each
Performance and Application of Parallel OVERFLOW Codes on Distributed and Shared Memory Platforms
NASA Technical Reports Server (NTRS)
Djomehri, M. Jahed; Rizk, Yehia M.
1999-01-01
The presentation discusses recent studies on the performance of the two parallel versions of the aerodynamics CFD code, OVERFLOW_MPI and _MLP. Developed at NASA Ames, the serial version, OVERFLOW, is a multidimensional Navier-Stokes flow solver based on overset (Chimera) grid technology. The code has recently been parallelized in two ways. One is based on the explicit message-passing interface (MPI) across processors and uses the _MPI communication package. This approach is primarily suited for distributed memory systems and workstation clusters. The second, termed the multi-level parallel (MLP) method, is simple and uses shared memory for all communications. The _MLP code is suitable on distributed-shared memory systems. For both methods, the message passing takes place across the processors or processes at the advancement of each time step. This procedure is, in effect, the Chimera boundary conditions update, which is done in an explicit "Jacobi" style. In contrast, the update in the serial code is done in more of the "Gauss-Sidel" fashion. The programming efforts for the _MPI code is more complicated than for the _MLP code; the former requires modification of the outer and some inner shells of the serial code, whereas the latter focuses only on the outer shell of the code. The _MPI version offers a great deal of flexibility in distributing grid zones across a specified number of processors in order to achieve load balancing. The approach is capable of partitioning zones across multiple processors or sending each zone and/or cluster of several zones into a single processor. The message passing across the processors consists of Chimera boundary and/or an overlap of "halo" boundary points for each partitioned zone. The MLP version is a new coarse-grain parallel concept at the zonal and intra-zonal levels. A grouping strategy is used to distribute zones into several groups forming sub-processes which will run in parallel. The total volume of grid points in each
Parallelization issues of a code for physically-based simulation of fabrics
NASA Astrophysics Data System (ADS)
Romero, Sergio; Gutiérrez, Eladio; Romero, Luis F.; Plata, Oscar; Zapata, Emilio L.
2004-10-01
The simulation of fabrics, clothes, and flexible materials is an essential topic in computer animation of realistic virtual humans and dynamic sceneries. New emerging technologies, as interactive digital TV and multimedia products, make necessary the development of powerful tools to perform real-time simulations. Parallelism is one of such tools. When analyzing computationally fabric simulations we found these codes belonging to the complex class of irregular applications. Frequently this kind of codes includes reduction operations in their core, so that an important fraction of the computational time is spent on such operations. In fabric simulators these operations appear when evaluating forces, giving rise to the equation system to be solved. For this reason, this paper discusses only this phase of the simulation. This paper analyzes and evaluates different irregular reduction parallelization techniques on ccNUMA shared memory machines, applied to a real, physically-based, fabric simulator we have developed. Several issues are taken into account in order to achieve high code performance, as exploitation of data access locality and parallelism, as well as careful use of memory resources (memory overhead). In this paper we use the concept of data affinity to develop various efficient algorithms for reduction parallelization exploiting data locality.
Performance of a parallel code for the Euler equations on hypercube computers
NASA Technical Reports Server (NTRS)
Barszcz, Eric; Chan, Tony F.; Jesperson, Dennis C.; Tuminaro, Raymond S.
1990-01-01
The performance of hypercubes were evaluated on a computational fluid dynamics problem and the parallel environment issues were considered that must be addressed, such as algorithm changes, implementation choices, programming effort, and programming environment. The evaluation focuses on a widely used fluid dynamics code, FLO52, which solves the two dimensional steady Euler equations describing flow around the airfoil. The code development experience is described, including interacting with the operating system, utilizing the message-passing communication system, and code modifications necessary to increase parallel efficiency. Results from two hypercube parallel computers (a 16-node iPSC/2, and a 512-node NCUBE/ten) are discussed and compared. In addition, a mathematical model of the execution time was developed as a function of several machine and algorithm parameters. This model accurately predicts the actual run times obtained and is used to explore the performance of the code in interesting but yet physically realizable regions of the parameter space. Based on this model, predictions about future hypercubes are made.
A visual parallel-BCI speller based on the time-frequency coding strategy
NASA Astrophysics Data System (ADS)
Xu, Minpeng; Chen, Long; Zhang, Lixin; Qi, Hongzhi; Ma, Lan; Tang, Jiabei; Wan, Baikun; Ming, Dong
2014-04-01
Objective. Spelling is one of the most important issues in brain-computer interface (BCI) research. This paper is to develop a visual parallel-BCI speller system based on the time-frequency coding strategy in which the sub-speller switching among four simultaneously presented sub-spellers and the character selection are identified in a parallel mode. Approach. The parallel-BCI speller was constituted by four independent P300+SSVEP-B (P300 plus SSVEP blocking) spellers with different flicker frequencies, thereby all characters had a specific time-frequency code. To verify its effectiveness, 11 subjects were involved in the offline and online spellings. A classification strategy was designed to recognize the target character through jointly using the canonical correlation analysis and stepwise linear discriminant analysis. Main results. Online spellings showed that the proposed parallel-BCI speller had a high performance, reaching the highest information transfer rate of 67.4 bit min-1, with an average of 54.0 bit min-1 and 43.0 bit min-1 in the three rounds and five rounds, respectively. Significance. The results indicated that the proposed parallel-BCI could be effectively controlled by users with attention shifting fluently among the sub-spellers, and highly improved the BCI spelling performance.
NASA Astrophysics Data System (ADS)
Sandalski, Stou
Smooth particle hydrodynamics is an efficient method for modeling the dynamics of fluids. It is commonly used to simulate astrophysical processes such as binary mergers. We present a newly developed GPU accelerated smooth particle hydrodynamics code for astrophysical simulations. The code is named
Software tools for developing parallel applications. Part 1: Code development and debugging
Brown, J.; Geist, A.; Pancake, C.; Rover, D.
1997-04-01
Developing an application for parallel computers can be a lengthy and frustrating process making it a perfect candidate for software tool support. Yet application programmers are often the last to hear about new tools emerging from R and D efforts. This paper provides an overview of two focuses of tool support: code development and debugging. Each is discussed in terms of the programmer needs addressed, the extent to which representative current tools meet those needs, and what new levels of tool support are important if parallel computing is to become more widespread.
Hierarchical Place Trees: A Portable Abstraction for Task Parallelism and Data Movement
NASA Astrophysics Data System (ADS)
Yan, Yonghong; Zhao, Jisheng; Guo, Yi; Sarkar, Vivek
Modern computer systems feature multiple homogeneous or heterogeneous computing units with deep memory hierarchies, and expect a high degree of thread-level parallelism from the software. Exploitation of data locality is critical to achieving scalable parallelism, but adds a significant dimension of complexity to performance optimization of parallel programs. This is especially true for programming models where locality is implicit and opaque to programmers. In this paper, we introduce the hierarchical place tree (HPT) model as a portable abstraction for task parallelism and data movement. The HPT model supports co-allocation of data and computation at multiple levels of a memory hierarchy. It can be viewed as a generalization of concepts from the Sequoia and X10 programming models, resulting in capabilities that are not supported by either. Compared to Sequoia, HPT supports three kinds of data movement in a memory hierarchy rather than just explicit data transfer between adjacent levels, as well as dynamic task scheduling rather than static task assignment. Compared to X10, HPT provides a hierarchical notion of places for both computation and data mapping. We describe our work-in-progress on implementing the HPT model in the Habanero-Java (HJ) compiler and runtime system. Preliminary results on general-purpose multicore processors and GPU accelerators indicate that the HPT model can be a promising portable abstraction for future multicore processors.
Parallel level-set methods on adaptive tree-based grids
NASA Astrophysics Data System (ADS)
Mirzadeh, Mohammad; Guittet, Arthur; Burstedde, Carsten; Gibou, Frederic
2016-10-01
We present scalable algorithms for the level-set method on dynamic, adaptive Quadtree and Octree Cartesian grids. The algorithms are fully parallelized and implemented using the MPI standard and the open-source p4est library. We solve the level set equation with a semi-Lagrangian method which, similar to its serial implementation, is free of any time-step restrictions. This is achieved by introducing a scalable global interpolation scheme on adaptive tree-based grids. Moreover, we present a simple parallel reinitialization scheme using the pseudo-time transient formulation. Both parallel algorithms scale on the Stampede supercomputer, where we are currently using up to 4096 CPU cores, the limit of our current account. Finally, a relevant application of the algorithms is presented in modeling a crystallization phenomenon by solving a Stefan problem, illustrating a level of detail that would be impossible to achieve without a parallel adaptive strategy. We believe that the algorithms presented in this article will be of interest and useful to researchers working with the level-set framework and modeling multi-scale physics in general.
NASA Astrophysics Data System (ADS)
Li, Xujing; Zheng, Weiying
2016-10-01
A new parallel code based on discontinuous Galerkin (DG) method for hyperbolic conservation laws on three dimensional unstructured meshes is developed recently. This code can be used for simulations of MHD equations, which are very important in magnetic confined plasma research. The main challenges in MHD simulations in fusion include the complex geometry of the configurations, such as plasma in tokamaks, the possibly discontinuous solutions and large scale computing. Our new developed code is based on three dimensional unstructured meshes, i.e. tetrahedron. This makes the code flexible to arbitrary geometries. Second order polynomials are used on each element and HWENO type limiter are applied. The accuracy tests show that our scheme reaches the desired three order accuracy and the nonlinear shock test demonstrate that our code can capture the sharp shock transitions. Moreover, One of the advantages of DG compared with the classical finite element methods is that the matrices solved are localized on each element, making it easy for parallelization. Several simulations including the kink instabilities in toroidal geometry will be present here. Chinese National Magnetic Confinement Fusion Science Program 2015GB110003.
Casewell, Nicholas R; Wagstaff, Simon C; Harrison, Robert A; Wüster, Wolfgang
2011-03-01
The proliferation of gene data from multiple loci of large multigene families has been greatly facilitated by considerable recent advances in sequence generation. The evolution of such gene families, which often undergo complex histories and different rates of change, combined with increases in sequence data, pose complex problems for traditional phylogenetic analyses, and in particular, those that aim to successfully recover species relationships from gene trees. Here, we implement gene tree parsimony analyses on multicopy gene family data sets of snake venom proteins for two separate groups of taxa, incorporating Bayesian posterior distributions as a rigorous strategy to account for the uncertainty present in gene trees. Gene tree parsimony largely failed to infer species trees congruent with each other or with species phylogenies derived from mitochondrial and single-copy nuclear sequences. Analysis of four toxin gene families from a large expressed sequence tag data set from the viper genus Echis failed to produce a consistent topology, and reanalysis of a previously published gene tree parsimony data set, from the family Elapidae, suggested that species tree topologies were predominantly unsupported. We suggest that gene tree parsimony failure in the family Elapidae is likely the result of unequal and/or incomplete sampling of paralogous genes and demonstrate that multiple parallel gene losses are likely responsible for the significant species tree conflict observed in the genus Echis. These results highlight the potential for gene tree parsimony analyses to be undermined by rapidly evolving multilocus gene families under strong natural selection.
NASA Technical Reports Server (NTRS)
Otto, John C.
1993-01-01
This paper describes the parallel version of the three-dimensional, chemically reacting, computational fluid dynamics (CFD) code, SPARK. This work was performed on the Intel iPSC/860-based parallel computers. The SPARK code utilizes relatively simple explicit numerical algorithms, but models complex chemical reactions. The code solves the equations over a regular structured mesh so a simple dam decomposition is used to assign work to the individual processors. The explicit nature of the algorithm, combined with the computational intensity of the chemistry calculations, results in a very low communication-to-computation ratio when compared to typical CFD codes. The efficiency of the parallel code is examined and shown to be about 65 percent when the problem size is scaled with the number of processors. Two low-angle wall-jet injection cases are solved to demonstrate the capability of the parallel code for solving large problems efficiently.
Quinlan, D; Barany, G; Panas, T
2007-08-30
Many forms of security analysis on large scale applications can be substantially automated but the size and complexity can exceed the time and memory available on conventional desktop computers. Most commercial tools are understandably focused on such conventional desktop resources. This paper presents research work on the parallelization of security analysis of both source code and binaries within our Compass tool, which is implemented using the ROSE source-to-source open compiler infrastructure. We have focused on both shared and distributed memory parallelization of the evaluation of rules implemented as checkers for a wide range of secure programming rules, applicable to desktop machines, networks of workstations and dedicated clusters. While Compass as a tool focuses on source code analysis and reports violations of an extensible set of rules, the binary analysis work uses the exact same infrastructure but is less well developed into an equivalent final tool.
Spatial Parallelism of a 3D Finite Difference, Velocity-Stress Elastic Wave Propagation Code
MINKOFF,SUSAN E.
1999-12-09
Finite difference methods for solving the wave equation more accurately capture the physics of waves propagating through the earth than asymptotic solution methods. Unfortunately. finite difference simulations for 3D elastic wave propagation are expensive. We model waves in a 3D isotropic elastic earth. The wave equation solution consists of three velocity components and six stresses. The partial derivatives are discretized using 2nd-order in time and 4th-order in space staggered finite difference operators. Staggered schemes allow one to obtain additional accuracy (via centered finite differences) without requiring additional storage. The serial code is most unique in its ability to model a number of different types of seismic sources. The parallel implementation uses the MP1 library, thus allowing for portability between platforms. Spatial parallelism provides a highly efficient strategy for parallelizing finite difference simulations. In this implementation, one can decompose the global problem domain into one-, two-, and three-dimensional processor decompositions with 3D decompositions generally producing the best parallel speed up. Because i/o is handled largely outside of the time-step loop (the most expensive part of the simulation) we have opted for straight-forward broadcast and reduce operations to handle i/o. The majority of the communication in the code consists of passing subdomain face information to neighboring processors for use as ''ghost cells''. When this communication is balanced against computation by allocating subdomains of reasonable size, we observe excellent scaled speed up. Allocating subdomains of size 25 x 25 x 25 on each node, we achieve efficiencies of 94% on 128 processors. Numerical examples for both a layered earth model and a homogeneous medium with a high-velocity blocky inclusion illustrate the accuracy of the parallel code.
A Parallel Monte Carlo Code for Simulating Collisional N-body Systems
NASA Astrophysics Data System (ADS)
Pattabiraman, Bharath; Umbreit, Stefan; Liao, Wei-keng; Choudhary, Alok; Kalogera, Vassiliki; Memik, Gokhan; Rasio, Frederic A.
2013-02-01
We present a new parallel code for computing the dynamical evolution of collisional N-body systems with up to N ~ 107 particles. Our code is based on the Hénon Monte Carlo method for solving the Fokker-Planck equation, and makes assumptions of spherical symmetry and dynamical equilibrium. The principal algorithmic developments involve optimizing data structures and the introduction of a parallel random number generation scheme as well as a parallel sorting algorithm required to find nearest neighbors for interactions and to compute the gravitational potential. The new algorithms we introduce along with our choice of decomposition scheme minimize communication costs and ensure optimal distribution of data and workload among the processing units. Our implementation uses the Message Passing Interface library for communication, which makes it portable to many different supercomputing architectures. We validate the code by calculating the evolution of clusters with initial Plummer distribution functions up to core collapse with the number of stars, N, spanning three orders of magnitude from 105 to 107. We find that our results are in good agreement with self-similar core-collapse solutions, and the core-collapse times generally agree with expectations from the literature. Also, we observe good total energy conservation, within <~ 0.04% throughout all simulations. We analyze the performance of the code, and demonstrate near-linear scaling of the runtime with the number of processors up to 64 processors for N = 105, 128 for N = 106 and 256 for N = 107. The runtime reaches saturation with the addition of processors beyond these limits, which is a characteristic of the parallel sorting algorithm. The resulting maximum speedups we achieve are approximately 60×, 100×, and 220×, respectively.
Suarez, J.; Stamatiadis, S.; Farantos, S. C.; Lathouwers, L.
2009-08-13
Reproducing molecular dynamics is at the root of the basic principles of chemical change and physical properties of the matter. New insight on molecular encounters can be gained by solving the Schroedinger equation in cartesian coordinates, provided one can overcome the massive calculations that it implies. We have developed a parallel code for solving the molecular Time Dependent Schroedinger Equation (TDSE) in cartesian coordinates. Variable order Finite Difference methods result in sparse Hamiltonian matrices which can make the large scale problem solving feasible.
A PARALLEL MONTE CARLO CODE FOR SIMULATING COLLISIONAL N-BODY SYSTEMS
Pattabiraman, Bharath; Umbreit, Stefan; Liao, Wei-keng; Choudhary, Alok; Kalogera, Vassiliki; Memik, Gokhan; Rasio, Frederic A.
2013-02-15
We present a new parallel code for computing the dynamical evolution of collisional N-body systems with up to N {approx} 10{sup 7} particles. Our code is based on the Henon Monte Carlo method for solving the Fokker-Planck equation, and makes assumptions of spherical symmetry and dynamical equilibrium. The principal algorithmic developments involve optimizing data structures and the introduction of a parallel random number generation scheme as well as a parallel sorting algorithm required to find nearest neighbors for interactions and to compute the gravitational potential. The new algorithms we introduce along with our choice of decomposition scheme minimize communication costs and ensure optimal distribution of data and workload among the processing units. Our implementation uses the Message Passing Interface library for communication, which makes it portable to many different supercomputing architectures. We validate the code by calculating the evolution of clusters with initial Plummer distribution functions up to core collapse with the number of stars, N, spanning three orders of magnitude from 10{sup 5} to 10{sup 7}. We find that our results are in good agreement with self-similar core-collapse solutions, and the core-collapse times generally agree with expectations from the literature. Also, we observe good total energy conservation, within {approx}< 0.04% throughout all simulations. We analyze the performance of the code, and demonstrate near-linear scaling of the runtime with the number of processors up to 64 processors for N = 10{sup 5}, 128 for N = 10{sup 6} and 256 for N = 10{sup 7}. The runtime reaches saturation with the addition of processors beyond these limits, which is a characteristic of the parallel sorting algorithm. The resulting maximum speedups we achieve are approximately 60 Multiplication-Sign , 100 Multiplication-Sign , and 220 Multiplication-Sign , respectively.
Short Communication: A Parallel Newton-Krylov Method for Navier-Stokes Rotorcraft Codes
NASA Astrophysics Data System (ADS)
Ekici, Kivanc; Lyrintzis, Anastasios S.
2003-05-01
The application of Krylov subspace iterative methods to unsteady three-dimensional Navier-Stokes codes on massively parallel and distributed computing environments is investigated. Previously, the Euler mode of the Navier-Stokes flow solver Transonic Unsteady Rotor Navier-Stokes (TURNS) has been coupled with a Newton-Krylov scheme which uses two Conjugate-Gradient-like (CG) iterative methods. For the efficient implementation of Newton-Krylov methods to the Navier-Stokes mode of TURNS, efficient preconditioners must be used. Parallel implicit operators are used and compared as preconditioners. Results are presented for two-dimensional and three-dimensional viscous cases. The Message Passing Interface (MPI) protocol is used, because of its portability to various parallel architectures.
SAPNEW: Parallel finite element code for thin shell structures on the Alliant FX/80
NASA Technical Reports Server (NTRS)
Kamat, Manohar P.; Watson, Brian C.
1992-01-01
The results of a research activity aimed at providing a finite element capability for analyzing turbo-machinery bladed-disk assemblies in a vector/parallel processing environment are summarized. Analysis of aircraft turbofan engines is very computationally intensive. The performance limit of modern day computers with a single processing unit was estimated at 3 billions of floating point operations per second (3 gigaflops). In view of this limit of a sequential unit, performance rates higher than 3 gigaflops can be achieved only through vectorization and/or parallelization as on Alliant FX/80. Accordingly, the efforts of this critically needed research were geared towards developing and evaluating parallel finite element methods for static and vibration analysis. A special purpose code, named with the acronym SAPNEW, performs static and eigen analysis of multi-degree-of-freedom blade models built-up from flat thin shell elements.
Chin, George; Choudhury, Sutanay; Kangas, Lars J.; McFarlane, Sally A.; Marquez, Andres
2011-09-01
Long viewed as a strong statistical inference technique, Bayesian networks have emerged to be an important class of applications for high-performance computing. We have applied an architecture-conscious approach to parallelizing the Lauritzen-Spiegelhalter Junction Tree algorithm for exact inferencing in Bayesian networks. In optimizing the Junction Tree algorithm, we have implemented both in-clique and topological parallelism strategies to best leverage the fine-grained synchronization and massive-scale multithreading of the Cray XMT architecture. Two topological techniques were developed to parallelize the evidence propagation process through the Bayesian network. One technique involves performing intelligent scheduling of junction tree nodes based on its topology and relative size. The second technique involves decomposing the junction tree into a much finer tree-like representation to offer much more opportunities for parallelism. We evaluate these optimizations on five different Bayesian networks and report our findings and observations. Another important contribution of this paper is to demonstrate the application of massive-scale multithreading for load balancing and use of implicit parallelism-based compiler optimizations in designing scalable inferencing algorithms.
Self-Scheduling Parallel Methods for Multiple Serial Codes with Application to WOPWOP
NASA Technical Reports Server (NTRS)
Long, Lyle N.; Brentner, Kenneth S.
2000-01-01
This paper presents a scheme for efficiently running a large number of serial jobs on parallel computers. Two examples are given of computer programs that run relatively quickly, but often they must be run numerous times to obtain all the results needed. It is very common in science and engineering to have codes that are not massive computing challenges in themselves, but due to the number of instances that must be run, they do become large-scale computing problems. The two examples given here represent common problems in aerospace engineering: aerodynamic panel methods and aeroacoustic integral methods. The first example simply solves many systems of linear equations. This is representative of an aerodynamic panel code where someone would like to solve for numerous angles of attack. The complete code for this first example is included in the appendix so that it can be readily used by others as a template. The second example is an aeroacoustics code (WOPWOP) that solves the Ffowcs Williams Hawkings equation to predict the far-field sound due to rotating blades. In this example, one quite often needs to compute the sound at numerous observer locations, hence parallelization is utilized to automate the noise computation for a large number of observers.
Recent Improvements to the IMPACT-T Parallel Particle TrackingCode
Qiang, J.; Pogorelov, I.V.; Ryne, R.
2006-11-16
The IMPACT-T code is a parallel three-dimensional quasi-static beam dynamics code for modeling high brightness beams in photoinjectors and RF linacs. Developed under the US DOE Scientific Discovery through Advanced Computing (SciDAC) program, it includes several key features including a self-consistent calculation of 3D space-charge forces using a shifted and integrated Green function method, multiple energy bins for beams with large energy spread, and models for treating RF standing wave and traveling wave structures. In this paper, we report on recent improvements to the IMPACT-T code including modeling traveling wave structures, short-range transverse and longitudinal wakefields, and longitudinal coherent synchrotron radiation through bending magnets.
Performance analysis of parallel gravitational N-body codes on large GPU clusters
NASA Astrophysics Data System (ADS)
Huang, Si-Yi; Spurzem, Rainer; Berczik, Peter
2016-01-01
We compare the performance of two very different parallel gravitational N-body codes for astrophysical simulations on large Graphics Processing Unit (GPU) clusters, both of which are pioneers in their own fields as well as on certain mutual scales - NBODY6++ and Bonsai. We carry out benchmarks of the two codes by analyzing their performance, accuracy and efficiency through the modeling of structure decomposition and timing measurements. We find that both codes are heavily optimized to leverage the computational potential of GPUs as their performance has approached half of the maximum single precision performance of the underlying GPU cards. With such performance we predict that a speed-up of 200 - 300 can be achieved when up to 1k processors and GPUs are employed simultaneously. We discuss the quantitative information about comparisons of the two codes, finding that in the same cases Bonsai adopts larger time steps as well as larger relative energy errors than NBODY6++, typically ranging from 10 - 50 times larger, depending on the chosen parameters of the codes. Although the two codes are built for different astrophysical applications, in specified conditions they may overlap in performance at certain physical scales, thus allowing the user to choose either one by fine-tuning parameters accordingly.
TOD-Tree: Task-Overlapped Direct Send Tree Image Compositing for Hybrid MPI Parallelism and GPUs.
Grosset, A V Pascal; Prasad, Manasa; Christensen, Cameron; Knoll, Aaron; Hansen, Charles
2017-06-01
Modern supercomputers have thousands of nodes, each with CPUs and/or GPUs capable of several teraflops. However, the network connecting these nodes is relatively slow, on the order of gigabits per second. For time-critical workloads such as interactive visualization, the bottleneck is no longer computation but communication. In this paper, we present an image compositing algorithm that works on both CPU-only and GPU-accelerated supercomputers and focuses on communication avoidance and overlapping communication with computation at the expense of evenly balancing the workload. The algorithm has three stages: a parallel direct send stage, followed by a tree compositing stage and a gather stage. We compare our algorithm with radix-k and binary-swap from the IceT library in a hybrid OpenMP/MPI setting on the Stampede and Edison supercomputers, show strong scaling results and explain how we generally achieve better performance than these two algorithms. We developed a GPU-based image compositing algorithm where we use CUDA kernels for computation and GPU Direct RDMA for inter-node GPU communication. We tested the algorithm on the Piz Daint GPU-accelerated supercomputer and show that we achieve performance on par with CPUs. Last, we introduce a workflow in which both rendering and compositing are done on the GPU.
NASA Astrophysics Data System (ADS)
Chen, Jian; Matuttis, Hans-Georg
2013-02-01
We report our experiences with the optimization and parallelization of a discrete element code for convex polyhedra on multi-core machines and introduce a novel variant of the sort-and-sweep neighborhood algorithm. While in theory the whole code in itself parallelizes ideally, in practice the results on different architectures with different compilers and performance measurement tools depend very much on the particle number and optimization of the code. After difficulties with the interpretation of the data for speedup and efficiency are overcome, respectable parallelization speedups could be obtained.
NBSymple, a double parallel, symplectic N-body code running on graphic processing units
NASA Astrophysics Data System (ADS)
Capuzzo-Dolcetta, R.; Mastrobuono-Battisti, A.; Maschietti, D.
2011-07-01
We present and discuss the characteristics and performance, both in term of computational speed and precision, of a numerical code which integrates the equation of motions of N 'particles' interacting via Newtonian gravitation and move in an external galactic smooth field. The force evaluation on every particle is done by mean of direct summation of the contribution of all the other system's particles, avoiding truncation error. The time integration is done with second-order and sixth-order symplectic schemes. The code, NBSymple, has been parallelized twice, by mean of the Compute Unified Device Architecture (CUDA) to make the all-pair force evaluation as fast as possible on high-performance Graphic Processing Units NVIDIA TESLA C1060, while the O( N) computations are distributed on various CPUs by mean of OpenMP Application Program. The code works both in single-precision floating point arithmetics or in double precision. The use of single-precision allows the use of the GPU performance at best but, of course, limits the precision of simulation in some critical situations. We find a good compromise in using a software reconstruction of double-precision for those variables that are most critical for the overall precision of the code. The code is available on the web site astrowww.phys.uniroma1.it/dolcetta/nbsymple.html.
A new technique for a parallel dealiased pseudospectral Navier-Stokes code
NASA Astrophysics Data System (ADS)
Iovieno, Michele; Cavazzoni, Carlo; Tordella, Daniela
2001-12-01
A novel aspect of a parallel procedure for the numerical simulation of the solution of the Navier-Stokes equations through the Fourier-Galerkin pseudospectral method is presented. It consists of a dealiased ("3/2" rule) transposition of the data that organizes the computations in the distributed direction in such a way that whenever a Fast Fourier Transform must be calculated, the algorithm will employ data stored solely on the proper memory of the processor which is computing it. This provide for the employment of standard routines for the computations of the Fourier transform. The aliasing removal procedure has been directly inserted into the transposition algorithm. The code is written for distributed memory computers, but not specifically for a peculiar architecture. The use on a variety of machines is allowed by the adoption of the Message Passing Interface library. The portability of the code is demonstrated by the similar performances, in particular the high efficiency, that all the machines tested show up to a number of parallel processors equal to 1/2 the truncation parameter N/2. Explicit time integration is used. The present code organization is relevant to physical and mathematical problems which require a three dimensional spectral treatment.
Xu, Lijun; Chen, Jianjun; Cao, Zhang; Liu, Xingbin; Hu, Jinhai
2014-07-01
In this paper, a quasi-parallel inductive-capacitive (LC) resonance method is proposed to improve the recovery of MIL-STD-1553 Manchester code with several frequency components from attenuated, distorted, and drifted signal for data telemetry in well logging, and corresponding telemetry system is developed. Required resonant frequency and quality factor are derived, and the quasi-parallel LC resonant circuit is established at the receiving end of the logging cable to suppress the low-pass filtering effect caused by the distributed capacitance of the cable and provide balanced pass for all the three frequency components of the Manchester code. The performance of the method for various encoding frequencies and cable lengths at different bit energy to noise density ratios (Eb/No) have been evaluated in the simulation. A 5 km single-core cable used in on-site well logging and various encoding frequencies were employed to verify the proposed telemetry system in the experiment. Results obtained demonstrate that the telemetry system is feasible and effective to improve the code recovery in terms of anti-attenuation, anti-distortion, and anti-drift performances, decrease the bit error rate, and increase the reachable transmission rate and distance greatly.
NASA Astrophysics Data System (ADS)
Pandya, Tara M.; Johnson, Seth R.; Evans, Thomas M.; Davidson, Gregory G.; Hamilton, Steven P.; Godfrey, Andrew T.
2016-03-01
This work discusses the implementation, capabilities, and validation of Shift, a massively parallel Monte Carlo radiation transport package authored at Oak Ridge National Laboratory. Shift has been developed to scale well from laptops to small computing clusters to advanced supercomputers and includes features such as support for multiple geometry and physics engines, hybrid capabilities for variance reduction methods such as the Consistent Adjoint-Driven Importance Sampling methodology, advanced parallel decompositions, and tally methods optimized for scalability on supercomputing architectures. The scaling studies presented in this paper demonstrate good weak and strong scaling behavior for the implemented algorithms. Shift has also been validated and verified against various reactor physics benchmarks, including the Consortium for Advanced Simulation of Light Water Reactors' Virtual Environment for Reactor Analysis criticality test suite and several Westinghouse AP1000® problems presented in this paper. These benchmark results compare well to those from other contemporary Monte Carlo codes such as MCNP5 and KENO.
Pandya, Tara M.; Johnson, Seth R.; Evans, Thomas M.; Davidson, Gregory G.; Hamilton, Steven P.; Godfrey, Andrew T.
2015-12-21
This paper discusses the implementation, capabilities, and validation of Shift, a massively parallel Monte Carlo radiation transport package developed and maintained at Oak Ridge National Laboratory. It has been developed to scale well from laptop to small computing clusters to advanced supercomputers. Special features of Shift include hybrid capabilities for variance reduction such as CADIS and FW-CADIS, and advanced parallel decomposition and tally methods optimized for scalability on supercomputing architectures. Shift has been validated and verified against various reactor physics benchmarks and compares well to other state-of-the-art Monte Carlo radiation transport codes such as MCNP5, CE KENO-VI, and OpenMC. Some specific benchmarks used for verification and validation include the CASL VERA criticality test suite and several Westinghouse AP1000^{®} problems. These benchmark and scaling studies show promising results.
Pandya, Tara M.; Johnson, Seth R.; Evans, Thomas M.; ...
2015-12-21
This paper discusses the implementation, capabilities, and validation of Shift, a massively parallel Monte Carlo radiation transport package developed and maintained at Oak Ridge National Laboratory. It has been developed to scale well from laptop to small computing clusters to advanced supercomputers. Special features of Shift include hybrid capabilities for variance reduction such as CADIS and FW-CADIS, and advanced parallel decomposition and tally methods optimized for scalability on supercomputing architectures. Shift has been validated and verified against various reactor physics benchmarks and compares well to other state-of-the-art Monte Carlo radiation transport codes such as MCNP5, CE KENO-VI, and OpenMC. Somemore » specific benchmarks used for verification and validation include the CASL VERA criticality test suite and several Westinghouse AP1000® problems. These benchmark and scaling studies show promising results.« less
Shared Memory Parallelization of an Implicit ADI-type CFD Code
NASA Technical Reports Server (NTRS)
Hauser, Th.; Huang, P. G.
1999-01-01
A parallelization study designed for ADI-type algorithms is presented using the OpenMP specification for shared-memory multiprocessor programming. Details of optimizations specifically addressed to cache-based computer architectures are described and performance measurements for the single and multiprocessor implementation are summarized. The paper demonstrates that optimization of memory access on a cache-based computer architecture controls the performance of the computational algorithm. A hybrid MPI/OpenMP approach is proposed for clusters of shared memory machines to further enhance the parallel performance. The method is applied to develop a new LES/DNS code, named LESTool. A preliminary DNS calculation of a fully developed channel flow at a Reynolds number of 180, Re(sub tau) = 180, has shown good agreement with existing data.
CPIC: A Parallel Particle-In-Cell Code for Studying Spacecraft Charging
NASA Astrophysics Data System (ADS)
Meierbachtol, Collin; Delzanno, Gian Luca; Moulton, David; Vernon, Louis
2015-11-01
CPIC is a three-dimensional electrostatic particle-in-cell code designed for use with curvilinear meshes. One of its primary objectives is to aid in studying spacecraft charging in the magnetosphere. CPIC maintains near-optimal computational performance and scaling thanks to a mapped logical mesh field solver, and a hybrid physical-logical space particle mover (avoiding the need to track particles). CPIC is written for parallel execution, utilizing a combination of both OpenMP threading and MPI distributed memory. New capabilities are being actively developed and added to CPIC, including the ability to handle multi-block curvilinear mesh structures. Verification results comparing CPIC to analytic test problems will be provided. Particular emphasis will be placed on the charging and shielding of a sphere-in-plasma system. Simulated charging results of representative spacecraft geometries will also be presented. Finally, its performance capabilities will be demonstrated through parallel scaling data.
Korall, Petra; Pryer, Kathleen M; Metzgar, Jordan S; Schneider, Harald; Conant, David S
2006-06-01
Tree ferns are a well-established clade within leptosporangiate ferns. Most of the 600 species (in seven families and 13 genera) are arborescent, but considerable morphological variability exists, spanning the giant scaly tree ferns (Cyatheaceae), the low, erect plants (Plagiogyriaceae), and the diminutive endemics of the Guayana Highlands (Hymenophyllopsidaceae). In this study, we investigate phylogenetic relationships within tree ferns based on analyses of four protein-coding, plastid loci (atpA, atpB, rbcL, and rps4). Our results reveal four well-supported clades, with genera of Dicksoniaceae (sensu ) interspersed among them: (A) (Loxomataceae, (Culcita, Plagiogyriaceae)), (B) (Calochlaena, (Dicksonia, Lophosoriaceae)), (C) Cibotium, and (D) Cyatheaceae, with Hymenophyllopsidaceae nested within. How these four groups are related to one other, to Thyrsopteris, or to Metaxyaceae is weakly supported. Our results show that Dicksoniaceae and Cyatheaceae, as currently recognised, are not monophyletic and new circumscriptions for these families are needed.
Multi-Zone Liquid Thrust Chamber Performance Code with Domain Decomposition for Parallel Processing
NASA Technical Reports Server (NTRS)
Navaz, Homayun K.
2002-01-01
-equation turbulence model, and two-phase flow. To overcome these limitations, the LTCP code is rewritten to include the multi-zone capability with domain decomposition that makes it suitable for parallel processing, i.e., enabling the code to run every zone or sub-domain on a separate processor. This can reduce the run time by a factor of 6 to 8, depending on the problem.
Multi-Zone Liquid Thrust Chamber Performance Code with Domain Decomposition for Parallel Processing
NASA Technical Reports Server (NTRS)
Navaz, Homayun K.
2002-01-01
-equation turbulence model, and two-phase flow. To overcome these limitations, the LTCP code is rewritten to include the multi-zone capability with domain decomposition that makes it suitable for parallel processing, i.e., enabling the code to run every zone or sub-domain on a separate processor. This can reduce the run time by a factor of 6 to 8, depending on the problem.
A Parallel Two-fluid Code for Global Magnetic Reconnection Studies
J.A. Breslau; S.C. Jardin
2001-08-09
This paper describes a new algorithm for the computation of two-dimensional resistive magnetohydrodynamic (MHD) and two-fluid studies of magnetic reconnection in plasmas. It has been implemented on several parallel platforms and shows good scalability up to 32 CPUs for reasonable problem sizes. A fixed, nonuniform rectangular mesh is used to resolve the different spatial scales in the reconnection problem. The resistive MHD version of the code uses an implicit/explicit hybrid method, while the two-fluid version uses an alternating-direction implicit (ADI) method. The technique has proven useful for comparing several different theories of collisional and collisionless reconnection.
Grid-based Parallel Data Streaming Implemented for the Gyrokinetic Toroidal Code
S. Klasky; S. Ethier; Z. Lin; K. Martins; D. McCune; R. Samtaney
2003-09-15
We have developed a threaded parallel data streaming approach using Globus to transfer multi-terabyte simulation data from a remote supercomputer to the scientist's home analysis/visualization cluster, as the simulation executes, with negligible overhead. Data transfer experiments show that this concurrent data transfer approach is more favorable compared with writing to local disk and then transferring this data to be post-processed. The present approach is conducive to using the grid to pipeline the simulation with post-processing and visualization. We have applied this method to the Gyrokinetic Toroidal Code (GTC), a 3-dimensional particle-in-cell code used to study microturbulence in magnetic confinement fusion from first principles plasma theory.
A distributed coding approach for stereo sequences in the tree structured Haar transform domain
NASA Astrophysics Data System (ADS)
Cancellaro, M.; Carli, M.; Neri, A.
2009-02-01
In this contribution, a novel method for distributed video coding for stereo sequences is proposed. The system encodes independently the left and right frames of the stereoscopic sequence. The decoder exploits the side information to achieve the best reconstruction of the correlated video streams. In particular, a syndrome coder approach based on a lifted Tree Structured Haar wavelet scheme has been adopted. The experimental results show the effectiveness of the proposed scheme.
Coding for Parallel Links to Maximize the Expected Value of Decodable Messages
NASA Technical Reports Server (NTRS)
Klimesh, Matthew A.; Chang, Christopher S.
2011-01-01
When multiple parallel communication links are available, it is useful to consider link-utilization strategies that provide tradeoffs between reliability and throughput. Interesting cases arise when there are three or more available links. Under the model considered, the links have known probabilities of being in working order, and each link has a known capacity. The sender has a number of messages to send to the receiver. Each message has a size and a value (i.e., a worth or priority). Messages may be divided into pieces arbitrarily, and the value of each piece is proportional to its size. The goal is to choose combinations of messages to send on the links so that the expected value of the messages decodable by the receiver is maximized. There are three parts to the innovation: (1) Applying coding to parallel links under the model; (2) Linear programming formulation for finding the optimal combinations of messages to send on the links; and (3) Algorithms for assisting in finding feasible combinations of messages, as support for the linear programming formulation. There are similarities between this innovation and methods developed in the field of network coding. However, network coding has generally been concerned with either maximizing throughput in a fixed network, or robust communication of a fixed volume of data. In contrast, under this model, the throughput is expected to vary depending on the state of the network. Examples of error-correcting codes that are useful under this model but which are not needed under previous models have been found. This model can represent either a one-shot communication attempt, or a stream of communications. Under the one-shot model, message sizes and link capacities are quantities of information (e.g., measured in bits), while under the communications stream model, message sizes and link capacities are information rates (e.g., measured in bits/second). This work has the potential to increase the value of data returned from
GOTHIC: Gravitational oct-tree code accelerated by hierarchical time step controlling
NASA Astrophysics Data System (ADS)
Miki, Yohei; Umemura, Masayuki
2017-04-01
The tree method is a widely implemented algorithm for collisionless N-body simulations in astrophysics well suited for GPU(s). Adopting hierarchical time stepping can accelerate N-body simulations; however, it is infrequently implemented and its potential remains untested in GPU implementations. We have developed a Gravitational Oct-Tree code accelerated by HIerarchical time step Controlling named GOTHIC, which adopts both the tree method and the hierarchical time step. The code adopts some adaptive optimizations by monitoring the execution time of each function on-the-fly and minimizes the time-to-solution by balancing the measured time of multiple functions. Results of performance measurements with realistic particle distribution performed on NVIDIA Tesla M2090, K20X, and GeForce GTX TITAN X, which are representative GPUs of the Fermi, Kepler, and Maxwell generation of GPUs, show that the hierarchical time step achieves a speedup by a factor of around 3-5 times compared to the shared time step. The measured elapsed time per step of GOTHIC is 0.30 s or 0.44 s on GTX TITAN X when the particle distribution represents the Andromeda galaxy or the NFW sphere, respectively, with 224 = 16,777,216 particles. The averaged performance of the code corresponds to 10-30% of the theoretical single precision peak performance of the GPU.
ALEGRA -- A massively parallel h-adaptive code for solid dynamics
Summers, R.M.; Wong, M.K.; Boucheron, E.A.; Weatherby, J.R.
1997-12-31
ALEGRA is a multi-material, arbitrary-Lagrangian-Eulerian (ALE) code for solid dynamics designed to run on massively parallel (MP) computers. It combines the features of modern Eulerian shock codes, such as CTH, with modern Lagrangian structural analysis codes using an unstructured grid. ALEGRA is being developed for use on the teraflop supercomputers to conduct advanced three-dimensional (3D) simulations of shock phenomena important to a variety of systems. ALEGRA was designed with the Single Program Multiple Data (SPMD) paradigm, in which the mesh is decomposed into sub-meshes so that each processor gets a single sub-mesh with approximately the same number of elements. Using this approach the authors have been able to produce a single code that can scale from one processor to thousands of processors. A current major effort is to develop efficient, high precision simulation capabilities for ALEGRA, without the computational cost of using a global highly resolved mesh, through flexible, robust h-adaptivity of finite elements. H-adaptivity is the dynamic refinement of the mesh by subdividing elements, thus changing the characteristic element size and reducing numerical error. The authors are working on several major technical challenges that must be met to make effective use of HAMMER on MP computers.
Moryakov, A. V.
2016-12-15
An algorithm for solving the time-dependent transport equation in the P{sub m}S{sub n} group approximation with the use of parallel computations is presented. The algorithm is implemented in the LUCKY-TD code for supercomputers employing the MPI standard for the data exchange between parallel processes.
NASA Astrophysics Data System (ADS)
Stepšys, A.; Mickevicius, S.; Germanas, D.; Kalinauskas, R. K.
2014-11-01
This new version of the HOTB program for calculation of the three and four particle harmonic oscillator transformation brackets provides some enhancements and corrections to the earlier version (Germanas et al., 2010) [1]. In particular, new version allows calculations of harmonic oscillator transformation brackets be performed in parallel using MPI parallel communication standard. Moreover, higher precision of intermediate calculations using GNU Quadruple Precision and arbitrary precision library FMLib [2] is done. A package of Fortran code is presented. Calculation time of large matrices can be significantly reduced using effective parallel code. Use of Higher Precision methods in intermediate calculations increases the stability of algorithms and extends the validity of used algorithms for larger input values. Catalogue identifier: AEFQ_v4_0 Program summary URL: http://cpc.cs.qub.ac.uk/summaries/AEFQ_v4_0.html Program obtainable from: CPC Program Library, Queen’s University of Belfast, N. Ireland Licensing provisions: GNU General Public License, version 3 Number of lines in programs, including test data, etc.: 1711 Number of bytes in distributed programs, including test data, etc.: 11667 Distribution format: tar.gz Program language used: FORTRAN 90 with MPI extensions for parallelism Computer: Any computer with FORTRAN 90 compiler Operating system: Windows, Linux, FreeBSD, True64 Unix Has the code been vectorized of parallelized?: Yes, parallelism using MPI extensions. Number of CPUs used: up to 999 RAM(per CPU core): Depending on allocated binomial and trinomial matrices and use of precision; at least 500 MB Catalogue identifier of previous version: AEFQ_v1_0 Journal reference of previous version: Comput. Phys. Comm. 181, Issue 2, (2010) 420-425 Does the new version supersede the previous version? Yes Nature of problem: Calculation of matrices of three-particle harmonic oscillator brackets (3HOB) and four-particle harmonic oscillator brackets (4HOB) in a more
Context Tree-Based Image Contour Coding Using a Geometric Prior
NASA Astrophysics Data System (ADS)
Zheng, Amin; Cheung, Gene; Florencio, Dinei
2017-02-01
If object contours in images are coded efficiently as side information, then they can facilitate advanced image / video coding techniques, such as graph Fourier transform coding or motion prediction of arbitrarily shaped pixel blocks. In this paper, we study the problem of lossless and lossy compression of detected contours in images. Specifically, we first convert a detected object contour composed of contiguous between-pixel edges to a sequence of directional symbols drawn from a small alphabet. To encode the symbol sequence using arithmetic coding, we compute an optimal variable-length context tree (VCT) $\\mathcal{T}$ via a maximum a posterior (MAP) formulation to estimate symbols' conditional probabilities. MAP prevents us from overfitting given a small training set $\\mathcal{X}$ of past symbol sequences by identifying a VCT $\\mathcal{T}$ that achieves a high likelihood $P(\\mathcal{X}|\\mathcal{T})$ of observing $\\mathcal{X}$ given $\\mathcal{T}$, and a large geometric prior $P(\\mathcal{T})$ stating that image contours are more often straight than curvy. For the lossy case, we design efficient dynamic programming (DP) algorithms that optimally trade off coding rate of an approximate contour $\\hat{\\mathbf{x}}$ given a VCT $\\mathcal{T}$ with two notions of distortion of $\\hat{\\mathbf{x}}$ with respect to the original contour $\\mathbf{x}$. To reduce the size of the DP tables, a total suffix tree is derived from a given VCT $\\mathcal{T}$ for compact table entry indexing, reducing complexity. Experimental results show that for lossless contour coding, our proposed algorithm outperforms state-of-the-art context-based schemes consistently for both small and large training datasets. For lossy contour coding, our algorithms outperform comparable schemes in the literature in rate-distortion performance.
SAPNEW: Parallel finite element code for thin shell structures on the Alliant FX-80
NASA Technical Reports Server (NTRS)
Kamat, Manohar P.; Watson, Brian C.
1992-01-01
The finite element method has proven to be an invaluable tool for analysis and design of complex, high performance systems, such as bladed-disk assemblies in aircraft turbofan engines. However, as the problem size increase, the computation time required by conventional computers can be prohibitively high. Parallel processing computers provide the means to overcome these computation time limits. This report summarizes the results of a research activity aimed at providing a finite element capability for analyzing turbomachinery bladed-disk assemblies in a vector/parallel processing environment. A special purpose code, named with the acronym SAPNEW, has been developed to perform static and eigen analysis of multi-degree-of-freedom blade models built-up from flat thin shell elements. SAPNEW provides a stand alone capability for static and eigen analysis on the Alliant FX/80, a parallel processing computer. A preprocessor, named with the acronym NTOS, has been developed to accept NASTRAN input decks and convert them to the SAPNEW format to make SAPNEW more readily used by researchers at NASA Lewis Research Center.
SAPNEW: Parallel finite element code for thin shell structures on the Alliant FX-80
NASA Astrophysics Data System (ADS)
Kamat, Manohar P.; Watson, Brian C.
1992-11-01
The finite element method has proven to be an invaluable tool for analysis and design of complex, high performance systems, such as bladed-disk assemblies in aircraft turbofan engines. However, as the problem size increase, the computation time required by conventional computers can be prohibitively high. Parallel processing computers provide the means to overcome these computation time limits. This report summarizes the results of a research activity aimed at providing a finite element capability for analyzing turbomachinery bladed-disk assemblies in a vector/parallel processing environment. A special purpose code, named with the acronym SAPNEW, has been developed to perform static and eigen analysis of multi-degree-of-freedom blade models built-up from flat thin shell elements. SAPNEW provides a stand alone capability for static and eigen analysis on the Alliant FX/80, a parallel processing computer. A preprocessor, named with the acronym NTOS, has been developed to accept NASTRAN input decks and convert them to the SAPNEW format to make SAPNEW more readily used by researchers at NASA Lewis Research Center.
An ecological and evolutionary perspective on the parallel invasion of two cross-compatible trees.
Besnard, Guillaume; Cuneo, Peter
2016-01-01
Invasive trees are generally seen as ecosystem-transforming plants that can have significant impacts on native vegetation, and often require management and control. Understanding their history and biology is essential to guide actions of land managers. Here, we present a summary of recent research into the ecology, phylogeography and management of invasive olives, which are now established outside of their native range as high ecological impact invasive trees. The parallel invasion of European and African olive in different climatic zones of Australia provides an interesting case study of invasion, characterized by early genetic admixture between domesticated and wild taxa. Today, the impact of the invasive olives on native vegetation and ecosystem function is of conservation concern, with European olive a declared weed in areas of South Australia, and African olive a declared weed in New South Wales and Pacific islands. Population genetics was used to trace the origins and invasion of both subspecies in Australia, indicating that both olive subspecies have hybridized early after introduction. Research also indicates that African olive populations can establish from a low number of founder individuals even after successive bottlenecks. Modelling based on distributional data from the native and invasive range identified a shift of the realized ecological niche in the Australian invasive range for both olive subspecies, which was particularly marked for African olive. As highly successful and long-lived invaders, olives offer further opportunities to understand the genetic basis of invasion, and we propose that future research examines the history of introduction and admixture, the genetic basis of adaptability and the role of biotic interactions during invasion. Advances on these questions will ultimately improve predictions on the future olive expansion and provide a solid basis for better management of invasive populations.
An ecological and evolutionary perspective on the parallel invasion of two cross-compatible trees
Besnard, Guillaume; Cuneo, Peter
2016-01-01
Invasive trees are generally seen as ecosystem-transforming plants that can have significant impacts on native vegetation, and often require management and control. Understanding their history and biology is essential to guide actions of land managers. Here, we present a summary of recent research into the ecology, phylogeography and management of invasive olives, which are now established outside of their native range as high ecological impact invasive trees. The parallel invasion of European and African olive in different climatic zones of Australia provides an interesting case study of invasion, characterized by early genetic admixture between domesticated and wild taxa. Today, the impact of the invasive olives on native vegetation and ecosystem function is of conservation concern, with European olive a declared weed in areas of South Australia, and African olive a declared weed in New South Wales and Pacific islands. Population genetics was used to trace the origins and invasion of both subspecies in Australia, indicating that both olive subspecies have hybridized early after introduction. Research also indicates that African olive populations can establish from a low number of founder individuals even after successive bottlenecks. Modelling based on distributional data from the native and invasive range identified a shift of the realized ecological niche in the Australian invasive range for both olive subspecies, which was particularly marked for African olive. As highly successful and long-lived invaders, olives offer further opportunities to understand the genetic basis of invasion, and we propose that future research examines the history of introduction and admixture, the genetic basis of adaptability and the role of biotic interactions during invasion. Advances on these questions will ultimately improve predictions on the future olive expansion and provide a solid basis for better management of invasive populations. PMID:27519914
Application of a 3D, Adaptive, Parallel, MHD Code to Supernova Remnant Simulations
NASA Astrophysics Data System (ADS)
Kominsky, P.; Drake, R. P.; Powell, K. G.
2001-05-01
We at Michigan have a computational model, BATS-R-US, which incorporates several modern features that make it suitable for calculations of supernova remnant evolution. In particular, it is a three-dimensional MHD model, using a method called the Multiscale Adaptive Upwind Scheme for MagnetoHydroDynamics (MAUS-MHD). It incorporates a data structure that allows for adaptive refinement of the mesh, even in massively parallel calculations. Its advanced Godunov method, a solution-adaptive, upwind, high-resolution scheme, incorporates a new, flux-based approach to the Riemann solver with improved numerical properties. This code has been successfully applied to several problems, including the simulation of comets and of planetary magnetospheres, in the 3D context of the Heliosphere. The code was developed under a NASA computational grand challenge grant to run very rapidly on parallel platforms. It is also now being used to study time-dependent systems such as the transport of particles and energy from solar coronal mass ejections to the Earth. We are in the process of modifying this code so that it can accommodate the very strong shocks present in supernova remnants. Our test case simulates the explosion of a star of 1.4 solar masses with an energy of 1 foe, in a uniform background medium. We have performed runs of 250,000 to 1 million cells on 8 nodes of an Origin 2000. These relatively coarse grids do not allow fine details of instabilities to become visible. Nevertheless, the macroscopic evolution of the shock is simulated well, with the forward and reverse shocks visible in velocity profiles. We will show our work to date. This work was supported by NASA through its GSRP program.
FISH: A THREE-DIMENSIONAL PARALLEL MAGNETOHYDRODYNAMICS CODE FOR ASTROPHYSICAL APPLICATIONS
Kaeppeli, R.; Whitehouse, S. C.; Scheidegger, S.; Liebendoerfer, M.; Pen, U.-L.
2011-08-01
FISH is a fast and simple ideal magnetohydrodynamics code that scales to {approx}10,000 processes for a Cartesian computational domain of {approx}1000{sup 3} cells. The simplicity of FISH has been achieved by the rigorous application of the operator splitting technique, while second-order accuracy is maintained by the symmetric ordering of the operators. Between directional sweeps, the three-dimensional data are rotated in memory so that the sweep is always performed in a cache-efficient way along the direction of contiguous memory. Hence, the code only requires a one-dimensional description of the conservation equations to be solved. This approach also enables an elegant novel parallelization of the code that is based on persistent communications with MPI for cubic domain decomposition on machines with distributed memory. This scheme is then combined with an additional OpenMP parallelization of different sweeps that can take advantage of clusters of shared memory. We document the detailed implementation of a second-order total variation diminishing advection scheme based on flux reconstruction. The magnetic fields are evolved by a constrained transport scheme. We show that the subtraction of a simple estimate of the hydrostatic gradient from the total gradients can significantly reduce the dissipation of the advection scheme in simulations of gravitationally bound hydrostatic objects. Through its simplicity and efficiency, FISH is as well suited for hydrodynamics classes as for large-scale astrophysical simulations on high-performance computer clusters. In preparation for the release of a public version, we demonstrate the performance of FISH in a suite of astrophysically orientated test cases.
FISH: A Three-dimensional Parallel Magnetohydrodynamics Code for Astrophysical Applications
NASA Astrophysics Data System (ADS)
Käppeli, R.; Whitehouse, S. C.; Scheidegger, S.; Pen, U.-L.; Liebendörfer, M.
2011-08-01
FISH is a fast and simple ideal magnetohydrodynamics code that scales to ~10,000 processes for a Cartesian computational domain of ~10003 cells. The simplicity of FISH has been achieved by the rigorous application of the operator splitting technique, while second-order accuracy is maintained by the symmetric ordering of the operators. Between directional sweeps, the three-dimensional data are rotated in memory so that the sweep is always performed in a cache-efficient way along the direction of contiguous memory. Hence, the code only requires a one-dimensional description of the conservation equations to be solved. This approach also enables an elegant novel parallelization of the code that is based on persistent communications with MPI for cubic domain decomposition on machines with distributed memory. This scheme is then combined with an additional OpenMP parallelization of different sweeps that can take advantage of clusters of shared memory. We document the detailed implementation of a second-order total variation diminishing advection scheme based on flux reconstruction. The magnetic fields are evolved by a constrained transport scheme. We show that the subtraction of a simple estimate of the hydrostatic gradient from the total gradients can significantly reduce the dissipation of the advection scheme in simulations of gravitationally bound hydrostatic objects. Through its simplicity and efficiency, FISH is as well suited for hydrodynamics classes as for large-scale astrophysical simulations on high-performance computer clusters. In preparation for the release of a public version, we demonstrate the performance of FISH in a suite of astrophysically orientated test cases.
NASA Astrophysics Data System (ADS)
Iwasawa, Masaki; Tanikawa, Ataru; Hosono, Natsuki; Nitadori, Keigo; Muranushi, Takayuki; Makino, Junichiro
2016-08-01
We present the basic idea, implementation, measured performance, and performance model of FDPS (Framework for Developing Particle Simulators). FDPS is an application-development framework which helps researchers to develop simulation programs using particle methods for large-scale distributed-memory parallel supercomputers. A particle-based simulation program for distributed-memory parallel computers needs to perform domain decomposition, exchange of particles which are not in the domain of each computing node, and gathering of the particle information in other nodes which are necessary for interaction calculation. Also, even if distributed-memory parallel computers are not used, in order to reduce the amount of computation, algorithms such as the Barnes-Hut tree algorithm or the Fast Multipole Method should be used in the case of long-range interactions. For short-range interactions, some methods to limit the calculation to neighbor particles are required. FDPS provides all of these functions which are necessary for efficient parallel execution of particle-based simulations as "templates," which are independent of the actual data structure of particles and the functional form of the particle-particle interaction. By using FDPS, researchers can write their programs with the amount of work necessary to write a simple, sequential and unoptimized program of O(N2) calculation cost, and yet the program, once compiled with FDPS, will run efficiently on large-scale parallel supercomputers. A simple gravitational N-body program can be written in around 120 lines. We report the actual performance of these programs and the performance model. The weak scaling performance is very good, and almost linear speed-up was obtained for up to the full system of the K computer. The minimum calculation time per timestep is in the range of 30 ms (N = 107) to 300 ms (N = 109). These are currently limited by the time for the calculation of the domain decomposition and communication
NASA Astrophysics Data System (ADS)
Chang, Yang-Lang; Chen, Zhi-Ming; Liu, Jin-Nan; Chang, Lena; Fang, Jyh Perng
2010-08-01
Satellite remote sensing images can be interpreted to provide important information of large-scale natural resources, such as lands, oceans, mountains, rivers, forests and minerals for Earth observations. Recent advances of remote sensing technologies have improved the availability of satellite imagery in a wide range of applications including high dimensional remote sensing data sets (e.g. high spectral and high spatial resolution images). The information of high dimensional remote sensing images obtained by state-of-the-art sensor technologies can be identified more accurately than images acquired by conventional remote sensing techniques. However, due to its large volume of image data, it requires a huge amount of storages and computing time. In response, the computational complexity of data processing for high dimensional remote sensing data analysis will increase. Consequently, this paper proposes a novel classification algorithm based on semi-matroid structure, known as the parallel k-dimensional tree semi-matroid (PKTSM) classification, which adopts a new hybrid parallel approach to deal with high dimensional data sets. It is implemented by combining the message passing interface (MPI) library, the open multi-processing (OpenMP) application programming interface and the compute unified device architecture (CUDA) of graphics processing units (GPU) in a hybrid mode. The effectiveness of the proposed PKTSM is evaluated by using MODIS/ASTER airborne simulator (MASTER) images and airborne synthetic aperture radar (AIRSAR) images for land cover classification during the Pacrim II campaign. The experimental results demonstrated that the proposed hybrid PKTSM can significantly improve the performance in terms of both computational speed-up and classification accuracy.
Delta: An object-oriented finite element code architecture for massively parallel computers
Weatherby, J.R.; Schutt, J.A.; Peery, J.S.; Hogan, R.E.
1996-02-01
Delta is an object-oriented code architecture based on the finite element method which enables simulation of a wide range of engineering mechanics problems in a parallel processing environment. Written in C{sup ++}, Delta is a natural framework for algorithm development and for research involving coupling of mechanics from different Engineering Science disciplines. To enhance flexibility and encourage code reuse, the architecture provides a clean separation of the major aspects of finite element programming. Spatial discretization, temporal discretization, and the solution of linear and nonlinear systems of equations are each implemented separately, independent from the governing field equations. Other attractive features of the Delta architecture include support for constitutive models with internal variables, reusable ``matrix-free`` equation solvers, and support for region-to-region variations in the governing equations and the active degrees of freedom. A demonstration code built from the Delta architecture has been used in two-dimensional and three-dimensional simulations involving dynamic and quasi-static solid mechanics, transient and steady heat transport, and flow in porous media.
NBSymple: A Double Parallel, Symplectic N-body Code Running on Graphic Processing Units
NASA Astrophysics Data System (ADS)
Capuzzo-Dolcetta, R.; Mastrobuono-Battisti, A.
2010-10-01
NBSymple is a numerical code which numerically integrates the equation of motions of N 'particles' interacting via Newtonian gravitation and move in an external galactic smooth field. The force evaluation on every particle is done by mean of direct summation of the contribution of all the other system's particle, avoiding truncation error. The time integration is done with second-order and sixth-order symplectic schemes. NBSymple has been parallelized twice, by mean of the Computer Unified Device Architecture to make the all-pair force evaluation as fast as possible on high-performance Graphic Processing Units NVIDIA TESLA C 1060, while the O(N) computations are distributed on various CPUs by mean of OpenMP Application Program. The code works both in single precision floating point arithmetics or in double precision. The use of single precision allows the use at best of the GPU performances but, of course, limits the precision of simulation in some critical situations. We find a good compromise in using a software reconstruction of double precision for those variables that are most critical for the overall precision of the code.
NASA Astrophysics Data System (ADS)
Peredo, Oscar; Ortiz, Julián M.; Herrero, José R.
2015-12-01
The Geostatistical Software Library (GSLIB) has been used in the geostatistical community for more than thirty years. It was designed as a bundle of sequential Fortran codes, and today it is still in use by many practitioners and researchers. Despite its widespread use, few attempts have been reported in order to bring this package to the multi-core era. Using all CPU resources, GSLIB algorithms can handle large datasets and grids, where tasks are compute- and memory-intensive applications. In this work, a methodology is presented to accelerate GSLIB applications using code optimization and hybrid parallel processing, specifically for compute-intensive applications. Minimal code modifications are added decreasing as much as possible the elapsed time of execution of the studied routines. If multi-core processing is available, the user can activate OpenMP directives to speed up the execution using all resources of the CPU. If multi-node processing is available, the execution is enhanced using MPI messages between the compute nodes.Four case studies are presented: experimental variogram calculation, kriging estimation, sequential gaussian and indicator simulation. For each application, three scenarios (small, large and extra large) are tested using a desktop environment with 4 CPU-cores and a multi-node server with 128 CPU-nodes. Elapsed times, speedup and efficiency results are shown.
Parallel changes in mate-attracting calls and female preferences in autotriploid tree frogs
Tucker, Mitch A.; Gerhardt, H. C.
2012-01-01
For polyploid species to persist, they must be reproductively isolated from their diploid parental species, which coexist at the same time and place at least initially. In a complex of biparentally reproducing tetraploid and diploid tree frogs in North America, selective phonotaxis—mediated by differences in the pulse-repetition (pulse rate) of their mate-attracting vocalizations—ensures assortative mating. We show that artificially produced autotriploid females of the diploid species (Hyla chrysoscelis) show a shift in pulse-rate preference in the direction of the pulse rate produced by males of the tetraploid species (Hyla versicolor). The estimated preference function is centred near the mean pulse rate of the calls of artificially produced male autotriploids. Such a parallel shift, which is caused by polyploidy per se and whose magnitude is expected to be greater in autotetraploids, may have facilitated sympatric speciation by promoting reproductive isolation of the initially formed polyploids from their diploid parental forms. This process also helps to explain why tetraploid lineages with different origins have similar advertisement calls and freely interbreed. PMID:22113033
Parallel changes in mate-attracting calls and female preferences in autotriploid tree frogs.
Tucker, Mitch A; Gerhardt, H C
2012-04-22
For polyploid species to persist, they must be reproductively isolated from their diploid parental species, which coexist at the same time and place at least initially. In a complex of biparentally reproducing tetraploid and diploid tree frogs in North America, selective phonotaxis--mediated by differences in the pulse-repetition (pulse rate) of their mate-attracting vocalizations--ensures assortative mating. We show that artificially produced autotriploid females of the diploid species (Hyla chrysoscelis) show a shift in pulse-rate preference in the direction of the pulse rate produced by males of the tetraploid species (Hyla versicolor). The estimated preference function is centred near the mean pulse rate of the calls of artificially produced male autotriploids. Such a parallel shift, which is caused by polyploidy per se and whose magnitude is expected to be greater in autotetraploids, may have facilitated sympatric speciation by promoting reproductive isolation of the initially formed polyploids from their diploid parental forms. This process also helps to explain why tetraploid lineages with different origins have similar advertisement calls and freely interbreed.
Fortran code for SU(3) lattice gauge theory with and without MPI checkerboard parallelization
NASA Astrophysics Data System (ADS)
Berg, Bernd A.; Wu, Hao
2012-10-01
We document plain Fortran and Fortran MPI checkerboard code for Markov chain Monte Carlo simulations of pure SU(3) lattice gauge theory with the Wilson action in D dimensions. The Fortran code uses periodic boundary conditions and is suitable for pedagogical purposes and small scale simulations. For the Fortran MPI code two geometries are covered: the usual torus with periodic boundary conditions and the double-layered torus as defined in the paper. Parallel computing is performed on checkerboards of sublattices, which partition the full lattice in one, two, and so on, up to D directions (depending on the parameters set). For updating, the Cabibbo-Marinari heatbath algorithm is used. We present validations and test runs of the code. Performance is reported for a number of currently used Fortran compilers and, when applicable, MPI versions. For the parallelized code, performance is studied as a function of the number of processors. Program summary Program title: STMC2LSU3MPI Catalogue identifier: AEMJ_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEMJ_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 26666 No. of bytes in distributed program, including test data, etc.: 233126 Distribution format: tar.gz Programming language: Fortran 77 compatible with the use of Fortran 90/95 compilers, in part with MPI extensions. Computer: Any capable of compiling and executing Fortran 77 or Fortran 90/95, when needed with MPI extensions. Operating system: Red Hat Enterprise Linux Server 6.1 with OpenMPI + pgf77 11.8-0, Centos 5.3 with OpenMPI + gfortran 4.1.2, Cray XT4 with MPICH2 + pgf90 11.2-0. Has the code been vectorised or parallelized?: Yes, parallelized using MPI extensions. Number of processors used: 2 to 11664 RAM: 200 Mega bytes per process. Classification: 11
Context Tree-Based Image Contour Coding Using a Geometric Prior.
Zheng, Amin; Cheung, Gene; Florencio, Dinei
2017-02-01
Efficient encoding of object contours in images can facilitate advanced image/video compression techniques, such as shape-adaptive transform coding or motion prediction of arbitrarily shaped pixel blocks. We study the problem of lossless and lossy compression of detected contours in images. Specifically, we first convert a detected object contour into a sequence of directional symbols drawn from a small alphabet. To encode the symbol sequence using arithmetic coding, we compute an optimal variable-length context tree (VCT) T via a maximum a posterior (MAP) formulation to estimate symbols' conditional probabilities. MAP can avoid overfitting given a small training set X of past symbol sequences by identifying a VCT T with high likelihood P(X|T) of observing X given T , using a geometric prior P(T) stating that image contours are more often straight than curvy. For the lossy case, we design fast dynamic programming (DP) algorithms that optimally trade off coding rate of an approximate contour [Formula: see text] given a VCT T with two notions of distortion of [Formula: see text] with respect to the original contour x. To reduce the size of the DP tables, a total suffix tree is derived from a given VCT T for compact table entry indexing, reducing complexity. Experimental results show that for lossless contour coding, our proposed algorithm outperforms state-of-the-art context-based schemes consistently for both small and large training datasets. For lossy contour coding, our algorithms outperform comparable schemes in the literature in rate-distortion performance.
An object-oriented implementation of a parallel Monte Carlo code for radiation transport
NASA Astrophysics Data System (ADS)
Santos, Pedro Duarte; Lani, Andrea
2016-05-01
This paper describes the main features of a state-of-the-art Monte Carlo solver for radiation transport which has been implemented within COOLFluiD, a world-class open source object-oriented platform for scientific simulations. The Monte Carlo code makes use of efficient ray tracing algorithms (for 2D, axisymmetric and 3D arbitrary unstructured meshes) which are described in detail. The solver accuracy is first verified in testcases for which analytical solutions are available, then validated for a space re-entry flight experiment (i.e. FIRE II) for which comparisons against both experiments and reference numerical solutions are provided. Through the flexible design of the physical models, ray tracing and parallelization strategy (fully reusing the mesh decomposition inherited by the fluid simulator), the implementation was made efficient and reusable.
NASA Astrophysics Data System (ADS)
Sosedkin, A. P.; Lotov, K. V.
2016-09-01
LCODE is a freely distributed quasistatic 2D3V code for simulating plasma wakefield acceleration, mainly specialized at resource-efficient studies of long-term propagation of ultrarelativistic particle beams in plasmas. The beam is modeled with fully relativistic macro-particles in a simulation window copropagating with the light velocity; the plasma can be simulated with either kinetic or fluid model. Several techniques are used to obtain exceptional numerical stability and precision while maintaining high resource efficiency, enabling LCODE to simulate the evolution of long particle beams over long propagation distances even on a laptop. A recent upgrade enabled LCODE to perform the calculations in parallel. A pipeline of several LCODE processes communicating via MPI (Message-Passing Interface) is capable of executing multiple consecutive time steps of the simulation in a single pass. This approach can speed up the calculations by hundreds of times.
On distributed memory MPI-based parallelization of SPH codes in massive HPC context
NASA Astrophysics Data System (ADS)
Oger, G.; Le Touzé, D.; Guibert, D.; de Leffe, M.; Biddiscombe, J.; Soumagne, J.; Piccinali, J.-G.
2016-03-01
Most of particle methods share the problem of high computational cost and in order to satisfy the demands of solvers, currently available hardware technologies must be fully exploited. Two complementary technologies are now accessible. On the one hand, CPUs which can be structured into a multi-node framework, allowing massive data exchanges through a high speed network. In this case, each node is usually comprised of several cores available to perform multithreaded computations. On the other hand, GPUs which are derived from the graphics computing technologies, able to perform highly multi-threaded calculations with hundreds of independent threads connected together through a common shared memory. This paper is primarily dedicated to the distributed memory parallelization of particle methods, targeting several thousands of CPU cores. The experience gained clearly shows that parallelizing a particle-based code on moderate numbers of cores can easily lead to an acceptable scalability, whilst a scalable speedup on thousands of cores is much more difficult to obtain. The discussion revolves around speeding up particle methods as a whole, in a massive HPC context by making use of the MPI library. We focus on one particular particle method which is Smoothed Particle Hydrodynamics (SPH), one of the most widespread today in the literature as well as in engineering.
MPI parallelization of Vlasov codes for the simulation of nonlinear laser-plasma interactions
NASA Astrophysics Data System (ADS)
Savchenko, V.; Won, K.; Afeyan, B.; Decyk, V.; Albrecht-Marc, M.; Ghizzo, A.; Bertrand, P.
2003-10-01
The simulation of optical mixing driven KEEN waves [1] and electron plasma waves [1] in laser-produced plasmas require nonlinear kinetic models and massive parallelization. We use Massage Passing Interface (MPI) libraries and Appleseed [2] to solve the Vlasov Poisson system of equations on an 8 node dual processor MAC G4 cluster. We use the semi-Lagrangian time splitting method [3]. It requires only row-column exchanges in the global data redistribution, minimizing the total number of communications between processors. Recurrent communication patterns for 2D FFTs involves global transposition. In the Vlasov-Maxwell case, we use splitting into two 1D spatial advections and a 2D momentum advection [4]. Discretized momentum advection equations have a double loop structure with the outer index being assigned to different processors. We adhere to a code structure with separate routines for calculations and data management for parallel computations. [1] B. Afeyan et al., IFSA 2003 Conference Proceedings, Monterey, CA [2] V. K. Decyk, Computers in Physics, 7, 418 (1993) [3] Sonnendrucker et al., JCP 149, 201 (1998) [4] Begue et al., JCP 151, 458 (1999)
A massively parallel method of characteristic neutral particle transport code for GPUs
Boyd, W. R.; Smith, K.; Forget, B.
2013-07-01
Over the past 20 years, parallel computing has enabled computers to grow ever larger and more powerful while scientific applications have advanced in sophistication and resolution. This trend is being challenged, however, as the power consumption for conventional parallel computing architectures has risen to unsustainable levels and memory limitations have come to dominate compute performance. Heterogeneous computing platforms, such as Graphics Processing Units (GPUs), are an increasingly popular paradigm for solving these issues. This paper explores the applicability of GPUs for deterministic neutron transport. A 2D method of characteristics (MOC) code - OpenMOC - has been developed with solvers for both shared memory multi-core platforms as well as GPUs. The multi-threading and memory locality methodologies for the GPU solver are presented. Performance results for the 2D C5G7 benchmark demonstrate 25-35 x speedup for MOC on the GPU. The lessons learned from this case study will provide the basis for further exploration of MOC on GPUs as well as design decisions for hardware vendors exploring technologies for the next generation of machines for scientific computing. (authors)
Kostin, Mikhail; Mokhov, Nikolai; Niita, Koji
2013-09-25
A parallel computing framework has been developed to use with general-purpose radiation transport codes. The framework was implemented as a C++ module that uses MPI for message passing. It is intended to be used with older radiation transport codes implemented in Fortran77, Fortran 90 or C. The module is significantly independent of radiation transport codes it can be used with, and is connected to the codes by means of a number of interface functions. The framework was developed and tested in conjunction with the MARS15 code. It is possible to use it with other codes such as PHITS, FLUKA and MCNP after certain adjustments. Besides the parallel computing functionality, the framework offers a checkpoint facility that allows restarting calculations with a saved checkpoint file. The checkpoint facility can be used in single process calculations as well as in the parallel regime. The framework corrects some of the known problems with the scheduling and load balancing found in the original implementations of the parallel computing functionality in MARS15 and PHITS. The framework can be used efficiently on homogeneous systems and networks of workstations, where the interference from the other users is possible.
OASIS4: An Efficient Parallel Code Coupler for Earth System Modelling
NASA Astrophysics Data System (ADS)
Coquart, L.; Valcke, S.; Redler, R.; Ritzdorf, H.
2009-04-01
As a new development step of the OASIS coupler family, we present OASIS4 in its latest version. OASIS4 is a software allowing synchronized exchanges of coupling information between numerical codes representing different components of the climate system. The concepts of portability, flexibility, parallelism and efficiency are the main drivers for the OASIS4 development with which we target the needs of Earth system modelling in its full complexity. The development and maintenance of OASIS4 has been supported by EU and institutional funding within the PRISM Support Initiative for the past seven years. Here we present the latest version of the OASIS4 coupling software which now includes the commonly known point based 2d and 3d interpolation schemes (bilinear, trilinear, bicubic, nearest neighbour), and 2D conservative remapping. Furthermore, the new version of the software now provides a complete parallel search taking into account specific requirements at process boundaries in order to provide identical search results independently of the domain partitioning. The parallel "multi-grid" search ensures low CPU cost to perform the task of the neighbourhood search and at the same time showing a good scalability when applied to grid partitioned domains. OASIS4 is currently used in few climate applications such as in the FP6 European GEMS project for the 3D coupling between atmosphere and atmosphere chemistry, by the Swedish Meteorological and Hydrological Institute (SMHI) for regional models covering the Arctic Sea or the Baltic area, and by the Calcul Intensif pour le CLimat et l'Environment (CICLE) project funded by the French "Agence Nationale de la Recherche".
NASA Astrophysics Data System (ADS)
Karl, Simon J.; Aarseth, Sverre J.; Naab, Thorsten; Haehnelt, Martin G.; Spurzem, Rainer
2015-09-01
We present a hybrid code combining the OpenMP-parallel tree code VINE with an algorithmic chain regularization scheme. The new code, called `rVINE', aims to significantly improve the accuracy of close encounters of massive bodies with supermassive black holes (SMBHs) in galaxy-scale numerical simulations. We demonstrate the capabilities of the code by studying two test problems, the sinking of a single massive black hole to the centre of a gas-free galaxy due to dynamical friction and the hardening of an SMBH binary due to close stellar encounters. We show that results obtained with rVINE compare well with NBODY7 for problems with particle numbers that can be simulated with NBODY7. In particular, in both NBODY7 and rVINE we find a clear N-dependence of the binary hardening rate, a low binary eccentricity and moderate eccentricity evolution, as well as the conversion of the galaxy's inner density profile from a cusp to a core via the ejection of stars at high velocity. The much larger number of particles that can be handled by rVINE will open up exciting opportunities to model stellar dynamics close to SMBHs much more accurately in a realistic galactic context. This will help to remedy the inherent limitations of commonly used tree solvers to follow the correct dynamical evolution of black holes in galaxy-scale simulations.
Overview of development and design of MPACT: Michigan parallel characteristics transport code
Kochunas, B.; Collins, B.; Jabaay, D.; Downar, T. J.; Martin, W. R.
2013-07-01
MPACT (Michigan Parallel Characteristics Transport Code) is a new reactor analysis tool. It is being developed by students and research staff at the University of Michigan to be used for an advanced pin-resolved transport capability within VERA (Virtual Environment for Reactor Analysis). VERA is the end-user reactor simulation tool being produced by the Consortium for the Advanced Simulation of Light Water Reactors (CASL). The MPACT development project is itself unique for the way it is changing how students do research to achieve the instructional and research goals of an academic institution, while providing immediate value to industry. The MPACT code makes use of modern lean/agile software processes and extensive testing to maintain a level of productivity and quality required by CASL. MPACT's design relies heavily on object-oriented programming concepts and design patterns and is programmed in Fortran 2003. These designs are explained and illustrated as to how they can be readily extended to incorporate new capabilities and research ideas in support of academic research objectives. The transport methods currently implemented in MPACT include the 2-D and 3-D method of characteristics (MOC) and 2-D and 3-D method of collision direction probabilities (CDP). For the cross section resonance treatment, presently the subgroup method and the new embedded self-shielding method (ESSM) are implemented within MPACT. (authors)
NASA Astrophysics Data System (ADS)
Lokavarapu, H. V.; Matsui, H.
2015-12-01
Convection and magnetic field of the Earth's outer core are expected to have vast length scales. To resolve these flows, high performance computing is required for geodynamo simulations using spherical harmonics transform (SHT), a significant portion of the execution time is spent on the Legendre transform. Calypso is a geodynamo code designed to model magnetohydrodynamics of a Boussinesq fluid in a rotating spherical shell, such as the outer core of the Earth. The code has been shown to scale well on computer clusters capable of computing at the order of 10⁵ cores using Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) parallelization for CPUs. To further optimize, we investigate three different algorithms of the SHT using GPUs. One is to preemptively compute the Legendre polynomials on the CPU before executing SHT on the GPU within the time integration loop. In the second approach, both the Legendre polynomials and the SHT are computed on the GPU simultaneously. In the third approach , we initially partition the radial grid for the forward transform and the harmonic order for the backward transform between the CPU and GPU. There after, the partitioned works are simultaneously computed in the time integration loop. We examine the trade-offs between space and time, memory bandwidth and GPU computations on Maverick, a Texas Advanced Computing Center (TACC) supercomputer. We have observed improved performance using a GPU enabled Legendre transform. Furthermore, we will compare and contrast the different algorithms in the context of GPUs.
NeuCode labels with parallel reaction monitoring for multiplexed, absolute protein quantification
Potts, Gregory K.; Voigt, Emily A.; Bailey, Derek J.; Westphall, Michael S.; Hebert, Alexander S.; Yin, John; Coon, Joshua J.
2016-01-01
We introduce a new method to multiplex the throughput of samples for targeted mass spectrometry analysis. The current paradigm for obtaining absolute quantification from biological samples requires spiking isotopically heavy peptide standards into light biological lysates. Because each lysate must be run individually, this method places limitations on sample throughput and high demands on instrument time. When cell lines are first metabolically labeled with various neutron-encoded (NeuCode) lysine isotopologues possessing mDa mass differences from each other, heavy cell lysates may be mixed and spiked with an additional heavy peptide as an internal standard. We demonstrate that these NeuCode lysate peptides may be co-isolated with their internal standards, fragmented, and analyzed together using high resolving power parallel reaction monitoring (PRM). Instead of running each sample individually, these methods allow samples to be multiplexed to obtain absolute concentrations of target peptides in 5, 15, and even 25 biological samples at a time during single mass spectrometry experiments. PMID:26882330
Brock, R Scott; Hu, Xin-Hua; Yang, Ping; Lu, Jun
2005-07-11
A parallel Finite-Difference-Time-Domain (FDTD) code has been developed to numerically model the elastic light scattering by biological cells. Extensive validation and evaluation on various computing clusters demonstrated the high performance of the parallel code and its significant potential of reducing the computational cost of the FDTD method with low cost computer clusters. The parallel FDTD code has been used to study the problem of light scattering by a human red blood cell (RBC) of a deformed shape in terms of the angular distributions of the Mueller matrix elements. The dependence of the Mueller matrix elements on the shape and orientation of the deformed RBC has been investigated. Analysis of these data provides valuable insight on determination of the RBC shapes using the method of elastic light scattering measurements.
Implementation of a tree-code for numerical simulations of stellar systems
NASA Astrophysics Data System (ADS)
Marinho, Eraldo Pereira
1991-10-01
An implementation of a tree code for the force calculation in gravitational N-body systems simulations is presented. The technique consists of virtualizing the entire system in a tree data-structure, which reduces the computational effort to theta(N log N) instead of the theta(N exp 2), typical of direct summation. The adopted time integrator is the simple leap-frog with second-order accuracy. A brief discussion about the truncation-error effects on the morphology of the system shows them to be essentially negligible. However, these errors do propagate in a Markovian way if a potential-adaptive time-step is used in order to maintain the expected truncation-error approximately constant in the entire system. The tests show that, even with totally arbitrary distributions, the total computation time obeys theta(N log N). As an application of the code, we evolved an initially cold and homogeneous sphere of point masses to simulate a primordial process of galaxy formation. The evolution of the global entropy of the system suggests that a quasi-equilibrium configuration is achieved after approximately 2 x 10 exp 9 years. It is shown that the final configuration displays a close resemblance to the well observed giant elliptical galaxies, in both kinematical and luminosity distribution properties. A discussion is given on the evolution of the important dynamic quantities characterizing the model. During all the computations, the energy is conserved to better than 0.1 percent.
Hybrid threshold adaptable quantum secret sharing scheme with reverse Huffman-Fibonacci-tree coding
Lai, Hong; Zhang, Jun; Luo, Ming-Xing; Pan, Lei; Pieprzyk, Josef; Xiao, Fuyuan; Orgun, Mehmet A.
2016-01-01
With prevalent attacks in communication, sharing a secret between communicating parties is an ongoing challenge. Moreover, it is important to integrate quantum solutions with classical secret sharing schemes with low computational cost for the real world use. This paper proposes a novel hybrid threshold adaptable quantum secret sharing scheme, using an m-bonacci orbital angular momentum (OAM) pump, Lagrange interpolation polynomials, and reverse Huffman-Fibonacci-tree coding. To be exact, we employ entangled states prepared by m-bonacci sequences to detect eavesdropping. Meanwhile, we encode m-bonacci sequences in Lagrange interpolation polynomials to generate the shares of a secret with reverse Huffman-Fibonacci-tree coding. The advantages of the proposed scheme is that it can detect eavesdropping without joint quantum operations, and permits secret sharing for an arbitrary but no less than threshold-value number of classical participants with much lower bandwidth. Also, in comparison with existing quantum secret sharing schemes, it still works when there are dynamic changes, such as the unavailability of some quantum channel, the arrival of new participants and the departure of participants. Finally, we provide security analysis of the new hybrid quantum secret sharing scheme and discuss its useful features for modern applications. PMID:27515908
Hybrid threshold adaptable quantum secret sharing scheme with reverse Huffman-Fibonacci-tree coding.
Lai, Hong; Zhang, Jun; Luo, Ming-Xing; Pan, Lei; Pieprzyk, Josef; Xiao, Fuyuan; Orgun, Mehmet A
2016-08-12
With prevalent attacks in communication, sharing a secret between communicating parties is an ongoing challenge. Moreover, it is important to integrate quantum solutions with classical secret sharing schemes with low computational cost for the real world use. This paper proposes a novel hybrid threshold adaptable quantum secret sharing scheme, using an m-bonacci orbital angular momentum (OAM) pump, Lagrange interpolation polynomials, and reverse Huffman-Fibonacci-tree coding. To be exact, we employ entangled states prepared by m-bonacci sequences to detect eavesdropping. Meanwhile, we encode m-bonacci sequences in Lagrange interpolation polynomials to generate the shares of a secret with reverse Huffman-Fibonacci-tree coding. The advantages of the proposed scheme is that it can detect eavesdropping without joint quantum operations, and permits secret sharing for an arbitrary but no less than threshold-value number of classical participants with much lower bandwidth. Also, in comparison with existing quantum secret sharing schemes, it still works when there are dynamic changes, such as the unavailability of some quantum channel, the arrival of new participants and the departure of participants. Finally, we provide security analysis of the new hybrid quantum secret sharing scheme and discuss its useful features for modern applications.
Hybrid threshold adaptable quantum secret sharing scheme with reverse Huffman-Fibonacci-tree coding
NASA Astrophysics Data System (ADS)
Lai, Hong; Zhang, Jun; Luo, Ming-Xing; Pan, Lei; Pieprzyk, Josef; Xiao, Fuyuan; Orgun, Mehmet A.
2016-08-01
With prevalent attacks in communication, sharing a secret between communicating parties is an ongoing challenge. Moreover, it is important to integrate quantum solutions with classical secret sharing schemes with low computational cost for the real world use. This paper proposes a novel hybrid threshold adaptable quantum secret sharing scheme, using an m-bonacci orbital angular momentum (OAM) pump, Lagrange interpolation polynomials, and reverse Huffman-Fibonacci-tree coding. To be exact, we employ entangled states prepared by m-bonacci sequences to detect eavesdropping. Meanwhile, we encode m-bonacci sequences in Lagrange interpolation polynomials to generate the shares of a secret with reverse Huffman-Fibonacci-tree coding. The advantages of the proposed scheme is that it can detect eavesdropping without joint quantum operations, and permits secret sharing for an arbitrary but no less than threshold-value number of classical participants with much lower bandwidth. Also, in comparison with existing quantum secret sharing schemes, it still works when there are dynamic changes, such as the unavailability of some quantum channel, the arrival of new participants and the departure of participants. Finally, we provide security analysis of the new hybrid quantum secret sharing scheme and discuss its useful features for modern applications.
Stankovski, Z.
1995-12-31
The collision probability method in neutron transport, as applied to 2D geometries, consume a great amount of computer time, for a typical 2D assembly calculation about 90% of the computing time is consumed in the collision probability evaluations. Consequently RZ or 3D calculations became prohibitive. In this paper the author presents a simple but efficient parallel algorithm based on the message passing host/node programmation model. Parallelization was applied to the energy group treatment. Such approach permits parallelization of the existing code, requiring only limited modifications. Sequential/parallel computer portability is preserved, which is a necessary condition for a industrial code. Sequential performances are also preserved. The algorithm is implemented on a CRAY 90 coupled to a 128 processor T3D computer, a 16 processor IBM SPI and a network of workstations, using the Public Domain PVM library. The tests were executed for a 2D geometry with the standard 99-group library. All results were very satisfactory, the best ones with IBM SPI. Because of heterogeneity of the workstation network, the author did not ask high performances for this architecture. The same source code was used for all computers. A more impressive advantage of this algorithm will appear in the calculations of the SAPHYR project (with the future fine multigroup library of about 8000 groups) with a massively parallel computer, using several hundreds of processors.
Parallel Adaptive Mesh Refinement Library
NASA Technical Reports Server (NTRS)
Mac-Neice, Peter; Olson, Kevin
2005-01-01
Parallel Adaptive Mesh Refinement Library (PARAMESH) is a package of Fortran 90 subroutines designed to provide a computer programmer with an easy route to extension of (1) a previously written serial code that uses a logically Cartesian structured mesh into (2) a parallel code with adaptive mesh refinement (AMR). Alternatively, in its simplest use, and with minimal effort, PARAMESH can operate as a domain-decomposition tool for users who want to parallelize their serial codes but who do not wish to utilize adaptivity. The package builds a hierarchy of sub-grids to cover the computational domain of a given application program, with spatial resolution varying to satisfy the demands of the application. The sub-grid blocks form the nodes of a tree data structure (a quad-tree in two or an oct-tree in three dimensions). Each grid block has a logically Cartesian mesh. The package supports one-, two- and three-dimensional models.
SPILADY: A parallel CPU and GPU code for spin-lattice magnetic molecular dynamics simulations
NASA Astrophysics Data System (ADS)
Ma, Pui-Wai; Dudarev, S. L.; Woo, C. H.
2016-10-01
Spin-lattice dynamics generalizes molecular dynamics to magnetic materials, where dynamic variables describing an evolving atomic system include not only coordinates and velocities of atoms but also directions and magnitudes of atomic magnetic moments (spins). Spin-lattice dynamics simulates the collective time evolution of spins and atoms, taking into account the effect of non-collinear magnetism on interatomic forces. Applications of the method include atomistic models for defects, dislocations and surfaces in magnetic materials, thermally activated diffusion of defects, magnetic phase transitions, and various magnetic and lattice relaxation phenomena. Spin-lattice dynamics retains all the capabilities of molecular dynamics, adding to them the treatment of non-collinear magnetic degrees of freedom. The spin-lattice dynamics time integration algorithm uses symplectic Suzuki-Trotter decomposition of atomic coordinate, velocity and spin evolution operators, and delivers highly accurate numerical solutions of dynamic evolution equations over extended intervals of time. The code is parallelized in coordinate and spin spaces, and is written in OpenMP C/C++ for CPU and in CUDA C/C++ for Nvidia GPU implementations. Temperatures of atoms and spins are controlled by Langevin thermostats. Conduction electrons are treated by coupling the discrete spin-lattice dynamics equations for atoms and spins to the heat transfer equation for the electrons. Worked examples include simulations of thermalization of ferromagnetic bcc iron, the dynamics of laser pulse demagnetization, and collision cascades.
GRay: A MASSIVELY PARALLEL GPU-BASED CODE FOR RAY TRACING IN RELATIVISTIC SPACETIMES
Chan, Chi-kwan; Psaltis, Dimitrios; Özel, Feryal
2013-11-01
We introduce GRay, a massively parallel integrator designed to trace the trajectories of billions of photons in a curved spacetime. This graphics-processing-unit (GPU)-based integrator employs the stream processing paradigm, is implemented in CUDA C/C++, and runs on nVidia graphics cards. The peak performance of GRay using single-precision floating-point arithmetic on a single GPU exceeds 300 GFLOP (or 1 ns per photon per time step). For a realistic problem, where the peak performance cannot be reached, GRay is two orders of magnitude faster than existing central-processing-unit-based ray-tracing codes. This performance enhancement allows more effective searches of large parameter spaces when comparing theoretical predictions of images, spectra, and light curves from the vicinities of compact objects to observations. GRay can also perform on-the-fly ray tracing within general relativistic magnetohydrodynamic algorithms that simulate accretion flows around compact objects. Making use of this algorithm, we calculate the properties of the shadows of Kerr black holes and the photon rings that surround them. We also provide accurate fitting formulae of their dependencies on black hole spin and observer inclination, which can be used to interpret upcoming observations of the black holes at the center of the Milky Way, as well as M87, with the Event Horizon Telescope.
GRay: A Massively Parallel GPU-based Code for Ray Tracing in Relativistic Spacetimes
NASA Astrophysics Data System (ADS)
Chan, Chi-kwan; Psaltis, Dimitrios; Özel, Feryal
2013-11-01
We introduce GRay, a massively parallel integrator designed to trace the trajectories of billions of photons in a curved spacetime. This graphics-processing-unit (GPU)-based integrator employs the stream processing paradigm, is implemented in CUDA C/C++, and runs on nVidia graphics cards. The peak performance of GRay using single-precision floating-point arithmetic on a single GPU exceeds 300 GFLOP (or 1 ns per photon per time step). For a realistic problem, where the peak performance cannot be reached, GRay is two orders of magnitude faster than existing central-processing-unit-based ray-tracing codes. This performance enhancement allows more effective searches of large parameter spaces when comparing theoretical predictions of images, spectra, and light curves from the vicinities of compact objects to observations. GRay can also perform on-the-fly ray tracing within general relativistic magnetohydrodynamic algorithms that simulate accretion flows around compact objects. Making use of this algorithm, we calculate the properties of the shadows of Kerr black holes and the photon rings that surround them. We also provide accurate fitting formulae of their dependencies on black hole spin and observer inclination, which can be used to interpret upcoming observations of the black holes at the center of the Milky Way, as well as M87, with the Event Horizon Telescope.
Dewaraja, Yuni K; Ljungberg, Michael; Majumdar, Amitava; Bose, Abhijit; Koral, Kenneth F
2002-02-01
This paper reports the implementation of the SIMIND Monte Carlo code on an IBM SP2 distributed memory parallel computer. Basic aspects of running Monte Carlo particle transport calculations on parallel architectures are described. Our parallelization is based on equally partitioning photons among the processors and uses the Message Passing Interface (MPI) library for interprocessor communication and the Scalable Parallel Random Number Generator (SPRNG) to generate uncorrelated random number streams. These parallelization techniques are also applicable to other distributed memory architectures. A linear increase in computing speed with the number of processors is demonstrated for up to 32 processors. This speed-up is especially significant in Single Photon Emission Computed Tomography (SPECT) simulations involving higher energy photon emitters, where explicit modeling of the phantom and collimator is required. For (131)I, the accuracy of the parallel code is demonstrated by comparing simulated and experimental SPECT images from a heart/thorax phantom. Clinically realistic SPECT simulations using the voxel-man phantom are carried out to assess scatter and attenuation correction.
Seeing the forest for the trees: Networked workstations as a parallel processing computer
NASA Technical Reports Server (NTRS)
Breen, J. O.; Meleedy, D. M.
1992-01-01
Unlike traditional 'serial' processing computers in which one central processing unit performs one instruction at a time, parallel processing computers contain several processing units, thereby, performing several instructions at once. Many of today's fastest supercomputers achieve their speed by employing thousands of processing elements working in parallel. Few institutions can afford these state-of-the-art parallel processors, but many already have the makings of a modest parallel processing system. Workstations on existing high-speed networks can be harnessed as nodes in a parallel processing environment, bringing the benefits of parallel processing to many. While such a system can not rival the industry's latest machines, many common tasks can be accelerated greatly by spreading the processing burden and exploiting idle network resources. We study several aspects of this approach, from algorithms to select nodes to speed gains in specific tasks. With ever-increasing volumes of astronomical data, it becomes all the more necessary to utilize our computing resources fully.
Fast Parallel Tree Codes for Gravitational and Fluid Dynamical N-Body Problems
1993-01-01
Chorin. Vortex models and boundary layer insta- 399:L109, 1992. bility. SIAMJ. Sci. Stat. Comput., 1(l):1 -21. 1980. [25] P. Koumoutsakos . Direct...turbulence. Comm. Pure Appl. thesis, California Institute of Technology, 1993. Math., 34:853-866, 1981. [26] P. Koumoutsakos and A. Leonard. Direct
Procassini, R.J.
1997-12-31
The fine-scale, multi-space resolution that is envisioned for accurate simulations of complex weapons systems in three spatial dimensions implies flop-rate and memory-storage requirements that will only be obtained in the near future through the use of parallel computational techniques. Since the Monte Carlo transport models in these simulations usually stress both of these computational resources, they are prime candidates for parallelization. The MONACO Monte Carlo transport package, which is currently under development at LLNL, will utilize two types of parallelism within the context of a multi-physics design code: decomposition of the spatial domain across processors (spatial parallelism) and distribution of particles in a given spatial subdomain across additional processors (particle parallelism). This implementation of the package will utilize explicit data communication between domains (message passing). Such a parallel implementation of a Monte Carlo transport model will result in non-deterministic communication patterns. The communication of particles between subdomains during a Monte Carlo time step may require a significant level of effort to achieve a high parallel efficiency.
Impact of parallel heterogeneity on a continuum model of the pulmonary arterial tree.
Krenz, G S; Lin, J; Dawson, C A; Linehan, J H
1994-08-01
Model arterial trees were constructed following rules consistent with morphometric data, Nj = (Dj/Da)-beta 1 and Lj = La(Dj/Da)beta 2, where Nj, Dj, and Lj are number, diameter, and length, respectively, of vessels in the jth level; Da and La are diameter and length, respectively, of the inlet artery, and -beta 1 and beta 2 are power law slopes relating vessel number and length, respectively, to vessel diameter. Simulated heterogeneous trees approximating these rules were constructed by assigning vessel diameters Dm = Da[2/(m + 1)]1/beta 1, such that m-1 vessels were larger than Dm (vessel length proportional to diameter). Vessels were connected, forming random bifurcating trees. Longitudinal intravascular pressure [P(Qcum)] with respect to cumulative vascular volume [Qcum] was computed for Poiseuille flow. Strahler-ordered tree morphometry yielded estimates of La, Da, beta 1, beta 2, and mean number ratio (B); B is defined by Nk + 1 = Bk, where k is total number of Strahler orders minus Strahler order number. The parameters were used in P(Qcum) = Pa [formula: see text] and the resulting P(Qcum) relationship was compared with that of the simulated tree, where Pa is total arterial pressure drop, Q is flow rate, Ra = (128 microLa)/(pi D4a (where mu is blood viscosity), and Qa (volume of inlet artery) = 1/4D2a pi La. Results indicate that the equation, originally developed for homogeneous trees (J. Appl. Physiol. 72: 2225-2237, 1992), provides a good approximation to the heterogeneous tree P(Qcum).
CMAD: A Self-consistent Parallel Code to Simulate the Electron Cloud Build-up and Instabilities
Pivi, M.T.F.; /SLAC
2007-11-07
We present the features of CMAD, a newly developed self-consistent code which simulates both the electron cloud build-up and related beam instabilities. By means of parallel (Message Passing Interface - MPI) computation, the code tracks the beam in an existing (MAD-type) lattice and continuously resolves the interaction between the beam and the cloud at each element location, with different cloud distributions at each magnet location. The goal of CMAD is to simulate single- and coupled-bunch instability, allowing tune shift, dynamic aperture and frequency map analysis and the determination of the secondary electron yield instability threshold. The code is in its phase of development and benchmarking with existing codes. Preliminary results on benchmarking are presented in this paper.
Wakefield Computations for the CLIC PETS using the Parallel Finite Element Time-Domain Code T3P
Candel, A; Kabel, A.; Lee, L.; Li, Z.; Ng, C.; Schussman, G.; Ko, K.; Syratchev, I.; /CERN
2009-06-19
In recent years, SLAC's Advanced Computations Department (ACD) has developed the high-performance parallel 3D electromagnetic time-domain code, T3P, for simulations of wakefields and transients in complex accelerator structures. T3P is based on advanced higher-order Finite Element methods on unstructured grids with quadratic surface approximation. Optimized for large-scale parallel processing on leadership supercomputing facilities, T3P allows simulations of realistic 3D structures with unprecedented accuracy, aiding the design of the next generation of accelerator facilities. Applications to the Compact Linear Collider (CLIC) Power Extraction and Transfer Structure (PETS) are presented.
NASA Astrophysics Data System (ADS)
Prashantha Kumar, H.; Sripati, U.; Shetty, K. Rajesh
2012-05-01
In this article, we propose a high-speed decoding algorithm for binary BCH codes that can correct up to 7 bits in error. Evaluation of the error-locator polynomial is the most complicated and time-consuming step in the decoding of a BCH code. We have derived equations for specifying the coefficients of the error-locator polynomial, which can form the basis for the development of a parallel architecture for the decoder. This approach has the advantage that all the coefficients of the error locator polynomial are computed in parallel (in one step). The roots of error-locator polynomial can be obtained by Chien's search and inverting these roots gives the error locations. This algorithm can be employed in any application where high-speed decoding of data encoded by a binary BCH code is required. One important application is in Flash memories where data integrity is preserved using a long, high-rate binary BCH code. We have synthesized generator polynomials for binary BCH codes (error-correcting capability, s ? ) that can be employed in Flash memory devices to improve the integrity of information storage. The proposed decoding algorithm can be used as an efficient, high-speed decoder in this important application.
Candel, A.E.; Kabel, A.C.; Ko, Yong-kyu; Lee, L.; Li, Z.; Limborg-Deprey, C.; Ng, C.K.; Prudencio, E.E.; Schussman, G.L.; Uplenchwar, R.; /SLAC
2007-11-07
Over the past years, SLAC's Advanced Computations Department (ACD) has developed the parallel finite element (FE) particle-in-cell code Pic3P (Pic2P) for simulations of beam-cavity interactions dominated by space-charge effects. As opposed to standard space-charge dominated beam transport codes, which are based on the electrostatic approximation, Pic3P (Pic2P) includes space-charge, retardation and boundary effects as it self-consistently solves the complete set of Maxwell-Lorentz equations using higher-order FE methods on conformal meshes. Use of efficient, large-scale parallel processing allows for the modeling of photoinjectors with unprecedented accuracy, aiding the design and operation of the next-generation of accelerator facilities. Applications to the Linac Coherent Light Source (LCLS) RF gun are presented.
Seedling establishment in a masting desert shrub parallels the pattern for forest trees
Susan E. Meyer; Burton K. Pendleton
2015-01-01
The masting phenomenon along with its accompanying suite of seedling adaptive traits has been well studied in forest trees but has rarely been examined in desert shrubs. Blackbrush (Coleogyne ramosissima) is a regionally dominant North American desert shrub whose seeds are produced in mast events and scatter-hoarded by rodents. We followed the fate of seedlings in...
Canbay, Ferhat; Levent, Vecdi Emre; Serbes, Gorkem; Ugurdag, H. Fatih; Goren, Sezer
2016-01-01
The authors aimed to develop an application for producing different architectures to implement dual tree complex wavelet transform (DTCWT) having near shift-invariance property. To obtain a low-cost and portable solution for implementing the DTCWT in multi-channel real-time applications, various embedded-system approaches are realised. For comparison, the DTCWT was implemented in C language on a personal computer and on a PIC microcontroller. However, in the former approach portability and in the latter desired speed performance properties cannot be achieved. Hence, implementation of the DTCWT on a reconfigurable platform such as field programmable gate array, which provides portable, low-cost, low-power, and high-performance computing, is considered as the most feasible solution. At first, they used the system generator DSP design tool of Xilinx for algorithm design. However, the design implemented by using such tools is not optimised in terms of area and power. To overcome all these drawbacks mentioned above, they implemented the DTCWT algorithm by using Verilog Hardware Description Language, which has its own difficulties. To overcome these difficulties, simplify the usage of proposed algorithms and the adaptation procedures, a code generator program that can produce different architectures is proposed. PMID:27733925
Canbay, Ferhat; Levent, Vecdi Emre; Serbes, Gorkem; Ugurdag, H Fatih; Goren, Sezer; Aydin, Nizamettin
2016-09-01
The authors aimed to develop an application for producing different architectures to implement dual tree complex wavelet transform (DTCWT) having near shift-invariance property. To obtain a low-cost and portable solution for implementing the DTCWT in multi-channel real-time applications, various embedded-system approaches are realised. For comparison, the DTCWT was implemented in C language on a personal computer and on a PIC microcontroller. However, in the former approach portability and in the latter desired speed performance properties cannot be achieved. Hence, implementation of the DTCWT on a reconfigurable platform such as field programmable gate array, which provides portable, low-cost, low-power, and high-performance computing, is considered as the most feasible solution. At first, they used the system generator DSP design tool of Xilinx for algorithm design. However, the design implemented by using such tools is not optimised in terms of area and power. To overcome all these drawbacks mentioned above, they implemented the DTCWT algorithm by using Verilog Hardware Description Language, which has its own difficulties. To overcome these difficulties, simplify the usage of proposed algorithms and the adaptation procedures, a code generator program that can produce different architectures is proposed.
Parallel Subspace Subcodes of Reed-Solomon Codes for Magnetic Recording Channels
ERIC Educational Resources Information Center
Wang, Han
2010-01-01
Read channel architectures based on a single low-density parity-check (LDPC) code are being considered for the next generation of hard disk drives. However, LDPC-only solutions suffer from the error floor problem, which may compromise reliability, if not handled properly. Concatenated architectures using an LDPC code plus a Reed-Solomon (RS) code…
Parallel Subspace Subcodes of Reed-Solomon Codes for Magnetic Recording Channels
ERIC Educational Resources Information Center
Wang, Han
2010-01-01
Read channel architectures based on a single low-density parity-check (LDPC) code are being considered for the next generation of hard disk drives. However, LDPC-only solutions suffer from the error floor problem, which may compromise reliability, if not handled properly. Concatenated architectures using an LDPC code plus a Reed-Solomon (RS) code…
A Multiple Sphere T-Matrix Fortran Code for Use on Parallel Computer Clusters
NASA Technical Reports Server (NTRS)
Mackowski, D. W.; Mishchenko, M. I.
2011-01-01
A general-purpose Fortran-90 code for calculation of the electromagnetic scattering and absorption properties of multiple sphere clusters is described. The code can calculate the efficiency factors and scattering matrix elements of the cluster for either fixed or random orientation with respect to the incident beam and for plane wave or localized- approximation Gaussian incident fields. In addition, the code can calculate maps of the electric field both interior and exterior to the spheres.The code is written with message passing interface instructions to enable the use on distributed memory compute clusters, and for such platforms the code can make feasible the calculation of absorption, scattering, and general EM characteristics of systems containing several thousand spheres.
A Multiple Sphere T-Matrix Fortran Code for Use on Parallel Computer Clusters
NASA Technical Reports Server (NTRS)
Mackowski, D. W.; Mishchenko, M. I.
2011-01-01
A general-purpose Fortran-90 code for calculation of the electromagnetic scattering and absorption properties of multiple sphere clusters is described. The code can calculate the efficiency factors and scattering matrix elements of the cluster for either fixed or random orientation with respect to the incident beam and for plane wave or localized- approximation Gaussian incident fields. In addition, the code can calculate maps of the electric field both interior and exterior to the spheres.The code is written with message passing interface instructions to enable the use on distributed memory compute clusters, and for such platforms the code can make feasible the calculation of absorption, scattering, and general EM characteristics of systems containing several thousand spheres.
Tsugane, Keisuke; Boku, Taisuke; Murai, Hitoshi; ...
2016-06-01
Recently, the Partitioned Global Address Space (PGAS) parallel programming model has emerged as a usable distributed memory programming model. XcalableMP (XMP) is a PGAS parallel programming language that extends base languages such as C and Fortran with directives in OpenMP-like style. XMP supports a global-view model that allows programmers to define global data and to map them to a set of processors, which execute the distributed global data as a single thread. In XMP, the concept of a coarray is also employed for local-view programming. In this study, we port Gyrokinetic Toroidal Code - Princeton (GTC-P), which is a three-dimensionalmore » gyrokinetic PIC code developed at Princeton University to study the microturbulence phenomenon in magnetically confined fusion plasmas, to XMP as an example of hybrid memory model coding with the global-view and local-view programming models. In local-view programming, the coarray notation is simple and intuitive compared with Message Passing Interface (MPI) programming while the performance is comparable to that of the MPI version. Thus, because the global-view programming model is suitable for expressing the data parallelism for a field of grid space data, we implement a hybrid-view version using a global-view programming model to compute the field and a local-view programming model to compute the movement of particles. Finally, the performance is degraded by 20% compared with the original MPI version, but the hybrid-view version facilitates more natural data expression for static grid space data (in the global-view model) and dynamic particle data (in the local-view model), and it also increases the readability of the code for higher productivity.« less
Tsugane, Keisuke; Boku, Taisuke; Murai, Hitoshi; Sato, Mitsuhisa; Tang, William; Wang, Bei
2016-06-01
Recently, the Partitioned Global Address Space (PGAS) parallel programming model has emerged as a usable distributed memory programming model. XcalableMP (XMP) is a PGAS parallel programming language that extends base languages such as C and Fortran with directives in OpenMP-like style. XMP supports a global-view model that allows programmers to define global data and to map them to a set of processors, which execute the distributed global data as a single thread. In XMP, the concept of a coarray is also employed for local-view programming. In this study, we port Gyrokinetic Toroidal Code - Princeton (GTC-P), which is a three-dimensional gyrokinetic PIC code developed at Princeton University to study the microturbulence phenomenon in magnetically confined fusion plasmas, to XMP as an example of hybrid memory model coding with the global-view and local-view programming models. In local-view programming, the coarray notation is simple and intuitive compared with Message Passing Interface (MPI) programming while the performance is comparable to that of the MPI version. Thus, because the global-view programming model is suitable for expressing the data parallelism for a field of grid space data, we implement a hybrid-view version using a global-view programming model to compute the field and a local-view programming model to compute the movement of particles. Finally, the performance is degraded by 20% compared with the original MPI version, but the hybrid-view version facilitates more natural data expression for static grid space data (in the global-view model) and dynamic particle data (in the local-view model), and it also increases the readability of the code for higher productivity.
Tsugane, Keisuke; Boku, Taisuke; Murai, Hitoshi; Sato, Mitsuhisa; Tang, William; Wang, Bei
2016-06-01
Recently, the Partitioned Global Address Space (PGAS) parallel programming model has emerged as a usable distributed memory programming model. XcalableMP (XMP) is a PGAS parallel programming language that extends base languages such as C and Fortran with directives in OpenMP-like style. XMP supports a global-view model that allows programmers to define global data and to map them to a set of processors, which execute the distributed global data as a single thread. In XMP, the concept of a coarray is also employed for local-view programming. In this study, we port Gyrokinetic Toroidal Code - Princeton (GTC-P), which is a three-dimensional gyrokinetic PIC code developed at Princeton University to study the microturbulence phenomenon in magnetically confined fusion plasmas, to XMP as an example of hybrid memory model coding with the global-view and local-view programming models. In local-view programming, the coarray notation is simple and intuitive compared with Message Passing Interface (MPI) programming while the performance is comparable to that of the MPI version. Thus, because the global-view programming model is suitable for expressing the data parallelism for a field of grid space data, we implement a hybrid-view version using a global-view programming model to compute the field and a local-view programming model to compute the movement of particles. Finally, the performance is degraded by 20% compared with the original MPI version, but the hybrid-view version facilitates more natural data expression for static grid space data (in the global-view model) and dynamic particle data (in the local-view model), and it also increases the readability of the code for higher productivity.
ERIC Educational Resources Information Center
Al-Khaja, Nawal
2007-01-01
This is a thematic lesson plan for young learners about palm trees and the importance of taking care of them. The two part lesson teaches listening, reading and speaking skills. The lesson includes parts of a tree; the modal auxiliary, can; dialogues and a role play activity.
NASA Astrophysics Data System (ADS)
Marx, Alain; Lütjens, Hinrich
2017-03-01
A hybrid MPI/OpenMP parallel version of the XTOR-2F code [Lütjens and Luciani, J. Comput. Phys. 229 (2010) 8130] solving the two-fluid MHD equations in full tokamak geometry by means of an iterative Newton-Krylov matrix-free method has been developed. The present work shows that the code has been parallelized significantly despite the numerical profile of the problem solved by XTOR-2F, i.e. a discretization with pseudo-spectral representations in all angular directions, the stiffness of the two-fluid stability problem in tokamaks, and the use of a direct LU decomposition to invert the physical pre-conditioner at every Krylov iteration of the solver. The execution time of the parallelized version is an order of magnitude smaller than the sequential one for low resolution cases, with an increasing speedup when the discretization mesh is refined. Moreover, it allows to perform simulations with higher resolutions, previously forbidden because of memory limitations.
Liu, Shifeng; Zhu, Dan; Wei, Zhengwu; Pan, Shilong
2014-07-01
A photonic approach for the generation of a widely tunable arbitrarily phase-coded microwave signal based on a dual-parallel polarization modulator (DP-PolM) is proposed and demonstrated without using any optical or electrical filter. Two orthogonally polarized ± first-order optical sidebands with suppressed carrier are generated based on the DP-PolM, and their polarization directions are aligned with the two principal axes of the following PolM. Phase coding is implemented at a following PolM driven by an electrical coding signal. The inherent frequency-doubling operation can make the system work at a frequency beyond the operation bandwidth of the DP-PolM and the 90° hybrid. Because no optical or electrical filter is applied, good frequency tunability is realized. An experiment is performed. The generation of phase-coded signals tuning from 10 to 40 GHz with up to 10 Gbit/s coding rates is verified.
NASA Astrophysics Data System (ADS)
Thonhofer, Stefan; Rubio, Luis R. Bellot; Utz, Dominik; Hanslmeier, Arnold; Jurçák, Jan
2015-10-01
Magnetic fields are one of the most important drivers of the highly dynamic processes that occur in the lower solar atmosphere. They span a broad range of sizes, from large- and intermediate-scale structures such as sunspots, pores and magnetic knots, down to the smallest magnetic elements observable with current telescopes. On small scales, magnetic flux tubes are often visible as Magnetic Bright Points (MBPs). Apart from simple V/I magnetograms, the most common method to deduce their magnetic properties is the inversion of spectropolarimetric data. Here we employ the SIR code for that purpose. SIR is a well-established tool that can derive not only the magnetic field vector and other atmospheric parameters (e.g., temperature, line-of-sight velocity), but also their stratifications with height, effectively producing 3-dimensional models of the lower solar atmosphere. In order to enhance the runtime performance and the usability of SIR we parallelized the existing code and standardized the input and output formats. This and other improvements make it feasible to invert extensive high-resolution data sets within a reasonable amount of computing time. An evaluation of the speedup of the parallel SIR code shows a substantial improvement in runtime.
NASA Astrophysics Data System (ADS)
Cattania, C.; Khalid, F.
2016-09-01
The estimation of space and time-dependent earthquake probabilities, including aftershock sequences, has received increased attention in recent years, and Operational Earthquake Forecasting systems are currently being implemented in various countries. Physics based earthquake forecasting models compute time dependent earthquake rates based on Coulomb stress changes, coupled with seismicity evolution laws derived from rate-state friction. While early implementations of such models typically performed poorly compared to statistical models, recent studies indicate that significant performance improvements can be achieved by considering the spatial heterogeneity of the stress field and secondary sources of stress. However, the major drawback of these methods is a rapid increase in computational costs. Here we present a code to calculate seismicity induced by time dependent stress changes. An important feature of the code is the possibility to include aleatoric uncertainties due to the existence of multiple receiver faults and to the finite grid size, as well as epistemic uncertainties due to the choice of input slip model. To compensate for the growth in computational requirements, we have parallelized the code for shared memory systems (using OpenMP) and distributed memory systems (using MPI). Performance tests indicate that these parallelization strategies lead to a significant speedup for problems with different degrees of complexity, ranging from those which can be solved on standard multicore desktop computers, to those requiring a small cluster, to a large simulation that can be run using up to 1500 cores.
Gwo, Jin-Ping; Yeh, Gour-Tsyh
1997-02-01
The objectives of this study are (1) to parallelize a 3-dimensional hydrogeochemistry code and (2) to apply the parallel code to a proposed waste disposal site at the Oak Ridge National Laboratory (ORNL). The 2-dimensional hydrogeochemistry code HYDROGEOCHEM, developed at the Pennsylvania State University for coupled subsurface solute transport and chemical equilibrium processes, was first modified to accommodate 3-dimensional problem domains. A bi-conjugate gradient stabilized linear matrix solver was then incorporated to solve the matrix equation. We chose to parallelize the 3-dimensional code on the Intel Paragons at ORNL by using an HPF (high performance FORTRAN) compiler developed at PGI. The data- and task-parallel algorithms available in the HPF compiler proved to be highly efficient for the geochemistry calculation. This calculation can be easily implemented in HPF formats and is perfectly parallel because the chemical speciation on one finite-element node is virtually independent of those on the others. The parallel code was applied to a subwatershed of the Melton Branch at ORNL. Chemical heterogeneity, in addition to physical heterogeneities of the geological formations, has been identified as one of the major factors that affect the fate and transport of contaminants at ORNL. This study demonstrated an application of the 3-dimensional hydrogeochemistry code on the Melton Branch site. A uranium tailing problem that involved in aqueous complexation and precipitation-dissolution was tested. Performance statistics was collected on the Intel Paragons at ORNL. Implications of these results on the further optimization of the code were discussed.
NASA Astrophysics Data System (ADS)
Germanas, D.; Stepšys, A.; Mickevičius, S.; Kalinauskas, R. K.
2017-06-01
This is a new version of the HOTB code designed to calculate three and four particle harmonic oscillator (HO) transformation brackets and their matrices. The new version uses the OpenMP parallel communication standard for calculations of harmonic oscillator transformation brackets. A package of Fortran code is presented. Calculation time of large matrices, orthogonality conditions and array of coefficients can be significantly reduced using effective parallel code. Other functionalities of the original code (for example calculation of single harmonic oscillator brackets) have not been modified.
2HOT: An Improved Parallel Hashed Oct-Tree N-Body Algorithm for Cosmological Simulation
Warren, Michael S.
2014-01-01
We report on improvements made over the past two decades to our adaptive treecode N-body method (HOT). A mathematical and computational approach to the cosmological N-body problem is described, with performance and scalability measured up to 256k (2 18 ) processors. We present error analysis and scientific application results from a series of more than ten 69 billion (4096 3 ) particle cosmological simulations, accounting for 4×10 20 floating point operations. These results include the first simulations using the new constraints on the standard model of cosmology from the Planck satellite. Our simulations set a new standard for accuracymore » and scientific throughput, while meeting or exceeding the computational efficiency of the latest generation of hybrid TreePM N-body methods.« less
NASA Astrophysics Data System (ADS)
Reuter, K.; Jenko, F.; Forest, C. B.; Bayliss, R. A.
2008-08-01
A parallel implementation of a nonlinear pseudo-spectral MHD code for the simulation of turbulent dynamos in spherical geometry is reported. It employs a dual domain decomposition technique in both real and spectral space. It is shown that this method shows nearly ideal scaling going up to 128 CPUs on Beowulf-type clusters with fast interconnect. Furthermore, the potential of exploiting single precision arithmetic on standard x86 processors is examined. It is pointed out that the MHD code thereby achieves a maximum speedup of 1.7, whereas the validity of the computations is still granted. The combination of both measures will allow for the direct numerical simulation of highly turbulent cases ( 1500
Lima, Thálitta Hetamaro Ayala; Buttura, Renato Vidal; Donadi, Eduardo Antônio; Veiga-Castelli, Luciana Caricati; Mendes-Junior, Celso Teixeira; Castelli, Erick C
2016-10-01
Human Leucocyte Antigen F (HLA-F) is a non-classical HLA class I gene distinguished from its classical counterparts by low allelic polymorphism and distinctive expression patterns. Its exact function remains unknown. It is believed that HLA-F has tolerogenic and immune modulatory properties. Currently, there is little information regarding the HLA-F allelic variation among human populations and the available studies have evaluated only a fraction of the HLA-F gene segment and/or have searched for known alleles only. Here we present a strategy to evaluate the complete HLA-F variability including its 5' upstream, coding and 3' downstream segments by using massively parallel sequencing procedures. HLA-F variability was surveyed on 196 individuals from the Brazilian Southeast. The results indicate that the HLA-F gene is indeed conserved at the protein level, where thirty coding haplotypes or coding alleles were detected, encoding only four different HLA-F full-length protein molecules. Moreover, a same protein molecule is encoded by 82.45% of all coding alleles detected in this Brazilian population sample. However, the HLA-F nucleotide and haplotype variability is much higher than our current knowledge both in Brazilians and considering the 1000 Genomes Project data. This protein conservation is probably a consequence of the key role of HLA-F in the immune system physiology.
Full Wave Parallel Code for Modeling RF Fields in Hot Plasmas
NASA Astrophysics Data System (ADS)
Spencer, Joseph; Svidzinski, Vladimir; Evstatiev, Evstati; Galkin, Sergei; Kim, Jin-Soo
2015-11-01
FAR-TECH, Inc. is developing a suite of full wave RF codes in hot plasmas. It is based on a formulation in configuration space with grid adaptation capability. The conductivity kernel (which includes a nonlocal dielectric response) is calculated by integrating the linearized Vlasov equation along unperturbed test particle orbits. For Tokamak applications a 2-D version of the code is being developed. Progress of this work will be reported. This suite of codes has the following advantages over existing spectral codes: 1) It utilizes the localized nature of plasma dielectric response to the RF field and calculates this response numerically without approximations. 2) It uses an adaptive grid to better resolve resonances in plasma and antenna structures. 3) It uses an efficient sparse matrix solver to solve the formulated linear equations. The linear wave equation is formulated using two approaches: for cold plasmas the local cold plasma dielectric tensor is used (resolving resonances by particle collisions), while for hot plasmas the conductivity kernel is calculated. Work is supported by the U.S. DOE SBIR program.
Kauer, J S
1991-02-01
Odor information appears to be encoded by activity distributed across many neurons at each level in the olfactory pathway. Thus olfactory circuits function as parallel distributed processors. New methods for observing distributed activity in such systems permit computer simulations to be constructed that are constrained by patterns of activity observed in the real system. Analysis of the system using a combination of physiological measurements and computational approaches might elucidate the principles by which odors are discriminated.
A user`s guide for BREAKUP: A computer code for parallelizing the overset grid approach
Barnette, D.W.
1998-04-01
In this user`s guide, details for running BREAKUP are discussed. BREAKUP allows the widely used overset grid method to be run in a parallel computer environment to achieve faster run times for computational field simulations over complex geometries. The overset grid method permits complex geometries to be divided into separate components. Each component is then gridded independently. The grids are computationally rejoined in a solver via interpolation coefficients used for grid-to-grid communications of boundary data. Overset grids have been in widespread use for many years on serial computers, and several well-known Navier-Stokes flow solvers have been extensively developed and validated to support their use. One drawback of serial overset grid methods has been the extensive compute time required to update flow solutions one grid at a time. Parallelizing the overset grid method overcomes this limitation by updating each grid or subgrid simultaneously. BREAKUP prepares overset grids for parallel processing by subdividing each overset grid into statically load-balanced subgrids. Two-dimensional examples with sample solutions, and three-dimensional examples, are presented.
NASA Astrophysics Data System (ADS)
Fang, Ye; Feng, Sheng; Tam, Ka-Ming; Yun, Zhifeng; Moreno, Juana; Ramanujam, J.; Jarrell, Mark
2014-10-01
Monte Carlo simulations of the Ising model play an important role in the field of computational statistical physics, and they have revealed many properties of the model over the past few decades. However, the effect of frustration due to random disorder, in particular the possible spin glass phase, remains a crucial but poorly understood problem. One of the obstacles in the Monte Carlo simulation of random frustrated systems is their long relaxation time making an efficient parallel implementation on state-of-the-art computation platforms highly desirable. The Graphics Processing Unit (GPU) is such a platform that provides an opportunity to significantly enhance the computational performance and thus gain new insight into this problem. In this paper, we present optimization and tuning approaches for the CUDA implementation of the spin glass simulation on GPUs. We discuss the integration of various design alternatives, such as GPU kernel construction with minimal communication, memory tiling, and look-up tables. We present a binary data format, Compact Asynchronous Multispin Coding (CAMSC), which provides an additional 28.4% speedup compared with the traditionally used Asynchronous Multispin Coding (AMSC). Our overall design sustains a performance of 33.5 ps per spin flip attempt for simulating the three-dimensional Edwards-Anderson model with parallel tempering, which significantly improves the performance over existing GPU implementations.
Zhang, S.; Yuen, D.A.; Zhu, A.; Song, S.; George, D.L.
2011-01-01
We parallelized the GeoClaw code on one-level grid using OpenMP in March, 2011 to meet the urgent need of simulating tsunami waves at near-shore from Tohoku 2011 and achieved over 75% of the potential speed-up on an eight core Dell Precision T7500 workstation [1]. After submitting that work to SC11 - the International Conference for High Performance Computing, we obtained an unreleased OpenMP version of GeoClaw from David George, who developed the GeoClaw code as part of his PH.D thesis. In this paper, we will show the complementary characteristics of the two approaches used in parallelizing GeoClaw and the speed-up obtained by combining the advantage of each of the two individual approaches with adaptive mesh refinement (AMR), demonstrating the capabilities of running GeoClaw efficiently on many-core systems. We will also show a novel simulation of the Tohoku 2011 Tsunami waves inundating the Sendai airport and Fukushima Nuclear Power Plants, over which the finest grid distance of 20 meters is achieved through a 4-level AMR. This simulation yields quite good predictions about the wave-heights and travel time of the tsunami waves. ?? 2011 IEEE.
Seedling establishment in a masting desert shrub parallels the pattern for forest trees
NASA Astrophysics Data System (ADS)
Meyer, Susan E.; Pendleton, Burton K.
2015-05-01
The masting phenomenon along with its accompanying suite of seedling adaptive traits has been well studied in forest trees but has rarely been examined in desert shrubs. Blackbrush (Coleogyne ramosissima) is a regionally dominant North American desert shrub whose seeds are produced in mast events and scatter-hoarded by rodents. We followed the fate of seedlings in intact stands vs. small-scale disturbances at four contrasting sites for nine growing seasons following emergence after a mast year. The primary cause of first-year mortality was post-emergence cache excavation and seedling predation, with contrasting impacts at sites with different heteromyid rodent seed predators. Long-term establishment patterns were strongly affected by rodent activity in the weeks following emergence. Survivorship curves generally showed decreased mortality risk with age but differed among sites even after the first year. There were no detectable effects of inter-annual precipitation variability or site climatic differences on survival. Intraspecific competition from conspecific adults had strong impacts on survival and growth, both of which were higher on small-scale disturbances, but similar in openings and under shrub crowns in intact stands. This suggests that adult plants preempted soil resources in the interspaces. Aside from effects on seedling predation, there was little evidence for facilitation or interference beneath adult plant crowns. Plants in intact stands were still small and clearly juvenile after nine years, showing that blackbrush forms cohorts of suppressed plants similar to the seedling banks of closed forests. Seedling banks function in the absence of a persistent seed bank in replacement after adult plant death (gap formation), which is temporally uncoupled from masting and associated recruitment events. This study demonstrates that the seedling establishment syndrome associated with masting has evolved in desert shrublands as well as in forests.
F100(3) parallel compressor computer code and user's manual
NASA Technical Reports Server (NTRS)
Mazzawy, R. S.; Fulkerson, D. A.; Haddad, D. E.; Clark, T. A.
1978-01-01
The Pratt & Whitney Aircraft multiple segment parallel compressor model has been modified to include the influence of variable compressor vane geometry on the sensitivity to circumferential flow distortion. Further, performance characteristics of the F100 (3) compression system have been incorporated into the model on a blade row basis. In this modified form, the distortion's circumferential location is referenced relative to the variable vane controlling sensors of the F100 (3) engine so that the proper solution can be obtained regardless of distortion orientation. This feature is particularly important for the analysis of inlet temperature distortion. Compatibility with fixed geometry compressor applications has been maintained in the model.
NASA Astrophysics Data System (ADS)
Lu, C.; Lichtner, P. C.; Tsimpanogiannis, I. N.
2005-12-01
Uncontrolled release of CO2 to the atmosphere has been identified as a major contributing source to the global warming problem. Significant research efforts from the international scientific community are targeted towards stabilization/reduction of CO2 concentrations in the atmosphere while attempting to satisfy our continuously increasing needs for energy. CO2 sequestration (capture, separation, and long term storage) in various media (e.g. geologic such as depleted oil reservoirs, saline aquifers, etc.; oceanic at different depths) has been considered as a possible solution to reduce green house gas emissions. In this study we utilize the PFLOTRAN simulator to investigate geologic sequestration of CO2. PFLOTRAN is a massively parallel 3-D reservoir simulator for modeling supercritical CO2 sequestration in geologic formations based on continuum scale mass and energy conservations. The mass and energy equations are sequentially coupled to reactive transport equations describing multi-component chemical reactions within the formation including aqueous speciation, and precipitation and dissolution of minerals to describe aqueous and mineral CO2 sequestration. The effect of the injected CO2 on pH, CO2 concentration within the aqueous phase, mineral stability, and other factors can be evaluated with this model. Parallelization is carried out using the PETSc parallel library package based on MPI providing a high parallel efficiency and allowing simulations with several tens of millions of degrees of freedom to be carried out-ideal for large-scale field applications involving multi-component chemistry. In this work, our main focus is a parametrical examination on the effects of reservoir and fluid properties on the sequestration process, such as permeability and capillary pressure functions (e.g. linear, van Genuchten, etc.), diffusion coefficients in a multiphase system, the sensitivity of component solubility on pressure, temperature and mole fractions etc. Several
1991-12-01
1,k) 2p(i+’/2,j+ /,k) 1+++ At •1(22) B(i +l/’j +/2’k)58 1+ 0.m(i +,j +/2,k) 2p(i+ 1A ,j+1/,k) ,[ ZEn(i+/2J+1’k)-E(i ’ ij k] where 5 is the lattice...1991. 19. Tipler , Paul A. Physics. New York: Worth Publishers Inc., 1976. 20. Work, Paul Rt and Gary B. Lamont. "Efficient Parallelization of Serial
Li, Shengtai; Li, Hui
2012-06-14
the position of the planet, we adopt the corotating frame that allows the planet moving only in radial direction if only one planet is present. This code has been extensively tested on a number of problems. For the earthmass planet with constant aspect ratio h = 0.05, the torque calculated using our code matches quite well with the the 3D linear theory results by Tanaka et al. (2002). The code is fully parallelized via message-passing interface (MPI) and has very high parallel efficiency. Several numerical examples for both fixed planet and moving planet are provided to demonstrate the efficacy of the numerical method and code.
Parallel code NSBC: Simulations of relativistic nuclei scattering by a bent crystal
NASA Astrophysics Data System (ADS)
Babaev, A. A.
2014-01-01
The presented program was designed to simulate the passage of relativistic nuclei through a bent crystal. Namely, the input data is related to a nuclei beam. The nuclei move into the crystal under planar channeling and quasichanneling conditions. The program realizes the numerical algorithm to evaluate the trajectory of nucleus in the bent crystal. The program output is formed by the projectile motion data including the angular distribution of nuclei behind the crystal. The program could be useful to simulate the particle tracking at the accelerator facilities used the crystal collimation systems. The code has been written on C++ and designed for the multiprocessor systems (clusters).
NASA Astrophysics Data System (ADS)
Hegde, Ganapathi; Vaya, Pukhraj
2013-10-01
This article presents a parallel architecture for 3-D discrete wavelet transform (3-DDWT). The proposed design is based on the 1-D pipelined lifting scheme. The architecture is fully scalable beyond the present coherent Daubechies filter bank (9, 7). This 3-DDWT architecture has advantages such as no group of pictures restriction and reduced memory referencing. It offers low power consumption, low latency and high throughput. The computing technique is based on the concept that lifting scheme minimises the storage requirement. The application specific integrated circuit implementation of the proposed architecture is done by synthesising it using 65 nm Taiwan Semiconductor Manufacturing Company standard cell library. It offers a speed of 486 MHz with a power consumption of 2.56 mW. This architecture is suitable for real-time video compression even with large frame dimensions.
BMI optimization by using parallel UNDX real-coded genetic algorithm with Beowulf cluster
NASA Astrophysics Data System (ADS)
Handa, Masaya; Kawanishi, Michihiro; Kanki, Hiroshi
2007-12-01
This paper deals with the global optimization algorithm of the Bilinear Matrix Inequalities (BMIs) based on the Unimodal Normal Distribution Crossover (UNDX) GA. First, analyzing the structure of the BMIs, the existence of the typical difficult structures is confirmed. Then, in order to improve the performance of algorithm, based on results of the problem structures analysis and consideration of BMIs characteristic properties, we proposed the algorithm using primary search direction with relaxed Linear Matrix Inequality (LMI) convex estimation. Moreover, in these algorithms, we propose two types of evaluation methods for GA individuals based on LMI calculation considering BMI characteristic properties more. In addition, in order to reduce computational time, we proposed parallelization of RCGA algorithm, Master-Worker paradigm with cluster computing technique.
PORTA: A Massively Parallel Code for 3D Non-LTE Polarized Radiative Transfer
NASA Astrophysics Data System (ADS)
Štěpán, J.
2014-10-01
The interpretation of the Stokes profiles of the solar (stellar) spectral line radiation requires solving a non-LTE radiative transfer problem that can be very complex, especially when the main interest lies in modeling the linear polarization signals produced by scattering processes and their modification by the Hanle effect. One of the main difficulties is due to the fact that the plasma of a stellar atmosphere can be highly inhomogeneous and dynamic, which implies the need to solve the non-equilibrium problem of generation and transfer of polarized radiation in realistic three-dimensional stellar atmospheric models. Here we present PORTA, a computer program we have developed for solving, in three-dimensional (3D) models of stellar atmospheres, the problem of the generation and transfer of spectral line polarization taking into account anisotropic radiation pumping and the Hanle and Zeeman effects in multilevel atoms. The numerical method of solution is based on a highly convergent iterative algorithm, whose convergence rate is insensitive to the grid size, and on an accurate short-characteristics formal solver of the Stokes-vector transfer equation which uses monotonic Bezier interpolation. In addition to the iterative method and the 3D formal solver, another important feature of PORTA is a novel parallelization strategy suitable for taking advantage of massively parallel computers. Linear scaling of the solution with the number of processors allows to reduce the solution time by several orders of magnitude. We present useful benchmarks and a few illustrations of applications using a 3D model of the solar chromosphere resulting from MHD simulations. Finally, we present our conclusions with a view to future research. For more details see Štěpán & Trujillo Bueno (2013).
Robust conjunctive item-place coding by hippocampal neurons parallels learning what happens where
Komorowski, Robert W.; Manns, Joseph R.; Eichenbaum, Howard
2009-01-01
Previous research indicates a critical role of the hippocampus in memory for events in the context in which they occur. However, studies to date have not provided compelling evidence that hippocampal neurons encode event-context conjunctions directly associated with this kind of learning. Here we report that, as animals learn different meanings for items in distinct contexts, individual hippocampal neurons develop responses to specific stimuli in the places where they have differential significance. Furthermore, this conjunctive coding evolves in the form of enhanced item-specific responses within a subset of the pre-existing spatial representation. These findings support the view that conjunctive representations in the hippocampus underlie the acquisition of context specific memories. PMID:19657042
NASA Astrophysics Data System (ADS)
Epstein, Henri
2016-11-01
An algebraic formalism, developed with V. Glaser and R. Stora for the study of the generalized retarded functions of quantum field theory, is used to prove a factorization theorem which provides a complete description of the generalized retarded functions associated with any tree graph. Integrating over the variables associated to internal vertices to obtain the perturbative generalized retarded functions for interacting fields arising from such graphs is shown to be possible for a large category of space-times.
Modeling RF Fields in Hot Plasmas with Parallel Full Wave Code
NASA Astrophysics Data System (ADS)
Spencer, Andrew; Svidzinski, Vladimir; Zhao, Liangji; Galkin, Sergei; Kim, Jin-Soo
2016-10-01
FAR-TECH, Inc. is developing a suite of full wave RF plasma codes. It is based on a meshless formulation in configuration space with adapted cloud of computational points (CCP) capability and using the hot plasma conductivity kernel to model the nonlocal plasma dielectric response. The conductivity kernel is calculated by numerically integrating the linearized Vlasov equation along unperturbed particle trajectories. Work has been done on the following calculations: 1) the conductivity kernel in hot plasmas, 2) a monitor function based on analytic solutions of the cold-plasma dispersion relation, 3) an adaptive CCP based on the monitor function, 4) stencils to approximate the wave equations on the CCP, 5) the solution to the full wave equations in the cold-plasma model in tokamak geometry for ECRH and ICRH range of frequencies, and 6) the solution to the wave equations using the calculated hot plasma conductivity kernel. We will present results on using a meshless formulation on adaptive CCP to solve the wave equations and on implementing the non-local hot plasma dielectric response to the wave equations. The presentation will include numerical results of wave propagation and absorption in the cold and hot tokamak plasma RF models, using DIII-D geometry and plasma parameters. Work is supported by the U.S. DOE SBIR program.
3-D Parallel, Object-Oriented, Hybrid, PIC Code for Ion Ring Studies
NASA Astrophysics Data System (ADS)
Omelchenko, Y. A.
1997-08-01
The 3-D hybrid, Particle-in-Cell (PIC) code, FLAME has been developed to study low-frequency, large orbit plasmas in realistic cylindrical configurations. FLAME assumes plasma quasineutrality and solves the Maxwell equations with displacement current neglected. The electron component is modeled as a massless fluid and all ion components are represented by discrete macro-particles. The poloidal discretization is done by a finite-difference staggered grid method. FFT is applied in the azimuthal direction. A substantial reduction of CPU time is achieved by enabling separate time advances of background and beam particle species in the time-averaged fields. The FLAME structure follows the guidelines of object-oriented programming. Its C++ class hierarchy comprises the Utility, Geometry, Particle, Grid and Distributed base class packages. The latter encapsulates implementation of concurrent grid and particle algorithms. The particle and grid data interprocessor communications are unified and designed to be independent of both the underlying message-passing library and the actual poloidal domain decomposition technique (FFT's are local). Load balancing concerns are addressed by using adaptive domain partitions to account for nonuniform spatial distributions of particle objects. The results of 2-D and 3-D FLAME simulations in support of the FIREX program at Cornell are presented.
Coding for parallel execution of hardware-in-the-loop millimeter-wave scene generation models on multicore SIMD processor architectures
NASA Astrophysics Data System (ADS)
Olson, Richard F.
2013-05-01
Rendering of point scatterer based radar scenes for millimeter wave (mmW) seeker tests in real-time hardware-in-the-loop (HWIL) scene generation requires efficient algorithms and vector-friendly computer architectures for complex signal synthesis. New processor technology from Intel implements an extended 256-bit vector SIMD instruction set (AVX, AVX2) in a multi-core CPU design providing peak execution rates of hundreds of GigaFLOPS (GFLOPS) on one chip. Real world mmW scene generation code can approach peak SIMD execution rates only after careful algorithm and source code design. An effective software design will maintain high computing intensity emphasizing register-to-register SIMD arithmetic operations over data movement between CPU caches or off-chip memories. Engineers at the U.S. Army Aviation and Missile Research, Development and Engineering Center (AMRDEC) applied two basic parallel coding methods to assess new 256-bit SIMD multi-core architectures for mmW scene generation in HWIL. These include use of POSIX threads built on vector library functions and more portable, highlevel parallel code based on compiler technology (e.g. OpenMP pragmas and SIMD autovectorization). Since CPU technology is rapidly advancing toward high processor core counts and TeraFLOPS peak SIMD execution rates, it is imperative that coding methods be identified which produce efficient and maintainable parallel code. This paper describes the algorithms used in point scatterer target model rendering, the parallelization of those algorithms, and the execution performance achieved on an AVX multi-core machine using the two basic parallel coding methods. The paper concludes with estimates for scale-up performance on upcoming multi-core technology.
Implementation of a tree algorithm in MCNP code for nuclear well logging applications.
Li, Fusheng; Han, Xiaogang
2012-07-01
The goal of this paper is to develop some modeling capabilities that are missing in the current MCNP code. Those missing capabilities can greatly help for some certain nuclear tools designs, such as a nuclear lithology/mineralogy spectroscopy tool. The new capabilities to be developed in this paper include the following: zone tally, neutron interaction tally, gamma rays index tally and enhanced pulse-height tally. The patched MCNP code also can be used to compute neutron slowing-down length and thermal neutron diffusion length.
NASA Astrophysics Data System (ADS)
Shi, Fei; Wang, Beibei; Selesnick, Ivan W.; Wang, Yao
2006-01-01
This paper introduces an anisotropic decomposition structure of a recently introduced 3-D dual-tree discrete wavelet transform (DDWT), and explores the applications for video denoising and coding. The 3-D DDWT is an attractive video representation because it isolates motion along different directions in separate subbands, and thus leads to sparse video decompositions. Our previous investigation shows that the 3-D DDWT, compared to the standard discrete wavelet transform (DWT), complies better with the statistical models based on sparse presumptions, and gives better visual and numerical results when used for statistical denoising algorithms. Our research on video compression also shows that even with 4:1 redundancy, the 3-D DDWT needs fewer coefficients to achieve the same coding quality (in PSNR) by applying the iterative projection-based noise shaping scheme proposed by Kingsbury. The proposed anisotropic DDWT extends the superiority of isotropic DDWT with more directional subbands without adding to the redundancy. Unlike the original 3-D DDWT which applies dyadic decomposition along all three directions and produces isotropic frequency spacing, it has a non-uniform tiling of the frequency space. By applying this structure, we can improve the denoising results, and the number of significant coefficients can be reduced further, which is beneficial for video coding.
NASA Astrophysics Data System (ADS)
Kum, Oyeon; Han, Youngyih; Jeong, Hae Sun
2012-05-01
Minimizing the differences between dose distributions calculated at the treatment planning stage and those delivered to the patient is an essential requirement for successful radiotheraphy. Accurate calculation of dose distributions in the treatment planning process is important and can be done only by using a Monte Carlo calculation of particle transport. In this paper, we perform a further validation of our previously developed parallel Monte Carlo electron and photon transport (PMCEPT) code [Kum and Lee, J. Korean Phys. Soc. 47, 716 (2005) and Kim and Kum, J. Korean Phys. Soc. 49, 1640 (2006)] for applications to clinical radiation problems. A linear accelerator, Siemens' Primus 6 MV, was modeled and commissioned. A thorough validation includes both small fields, closely related to the intensity modulated radiation treatment (IMRT), and large fields. Two-dimensional comparisons with film measurements were also performed. The PMCEPT results, in general, agreed well with the measured data within a maximum error of about 2%. However, considering the experimental errors, the PMCEPT results can provide the gold standard of dose distributions for radiotherapy. The computing time was also much faster, compared to that needed for experiments, although it is still a bottleneck for direct applications to the daily routine treatment planning procedure.
Yuan, Jie; Xu, Guan; Yu, Yao; Zhou, Yu; Carson, Paul L; Wang, Xueding; Liu, Xiaojun
2013-08-01
Photoacoustic tomography (PAT) offers structural and functional imaging of living biological tissue with highly sensitive optical absorption contrast and excellent spatial resolution comparable to medical ultrasound (US) imaging. We report the development of a fully integrated PAT and US dual-modality imaging system, which performs signal scanning, image reconstruction, and display for both photoacoustic (PA) and US imaging all in a truly real-time manner. The back-projection (BP) algorithm for PA image reconstruction is optimized to reduce the computational cost and facilitate parallel computation on a state of the art graphics processing unit (GPU) card. For the first time, PAT and US imaging of the same object can be conducted simultaneously and continuously, at a real-time frame rate, presently limited by the laser repetition rate of 10 Hz. Noninvasive PAT and US imaging of human peripheral joints in vivo were achieved, demonstrating the satisfactory image quality realized with this system. Another experiment, simultaneous PAT and US imaging of contrast agent flowing through an artificial vessel, was conducted to verify the performance of this system for imaging fast biological events. The GPU-based image reconstruction software code for this dual-modality system is open source and available for download from http://sourceforge.net/projects/patrealtime.
NASA Astrophysics Data System (ADS)
Yuan, Jie; Xu, Guan; Yu, Yao; Zhou, Yu; Carson, Paul L.; Wang, Xueding; Liu, Xiaojun
2014-03-01
Photoacoustic tomography (PAT) offers structural and functional imaging of living biological tissue with highly sensitive optical absorption contrast and excellent spatial resolution comparable to medical ultrasound (US) imaging. We report the development of a fully integrated PAT and US dual-modality imaging system, which performs signal scanning, image reconstruction and display for both photoacoustic (PA) and US imaging all in a truly real-time manner. The backprojection (BP) algorithm for PA image reconstruction is optimized to reduce the computational cost and facilitate parallel computation on a state of the art graphics processing unit (GPU) card. For the first time, PAT and US imaging of the same object can be conducted simultaneously and continuously, at a real time frame rate, presently limited by the laser repetition rate of 10 Hz. Noninvasive PAT and US imaging of human peripheral joints in vivo were achieved, demonstrating the satisfactory image quality realized with this system. Another experiment, simultaneous PAT and US imaging of contrast agent flowing through an artificial vessel was conducted to verify the performance of this system for imaging fast biological events. The GPU based image reconstruction software code for this dual-modality system is open source and available for download from http://sourceforge.net/projects/pat realtime .
NASA Astrophysics Data System (ADS)
Yuan, Jie; Xu, Guan; Yu, Yao; Zhou, Yu; Carson, Paul L.; Wang, Xueding; Liu, Xiaojun
2013-08-01
Photoacoustic tomography (PAT) offers structural and functional imaging of living biological tissue with highly sensitive optical absorption contrast and excellent spatial resolution comparable to medical ultrasound (US) imaging. We report the development of a fully integrated PAT and US dual-modality imaging system, which performs signal scanning, image reconstruction, and display for both photoacoustic (PA) and US imaging all in a truly real-time manner. The back-projection (BP) algorithm for PA image reconstruction is optimized to reduce the computational cost and facilitate parallel computation on a state of the art graphics processing unit (GPU) card. For the first time, PAT and US imaging of the same object can be conducted simultaneously and continuously, at a real-time frame rate, presently limited by the laser repetition rate of 10 Hz. Noninvasive PAT and US imaging of human peripheral joints in vivo were achieved, demonstrating the satisfactory image quality realized with this system. Another experiment, simultaneous PAT and US imaging of contrast agent flowing through an artificial vessel, was conducted to verify the performance of this system for imaging fast biological events. The GPU-based image reconstruction software code for this dual-modality system is open source and available for download from http://sourceforge.net/projects/patrealtime.
Wilkinson, Karl A; Hine, Nicholas D M; Skylaris, Chris-Kriton
2014-11-11
We present a hybrid MPI-OpenMP implementation of Linear-Scaling Density Functional Theory within the ONETEP code. We illustrate its performance on a range of high performance computing (HPC) platforms comprising shared-memory nodes with fast interconnect. Our work has focused on applying OpenMP parallelism to the routines which dominate the computational load, attempting where possible to parallelize different loops from those already parallelized within MPI. This includes 3D FFT box operations, sparse matrix algebra operations, calculation of integrals, and Ewald summation. While the underlying numerical methods are unchanged, these developments represent significant changes to the algorithms used within ONETEP to distribute the workload across CPU cores. The new hybrid code exhibits much-improved strong scaling relative to the MPI-only code and permits calculations with a much higher ratio of cores to atoms. These developments result in a significantly shorter time to solution than was possible using MPI alone and facilitate the application of the ONETEP code to systems larger than previously feasible. We illustrate this with benchmark calculations from an amyloid fibril trimer containing 41,907 atoms. We use the code to study the mechanism of delamination of cellulose nanofibrils when undergoing sonification, a process which is controlled by a large number of interactions that collectively determine the structural properties of the fibrils. Many energy evaluations were needed for these simulations, and as these systems comprise up to 21,276 atoms this would not have been feasible without the developments described here.
Simulation of Ionospheric E-Region Plasma Turbulence with a Massively Parallel Hybrid PIC/Fluid Code
NASA Astrophysics Data System (ADS)
Young, M.; Oppenheim, M. M.; Dimant, Y. S.
2015-12-01
The Farley-Buneman (FB) and gradient drift (GD) instabilities are plasma instabilities that occur at roughly 100 km in the equatorial E-region ionosphere. They develop when ion-neutral collisions dominate ion motion while electron motion is affected by both electron-neutral collisions and the background magnetic field. GD drift waves grow when the background density gradient and electric field are aligned; FB waves grow when the background electric field causes electrons to E × B drift with a speed slightly larger than the ion acoustic speed. Theory predicts that FB and GD turbulence should develop in the same plasma volume when GD waves create a perturbation electric field that exceeds the threshold value for FB turbulence. However, ionospheric radars, which regularly observe meter-scale irregularities associated with FB turbulence, must infer kilometer-scale GD dynamics rather than observe them directly. Numerical simulations have been unable to simultaneously resolve GD and FB structure. We present results from a parallelized hybrid simulation that uses a particle-in-cell (PIC) method for ions while modeling electrons as an inertialess, quasi-neutral fluid. This approach allows us to reach length scales of hundreds of meters to kilometers with sub-meter resolution, but requires solving a large linear system derived from an elliptic PDE that depends on plasma density, ion flux, and electron parameters. We solve the resultant linear system at each time step via the Portable Extensible Toolkit for Scientific Computing (PETSc). We compare results of simulated FB turbulence from this model to results from a thoroughly tested PIC code and describe progress toward the first simultaneous simulations of FB and GD instabilities. This model has immediate applications to radar observations of the E-region ionosphere, as well as potential applications to the F-region ionosphere and the chromosphere of the Sun.
A three-dimensional Cartesian tree-code and applications to vortex sheet roll-up
NASA Astrophysics Data System (ADS)
Lindsay, Keith Thomas
An algorithm is presented for the rapid computation of vortex sheet motion in three-dimensional fluid flow. The equations governing vortex sheet motion, considered in Lagrangian form, are desingularized and discretized, resulting in a system of equations for the N discretizing particles. Since the particles interact pairwise, evaluating the velocities by direct summation requires O(N2) operations, which becomes prohibitively expensive as N increases. Based on measured execution times, the new algorithm computes the particle interactions with O(N log N) operations. The additional memory required by the algorithm is less than 60% of the memory used by a direct summation algorithm. The algorithm extends Draghicescu's algorithm from two to three space dimensions. The main ingredients are the replacement of particle-particle interactions with particle-cluster interactions which are based on Cartesian Taylor series expansions and the use of an adaptive tree-based subdivision of space to create the particle clusters. An important feature of the algorithm is the use of recurrences to compute the expansion coefficients. The recurrences are a generalization of those used by Draghicescu. The new features of the algorithm are its application to a non-harmonic three- dimensional kernel, its adaptive subdivision of space and adaptive error control. The algorithm is used to study the dynamics of vortex rings which are modeled as rolling up vortex sheets. An adaptive point insertion algorithm is used to ensure that the vortex sheets are accurately resolved as they stretch. The problems considered are azimuthal vortex ring instabilities, the evolution of an elliptical vortex ring. and the collision of two vortex rings. In the last problem, the vorticity in the rings appears to connect, due to superposition, even though the vortex sheet model does not explicitly account for viscous effects and the sheets themselves do not connect.
NASA Astrophysics Data System (ADS)
Socas-Navarro, H.; de la Cruz Rodríguez, J.; Asensio Ramos, A.; Trujillo Bueno, J.; Ruiz Cobo, B.
2015-05-01
With the advent of a new generation of solar telescopes and instrumentation, interpreting chromospheric observations (in particular, spectropolarimetry) requires new, suitable diagnostic tools. This paper describes a new code, NICOLE, that has been designed for Stokes non-LTE radiative transfer, for synthesis and inversion of spectral lines and Zeeman-induced polarization profiles, spanning a wide range of atmospheric heights from the photosphere to the chromosphere. The code features a number of unique features and capabilities and has been built from scratch with a powerful parallelization scheme that makes it suitable for application on massive datasets using large supercomputers. The source code is written entirely in Fortran 90/2003 and complies strictly with the ANSI standards to ensure maximum compatibility and portability. It is being publicly released, with the idea of facilitating future branching by other groups to augment its capabilities. The source code is currently hosted at the following repository: http://https://github.com/hsocasnavarro/NICOLE
NASA Astrophysics Data System (ADS)
Pereira, Tiago M. D.; Uitenbroek, Han
2015-02-01
The emergence of three-dimensional magneto-hydrodynamic simulations of stellar atmospheres has sparked a need for efficient radiative transfer codes to calculate detailed synthetic spectra. We present RH 1.5D, a massively parallel code based on the RH code and capable of performing Zeeman polarised multi-level non-local thermodynamical equilibrium calculations with partial frequency redistribution for an arbitrary amount of chemical species. The code calculates spectra from 3D, 2D or 1D atmospheric models on a column-by-column basis (or 1.5D). While the 1.5D approximation breaks down in the cores of very strong lines in an inhomogeneous environment, it is nevertheless suitable for a large range of scenarios and allows for faster convergence with finer control over the iteration of each simulation column. The code scales well to at least tens of thousands of CPU cores, and is publicly available. In the present work we briefly describe its inner workings, strategies for convergence optimisation, its parallelism, and some possible applications.
Kirk, B.L.; Sartori, E.
1997-06-01
Subsequent to the introduction of High Performance Computing in the developed countries, the Organization for Economic Cooperation and Development/Nuclear Energy Agency (OECD/NEA) created the Task Force on Adapting Computer Codes in Nuclear Applications to Parallel Architectures (under the guidance of the Nuclear Science Committee`s Working Party on Advanced Computing) to study the growth area in supercomputing and its applicability to the nuclear community`s computer codes. The result has been four years of investigation for the Task Force in different subject fields - deterministic and Monte Carlo radiation transport, computational mechanics and fluid dynamics, nuclear safety, atmospheric models and waste management.
Salko, Robert K; Schmidt, Rodney; Avramova, Maria N
2014-01-01
This paper describes major improvements to the computational infrastructure of the CTF sub-channel code so that full-core sub-channel-resolved simulations can now be performed in much shorter run-times, either in stand-alone mode or as part of coupled-code multi-physics calculations. These improvements support the goals of the Department Of Energy (DOE) Consortium for Advanced Simulations of Light Water (CASL) Energy Innovation Hub to develop high fidelity multi-physics simulation tools for nuclear energy design and analysis. A set of serial code optimizations--including fixing computational inefficiencies, optimizing the numerical approach, and making smarter data storage choices--are first described and shown to reduce both execution time and memory usage by about a factor of ten. Next, a Single Program Multiple Data (SPMD) parallelization strategy targeting distributed memory Multiple Instruction Multiple Data (MIMD) platforms and utilizing domain-decomposition is presented. In this approach, data communication between processors is accomplished by inserting standard MPI calls at strategic points in the code. The domain decomposition approach implemented assigns one MPI process to each fuel assembly, with each domain being represented by its own CTF input file. The creation of CTF input files, both for serial and parallel runs, is also fully automated through use of a pre-processor utility that takes a greatly reduced set of user input over the traditional CTF input file. To run CTF in parallel, two additional libraries are currently needed; MPI, for inter-processor message passing, and the Parallel Extensible Toolkit for Scientific Computation (PETSc), which is leveraged to solve the global pressure matrix in parallel. Results presented include a set of testing and verification calculations and performance tests assessing parallel scaling characteristics up to a full core, sub-channel-resolved model of Watts Bar Unit 1 under hot full-power conditions (193 17x17
NASA Technical Reports Server (NTRS)
Lyster, P. M.; Liewer, P. C.; Decyk, V. K.; Ferraro, R. D.
1995-01-01
A three-dimensional electrostatic particle-in-cell (PIC) plasma simulation code has been developed on coarse-grain distributed-memory massively parallel computers with message passing communications. Our implementation is the generalization to three-dimensions of the general concurrent particle-in-cell (GCPIC) algorithm. In the GCPIC algorithm, the particle computation is divided among the processors using a domain decomposition of the simulation domain. In a three-dimensional simulation, the domain can be partitioned into one-, two-, or three-dimensional subdomains ("slabs," "rods," or "cubes") and we investigate the efficiency of the parallel implementation of the push for all three choices. The present implementation runs on the Intel Touchstone Delta machine at Caltech; a multiple-instruction-multiple-data (MIMD) parallel computer with 512 nodes. We find that the parallel efficiency of the push is very high, with the ratio of communication to computation time in the range 0.3%-10.0%. The highest efficiency (> 99%) occurs for a large, scaled problem with 64(sup 3) particles per processing node (approximately 134 million particles of 512 nodes) which has a push time of about 250 ns per particle per time step. We have also developed expressions for the timing of the code which are a function of both code parameters (number of grid points, particles, etc.) and machine-dependent parameters (effective FLOP rate, and the effective interprocessor bandwidths for the communication of particles and grid points). These expressions can be used to estimate the performance of scaled problems--including those with inhomogeneous plasmas--to other parallel machines once the machine-dependent parameters are known.
NASA Astrophysics Data System (ADS)
Maeda, Takuto; Takemura, Shunsuke; Furumura, Takashi
2017-07-01
We have developed an open-source software package, Open-source Seismic Wave Propagation Code (OpenSWPC), for parallel numerical simulations of seismic wave propagation in 3D and 2D (P-SV and SH) viscoelastic media based on the finite difference method in local-to-regional scales. This code is equipped with a frequency-independent attenuation model based on the generalized Zener body and an efficient perfectly matched layer for absorbing boundary condition. A hybrid-style programming using OpenMP and the Message Passing Interface (MPI) is adopted for efficient parallel computation. OpenSWPC has wide applicability for seismological studies and great portability to allowing excellent performance from PC clusters to supercomputers. Without modifying the code, users can conduct seismic wave propagation simulations using their own velocity structure models and the necessary source representations by specifying them in an input parameter file. The code has various modes for different types of velocity structure model input and different source representations such as single force, moment tensor and plane-wave incidence, which can easily be selected via the input parameters. Widely used binary data formats, the Network Common Data Form (NetCDF) and the Seismic Analysis Code (SAC) are adopted for the input of the heterogeneous structure model and the outputs of the simulation results, so users can easily handle the input/output datasets. All codes are written in Fortran 2003 and are available with detailed documents in a public repository.[Figure not available: see fulltext.
Fast Coding Unit Encoding Mechanism for Low Complexity Video Coding.
Gao, Yuan; Liu, Pengyu; Wu, Yueying; Jia, Kebin; Gao, Guandong
2016-01-01
In high efficiency video coding (HEVC), coding tree contributes to excellent compression performance. However, coding tree brings extremely high computational complexity. Innovative works for improving coding tree to further reduce encoding time are stated in this paper. A novel low complexity coding tree mechanism is proposed for HEVC fast coding unit (CU) encoding. Firstly, this paper makes an in-depth study of the relationship among CU distribution, quantization parameter (QP) and content change (CC). Secondly, a CU coding tree probability model is proposed for modeling and predicting CU distribution. Eventually, a CU coding tree probability update is proposed, aiming to address probabilistic model distortion problems caused by CC. Experimental results show that the proposed low complexity CU coding tree mechanism significantly reduces encoding time by 27% for lossy coding and 42% for visually lossless coding and lossless coding. The proposed low complexity CU coding tree mechanism devotes to improving coding performance under various application conditions.
Fast Coding Unit Encoding Mechanism for Low Complexity Video Coding
Wu, Yueying; Jia, Kebin; Gao, Guandong
2016-01-01
In high efficiency video coding (HEVC), coding tree contributes to excellent compression performance. However, coding tree brings extremely high computational complexity. Innovative works for improving coding tree to further reduce encoding time are stated in this paper. A novel low complexity coding tree mechanism is proposed for HEVC fast coding unit (CU) encoding. Firstly, this paper makes an in-depth study of the relationship among CU distribution, quantization parameter (QP) and content change (CC). Secondly, a CU coding tree probability model is proposed for modeling and predicting CU distribution. Eventually, a CU coding tree probability update is proposed, aiming to address probabilistic model distortion problems caused by CC. Experimental results show that the proposed low complexity CU coding tree mechanism significantly reduces encoding time by 27% for lossy coding and 42% for visually lossless coding and lossless coding. The proposed low complexity CU coding tree mechanism devotes to improving coding performance under various application conditions. PMID:26999741
JuNoLo - Jülich nonlocal code for parallel post-processing evaluation of vdW-DF correlation energy
NASA Astrophysics Data System (ADS)
Lazić, Predrag; Atodiresei, Nicolae; Alaei, Mojtaba; Caciuc, Vasile; Blügel, Stefan; Brako, Radovan
2010-02-01
Nowadays the state of the art Density Functional Theory (DFT) codes are based on local (LDA) or semilocal (GGA) energy functionals. Recently the theory of a truly nonlocal energy functional has been developed. It has been used mostly as a post-DFT calculation approach, i.e. by applying the functional to the charge density calculated using any standard DFT code, thus obtaining a new improved value for the total energy of the system. Nonlocal calculation is computationally quite expensive and scales as N where N is the number of points in which the density is defined, and a massively parallel calculation is welcome for a wider applicability of the new approach. In this article we present a code which accomplishes this goal. Program summaryProgram title: JuNoLo Catalogue identifier: AEFM_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEFM_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 176 980 No. of bytes in distributed program, including test data, etc.: 2 126 072 Distribution format: tar.gz Programming language: Fortran 90 Computer: any architecture with a Fortran 90 compiler Operating system: Linux, AIX Has the code been vectorised or parallelized?: Yes, from 1 to 65536 processors may be used. RAM: depends strongly on the problem's size. Classification: 7.3 External routines: • FFTW ( http://www.tw.org/) • MPI ( http://www.mcs.anl.gov/research/projects/mpich2/ or http://www.lam-mpi.org/) Nature of problem: Obtaining the value of the nonlocal vdW-DF energy based on the charge density distribution obtained from some Density Functional Theory code. Solution method: Numerical calculation of the double sum is implemented in a parallel F90 code. Calculation of this sum yields the required nonlocal vdW-DF energy. Unusual features: Binds to virtually any DFT
Parallel algorithm development
Adams, T.F.
1996-06-01
Rapid changes in parallel computing technology are causing significant changes in the strategies being used for parallel algorithm development. One approach is simply to write computer code in a standard language like FORTRAN 77 or with the expectation that the compiler will produce executable code that will run in parallel. The alternatives are: (1) to build explicit message passing directly into the source code; or (2) to write source code without explicit reference to message passing or parallelism, but use a general communications library to provide efficient parallel execution. Application of these strategies is illustrated with examples of codes currently under development.
Hoover, C G; DeGroot, A J; Sherwood, R J
2000-06-01
ParaDyn is a parallel version of the DYNA3D computer program, a three-dimensional explicit finite-element program for analyzing the dynamic response of solids and structures. The ParaDyn program has been used as a production tool for over three years for analyzing problems which range in size from a few tens of thousands of elements to between one-million and ten-million elements. ParaDyn runs on parallel computers provided by the Department of Energy Accelerated Strategic Computing Initiative (ASCI) and the Department of Defense High Performance Computing and Modernization Program. Preprocessing and post-processing software utilities and tools are designed to facilitate the generation of partitioned domains for processors on a massively parallel computer and the visualization of both resultant data and boundary data generated in a parallel simulation. This manual provides a brief overview of the parallel implementation; describes techniques for running the ParaDyn program, tools and utilities; and provides examples of parallel simulations.
Greiver, Michelle; Wintemute, Kimberly; Aliarzadeh, Babak; Martin, Ken; Khan, Shahriar; Jackson, Dave; Leggett, Jannet; Lambert-Lanning, Anita; Siu, Maggie
2016-10-12
Consistent and standardized coding for chronic conditions is associated with better care; however, coding may currently be limited in electronic medical records (EMRs) used in Canadian primary care.Objectives To implement data management activities in a community-based primary care organisation and to evaluate the effects on coding for chronic conditions. Fifty-nine family physicians in Toronto, Ontario, belonging to a single primary care organisation, participated in the study. The organisation implemented a central analytical data repository containing their EMR data extracted, cleaned, standardized and returned by the Canadian Primary Care Sentinel Surveillance Network (CPCSSN), a large validated primary care EMR-based database. They used reporting software provided by CPCSSN to identify selected chronic conditions and standardized codes were then added back to the EMR. We studied four chronic conditions (diabetes, hypertension, chronic obstructive pulmonary disease and dementia). We compared changes in coding over six months for physicians in the organisation with changes for 315 primary care physicians participating in CPCSSN across Canada. Chronic disease coding within the organisation increased significantly more than in other primary care sites. The adjusted difference in the increase of coding was 7.7% (95% confidence interval 7.1%-8.2%, p < 0.01). The use of standard codes, consisting of the most common diagnostic codes for each condition in the CPCSSN database, increased by 8.9% more (95% CI 8.3%-9.5%, p < 0.01). Data management activities were associated with an increase in standardized coding for chronic conditions. Exploring requirements to scale and spread this approach in Canadian primary care organisations may be worthwhile.
NASA Astrophysics Data System (ADS)
Gassmöller, Rene; Bangerth, Wolfgang
2016-04-01
Particle-in-cell methods have a long history and many applications in geodynamic modelling of mantle convection, lithospheric deformation and crustal dynamics. They are primarily used to track material information, the strain a material has undergone, the pressure-temperature history a certain material region has experienced, or the amount of volatiles or partial melt present in a region. However, their efficient parallel implementation - in particular combined with adaptive finite-element meshes - is complicated due to the complex communication patterns and frequent reassignment of particles to cells. Consequently, many current scientific software packages accomplish this efficient implementation by specifically designing particle methods for a single purpose, like the advection of scalar material properties that do not evolve over time (e.g., for chemical heterogeneities). Design choices for particle integration, data storage, and parallel communication are then optimized for this single purpose, making the code relatively rigid to changing requirements. Here, we present the implementation of a flexible, scalable and efficient particle-in-cell method for massively parallel finite-element codes with adaptively changing meshes. Using a modular plugin structure, we allow maximum flexibility of the generation of particles, the carried tracer properties, the advection and output algorithms, and the projection of properties to the finite-element mesh. We present scaling tests ranging up to tens of thousands of cores and tens of billions of particles. Additionally, we discuss efficient load-balancing strategies for particles in adaptive meshes with their strengths and weaknesses, local particle-transfer between parallel subdomains utilizing existing communication patterns from the finite element mesh, and the use of established parallel output algorithms like the HDF5 library. Finally, we show some relevant particle application cases, compare our implementation to a
Li, Yong Gang; Yang, Yang; Short, Michael P.; Ding, Ze Jun; Zeng, Zhi; Li, Ju
2015-01-01
SRIM-like codes have limitations in describing general 3D geometries, for modeling radiation displacements and damage in nanostructured materials. A universal, computationally efficient and massively parallel 3D Monte Carlo code, IM3D, has been developed with excellent parallel scaling performance. IM3D is based on fast indexing of scattering integrals and the SRIM stopping power database, and allows the user a choice of Constructive Solid Geometry (CSG) or Finite Element Triangle Mesh (FETM) method for constructing 3D shapes and microstructures. For 2D films and multilayers, IM3D perfectly reproduces SRIM results, and can be ∼102 times faster in serial execution and > 104 times faster using parallel computation. For 3D problems, it provides a fast approach for analyzing the spatial distributions of primary displacements and defect generation under ion irradiation. Herein we also provide a detailed discussion of our open-source collision cascade physics engine, revealing the true meaning and limitations of the “Quick Kinchin-Pease” and “Full Cascades” options. The issues of femtosecond to picosecond timescales in defining displacement versus damage, the limitation of the displacements per atom (DPA) unit in quantifying radiation damage (such as inadequacy in quantifying degree of chemical mixing), are discussed. PMID:26658477
Qiang, J.; Leitner, D.; Todd, D.S.; Ryne, R.D.
2005-03-15
The superconducting ECR ion source VENUS serves as the prototype injector ion source for the Rare Isotope Accelerator (RIA) driver linac. The RIA driver linac requires a great variety of high charge state ion beams with up to an order of magnitude higher intensity than currently achievable with conventional ECR ion sources. In order to design the beam line optics of the low energy beam line for the RIA front end for the wide parameter range required for the RIA driver accelerator, reliable simulations of the ion beam extraction from the ECR ion source through the ion mass analyzing system are essential. The RIA low energy beam transport line must be able to transport intense beams (up to 10 mA) of light and heavy ions at 30 keV.For this purpose, LBNL is developing the parallel 3D particle-in-cell code IMPACT to simulate the ion beam transport from the ECR extraction aperture through the analyzing section of the low energy transport system. IMPACT, a parallel, particle-in-cell code, is currently used to model the superconducting RF linac section of RIA and is being modified in order to simulate DC beams from the ECR ion source extraction. By using the high performance of parallel supercomputing we will be able to account consistently for the changing space charge in the extraction region and the analyzing section. A progress report and early results in the modeling of the VENUS source will be presented.
NASA Astrophysics Data System (ADS)
Qiang, J.; Leitner, D.; Todd, D. S.; Ryne, R. D.
2005-03-01
The superconducting ECR ion source VENUS serves as the prototype injector ion source for the Rare Isotope Accelerator (RIA) driver linac. The RIA driver linac requires a great variety of high charge state ion beams with up to an order of magnitude higher intensity than currently achievable with conventional ECR ion sources. In order to design the beam line optics of the low energy beam line for the RIA front end for the wide parameter range required for the RIA driver accelerator, reliable simulations of the ion beam extraction from the ECR ion source through the ion mass analyzing system are essential. The RIA low energy beam transport line must be able to transport intense beams (up to 10 mA) of light and heavy ions at 30 keV. For this purpose, LBNL is developing the parallel 3D particle-in-cell code IMPACT to simulate the ion beam transport from the ECR extraction aperture through the analyzing section of the low energy transport system. IMPACT, a parallel, particle-in-cell code, is currently used to model the superconducting RF linac section of RIA and is being modified in order to simulate DC beams from the ECR ion source extraction. By using the high performance of parallel supercomputing we will be able to account consistently for the changing space charge in the extraction region and the analyzing section. A progress report and early results in the modeling of the VENUS source will be presented.
A New Parallel N-Body Gravity Solver: TPM
NASA Astrophysics Data System (ADS)
Xu, Guohong
1995-05-01
We have developed a gravity solver based on combining the particle-mesh (PM) method and TREE methods. It is designed for and has been implemented on parallel computer architectures. The new code can deal with tens of millions of particles on current computers, with the calculation done on a parallel super- computer or a group of workstations. Typically, the spatial resolution is enhanced by more than a factor of 20 over the pure PM code with mass resolution retained at nearly the PM level. This code runs much faster than a pure TREE code with the same number of particles and maintains almost the same resolution in high-density regions. Multiple time step integration has also been implemented with the code, with second-order time accuracy. The performance of the code has been checked in several kinds of parallel computer configurations, including IBM SP1, SGI Challenge, and a group of workstations, with the speedup of the parallel code on a 32 processor IBM SP2 supercomputer nearly linear (efficiency ≍ 80%) in the number of processors. The computation/communication ratio is also very high (˜50), which means the code spends 95% of its CPU time in computation.
NASA Astrophysics Data System (ADS)
Feng, Sheng; Fang, Ye; Tam, Ka-Ming; Thakur, Bhupender; Yun, Zhifeng; Tomko, Karen; Moreno, Juana; Ramanujam, Jagannathan; Jarrell, Mark
2013-03-01
The Edwards Anderson model is a typical example of random frustrated system. It has been a long standing problem in computational physics due to its long relaxation time. Some important properties of the low temperature spin glass phase are still poorly understood after decades of study. The recent advances of GPU computing provide a new opportunity to substantially improve the simulations. We developed an MPI-CUDA hybrid code with multi-spin coding for parallel tempering Monte Carlo simulation of Edwards Anderson model. Since the system size is relatively small, and a large number of parallel replicas and Monte Carlo moves are required, the problem suits well for modern GPUs with CUDA architecture. We use the code to perform an extensive simulation on the three-dimensional Edwards Anderson model with an external field. This work is funded by the NSF EPSCoR LA-SiGMA project under award number EPS-1003897. This work is partly done on the machines of Ohio Supercomputer Center.
Huang, Xiao-Yan; Li, Ming-Li; Xu, Juan; Gao, Yue-Dong; Wang, Wen-Guang; Yin, An-Guo; Li, Xiao-Fei; Sun, Xiao-Mei; Xia, Xue-Shan; Dai, Jie-Jie
2013-04-01
While the tree shrew (Tupaia belangeri chinensis) is an excellent animal model for studying the mechanisms of human diseases, but few studies examine interleukin-2 (IL-2), an important immune factor in disease model evaluation. In this study, a 465 bp of the full-length IL-2 cDNA encoding sequence was cloned from the RNA of tree shrew spleen lymphocytes, which were then cultivated and stimulated with ConA (concanavalin). Clustal W 2.0 was used to compare and analyze the sequence and molecular characteristics, and establish the similarity of the overall structure of IL-2 between tree shrews and other mammals. The homology of the IL-2 nucleotide sequence between tree shrews and humans was 93%, and the amino acid homology was 80%. The phylogenetic tree results, derived through the Neighbour-Joining method using MEGA5.0, indicated a close genetic relationship between tree shrews, Homo sapiens, and Macaca mulatta. The three-dimensional structure analysis showed that the surface charges in most regions of tree shrew IL-2 were similar to between tree shrews and humans; however, the N-glycosylation sites and local structures were different, which may affect antibody binding. These results provide a fundamental basis for the future study of IL-2 monoclonal antibody in tree shrews, thereby improving their utility as a model.
Hayes, J C; Norman, M
1999-10-28
This report details an investigation into the efficacy of two approaches to solving the radiation diffusion equation within a radiation hydrodynamic simulation. Because leading-edge scientific computing platforms have evolved from large single-node vector processors to parallel aggregates containing tens to thousands of individual CPU's, the ability of an algorithm to maintain high compute efficiency when distributed over a large array of nodes is critically important. The viability of an algorithm thus hinges upon the tripartite question of numerical accuracy, total time to solution, and parallel efficiency.
NASA Astrophysics Data System (ADS)
Gedney, Stephen D.
1987-09-01
The Electromagnetic Pulse (EMP) produced by a high-altitude nuclear blast presents a severe threat to electronic systems due to its extreme characteristics. To test the vulnerability of large systems, such as airplanes, missiles, or satellites, they must be subjected to a simulated EMP environment. One type of simulator that has been used to approximate the EMP environment is the Large Parallel-Plate Bounded-Wave Simulator. It is a guided wave simulator which has properties of transmission line and supports a single TEM model at sufficiently low frequencies. This type of simulator consists of finite-width parallel-plate waveguides, which are excited by a wave launcher and terminated by a wave receptor. This study addresses the field distribution within a finite-width parallel-plate waveguide that is matched to a conical tapered waveguide at either end. Characteristics of a parallel-plate bounded-wave EMP simulator were developed using scattering theory, thin-wire mesh approximation of the conducting surfaces, and the Numerical Electronics Code (NEC). Background is provided for readers to use the NEC as a tool in solving thin wire scattering problems.
NASA Astrophysics Data System (ADS)
Misawa, Takeharu; Yoshida, Hiroyuki; Akimoto, Hajime
In Japan Atomic Energy Agency (JAEA), the Innovative Water Reactor for Flexible Fuel Cycle (FLWR) has been developed. For thermal design of FLWR, it is necessary to develop analytical method to predict boiling transition of FLWR. Japan Atomic Energy Agency (JAEA) has been developing three-dimensional two-fluid model analysis code ACE-3D, which adopts boundary fitted coordinate system to simulate complex shape channel flow. In this paper, as a part of development of ACE-3D to apply to rod bundle analysis, introduction of parallelization to ACE-3D and assessments of ACE-3D are shown. In analysis of large-scale domain such as a rod bundle, even two-fluid model requires large number of computational cost, which exceeds upper limit of memory amount of 1 CPU. Therefore, parallelization was introduced to ACE-3D to divide data amount for analysis of large-scale domain among large number of CPUs, and it is confirmed that analysis of large-scale domain such as a rod bundle can be performed by parallel computation with keeping parallel computation performance even using large number of CPUs. ACE-3D adopts two-phase flow models, some of which are dependent upon channel geometry. Therefore, analyses in the domains, which simulate individual subchannel and 37 rod bundle, are performed, and compared with experiments. It is confirmed that the results obtained by both analyses using ACE-3D show agreement with past experimental result qualitatively.
NASA Astrophysics Data System (ADS)
Sijoy, C. D.; Chaturvedi, S.
2016-06-01
Higher-order cell-centered multi-material hydrodynamics (HD) and parallel node-centered radiation transport (RT) schemes are combined self-consistently in three-temperature (3T) radiation hydrodynamics (RHD) code TRHD (Sijoy and Chaturvedi, 2015) developed for the simulation of intense thermal radiation or high-power laser driven RHD. For RT, a node-centered gray model implemented in a popular RHD code MULTI2D (Ramis et al., 2009) is used. This scheme, in principle, can handle RT in both optically thick and thin materials. The RT module has been parallelized using message passing interface (MPI) for parallel computation. Presently, for multi-material HD, we have used a simple and robust closure model in which common strain rates to all materials in a mixed cell is assumed. The closure model has been further generalized to allow different temperatures for the electrons and ions. In addition to this, electron and radiation temperatures are assumed to be in non-equilibrium. Therefore, the thermal relaxation between the electrons and ions and the coupling between the radiation and matter energies are required to be computed self-consistently. This has been achieved by using a node-centered symmetric-semi-implicit (SSI) integration scheme. The electron thermal conduction is calculated using a cell-centered, monotonic, non-linear finite volume scheme (NLFV) suitable for unstructured meshes. In this paper, we have described the details of the 2D, 3T, non-equilibrium, multi-material RHD code developed with a special attention to the coupling of various cell-centered and node-centered formulations along with a suite of validation test problems to demonstrate the accuracy and performance of the algorithms. We also report the parallel performance of RT module. Finally, in order to demonstrate the full capability of the code implementation, we have presented the simulation of laser driven shock propagation in a layered thin foil. The simulation results are found to be in good
Salko, Robert K.; Schmidt, Rodney C.; Avramova, Maria N.
2014-11-23
This study describes major improvements to the computational infrastructure of the CTF subchannel code so that full-core, pincell-resolved (i.e., one computational subchannel per real bundle flow channel) simulations can now be performed in much shorter run-times, either in stand-alone mode or as part of coupled-code multi-physics calculations. These improvements support the goals of the Department Of Energy Consortium for Advanced Simulation of Light Water Reactors (CASL) Energy Innovation Hub to develop high fidelity multi-physics simulation tools for nuclear energy design and analysis.
NASA Astrophysics Data System (ADS)
Meléndez, A.; Korenaga, J.; Sallarès, V.; Miniussi, A.; Ranero, C. R.
2015-10-01
We present a new 3-D traveltime tomography code (TOMO3D) for the modelling of active-source seismic data that uses the arrival times of both refracted and reflected seismic phases to derive the velocity distribution and the geometry of reflecting boundaries in the subsurface. This code is based on its popular 2-D version TOMO2D from which it inherited the methods to solve the forward and inverse problems. The traveltime calculations are done using a hybrid ray-tracing technique combining the graph and bending methods. The LSQR algorithm is used to perform the iterative regularized inversion to improve the initial velocity and depth models. In order to cope with an increased computational demand due to the incorporation of the third dimension, the forward problem solver, which takes most of the run time (˜90 per cent in the test presented here), has been parallelized with a combination of multi-processing and message passing interface standards. This parallelization distributes the ray-tracing and traveltime calculations among available computational resources. The code's performance is illustrated with a realistic synthetic example, including a checkerboard anomaly and two reflectors, which simulates the geometry of a subduction zone. The code is designed to invert for a single reflector at a time. A data-driven layer-stripping strategy is proposed for cases involving multiple reflectors, and it is tested for the successive inversion of the two reflectors. Layers are bound by consecutive reflectors, and an initial velocity model for each inversion step incorporates the results from previous steps. This strategy poses simpler inversion problems at each step, allowing the recovery of strong velocity discontinuities that would otherwise be smoothened.
Tian, Wei-Wei; Gao, Yue-Dong; Guo, Yan; Huang, Jing-Fei; Xiao, Chang; Li, Zuo-Sheng; Zhang, Hua-Tang
2012-02-01
The tree shrews, as an ideal animal model receiving extensive attentions to human disease research, demands essential research tools, in particular cellular markers and monoclonal antibodies for immunological studies. In this paper, a 1 365 bp of the full-length CD4 cDNA encoding sequence was cloned from total RNA in peripheral blood of tree shrews, the sequence completes two unknown fragment gaps of tree shrews predicted CD4 cDNA in the GenBank database, and its molecular characteristics were analyzed compared with other mammals by using biology software such as Clustal W2.0 and so forth. The results showed that the extracellular and intracellular domains of tree shrews CD4 amino acid sequence are conserved. The tree shrews CD4 amino acid sequence showed a close genetic relationship with Homo sapiens and Macaca mulatta. Most regions of the tree shrews CD4 molecule surface showed positive charges as humans. However, compared with CD4 extracellular domain D1 of human, CD4 D1 surface of tree shrews showed more negative charges, and more two N-glycosylation sites, which may affect antibody binding. This study provides a theoretical basis for the preparation and functional studies of CD4 monoclonal antibody.
NASA Astrophysics Data System (ADS)
Zaghi, S.
2014-07-01
OFF, an open source (free software) code for performing fluid dynamics simulations, is presented. The aim of OFF is to solve, numerically, the unsteady (and steady) compressible Navier-Stokes equations of fluid dynamics by means of finite volume techniques: the research background is mainly focused on high-order (WENO) schemes for multi-fluids, multi-phase flows over complex geometries. To this purpose a highly modular, object-oriented application program interface (API) has been developed. In particular, the concepts of data encapsulation and inheritance available within Fortran language (from standard 2003) have been stressed in order to represent each fluid dynamics "entity" (e.g. the conservative variables of a finite volume, its geometry, etc…) by a single object so that a large variety of computational libraries can be easily (and efficiently) developed upon these objects. The main features of OFF can be summarized as follows: Programming LanguageOFF is written in standard (compliant) Fortran 2003; its design is highly modular in order to enhance simplicity of use and maintenance without compromising the efficiency; Parallel Frameworks Supported the development of OFF has been also targeted to maximize the computational efficiency: the code is designed to run on shared-memory multi-cores workstations and distributed-memory clusters of shared-memory nodes (supercomputers); the code's parallelization is based on Open Multiprocessing (OpenMP) and Message Passing Interface (MPI) paradigms; Usability, Maintenance and Enhancement in order to improve the usability, maintenance and enhancement of the code also the documentation has been carefully taken into account; the documentation is built upon comprehensive comments placed directly into the source files (no external documentation files needed): these comments are parsed by means of doxygen free software producing high quality html and latex documentation pages; the distributed versioning system referred as git
Icarus: A 2D direct simulation Monte Carlo (DSMC) code for parallel computers. User`s manual - V.3.0
Bartel, T.; Plimpton, S.; Johannes, J.; Payne, J.
1996-10-01
Icarus is a 2D Direct Simulation Monte Carlo (DSMC) code which has been optimized for the parallel computing environment. The code is based on the DSMC method of Bird and models from free-molecular to continuum flowfields in either cartesian (x, y) or axisymmetric (z, r) coordinates. Computational particles, representing a given number of molecules or atoms, are tracked as they have collisions with other particles or surfaces. Multiple species, internal energy modes (rotation and vibration), chemistry, and ion transport are modelled. A new trace species methodology for collisions and chemistry is used to obtain statistics for small species concentrations. Gas phase chemistry is modelled using steric factors derived from Arrhenius reaction rates. Surface chemistry is modelled with surface reaction probabilities. The electron number density is either a fixed external generated field or determined using a local charge neutrality assumption. Ion chemistry is modelled with electron impact chemistry rates and charge exchange reactions. Coulomb collision cross-sections are used instead of Variable Hard Sphere values for ion-ion interactions. The electrostatic fields can either be externally input or internally generated using a Langmuir-Tonks model. The Icarus software package includes the grid generation, parallel processor decomposition, postprocessing, and restart software. The commercial graphics package, Tecplot, is used for graphics display. The majority of the software packages are written in standard Fortran.
Large Scale Earth's Bow Shock with Northern IMF as Simulated by PIC Code in Parallel with MHD Model
NASA Astrophysics Data System (ADS)
Baraka, Suleiman
2016-06-01
In this paper, we propose a 3D kinetic model (particle-in-cell, PIC) for the description of the large scale Earth's bow shock. The proposed version is stable and does not require huge or extensive computer resources. Because PIC simulations work with scaled plasma and field parameters, we also propose to validate our code by comparing its results with the available MHD simulations under same scaled solar wind (SW) and (IMF) conditions. We report new results from the two models. In both codes the Earth's bow shock position is found to be ≈14.8 R E along the Sun-Earth line, and ≈29 R E on the dusk side. Those findings are consistent with past in situ observations. Both simulations reproduce the theoretical jump conditions at the shock. However, the PIC code density and temperature distributions are inflated and slightly shifted sunward when compared to the MHD results. Kinetic electron motions and reflected ions upstream may cause this sunward shift. Species distributions in the foreshock region are depicted within the transition of the shock (measured ≈2 c/ ω pi for Θ Bn = 90° and M MS = 4.7) and in the downstream. The size of the foot jump in the magnetic field at the shock is measured to be (1.7 c/ ω pi ). In the foreshocked region, the thermal velocity is found equal to 213 km s-1 at 15 R E and is equal to 63 km s -1 at 12 R E (magnetosheath region). Despite the large cell size of the current version of the PIC code, it is powerful to retain macrostructure of planets magnetospheres in very short time, thus it can be used for pedagogical test purposes. It is also likely complementary with MHD to deepen our understanding of the large scale magnetosphere.
NASA Astrophysics Data System (ADS)
Kaus, B.; Popov, A.
2014-12-01
The complexity of lithospheric rheology and the necessity to resolve the deformation patterns near the free surface (faults and folds) sufficiently well places a great demand on a stable and scalable modeling tool that is capable of efficiently handling nonlinearities. Our code LaMEM (Lithosphere and Mantle Evolution Model) is an attempt to satisfy this demand. The code utilizes a stable and numerically inexpensive finite difference discretization with the spatial staggering of velocity, pressure, and temperature unknowns (a so-called staggered grid). As a time discretization method the forward Euler, or a combination of the predictor-corrector and the fourth-order Runge-Kutta can be chosen. Elastic stresses are rotated on the markers, which are also used to track all relevant material properties and solution history fields. The Newtonian nonlinear iteration, however, is handled at the level of the grid points to avoid spurious averaging between markers and grid. Such an arrangement required us to develop a non-standard discretization of the effective strain-rate second invariant. Important feature of the code is its ability to handle stress-free and open-box boundary conditions, in which empty cells are simply eliminated from the discretization, which also solves the biggest problem of the sticky-air approach - namely large viscosity jumps near the free surface. We currently support an arbitrary combination of linear elastic, nonlinear viscous with multiple creep mechanisms, and plastic rheologies based on either a depth-dependent von Mises or pressure-dependent Drucker-Prager yield criteria.LaMEM is being developed as an inherently parallel code. Structurally all its parts are based on the building blocks provided by PETSc library. These include Jacobian-Free Newton-Krylov nonlinear solvers with convergence globalization techniques (line search), equipped with different linear preconditioners. We have also implemented the coupled velocity-pressure multigrid
NASA Astrophysics Data System (ADS)
Trost, Nico; Jiménez, Javier; Imke, Uwe; Sanchez, Victor
2014-06-01
TWOPORFLOW is a thermo-hydraulic code based on a porous media approach to simulate single- and two-phase flow including boiling. It is under development at the Institute for Neutron Physics and Reactor Technology (INR) at KIT. The code features a 3D transient solution of the mass, momentum and energy conservation equations for two inter-penetrating fluids with a semi-implicit continuous Eulerian type solver. The application domain of TWOPORFLOW includes the flow in standard porous media and in structured porous media such as micro-channels and cores of nuclear power plants. In the latter case, the fluid domain is coupled to a fuel rod model, describing the heat flow inside the solid structure. In this work, detailed profiling tools have been utilized to determine the optimization potential of TWOPORFLOW. As a result, bottle-necks were identified and reduced in the most feasible way, leading for instance to an optimization of the water-steam property computation. Furthermore, an OpenMP implementation addressing the routines in charge of inter-phase momentum-, energy- and mass-coupling delivered good performance together with a high scalability on shared memory architectures. In contrast to that, the approach for distributed memory systems was to solve sub-problems resulting by the decomposition of the initial Cartesian geometry. Thread communication for the sub-problem boundary updates was accomplished by the Message Passing Interface (MPI) standard.
A parallel PCG solver for MODFLOW.
Dong, Yanhui; Li, Guomin
2009-01-01
In order to simulate large-scale ground water flow problems more efficiently with MODFLOW, the OpenMP programming paradigm was used to parallelize the preconditioned conjugate-gradient (PCG) solver with in this study. Incremental parallelization, the significant advantage supported by OpenMP on a shared-memory computer, made the solver transit to a parallel program smoothly one block of code at a time. The parallel PCG solver, suitable for both MODFLOW-2000 and MODFLOW-2005, is verified using an 8-processor computer. Both the impact of compilers and different model domain sizes were considered in the numerical experiments. Based on the timing results, execution times using the parallel PCG solver are typically about 1.40 to 5.31 times faster than those using the serial one. In addition, the simulation results are the exact same as the original PCG solver, because the majority of serial codes were not changed. It is worth noting that this parallelizing approach reduces cost in terms of software maintenance because only a single source PCG solver code needs to be maintained in the MODFLOW source tree.
Mischerikow, Nikolai; van Nierop, Pim; Li, Ka Wan; Bernstein, Hans-Gert; Smit, August B; Heck, Albert J R; Altelaar, A F Maarten
2010-10-01
Isobaric stable isotope labeling of peptides using iTRAQ is an important method for MS based quantitative proteomics. Traditionally, quantitative analysis of iTRAQ labeled peptides has been confined to beam-type instruments because of the weak detection capabilities of ion traps for low mass ions. Recent technical advances in fragmentation techniques on linear ion traps and the hybrid linear ion trap-orbitrap allow circumventing this limitation. Namely, PQD and HCD facilitate iTRAQ analysis on these instrument types. Here we report a method for iTRAQ-based relative quantification on the ETD enabled LTQ Orbitrap XL, which is based on parallel peptide quantification and peptide identification. iTRAQ reporter ion generation is performed by HCD, while CID and ETD provide peptide identification data in parallel in the LTQ ion trap. This approach circumvents problems accompanying iTRAQ reporter ion generation with ETD and allows quantitative, decision tree-based CID/ETD experiments. Furthermore, the use of HCD solely for iTRAQ reporter ion read out significantly reduces the number of ions needed to obtain informative spectra, which significantly reduces the analysis time. Finally, we show that integration of this method, both with existing CID and ETD methods as well as with existing iTRAQ data analysis workflows, is simple to realize. By applying our approach to the analysis of the synapse proteome from human brain biopsies, we demonstrate that it outperforms a latest generation MALDI TOF/TOF instrument, with improvements in both peptide and protein identification and quantification. Conclusively, our work shows how HCD, CID and ETD can be beneficially combined to enable iTRAQ-based quantification on an ETD-enabled LTQ Orbitrap XL.
GOTPM: a parallel hybrid particle-mesh treecode
NASA Astrophysics Data System (ADS)
Dubinski, John; Kim, Juhan; Park, Changbom; Humble, Robin
2004-02-01
We describe a parallel, cosmological N-body code based on a hybrid scheme using the particle-mesh (PM) and Barnes-Hut (BH) oct-tree algorithm. We call the algorithm GOTPM for Grid-of-Oct-Trees-Particle-Mesh. The code is parallelized using the Message Passing Interface (MPI) library and is optimized to run on Beowulf clusters as well as symmetric multi-processors. The gravitational potential is determined on a mesh using a standard PM method with particle forces determined through interpolation. The softened PM force is corrected for short range interactions using a grid of localized BH trees throughout the entire simulation volume in a completely analogous way to P3M methods. This method makes no assumptions about the local density for short range force corrections and so is consistent with the results of the P3M method in the limit that the treecode opening angle parameter, θ→0. The PM method is parallelized using one-dimensional slice domain decomposition. Particles are distributed in slices of equal width to allow mass assignment onto mesh points. The Fourier transforms in the PM method are done in parallel using the MPI implementation of the FFTW package. Parallelization for the tree force corrections is achieved again using one-dimensional slices but the width of each slice is allowed to vary according to the amount of computational work required by the particles within each slice to achieve load balance. The tree force corrections dominate the computational load and so imbalances in the PM density assignment step do not impact the overall load balance and performance significantly. The code performance scales well to 128 processors and is significantly better than competing methods. We present preliminary results from simulations run on different platforms containing up to N=1 G particles to verify the code.
Shumaker, Dana E.; Steefel, Carl I.
2016-06-21
The code CRUNCH_PARALLEL is a parallel version of the CRUNCH code. CRUNCH code version 2.0 was previously released by LLNL, (UCRL-CODE-200063). Crunch is a general purpose reactive transport code developed by Carl Steefel and Yabusake (Steefel Yabsaki 1996). The code handles non-isothermal transport and reaction in one, two, and three dimensions. The reaction algorithm is generic in form, handling an arbitrary number of aqueous and surface complexation as well as mineral dissolution/precipitation. A standardized database is used containing thermodynamic and kinetic data. The code includes advective, dispersive, and diffusive transport.
High-Fidelity RF Gun Simulations with the Parallel 3D Finite Element Particle-In-Cell Code Pic3P
Candel, A; Kabel, A.; Lee, L.; Li, Z.; Limborg, C.; Ng, C.; Schussman, G.; Ko, K.; /SLAC
2009-06-19
SLAC's Advanced Computations Department (ACD) has developed the first parallel Finite Element 3D Particle-In-Cell (PIC) code, Pic3P, for simulations of RF guns and other space-charge dominated beam-cavity interactions. Pic3P solves the complete set of Maxwell-Lorentz equations and thus includes space charge, retardation and wakefield effects from first principles. Pic3P uses higher-order Finite Elementmethods on unstructured conformal meshes. A novel scheme for causal adaptive refinement and dynamic load balancing enable unprecedented simulation accuracy, aiding the design and operation of the next generation of accelerator facilities. Application to the Linac Coherent Light Source (LCLS) RF gun is presented.
Roth, Michael J.; Forbes, Andrew J.; Boyne, Michael T.; Kim, Yong-Bin; Robinson, Dana E.; Kelleher*, Neil L.
2005-01-01
Summary The human proteome is a highly complex extension of the genome wherein a single gene often produces distinct protein forms due to alternative splicing, RNA-editing, polymorphisms, and posttranslational modifications (PTMs). Such biological variation compounded by the high sequence identity within gene families currently overwhelms the complete and routine characterization of mammalian proteins by mass spectrometry (MS). A new database of human proteins (and their possible variants) was created and searched using tandem mass spectrometric data from intact proteins. This first application of Top Down MS/MS to wild-type human proteins demonstrates both gene-specific identification and the unambiguous characterization of multi-faceted mass shifts (Δm’s). Such Δm values found from the precise identification of 45 protein forms from HeLa cells reveal 34 coding SNPs, two protein forms from alternative splicing, and 12 diverse modifications (not including simple N-terminal processing), including a previously unknown phosphorylation at 10% occupancy. Automated protein identification was achieved with a median probability score of 10−13 and often occurred simultaneously with dissection of diverse sources of protein variability as they occur in combination. Top Down MS therefore has a bright future for enabling precise annotation of gene products expressed from the human genome by non-mass specrometrists. PMID:15863400
Bass, Graham; Thomas, Russell; Pearce, Julia
2009-04-21
The most recent electron dosimetry code of practice for radiotherapy written by the Institute of Physics and Engineering in Medicine was published in 2003 and is based on the NPL electron absorbed dose to water calibration service. NPL has calibrated many Scanditronix type NACP-02 and PTW Roos type 34001 parallel plate ionization chambers in terms of absorbed dose to water, for use with the code of practice. The results of the calibrations of these chamber types summarized here include the absorbed dose to water sensitivity, where the mean calibration factor standard deviations are 5.8% for NACP-02 chambers and 1.1% for PTW Roos chambers. The correction for the polarity effect is shown to be small (less than 0.2% for all beam qualities) but with a discernible beam quality dependence. The correction for recombination is shown to be consistent and reproducible, and an analysis of these results suggests that the plate separation of the NACP-02 chambers is more variable from chamber to chamber than with the PTW Roos chambers. The calibration of these chambers is shown to be repeatable within +/-0.2% over 2-3 years. It is also shown that check source measurements can be repeated within +/-0.3% over several years. The results justify the use of NACP-02 and PTW 34001 chambers as secondary standards, but also indicate that the PTW 34001 chambers show less variation from chamber to chamber.
NASA Astrophysics Data System (ADS)
Bass, Graham; Thomas, Russell; Pearce, Julia
2009-04-01
The most recent electron dosimetry code of practice for radiotherapy written by the Institute of Physics and Engineering in Medicine was published in 2003 and is based on the NPL electron absorbed dose to water calibration service. NPL has calibrated many Scanditronix type NACP-02 and PTW Roos type 34001 parallel plate ionization chambers in terms of absorbed dose to water, for use with the code of practice. The results of the calibrations of these chamber types summarized here include the absorbed dose to water sensitivity, where the mean calibration factor standard deviations are 5.8% for NACP-02 chambers and 1.1% for PTW Roos chambers. The correction for the polarity effect is shown to be small (less than 0.2% for all beam qualities) but with a discernible beam quality dependence. The correction for recombination is shown to be consistent and reproducible, and an analysis of these results suggests that the plate separation of the NACP-02 chambers is more variable from chamber to chamber than with the PTW Roos chambers. The calibration of these chambers is shown to be repeatable within ±0.2% over 2-3 years. It is also shown that check source measurements can be repeated within ±0.3% over several years. The results justify the use of NACP-02 and PTW 34001 chambers as secondary standards, but also indicate that the PTW 34001 chambers show less variation from chamber to chamber.
A parallelized binary search tree
USDA-ARS?s Scientific Manuscript database
PTTRNFNDR is an unsupervised statistical learning algorithm that detects patterns in DNA sequences, protein sequences, or any natural language texts that can be decomposed into letters of a finite alphabet. PTTRNFNDR performs complex mathematical computations and its processing time increases when i...
Rubio, L; Ortiz, M C; Sarabia, L A
2014-04-11
A non-separative, fast and inexpensive spectrofluorimetric method based on the second order calibration of excitation-emission fluorescence matrices (EEMs) was proposed for the determination of carbaryl, carbendazim and 1-naphthol in dried lime tree flowers. The trilinearity property of three-way data was used to handle the intrinsic fluorescence of lime flowers and the difference in the fluorescence intensity of each analyte. It also made possible to identify unequivocally each analyte. Trilinearity of the data tensor guarantees the uniqueness of the solution obtained through parallel factor analysis (PARAFAC), so the factors of the decomposition match up with the analytes. In addition, an experimental procedure was proposed to identify, with three-way data, the quenching effect produced by the fluorophores of the lime flowers. This procedure also enabled the selection of the adequate dilution of the lime flowers extract to minimize the quenching effect so the three analytes can be quantified. Finally, the analytes were determined using the standard addition method for a calibration whose standards were chosen with a D-optimal design. The three analytes were unequivocally identified by the correlation between the pure spectra and the PARAFAC excitation and emission spectral loadings. The trueness was established by the accuracy line "calculated concentration versus added concentration" in all cases. Better decision limit values (CCα), in x0=0 with the probability of false positive fixed at 0.05, were obtained for the calibration performed in pure solvent: 2.97 μg L(-1) for 1-naphthol, 3.74 μg L(-1) for carbaryl and 23.25 μg L(-1) for carbendazim. The CCα values for the second calibration carried out in matrix were 1.61, 4.34 and 51.75 μg L(-1) respectively; while the values obtained considering only the pure samples as calibration set were: 2.65, 8.61 and 28.7 μg L(-1), respectively.
ERIC Educational Resources Information Center
Walter, Pierre
2012-01-01
This study examines how cultural codes in environmental adult education can be used to "frame" collective identity, develop counterhegemonic ideologies, and catalyse "educative-activism" within social movements. Three diverse examples are discussed, spanning environmental movements in urban Victoria, British Columbia, Canada,…
Morozov, Dmitriy; Weber, Gunther H.
2014-03-31
Topological techniques provide robust tools for data analysis. They are used, for example, for feature extraction, for data de-noising, and for comparison of data sets. This chapter concerns contour trees, a topological descriptor that records the connectivity of the isosurfaces of scalar functions. These trees are fundamental to analysis and visualization of physical phenomena modeled by real-valued measurements. We study the parallel analysis of contour trees. After describing a particular representation of a contour tree, called local{global representation, we illustrate how di erent problems that rely on contour trees can be solved in parallel with minimal communication.
NASA Astrophysics Data System (ADS)
Štěpán, Jiří; Trujillo Bueno, Javier
2013-09-01
The interpretation of the intensity and polarization of the spectral line radiation produced in the atmosphere of the Sun and of other stars requires solving a radiative transfer problem that can be very complex, especially when the main interest lies in modeling the spectral line polarization produced by scattering processes and the Hanle and Zeeman effects. One of the difficulties is that the plasma of a stellar atmosphere can be highly inhomogeneous and dynamic, which implies the need to solve the non-equilibrium problem of the generation and transfer of polarized radiation in realistic three-dimensional (3D) stellar atmospheric models. Here we present PORTA, an efficient multilevel radiative transfer code we have developed for the simulation of the spectral line polarization caused by scattering processes and the Hanle and Zeeman effects in 3D models of stellar atmospheres. The numerical method of solution is based on the non-linear multigrid iterative method and on a novel short-characteristics formal solver of the Stokes-vector transfer equation which uses monotonic Bézier interpolation. Therefore, with PORTA the computing time needed to obtain at each spatial grid point the self-consistent values of the atomic density matrix (which quantifies the excitation state of the atomic system) scales linearly with the total number of grid points. Another crucial feature of PORTA is its parallelization strategy, which allows us to speed up the numerical solution of complicated 3D problems by several orders of magnitude with respect to sequential radiative transfer approaches, given its excellent linear scaling with the number of available processors. The PORTA code can also be conveniently applied to solve the simpler 3D radiative transfer problem of unpolarized radiation in multilevel systems.
van Beugen, Boeke J.; Gao, Zhenyu; Boele, Henk-Jan; Hoebeek, Freek; De Zeeuw, Chris I.
2013-01-01
Cerebellar granule cells (GrCs) convey information from mossy fibers (MFs) to Purkinje cells (PCs) via their parallel fibers (PFs). MF to GrC signaling allows transmission of frequencies up to 1 kHz and GrCs themselves can also fire bursts of action potentials with instantaneous frequencies up to 1 kHz. So far, in the scientific literature no evidence has been shown that these high-frequency bursts also exist in awake, behaving animals. More so, it remains to be shown whether such high-frequency bursts can transmit temporally coded information from MFs to PCs and/or whether these patterns of activity contribute to the spatiotemporal filtering properties of the GrC layer. Here, we show that, upon sensory stimulation in both un-anesthetized rabbits and mice, GrCs can show bursts that consist of tens of spikes at instantaneous frequencies over 800 Hz. In vitro recordings from individual GrC-PC pairs following high-frequency stimulation revealed an overall low initial release probability of ~0.17. Nevertheless, high-frequency burst activity induced a short-lived facilitation to ensure signaling within the first few spikes, which was rapidly followed by a reduction in transmitter release. The facilitation rate among individual GrC-PC pairs was heterogeneously distributed and could be classified as either “reluctant” or “responsive” according to their release characteristics. Despite the variety of efficacy at individual connections, grouped activity in GrCs resulted in a linear relationship between PC response and PF burst duration at frequencies up to 300 Hz allowing rate coding to persist at the network level. Together, these findings support the hypothesis that the cerebellar granular layer acts as a spatiotemporal filter between MF input and PC output (D’Angelo and De Zeeuw, 2009). PMID:23734102
NASA Astrophysics Data System (ADS)
Leboeuf, Jean-Noel; Decyk, Viktor; Newman, David; Sanchez, Raul
2013-10-01
The massively parallel, 2D domain-decomposed, nonlinear, 3D, toroidal, electrostatic, gyrokinetic, Particle in Cell (PIC), Cartesian geometry UCAN2 code, with particle ions and adiabatic electrons, has been ported to two emerging mainframes. These two computers, one at NERSC in the US built by Cray named Edison and the other at the Barcelona Supercomputer Center (BSC) in Spain built by IBM named MareNostrum III (MNIII) just happen to share the same Intel ``Sandy Bridge'' processors. The successful port of UCAN2 to MNIII which came online first has enabled us to be up and running efficiently in record time on Edison. Overall, the performance of UCAN2 on Edison is superior to that on MNIII, particularly at large numbers of processors (>1024) for the same Intel IFORT compiler. This appears to be due to different MPI modules (OpenMPI on MNIII and MPICH2 on Edison) and different interconnection networks (Infiniband on MNIII and Cray's Aries on Edison) on the two mainframes. Details of these ports and comparative benchmarks are presented. Work supported by OFES, USDOE, under contract no. DE-FG02-04ER54741 with the University of Alaska at Fairbanks.
NASA Astrophysics Data System (ADS)
Kumar, J.; Mills, R. T.; Lichtner, P. C.; Hammond, G. E.
2010-12-01
Fracture dominated flows occur in numerous subsurface geochemical processes and at many different scales in rock pore structures, micro-fractures, fracture networks and faults. Fractured porous media can be modeled as multiple interacting continua which are connected to each other through transfer terms that capture the flow of mass and energy in response to pressure, temperature and concentration gradients. However, the analysis of large-scale transient problems using the multiple interacting continuum approach presents an algorithmic and computational challenge for problems with very large numbers of degrees of freedom. A generalized dual porosity model based on the Dual Continuum Disconnected Matrix approach has been implemented within a massively parallel multiphysics-multicomponent-multiphase subsurface reactive flow and transport code PFLOTRAN. Developed as part of the Department of Energy's SciDAC-2 program, PFLOTRAN provides subsurface simulation capabilities that can scale from laptops to ultrascale supercomputers, and utilizes the PETSc framework to solve the large, sparse algebraic systems that arises in complex subsurface reactive flow and transport problems. It has been successfully applied to the solution of problems composed of more than two billions degrees of freedom, utilizing up to 131,072 processor cores on Jaguar, the Cray XT5 system at Oak Ridge National Laboratory that is the world’s fastest supercomputer. Building upon the capabilities and computational efficiency of PFLOTRAN, we will present an implementation of the multiple interacting continua formulation for fractured porous media along with an application case study.
NASA Astrophysics Data System (ADS)
Engel, D.; Klews, M.; Wunner, G.
2009-02-01
We have developed a new method for the fast computation of wavelengths and oscillator strengths for medium-Z atoms and ions, up to iron, at neutron star magnetic field strengths. The method is a parallelized Hartree-Fock approach in adiabatic approximation based on finite-element and B-spline techniques. It turns out that typically 15-20 finite elements are sufficient to calculate energies to within a relative accuracy of 10-5 in 4 or 5 iteration steps using B-splines of 6th order, with parallelization speed-ups of 20 on a 26-processor machine. Results have been obtained for the energies of the ground states and excited levels and for the transition strengths of astrophysically relevant atoms and ions in the range Z=2…26 in different ionization stages. Catalogue identifier: AECC_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AECC_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 3845 No. of bytes in distributed program, including test data, etc.: 27 989 Distribution format: tar.gz Programming language: MPI/Fortran 95 and Python Computer: Cluster of 1-26 HP Compaq dc5750 Operating system: Fedora 7 Has the code been vectorised or parallelized?: Yes RAM: 1 GByte Classification: 2.1 External routines: MPI/GFortran, LAPACK, PyLab/Matplotlib Nature of problem: Calculations of synthetic spectra [1] of strongly magnetized neutron stars are bedevilled by the lack of data for atoms in intense magnetic fields. While the behaviour of hydrogen and helium has been investigated in detail (see, e.g., [2]), complete and reliable data for heavier elements, in particular iron, are still missing. Since neutron stars are formed by the collapse of the iron cores of massive stars, it may be assumed that their atmospheres contain an iron plasma. Our objective is to fill the gap
Poirot, Jordan; De Luna, Paolo; Rainer, Gregor
2016-04-01
We comprehensively characterize spiking and visual evoked potential (VEP) activity in tree shrew V1 and V2 using Cartesian, hyperbolic, and polar gratings. Neural selectivity to structure of Cartesian gratings was higher than other grating classes in both visual areas. From V1 to V2, structure selectivity of spiking activity increased, whereas corresponding VEP values tended to decrease, suggesting that single-neuron coding of Cartesian grating attributes improved while the cortical columnar organization of these neurons became less precise from V1 to V2. We observed that neurons in V2 generally exhibited similar selectivity for polar and Cartesian gratings, suggesting that structure of polar-like stimuli might be encoded as early as in V2. This hypothesis is supported by the preference shift from V1 to V2 toward polar gratings of higher spatial frequency, consistent with the notion that V2 neurons encode visual scene borders and contours. Neural sensitivity to modulations of polarity of hyperbolic gratings was highest among all grating classes and closely related to the visual receptive field (RF) organization of ON- and OFF-dominated subregions. We show that spatial RF reconstructions depend strongly on grating class, suggesting that intracortical contributions to RF structure are strongest for Cartesian and polar gratings. Hyperbolic gratings tend to recruit least cortical elaboration such that the RF maps are similar to those generated by sparse noise, which most closely approximate feedforward inputs. Our findings complement previous literature in primates, rodents, and carnivores and highlight novel aspects of shape representation and coding occurring in mammalian early visual cortex.
DIANE multiparticle transport code
NASA Astrophysics Data System (ADS)
Caillaud, M.; Lemaire, S.; Ménard, S.; Rathouit, P.; Ribes, J. C.; Riz, D.
2014-06-01
DIANE is the general Monte Carlo code developed at CEA-DAM. DIANE is a 3D multiparticle multigroup code. DIANE includes automated biasing techniques and is optimized for massive parallel calculations.
Scioto: A Framework for Global-ViewTask Parallelism
Dinan, James S.; Krishnamoorthy, Sriram; Larkins, D. B.; Nieplocha, Jaroslaw; Sadayappan, Ponnuswamy
2008-09-09
We introduce Scioto, Shared Collections of Task Objects, a framework for supporting task-parallelism in one-sided and global-view parallel programming models. Scioto provides lightweight, locality aware dynamic load balancing and interoperates with existing parallel models including MPI, SHMEM, CAF, and Global Arrays. Through task parallelism, the Scioto framework provides a solution for overcoming load imbalance and heterogeneity as well as dynamic mapping of computation onto emerging multicore architectures. In this paper, we present the design and implementation of the Scioto framework and demonstrate its effectiveness on the Unbalanced Tree Search (UTS) benchmark and two quantum chemistry codes: the closed shell Self-Consistent Field (SCF) method and a sparse tensor contraction kernel extracted from a coupled cluster computation. We explore the efficiency and scalability of Scioto through these sample applications and demonstrate that is offers low overhead, achieves good performance on heterogeneous and multicore clusters, and scales to hundreds of processors.
Categorizing ideas about trees: a tree of trees.
Fisler, Marie; Lecointre, Guillaume
2013-01-01
The aim of this study is to explore whether matrices and MP trees used to produce systematic categories of organisms could be useful to produce categories of ideas in history of science. We study the history of the use of trees in systematics to represent the diversity of life from 1766 to 1991. We apply to those ideas a method inspired from coding homologous parts of organisms. We discretize conceptual parts of ideas, writings and drawings about trees contained in 41 main writings; we detect shared parts among authors and code them into a 91-characters matrix and use a tree representation to show who shares what with whom. In other words, we propose a hierarchical representation of the shared ideas about trees among authors: this produces a "tree of trees." Then, we categorize schools of tree-representations. Classical schools like "cladists" and "pheneticists" are recovered but others are not: "gradists" are separated into two blocks, one of them being called here "grade theoreticians." We propose new interesting categories like the "buffonian school," the "metaphoricians," and those using "strictly genealogical classifications." We consider that networks are not useful to represent shared ideas at the present step of the study. A cladogram is made for showing who is sharing what with whom, but also heterobathmy and homoplasy of characters. The present cladogram is not modelling processes of transmission of ideas about trees, and here it is mostly used to test for proximity of ideas of the same age and for categorization.
Wang, Lin-Wang
2004-10-21
This is a total energy electronic structure code using Local Density Approximation (LDA) of the density funtional theory. It uses the plane wave as the wave function basis set. It can sue both the norm conserving pseudopotentials and the ultra soft pseudopotentials. It can relax the atomic positions according to the total energy. It is a parallel code using MP1.
Eriksson, Maria E; Hoffman, Daniel; Kaduk, Mateusz; Mauriat, Mélanie; Moritz, Thomas
2015-02-01
Bioactive gibberellins (GAs) have been implicated in short day (SD)-induced growth cessation in Populus, because exogenous applications of bioactive GAs to hybrid aspens (Populus tremula × tremuloides) under SD conditions delay growth cessation. However, this effect diminishes with time, suggesting that plants may cease growth following exposure to SDs due to a reduction in sensitivity to GAs. In order to validate and further explore the role of GAs in growth cessation, we perturbed GA biosynthesis or signalling in hybrid aspen plants by overexpressing AtGA20ox1, AtGA2ox2 and PttGID1.3 (encoding GA biosynthesis enzymes and a GA receptor). We found trees with elevated concentrations of bioactive GA, due to overexpression of AtGA20ox1, continued to grow in SD conditions and were insensitive to the level of FLOWERING LOCUS T2 (FT2) expression. As transgenic plants overexpressing the PttGID1.3 GA receptor responded in a wild-type (WT) manner to SD conditions, this insensitivity did not result from limited receptor availability. As high concentrations of bioactive GA during SD conditions were sufficient to sustain shoot elongation growth in hybrid aspen trees, independent of FT2 expression levels, we conclude elongation growth in trees is regulated by both GA- and long day-responsive pathways, similar to the regulation of flowering in Arabidopsis thaliana.
Parallelized direct execution simulation of message-passing parallel programs
NASA Technical Reports Server (NTRS)
Dickens, Phillip M.; Heidelberger, Philip; Nicol, David M.
1994-01-01
As massively parallel computers proliferate, there is growing interest in findings ways by which performance of massively parallel codes can be efficiently predicted. This problem arises in diverse contexts such as parallelizing computers, parallel performance monitoring, and parallel algorithm development. In this paper we describe one solution where one directly executes the application code, but uses a discrete-event simulator to model details of the presumed parallel machine such as operating system and communication network behavior. Because this approach is computationally expensive, we are interested in its own parallelization specifically the parallelization of the discrete-event simulator. We describe methods suitable for parallelized direct execution simulation of message-passing parallel programs, and report on the performance of such a system, Large Application Parallel Simulation Environment (LAPSE), we have built on the Intel Paragon. On all codes measured to date, LAPSE predicts performance well typically within 10 percent relative error. Depending on the nature of the application code, we have observed low slowdowns (relative to natively executing code) and high relative speedups using up to 64 processors.
NASA Astrophysics Data System (ADS)
KIM, Jong Woon; LEE, Young-Ouk
2017-09-01
As computing power gets better and better, computer codes that use a deterministic method seem to be less useful than those using the Monte Carlo method. In addition, users do not like to think about space, angles, and energy discretization for deterministic codes. However, a deterministic method is still powerful in that we can obtain a solution of the flux throughout the problem, particularly as when particles can barely penetrate, such as in a deep penetration problem with small detection volumes. Recently, a new state-of-the-art discrete-ordinates code, ATTILA, was developed and has been widely used in several applications. ATTILA provides the capabilities to solve geometrically complex 3-D transport problems by using an unstructured tetrahedral mesh. Since 2009, we have been developing our own code by benchmarking ATTILA. AETIUS is a discrete ordinates code that uses an unstructured tetrahedral mesh such as ATTILA. For pre- and post- processing, Gmsh is used to generate an unstructured tetrahedral mesh by importing a CAD file (*.step) and visualizing the calculation results of AETIUS. Using a CAD tool, the geometry can be modeled very easily. In this paper, we describe a brief overview of AETIUS and provide numerical results from both AETIUS and a Monte Carlo code, MCNP5, in a deep penetration problem with small detection volumes. The results demonstrate the effectiveness and efficiency of AETIUS for such calculations.
A systolic array parallelizing compiler
Tseng, P.S. )
1990-01-01
This book presents a completely new approach to the problem of systolic array parallelizing compiler. It describes the AL parallelizing compiler for the Warp systolic array, the first working systolic array parallelizing compiler which can generate efficient parallel code for complete LINPACK routines. This book begins by analyzing the architectural strength of the Warp systolic array. It proposes a model for mapping programs onto the machine and introduces the notion of data relations for optimizing the program mapping. Also presented are successful applications of the AL compiler in matrix computation and image processing. A complete listing of the source program and compiler-generated parallel code are given to clarify the overall picture of the compiler. The book concludes that systolic array parallelizing compiler can produce efficient parallel code, almost identical to what the user would have written by hand.
King, J. R.; Pankin, A. Y.; Kruger, S. E.; ...
2016-06-24
The extended-MHD NIMROD code [C. R. Sovinec and J. R. King, J. Comput. Phys. 229, 5803 (2010)] is verified against the ideal-MHD ELITE code [H. R. Wilson et al., Phys. Plasmas 9, 1277 (2002)] on a diverted tokamak discharge. When the NIMROD model complexity is increased incrementally, resistive and first-order finite-Larmour radius effects are destabilizing and stabilizing, respectively. Lastly, the full result is compared to local analytic calculations which are found to overpredict both the resistive destabilization and drift stabilization in comparison to the NIMROD computations.
King, J. R.; Pankin, A. Y.; Kruger, S. E.; Snyder, P. B.
2016-06-15
The extended-MHD NIMROD code [C. R. Sovinec and J. R. King, J. Comput. Phys. 229, 5803 (2010)] is verified against the ideal-MHD ELITE code [H. R. Wilson et al., Phys. Plasmas 9, 1277 (2002)] on a diverted tokamak discharge. When the NIMROD model complexity is increased incrementally, resistive and first-order finite-Larmour radius effects are destabilizing and stabilizing, respectively. The full result is compared to local analytic calculations which are found to overpredict both the resistive destabilization and drift stabilization in comparison to the NIMROD computations.
NASA Astrophysics Data System (ADS)
King, J. R.; Pankin, A. Y.; Kruger, S. E.; Snyder, P. B.
2016-06-01
The extended-MHD NIMROD code [C. R. Sovinec and J. R. King, J. Comput. Phys. 229, 5803 (2010)] is verified against the ideal-MHD ELITE code [H. R. Wilson et al., Phys. Plasmas 9, 1277 (2002)] on a diverted tokamak discharge. When the NIMROD model complexity is increased incrementally, resistive and first-order finite-Larmour radius effects are destabilizing and stabilizing, respectively. The full result is compared to local analytic calculations which are found to overpredict both the resistive destabilization and drift stabilization in comparison to the NIMROD computations.
Treveaven, P.
1989-01-01
This book presents an introduction to object-oriented, functional, and logic parallel computing on which the fifth generation of computer systems will be based. Coverage includes concepts for parallel computing languages, a parallel object-oriented system (DOOM) and its language (POOL), an object-oriented multilevel VLSI simulator using POOL, and implementation of lazy functional languages on parallel architectures.
NASA Astrophysics Data System (ADS)
Konishi, Tsuyoshi; Tanida, Jun; Ichioka, Yoshiki
1995-06-01
A novel technique, the visual-area coding technique (VACT), for the optical implementation of fuzzy logic with the capability of visualization of the results is presented. This technique is based on the microfont method and is considered to be an instance of digitized analog optical computing. Huge amounts of data can be processed in fuzzy logic with the VACT. In addition, real-time visualization of the processed result can be accomplished.
Categorizing Ideas about Trees: A Tree of Trees
Fisler, Marie; Lecointre, Guillaume
2013-01-01
The aim of this study is to explore whether matrices and MP trees used to produce systematic categories of organisms could be useful to produce categories of ideas in history of science. We study the history of the use of trees in systematics to represent the diversity of life from 1766 to 1991. We apply to those ideas a method inspired from coding homologous parts of organisms. We discretize conceptual parts of ideas, writings and drawings about trees contained in 41 main writings; we detect shared parts among authors and code them into a 91-characters matrix and use a tree representation to show who shares what with whom. In other words, we propose a hierarchical representation of the shared ideas about trees among authors: this produces a “tree of trees.” Then, we categorize schools of tree-representations. Classical schools like “cladists” and “pheneticists” are recovered but others are not: “gradists” are separated into two blocks, one of them being called here “grade theoreticians.” We propose new interesting categories like the “buffonian school,” the “metaphoricians,” and those using “strictly genealogical classifications.” We consider that networks are not useful to represent shared ideas at the present step of the study. A cladogram is made for showing who is sharing what with whom, but also heterobathmy and homoplasy of characters. The present cladogram is not modelling processes of transmission of ideas about trees, and here it is mostly used to test for proximity of ideas of the same age and for categorization. PMID:23950877
Bilingual parallel programming
Foster, I.; Overbeek, R.
1990-01-01
Numerous experiments have demonstrated that computationally intensive algorithms support adequate parallelism to exploit the potential of large parallel machines. Yet successful parallel implementations of serious applications are rare. The limiting factor is clearly programming technology. None of the approaches to parallel programming that have been proposed to date -- whether parallelizing compilers, language extensions, or new concurrent languages -- seem to adequately address the central problems of portability, expressiveness, efficiency, and compatibility with existing software. In this paper, we advocate an alternative approach to parallel programming based on what we call bilingual programming. We present evidence that this approach provides and effective solution to parallel programming problems. The key idea in bilingual programming is to construct the upper levels of applications in a high-level language while coding selected low-level components in low-level languages. This approach permits the advantages of a high-level notation (expressiveness, elegance, conciseness) to be obtained without the cost in performance normally associated with high-level approaches. In addition, it provides a natural framework for reusing existing code.
Trees of trees: an approach to comparing multiple alternative phylogenies.
Nye, Tom M W
2008-10-01
Phylogenetic analysis very commonly produces several alternative trees for a given fixed set of taxa. For example, different sets of orthologous genes may be analyzed, or the analysis may sample from a distribution of probable trees. This article describes an approach to comparing and visualizing multiple alternative phylogenies via the idea of a "tree of trees" or "meta-tree." A meta-tree clusters phylogenies with similar topologies together in the same way that a phylogeny clusters species with similar DNA sequences. Leaf nodes on a meta-tree correspond to the original set of phylogenies given by some analysis, whereas interior nodes correspond to certain consensus topologies. The construction of meta-trees is motivated by analogy with construction of a most parsimonious tree for DNA data, but instead of using DNA letters, in a meta-tree the characters are partitions or splits of the set of taxa. An efficient algorithm for meta-tree construction is described that makes use of a known relationship between the majority consensus and parsimony in terms of gain and loss of splits. To illustrate these ideas meta-trees are constructed for two datasets: a set of gene trees for species of yeast and trees from a bootstrap analysis of a set of gene trees in ray-finned fish. A software tool for constructing meta-trees and comparing alternative phylogenies is available online, and the source code can be obtained from the author.
Lee, Eun Young; Lee, Hwan Young; Oh, Se Yoon; Jung, Sang-Eun; Yang, In Seok; Lee, Yang-Han; Yang, Woo Ick; Shin, Kyoung-Jin
2016-05-01
The application of next-generation sequencing (NGS) to forensic genetics is being explored by an increasing number of laboratories because of the potential of high-throughput sequencing for recovering genetic information from multiple markers and multiple individuals in a single run. A cumbersome and technically challenging library construction process is required for NGS. In this study, we propose a simplified library preparation method for mitochondrial DNA (mtDNA) analysis that involves two rounds of PCR amplification. In the first-round of multiplex PCR, six fragments covering the entire mtDNA control region and 22 fragments covering interspersed single nucleotide polymorphisms (SNPs) in the coding region that can be used to determine global haplogroups and East Asian haplogroups were amplified using template-specific primers with read sequences. In the following step, indices and platform-specific sequences for the MiSeq(®) system (Illumina) were added by PCR. The barcoded library produced using this simplified workflow was successfully sequenced on the MiSeq system using the MiSeq Reagent Nano Kit v2. A total of 0.4 GB of sequences, 80.6% with base quality of >Q30, were obtained from 12 degraded DNA samples and mapped to the revised Cambridge Reference Sequence (rCRS). A relatively even read count was obtained for all amplicons, with an average coverage of 5200 × and a less than three-fold read count difference between amplicons per sample. Control region sequences were successfully determined, and all samples were assigned to the relevant haplogroups. In addition, enhanced discrimination was observed by adding coding region SNPs to the control region in in silico analysis. Because the developed multiplex PCR system amplifies small-sized amplicons (<250 bp), NGS analysis using the library preparation method described here allows mtDNA analysis using highly degraded DNA samples. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Boyko, Oleksiy; Zheleznyak, Mark
2015-04-01
The original numerical code TOPKAPI-IMMS of the distributed rainfall-runoff model TOPKAPI ( Todini et al, 1996-2014) is developed and implemented in Ukraine. The parallel version of the code has been developed recently to be used on multiprocessors systems - multicore/processors PC and clusters. Algorithm is based on binary-tree decomposition of the watershed for the balancing of the amount of computation for all processors/cores. Message passing interface (MPI) protocol is used as a parallel computing framework. The numerical efficiency of the parallelization algorithms is demonstrated for the case studies for the flood predictions of the mountain watersheds of the Ukrainian Carpathian regions. The modeling results is compared with the predictions based on the lumped parameters models.
NASA Technical Reports Server (NTRS)
Hribar, Michelle R.; Frumkin, Michael; Jin, Haoqiang; Waheed, Abdul; Yan, Jerry; Saini, Subhash (Technical Monitor)
1998-01-01
Over the past decade, high performance computing has evolved rapidly; systems based on commodity microprocessors have been introduced in quick succession from at least seven vendors/families. Porting codes to every new architecture is a difficult problem; in particular, here at NASA, there are many large CFD applications that are very costly to port to new machines by hand. The LCM ("Legacy Code Modernization") Project is the development of an integrated parallelization environment (IPE) which performs the automated mapping of legacy CFD (Fortran) applications to state-of-the-art high performance computers. While most projects to port codes focus on the parallelization of the code, we consider porting to be an iterative process consisting of several steps: 1) code cleanup, 2) serial optimization,3) parallelization, 4) performance monitoring and visualization, 5) intelligent tools for automated tuning using performance prediction and 6) machine specific optimization. The approach for building this parallelization environment is to build the components for each of the steps simultaneously and then integrate them together. The demonstration will exhibit our latest research in building this environment: 1. Parallelizing tools and compiler evaluation. 2. Code cleanup and serial optimization using automated scripts 3. Development of a code generator for performance prediction 4. Automated partitioning 5. Automated insertion of directives. These demonstrations will exhibit the effectiveness of an automated approach for all the steps involved with porting and tuning a legacy code application for a new architecture.
NASA Technical Reports Server (NTRS)
Hribar, Michelle R.; Frumkin, Michael; Jin, Haoqiang; Waheed, Abdul; Yan, Jerry; Saini, Subhash (Technical Monitor)
1998-01-01
Over the past decade, high performance computing has evolved rapidly; systems based on commodity microprocessors have been introduced in quick succession from at least seven vendors/families. Porting codes to every new architecture is a difficult problem; in particular, here at NASA, there are many large CFD applications that are very costly to port to new machines by hand. The LCM ("Legacy Code Modernization") Project is the development of an integrated parallelization environment (IPE) which performs the automated mapping of legacy CFD (Fortran) applications to state-of-the-art high performance computers. While most projects to port codes focus on the parallelization of the code, we consider porting to be an iterative process consisting of several steps: 1) code cleanup, 2) serial optimization,3) parallelization, 4) performance monitoring and visualization, 5) intelligent tools for automated tuning using performance prediction and 6) machine specific optimization. The approach for building this parallelization environment is to build the components for each of the steps simultaneously and then integrate them together. The demonstration will exhibit our latest research in building this environment: 1. Parallelizing tools and compiler evaluation. 2. Code cleanup and serial optimization using automated scripts 3. Development of a code generator for performance prediction 4. Automated partitioning 5. Automated insertion of directives. These demonstrations will exhibit the effectiveness of an automated approach for all the steps involved with porting and tuning a legacy code application for a new architecture.
Force user's manual: A portable, parallel FORTRAN
NASA Technical Reports Server (NTRS)
Jordan, Harry F.; Benten, Muhammad S.; Arenstorf, Norbert S.; Ramanan, Aruna V.
1990-01-01
The use of Force, a parallel, portable FORTRAN on shared memory parallel computers is described. Force simplifies writing code for parallel computers and, once the parallel code is written, it is easily ported to computers on which Force is installed. Although Force is nearly the same for all computers, specific details are included for the Cray-2, Cray-YMP, Convex 220, Flex/32, Encore, Sequent, Alliant computers on which it is installed.
GSHR-Tree: a spatial index tree based on dynamic spatial slot and hash table in grid environments
NASA Astrophysics Data System (ADS)
Chen, Zhanlong; Wu, Xin-cai; Wu, Liang
2008-12-01
distributed operation, reduplication operation transfer operation of spatial index in the grid environment. The design of GSHR-Tree has ensured the performance of the load balance in the parallel computation. This tree structure is fit for the parallel process of the spatial information in the distributed network environments. Instead of spatial object's recursive comparison where original R tree has been used, the algorithm builds the spatial index by applying binary code operation in which computer runs more efficiently, and extended dynamic hash code for bit comparison. In GSHR-Tree, a new server is assigned to the network whenever a split of a full node is required. We describe a more flexible allocation protocol which copes with a temporary shortage of storage resources. It uses a distributed balanced binary spatial tree that scales with insertions to potentially any number of storage servers through splits of the overloaded ones. The application manipulates the GSHR-Tree structure from a node in the grid environment. The node addresses the tree through its image that the splits can make outdated. This may generate addressing errors, solved by the forwarding among the servers. In this paper, a spatial index data distribution algorithm that limits the number of servers has been proposed. We improve the storage utilization at the cost of additional messages. The structure of GSHR-Tree is believed that the scheme of this grid spatial index should fit the needs of new applications using endlessly larger sets of spatial data. Our proposal constitutes a flexible storage allocation method for a distributed spatial index. The insertion policy can be tuned dynamically to cope with periods of storage shortage. In such cases storage balancing should be favored for better space utilization, at the price of extra message exchanges between servers. This structure makes a compromise in the updating of the duplicated index and the transformation of the spatial index data. Meeting the
Chen, J.; Alpan, F. A.; Fischer, G.A.; Fero, A.H.
2011-07-01
Traditional two-dimensional (2D)/one-dimensional (1D) SYNTHESIS methodology has been widely used to calculate fast neutron (>1.0 MeV) fluence exposure to reactor pressure vessel in the belt-line region. However, it is expected that this methodology cannot provide accurate fast neutron fluence calculation at elevations far above or below the active core region. A three-dimensional (3D) parallel discrete ordinates calculation for ex-vessel neutron dosimetry on a Westinghouse 4-Loop XL Pressurized Water Reactor has been done. It shows good agreement between the calculated results and measured results. Furthermore, the results show very different fast neutron flux values at some of the former plate locations and elevations above and below an active core than those calculated by a 2D/1D SYNTHESIS method. This indicates that for certain irregular reactor internal structures, where the fast neutron flux has a very strong local effect, it is required to use a 3D transport method to calculate accurate fast neutron exposure. (authors)
Status of TRANSP Parallel Services
NASA Astrophysics Data System (ADS)
Indireshkumar, K.; Andre, Robert; McCune, Douglas; Randerson, Lewis
2006-10-01
The PPPL TRANSP code suite has been used successfully over many years to carry out time dependent simulations of tokamak plasmas. However, accurately modeling certain phenomena such as RF heating and fast ion behavior using TRANSP requires extensive computational power and will benefit from parallelization. Parallelizing all of TRANSP is not required and parts will run sequentially while other parts run parallelized. To efficiently use a site's parallel services, the parallelized TRANSP modules are deployed to a shared ``parallel service'' on a separate cluster. The PPPL Monte Carlo fast ion module NUBEAM and the MIT RF module TORIC are the first TRANSP modules to be so deployed. This poster will show the performance scaling of these modules within the parallel server. Communications between the serial client and the parallel server will be described in detail, and measurements of startup and communications overhead will be shown. Physics modeling benefits for TRANSP users will be assessed.
Nelson, Andrew F.; Wetzstein, M.; Naab, T.
2009-10-01
We continue our presentation of VINE. In this paper, we begin with a description of relevant architectural properties of the serial and shared memory parallel computers on which VINE is intended to run, and describe their influences on the design of the code itself. We continue with a detailed description of a number of optimizations made to the layout of the particle data in memory and to our implementation of a binary tree used to access that data for use in gravitational force calculations and searches for smoothed particle hydrodynamics (SPH) neighbor particles. We describe the modifications to the code necessary to obtain forces efficiently from special purpose 'GRAPE' hardware, the interfaces required to allow transparent substitution of those forces in the code instead of those obtained from the tree, and the modifications necessary to use both tree and GRAPE together as a fused GRAPE/tree combination. We conclude with an extensive series of performance tests, which demonstrate that the code can be run efficiently and without modification in serial on small workstations or in parallel using the OpenMP compiler directives on large-scale, shared memory parallel machines. We analyze the effects of the code optimizations and estimate that they improve its overall performance by more than an order of magnitude over that obtained by many other tree codes. Scaled parallel performance of the gravity and SPH calculations, together the most costly components of most simulations, is nearly linear up to at least 120 processors on moderate sized test problems using the Origin 3000 architecture, and to the maximum machine sizes available to us on several other architectures. At similar accuracy, performance of VINE, used in GRAPE-tree mode, is approximately a factor 2 slower than that of VINE, used in host-only mode. Further optimizations of the GRAPE/host communications could improve the speed by as much as a factor of 3, but have not yet been implemented in VINE
NASA Astrophysics Data System (ADS)
Nelson, Andrew F.; Wetzstein, M.; Naab, T.
2009-10-01
We continue our presentation of VINE. In this paper, we begin with a description of relevant architectural properties of the serial and shared memory parallel computers on which VINE is intended to run, and describe their influences on the design of the code itself. We continue with a detailed description of a number of optimizations made to the layout of the particle data in memory and to our implementation of a binary tree used to access that data for use in gravitational force calculations and searches for smoothed particle hydrodynamics (SPH) neighbor particles. We describe the modifications to the code necessary to obtain forces efficiently from special purpose "GRAPE" hardware, the interfaces required to allow transparent substitution of those forces in the code instead of those obtained from the tree, and the modifications necessary to use both tree and GRAPE together as a fused GRAPE/tree combination. We conclude with an extensive series of performance tests, which demonstrate that the code can be run efficiently and without modification in serial on small workstations or in parallel using the OpenMP compiler directives on large-scale, shared memory parallel machines. We analyze the effects of the code optimizations and estimate that they improve its overall performance by more than an order of magnitude over that obtained by many other tree codes. Scaled parallel performance of the gravity and SPH calculations, together the most costly components of most simulations, is nearly linear up to at least 120 processors on moderate sized test problems using the Origin 3000 architecture, and to the maximum machine sizes available to us on several other architectures. At similar accuracy, performance of VINE, used in GRAPE-tree mode, is approximately a factor 2 slower than that of VINE, used in host-only mode. Further optimizations of the GRAPE/host communications could improve the speed by as much as a factor of 3, but have not yet been implemented in VINE
Foster, I.; Tuecke, S.
1991-12-01
PCN is a system for developing and executing parallel programs. It comprises a high-level programming language, tools for developing and debugging programs in this language, and interfaces to Fortran and C that allow the reuse of existing code in multilingual parallel programs. Programs developed using PCN are portable across many different workstations, networks, and parallel computers. This document provides all the information required to develop parallel programs with the PCN programming system. In includes both tutorial and reference material. It also presents the basic concepts that underly PCN, particularly where these are likely to be unfamiliar to the reader, and provides pointers to other documentation on the PCN language, programming techniques, and tools. PCN is in the public domain. The latest version of both the software and this manual can be obtained by anonymous FTP from Argonne National Laboratory in the directory pub/pcn at info.mcs.anl.gov (c.f. Appendix A).
Foster, I.; Tuecke, S.
1991-09-01
PCN is a system for developing and executing parallel programs. It comprises a high-level programming language, a set of tools for developing and debugging programs in this language, and interfaces to Fortran and C that allow the reuse of existing code in multilingual parallel programs. Programs developed using PCN are portable across many different workstations, networks, and parallel computers. This document provides all the information required to develop parallel programs with the PCN programming system. It includes both tutorial and reference material. It also presents the basic concepts that underlie PCN, particularly where these are likely to be unfamiliar to the reader, and provides pointers to other documentation on the PCN language, programming techniques, and tools. PCN is in the public domain. The latest version of both the software and this manual can be obtained by anonymous FTP from Argonne National Laboratory at info.mcs.anl.gov.
Coset Codes Viewed as Terminated Convolutional Codes
NASA Technical Reports Server (NTRS)
Fossorier, Marc P. C.; Lin, Shu
1996-01-01
In this paper, coset codes are considered as terminated convolutional codes. Based on this approach, three new general results are presented. First, it is shown that the iterative squaring construction can equivalently be defined from a convolutional code whose trellis terminates. This convolutional code determines a simple encoder for the coset code considered, and the state and branch labelings of the associated trellis diagram become straightforward. Also, from the generator matrix of the code in its convolutional code form, much information about the trade-off between the state connectivity and complexity at each section, and the parallel structure of the trellis, is directly available. Based on this generator matrix, it is shown that the parallel branches in the trellis diagram of the convolutional code represent the same coset code C(sub 1), of smaller dimension and shorter length. Utilizing this fact, a two-stage optimum trellis decoding method is devised. The first stage decodes C(sub 1), while the second stage decodes the associated convolutional code, using the branch metrics delivered by stage 1. Finally, a bidirectional decoding of each received block starting at both ends is presented. If about the same number of computations is required, this approach remains very attractive from a practical point of view as it roughly doubles the decoding speed. This fact is particularly interesting whenever the second half of the trellis is the mirror image of the first half, since the same decoder can be implemented for both parts.
Ghosh, J.; Harrison, C.G.
1990-01-01
The present conference discusses topics in the fields of VLSI-based and real-time image-processing systems, parallel architectures for image processing, image-processing algorithms, and image processing on the basis of artificial neural networks. Attention is given to a fixed-point VLSI architecture for high-speed image reconstruction, an orthogonal multiprocessor for image processing with neural networks, massively parallel processors in real-time applications, the use of the adiabatic approximation as a tool in image estimation, parallel algorithms for contour-extraction and coding, and a parallel architecture for multidimensional image processing. Also discussed are concurrent image-processing on hypercube multicomputers, neural-network simulation on a reduced-mesh-of-trees organization, and a goal-seeking neural net for recall and recognition.
Electrical Circuit Simulation Code
Wix, Steven D.; Waters, Arlon J.; Shirley, David
2001-08-09
Massively-Parallel Electrical Circuit Simulation Code. CHILESPICE is a massively-arallel distributed-memory electrical circuit simulation tool that contains many enhanced radiation, time-based, and thermal features and models. Large scale electronic circuit simulation. Shared memory, parallel processing, enhance convergence. Sandia specific device models.
FLY: MPI-2 High Resolution code for LSS Cosmological Simulations
NASA Astrophysics Data System (ADS)
Becciani, U.; Antonuccio, V.; Comparato, M.
2010-11-01
Cosmological simulations of structures and galaxies formations have played a fundamental role in the study of the origin, formation and evolution of the Universe. These studies improved enormously with the use of supercomputers and parallel systems and, recently, grid based systems and Linux clusters. Now we present the new version of the tree N-body parallel code FLY that runs on a PC Linux Cluster using the one side communication paradigm MPI-2 and we show the performances obtained. FLY is included in the Computer Physics Communication Program Library. This new version was developed using the Linux Cluster of CINECA, an IBM Cluster with 1024 Intel Xeon Pentium IV 3.0 Ghz. The results show that it is possible to run a 64 Million particle simulation in less than 15 minutes for each timestep, and the code scalability with the number of processors is achieved. This lead us to propose FLY as a code to run very large N-Body simulations with more than 10(9) particles with the higher resolution of a pure tree code.
FLY: MPI-2 high resolution code for LSS cosmological simulations
NASA Astrophysics Data System (ADS)
Becciani, U.; Antonuccio-Delogu, V.; Comparato, M.
2007-02-01
Cosmological simulations of structures and galaxies formations have played a fundamental role in the study of the origin, formation and evolution of the Universe. These studies improved enormously with the use of supercomputers and parallel systems and, recently, grid based systems and Linux clusters. Now we present the new version of the tree N-body parallel code FLY that runs on a PC Linux Cluster using the one side communication paradigm MPI-2 and we show the performances obtained. FLY is included in the Computer Physics Communication Program Library. This new version was developed using the Linux Cluster of CINECA, an IBM Cluster with 1024 Intel Xeon Pentium IV 3.0 GHz. The results show that it is possible to run a 64 million particle simulation in less than 15 minutes for each time-step, and the code scalability with the number of processors is achieved. This leads us to propose FLY as a code to run very large N-body simulations with more than 109 particles with the higher resolution of a pure tree code. The FLY new version is available at the CPC Program Library, http://cpc.cs.qub.ac.uk/summaries/ADSC_v2_0.html [U. Becciani, M. Comparato, V. Antonuccio-Delogu, Comput Phys. Comm. 174 (2006) 605].
Parallelization of heterogeneous reactor calculations on a graphics processing unit
NASA Astrophysics Data System (ADS)
Malofeev, V. M.; Pal'shin, V. A.
2016-12-01
Parallelization is applied to the neutron calculations performed by the heterogeneous method on a graphics processing unit. The parallel algorithm of the modified TREC code is described. The efficiency of the parallel algorithm is evaluated.
Start/Pat; A parallel-programming toolkit
Appelbe, B.; Smith, K. ); McDowell, C. )
1989-07-01
How can you make Fortran code parallel without isolating the programmer from learning to understand and exploit parallelism effectively. With an interactive toolkit that automates parallelization as it educates. This paper discusses the Start/Pat toolkit.
Parallelization of heterogeneous reactor calculations on a graphics processing unit
Malofeev, V. M. Pal’shin, V. A.
2016-12-15
Parallelization is applied to the neutron calculations performed by the heterogeneous method on a graphics processing unit. The parallel algorithm of the modified TREC code is described. The efficiency of the parallel algorithm is evaluated.
NASA Technical Reports Server (NTRS)
Crockett, Thomas W.
1995-01-01
This article provides a broad introduction to the subject of parallel rendering, encompassing both hardware and software systems. The focus is on the underlying concepts and the issues which arise in the design of parallel rendering algorithms and systems. We examine the different types of parallelism and how they can be applied in rendering applications. Concepts from parallel computing, such as data decomposition, task granularity, scalability, and load balancing, are considered in relation to the rendering problem. We also explore concepts from computer graphics, such as coherence and projection, which have a significant impact on the structure of parallel rendering algorithms. Our survey covers a number of practical considerations as well, including the choice of architectural platform, communication and memory requirements, and the problem of image assembly and display. We illustrate the discussion with numerous examples from the parallel rendering literature, representing most of the principal rendering methods currently used in computer graphics.
Simple, parallel virtual machines for extreme computations
NASA Astrophysics Data System (ADS)
Chokoufe Nejad, Bijan; Ohl, Thorsten; Reuter, Jürgen
2015-11-01
We introduce a virtual machine (VM) written in a numerically fast language like Fortran or C for evaluating very large expressions. We discuss the general concept of how to perform computations in terms of a VM and present specifically a VM that is able to compute tree-level cross sections for any number of external legs, given the corresponding byte-code from the optimal matrix element generator, O'MEGA. Furthermore, this approach allows to formulate the parallel computation of a single phase space point in a simple and obvious way. We analyze hereby the scaling behavior with multiple threads as well as the benefits and drawbacks that are introduced with this method. Our implementation of a VM can run faster than the corresponding native, compiled code for certain processes and compilers, especially for very high multiplicities, and has in general runtimes in the same order of magnitude. By avoiding the tedious compile and link steps, which may fail for source code files of gigabyte sizes, new processes or complex higher order corrections that are currently out of reach could be evaluated with a VM given enough computing power.
NASA Astrophysics Data System (ADS)
Huberman, Bernardo A.
1989-11-01
This paper reviews three different aspects of parallel computation which are useful for physics. The first part deals with special architectures for parallel computing (SIMD and MIMD machines) and their differences, with examples of their uses. The second section discusses the speedup that can be achieved in parallel computation and the constraints generated by the issues of communication and synchrony. The third part describes computation by distributed networks of powerful workstations without global controls and the issues involved in understanding their behavior.
Multigrid on massively parallel architectures
Falgout, R D; Jones, J E
1999-09-17
The scalable implementation of multigrid methods for machines with several thousands of processors is investigated. Parallel performance models are presented for three different structured-grid multigrid algorithms, and a description is given of how these models can be used to guide implementation. Potential pitfalls are illustrated when moving from moderate-sized parallelism to large-scale parallelism, and results are given from existing multigrid codes to support the discussion. Finally, the use of mixed programming models is investigated for multigrid codes on clusters of SMPs.
ERIC Educational Resources Information Center
von Davier, Matthias
2016-01-01
This report presents results on a parallel implementation of the expectation-maximization (EM) algorithm for multidimensional latent variable models. The developments presented here are based on code that parallelizes both the E step and the M step of the parallel-E parallel-M algorithm. Examples presented in this report include item response…
Kok, J.
1988-01-01
To the human programmer the ease of coding distributed computing is highly dependent on the suitability of the employed programming language. But with a particular language it is also important whether the possibilities of one or more parallel architectures can efficiently be addressed by available language constructs. In this paper the possibilities are discussed of the high-level language Ada and in particular of its tasking concept as a descriptional tool for the design and implementation of numerical and other algorithms that allow execution of parts in parallel. Language tools are explained and their use for common applications is shown. Conclusions are drawn about the usefulness of several Ada concepts.
Binary Trees and Parallel Scheduling Algorithms.
1980-09-01
in part by the National Science Foundation tinder grant MCS80-005856 and in part by the Office of Naval Research under contract N000i4-80-C-0650. Offie...respectively. Since the weights w. play no part in the Lmax problem, we shall only consider triples (ri, di, pi) in these sub-sections. 2.i.1 p=i, 1<i... parts ). Recall that k denotes the number of distinct release times and that at each node at most one additional job split can occur. Because of the
Li, Lixin; Losser, Travis; Yorke, Charles; Piltner, Reinhard
2014-01-01
Epidemiological studies have identified associations between mortality and changes in concentration of particulate matter. These studies have highlighted the public concerns about health effects of particulate air pollution. Modeling fine particulate matter PM2.5 exposure risk and monitoring day-to-day changes in PM2.5 concentration is a critical step for understanding the pollution problem and embarking on the necessary remedy. This research designs, implements and compares two inverse distance weighting (IDW)-based spatiotemporal interpolation methods, in order to assess the trend of daily PM2.5 concentration for the contiguous United States over the year of 2009, at both the census block group level and county level. Traditionally, when handling spatiotemporal interpolation, researchers tend to treat space and time separately and reduce the spatiotemporal interpolation problems to a sequence of snapshots of spatial interpolations. In this paper, PM2.5 data interpolation is conducted in the continuous space-time domain by integrating space and time simultaneously, using the so-called extension approach. Time values are calculated with the help of a factor under the assumption that spatial and temporal dimensions are equally important when interpolating a continuous changing phenomenon in the space-time domain. Various IDW-based spatiotemporal interpolation methods with different parameter configurations are evaluated by cross-validation. In addition, this study explores computational issues (computer processing speed) faced during implementation of spatiotemporal interpolation for huge data sets. Parallel programming techniques and an advanced data structure, named k-d tree, are adapted in this paper to address the computational challenges. Significant computational improvement has been achieved. Finally, a web-based spatiotemporal IDW-based interpolation application is designed and implemented where users can visualize and animate spatiotemporal interpolation
The language parallel Pascal and other aspects of the massively parallel processor
NASA Technical Reports Server (NTRS)
Reeves, A. P.; Bruner, J. D.
1982-01-01
A high level language for the Massively Parallel Processor (MPP) was designed. This language, called Parallel Pascal, is described in detail. A description of the language design, a description of the intermediate language, Parallel P-Code, and details for the MPP implementation are included. Formal descriptions of Parallel Pascal and Parallel P-Code are given. A compiler was developed which converts programs in Parallel Pascal into the intermediate Parallel P-Code language. The code generator to complete the compiler for the MPP is being developed independently. A Parallel Pascal to Pascal translator was also developed. The architecture design for a VLSI version of the MPP was completed with a description of fault tolerant interconnection networks. The memory arrangement aspects of the MPP are discussed and a survey of other high level languages is given.
1988-03-01
Procss Pro$ I npt Ts ilVr Tr m SeFd ouTree Tree ( AlP)t Toietec t ChTc mnewor aet Ch e g egvker Uwthe absra PE Resom Aat Network Clause Memoy, Term Meory ...276 % ’r 16. SUPPLEMENTARY NOTATION aN". N/A 17. COSATI CODES 18. SUBJECT TERMS (Continue on rovere If necessary and identify by block number) FIELD...h - v - * List of Figures Figure 2.1 Relational Algebra Tree 14 Figure 3.1 DAG Representation of Terms 30 Figure 3.2 Example of program
High Dimensional Trellis Coded Modulation
2002-03-01
popular recently for the decoding of turbo codes (or parallel concatenated codes ) which require an iteration between two permuted code sequences. The...nonsystematic constituent codes ) Published descriptions of the implementation of turbo decoders refer to the permuted “common” or “extrinsic” information...invented based on that condition. With the recent development of turbo codes [4] and the requirement of short frame transmission [5] [6], trellis
Progress in parallelizing XOOPIC
NASA Astrophysics Data System (ADS)
Mardahl, Peter; Verboncoeur, J. P.
1997-11-01
XOOPIC (Object Orient Particle in Cell code for X11-based Unix workstations) is presently a serial 2-D 3v particle-in-cell plasma simulation (J.P. Verboncoeur, A.B. Langdon, and N.T. Gladd, ``An object-oriented electromagnetic PIC code.'' Computer Physics Communications 87 (1995) 199-211.). The present effort focuses on using parallel and distributed processing to optimize the simulation for large problems. The benefits include increased capacity for memory intensive problems, and improved performance for processor-intensive problems. The MPI library is used to enable the parallel version to be easily ported to massively parallel, SMP, and distributed computers. The philosophy employed here is to spatially decompose the system into computational regions separated by 'virtual boundaries', objects which contain the local data and algorithms to perform the local field solve and particle communication between regions. This implementation will reduce the changes required in the rest of the program by parallelization. Specific implementation details such as the hiding of communication latency behind local computation will also be discussed.
Implementation and performance of parallelized elegant.
Wang, Y.; Borland, M.; Accelerator Systems Division
2008-01-01
The program elegant is widely used for design and modeling of linacs for free-electron lasers and energy recovery linacs, as well as storage rings and other applications. As part of a multi-year effort, we have parallelized many aspects of the code, including single-particle dynamics, wakefields, and coherent synchrotron radiation. We report on the approach used for gradual parallelization, which proved very beneficial in getting parallel features into the hands of users quickly. We also report details of parallelization of collective effects. Finally, we discuss performance of the parallelized code in various applications.
Parallel Processing of a Groundwater Contaminant Code
Arnett, Ronald Chester; Greenwade, Lance Eric
2000-05-01
The U. S. Department of Energy’s Idaho National Engineering and Environmental Laboratory (INEEL) is conducting a field test of experimental enhanced bioremediation of trichoroethylene (TCE) contaminated groundwater. TCE is a chlorinated organic substance that was used as a solvent in the early years of the INEEL and disposed in some cases to the aquifer. There is an effort underway to enhance the natural bioremediation of TCE by adding a non-toxic substance that serves as a feed material for the bacteria that can biologically degrade the TCE.
Parallel machines: Parallel machine languages
Iannucci, R.A. )
1990-01-01
This book presents a framework for understanding the tradeoffs between the conventional view and the dataflow view with the objective of discovering the critical hardware structures which must be present in any scalable, general-purpose parallel computer to effectively tolerate latency and synchronization costs. The author presents an approach to scalable general purpose parallel computation. Linguistic Concerns, Compiling Issues, Intermediate Language Issues, and hardware/technological constraints are presented as a combined approach to architectural Develoement. This book presents the notion of a parallel machine language.
Joseph, D.D.; Bai, R.; Liao, T.Y.; Huang, A.; Hu, H.H.
1995-09-01
In this paper the authors introduce the idea of parallel pipelining for water lubricated transportation of oil (or other viscous material). A parallel system can have major advantages over a single pipe with respect to the cost of maintenance and continuous operation of the system, to the pressure gradients required to restart a stopped system and to the reduction and even elimination of the fouling of pipe walls in continuous operation. The authors show that the action of capillarity in small pipes is more favorable for restart than in large pipes. In a parallel pipeline system, they estimate the number of small pipes needed to deliver the same oil flux as in one larger pipe as N = (R/r){sup {alpha}}, where r and R are the radii of the small and large pipes, respectively, and {alpha} = 4 or 19/7 when the lubricating water flow is laminar or turbulent.
Automatic Multilevel Parallelization Using OpenMP
NASA Technical Reports Server (NTRS)
Jin, Hao-Qiang; Jost, Gabriele; Yan, Jerry; Ayguade, Eduard; Gonzalez, Marc; Martorell, Xavier; Biegel, Bryan (Technical Monitor)
2002-01-01
In this paper we describe the extension of the CAPO (CAPtools (Computer Aided Parallelization Toolkit) OpenMP) parallelization support tool to support multilevel parallelism based on OpenMP directives. CAPO generates OpenMP directives with extensions supported by the NanosCompiler to allow for directive nesting and definition of thread groups. We report some results for several benchmark codes and one full application that have been parallelized using our system.
On finding minimum-diameter clique trees
Blair, J.R.S. . Dept. of Computer Science); Peyton, B.W. )
1991-08-01
It is well-known that any chordal graph can be represented as a clique tree (acyclic hypergraph, join tree). Since some chordal graphs have many distinct clique tree representations, it is interesting to consider which one is most desirable under various circumstances. A clique tree of minimum diameter (or height) is sometimes a natural candidate when choosing clique trees to be processed in a parallel computing environment. This paper introduces a linear time algorithm for computing a minimum-diameter clique tree. The new algorithm is an analogue of the natural greedy algorithm for rooting an ordinary tree in order to minimize its height. It has potential application in the development of parallel algorithms for both knowledge-based systems and the solution of sparse linear systems of equations. 31 refs., 7 figs.
NASA Technical Reports Server (NTRS)
Martensen, Anna L.; Butler, Ricky W.
1987-01-01
The Fault Tree Compiler Program is a new reliability tool used to predict the top event probability for a fault tree. Five different gate types are allowed in the fault tree: AND, OR, EXCLUSIVE OR, INVERT, and M OF N gates. The high level input language is easy to understand and use when describing the system tree. In addition, the use of the hierarchical fault tree capability can simplify the tree description and decrease program execution time. The current solution technique provides an answer precise (within the limits of double precision floating point arithmetic) to the five digits in the answer. The user may vary one failure rate or failure probability over a range of values and plot the results for sensitivity analyses. The solution technique is implemented in FORTRAN; the remaining program code is implemented in Pascal. The program is written to run on a Digital Corporation VAX with the VMS operation system.
Parallel Power Grid Simulation Toolkit
Smith, Steve; Kelley, Brian; Banks, Lawrence; Top, Philip; Woodward, Carol
2015-09-14
ParGrid is a 'wrapper' that integrates a coupled Power Grid Simulation toolkit consisting of a library to manage the synchronization and communication of independent simulations. The included library code in ParGid, named FSKIT, is intended to support the coupling multiple continuous and discrete even parallel simulations. The code is designed using modern object oriented C++ methods utilizing C++11 and current Boost libraries to ensure compatibility with multiple operating systems and environments.
Parallel contingency statistics with Titan.
Thompson, David C.; Pebay, Philippe Pierre
2009-09-01
This report summarizes existing statistical engines in VTK/Titan and presents the recently parallelized contingency statistics engine. It is a sequel to [PT08] and [BPRT09] which studied the parallel descriptive, correlative, multi-correlative, and principal component analysis engines. The ease of use of this new parallel engines is illustrated by the means of C++ code snippets. Furthermore, this report justifies the design of these engines with parallel scalability in mind; however, the very nature of contingency tables prevent this new engine from exhibiting optimal parallel speed-up as the aforementioned engines do. This report therefore discusses the design trade-offs we made and study performance with up to 200 processors.
Parallelized modelling and solution scheme for hierarchically scaled simulations
NASA Technical Reports Server (NTRS)
Padovan, Joe
1995-01-01
This two-part paper presents the results of a benchmarked analytical-numerical investigation into the operational characteristics of a unified parallel processing strategy for implicit fluid mechanics formulations. This hierarchical poly tree (HPT) strategy is based on multilevel substructural decomposition. The Tree morphology is chosen to minimize memory, communications and computational effort. The methodology is general enough to apply to existing finite difference (FD), finite element (FEM), finite volume (FV) or spectral element (SE) based computer programs without an extensive rewrite of code. In addition to finding large reductions in memory, communications, and computational effort associated with a parallel computing environment, substantial reductions are generated in the sequential mode of application. Such improvements grow with increasing problem size. Along with a theoretical development of general 2-D and 3-D HPT, several techniques for expanding the problem size that the current generation of computers are capable of solving, are presented and discussed. Among these techniques are several interpolative reduction methods. It was found that by combining several of these techniques that a relatively small interpolative reduction resulted in substantial performance gains. Several other unique features/benefits are discussed in this paper. Along with Part 1's theoretical development, Part 2 presents a numerical approach to the HPT along with four prototype CFD applications. These demonstrate the potential of the HPT strategy.
Foster, I.; Tuecke, S.
1993-01-01
PCN is a system for developing and executing parallel programs. It comprises a high-level programming language, tools for developing and debugging programs in this language, and interfaces to Fortran and Cthat allow the reuse of existing code in multilingual parallel programs. Programs developed using PCN are portable across many different workstations, networks, and parallel computers. This document provides all the information required to develop parallel programs with the PCN programming system. It includes both tutorial and reference material. It also presents the basic concepts that underlie PCN, particularly where these are likely to be unfamiliar to the reader, and provides pointers to other documentation on the PCN language, programming techniques, and tools. PCN is in the public domain. The latest version of both the software and this manual can be obtained by anonymous ftp from Argonne National Laboratory in the directory pub/pcn at info.mcs. ani.gov (cf. Appendix A). This version of this document describes PCN version 2.0, a major revision of the PCN programming system. It supersedes earlier versions of this report.
Morozov, Dmitriy; Weber, Gunther
2013-01-08
Improved simulations and sensors are producing datasets whose increasing complexity exhausts our ability to visualize and comprehend them directly. To cope with this problem, we can detect and extract significant features in the data and use them as the basis for subsequent analysis. Topological methods are valuable in this context because they provide robust and general feature definitions. As the growth of serial computational power has stalled, data analysis is becoming increasingly dependent on massively parallel machines. To satisfy the computational demand created by complex datasets, algorithms need to effectively utilize these computer architectures. The main strength of topological methods, their emphasis on global information, turns into an obstacle during parallelization. We present two approaches to alleviate this problem. We develop a distributed representation of the merge tree that avoids computing the global tree on a single processor and lets us parallelize subsequent queries. To account for the increasing number of cores per processor, we develop a new data structure that lets us take advantage of multiple shared-memory cores to parallelize the work on a single node. Finally, we present experiments that illustrate the strengths of our approach as well as help identify future challenges.
Fault trees and sequence dependencies
NASA Technical Reports Server (NTRS)
Dugan, Joanne Bechta; Boyd, Mark A.; Bavuso, Salvatore J.
1990-01-01
One of the frequently cited shortcomings of fault-tree models, their inability to model so-called sequence dependencies, is discussed. Several sources of such sequence dependencies are discussed, and new fault-tree gates to capture this behavior are defined. These complex behaviors can be included in present fault-tree models because they utilize a Markov solution. The utility of the new gates is demonstrated by presenting several models of the fault-tolerant parallel processor, which include both hot and cold spares.
DeHart, Mark D; Williams, Mark L; Bowman, Stephen M
2010-01-01
The SCALE computational architecture has remained basically the same since its inception 30 years ago, although constituent modules and capabilities have changed significantly. This SCALE concept was intended to provide a framework whereby independent codes can be linked to provide a more comprehensive capability than possible with the individual programs - allowing flexibility to address a wide variety of applications. However, the current system was designed originally for mainframe computers with a single CPU and with significantly less memory than today's personal computers. It has been recognized that the present SCALE computation system could be restructured to take advantage of modern hardware and software capabilities, while retaining many of the modular features of the present system. Preliminary work is being done to define specifications and capabilities for a more advanced computational architecture. This paper describes the state of current SCALE development activities and plans for future development. With the release of SCALE 6.1 in 2010, a new phase of evolutionary development will be available to SCALE users within the TRITON and NEWT modules. The SCALE (Standardized Computer Analyses for Licensing Evaluation) code system developed by Oak Ridge National Laboratory (ORNL) provides a comprehensive and integrated package of codes and nuclear data for a wide range of applications in criticality safety, reactor physics, shielding, isotopic depletion and decay, and sensitivity/uncertainty (S/U) analysis. Over the last three years, since the release of version 5.1 in 2006, several important new codes have been introduced within SCALE, and significant advances applied to existing codes. Many of these new features became available with the release of SCALE 6.0 in early 2009. However, beginning with SCALE 6.1, a first generation of parallel computing is being introduced. In addition to near-term improvements, a plan for longer term SCALE enhancement
Parallel adaptive wavelet collocation method for PDEs
Nejadmalayeri, Alireza; Vezolainen, Alexei; Brown-Dymkoski, Eric; Vasilyev, Oleg V.
2015-10-01
A parallel adaptive wavelet collocation method for solving a large class of Partial Differential Equations is presented. The parallelization is achieved by developing an asynchronous parallel wavelet transform, which allows one to perform parallel wavelet transform and derivative calculations with only one data synchronization at the highest level of resolution. The data are stored using tree-like structure with tree roots starting at a priori defined level of resolution. Both static and dynamic domain partitioning approaches are developed. For the dynamic domain partitioning, trees are considered to be the minimum quanta of data to be migrated between the processes. This allows fully automated and efficient handling of non-simply connected partitioning of a computational domain. Dynamic load balancing is achieved via domain repartitioning during the grid adaptation step and reassigning trees to the appropriate processes to ensure approximately the same number of grid points on each process. The parallel efficiency of the approach is discussed based on parallel adaptive wavelet-based Coherent Vortex Simulations of homogeneous turbulence with linear forcing at effective non-adaptive resolutions up to 2048{sup 3} using as many as 2048 CPU cores.
Parallel adaptive wavelet collocation method for PDEs
NASA Astrophysics Data System (ADS)
Nejadmalayeri, Alireza; Vezolainen, Alexei; Brown-Dymkoski, Eric; Vasilyev, Oleg V.
2015-10-01
A parallel adaptive wavelet collocation method for solving a large class of Partial Differential Equations is presented. The parallelization is achieved by developing an asynchronous parallel wavelet transform, which allows one to perform parallel wavelet transform and derivative calculations with only one data synchronization at the highest level of resolution. The data are stored using tree-like structure with tree roots starting at a priori defined level of resolution. Both static and dynamic domain partitioning approaches are developed. For the dynamic domain partitioning, trees are considered to be the minimum quanta of data to be migrated between the processes. This allows fully automated and efficient handling of non-simply connected partitioning of a computational domain. Dynamic load balancing is achieved via domain repartitioning during the grid adaptation step and reassigning trees to the appropriate processes to ensure approximately the same number of grid points on each process. The parallel efficiency of the approach is discussed based on parallel adaptive wavelet-based Coherent Vortex Simulations of homogeneous turbulence with linear forcing at effective non-adaptive resolutions up to 20483 using as many as 2048 CPU cores.
Clinical coding. Code breakers.
Mathieson, Steve
2005-02-24
--The advent of payment by results has seen the role of the clinical coder pushed to the fore in England. --Examinations for a clinical coding qualification began in 1999. In 2004, approximately 200 people took the qualification. --Trusts are attracting people to the role by offering training from scratch or through modern apprenticeships.
Parallel implicit Monte Carlo in C++
Urbatsch, T.J.; Evans, T.M.
1998-12-31
The authors are developing a parallel C++ Implicit Monte Carlo code in the Draco framework. As a background and motivation for the parallelization strategy, they first present three basic parallelization schemes. They use three hypothetical examples, mimicking the memory constraints of the real world, to examine characteristics of the basic schemes. Next, they present a two-step scheme proposed by Lawrence Livermore National Laboratory (LLNL). The two-step parallelization scheme they develop is based upon LLNL`s two-step scheme. The two-step scheme appears to have greater potential compared to the basic schemes and LLNL`s two-step scheme. Lastly, they explain the code design and describe how the functionality of C++ and the Draco framework assist the development of a parallel code.
Fully Parallel MHD Stability Analysis Tool
NASA Astrophysics Data System (ADS)
Svidzinski, Vladimir; Galkin, Sergei; Kim, Jin-Soo; Liu, Yueqiang
2015-11-01
Progress on full parallelization of the plasma stability code MARS will be reported. MARS calculates eigenmodes in 2D axisymmetric toroidal equilibria in MHD-kinetic plasma models. It is a powerful tool for studying MHD and MHD-kinetic instabilities and it is widely used by fusion community. Parallel version of MARS is intended for simulations on local parallel clusters. It will be an efficient tool for simulation of MHD instabilities with low, intermediate and high toroidal mode numbers within both fluid and kinetic plasma models, already implemented in MARS. Parallelization of the code includes parallelization of the construction of the matrix for the eigenvalue problem and parallelization of the inverse iterations algorithm, implemented in MARS for the solution of the formulated eigenvalue problem. Construction of the matrix is parallelized by distributing the load among processors assigned to different magnetic surfaces. Parallelization of the solution of the eigenvalue problem is made by repeating steps of the present MARS algorithm using parallel libraries and procedures. Results of MARS parallelization and of the development of a new fix boundary equilibrium code adapted for MARS input will be reported. Work is supported by the U.S. DOE SBIR program.
Fully Parallel MHD Stability Analysis Tool
NASA Astrophysics Data System (ADS)
Svidzinski, Vladimir; Galkin, Sergei; Kim, Jin-Soo; Liu, Yueqiang
2014-10-01
Progress on full parallelization of the plasma stability code MARS will be reported. MARS calculates eigenmodes in 2D axisymmetric toroidal equilibria in MHD-kinetic plasma models. It is a powerful tool for studying MHD and MHD-kinetic instabilities and it is widely used by fusion community. Parallel version of MARS is intended for simulations on local parallel clusters. It will be an efficient tool for simulation of MHD instabilities with low, intermediate and high toroidal mode numbers within both fluid and kinetic plasma models, already implemented in MARS. Parallelization of the code includes parallelization of the construction of the matrix for the eigenvalue problem and parallelization of the inverse iterations algorithm, implemented in MARS for the solution of the formulated eigenvalue problem. Construction of the matrix is parallelized by distributing the load among processors assigned to different magnetic surfaces. Parallelization of the solution of the eigenvalue problem is made by repeating steps of the present MARS algorithm using parallel libraries and procedures. Initial results of the code parallelization will be reported. Work is supported by the U.S. DOE SBIR program.
Fully Parallel MHD Stability Analysis Tool
NASA Astrophysics Data System (ADS)
Svidzinski, Vladimir; Galkin, Sergei; Kim, Jin-Soo; Liu, Yueqiang
2013-10-01
Progress on full parallelization of the plasma stability code MARS will be reported. MARS calculates eigenmodes in 2D axisymmetric toroidal equilibria in MHD-kinetic plasma models. It is a powerful tool for studying MHD and MHD-kinetic instabilities and it is widely used by fusion community. Parallel version of MARS is intended for simulations on local parallel clusters. It will be an efficient tool for simulation of MHD instabilities with low, intermediate and high toroidal mode numbers within both fluid and kinetic plasma models, already implemented in MARS. Parallelization of the code includes parallelization of the construction of the matrix for the eigenvalue problem and parallelization of the inverse iterations algorithm, implemented in MARS for the solution of the formulated eigenvalue problem. Construction of the matrix is parallelized by distributing the load among processors assigned to different magnetic surfaces. Parallelization of the solution of the eigenvalue problem is made by repeating steps of the present MARS algorithm using parallel libraries and procedures. Preliminary results of the code parallelization will be reported. Work is supported by the U.S. DOE SBIR program.
Computer-Aided Parallelizer and Optimizer
NASA Technical Reports Server (NTRS)
Jin, Haoqiang
2011-01-01
The Computer-Aided Parallelizer and Optimizer (CAPO) automates the insertion of compiler directives (see figure) to facilitate parallel processing on Shared Memory Parallel (SMP) machines. While CAPO currently is integrated seamlessly into CAPTools (developed at the University of Greenwich, now marketed as ParaWise), CAPO was independently developed at Ames Research Center as one of the components for the Legacy Code Modernization (LCM) project. The current version takes serial FORTRAN programs, performs interprocedural data dependence analysis, and generates OpenMP directives. Due to the widely supported OpenMP standard, the generated OpenMP codes have the potential to run on a wide range of SMP machines. CAPO relies on accurate interprocedural data dependence information currently provided by CAPTools. Compiler directives are generated through identification of parallel loops in the outermost level, construction of parallel regions around parallel loops and optimization of parallel regions, and insertion of directives with automatic identification of private, reduction, induction, and shared variables. Attempts also have been made to identify potential pipeline parallelism (implemented with point-to-point synchronization). Although directives are generated automatically, user interaction with the tool is still important for producing good parallel codes. A comprehensive graphical user interface is included for users to interact with the parallelization process.
Treating a User-Defined Parallel Library as a Domain-Specific Language
Quinlan, D; Miller, B; Schordan, M; Philip, B
2001-11-19
An important purpose of a programming language is to insulate the programmer from low level details and provide a high enough level of abstraction to be productive and develop reasonably portable application codes. For these reasons scientific programming is longer done using assembly language. But high performance of scientific applications often requires that critical sections of code be expressed at a particularly low level to avoid inefficiencies introduced by the comiler (function call overhead, poor cache use, etc.). The use of high-level abstractions exaserbates this problem since the compiler is often unable to generate the equivalent low-level code required for good performance. The result is often significantly degraded performance. Libraries provide a way for domain specific knowledge to be developed for large numbers of users. Libraries thus simplify the development of many application codes and the work spent building libraries can be amortized across large numbers of applications and application developers. Such a hierarchy puts languages and compilers at the root of tree of abstractsions developed within numerous libraries at one level and numerous applications at a second level. Libraries provide a way to define high-level abstractions. We have developed specific libraries to simplify the development of serial and parallel scientific applications. The A++/P++ library provide an essential array abstraction for C++ scientific applications. The effect is to provide a single array abstraction that permits the development of serial code (using A++). The serial application code using the array abstractions need only be recompiled (using P++) to run on parallel distributed memory machines. The resulting abstractions are simple and powerful since it simplifies serial application code and even completely hides parallel details. But since it operates as a library the compiler is oblivious to its semantics and likewise the library is oblivious to the context
ERIC Educational Resources Information Center
Tolman, Marvin
2005-01-01
Students love outdoor activities and will love them even more when they build confidence in their tree identification and measurement skills. Through these activities, students will learn to identify the major characteristics of trees and discover how the pace--a nonstandard measuring unit--can be used to estimate not only distances but also the…
ERIC Educational Resources Information Center
Tolman, Marvin
2005-01-01
Students love outdoor activities and will love them even more when they build confidence in their tree identification and measurement skills. Through these activities, students will learn to identify the major characteristics of trees and discover how the pace--a nonstandard measuring unit--can be used to estimate not only distances but also the…
ERIC Educational Resources Information Center
Center for Environmental Study, Grand Rapids, MI.
Tree Amigos is a special cross-cultural program that uses trees as a common bond to bring the people of the Americas together in unique partnerships to preserve and protect the shared global environment. It is a tangible program that embodies the philosophy that individuals, acting together, can make a difference. This resource book contains…
Parallel evolution of KCNQ4 in echolocating bats.
Liu, Zhen; Li, Shude; Wang, Wei; Xu, Dongming; Murphy, Robert W; Shi, Peng
2011-01-01
High-frequency hearing is required for echolocating bats to locate, range and identify objects, yet little is known about its molecular basis. The discovery of a high-frequency hearing-related gene, KCNQ4, provides an opportunity to address this question. Here, we obtain the coding regions of KCNQ4 from 15 species of bats, including echolocating bats that have higher frequency hearing and non-echolocating bats that have the same ability as most other species of mammals. The strongly supported protein-tree resolves a monophyletic group containing all bats with higher frequency hearing and this arrangement conflicts with the phylogeny of bats in which these species are paraphyletic. We identify five parallel evolved sites in echolocating bats belonging to both suborders. The evolutionary trajectories of the parallel sites suggest the independent gain of higher frequency hearing ability in echolocating bats. This study highlights the usefulness of convergent or parallel evolutionary studies for finding phenotype-related genes and contributing to the resolution of evolutionary problems.
REBOUND: an open-source multi-purpose N-body code for collisional dynamics
NASA Astrophysics Data System (ADS)
Rein, H.; Liu, S.-F.
2012-01-01
REBOUND is a new multi-purpose N-body code which is freely available under an open-source license. It was designed for collisional dynamics such as planetary rings but can also solve the classical N-body problem. It is highly modular and can be customized easily to work on a wide variety of different problems in astrophysics and beyond. REBOUND comes with three symplectic integrators: leap-frog, the symplectic epicycle integrator (SEI) and a Wisdom-Holman mapping (WH). It supports open, periodic and shearing-sheet boundary conditions. REBOUND can use a Barnes-Hut tree to calculate both self-gravity and collisions. These modules are fully parallelized with MPI as well as OpenMP. The former makes use of a static domain decomposition and a distributed essential tree. Two new collision detection modules based on a plane-sweep algorithm are also implemented. The performance of the plane-sweep algorithm is superior to a tree code for simulations in which one dimension is much longer than the other two and in simulations which are quasi-two dimensional with less than one million particles. In this work, we discuss the different algorithms implemented in REBOUND, the philosophy behind the code's structure as well as implementation specific details of the different modules. We present results of accuracy and scaling tests which show that the code can run efficiently on both desktop machines and large computing clusters.
Parallel programming of industrial applications
Heroux, M; Koniges, A; Simon, H
1998-07-21
In the introductory material, we overview the typical MPP environment for real application computing and the special tools available such as parallel debuggers and performance analyzers. Next, we draw from a series of real applications codes and discuss the specific challenges and problems that are encountered in parallelizing these individual applications. The application areas drawn from include biomedical sciences, materials processing and design, plasma and fluid dynamics, and others. We show how it was possible to get a particular application to run efficiently and what steps were necessary. Finally we end with a summary of the lessons learned from these applications and predictions for the future of industrial parallel computing. This tutorial is based on material from a forthcoming book entitled: "Industrial Strength Parallel Computing" to be published by Morgan Kaufmann Publishers (ISBN l-55860-54).
Efficiency of parallel direct optimization
NASA Technical Reports Server (NTRS)
Janies, D. A.; Wheeler, W. C.
2001-01-01
Tremendous progress has been made at the level of sequential computation in phylogenetics. However, little attention has been paid to parallel computation. Parallel computing is particularly suited to phylogenetics because of the many ways large computational problems can be broken into parts that can be analyzed concurrently. In this paper, we investigate the scaling factors and efficiency of random addition and tree refinement strategies using the direct optimization software, POY, on a small (10 slave processors) and a large (256 slave processors) cluster of networked PCs running LINUX. These algorithms were tested on several data sets composed of DNA and morphology ranging from 40 to 500 taxa. Various algorithms in POY show fundamentally different properties within and between clusters. All algorithms are efficient on the small cluster for the 40-taxon data set. On the large cluster, multibuilding exhibits excellent parallel efficiency, whereas parallel building is inefficient. These results are independent of data set size. Branch swapping in parallel shows excellent speed-up for 16 slave processors on the large cluster. However, there is no appreciable speed-up for branch swapping with the further addition of slave processors (>16). This result is independent of data set size. Ratcheting in parallel is efficient with the addition of up to 32 processors in the large cluster. This result is independent of data set size. c2001 The Willi Hennig Society.
Parallel processing of natural language
Chang, H.O.
1986-01-01
Two types of parallel natural language processing are studied in this work: (1) the parallelism between syntactic and nonsyntactic processing and (2) the parallelism within syntactic processing. It is recognized that a syntactic category can potentially be attached to more than one node in the syntactic tree of a sentence. Even if all the attachments are syntactically well-formed, nonsyntactic factors such as semantic and pragmatic consideration may require one particular attachment. Syntactic processing must synchronize and communicate with nonsyntactic processing. Two syntactic processing algorithms are proposed for use in a parallel environment: Early's algorithm and the LR(k) algorithm. Conditions are identified to detect the syntactic ambiguity and the algorithms are augmented accordingly. It is shown that by using nonsyntactic information during syntactic processing, backtracking can be reduced, and the performance of the syntactic processor is improved. For the second type of parallelism, it is recognized that one portion of a grammar can be isolated from the rest of the grammar and be processed by a separate processor. A partial grammar of a larger grammar is defined. Parallel syntactic processing is achieved by using two processors concurrently: the main processor (mp) and the two processors concurrently: the main processor (mp) and the auxiliary processor (ap).
Efficiency of parallel direct optimization
NASA Technical Reports Server (NTRS)
Janies, D. A.; Wheeler, W. C.
2001-01-01
Tremendous progress has been made at the level of sequential computation in phylogenetics. However, little attention has been paid to parallel computation. Parallel computing is particularly suited to phylogenetics because of the many ways large computational problems can be broken into parts that can be analyzed concurrently. In this paper, we investigate the scaling factors and efficiency of random addition and tree refinement strategies using the direct optimization software, POY, on a small (10 slave processors) and a large (256 slave processors) cluster of networked PCs running LINUX. These algorithms were tested on several data sets composed of DNA and morphology ranging from 40 to 500 taxa. Various algorithms in POY show fundamentally different properties within and between clusters. All algorithms are efficient on the small cluster for the 40-taxon data set. On the large cluster, multibuilding exhibits excellent parallel efficiency, whereas parallel building is inefficient. These results are independent of data set size. Branch swapping in parallel shows excellent speed-up for 16 slave processors on the large cluster. However, there is no appreciable speed-up for branch swapping with the further addition of slave processors (>16). This result is independent of data set size. Ratcheting in parallel is efficient with the addition of up to 32 processors in the large cluster. This result is independent of data set size. c2001 The Willi Hennig Society.
Efficiency of parallel direct optimization.
Janies, D A; Wheeler, W C
2001-03-01
Tremendous progress has been made at the level of sequential computation in phylogenetics. However, little attention has been paid to parallel computation. Parallel computing is particularly suited to phylogenetics because of the many ways large computational problems can be broken into parts that can be analyzed concurrently. In this paper, we investigate the scaling factors and efficiency of random addition and tree refinement strategies using the direct optimization software, POY, on a small (10 slave processors) and a large (256 slave processors) cluster of networked PCs running LINUX. These algorithms were tested on several data sets composed of DNA and morphology ranging from 40 to 500 taxa. Various algorithms in POY show fundamentally different properties within and between clusters. All algorithms are efficient on the small cluster for the 40-taxon data set. On the large cluster, multibuilding exhibits excellent parallel efficiency, whereas parallel building is inefficient. These results are independent of data set size. Branch swapping in parallel shows excellent speed-up for 16 slave processors on the large cluster. However, there is no appreciable speed-up for branch swapping with the further addition of slave processors (>16). This result is independent of data set size. Ratcheting in parallel is efficient with the addition of up to 32 processors in the large cluster. This result is independent of data set size.
Hybrid parallel programming with MPI and Unified Parallel C.
Dinan, J.; Balaji, P.; Lusk, E.; Sadayappan, P.; Thakur, R.; Mathematics and Computer Science; The Ohio State Univ.
2010-01-01
The Message Passing Interface (MPI) is one of the most widely used programming models for parallel computing. However, the amount of memory available to an MPI process is limited by the amount of local memory within a compute node. Partitioned Global Address Space (PGAS) models such as Unified Parallel C (UPC) are growing in popularity because of their ability to provide a shared global address space that spans the memories of multiple compute nodes. However, taking advantage of UPC can require a large recoding effort for existing parallel applications. In this paper, we explore a new hybrid parallel programming model that combines MPI and UPC. This model allows MPI programmers incremental access to a greater amount of memory, enabling memory-constrained MPI codes to process larger data sets. In addition, the hybrid model offers UPC programmers an opportunity to create static UPC groups that are connected over MPI. As we demonstrate, the use of such groups can significantly improve the scalability of locality-constrained UPC codes. This paper presents a detailed description of the hybrid model and demonstrates its effectiveness in two applications: a random access benchmark and the Barnes-Hut cosmological simulation. Experimental results indicate that the hybrid model can greatly enhance performance; using hybrid UPC groups that span two cluster nodes, RA performance increases by a factor of 1.33 and using groups that span four cluster nodes, Barnes-Hut experiences a twofold speedup at the expense of a 2% increase in code size.
Automatic Multilevel Parallelization Using OpenMP
NASA Technical Reports Server (NTRS)
Jin, Hao-Qiang; Jost, Gabriele; Yan, Jerry; Ayguade, Eduard; Gonzalez, Marc; Martorell, Xavier; Biegel, Bryan (Technical Monitor)
2002-01-01
In this paper we describe the extension of the CAPO parallelization support tool to support multilevel parallelism based on OpenMP directives. CAPO generates OpenMP directives with extensions supported by the NanosCompiler to allow for directive nesting and definition of thread groups. We report first results for several benchmark codes and one full application that have been parallelized using our system.
Parallel NPARC: Implementation and Performance
NASA Technical Reports Server (NTRS)
Townsend, S. E.
1996-01-01
Version 3 of the NPARC Navier-Stokes code includes support for large-grain (block level) parallelism using explicit message passing between a heterogeneous collection of computers. This capability has the potential for significant performance gains, depending upon the block data distribution. The parallel implementation uses a master/worker arrangement of processes. The master process assigns blocks to workers, controls worker actions, and provides remote file access for the workers. The processes communicate via explicit message passing using an interface library which provides portability to a number of message passing libraries, such as PVM (Parallel Virtual Machine). A Bourne shell script is used to simplify the task of selecting hosts, starting processes, retrieving remote files, and terminating a computation. This script also provides a simple form of fault tolerance. An analysis of the computational performance of NPARC is presented, using data sets from an F/A-18 inlet study and a Rocket Based Combined Cycle Engine analysis. Parallel speedup and overall computational efficiency were obtained for various NPARC run parameters on a cluster of IBM RS6000 workstations. The data show that although NPARC performance compares favorably with the estimated potential parallelism, typical data sets used with previous versions of NPARC will often need to be reblocked for optimum parallel performance. In one of the cases studied, reblocking increased peak parallel speedup from 3.2 to 11.8.
Parallel multiscale simulations of a brain aneurysm.
Grinberg, Leopold; Fedosov, Dmitry A; Karniadakis, George Em
2013-07-01
Cardiovascular pathologies, such as a brain aneurysm, are affected by the global blood circulation as well as by the local microrheology. Hence, developing computational models for such cases requires the coupling of disparate spatial and temporal scales often governed by diverse mathematical descriptions, e.g., by partial differential equations (continuum) and ordinary differential equations for discrete particles (atomistic). However, interfacing atomistic-based with continuum-based domain discretizations is a challenging problem that requires both mathematical and computational advances. We present here a hybrid methodology that enabled us to perform the first multi-scale simulations of platelet depositions on the wall of a brain aneurysm. The large scale flow features in the intracranial network are accurately resolved by using the high-order spectral element Navier-Stokes solver εκαr . The blood rheology inside the aneurysm is modeled using a coarse-grained stochastic molecular dynamics approach (the dissipative particle dynamics method) implemented in the parallel code LAMMPS. The continuum and atomistic domains overlap with interface conditions provided by effective forces computed adaptively to ensure continuity of states across the interface boundary. A two-way interaction is allowed with the time-evolving boundary of the (deposited) platelet clusters tracked by an immersed boundary method. The corresponding heterogeneous solvers ( εκαr and LAMMPS) are linked together by a computational multilevel message passing interface that facilitates modularity and high parallel efficiency. Results of multiscale simulations of clot formation inside the aneurysm in a patient-specific arterial tree are presented. We also discuss the computational challenges involved and present scalability results of our coupled solver on up to 300K computer processors. Validation of such coupled atomistic-continuum models is a main open issue that has to be addressed in future
Parallel multiscale simulations of a brain aneurysm
Grinberg, Leopold; Fedosov, Dmitry A.; Karniadakis, George Em
2013-07-01
Cardiovascular pathologies, such as a brain aneurysm, are affected by the global blood circulation as well as by the local microrheology. Hence, developing computational models for such cases requires the coupling of disparate spatial and temporal scales often governed by diverse mathematical descriptions, e.g., by partial differential equations (continuum) and ordinary differential equations for discrete particles (atomistic). However, interfacing atomistic-based with continuum-based domain discretizations is a challenging problem that requires both mathematical and computational advances. We present here a hybrid methodology that enabled us to perform the first multiscale simulations of platelet depositions on the wall of a brain aneurysm. The large scale flow features in the intracranial network are accurately resolved by using the high-order spectral element Navier–Stokes solver NεκTαr. The blood rheology inside the aneurysm is modeled using a coarse-grained stochastic molecular dynamics approach (the dissipative particle dynamics method) implemented in the parallel code LAMMPS. The continuum and atomistic domains overlap with interface conditions provided by effective forces computed adaptively to ensure continuity of states across the interface boundary. A two-way interaction is allowed with the time-evolving boundary of the (deposited) platelet clusters tracked by an immersed boundary method. The corresponding heterogeneous solvers (NεκTαr and LAMMPS) are linked together by a computational multilevel message passing interface that facilitates modularity and high parallel efficiency. Results of multiscale simulations of clot formation inside the aneurysm in a patient-specific arterial tree are presented. We also discuss the computational challenges involved and present scalability results of our coupled solver on up to 300 K computer processors. Validation of such coupled atomistic-continuum models is a main open issue that has to be addressed in
Parallel multiscale simulations of a brain aneurysm
Grinberg, Leopold; Fedosov, Dmitry A.; Karniadakis, George Em
2012-01-01
Cardiovascular pathologies, such as a brain aneurysm, are affected by the global blood circulation as well as by the local microrheology. Hence, developing computational models for such cases requires the coupling of disparate spatial and temporal scales often governed by diverse mathematical descriptions, e.g., by partial differential equations (continuum) and ordinary differential equations for discrete particles (atomistic). However, interfacing atomistic-based with continuum-based domain discretizations is a challenging problem that requires both mathematical and computational advances. We present here a hybrid methodology that enabled us to perform the first multi-scale simulations of platelet depositions on the wall of a brain aneurysm. The large scale flow features in the intracranial network are accurately resolved by using the high-order spectral element Navier-Stokes solver εκ αr. The blood rheology inside the aneurysm is modeled using a coarse-grained stochastic molecular dynamics approach (the dissipative particle dynamics method) implemented in the parallel code LAMMPS. The continuum and atomistic domains overlap with interface conditions provided by effective forces computed adaptively to ensure continuity of states across the interface boundary. A two-way interaction is allowed with the time-evolving boundary of the (deposited) platelet clusters tracked by an immersed boundary method. The corresponding heterogeneous solvers ( εκ αr and LAMMPS) are linked together by a computational multilevel message passing interface that facilitates modularity and high parallel efficiency. Results of multiscale simulations of clot formation inside the aneurysm in a patient-specific arterial tree are presented. We also discuss the computational challenges involved and present scalability results of our coupled solver on up to 300K computer processors. Validation of such coupled atomistic-continuum models is a main open issue that has to be addressed in future
Parallel multiscale simulations of a brain aneurysm
NASA Astrophysics Data System (ADS)
Grinberg, Leopold; Fedosov, Dmitry A.; Karniadakis, George Em
2013-07-01
Cardiovascular pathologies, such as a brain aneurysm, are affected by the global blood circulation as well as by the local microrheology. Hence, developing computational models for such cases requires the coupling of disparate spatial and temporal scales often governed by diverse mathematical descriptions, e.g., by partial differential equations (continuum) and ordinary differential equations for discrete particles (atomistic). However, interfacing atomistic-based with continuum-based domain discretizations is a challenging problem that requires both mathematical and computational advances. We present here a hybrid methodology that enabled us to perform the first multiscale simulations of platelet depositions on the wall of a brain aneurysm. The large scale flow features in the intracranial network are accurately resolved by using the high-order spectral element Navier-Stokes solver NɛκTαr. The blood rheology inside the aneurysm is modeled using a coarse-grained stochastic molecular dynamics approach (the dissipative particle dynamics method) implemented in the parallel code LAMMPS. The continuum and atomistic domains overlap with interface conditions provided by effective forces computed adaptively to ensure continuity of states across the interface boundary. A two-way interaction is allowed with the time-evolving boundary of the (deposited) platelet clusters tracked by an immersed boundary method. The corresponding heterogeneous solvers (NɛκTαr and LAMMPS) are linked together by a computational multilevel message passing interface that facilitates modularity and high parallel efficiency. Results of multiscale simulations of clot formation inside the aneurysm in a patient-specific arterial tree are presented. We also discuss the computational challenges involved and present scalability results of our coupled solver on up to 300 K computer processors. Validation of such coupled atomistic-continuum models is a main open issue that has to be addressed in future
Phonological coding during reading
Leinenger, Mallorie
2014-01-01
The exact role that phonological coding (the recoding of written, orthographic information into a sound based code) plays during silent reading has been extensively studied for more than a century. Despite the large body of research surrounding the topic, varying theories as to the time course and function of this recoding still exist. The present review synthesizes this body of research, addressing the topics of time course and function in tandem. The varying theories surrounding the function of phonological coding (e.g., that phonological codes aid lexical access, that phonological codes aid comprehension and bolster short-term memory, or that phonological codes are largely epiphenomenal in skilled readers) are first outlined, and the time courses that each maps onto (e.g., that phonological codes come online early (pre-lexical) or that phonological codes come online late (post-lexical)) are discussed. Next the research relevant to each of these proposed functions is reviewed, discussing the varying methodologies that have been used to investigate phonological coding (e.g., response time methods, reading while eyetracking or recording EEG and MEG, concurrent articulation) and highlighting the advantages and limitations of each with respect to the study of phonological coding. In response to the view that phonological coding is largely epiphenomenal in skilled readers, research on the use of phonological codes in prelingually, profoundly deaf readers is reviewed. Finally, implications for current models of word identification (activation-verification model (Van Order, 1987), dual-route model (e.g., Coltheart, Rastle, Perry, Langdon, & Ziegler, 2001), parallel distributed processing model (Seidenberg & McClelland, 1989)) are discussed. PMID:25150679
Parallelization of the Implicit RPLUS Algorithm
NASA Technical Reports Server (NTRS)
Orkwis, Paul D.
1994-01-01
The multiblock reacting Navier-Stokes flow-solver RPLUS2D was modified for parallel implementation. Results for non-reacting flow calculations of this code indicate parallelization efficiencies greater than 84% are possible for a typical test problem. Results tend to improve as the size of the problem increases. The convergence rate of the scheme is degraded slightly when additional artificial block boundaries are included for the purpose of parallelization. However, this degradation virtually disappears if the solution is converged near to machine zero. Recommendations are made for further code improvements to increase efficiency, correct bugs in the original version, and study decomposition effectiveness.
Parallelization of the Implicit RPLUS Algorithm
NASA Technical Reports Server (NTRS)
Orkwis, Paul D.
1997-01-01
The multiblock reacting Navier-Stokes flow solver RPLUS2D was modified for parallel implementation. Results for non-reacting flow calculations of this code indicate parallelization efficiencies greater than 84% are possible for a typical test problem. Results tend to improve as the size of the problem increases. The convergence rate of the scheme is degraded slightly when additional artificial block boundaries are included for the purpose of parallelization. However, this degradation virtually disappears if the solution is converged near to machine zero. Recommendations are made for further code improvements to increase efficiency, correct bugs in the original version, and study decomposition effectiveness.
Embedded foveation image coding.
Wang, Z; Bovik, A C
2001-01-01
The human visual system (HVS) is highly space-variant in sampling, coding, processing, and understanding. The spatial resolution of the HVS is highest around the point of fixation (foveation point) and decreases rapidly with increasing eccentricity. By taking advantage of this fact, it is possible to remove considerable high-frequency information redundancy from the peripheral regions and still reconstruct a perceptually good quality image. Great success has been obtained previously by a class of embedded wavelet image coding algorithms, such as the embedded zerotree wavelet (EZW) and the set partitioning in hierarchical trees (SPIHT) algorithms. Embedded wavelet coding not only provides very good compression performance, but also has the property that the bitstream can be truncated at any point and still be decoded to recreate a reasonably good quality image. In this paper, we propose an embedded foveation image coding (EFIC) algorithm, which orders the encoded bitstream to optimize foveated visual quality at arbitrary bit-rates. A foveation-based image quality metric, namely, foveated wavelet image quality index (FWQI), plays an important role in the EFIC system. We also developed a modified SPIHT algorithm to improve the coding efficiency. Experiments show that EFIC integrates foveation filtering with foveated image coding and demonstrates very good coding performance and scalability in terms of foveated image quality measurement.
Code Disentanglement: Initial Plan
Wohlbier, John Greaton; Kelley, Timothy M.; Rockefeller, Gabriel M.; Calef, Matthew Thomas
2015-01-27
The first step to making more ambitious changes in the EAP code base is to disentangle the code into a set of independent, levelized packages. We define a package as a collection of code, most often across a set of files, that provides a defined set of functionality; a package a) can be built and tested as an entity and b) fits within an overall levelization design. Each package contributes one or more libraries, or an application that uses the other libraries. A package set is levelized if the relationships between packages form a directed, acyclic graph and each package uses only packages at lower levels of the diagram (in Fortran this relationship is often describable by the use relationship between modules). Independent packages permit independent- and therefore parallel|development. The packages form separable units for the purposes of development and testing. This is a proven path for enabling finer-grained changes to a complex code.
Using Coarrays to Parallelize Legacy Fortran Applications: Strategy and Case Study
Radhakrishnan, Hari; Rouson, Damian W. I.; Morris, Karla; ...
2015-01-01
This paper summarizes a strategy for parallelizing a legacy Fortran 77 program using the object-oriented (OO) and coarray features that entered Fortran in the 2003 and 2008 standards, respectively. OO programming (OOP) facilitates the construction of an extensible suite of model-verification and performance tests that drive the development. Coarray parallel programming facilitates a rapid evolution from a serial application to a parallel application capable of running on multicore processors and many-core accelerators in shared and distributed memory. We delineate 17 code modernization steps used to refactor and parallelize the program and study the resulting performance. Our initial studies were donemore » using the Intel Fortran compiler on a 32-core shared memory server. Scaling behavior was very poor, and profile analysis using TAU showed that the bottleneck in the performance was due to our implementation of a collective, sequential summation procedure. We were able to improve the scalability and achieve nearly linear speedup by replacing the sequential summation with a parallel, binary tree algorithm. We also tested the Cray compiler, which provides its own collective summation procedure. Intel provides no collective reductions. With Cray, the program shows linear speedup even in distributed-memory execution. We anticipate similar results with other compilers once they support the new collective procedures proposed for Fortran 2015.« less
Parallel Implicit Algorithms for CFD
NASA Technical Reports Server (NTRS)
Keyes, David E.
1998-01-01
The main goal of this project was efficient distributed parallel and workstation cluster implementations of Newton-Krylov-Schwarz (NKS) solvers for implicit Computational Fluid Dynamics (CFD.) "Newton" refers to a quadratically convergent nonlinear iteration using gradient information based on the true residual, "Krylov" to an inner linear iteration that accesses the Jacobian matrix only through highly parallelizable sparse matrix-vector products, and "Schwarz" to a domain decomposition form of preconditioning the inner Krylov iterations with primarily neighbor-only exchange of data between the processors. Prior experience has established that Newton-Krylov methods are competitive solvers in the CFD context and that Krylov-Schwarz methods port well to distributed memory computers. The combination of the techniques into Newton-Krylov-Schwarz was implemented on 2D and 3D unstructured Euler codes on the parallel testbeds that used to be at LaRC and on several other parallel computers operated by other agencies or made available by the vendors. Early implementations were made directly in Massively Parallel Integration (MPI) with parallel solvers we adapted from legacy NASA codes and enhanced for full NKS functionality. Later implementations were made in the framework of the PETSC library from Argonne National Laboratory, which now includes pseudo-transient continuation Newton-Krylov-Schwarz solver capability (as a result of demands we made upon PETSC during our early porting experiences). A secondary project pursued with funding from this contract was parallel implicit solvers in acoustics, specifically in the Helmholtz formulation. A 2D acoustic inverse problem has been solved in parallel within the PETSC framework.
2011-01-01
Background There are two components to the clinical efficacy of pediculicides: (i) efficacy against the crawling-stages (lousicidal efficacy); and (ii) efficacy against the eggs (ovicidal efficacy). Lousicidal efficacy and ovicidal efficacy are confounded in clinical trials. Here we report on a trial that was specially designed to rank the clinical ovicidal efficacy of pediculicides. Eggs were collected, pre-treatment and post-treatment, from subjects with different types of hair, different coloured hair and hair of different length. Method Subjects with at least 20 live eggs of Pediculus capitis (head lice) were randomised to one of three treatment-groups: a melaleuca oil (commonly called tea tree oil) and lavender oil pediculicide (TTO/LO); a eucalyptus oil and lemon tea tree oil pediculicide (EO/LTTO); or a "suffocation" pediculicide. Pre-treatment: 10 to 22 live eggs were taken from the head by cutting the single hair with the live egg attached, before the treatment (total of 1,062 eggs). Treatment: The subjects then received a single treatment of one of the three pediculicides, according to the manufacturers' instructions. Post-treatment: 10 to 41 treated live eggs were taken from the head by cutting the single hair with the egg attached (total of 1,183 eggs). Eggs were incubated for 14 days. The proportion of eggs that had hatched after 14 days in the pre-treatment group was compared with the proportion of eggs that hatched in the post-treatment group. The primary outcome measure was % ovicidal efficacy for each of the three pediculicides. Results 722 subjects were examined for the presence of eggs of head lice. 92 of these subjects were recruited and randomly assigned to: the "suffocation" pediculicide (n = 31); the melaleuca oil and lavender oil pediculicide (n = 31); and the eucalyptus oil and lemon tea tree oil pediculicide (n = 30 subjects). The group treated with eucalyptus oil and lemon tea tree oil had an ovicidal efficacy of 3.3% (SD 16%) whereas the
Barker, Stephen C; Altman, Phillip M
2011-08-24
There are two components to the clinical efficacy of pediculicides: (i) efficacy against the crawling-stages (lousicidal efficacy); and (ii) efficacy against the eggs (ovicidal efficacy). Lousicidal efficacy and ovicidal efficacy are confounded in clinical trials. Here we report on a trial that was specially designed to rank the clinical ovicidal efficacy of pediculicides. Eggs were collected, pre-treatment and post-treatment, from subjects with different types of hair, different coloured hair and hair of different length. Subjects with at least 20 live eggs of Pediculus capitis (head lice) were randomised to one of three treatment-groups: a melaleuca oil (commonly called tea tree oil) and lavender oil pediculicide (TTO/LO); a eucalyptus oil and lemon tea tree oil pediculicide (EO/LTTO); or a "suffocation" pediculicide. Pre-treatment: 10 to 22 live eggs were taken from the head by cutting the single hair with the live egg attached, before the treatment (total of 1,062 eggs). The subjects then received a single treatment of one of the three pediculicides, according to the manufacturers' instructions. Post-treatment: 10 to 41 treated live eggs were taken from the head by cutting the single hair with the egg attached (total of 1,183 eggs). Eggs were incubated for 14 days. The proportion of eggs that had hatched after 14 days in the pre-treatment group was compared with the proportion of eggs that hatched in the post-treatment group. The primary outcome measure was % ovicidal efficacy for each of the three pediculicides. 722 subjects were examined for the presence of eggs of head lice. 92 of these subjects were recruited and randomly assigned to: the "suffocation" pediculicide (n = 31); the melaleuca oil and lavender oil pediculicide (n = 31); and the eucalyptus oil and lemon tea tree oil pediculicide (n = 30 subjects). The group treated with eucalyptus oil and lemon tea tree oil had an ovicidal efficacy of 3.3% (SD 16%) whereas the group treated with melaleuca oil and
Parallel auto-correlative statistics with VTK.
Pebay, Philippe Pierre; Bennett, Janine Camille
2013-08-01
This report summarizes existing statistical engines in VTK and presents both the serial and parallel auto-correlative statistics engines. It is a sequel to [PT08, BPRT09b, PT09, BPT09, PT10] which studied the parallel descriptive, correlative, multi-correlative, principal component analysis, contingency, k-means, and order statistics engines. The ease of use of the new parallel auto-correlative statistics engine is illustrated by the means of C++ code snippets and algorithm verification is provided. This report justifies the design of the statistics engines with parallel scalability in mind, and provides scalability and speed-up analysis results for the autocorrelative statistics engine.
High performance parallel implicit CFD.
Gropp, W. D.; Kaushik, D. K.; Keyes, D. E.; Smith, B. F.; Mathematics and Computer Science; Old Dominion Univ.
2001-03-01
Fluid dynamical simulations based on finite discretizations on (quasi-)static grids scale well in parallel, but execute at a disappointing percentage of per-processor peak floating point operation rates without special attention to layout and access ordering of data. We document both claims from our experience with an unstructured grid CFD code that is typical of the state of the practice at NASA. These basic performance characteristics of PDE-based codes can be understood with surprisingly simple models, for which we quote earlier work, presenting primarily experimental results. The performance models and experimental results motivate algorithmic and software practices that lead to improvements in both parallel scalability and per node performance. This snapshot of ongoing work updates our 1999 Bell Prize-winning simulation on ASCI computers.
Low Density Parity Check Codes: Bandwidth Efficient Channel Coding
NASA Technical Reports Server (NTRS)
Fong, Wai; Lin, Shu; Maki, Gary; Yeh, Pen-Shu
2003-01-01
Low Density Parity Check (LDPC) Codes provide near-Shannon Capacity performance for NASA Missions. These codes have high coding rates R=0.82 and 0.875 with moderate code lengths, n=4096 and 8176. Their decoders have inherently parallel structures which allows for high-speed implementation. Two codes based on Euclidean Geometry (EG) were selected for flight ASIC implementation. These codes are cyclic and quasi-cyclic in nature and therefore have a simple encoder structure. This results in power and size benefits. These codes also have a large minimum distance as much as d,,, = 65 giving them powerful error correcting capabilities and error floors less than lo- BER. This paper will present development of the LDPC flight encoder and decoder, its applications and status.
Parallel pivoting combined with parallel reduction
NASA Technical Reports Server (NTRS)
Alaghband, Gita
1987-01-01
Parallel algorithms for triangularization of large, sparse, and unsymmetric matrices are presented. The method combines the parallel reduction with a new parallel pivoting technique, control over generations of fill-ins and a check for numerical stability, all done in parallel with the work being distributed over the active processes. The parallel technique uses the compatibility relation between pivots to identify parallel pivot candidates and uses the Markowitz number of pivots to minimize fill-in. This technique is not a preordering of the sparse matrix and is applied dynamically as the decomposition proceeds.
Springer, Mark S; Gatesy, John
2016-01-01
Higher-level relationships among placental mammals are mostly resolved, but several polytomies remain contentious. Song et al. (2012) claimed to have resolved three of these using shortcut coalescence methods (MP-EST, STAR) and further concluded that these methods, which assume no within-locus recombination, are required to unravel deep-level phylogenetic problems that have stymied concatenation. Here, we reanalyze Song et al.'s (2012) data and leverage these re-analyses to explore key issues in systematics including the recombination ratchet, gene tree stoichiometry, the proportion of gene tree incongruence that results from deep coalescence versus other factors, and simulations that compare the performance of coalescence and concatenation methods in species tree estimation. Song et al. (2012) reported an average locus length of 3.1 kb for the 447 protein-coding genes in their phylogenomic dataset, but the true mean length of these loci (start codon to stop codon) is 139.6 kb. Empirical estimates of recombination breakpoints in primates, coupled with consideration of the recombination ratchet, suggest that individual coalescence genes (c-genes) approach ∼12 bp or less for Song et al.'s (2012) dataset, three to four orders of magnitude shorter than the c-genes reported by these authors. This result has general implications for the application of coalescence methods in species tree estimation. We contend that it is illogical to apply coalescence methods to complete protein-coding sequences. Such analyses amalgamate c-genes with different evolutionary histories (i.e., exons separated by >100,000 bp), distort true gene tree stoichiometry that is required for accurate species tree inference, and contradict the central rationale for applying coalescence methods to difficult phylogenetic problems. In addition, Song et al.'s (2012) dataset of 447 genes includes 21 loci with switched taxonomic names, eight duplicated loci, 26 loci with non-homologous sequences that are
Data-Parallel Halo Finder Operator in PISTON
Widanagamaachchi, W. N.
2012-08-01
PISTON is a portable framework which supports the development of visualization and analysis operators using a platform-independent, data-parallel programming model. Operators such as isosurface, cut-surface and threshold have been implemented in this framework, with the exact same operator code achieving good parallel performance on different architectures. An important analysis operator in cosmology is the halo finder. A halo is a cluster of particles and is considered a common feature of interest found in cosmology data. As the number of cosmological simulations carried out in the recent past has increased, the resultant data of these simulations and the required analysis tasks have increased as well. As a consequence, there is a need to develop scalable and efficient tools to carry out the needed analysis. Therefore, we are currently implementing a halo finder operator using PISTON. Researchers have developed a wide variety of techniques to identify halos in raw particle data. The most basic algorithm is the friend-of-friends (FOF) halo finder, where the particles are clustered based on two parameters: linking length and halo size. In a FOF halo finder, all particles which lie within the linking length are considered as one halo and the halos are filtered based on the halo size parameter. A naive implementation of a FOF halo finder compares each and every particle pair, requiring O(n{sup 2}) operations. Our data-parallel halo finder operator uses a balanced k-d tree to reduce this number of operations in the average case, and implements the algorithm using only the data-parallel primitives in order to achieve portability and performance.
ERIC Educational Resources Information Center
National Audubon Society, New York, NY.
Included are an illustrated student reader, "The Story of Trees," a leaders' guide, and a large tree chart with 37 colored pictures. The student reader reviews several aspects of trees: a definition of a tree; where and how trees grow; flowers, pollination and seed production; how trees make their food; how to recognize trees; seasonal changes;…
ERIC Educational Resources Information Center
National Audubon Society, New York, NY.
Included are an illustrated student reader, "The Story of Trees," a leaders' guide, and a large tree chart with 37 colored pictures. The student reader reviews several aspects of trees: a definition of a tree; where and how trees grow; flowers, pollination and seed production; how trees make their food; how to recognize trees; seasonal changes;…
Parallelization of ARC3D with Computer-Aided Tools
NASA Technical Reports Server (NTRS)
Jin, Haoqiang; Hribar, Michelle; Yan, Jerry; Saini, Subhash (Technical Monitor)
1998-01-01
A series of efforts have been devoted to investigating methods of porting and parallelizing applications quickly and efficiently for new architectures, such as the SCSI Origin 2000 and Cray T3E. This report presents the parallelization of a CFD application, ARC3D, using the computer-aided tools, Cesspools. Steps of parallelizing this code and requirements of achieving better performance are discussed. The generated parallel version has achieved reasonably well performance, for example, having a speedup of 30 for 36 Cray T3E processors. However, this performance could not be obtained without modification of the original serial code. It is suggested that in many cases improving serial code and performing necessary code transformations are important parts for the automated parallelization process although user intervention in many of these parts are still necessary. Nevertheless, development and improvement of useful software tools, such as Cesspools, can help trim down many tedious parallelization details and improve the processing efficiency.
Locating hardware faults in a parallel computer
Archer, Charles J.; Megerian, Mark G.; Ratterman, Joseph D.; Smith, Brian E.
2010-04-13
Locating hardware faults in a parallel computer, including defining within a tree network of the parallel computer two or more sets of non-overlapping test levels of compute nodes of the network that together include all the data communications links of the network, each non-overlapping test level comprising two or more adjacent tiers of the tree; defining test cells within each non-overlapping test level, each test cell comprising a subtree of the tree including a subtree root compute node and all descendant compute nodes of the subtree root compute node within a non-overlapping test level; performing, separately on each set of non-overlapping test levels, an uplink test on all test cells in a set of non-overlapping test levels; and performing, separately from the uplink tests and separately on each set of non-overlapping test levels, a downlink test on all test cells in a set of non-overlapping test levels.
Resnik, Barry I
2009-01-01
It is ethical, legal, and proper for a dermatologist to maximize income through proper coding of patient encounters and procedures. The overzealous physician can misinterpret reimbursement requirements or receive bad advice from other physicians and cross the line from aggressive coding to coding fraud. Several of the more common problem areas are discussed.
An interactive programme for weighted Steiner trees
NASA Astrophysics Data System (ADS)
Zanchetta do Nascimento, Marcelo; Ramos Batista, Valério; Raffa Coimbra, Wendhel
2015-01-01
We introduce a fully written programmed code with a supervised method for generating weighted Steiner trees. Our choice of the programming language, and the use of well- known theorems from Geometry and Complex Analysis, allowed this method to be implemented with only 764 lines of effective source code. This eases the understanding and the handling of this beta version for future developments.
The Xyce Parallel Electronic Simulator - An Overview
HUTCHINSON,SCOTT A.; KEITER,ERIC R.; HOEKSTRA,ROBERT J.; WATTS,HERMAN A.; WATERS,ARLON J.; SCHELLS,REGINA L.; WIX,STEVEN D.
2000-12-08
The Xyce{trademark} Parallel Electronic Simulator has been written to support the simulation needs of the Sandia National Laboratories electrical designers. As such, the development has focused on providing the capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). In addition, they are providing improved performance for numerical kernels using state-of-the-art algorithms, support for modeling circuit phenomena at a variety of abstraction levels and using object-oriented and modern coding-practices that ensure the code will be maintainable and extensible far into the future. The code is a parallel code in the most general sense of the phrase--a message passing parallel implementation--which allows it to run efficiently on the widest possible number of computing platforms. These include serial, shared-memory and distributed-memory parallel as well as heterogeneous platforms. Furthermore, careful attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved even as the number of processors grows.
Badger, P.C.
1995-12-31
Short rotation intensive culture tree plantations have been a major part of biomass energy concepts since the beginning. One aspect receiving less attention than it deserves is harvesting. This article describes an method of harvesting somewhere between agricultural mowing machines and huge feller-bunchers of the pulpwood and lumber industries.
Mark J. Ambrose
2012-01-01
Tree mortality is a natural process in all forest ecosystems. However, extremely high mortality also can be an indicator of forest health issues. On a regional scale, high mortality levels may indicate widespread insect or disease problems. High mortality may also occur if a large proportion of the forest in a particular region is made up of older, senescent stands....
Mark J. Ambrose
2013-01-01
Tree mortality is a natural process in all forest ecosystems. However, extremely high mortality can also be an indicator of forest health issues. On a regional scale, high mortality levels may indicate widespread insect or disease problems. High mortality may also occur if a large proportion of the forest in a particular region is made up of older, senescent stands....
Mark J. Ambrose
2013-01-01
Tree mortality is a natural process in all forest ecosystems. However, extremely high mortality also can be an indicator of forest health issues. On a regional scale, high mortality levels may indicate widespread insect or disease problems. High mortality may also occur if a large proportion of the forests in a region is made up of older, senescent stands.
NASA Technical Reports Server (NTRS)
Pollara, Fabrizio; Hamkins, Jon; Dolinar, Sam; Andrews, Ken; Divsalar, Dariush
2006-01-01
This viewgraph presentation reviews uplink coding. The purpose and goals of the briefing are (1) Show a plan for using uplink coding and describe benefits (2) Define possible solutions and their applicability to different types of uplink, including emergency uplink (3) Concur with our conclusions so we can embark on a plan to use proposed uplink system (4) Identify the need for the development of appropriate technology and infusion in the DSN (5) Gain advocacy to implement uplink coding in flight projects Action Item EMB04-1-14 -- Show a plan for using uplink coding, including showing where it is useful or not (include discussion of emergency uplink coding).
Parallel computing for probabilistic fatigue analysis
NASA Technical Reports Server (NTRS)
Sues, Robert H.; Lua, Yuan J.; Smith, Mark D.
1993-01-01
This paper presents the results of Phase I research to investigate the most effective parallel processing software strategies and hardware configurations for probabilistic structural analysis. We investigate the efficiency of both shared and distributed-memory architectures via a probabilistic fatigue life analysis problem. We also present a parallel programming approach, the virtual shared-memory paradigm, that is applicable across both types of hardware. Using this approach, problems can be solved on a variety of parallel configurations, including networks of single or multiprocessor workstations. We conclude that it is possible to effectively parallelize probabilistic fatigue analysis codes; however, special strategies will be needed to achieve large-scale parallelism to keep large number of processors busy and to treat problems with the large memory requirements encountered in practice. We also conclude that distributed-memory architecture is preferable to shared-memory for achieving large scale parallelism; however, in the future, the currently emerging hybrid-memory architectures will likely be optimal.
Special parallel processing workshop
1994-12-01
This report contains viewgraphs from the Special Parallel Processing Workshop. These viewgraphs deal with topics such as parallel processing performance, message passing, queue structure, and other basic concept detailing with parallel processing.
Parallel Eclipse Project Checkout
NASA Technical Reports Server (NTRS)
Crockett, Thomas M.; Joswig, Joseph C.; Shams, Khawaja S.; Powell, Mark W.; Bachmann, Andrew G.
2011-01-01
Parallel Eclipse Project Checkout (PEPC) is a program written to leverage parallelism and to automate the checkout process of plug-ins created in Eclipse RCP (Rich Client Platform). Eclipse plug-ins can be aggregated in a feature project. This innovation digests a feature description (xml file) and automatically checks out all of the plug-ins listed in the feature. This resolves the issue of manually checking out each plug-in required to work on the project. To minimize the amount of time necessary to checkout the plug-ins, this program makes the plug-in checkouts parallel. After parsing the feature, a request to checkout for each plug-in in the feature has been inserted. These requests are handled by a thread pool with a configurable number of threads. By checking out the plug-ins in parallel, the checkout process is streamlined before getting started on the project. For instance, projects that took 30 minutes to checkout now take less than 5 minutes. The effect is especially clear on a Mac, which has a network monitor displaying the bandwidth use. When running the client from a developer s home, the checkout process now saturates the bandwidth in order to get all the plug-ins checked out as fast as possible. For comparison, a checkout process that ranged from 8-200 Kbps from a developer s home is now able to saturate a pipe of 1.3 Mbps, resulting in significantly faster checkouts. Eclipse IDE (integrated development environment) tries to build a project as soon as it is downloaded. As part of another optimization, this innovation programmatically tells Eclipse to stop building while checkouts are happening, which dramatically reduces lock contention and enables plug-ins to continue downloading until all of them finish. Furthermore, the software re-enables automatic building, and forces Eclipse to do a clean build once it finishes checking out all of the plug-ins. This software is fully generic and does not contain any NASA-specific code. It can be applied to any
Integrated Task and Data Parallel Programming
NASA Technical Reports Server (NTRS)
Grimshaw, A. S.
1998-01-01
This research investigates the combination of task and data parallel language constructs within a single programming language. There are an number of applications that exhibit properties which would be well served by such an integrated language. Examples include global climate models, aircraft design problems, and multidisciplinary design optimization problems. Our approach incorporates data parallel language constructs into an existing, object oriented, task parallel language. The language will support creation and manipulation of parallel classes and objects of both types (task parallel and data parallel). Ultimately, the language will allow data parallel and task parallel classes to be used either as building blocks or managers of parallel objects of either type, thus allowing the development of single and multi-paradigm parallel applications. 1995 Research Accomplishments In February I presented a paper at Frontiers 1995 describing the design of the data parallel language subset. During the spring I wrote and defended my dissertation proposal. Since that time I have developed a runtime model for the language subset. I have begun implementing the model and hand-coding simple examples which demonstrate the language subset. I have identified an astrophysical fluid flow application which will validate the data parallel language subset. 1996 Research Agenda Milestones for the coming year include implementing a significant portion of the data parallel language subset over the Legion system. Using simple hand-coded methods, I plan to demonstrate (1) concurrent task and data parallel objects and (2) task parallel objects managing both task and data parallel objects. My next steps will focus on constructing a compiler and implementing the fluid flow application with the language. Concurrently, I will conduct a search for a real-world application exhibiting both task and data parallelism within the same program. Additional 1995 Activities During the fall I collaborated
... Luncheon Registration Create Your Own Events Educational Events Tree Nut Allergies Tree nut allergy is one of ... with tree nuts during manufacturing and processing. Avoiding Tree Nuts The federal Food Allergen Labeling and Consumer ...
Utilizing GPUs to Accelerate Turbomachinery CFD Codes
NASA Technical Reports Server (NTRS)
MacCalla, Weylin; Kulkarni, Sameer
2016-01-01
GPU computing has established itself as a way to accelerate parallel codes in the high performance computing world. This work focuses on speeding up APNASA, a legacy CFD code used at NASA Glenn Research Center, while also drawing conclusions about the nature of GPU computing and the requirements to make GPGPU worthwhile on legacy codes. Rewriting and restructuring of the source code was avoided to limit the introduction of new bugs. The code was profiled and investigated for parallelization potential, then OpenACC directives were used to indicate parallel parts of the code. The use of OpenACC directives was not able to reduce the runtime of APNASA on either the NVIDIA Tesla discrete graphics card, or the AMD accelerated processing unit. Additionally, it was found that in order to justify the use of GPGPU, the amount of parallel work being done within a kernel would have to greatly exceed the work being done by any one portion of the APNASA code. It was determined that in order for an application like APNASA to be accelerated on the GPU, it should not be modular in nature, and the parallel portions of the code must contain a large portion of the code's computation time.
Global tree network for computing structures enabling global processing operations
Blumrich; Matthias A.; Chen, Dong; Coteus, Paul W.; Gara, Alan G.; Giampapa, Mark E.; Heidelberger, Philip; Hoenicke, Dirk; Steinmacher-Burow, Burkhard D.; Takken, Todd E.; Vranas, Pavlos M.
2010-01-19
A system and method for enabling high-speed, low-latency global tree network communications among processing nodes interconnected according to a tree network structure. The global tree network enables collective reduction operations to be performed during parallel algorithm operations executing in a computer structure having a plurality of the interconnected processing nodes. Router devices are included that interconnect the nodes of the tree via links to facilitate performance of low-latency global processing operations at nodes of the virtual tree and sub-tree structures. The global operations performed include one or more of: broadcast operations downstream from a root node to leaf nodes of a virtual tree, reduction operations upstream from leaf nodes to the root node in the virtual tree, and point-to-point message passing from any node to the root node. The global tree network is configurable to provide global barrier and interrupt functionality in asynchronous or synchronized manner, and, is physically and logically partitionable.
Functional Data Analysis of Tree Data Objects.
Shen, Dan; Shen, Haipeng; Bhamidi, Shankar; Maldonado, Yolanda Muñoz; Kim, Yongdai; Marron, J S
2014-01-01
Data analysis on non-Euclidean spaces, such as tree spaces, can be challenging. The main contribution of this paper is establishment of a connection between tree data spaces and the well developed area of Functional Data Analysis (FDA), where the data objects are curves. This connection comes through two tree representation approaches, the Dyck path representation and the branch length representation. These representations of trees in Euclidean spaces enable us to exploit the power of FDA to explore statistical properties of tree data objects. A major challenge in the analysis is the sparsity of tree branches in a sample of trees. We overcome this issue by using a tree pruning technique that focuses the analysis on important underlying population structures. This method parallels scale-space analysis in the sense that it reveals statistical properties of tree structured data over a range of scales. The effectiveness of these new approaches is demonstrated by some novel results obtained in the analysis of brain artery trees. The scale space analysis reveals a deeper relationship between structure and age. These methods are the first to find a statistically significant gender difference.
ERIC Educational Resources Information Center
Jenkins, Peter
Tree climbing offers a safe, inexpensive adventure sport that can be performed almost anywhere. Using standard procedures practiced in tree surgery or rock climbing, almost any tree can be climbed. Tree climbing provides challenge and adventure as well as a vigorous upper-body workout. Tree Climbers International classifies trees using a system…
Real-time SHVC software decoding with multi-threaded parallel processing
NASA Astrophysics Data System (ADS)
Gudumasu, Srinivas; He, Yuwen; Ye, Yan; He, Yong; Ryu, Eun-Seok; Dong, Jie; Xiu, Xiaoyu
2014-09-01
This paper proposes a parallel decoding framework for scalable HEVC (SHVC). Various optimization technologies are implemented on the basis of SHVC reference software SHM-2.0 to achieve real-time decoding speed for the two layer spatial scalability configuration. SHVC decoder complexity is analyzed with profiling information. The decoding process at each layer and the up-sampling process are designed in parallel and scheduled by a high level application task manager. Within each layer, multi-threaded decoding is applied to accelerate the layer decoding speed. Entropy decoding, reconstruction, and in-loop processing are pipeline designed with multiple threads based on groups of coding tree units (CTU). A group of CTUs is treated as a processing unit in each pipeline stage to achieve a better trade-off between parallelism and synchronization. Motion compensation, inverse quantization, and inverse transform modules are further optimized with SSE4 SIMD instructions. Simulations on a desktop with an Intel i7 processor 2600 running at 3.4 GHz show that the parallel SHVC software decoder is able to decode 1080p spatial 2x at up to 60 fps (frames per second) and 1080p spatial 1.5x at up to 50 fps for those bitstreams generated with SHVC common test conditions in the JCT-VC standardization group. The decoding performance at various bitrates with different optimization technologies and different numbers of threads are compared in terms of decoding speed and resource usage, including processor and memory.
An experimental APL compiler for a distributed memory parallel machine
Ching, W.M.; Katz, A.
1994-12-31
The authors developed an experimental APL compiler for the IBM SP1 distributed memory parallel machine. It accepts classical APL programs, without additional directives, and generates parallelized C code for execution on the SP1 machine. The compiler exploits data parallelism in APL programs based on parallel high level primitives. Program variables are either replicated or partitioned. They also present performance data for five moderate size programs running on the SP1.
Burda, Z; Erdmann, J; Petersson, B; Wattenberg, M
2003-02-01
We discuss the scaling properties of free branched polymers. The scaling behavior of the model is classified by the Hausdorff dimensions for the internal geometry, d(L) and d(H), and for the external one, D(L) and D(H). The dimensions d(H) and D(H) characterize the behavior for long distances, while d(L) and D(L) for short distances. We show that the internal Hausdorff dimension is d(L)=2 for generic and scale-free trees, contrary to d(H), which is known to be equal to 2 for generic trees and to vary between 2 and infinity for scale-free trees. We show that the external Hausdorff dimension D(H) is directly related to the internal one as D(H)=alphad(H), where alpha is the stability index of the embedding weights for the nearest-vertex interactions. The index is alpha=2 for weights from the Gaussian domain of attraction and 0
Kubilius, Jonas
2014-01-01
Sharing code is becoming increasingly important in the wake of Open Science. In this review I describe and compare two popular code-sharing utilities, GitHub and Open Science Framework (OSF). GitHub is a mature, industry-standard tool but lacks focus towards researchers. In comparison, OSF offers a one-stop solution for researchers but a lot of functionality is still under development. I conclude by listing alternative lesser-known tools for code and materials sharing.
Computational electromagnetics and parallel dense matrix computations
Forsman, K.; Kettunen, L.; Gropp, W.; Levine, D.
1995-06-01
We present computational results using CORAL, a parallel, three-dimensional, nonlinear magnetostatic code based on a volume integral equation formulation. A key feature of CORAL is the ability to solve, in parallel, the large, dense systems of linear equations that are inherent in the use of integral equation methods. Using the Chameleon and PSLES libraries ensures portability and access to the latest linear algebra solution technology.
Computational electromagnetics and parallel dense matrix computations
Forsman, K.; Kettunen, L.; Gropp, W.
1995-12-01
We present computational results using CORAL, a parallel, three-dimensional, nonlinear magnetostatic code based on a volume integral equation formulation. A key feature of CORAL is the ability to solve, in parallel, the large, dense systems of linear equations that are inherent in the use of integral equation methods. Using the Chameleon and PSLES libraries ensures portability and access to the latest linear algebra solution technology.
HOPSPACK: Hybrid Optimization Parallel Search Package.
Gray, Genetha Anne.; Kolda, Tamara G.; Griffin, Joshua; Taddy, Matt; Martinez-Canales, Monica L.
2008-12-01
In this paper, we describe the technical details of HOPSPACK (Hybrid Optimization Parallel SearchPackage), a new software platform which facilitates combining multiple optimization routines into asingle, tightly-coupled, hybrid algorithm that supports parallel function evaluations. The frameworkis designed such that existing optimization source code can be easily incorporated with minimalcode modification. By maintaining the integrity of each individual solver, the strengths and codesophistication of the original optimization package are retained and exploited.4
M-Code Benefits and Availability
2015-04-29
PUBLIC RELEASE 2 UNCLASSIFIED/APPROVED FOR PUBLIC RELEASE M-Code Increased Power Operate closer to jammer, under trees M-Code Cryptography More...significantly improved warfighter benefits - Increased power - Jamming resistance - BFEA compatibility - More secure and flexible cryptography
Parallel rendering techniques for massively parallel visualization
Hansen, C.; Krogh, M.; Painter, J.
1995-07-01
As the resolution of simulation models increases, scientific visualization algorithms which take advantage of the large memory. and parallelism of Massively Parallel Processors (MPPs) are becoming increasingly important. For large applications rendering on the MPP tends to be preferable to rendering on a graphics workstation due to the MPP`s abundant resources: memory, disk, and numerous processors. The challenge becomes developing algorithms that can exploit these resources while minimizing overhead, typically communication costs. This paper will describe recent efforts in parallel rendering for polygonal primitives as well as parallel volumetric techniques. This paper presents rendering algorithms, developed for massively parallel processors (MPPs), for polygonal, spheres, and volumetric data. The polygon algorithm uses a data parallel approach whereas the sphere and volume render use a MIMD approach. Implementations for these algorithms are presented for the Thinking Ma.chines Corporation CM-5 MPP.
Interfacing Computer Aided Parallelization and Performance Analysis
NASA Technical Reports Server (NTRS)
Jost, Gabriele; Jin, Haoqiang; Labarta, Jesus; Gimenez, Judit; Biegel, Bryan A. (Technical Monitor)
2003-01-01
When porting sequential applications to parallel computer architectures, the program developer will typically go through several cycles of source code optimization and performance analysis. We have started a project to develop an environment where the user can jointly navigate through program structure and performance data information in order to make efficient optimization decisions. In a prototype implementation we have interfaced the CAPO computer aided parallelization tool with the Paraver performance analysis tool. We describe both tools and their interface and give an example for how the interface helps within the program development cycle of a benchmark code.
ERIC Educational Resources Information Center
NatureScope, 1986
1986-01-01
Provides: (1) background information on trees, focusing on the parts of trees and how they differ from other plants; (2) eight activities; and (3) ready-to-copy pages dealing with tree identification and tree rings. Activities include objective(s), recommended age level(s), subject area(s), list of materials needed, and procedures. (JN)
PARAMESH: A Parallel Adaptive Mesh Refinement Community Toolkit
NASA Technical Reports Server (NTRS)
MacNeice, Peter; Olson, Kevin M.; Mobarry, Clark; deFainchtein, Rosalinda; Packer, Charles
1999-01-01
In this paper, we describe a community toolkit which is designed to provide parallel support with adaptive mesh capability for a large and important class of computational models, those using structured, logically cartesian meshes. The package of Fortran 90 subroutines, called PARAMESH, is designed to provide an application developer with an easy route to extend an existing serial code which uses a logically cartesian structured mesh into a parallel code with adaptive mesh refinement. Alternatively, in its simplest use, and with minimal effort, it can operate as a domain decomposition tool for users who want to parallelize their serial codes, but who do not wish to use adaptivity. The package can provide them with an incremental evolutionary path for their code, converting it first to uniformly refined parallel code, and then later if they so desire, adding adaptivity.
Method of moment solutions to scattering problems in a parallel processing environment
NASA Technical Reports Server (NTRS)
Cwik, Tom; Partee, Jonathan; Patterson, Jean
1991-01-01
This paper describes the implementation of a parallelized method of moments (MOM) code into an interactive workstation environment. The workstation allows interactive solid body modeling and mesh generation, MOM analysis, and the graphical display of results. After describing the parallel computing environment, the implementation and results of parallelizing a general MOM code are presented in detail.
Xyce parallel electronic simulator : users' guide.
Mei, Ting; Rankin, Eric Lamont; Thornquist, Heidi K.; Santarelli, Keith R.; Fixel, Deborah A.; Coffey, Todd Stirling; Russo, Thomas V.; Schiek, Richard Louis; Warrender, Christina E.; Keiter, Eric Richard; Pawlowski, Roger Patrick
2011-05-01
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: (1) Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). Note that this includes support for most popular parallel and serial computers; (2) Improved performance for all numerical kernels (e.g., time integrator, nonlinear and linear solvers) through state-of-the-art algorithms and novel techniques. (3) Device models which are specifically tailored to meet Sandia's needs, including some radiation-aware devices (for Sandia users only); and (4) Object-oriented code design and implementation using modern coding practices that ensure that the Xyce Parallel Electronic Simulator will be maintainable and extensible far into the future. Xyce is a parallel code in the most general sense of the phrase - a message passing parallel implementation - which allows it to run efficiently on the widest possible number of computing platforms. These include serial, shared-memory and distributed-memory parallel as well as heterogeneous platforms. Careful attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows. The development of Xyce provides a platform for computational research and development aimed specifically at the needs of the Laboratory. With Xyce, Sandia has an 'in-house' capability with which both new electrical (e.g., device model development) and algorithmic (e.g., faster time-integration methods, parallel solver algorithms) research and development can be performed. As a result, Xyce is a unique
An efficient parallel algorithm for accelerating computational protein design
Zhou, Yichao; Xu, Wei; Donald, Bruce R.; Zeng, Jianyang
2014-01-01
Motivation: Structure-based computational protein design (SCPR) is an important topic in protein engineering. Under the assumption of a rigid backbone and a finite set of discrete conformations of side-chains, various methods have been proposed to address this problem. A popular method is to combine the dead-end elimination (DEE) and A* tree search algorithms, which provably finds the global minimum energy conformation (GMEC) solution. Results: In this article, we improve the efficiency of computing A* heuristic functions for protein design and propose a variant of A* algorithm in which the search process can be performed on a single GPU in a massively parallel fashion. In addition, we make some efforts to address the memory exceeding problem in A* search. As a result, our enhancements can achieve a significant speedup of the A*-based protein design algorithm by four orders of magnitude on large-scale test data through pre-computation and parallelization, while still maintaining an acceptable memory overhead. We also show that our parallel A* search algorithm could be successfully combined with iMinDEE, a state-of-the-art DEE criterion, for rotamer pruning to further improve SCPR with the consideration of continuous side-chain flexibility. Availability: Our software is available and distributed open-source under the GNU Lesser General License Version 2.1 (GNU, February 1999). The source code can be downloaded from http://www.cs.duke.edu/donaldlab/osprey.php or http://iiis.tsinghua.edu.cn/∼compbio/software.html. Contact: zengjy321@tsinghua.edu.cn Supplementary information: Supplementary data are available at Bioinformatics online. PMID:24931991
Parallelism In Rule-Based Systems
NASA Astrophysics Data System (ADS)
Sabharwal, Arvind; Iyengar, S. Sitharama; de Saussure, G.; Weisbin, C. R.
1988-03-01
Rule-based systems, which have proven to be extremely useful for several Artificial Intelligence and Expert Systems applications, currently face severe limitations due to the slow speed of their execution. To achieve the desired speed-up, this paper addresses the problem of parallelization of production systems and explores the various architectural and algorithmic possibilities. The inherent sources of parallelism in the production system structure are analyzed and the trade-offs, limitations and feasibility of exploitation of these sources of parallelism are presented. Based on this analysis, we propose a dedicated, coarse-grained, n-ary tree multiprocessor architecture for the parallel implementation of rule-based systems and then present algorithms for partitioning of rules in this architecture.
The dynamics of strangling among forest trees.
Okamoto, Kenichi W
2015-11-07
Strangler trees germinate and grow on other trees, eventually enveloping and potentially even girdling their hosts. This allows them to mitigate fitness costs otherwise incurred by germinating and competing with other trees on the forest floor, as well as minimize risks associated with host tree-fall. If stranglers can themselves host other strangler trees, they may not even seem to need non-stranglers to persist. Yet despite their high fitness potential, strangler trees neither dominate the communities in which they occur nor is the strategy particularly common outside of figs (genus Ficus). Here we analyze how dynamic interactions between strangling and non-strangling trees can shape the adaptive landscape for strangling mutants and mutant trees that have lost the ability to strangle. We find a threshold which strangler germination rates must exceed for selection to favor the evolution of strangling, regardless of how effectively hemiepiphytic stranglers may subsequently replace their hosts. This condition describes the magnitude of the phenotypic displacement in the ability to germinate on other trees necessary for invasion by a mutant tree that could potentially strangle its host following establishment as an epiphyte. We show how the relative abilities of strangling and non-strangling trees to occupy empty sites can govern whether strangling is an evolutionarily stable strategy, and obtain the conditions for strangler coexistence with non-stranglers. We then elucidate when the evolution of strangling can disrupt stable coexistence between commensal epiphytic ancestors and their non-strangling host trees. This allows us to highlight parallels between the invasion fitness of strangler trees arising from commensalist ancestors, and cases where strangling can arise in concert with the evolution of hemiepiphytism among free-standing ancestors. Finally, we discuss how our results can inform the evolutionary ecology of antagonistic interactions more generally.
Status and Verification of Edge Plasma Turbulence Code BOUT
Umansky, M V; Xu, X Q; Dudson, B; LoDestro, L L; Myra, J R
2009-01-08
The BOUT code is a detailed numerical model of tokamak edge turbulence based on collisional plasma uid equations. BOUT solves for time evolution of plasma uid variables: plasma density N{sub i}, parallel ion velocity V{sub {parallel}i}, electron temperature T{sub e}, ion temperature T{sub i}, electric potential {phi}, parallel current j{sub {parallel}}, and parallel vector potential A{sub {parallel}}, in realistic 3D divertor tokamak geometry. The current status of the code, physics model, algorithms, and implementation is described. Results of verification testing are presented along with illustrative applications to tokamak edge turbulence.
E.G. McPherson; F. Ferrini
2010-01-01
We know that âtrees are good,â and most people believe this to be true. But if this is so, why are so many trees neglected, and so many tree wells empty? An individualâs attitude toward trees may result from their firsthand encounters with specific trees. Understanding how attitudes about trees are shaped, particularly aversion to trees, is critical to the business of...
Parallelization of a Compositional Reservoir Simulator
NASA Astrophysics Data System (ADS)
Reme, Hilde; Åge Øye, Geir; Espedal, Magne S.; Fladmark, Gunnar E.
A finite volume dicretization has been used to solve compositional flow in porous media. Secondary migration in fractured rocks has been the main motivation for the work. Multipoint flux approximation has been implemented and adaptive local grid refinement, based on domain decomposition, is used at fractures and faults. The parallelization method, which is described in this paper, strongly promotes code reuse and gives a very high level of parallelization despite low implementation costs. The programming framework is also portable to other platforms or other applications. We have presented computer experiments to examine the parallel efficiency of the implemented parallel simulator with respect to scalability and speedup. Keywords: porous media, multipoint flux approximation, domain decomposition, parallelization
Parallel-In-Time For Moving Meshes
Falgout, R. D.; Manteuffel, T. A.; Southworth, B.; Schroder, J. B.
2016-02-04
With steadily growing computational resources available, scientists must develop e ective ways to utilize the increased resources. High performance, highly parallel software has be- come a standard. However until recent years parallelism has focused primarily on the spatial domain. When solving a space-time partial di erential equation (PDE), this leads to a sequential bottleneck in the temporal dimension, particularly when taking a large number of time steps. The XBraid parallel-in-time library was developed as a practical way to add temporal parallelism to existing se- quential codes with only minor modi cations. In this work, a rezoning-type moving mesh is applied to a di usion problem and formulated in a parallel-in-time framework. Tests and scaling studies are run using XBraid and demonstrate excellent results for the simple model problem considered herein.
Torney, D. C.
2001-01-01
We have begun to characterize a variety of codes, motivated by potential implementation as (quaternary) DNA n-sequences, with letters denoted A, C The first codes we studied are the most reminiscent of conventional group codes. For these codes, Hamming similarity was generalized so that the score for matched letters takes more than one value, depending upon which letters are matched [2]. These codes consist of n-sequences satisfying an upper bound on the similarities, summed over the letter positions, of distinct codewords. We chose similarity 2 for matches of letters A and T and 3 for matches of the letters C and G, providing a rough approximation to double-strand bond energies in DNA. An inherent novelty of DNA codes is 'reverse complementation'. The latter may be defined, as follows, not only for alphabets of size four, but, more generally, for any even-size alphabet. All that is required is a matching of the letters of the alphabet: a partition into pairs. Then, the reverse complement of a codeword is obtained by reversing the order of its letters and replacing each letter by its match. For DNA, the matching is AT/CG because these are the Watson-Crick bonding pairs. Reversal arises because two DNA sequences form a double strand with opposite relative orientations. Thus, as will be described in detail, because in vitro decoding involves the formation of double-stranded DNA from two codewords, it is reasonable to assume - for universal applicability - that the reverse complement of any codeword is also a codeword. In particular, self-reverse complementary codewords are expressly forbidden in reverse-complement codes. Thus, an appropriate distance between all pairs of codewords must, when large, effectively prohibit binding between the respective codewords: to form a double strand. Only reverse-complement pairs of codewords should be able to bind. For most applications, a DNA code is to be bi-partitioned, such that the reverse-complementary pairs are separated
Applications of Parallel Processing in Configuration Analyses
NASA Technical Reports Server (NTRS)
Sundaram, Ppchuraman; Hager, James O.; Biedron, Robert T.
1999-01-01
The paper presents the recent progress made towards developing an efficient and user-friendly parallel environment for routine analysis of large CFD problems. The coarse-grain parallel version of the CFL3D Euler/Navier-Stokes analysis code, CFL3Dhp, has been ported onto most available parallel platforms. The CFL3Dhp solution accuracy on these parallel platforms has been verified with the CFL3D sequential analyses. User-friendly pre- and post-processing tools that enable a seamless transfer from sequential to parallel processing have been written. Static load balancing tool for CFL3Dhp analysis has also been implemented for achieving good parallel efficiency. For large problems, load balancing efficiency as high as 95% can be achieved even when large number of processors are used. Linear scalability of the CFL3Dhp code with increasing number of processors has also been shown using a large installed transonic nozzle boattail analysis. To highlight the fast turn-around time of parallel processing, the TCA full configuration in sideslip Navier-Stokes drag polar at supersonic cruise has been obtained in a day. CFL3Dhp is currently being used as a production analysis tool.
On implementing large binary tree architectures in VLSI and WSI
Youn, H.Y.; Singh, A.D.
1989-04-01
The complete binary tree is known to support the parallel execution of important algorithms, which has given rise to much interest in implementing such architectures in VLSI and WSI. For large trees, the classical H-tree layout approaches suffers from area inefficiency and long interconnects. Other proposed schemes are not well suited for the implementation of defect-tolerant designs. This paper presents an efficient scheme for the layout of large binary tree architectures by embedding the complete binary tree in a two-dimensional array of processing elements.
Performance issues for engineering analysis on MIMD parallel computers
Fang, H.E.; Vaughan, C.T.; Gardner, D.R.
1994-08-01
We discuss how engineering analysts can obtain greater computational resolution in a more timely manner from applications codes running on MIMD parallel computers. Both processor speed and memory capacity are important to achieving better performance than a serial vector supercomputer. To obtain good performance, a parallel applications code must be scalable. In addition, the aspect ratios of the subdomains in the decomposition of the simulation domain onto the parallel computer should be of order 1. We demonstrate these conclusions using simulations conducted with the PCTH shock wave physics code running on a Cray Y-MP, a 1024-node nCUBE 2, and an 1840-node Paragon.
Parallel distributed computing using Python
NASA Astrophysics Data System (ADS)
Dalcin, Lisandro D.; Paz, Rodrigo R.; Kler, Pablo A.; Cosimo, Alejandro
2011-09-01
This work presents two software components aimed to relieve the costs of accessing high-performance parallel computing resources within a Python programming environment: MPI for Python and PETSc for Python. MPI for Python is a general-purpose Python package that provides bindings for the Message Passing Interface (MPI) standard using any back-end MPI implementation. Its facilities allow parallel Python programs to easily exploit multiple processors using the message passing paradigm. PETSc for Python provides access to the Portable, Extensible Toolkit for Scientific Computation (PETSc) libraries. Its facilities allow sequential and parallel Python applications to exploit state of the art algorithms and data structures readily available in PETSc for the solution of large-scale problems in science and engineering. MPI for Python and PETSc for Python are fully integrated to PETSc-FEM, an MPI and PETSc based parallel, multiphysics, finite elements code developed at CIMEC laboratory. This software infrastructure supports research activities related to simulation of fluid flows with applications ranging from the design of microfluidic devices for biochemical analysis to modeling of large-scale stream/aquifer interactions.
Parallel Computational Protein Design
Zhou, Yichao; Donald, Bruce R.; Zeng, Jianyang
2016-01-01
Computational structure-based protein design (CSPD) is an important problem in computational biology, which aims to design or improve a prescribed protein function based on a protein structure template. It provides a practical tool for real-world protein engineering applications. A popular CSPD method that guarantees to find the global minimum energy solution (GMEC) is to combine both dead-end elimination (DEE) and A* tree search algorithms. However, in this framework, the A* search algorithm can run in exponential time in the worst case, which may become the computation bottleneck of large-scale computational protein design process. To address this issue, we extend and add a new module to the OSPREY program that was previously developed in the Donald lab [1] to implement a GPU-based massively parallel A* algorithm for improving protein design pipeline. By exploiting the modern GPU computational framework and optimizing the computation of the heuristic function for A* search, our new program, called gOSPREY, can provide up to four orders of magnitude speedups in large protein design cases with a small memory overhead comparing to the traditional A* search algorithm implementation, while still guaranteeing the optimality. In addition, gOSPREY can be configured to run in a bounded-memory mode to tackle the problems in which the conformation space is too large and the global optimal solution cannot be computed previously. Furthermore, the GPU-based A* algorithm implemented in the gOSPREY program can be combined with the state-of-the-art rotamer pruning algorithms such as iMinDEE [2] and DEEPer [3] to also consider continuous backbone and side-chain flexibility. PMID:27914056
Aerodynamic simulation on massively parallel systems
NASA Technical Reports Server (NTRS)
Haeuser, Jochem; Simon, Horst D.
1992-01-01
This paper briefly addresses the computational requirements for the analysis of complete configurations of aircraft and spacecraft currently under design to be used for advanced transportation in commercial applications as well as in space flight. The discussion clearly shows that massively parallel systems are the only alternative which is both cost effective and on the other hand can provide the necessary TeraFlops, needed to satisfy the narrow design margins of modern vehicles. It is assumed that the solution of the governing physical equations, i.e., the Navier-Stokes equations which may be complemented by chemistry and turbulence models, is done on multiblock grids. This technique is situated between the fully structured approach of classical boundary fitted grids and the fully unstructured tetrahedra grids. A fully structured grid best represents the flow physics, while the unstructured grid gives best geometrical flexibility. The multiblock grid employed is structured within a block, but completely unstructured on the block level. While a completely unstructured grid is not straightforward to parallelize, the above mentioned multiblock grid is inherently parallel, in particular for multiple instruction multiple datastream (MIMD) machines. In this paper guidelines are provided for setting up or modifying an existing sequential code so that a direct parallelization on a massively parallel system is possible. Results are presented for three parallel systems, namely the Intel hypercube, the Ncube hypercube, and the FPS 500 system. Some preliminary results for an 8K CM2 machine will also be mentioned. The code run is the two dimensional grid generation module of Grid, which is a general two dimensional and three dimensional grid generation code for complex geometries. A system of nonlinear Poisson equations is solved. This code is also a good testcase for complex fluid dynamics codes, since the same datastructures are used. All systems provided good speedups, but
Aerodynamic simulation on massively parallel systems
NASA Technical Reports Server (NTRS)
Haeuser, Jochem; Simon, Horst D.
1992-01-01
This paper briefly addresses the computational requirements for the analysis of complete configurations of aircraft and spacecraft currently under design to be used for advanced transportation in commercial applications as well as in space flight. The discussion clearly shows that massively parallel systems are the only alternative which is both cost effective and on the other hand can provide the necessary TeraFlops, needed to satisfy the narrow design margins of modern vehicles. It is assumed that the solution of the governing physical equations, i.e., the Navier-Stokes equations which may be complemented by chemistry and turbulence models, is done on multiblock grids. This technique is situated between the fully structured approach of classical boundary fitted grids and the fully unstructured tetrahedra grids. A fully structured grid best represents the flow physics, while the unstructured grid gives best geometrical flexibility. The multiblock grid employed is structured within a block, but completely unstructured on the block level. While a completely unstructured grid is not straightforward to parallelize, the above mentioned multiblock grid is inherently parallel, in particular for multiple instruction multiple datastream (MIMD) machines. In this paper guidelines are provided for setting up or modifying an existing sequential code so that a direct parallelization on a massively parallel system is possible. Results are presented for three parallel systems, namely the Intel hypercube, the Ncube hypercube, and the FPS 500 system. Some preliminary results for an 8K CM2 machine will also be mentioned. The code run is the two dimensional grid generation module of Grid, which is a general two dimensional and three dimensional grid generation code for complex geometries. A system of nonlinear Poisson equations is solved. This code is also a good testcase for complex fluid dynamics codes, since the same datastructures are used. All systems provided good speedups, but
A distributed particle simulation code in C++
Forslund, D.W.; Wingate, C.A.; Ford, P.S.; Junkins, J.S.; Pope, S.C.
1992-03-01
Although C++ has been successfully used in a variety of computer science applications, it has just recently begun to be used in scientific applications. We have found that the object-oriented properties of C++ lend themselves well to scientific computations by making maintenance of the code easier, by making the code easier to understand, and by providing a better paradigm for distributed memory parallel codes. We describe here aspects of developing a particle plasma simulation code using object-oriented techniques for use in a distributed computing environment. We initially designed and implemented the code for serial computation and then used the distributed programming toolkit ISIS to run it in parallel. In this connection we describe some of the difficulties presented by using C++ for doing parallel and scientific computation.
A distributed particle simulation code in C++
Forslund, D.W.; Wingate, C.A.; Ford, P.S.; Junkins, J.S.; Pope, S.C.
1992-01-01
Although C++ has been successfully used in a variety of computer science applications, it has just recently begun to be used in scientific applications. We have found that the object-oriented properties of C++ lend themselves well to scientific computations by making maintenance of the code easier, by making the code easier to understand, and by providing a better paradigm for distributed memory parallel codes. We describe here aspects of developing a particle plasma simulation code using object-oriented techniques for use in a distributed computing environment. We initially designed and implemented the code for serial computation and then used the distributed programming toolkit ISIS to run it in parallel. In this connection we describe some of the difficulties presented by using C++ for doing parallel and scientific computation.
Kubilius, Jonas
2014-01-01
Sharing code is becoming increasingly important in the wake of Open Science. In this review I describe and compare two popular code-sharing utilities, GitHub and Open Science Framework (OSF). GitHub is a mature, industry-standard tool but lacks focus towards researchers. In comparison, OSF offers a one-stop solution for researchers but a lot of functionality is still under development. I conclude by listing alternative lesser-known tools for code and materials sharing. PMID:25165519
REBOUND: Multi-purpose N-body code for collisional dynamics
NASA Astrophysics Data System (ADS)
Rein, Hanno; Liu, Shang-Fei
2011-10-01
REBOUND is a multi-purpose N-body code which is freely available under an open-source license. It was designed for collisional dynamics such as planetary rings but can also solve the classical N-body problem. It is highly modular and can be customized easily to work on a wide variety of different problems in astrophysics and beyond. REBOUND comes with three symplectic integrators: leap-frog, the symplectic epicycle integrator (SEI) and a Wisdom-Holman mapping (WH). It supports open, periodic and shearing-sheet boundary conditions. REBOUND can use a Barnes-Hut tree to calculate both self-gravity and collisions. These modules are fully parallelized with MPI as well as OpenMP. The former makes use of a static domain decomposition and a distributed essential tree. Two new collision detection modules based on a plane-sweep algorithm are also implemented. The performance of the plane-sweep algorithm is superior to a tree code for simulations in which one dimension is much longer than the other two and in simulations which are quasi-two dimensional with less than one million particles.
Parallel BLAST on split databases.
Mathog, David R
2003-09-22
BLAST programs often run on large SMP machines where multiple threads can work simultaneously and there is enough memory to cache the databases between program runs. A group of programs is described which allows comparable performance to be achieved with a Beowulf configuration in which no node has enough memory to cache a database but the cluster as an aggregate does. To achieve this result, databases are split into equal sized pieces and stored locally on each node. Each query is run on all nodes in parallel and the resultant BLAST output files from all nodes merged to yield the final output. Source code is available from ftp://saf.bio.caltech.edu/
ERIC Educational Resources Information Center
Webb, Richard; Forbatha, Ann
1982-01-01
Strategies for using trees in classroom instruction are provided. Includes: (1) activities (such as tree identification, mapping, measuring tree height/width); (2) list of asthetic, architectural, engineering, climate, and wildlife functions of trees; (3) tree discussion questions; and (4) references. (JN)
David J. Nowak; Jeffrey T. Walton; James Baldwin; Jerry. Bond
2015-01-01
Information on street trees is critical for management of this important resource. Sampling of street tree populations provides an efficient means to obtain street tree population information. Long-term repeat measures of street tree samples supply additional information on street tree changes and can be used to report damages from catastrophic events. Analyses of...
Joe R. McBride; David J. Nowak
1989-01-01
A survey of published reports on urban park tree inventories in the United States and the United Kingdom reveal two types of inventories: (1) Tree Location Inventories and (2) Generalized Information Inventories. Tree location inventories permit managers to relocate specific park trees, along with providing individual tree characteristics and condition data. In...
Katie Himanga; Douglas Jones; Jean Miller; Janette Monear; Gail Steinman; Katherine Widin
2001-01-01
Tree Trust has been helping people plant trees in their communities since 1976. Our goal is to educate people about the importance of trees in their community and guide them through the process of successful tree-planting projects. Franklin Delano Roosevelt once said ?to exist as a nation, to prosper as a state, and to live as a people, we must have trees?....
An Integrated Procedure for Tree N-body Simulations: FLY and AstroMD
NASA Astrophysics Data System (ADS)
Becciani, U.; Antonuccio-Delogu, V.; Buonomo, F.; Gheller, C.
We present a new code for evolving three-dimensional self-gravitating collisionless systems with a large number of particles N >= 107. FLY (Fast Level-based N-bodY code) is a fully parallel code based on a tree algorithm. It adopts periodic boundary conditions implemented by means of the Ewald summation technique. FLY is based on the one-side communication paradigm for sharing data among the processors that access remote private data, avoiding any kind of synchronization. The code was originally developed on a CRAY T3E system using the SHMEM library and it was ported to SGI ORIGIN 2000 and IBM SP (on the latter making use of the LAPI library). FLY version 1.1 is open source, freely available code. FLY output data can be analysed with AstroMD, an analysis and visualization tool specifically designed for astrophysical data. AstroMD can manage different physical quantities. It can find structures without well defined shape or symmetries, and perform quantitative calculations on selected regions. AstroMD is freely available.
On-Line Construction of Parameterized Suffix Trees
NASA Astrophysics Data System (ADS)
Lee, Taehyung; Na, Joong Chae; Park, Kunsoo
We consider on-line construction of a suffix tree for a parameterized string, where we always have the suffix tree of the input string read so far. This situation often arises from source code management systems where, for example, a source code repository is gradually increasing in its size as users commit new codes into the repository day by day. We present an on-line algorithm which constructs a parameterized suffix tree in randomized O(n) time, where n is the length of the input string. Our algorithm is the first randomized linear time algorithm for the on-line construction problem.
Parallel community climate model: Description and user`s guide
Drake, J.B.; Flanery, R.E.; Semeraro, B.D.; Worley, P.H.
1996-07-15
This report gives an overview of a parallel version of the NCAR Community Climate Model, CCM2, implemented for MIMD massively parallel computers using a message-passing programming paradigm. The parallel implementation was developed on an Intel iPSC/860 with 128 processors and on the Intel Delta with 512 processors, and the initial target platform for the production version of the code is the Intel Paragon with 2048 processors. Because the implementation uses a standard, portable message-passing libraries, the code has been easily ported to other multiprocessors supporting a message-passing programming paradigm. The parallelization strategy used is to decompose the problem domain into geographical patches and assign each processor the computation associated with a distinct subset of the patches. With this decomposition, the physics calculations involve only grid points and data local to a processor and are performed in parallel. Using parallel algorithms developed for the semi-Lagrangian transport, the fast Fourier transform and the Legendre transform, both physics and dynamics are computed in parallel with minimal data movement and modest change to the original CCM2 source code. Sequential or parallel history tapes are written and input files (in history tape format) are read sequentially by the parallel code to promote compatibility with production use of the model on other computer systems. A validation exercise has been performed with the parallel code and is detailed along with some performance numbers on the Intel Paragon and the IBM SP2. A discussion of reproducibility of results is included. A user`s guide for the PCCM2 version 2.1 on the various parallel machines completes the report. Procedures for compilation, setup and execution are given. A discussion of code internals is included for those who may wish to modify and use the program in their own research.
CodedStream: live media streaming with overlay coded multicast
NASA Astrophysics Data System (ADS)
Guo, Jiang; Zhu, Ying; Li, Baochun
2003-12-01
Multicasting is a natural paradigm for streaming live multimedia to multiple end receivers. Since IP multicast is not widely deployed, many application-layer multicast protocols have been proposed. However, all of these schemes focus on the construction of multicast trees, where a relatively small number of links carry the multicast streaming load, while the capacity of most of the other links in the overlay network remain unused. In this paper, we propose CodedStream, a high-bandwidth live media distribution system based on end-system overlay multicast. In CodedStream, we construct a k-redundant multicast graph (a directed acyclic graph) as the multicast topology, on which network coding is applied to work around bottlenecks. Simulation results have shown that the combination of k-redundant multicast graph and network coding may indeed bring significant benefits with respect to improving the quality of live media at the end receivers.
Massively-Parallel Dislocation Dynamics Simulations
Cai, W; Bulatov, V V; Pierce, T G; Hiratani, M; Rhee, M; Bartelt, M; Tang, M
2003-06-18
Prediction of the plastic strength of single crystals based on the collective dynamics of dislocations has been a challenge for computational materials science for a number of years. The difficulty lies in the inability of the existing dislocation dynamics (DD) codes to handle a sufficiently large number of dislocation lines, in order to be statistically representative and to reproduce experimentally observed microstructures. A new massively-parallel DD code is developed that is capable of modeling million-dislocation systems by employing thousands of processors. We discuss the general aspects of this code that make such large scale simulations possible, as well as a few initial simulation results.
Payne, J.L.; Hassan, B.
1998-09-01
Massively parallel computers have enabled the analyst to solve complicated flow fields (turbulent, chemically reacting) that were previously intractable. Calculations are presented using a massively parallel CFD code called SACCARA (Sandia Advanced Code for Compressible Aerothermodynamics Research and Analysis) currently under development at Sandia National Laboratories as part of the Department of Energy (DOE) Accelerated Strategic Computing Initiative (ASCI). Computations were made on a generic reentry vehicle in a hypersonic flowfield utilizing three different distributed parallel computers to assess the parallel efficiency of the code with increasing numbers of processors. The parallel efficiencies for the SACCARA code will be presented for cases using 1, 150, 100 and 500 processors. Computations were also made on a subsonic/transonic vehicle using both 236 and 521 processors on a grid containing approximately 14.7 million grid points. Ongoing and future plans to implement a parallel overset grid capability and couple SACCARA with other mechanics codes in a massively parallel environment are discussed.
Epetra developers coding guidelines.
Heroux, Michael Allen; Sexton, Paul Michael
2003-12-01
Epetra is a package of classes for the construction and use of serial and distributed parallel linear algebra objects. It is one of the base packages in Trilinos. This document describes guidelines for Epetra coding style. The issues discussed here go beyond correct C++ syntax to address issues that make code more readable and self-consistent. The guidelines presented here are intended to aid current and future development of Epetra specifically. They reflect design decisions that were made in the early development stages of Epetra. Some of the guidelines are contrary to more commonly used conventions, but we choose to continue these practices for the purposes of self-consistency. These guidelines are intended to be complimentary to policies established in the Trilinos Developers Guide.
NASA Astrophysics Data System (ADS)
Schnack, D. D.; Glasser, A. H.
1996-11-01
NIMROD is a new code system that is being developed for the analysis of modern fusion experiments. It is being designed from the beginning to make the maximum use of massively parallel computer architectures and computer graphics. The NIMROD physics kernel solves the three-dimensional, time-dependent two-fluid equations with neo-classical effects in toroidal geometry of arbitrary poloidal cross section. The NIMROD system also includes a pre-processor, a grid generator, and a post processor. User interaction with NIMROD is facilitated by a modern graphical user interface (GUI). The NIMROD project is using Quality Function Deployment (QFD) team management techniques to minimize re-engineering and reduce code development time. This paper gives an overview of the NIMROD project. Operation of the GUI is demonstrated, and the first results from the physics kernel are given.
Parallel flow diffusion battery
Yeh, Hsu-Chi; Cheng, Yung-Sung
1984-08-07
A parallel flow diffusion battery for determining the mass distribution of an aerosol has a plurality of diffusion cells mounted in parallel to an aerosol stream, each diffusion cell including a stack of mesh wire screens of different density.
Parallel flow diffusion battery
Yeh, H.C.; Cheng, Y.S.
1984-01-01
A parallel flow diffusion battery for determining the mass distribution of an aerosol has a plurality of diffusion cells mounted in parallel to an aerosol stream, each diffusion cell including a stack of mesh wire screens of different density.
Parallel simulated annealing algorithms for cell placement on hypercube multiprocessors
NASA Technical Reports Server (NTRS)
Banerjee, Prithviraj; Jones, Mark Howard; Sargent, Jeff S.
1990-01-01
Two parallel algorithms for standard cell placement using simulated annealing are developed to run on distributed-memory message-passing hypercube multiprocessors. The cells can be mapped in a two-dimensional area of a chip onto processors in an n-dimensional hypercube in two ways, such that both small and large cell exchange and displacement moves can be applied. The computation of the cost function in parallel among all the processors in the hypercube is described, along with a distributed data structure that needs to be stored in the hypercube to support the parallel cost evaluation. A novel tree broadcasting strategy is used extensively for updating cell locations in the parallel environment. A dynamic parallel annealing schedule estimates the errors due to interacting parallel moves and adapts the rate of synchronization automatically. Two novel approaches in controlling error in parallel algorithms are described: heuristic cell coloring and adaptive sequence control.
Parallel simulated annealing algorithms for cell placement on hypercube multiprocessors
NASA Technical Reports Server (NTRS)
Banerjee, Prithviraj; Jones, Mark Howard; Sargent, Jeff S.
1990-01-01
Two parallel algorithms for standard cell placement using simulated annealing are developed to run on distributed-memory message-passing hypercube multiprocessors. The cells can be mapped in a two-dimensional area of a chip onto processors in an n-dimensional hypercube in two ways, such that both small and large cell exchange and displacement moves can be applied. The computation of the cost function in parallel among all the processors in the hypercube is described, along with a distributed data structure that needs to be stored in the hypercube to support the parallel cost evaluation. A novel tree broadcasting strategy is used extensively for updating cell locations in the parallel environment. A dynamic parallel annealing schedule estimates the errors due to interacting parallel moves and adapts the rate of synchronization automatically. Two novel approaches in controlling error in parallel algorithms are described: heuristic cell coloring and adaptive sequence control.