Sample records for parallel tree code

  1. An implementation of a tree code on a SIMD, parallel computer

    NASA Technical Reports Server (NTRS)

    Olson, Kevin M.; Dorband, John E.

    1994-01-01

    We describe a fast tree algorithm for gravitational N-body simulation on SIMD parallel computers. The tree construction uses fast, parallel sorts. The sorted lists are recursively divided along their x, y and z coordinates. This data structure is a completely balanced tree (i.e., each particle is paired with exactly one other particle) and maintains good spatial locality. An implementation of this tree-building algorithm on a 16k processor Maspar MP-1 performs well and constitutes only a small fraction (approximately 15%) of the entire cycle of finding the accelerations. Each node in the tree is treated as a monopole. The tree search and the summation of accelerations also perform well. During the tree search, node data that is needed from another processor is simply fetched. Roughly 55% of the tree search time is spent in communications between processors. We apply the code to two problems of astrophysical interest. The first is a simulation of the close passage of two gravitationally, interacting, disk galaxies using 65,636 particles. We also simulate the formation of structure in an expanding, model universe using 1,048,576 particles. Our code attains speeds comparable to one head of a Cray Y-MP, so single instruction, multiple data (SIMD) type computers can be used for these simulations. The cost/performance ratio for SIMD machines like the Maspar MP-1 make them an extremely attractive alternative to either vector processors or large multiple instruction, multiple data (MIMD) type parallel computers. With further optimizations (e.g., more careful load balancing), speeds in excess of today's vector processing computers should be possible.

  2. PENTACLE: Parallelized particle-particle particle-tree code for planet formation

    NASA Astrophysics Data System (ADS)

    Iwasawa, Masaki; Oshino, Shoichi; Fujii, Michiko S.; Hori, Yasunori

    2017-10-01

    We have newly developed a parallelized particle-particle particle-tree code for planet formation, PENTACLE, which is a parallelized hybrid N-body integrator executed on a CPU-based (super)computer. PENTACLE uses a fourth-order Hermite algorithm to calculate gravitational interactions between particles within a cut-off radius and a Barnes-Hut tree method for gravity from particles beyond. It also implements an open-source library designed for full automatic parallelization of particle simulations, FDPS (Framework for Developing Particle Simulator), to parallelize a Barnes-Hut tree algorithm for a memory-distributed supercomputer. These allow us to handle 1-10 million particles in a high-resolution N-body simulation on CPU clusters for collisional dynamics, including physical collisions in a planetesimal disc. In this paper, we show the performance and the accuracy of PENTACLE in terms of \\tilde{R}_cut and a time-step Δt. It turns out that the accuracy of a hybrid N-body simulation is controlled through Δ t / \\tilde{R}_cut and Δ t / \\tilde{R}_cut ˜ 0.1 is necessary to simulate accurately the accretion process of a planet for ≥106 yr. For all those interested in large-scale particle simulations, PENTACLE, customized for planet formation, will be freely available from https://github.com/PENTACLE-Team/PENTACLE under the MIT licence.

  3. FLY MPI-2: a parallel tree code for LSS

    NASA Astrophysics Data System (ADS)

    Becciani, U.; Comparato, M.; Antonuccio-Delogu, V.

    2006-04-01

    New version program summaryProgram title: FLY 3.1 Catalogue identifier: ADSC_v2_0 Licensing provisions: yes Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADSC_v2_0 Program obtainable from: CPC Program Library, Queen's University of Belfast, N. Ireland No. of lines in distributed program, including test data, etc.: 158 172 No. of bytes in distributed program, including test data, etc.: 4 719 953 Distribution format: tar.gz Programming language: Fortran 90, C Computer: Beowulf cluster, PC, MPP systems Operating system: Linux, Aix RAM: 100M words Catalogue identifier of previous version: ADSC_v1_0 Journal reference of previous version: Comput. Phys. Comm. 155 (2003) 159 Does the new version supersede the previous version?: yes Nature of problem: FLY is a parallel collisionless N-body code for the calculation of the gravitational force Solution method: FLY is based on the hierarchical oct-tree domain decomposition introduced by Barnes and Hut (1986) Reasons for the new version: The new version of FLY is implemented by using the MPI-2 standard: the distributed version 3.1 was developed by using the MPICH2 library on a PC Linux cluster. Today the FLY performance allows us to consider the FLY code among the most powerful parallel codes for tree N-body simulations. Another important new feature regards the availability of an interface with hydrodynamical Paramesh based codes. Simulations must follow a box large enough to accurately represent the power spectrum of fluctuations on very large scales so that we may hope to compare them meaningfully with real data. The number of particles then sets the mass resolution of the simulation, which we would like to make as fine as possible. The idea to build an interface between two codes, that have different and complementary cosmological tasks, allows us to execute complex cosmological simulations with FLY, specialized for DM evolution, and a code specialized for hydrodynamical components that uses a Paramesh block

  4. Hybrid Parallel Contour Trees, Version 1.0

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Sewell, Christopher; Fasel, Patricia; Carr, Hamish

    A common operation in scientific visualization is to compute and render a contour of a data set. Given a function of the form f : R^d -> R, a level set is defined as an inverse image f^-1(h) for an isovalue h, and a contour is a single connected component of a level set. The Reeb graph can then be defined to be the result of contracting each contour to a single point, and is well defined for Euclidean spaces or for general manifolds. For simple domains, the graph is guaranteed to be a tree, and is called the contourmore » tree. Analysis can then be performed on the contour tree in order to identify isovalues of particular interest, based on various metrics, and render the corresponding contours, without having to know such isovalues a priori. This code is intended to be the first data-parallel algorithm for computing contour trees. Our implementation will use the portable data-parallel primitives provided by Nvidia’s Thrust library, allowing us to compile our same code for both GPUs and multi-core CPUs. Native OpenMP and purely serial versions of the code will likely also be included. It will also be extended to provide a hybrid data-parallel / distributed algorithm, allowing scaling beyond a single GPU or CPU.« less

  5. Reconstructing evolutionary trees in parallel for massive sequences.

    PubMed

    Zou, Quan; Wan, Shixiang; Zeng, Xiangxiang; Ma, Zhanshan Sam

    2017-12-14

    Building the evolutionary trees for massive unaligned DNA sequences is challenging and crucial. However, reconstructing evolutionary tree for ultra-large sequences is hard. Massive multiple sequence alignment is also challenging and time/space consuming. Hadoop and Spark are developed recently, which bring spring light for the classical computational biology problems. In this paper, we tried to solve the multiple sequence alignment and evolutionary reconstruction in parallel. HPTree, which is developed in this paper, can deal with big DNA sequence files quickly. It works well on the >1GB files, and gets better performance than other evolutionary reconstruction tools. Users could use HPTree for reonstructing evolutioanry trees on the computer clusters or cloud platform (eg. Amazon Cloud). HPTree could help on population evolution research and metagenomics analysis. In this paper, we employ the Hadoop and Spark platform and design an evolutionary tree reconstruction software tool for unaligned massive DNA sequences. Clustering and multiple sequence alignment are done in parallel. Neighbour-joining model was employed for the evolutionary tree building. We opened our software together with source codes via http://lab.malab.cn/soft/HPtree/ .

  6. GRADSPMHD: A parallel MHD code based on the SPH formalism

    NASA Astrophysics Data System (ADS)

    Vanaverbeke, S.; Keppens, R.; Poedts, S.

    2014-03-01

    We present GRADSPMHD, a completely Lagrangian parallel magnetohydrodynamics code based on the SPH formalism. The implementation of the equations of SPMHD in the “GRAD-h” formalism assembles known results, including the derivation of the discretized MHD equations from a variational principle, the inclusion of time-dependent artificial viscosity, resistivity and conductivity terms, as well as the inclusion of a mixed hyperbolic/parabolic correction scheme for satisfying the ∇ṡB→ constraint on the magnetic field. The code uses a tree-based formalism for neighbor finding and can optionally use the tree code for computing the self-gravity of the plasma. The structure of the code closely follows the framework of our parallel GRADSPH FORTRAN 90 code which we added previously to the CPC program library. We demonstrate the capabilities of GRADSPMHD by running 1, 2, and 3 dimensional standard benchmark tests and we find good agreement with previous work done by other researchers. The code is also applied to the problem of simulating the magnetorotational instability in 2.5D shearing box tests as well as in global simulations of magnetized accretion disks. We find good agreement with available results on this subject in the literature. Finally, we discuss the performance of the code on a parallel supercomputer with distributed memory architecture. Catalogue identifier: AERP_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AERP_v1_0.html Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 620503 No. of bytes in distributed program, including test data, etc.: 19837671 Distribution format: tar.gz Programming language: FORTRAN 90/MPI. Computer: HPC cluster. Operating system: Unix. Has the code been vectorized or parallelized?: Yes, parallelized using MPI. RAM: ˜30 MB for a

  7. Gravitational tree-code on graphics processing units: implementation in CUDA

    NASA Astrophysics Data System (ADS)

    Gaburov, Evghenii; Bédorf, Jeroen; Portegies Zwart, Simon

    2010-05-01

    We present a new very fast tree-code which runs on massively parallel Graphical Processing Units (GPU) with NVIDIA CUDA architecture. The tree-construction and calculation of multipole moments is carried out on the host CPU, while the force calculation which consists of tree walks and evaluation of interaction list is carried out on the GPU. In this way we achieve a sustained performance of about 100GFLOP/s and data transfer rates of about 50GB/s. It takes about a second to compute forces on a million particles with an opening angle of θ ≈ 0.5. The code has a convenient user interface and is freely available for use. http://castle.strw.leidenuniv.nl/software/octgrav.html

  8. PARAVT: Parallel Voronoi tessellation code

    NASA Astrophysics Data System (ADS)

    González, R. E.

    2016-10-01

    In this study, we present a new open source code for massive parallel computation of Voronoi tessellations (VT hereafter) in large data sets. The code is focused for astrophysical purposes where VT densities and neighbors are widely used. There are several serial Voronoi tessellation codes, however no open source and parallel implementations are available to handle the large number of particles/galaxies in current N-body simulations and sky surveys. Parallelization is implemented under MPI and VT using Qhull library. Domain decomposition takes into account consistent boundary computation between tasks, and includes periodic conditions. In addition, the code computes neighbors list, Voronoi density, Voronoi cell volume, density gradient for each particle, and densities on a regular grid. Code implementation and user guide are publicly available at https://github.com/regonzar/paravt.

  9. Advances in Parallelization for Large Scale Oct-Tree Mesh Generation

    NASA Technical Reports Server (NTRS)

    O'Connell, Matthew; Karman, Steve L.

    2015-01-01

    Despite great advancements in the parallelization of numerical simulation codes over the last 20 years, it is still common to perform grid generation in serial. Generating large scale grids in serial often requires using special "grid generation" compute machines that can have more than ten times the memory of average machines. While some parallel mesh generation techniques have been proposed, generating very large meshes for LES or aeroacoustic simulations is still a challenging problem. An automated method for the parallel generation of very large scale off-body hierarchical meshes is presented here. This work enables large scale parallel generation of off-body meshes by using a novel combination of parallel grid generation techniques and a hybrid "top down" and "bottom up" oct-tree method. Meshes are generated using hardware commonly found in parallel compute clusters. The capability to generate very large meshes is demonstrated by the generation of off-body meshes surrounding complex aerospace geometries. Results are shown including a one billion cell mesh generated around a Predator Unmanned Aerial Vehicle geometry, which was generated on 64 processors in under 45 minutes.

  10. Computational efficiency of parallel combinatorial OR-tree searches

    NASA Technical Reports Server (NTRS)

    Li, Guo-Jie; Wah, Benjamin W.

    1990-01-01

    The performance of parallel combinatorial OR-tree searches is analytically evaluated. This performance depends on the complexity of the problem to be solved, the error allowance function, the dominance relation, and the search strategies. The exact performance may be difficult to predict due to the nondeterminism and anomalies of parallelism. The authors derive the performance bounds of parallel OR-tree searches with respect to the best-first, depth-first, and breadth-first strategies, and verify these bounds by simulation. They show that a near-linear speedup can be achieved with respect to a large number of processors for parallel OR-tree searches. Using the bounds developed, the authors derive sufficient conditions for assuring that parallelism will not degrade performance and necessary conditions for allowing parallelism to have a speedup greater than the ratio of the numbers of processors. These bounds and conditions provide the theoretical foundation for determining the number of processors required to assure a near-linear speedup.

  11. National Combustion Code Parallel Performance Enhancements

    NASA Technical Reports Server (NTRS)

    Quealy, Angela; Benyo, Theresa (Technical Monitor)

    2002-01-01

    The National Combustion Code (NCC) is being developed by an industry-government team for the design and analysis of combustion systems. The unstructured grid, reacting flow code uses a distributed memory, message passing model for its parallel implementation. The focus of the present effort has been to improve the performance of the NCC code to meet combustor designer requirements for model accuracy and analysis turnaround time. Improving the performance of this code contributes significantly to the overall reduction in time and cost of the combustor design cycle. This report describes recent parallel processing modifications to NCC that have improved the parallel scalability of the code, enabling a two hour turnaround for a 1.3 million element fully reacting combustion simulation on an SGI Origin 2000.

  12. Code Parallelization with CAPO: A User Manual

    NASA Technical Reports Server (NTRS)

    Jin, Hao-Qiang; Frumkin, Michael; Yan, Jerry; Biegel, Bryan (Technical Monitor)

    2001-01-01

    A software tool has been developed to assist the parallelization of scientific codes. This tool, CAPO, extends an existing parallelization toolkit, CAPTools developed at the University of Greenwich, to generate OpenMP parallel codes for shared memory architectures. This is an interactive toolkit to transform a serial Fortran application code to an equivalent parallel version of the software - in a small fraction of the time normally required for a manual parallelization. We first discuss the way in which loop types are categorized and how efficient OpenMP directives can be defined and inserted into the existing code using the in-depth interprocedural analysis. The use of the toolkit on a number of application codes ranging from benchmark to real-world application codes is presented. This will demonstrate the great potential of using the toolkit to quickly parallelize serial programs as well as the good performance achievable on a large number of toolkit to quickly parallelize serial programs as well as the good performance achievable on a large number of processors. The second part of the document gives references to the parameters and the graphic user interface implemented in the toolkit. Finally a set of tutorials is included for hands-on experiences with this toolkit.

  13. The Forest Method as a New Parallel Tree Method with the Sectional Voronoi Tessellation

    NASA Astrophysics Data System (ADS)

    Yahagi, Hideki; Mori, Masao; Yoshii, Yuzuru

    1999-09-01

    We have developed a new parallel tree method which will be called the forest method hereafter. This new method uses the sectional Voronoi tessellation (SVT) for the domain decomposition. The SVT decomposes a whole space into polyhedra and allows their flat borders to move by assigning different weights. The forest method determines these weights based on the load balancing among processors by means of the overload diffusion (OLD). Moreover, since all the borders are flat, before receiving the data from other processors, each processor can collect enough data to calculate the gravity force with precision. Both the SVT and the OLD are coded in a highly vectorizable manner to accommodate on vector parallel processors. The parallel code based on the forest method with the Message Passing Interface is run on various platforms so that a wide portability is guaranteed. Extensive calculations with 15 processors of Fujitsu VPP300/16R indicate that the code can calculate the gravity force exerted on 105 particles in each second for some ideal dark halo. This code is found to enable an N-body simulation with 107 or more particles for a wide dynamic range and is therefore a very powerful tool for the study of galaxy formation and large-scale structure in the universe.

  14. A parallel and modular deformable cell Car-Parrinello code

    NASA Astrophysics Data System (ADS)

    Cavazzoni, Carlo; Chiarotti, Guido L.

    1999-12-01

    We have developed a modular parallel code implementing the Car-Parrinello [Phys. Rev. Lett. 55 (1985) 2471] algorithm including the variable cell dynamics [Europhys. Lett. 36 (1994) 345; J. Phys. Chem. Solids 56 (1995) 510]. Our code is written in Fortran 90, and makes use of some new programming concepts like encapsulation, data abstraction and data hiding. The code has a multi-layer hierarchical structure with tree like dependences among modules. The modules include not only the variables but also the methods acting on them, in an object oriented fashion. The modular structure allows easier code maintenance, develop and debugging procedures, and is suitable for a developer team. The layer structure permits high portability. The code displays an almost linear speed-up in a wide range of number of processors independently of the architecture. Super-linear speed up is obtained with a "smart" Fast Fourier Transform (FFT) that uses the available memory on the single node (increasing for a fixed problem with the number of processing elements) as temporary buffer to store wave function transforms. This code has been used to simulate water and ammonia at giant planet conditions for systems as large as 64 molecules for ˜50 ps.

  15. Parallel Continuous Flow: A Parallel Suffix Tree Construction Tool for Whole Genomes

    PubMed Central

    Farreras, Montse

    2014-01-01

    Abstract The construction of suffix trees for very long sequences is essential for many applications, and it plays a central role in the bioinformatic domain. With the advent of modern sequencing technologies, biological sequence databases have grown dramatically. Also the methodologies required to analyze these data have become more complex everyday, requiring fast queries to multiple genomes. In this article, we present parallel continuous flow (PCF), a parallel suffix tree construction method that is suitable for very long genomes. We tested our method for the suffix tree construction of the entire human genome, about 3GB. We showed that PCF can scale gracefully as the size of the input genome grows. Our method can work with an efficiency of 90% with 36 processors and 55% with 172 processors. We can index the human genome in 7 minutes using 172 processes. PMID:24597675

  16. National Combustion Code: Parallel Implementation and Performance

    NASA Technical Reports Server (NTRS)

    Quealy, A.; Ryder, R.; Norris, A.; Liu, N.-S.

    2000-01-01

    The National Combustion Code (NCC) is being developed by an industry-government team for the design and analysis of combustion systems. CORSAIR-CCD is the current baseline reacting flow solver for NCC. This is a parallel, unstructured grid code which uses a distributed memory, message passing model for its parallel implementation. The focus of the present effort has been to improve the performance of the NCC flow solver to meet combustor designer requirements for model accuracy and analysis turnaround time. Improving the performance of this code contributes significantly to the overall reduction in time and cost of the combustor design cycle. This paper describes the parallel implementation of the NCC flow solver and summarizes its current parallel performance on an SGI Origin 2000. Earlier parallel performance results on an IBM SP-2 are also included. The performance improvements which have enabled a turnaround of less than 15 hours for a 1.3 million element fully reacting combustion simulation are described.

  17. Parallel Adaptive Mesh Refinement Library

    NASA Technical Reports Server (NTRS)

    Mac-Neice, Peter; Olson, Kevin

    2005-01-01

    Parallel Adaptive Mesh Refinement Library (PARAMESH) is a package of Fortran 90 subroutines designed to provide a computer programmer with an easy route to extension of (1) a previously written serial code that uses a logically Cartesian structured mesh into (2) a parallel code with adaptive mesh refinement (AMR). Alternatively, in its simplest use, and with minimal effort, PARAMESH can operate as a domain-decomposition tool for users who want to parallelize their serial codes but who do not wish to utilize adaptivity. The package builds a hierarchy of sub-grids to cover the computational domain of a given application program, with spatial resolution varying to satisfy the demands of the application. The sub-grid blocks form the nodes of a tree data structure (a quad-tree in two or an oct-tree in three dimensions). Each grid block has a logically Cartesian mesh. The package supports one-, two- and three-dimensional models.

  18. Parallel peak pruning for scalable SMP contour tree computation

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Carr, Hamish A.; Weber, Gunther H.; Sewell, Christopher M.

    As data sets grow to exascale, automated data analysis and visualisation are increasingly important, to intermediate human understanding and to reduce demands on disk storage via in situ analysis. Trends in architecture of high performance computing systems necessitate analysis algorithms to make effective use of combinations of massively multicore and distributed systems. One of the principal analytic tools is the contour tree, which analyses relationships between contours to identify features of more than local importance. Unfortunately, the predominant algorithms for computing the contour tree are explicitly serial, and founded on serial metaphors, which has limited the scalability of this formmore » of analysis. While there is some work on distributed contour tree computation, and separately on hybrid GPU-CPU computation, there is no efficient algorithm with strong formal guarantees on performance allied with fast practical performance. Here in this paper, we report the first shared SMP algorithm for fully parallel contour tree computation, withfor-mal guarantees of O(lgnlgt) parallel steps and O(n lgn) work, and implementations with up to 10x parallel speed up in OpenMP and up to 50x speed up in NVIDIA Thrust.« less

  19. Tree-based solvers for adaptive mesh refinement code FLASH - I: gravity and optical depths

    NASA Astrophysics Data System (ADS)

    Wünsch, R.; Walch, S.; Dinnbier, F.; Whitworth, A.

    2018-04-01

    We describe an OctTree algorithm for the MPI parallel, adaptive mesh refinement code FLASH, which can be used to calculate the gas self-gravity, and also the angle-averaged local optical depth, for treating ambient diffuse radiation. The algorithm communicates to the different processors only those parts of the tree that are needed to perform the tree-walk locally. The advantage of this approach is a relatively low memory requirement, important in particular for the optical depth calculation, which needs to process information from many different directions. This feature also enables a general tree-based radiation transport algorithm that will be described in a subsequent paper, and delivers excellent scaling up to at least 1500 cores. Boundary conditions for gravity can be either isolated or periodic, and they can be specified in each direction independently, using a newly developed generalization of the Ewald method. The gravity calculation can be accelerated with the adaptive block update technique by partially re-using the solution from the previous time-step. Comparison with the FLASH internal multigrid gravity solver shows that tree-based methods provide a competitive alternative, particularly for problems with isolated or mixed boundary conditions. We evaluate several multipole acceptance criteria (MACs) and identify a relatively simple approximate partial error MAC which provides high accuracy at low computational cost. The optical depth estimates are found to agree very well with those of the RADMC-3D radiation transport code, with the tree-solver being much faster. Our algorithm is available in the standard release of the FLASH code in version 4.0 and later.

  20. A Data Parallel Multizone Navier-Stokes Code

    NASA Technical Reports Server (NTRS)

    Jespersen, Dennis C.; Levit, Creon; Kwak, Dochan (Technical Monitor)

    1995-01-01

    We have developed a data parallel multizone compressible Navier-Stokes code on the Connection Machine CM-5. The code is set up for implicit time-stepping on single or multiple structured grids. For multiple grids and geometrically complex problems, we follow the "chimera" approach, where flow data on one zone is interpolated onto another in the region of overlap. We will describe our design philosophy and give some timing results for the current code. The design choices can be summarized as: 1. finite differences on structured grids; 2. implicit time-stepping with either distributed solves or data motion and local solves; 3. sequential stepping through multiple zones with interzone data transfer via a distributed data structure. We have implemented these ideas on the CM-5 using CMF (Connection Machine Fortran), a data parallel language which combines elements of Fortran 90 and certain extensions, and which bears a strong similarity to High Performance Fortran (HPF). One interesting feature is the issue of turbulence modeling, where the architecture of a parallel machine makes the use of an algebraic turbulence model awkward, whereas models based on transport equations are more natural. We will present some performance figures for the code on the CM-5, and consider the issues involved in transitioning the code to HPF for portability to other parallel platforms.

  1. A Hybrid Shared-Memory Parallel Max-Tree Algorithm for Extreme Dynamic-Range Images.

    PubMed

    Moschini, Ugo; Meijster, Arnold; Wilkinson, Michael H F

    2018-03-01

    Max-trees, or component trees, are graph structures that represent the connected components of an image in a hierarchical way. Nowadays, many application fields rely on images with high-dynamic range or floating point values. Efficient sequential algorithms exist to build trees and compute attributes for images of any bit depth. However, we show that the current parallel algorithms perform poorly already with integers at bit depths higher than 16 bits per pixel. We propose a parallel method combining the two worlds of flooding and merging max-tree algorithms. First, a pilot max-tree of a quantized version of the image is built in parallel using a flooding method. Later, this structure is used in a parallel leaf-to-root approach to compute efficiently the final max-tree and to drive the merging of the sub-trees computed by the threads. We present an analysis of the performance both on simulated and actual 2D images and 3D volumes. Execution times are about better than the fastest sequential algorithm and speed-up goes up to on 64 threads.

  2. National Combustion Code: Parallel Performance

    NASA Technical Reports Server (NTRS)

    Babrauckas, Theresa

    2001-01-01

    This report discusses the National Combustion Code (NCC). The NCC is an integrated system of codes for the design and analysis of combustion systems. The advanced features of the NCC meet designers' requirements for model accuracy and turn-around time. The fundamental features at the inception of the NCC were parallel processing and unstructured mesh. The design and performance of the NCC are discussed.

  3. RY-Coding and Non-Homogeneous Models Can Ameliorate the Maximum-Likelihood Inferences From Nucleotide Sequence Data with Parallel Compositional Heterogeneity.

    PubMed

    Ishikawa, Sohta A; Inagaki, Yuji; Hashimoto, Tetsuo

    2012-01-01

    In phylogenetic analyses of nucleotide sequences, 'homogeneous' substitution models, which assume the stationarity of base composition across a tree, are widely used, albeit individual sequences may bear distinctive base frequencies. In the worst-case scenario, a homogeneous model-based analysis can yield an artifactual union of two distantly related sequences that achieved similar base frequencies in parallel. Such potential difficulty can be countered by two approaches, 'RY-coding' and 'non-homogeneous' models. The former approach converts four bases into purine and pyrimidine to normalize base frequencies across a tree, while the heterogeneity in base frequency is explicitly incorporated in the latter approach. The two approaches have been applied to real-world sequence data; however, their basic properties have not been fully examined by pioneering simulation studies. Here, we assessed the performances of the maximum-likelihood analyses incorporating RY-coding and a non-homogeneous model (RY-coding and non-homogeneous analyses) on simulated data with parallel convergence to similar base composition. Both RY-coding and non-homogeneous analyses showed superior performances compared with homogeneous model-based analyses. Curiously, the performance of RY-coding analysis appeared to be significantly affected by a setting of the substitution process for sequence simulation relative to that of non-homogeneous analysis. The performance of a non-homogeneous analysis was also validated by analyzing a real-world sequence data set with significant base heterogeneity.

  4. New Parallel computing framework for radiation transport codes

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kostin, M.A.; /Michigan State U., NSCL; Mokhov, N.V.

    A new parallel computing framework has been developed to use with general-purpose radiation transport codes. The framework was implemented as a C++ module that uses MPI for message passing. The module is significantly independent of radiation transport codes it can be used with, and is connected to the codes by means of a number of interface functions. The framework was integrated with the MARS15 code, and an effort is under way to deploy it in PHITS. Besides the parallel computing functionality, the framework offers a checkpoint facility that allows restarting calculations with a saved checkpoint file. The checkpoint facility canmore » be used in single process calculations as well as in the parallel regime. Several checkpoint files can be merged into one thus combining results of several calculations. The framework also corrects some of the known problems with the scheduling and load balancing found in the original implementations of the parallel computing functionality in MARS15 and PHITS. The framework can be used efficiently on homogeneous systems and networks of workstations, where the interference from the other users is possible.« less

  5. Capabilities of Fully Parallelized MHD Stability Code MARS

    NASA Astrophysics Data System (ADS)

    Svidzinski, Vladimir; Galkin, Sergei; Kim, Jin-Soo; Liu, Yueqiang

    2016-10-01

    Results of full parallelization of the plasma stability code MARS will be reported. MARS calculates eigenmodes in 2D axisymmetric toroidal equilibria in MHD-kinetic plasma models. Parallel version of MARS, named PMARS, has been recently developed at FAR-TECH. Parallelized MARS is an efficient tool for simulation of MHD instabilities with low, intermediate and high toroidal mode numbers within both fluid and kinetic plasma models, implemented in MARS. Parallelization of the code included parallelization of the construction of the matrix for the eigenvalue problem and parallelization of the inverse vector iterations algorithm, implemented in MARS for the solution of the formulated eigenvalue problem. Construction of the matrix is parallelized by distributing the load among processors assigned to different magnetic surfaces. Parallelization of the solution of the eigenvalue problem is made by repeating steps of the MARS algorithm using parallel libraries and procedures. Parallelized MARS is capable of calculating eigenmodes with significantly increased spatial resolution: up to 5,000 adapted radial grid points with up to 500 poloidal harmonics. Such resolution is sufficient for simulation of kink, tearing and peeling-ballooning instabilities with physically relevant parameters. Work is supported by the U.S. DOE SBIR program.

  6. Image compression using quad-tree coding with morphological dilation

    NASA Astrophysics Data System (ADS)

    Wu, Jiaji; Jiang, Weiwei; Jiao, Licheng; Wang, Lei

    2007-11-01

    In this paper, we propose a new algorithm which integrates morphological dilation operation to quad-tree coding, the purpose of doing this is to compensate each other's drawback by using quad-tree coding and morphological dilation operation respectively. New algorithm can not only quickly find the seed significant coefficient of dilation but also break the limit of block boundary of quad-tree coding. We also make a full use of both within-subband and cross-subband correlation to avoid the expensive cost of representing insignificant coefficients. Experimental results show that our algorithm outperforms SPECK and SPIHT. Without using any arithmetic coding, our algorithm can achieve good performance with low computational cost and it's more suitable to mobile devices or scenarios with a strict real-time requirement.

  7. Identifying failure in a tree network of a parallel computer

    DOEpatents

    Archer, Charles J.; Pinnow, Kurt W.; Wallenfelt, Brian P.

    2010-08-24

    Methods, parallel computers, and products are provided for identifying failure in a tree network of a parallel computer. The parallel computer includes one or more processing sets including an I/O node and a plurality of compute nodes. For each processing set embodiments include selecting a set of test compute nodes, the test compute nodes being a subset of the compute nodes of the processing set; measuring the performance of the I/O node of the processing set; measuring the performance of the selected set of test compute nodes; calculating a current test value in dependence upon the measured performance of the I/O node of the processing set, the measured performance of the set of test compute nodes, and a predetermined value for I/O node performance; and comparing the current test value with a predetermined tree performance threshold. If the current test value is below the predetermined tree performance threshold, embodiments include selecting another set of test compute nodes. If the current test value is not below the predetermined tree performance threshold, embodiments include selecting from the test compute nodes one or more potential problem nodes and testing individually potential problem nodes and links to potential problem nodes.

  8. Parallelization of a Monte Carlo particle transport simulation code

    NASA Astrophysics Data System (ADS)

    Hadjidoukas, P.; Bousis, C.; Emfietzoglou, D.

    2010-05-01

    We have developed a high performance version of the Monte Carlo particle transport simulation code MC4. The original application code, developed in Visual Basic for Applications (VBA) for Microsoft Excel, was first rewritten in the C programming language for improving code portability. Several pseudo-random number generators have been also integrated and studied. The new MC4 version was then parallelized for shared and distributed-memory multiprocessor systems using the Message Passing Interface. Two parallel pseudo-random number generator libraries (SPRNG and DCMT) have been seamlessly integrated. The performance speedup of parallel MC4 has been studied on a variety of parallel computing architectures including an Intel Xeon server with 4 dual-core processors, a Sun cluster consisting of 16 nodes of 2 dual-core AMD Opteron processors and a 200 dual-processor HP cluster. For large problem size, which is limited only by the physical memory of the multiprocessor server, the speedup results are almost linear on all systems. We have validated the parallel implementation against the serial VBA and C implementations using the same random number generator. Our experimental results on the transport and energy loss of electrons in a water medium show that the serial and parallel codes are equivalent in accuracy. The present improvements allow for studying of higher particle energies with the use of more accurate physical models, and improve statistics as more particles tracks can be simulated in low response time.

  9. Parallel Scaling Characteristics of Selected NERSC User ProjectCodes

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Skinner, David; Verdier, Francesca; Anand, Harsh

    This report documents parallel scaling characteristics of NERSC user project codes between Fiscal Year 2003 and the first half of Fiscal Year 2004 (Oct 2002-March 2004). The codes analyzed cover 60% of all the CPU hours delivered during that time frame on seaborg, a 6080 CPU IBM SP and the largest parallel computer at NERSC. The scale in terms of concurrency and problem size of the workload is analyzed. Drawing on batch queue logs, performance data and feedback from researchers we detail the motivations, benefits, and challenges of implementing highly parallel scientific codes on current NERSC High Performance Computing systems.more » An evaluation and outlook of the NERSC workload for Allocation Year 2005 is presented.« less

  10. Parallel CARLOS-3D code development

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Putnam, J.M.; Kotulski, J.D.

    1996-02-01

    CARLOS-3D is a three-dimensional scattering code which was developed under the sponsorship of the Electromagnetic Code Consortium, and is currently used by over 80 aerospace companies and government agencies. The code has been extensively validated and runs on both serial workstations and parallel super computers such as the Intel Paragon. CARLOS-3D is a three-dimensional surface integral equation scattering code based on a Galerkin method of moments formulation employing Rao- Wilton-Glisson roof-top basis for triangular faceted surfaces. Fully arbitrary 3D geometries composed of multiple conducting and homogeneous bulk dielectric materials can be modeled. This presentation describes some of the extensions tomore » the CARLOS-3D code, and how the operator structure of the code facilitated these improvements. Body of revolution (BOR) and two-dimensional geometries were incorporated by simply including new input routines, and the appropriate Galerkin matrix operator routines. Some additional modifications were required in the combined field integral equation matrix generation routine due to the symmetric nature of the BOR and 2D operators. Quadrilateral patched surfaces with linear roof-top basis functions were also implemented in the same manner. Quadrilateral facets and triangular facets can be used in combination to more efficiently model geometries with both large smooth surfaces and surfaces with fine detail such as gaps and cracks. Since the parallel implementation in CARLOS-3D is at high level, these changes were independent of the computer platform being used. This approach minimizes code maintenance, while providing capabilities with little additional effort. Results are presented showing the performance and accuracy of the code for some large scattering problems. Comparisons between triangular faceted and quadrilateral faceted geometry representations will be shown for some complex scatterers.« less

  11. A Parallel Decoding Algorithm for Short Polar Codes Based on Error Checking and Correcting

    PubMed Central

    Pan, Xiaofei; Pan, Kegang; Ye, Zhan; Gong, Chao

    2014-01-01

    We propose a parallel decoding algorithm based on error checking and correcting to improve the performance of the short polar codes. In order to enhance the error-correcting capacity of the decoding algorithm, we first derive the error-checking equations generated on the basis of the frozen nodes, and then we introduce the method to check the errors in the input nodes of the decoder by the solutions of these equations. In order to further correct those checked errors, we adopt the method of modifying the probability messages of the error nodes with constant values according to the maximization principle. Due to the existence of multiple solutions of the error-checking equations, we formulate a CRC-aided optimization problem of finding the optimal solution with three different target functions, so as to improve the accuracy of error checking. Besides, in order to increase the throughput of decoding, we use a parallel method based on the decoding tree to calculate probability messages of all the nodes in the decoder. Numerical results show that the proposed decoding algorithm achieves better performance than that of some existing decoding algorithms with the same code length. PMID:25540813

  12. Parallelization Issues and Particle-In Codes.

    NASA Astrophysics Data System (ADS)

    Elster, Anne Cathrine

    1994-01-01

    "Everything should be made as simple as possible, but not simpler." Albert Einstein. The field of parallel scientific computing has concentrated on parallelization of individual modules such as matrix solvers and factorizers. However, many applications involve several interacting modules. Our analyses of a particle-in-cell code modeling charged particles in an electric field, show that these accompanying dependencies affect data partitioning and lead to new parallelization strategies concerning processor, memory and cache utilization. Our test-bed, a KSR1, is a distributed memory machine with a globally shared addressing space. However, most of the new methods presented hold generally for hierarchical and/or distributed memory systems. We introduce a novel approach that uses dual pointers on the local particle arrays to keep the particle locations automatically partially sorted. Complexity and performance analyses with accompanying KSR benchmarks, have been included for both this scheme and for the traditional replicated grids approach. The latter approach maintains load-balance with respect to particles. However, our results demonstrate it fails to scale properly for problems with large grids (say, greater than 128-by-128) running on as few as 15 KSR nodes, since the extra storage and computation time associated with adding the grid copies, becomes significant. Our grid partitioning scheme, although harder to implement, does not need to replicate the whole grid. Consequently, it scales well for large problems on highly parallel systems. It may, however, require load balancing schemes for non-uniform particle distributions. Our dual pointer approach may facilitate this through dynamically partitioned grids. We also introduce hierarchical data structures that store neighboring grid-points within the same cache -line by reordering the grid indexing. This alignment produces a 25% savings in cache-hits for a 4-by-4 cache. A consideration of the input data's effect on

  13. Implementation of a 3D mixing layer code on parallel computers

    NASA Technical Reports Server (NTRS)

    Roe, K.; Thakur, R.; Dang, T.; Bogucz, E.

    1995-01-01

    This paper summarizes our progress and experience in the development of a Computational-Fluid-Dynamics code on parallel computers to simulate three-dimensional spatially-developing mixing layers. In this initial study, the three-dimensional time-dependent Euler equations are solved using a finite-volume explicit time-marching algorithm. The code was first programmed in Fortran 77 for sequential computers. The code was then converted for use on parallel computers using the conventional message-passing technique, while we have not been able to compile the code with the present version of HPF compilers.

  14. New Bandwidth Efficient Parallel Concatenated Coding Schemes

    NASA Technical Reports Server (NTRS)

    Denedetto, S.; Divsalar, D.; Montorsi, G.; Pollara, F.

    1996-01-01

    We propose a new solution to parallel concatenation of trellis codes with multilevel amplitude/phase modulations and a suitable iterative decoding structure. Examples are given for throughputs 2 bits/sec/Hz with 8PSK and 16QAM signal constellations.

  15. Parallelization of KENO-Va Monte Carlo code

    NASA Astrophysics Data System (ADS)

    Ramón, Javier; Peña, Jorge

    1995-07-01

    KENO-Va is a code integrated within the SCALE system developed by Oak Ridge that solves the transport equation through the Monte Carlo Method. It is being used at the Consejo de Seguridad Nuclear (CSN) to perform criticality calculations for fuel storage pools and shipping casks. Two parallel versions of the code: one for shared memory machines and other for distributed memory systems using the message-passing interface PVM have been generated. In both versions the neutrons of each generation are tracked in parallel. In order to preserve the reproducibility of the results in both versions, advanced seeds for random numbers were used. The CONVEX C3440 with four processors and shared memory at CSN was used to implement the shared memory version. A FDDI network of 6 HP9000/735 was employed to implement the message-passing version using proprietary PVM. The speedup obtained was 3.6 in both cases.

  16. OCTGRAV: Sparse Octree Gravitational N-body Code on Graphics Processing Units

    NASA Astrophysics Data System (ADS)

    Gaburov, Evghenii; Bédorf, Jeroen; Portegies Zwart, Simon

    2010-10-01

    Octgrav is a very fast tree-code which runs on massively parallel Graphical Processing Units (GPU) with NVIDIA CUDA architecture. The algorithms are based on parallel-scan and sort methods. The tree-construction and calculation of multipole moments is carried out on the host CPU, while the force calculation which consists of tree walks and evaluation of interaction list is carried out on the GPU. In this way, a sustained performance of about 100GFLOP/s and data transfer rates of about 50GB/s is achieved. It takes about a second to compute forces on a million particles with an opening angle of heta approx 0.5. To test the performance and feasibility, we implemented the algorithms in CUDA in the form of a gravitational tree-code which completely runs on the GPU. The tree construction and traverse algorithms are portable to many-core devices which have support for CUDA or OpenCL programming languages. The gravitational tree-code outperforms tuned CPU code during the tree-construction and shows a performance improvement of more than a factor 20 overall, resulting in a processing rate of more than 2.8 million particles per second. The code has a convenient user interface and is freely available for use.

  17. Parallel family trees for transfer matrices in the Potts model

    NASA Astrophysics Data System (ADS)

    Navarro, Cristobal A.; Canfora, Fabrizio; Hitschfeld, Nancy; Navarro, Gonzalo

    2015-02-01

    The computational cost of transfer matrix methods for the Potts model is related to the question in how many ways can two layers of a lattice be connected? Answering the question leads to the generation of a combinatorial set of lattice configurations. This set defines the configuration space of the problem, and the smaller it is, the faster the transfer matrix can be computed. The configuration space of generic (q , v) transfer matrix methods for strips is in the order of the Catalan numbers, which grows asymptotically as O(4m) where m is the width of the strip. Other transfer matrix methods with a smaller configuration space indeed exist but they make assumptions on the temperature, number of spin states, or restrict the structure of the lattice. In this paper we propose a parallel algorithm that uses a sub-Catalan configuration space of O(3m) to build the generic (q , v) transfer matrix in a compressed form. The improvement is achieved by grouping the original set of Catalan configurations into a forest of family trees, in such a way that the solution to the problem is now computed by solving the root node of each family. As a result, the algorithm becomes exponentially faster than the Catalan approach while still highly parallel. The resulting matrix is stored in a compressed form using O(3m ×4m) of space, making numerical evaluation and decompression to be faster than evaluating the matrix in its O(4m ×4m) uncompressed form. Experimental results for different sizes of strip lattices show that the parallel family trees (PFT) strategy indeed runs exponentially faster than the Catalan Parallel Method (CPM), especially when dealing with dense transfer matrices. In terms of parallel performance, we report strong-scaling speedups of up to 5.7 × when running on an 8-core shared memory machine and 28 × for a 32-core cluster. The best balance of speedup and efficiency for the multi-core machine was achieved when using p = 4 processors, while for the cluster

  18. Performance Analysis and Optimization on the UCLA Parallel Atmospheric General Circulation Model Code

    NASA Technical Reports Server (NTRS)

    Lou, John; Ferraro, Robert; Farrara, John; Mechoso, Carlos

    1996-01-01

    An analysis is presented of several factors influencing the performance of a parallel implementation of the UCLA atmospheric general circulation model (AGCM) on massively parallel computer systems. Several modificaitons to the original parallel AGCM code aimed at improving its numerical efficiency, interprocessor communication cost, load-balance and issues affecting single-node code performance are discussed.

  19. ANNarchy: a code generation approach to neural simulations on parallel hardware

    PubMed Central

    Vitay, Julien; Dinkelbach, Helge Ü.; Hamker, Fred H.

    2015-01-01

    Many modern neural simulators focus on the simulation of networks of spiking neurons on parallel hardware. Another important framework in computational neuroscience, rate-coded neural networks, is mostly difficult or impossible to implement using these simulators. We present here the ANNarchy (Artificial Neural Networks architect) neural simulator, which allows to easily define and simulate rate-coded and spiking networks, as well as combinations of both. The interface in Python has been designed to be close to the PyNN interface, while the definition of neuron and synapse models can be specified using an equation-oriented mathematical description similar to the Brian neural simulator. This information is used to generate C++ code that will efficiently perform the simulation on the chosen parallel hardware (multi-core system or graphical processing unit). Several numerical methods are available to transform ordinary differential equations into an efficient C++code. We compare the parallel performance of the simulator to existing solutions. PMID:26283957

  20. Performance of a parallel thermal-hydraulics code TEMPEST

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Fann, G.I.; Trent, D.S.

    The authors describe the parallelization of the Tempest thermal-hydraulics code. The serial version of this code is used for production quality 3-D thermal-hydraulics simulations. Good speedup was obtained with a parallel diagonally preconditioned BiCGStab non-symmetric linear solver, using a spatial domain decomposition approach for the semi-iterative pressure-based and mass-conserved algorithm. The test case used here to illustrate the performance of the BiCGStab solver is a 3-D natural convection problem modeled using finite volume discretization in cylindrical coordinates. The BiCGStab solver replaced the LSOR-ADI method for solving the pressure equation in TEMPEST. BiCGStab also solves the coupled thermal energy equation. Scalingmore » performance of 3 problem sizes (221220 nodes, 358120 nodes, and 701220 nodes) are presented. These problems were run on 2 different parallel machines: IBM-SP and SGI PowerChallenge. The largest problem attains a speedup of 68 on an 128 processor IBM-SP. In real terms, this is over 34 times faster than the fastest serial production time using the LSOR-ADI solver.« less

  1. Scalability study of parallel spatial direct numerical simulation code on IBM SP1 parallel supercomputer

    NASA Technical Reports Server (NTRS)

    Hanebutte, Ulf R.; Joslin, Ronald D.; Zubair, Mohammad

    1994-01-01

    The implementation and the performance of a parallel spatial direct numerical simulation (PSDNS) code are reported for the IBM SP1 supercomputer. The spatially evolving disturbances that are associated with laminar-to-turbulent in three-dimensional boundary-layer flows are computed with the PS-DNS code. By remapping the distributed data structure during the course of the calculation, optimized serial library routines can be utilized that substantially increase the computational performance. Although the remapping incurs a high communication penalty, the parallel efficiency of the code remains above 40% for all performed calculations. By using appropriate compile options and optimized library routines, the serial code achieves 52-56 Mflops on a single node of the SP1 (45% of theoretical peak performance). The actual performance of the PSDNS code on the SP1 is evaluated with a 'real world' simulation that consists of 1.7 million grid points. One time step of this simulation is calculated on eight nodes of the SP1 in the same time as required by a Cray Y/MP for the same simulation. The scalability information provides estimated computational costs that match the actual costs relative to changes in the number of grid points.

  2. Coding hazardous tree failures for a data management system

    Treesearch

    Lee A. Paine

    1978-01-01

    Codes for automatic data processing (ADP) are provided for hazardous tree failure data submitted on Report of Tree Failure forms. Definitions of data items and suggestions for interpreting ambiguously worded reports are also included. The manual is intended to insure the production of accurate and consistent punched ADP cards which are used in transfer of the data to...

  3. Second International Workshop on Software Engineering and Code Design in Parallel Meteorological and Oceanographic Applications

    NASA Technical Reports Server (NTRS)

    OKeefe, Matthew (Editor); Kerr, Christopher L. (Editor)

    1998-01-01

    This report contains the abstracts and technical papers from the Second International Workshop on Software Engineering and Code Design in Parallel Meteorological and Oceanographic Applications, held June 15-18, 1998, in Scottsdale, Arizona. The purpose of the workshop is to bring together software developers in meteorology and oceanography to discuss software engineering and code design issues for parallel architectures, including Massively Parallel Processors (MPP's), Parallel Vector Processors (PVP's), Symmetric Multi-Processors (SMP's), Distributed Shared Memory (DSM) multi-processors, and clusters. Issues to be discussed include: (1) code architectures for current parallel models, including basic data structures, storage allocation, variable naming conventions, coding rules and styles, i/o and pre/post-processing of data; (2) designing modular code; (3) load balancing and domain decomposition; (4) techniques that exploit parallelism efficiently yet hide the machine-related details from the programmer; (5) tools for making the programmer more productive; and (6) the proliferation of programming models (F--, OpenMP, MPI, and HPF).

  4. Composing Data Parallel Code for a SPARQL Graph Engine

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Castellana, Vito G.; Tumeo, Antonino; Villa, Oreste

    Big data analytics process large amount of data to extract knowledge from them. Semantic databases are big data applications that adopt the Resource Description Framework (RDF) to structure metadata through a graph-based representation. The graph based representation provides several benefits, such as the possibility to perform in memory processing with large amounts of parallelism. SPARQL is a language used to perform queries on RDF-structured data through graph matching. In this paper we present a tool that automatically translates SPARQL queries to parallel graph crawling and graph matching operations. The tool also supports complex SPARQL constructs, which requires more than basicmore » graph matching for their implementation. The tool generates parallel code annotated with OpenMP pragmas for x86 Shared-memory Multiprocessors (SMPs). With respect to commercial database systems such as Virtuoso, our approach reduces memory occupation due to join operations and provides higher performance. We show the scaling of the automatically generated graph-matching code on a 48-core SMP.« less

  5. Parallel processing a three-dimensional free-lagrange code

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Mandell, D.A.; Trease, H.E.

    1989-01-01

    A three-dimensional, time-dependent free-Lagrange hydrodynamics code has been multitasked and autotasked on a CRAY X-MP/416. The multitasking was done by using the Los Alamos Multitasking Control Library, which is a superset of the CRAY multitasking library. Autotasking is done by using constructs which are only comment cards if the source code is not run through a preprocessor. The three-dimensional algorithm has presented a number of problems that simpler algorithms, such as those for one-dimensional hydrodynamics, did not exhibit. Problems in converting the serial code, originally written for a CRAY-1, to a multitasking code are discussed. Autotasking of a rewritten versionmore » of the code is discussed. Timing results for subroutines and hot spots in the serial code are presented and suggestions for additional tools and debugging aids are given. Theoretical speedup results obtained from Amdahl's law and actual speedup results obtained on a dedicated machine are presented. Suggestions for designing large parallel codes are given.« less

  6. Computing Maximum Cardinality Matchings in Parallel on Bipartite Graphs via Tree-Grafting

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Azad, Ariful; Buluc, Aydn; Pothen, Alex

    It is difficult to obtain high performance when computing matchings on parallel processors because matching algorithms explicitly or implicitly search for paths in the graph, and when these paths become long, there is little concurrency. In spite of this limitation, we present a new algorithm and its shared-memory parallelization that achieves good performance and scalability in computing maximum cardinality matchings in bipartite graphs. This algorithm searches for augmenting paths via specialized breadth-first searches (BFS) from multiple source vertices, hence creating more parallelism than single source algorithms. Algorithms that employ multiple-source searches cannot discard a search tree once no augmenting pathmore » is discovered from the tree, unlike algorithms that rely on single-source searches. We describe a novel tree-grafting method that eliminates most of the redundant edge traversals resulting from this property of multiple-source searches. We also employ the recent direction-optimizing BFS algorithm as a subroutine to discover augmenting paths faster. Our algorithm compares favorably with the current best algorithms in terms of the number of edges traversed, the average augmenting path length, and the number of iterations. Here, we provide a proof of correctness for our algorithm. Our NUMA-aware implementation is scalable to 80 threads of an Intel multiprocessor and to 240 threads on an Intel Knights Corner coprocessor. On average, our parallel algorithm runs an order of magnitude faster than the fastest algorithms available. The performance improvement is more significant on graphs with small matching number.« less

  7. Computing Maximum Cardinality Matchings in Parallel on Bipartite Graphs via Tree-Grafting

    DOE PAGES

    Azad, Ariful; Buluc, Aydn; Pothen, Alex

    2016-03-24

    It is difficult to obtain high performance when computing matchings on parallel processors because matching algorithms explicitly or implicitly search for paths in the graph, and when these paths become long, there is little concurrency. In spite of this limitation, we present a new algorithm and its shared-memory parallelization that achieves good performance and scalability in computing maximum cardinality matchings in bipartite graphs. This algorithm searches for augmenting paths via specialized breadth-first searches (BFS) from multiple source vertices, hence creating more parallelism than single source algorithms. Algorithms that employ multiple-source searches cannot discard a search tree once no augmenting pathmore » is discovered from the tree, unlike algorithms that rely on single-source searches. We describe a novel tree-grafting method that eliminates most of the redundant edge traversals resulting from this property of multiple-source searches. We also employ the recent direction-optimizing BFS algorithm as a subroutine to discover augmenting paths faster. Our algorithm compares favorably with the current best algorithms in terms of the number of edges traversed, the average augmenting path length, and the number of iterations. Here, we provide a proof of correctness for our algorithm. Our NUMA-aware implementation is scalable to 80 threads of an Intel multiprocessor and to 240 threads on an Intel Knights Corner coprocessor. On average, our parallel algorithm runs an order of magnitude faster than the fastest algorithms available. The performance improvement is more significant on graphs with small matching number.« less

  8. Parallel Grand Canonical Monte Carlo (ParaGrandMC) Simulation Code

    NASA Technical Reports Server (NTRS)

    Yamakov, Vesselin I.

    2016-01-01

    This report provides an overview of the Parallel Grand Canonical Monte Carlo (ParaGrandMC) simulation code. This is a highly scalable parallel FORTRAN code for simulating the thermodynamic evolution of metal alloy systems at the atomic level, and predicting the thermodynamic state, phase diagram, chemical composition and mechanical properties. The code is designed to simulate multi-component alloy systems, predict solid-state phase transformations such as austenite-martensite transformations, precipitate formation, recrystallization, capillary effects at interfaces, surface absorption, etc., which can aid the design of novel metallic alloys. While the software is mainly tailored for modeling metal alloys, it can also be used for other types of solid-state systems, and to some degree for liquid or gaseous systems, including multiphase systems forming solid-liquid-gas interfaces.

  9. HAlign-II: efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing.

    PubMed

    Wan, Shixiang; Zou, Quan

    2017-01-01

    Multiple sequence alignment (MSA) plays a key role in biological sequence analyses, especially in phylogenetic tree construction. Extreme increase in next-generation sequencing results in shortage of efficient ultra-large biological sequence alignment approaches for coping with different sequence types. Distributed and parallel computing represents a crucial technique for accelerating ultra-large (e.g. files more than 1 GB) sequence analyses. Based on HAlign and Spark distributed computing system, we implement a highly cost-efficient and time-efficient HAlign-II tool to address ultra-large multiple biological sequence alignment and phylogenetic tree construction. The experiments in the DNA and protein large scale data sets, which are more than 1GB files, showed that HAlign II could save time and space. It outperformed the current software tools. HAlign-II can efficiently carry out MSA and construct phylogenetic trees with ultra-large numbers of biological sequences. HAlign-II shows extremely high memory efficiency and scales well with increases in computing resource. THAlign-II provides a user-friendly web server based on our distributed computing infrastructure. HAlign-II with open-source codes and datasets was established at http://lab.malab.cn/soft/halign.

  10. Performance of a parallel code for the Euler equations on hypercube computers

    NASA Technical Reports Server (NTRS)

    Barszcz, Eric; Chan, Tony F.; Jesperson, Dennis C.; Tuminaro, Raymond S.

    1990-01-01

    The performance of hypercubes were evaluated on a computational fluid dynamics problem and the parallel environment issues were considered that must be addressed, such as algorithm changes, implementation choices, programming effort, and programming environment. The evaluation focuses on a widely used fluid dynamics code, FLO52, which solves the two dimensional steady Euler equations describing flow around the airfoil. The code development experience is described, including interacting with the operating system, utilizing the message-passing communication system, and code modifications necessary to increase parallel efficiency. Results from two hypercube parallel computers (a 16-node iPSC/2, and a 512-node NCUBE/ten) are discussed and compared. In addition, a mathematical model of the execution time was developed as a function of several machine and algorithm parameters. This model accurately predicts the actual run times obtained and is used to explore the performance of the code in interesting but yet physically realizable regions of the parameter space. Based on this model, predictions about future hypercubes are made.

  11. Parallel processing a real code: A case history

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Mandell, D.A.; Trease, H.E.

    1988-01-01

    A three-dimensional, time-dependent Free-Lagrange hydrodynamics code has been multitasked and autotasked on a Cray X-MP/416. The multitasking was done by using the Los Alamos Multitasking Control Library, which is a superset of the Cray multitasking library. Autotasking is done by using constructs which are only comment cards if the source code is not run through a preprocessor. The 3-D algorithm has presented a number of problems that simpler algorithms, such as 1-D hydrodynamics, did not exhibit. Problems in converting the serial code, originally written for a Cray 1, to a multitasking code are discussed, Autotasking of a rewritten version ofmore » the code is discussed. Timing results for subroutines and hot spots in the serial code are presented and suggestions for additional tools and debugging aids are given. Theoretical speedup results obtained from Amdahl's law and actual speedup results obtained on a dedicated machine are presented. Suggestions for designing large parallel codes are given. 8 refs., 13 figs.« less

  12. The implementation of an aeronautical CFD flow code onto distributed memory parallel systems

    NASA Astrophysics Data System (ADS)

    Ierotheou, C. S.; Forsey, C. R.; Leatham, M.

    2000-04-01

    The parallelization of an industrially important in-house computational fluid dynamics (CFD) code for calculating the airflow over complex aircraft configurations using the Euler or Navier-Stokes equations is presented. The code discussed is the flow solver module of the SAUNA CFD suite. This suite uses a novel grid system that may include block-structured hexahedral or pyramidal grids, unstructured tetrahedral grids or a hybrid combination of both. To assist in the rapid convergence to a solution, a number of convergence acceleration techniques are employed including implicit residual smoothing and a multigrid full approximation storage scheme (FAS). Key features of the parallelization approach are the use of domain decomposition and encapsulated message passing to enable the execution in parallel using a single programme multiple data (SPMD) paradigm. In the case where a hybrid grid is used, a unified grid partitioning scheme is employed to define the decomposition of the mesh. The parallel code has been tested using both structured and hybrid grids on a number of different distributed memory parallel systems and is now routinely used to perform industrial scale aeronautical simulations. Copyright

  13. Boltzmann Transport Code Update: Parallelization and Integrated Design Updates

    NASA Technical Reports Server (NTRS)

    Heinbockel, J. H.; Nealy, J. E.; DeAngelis, G.; Feldman, G. A.; Chokshi, S.

    2003-01-01

    The on going efforts at developing a web site for radiation analysis is expected to result in an increased usage of the High Charge and Energy Transport Code HZETRN. It would be nice to be able to do the requested calculations quickly and efficiently. Therefore the question arose, "Could the implementation of parallel processing speed up the calculations required?" To answer this question two modifications of the HZETRN computer code were created. The first modification selected the shield material of Al(2219) , then polyethylene and then Al(2219). The modified Fortran code was labeled 1SSTRN.F. The second modification considered the shield material of CO2 and Martian regolith. This modified Fortran code was labeled MARSTRN.F.

  14. Data Parallel Line Relaxation (DPLR) Code User Manual: Acadia - Version 4.01.1

    NASA Technical Reports Server (NTRS)

    Wright, Michael J.; White, Todd; Mangini, Nancy

    2009-01-01

    Data-Parallel Line Relaxation (DPLR) code is a computational fluid dynamic (CFD) solver that was developed at NASA Ames Research Center to help mission support teams generate high-value predictive solutions for hypersonic flow field problems. The DPLR Code Package is an MPI-based, parallel, full three-dimensional Navier-Stokes CFD solver with generalized models for finite-rate reaction kinetics, thermal and chemical non-equilibrium, accurate high-temperature transport coefficients, and ionized flow physics incorporated into the code. DPLR also includes a large selection of generalized realistic surface boundary conditions and links to enable loose coupling with external thermal protection system (TPS) material response and shock layer radiation codes.

  15. A Parallel Numerical Micromagnetic Code Using FEniCS

    NASA Astrophysics Data System (ADS)

    Nagy, L.; Williams, W.; Mitchell, L.

    2013-12-01

    Many problems in the geosciences depend on understanding the ability of magnetic minerals to provide stable paleomagnetic recordings. Numerical micromagnetic modelling allows us to calculate the domain structures found in naturally occurring magnetic materials. However the computational cost rises exceedingly quickly with respect to the size and complexity of the geometries that we wish to model. This problem is compounded by the fact that the modern processor design no longer focuses on the speed at which calculations are performed, but rather on the number of computational units amongst which we may distribute our calculations. Consequently to better exploit modern computational resources our micromagnetic simulations must "go parallel". We present a parallel and scalable micromagnetics code written using FEniCS. FEniCS is a multinational collaboration involving several institutions (University of Cambridge, University of Chicago, The Simula Research Laboratory, etc.) that aims to provide a set of tools for writing scientific software; in particular software that employs the finite element method. The advantages of this approach are the leveraging of pre-existing projects from the world of scientific computing (PETSc, Trilinos, Metis/Parmetis, etc.) and exposing these so that researchers may pose problems in a manner closer to the mathematical language of their domain. Our code provides a scriptable interface (in Python) that allows users to not only run micromagnetic models in parallel, but also to perform pre/post processing of data.

  16. CUBE: Information-optimized parallel cosmological N-body simulation code

    NASA Astrophysics Data System (ADS)

    Yu, Hao-Ran; Pen, Ue-Li; Wang, Xin

    2018-05-01

    CUBE, written in Coarray Fortran, is a particle-mesh based parallel cosmological N-body simulation code. The memory usage of CUBE can approach as low as 6 bytes per particle. Particle pairwise (PP) force, cosmological neutrinos, spherical overdensity (SO) halofinder are included.

  17. An Expert System for the Development of Efficient Parallel Code

    NASA Technical Reports Server (NTRS)

    Jost, Gabriele; Chun, Robert; Jin, Hao-Qiang; Labarta, Jesus; Gimenez, Judit

    2004-01-01

    We have built the prototype of an expert system to assist the user in the development of efficient parallel code. The system was integrated into the parallel programming environment that is currently being developed at NASA Ames. The expert system interfaces to tools for automatic parallelization and performance analysis. It uses static program structure information and performance data in order to automatically determine causes of poor performance and to make suggestions for improvements. In this paper we give an overview of our programming environment, describe the prototype implementation of our expert system, and demonstrate its usefulness with several case studies.

  18. Performance and Application of Parallel OVERFLOW Codes on Distributed and Shared Memory Platforms

    NASA Technical Reports Server (NTRS)

    Djomehri, M. Jahed; Rizk, Yehia M.

    1999-01-01

    The presentation discusses recent studies on the performance of the two parallel versions of the aerodynamics CFD code, OVERFLOW_MPI and _MLP. Developed at NASA Ames, the serial version, OVERFLOW, is a multidimensional Navier-Stokes flow solver based on overset (Chimera) grid technology. The code has recently been parallelized in two ways. One is based on the explicit message-passing interface (MPI) across processors and uses the _MPI communication package. This approach is primarily suited for distributed memory systems and workstation clusters. The second, termed the multi-level parallel (MLP) method, is simple and uses shared memory for all communications. The _MLP code is suitable on distributed-shared memory systems. For both methods, the message passing takes place across the processors or processes at the advancement of each time step. This procedure is, in effect, the Chimera boundary conditions update, which is done in an explicit "Jacobi" style. In contrast, the update in the serial code is done in more of the "Gauss-Sidel" fashion. The programming efforts for the _MPI code is more complicated than for the _MLP code; the former requires modification of the outer and some inner shells of the serial code, whereas the latter focuses only on the outer shell of the code. The _MPI version offers a great deal of flexibility in distributing grid zones across a specified number of processors in order to achieve load balancing. The approach is capable of partitioning zones across multiple processors or sending each zone and/or cluster of several zones into a single processor. The message passing across the processors consists of Chimera boundary and/or an overlap of "halo" boundary points for each partitioned zone. The MLP version is a new coarse-grain parallel concept at the zonal and intra-zonal levels. A grouping strategy is used to distribute zones into several groups forming sub-processes which will run in parallel. The total volume of grid points in each

  19. Development of Parallel Code for the Alaska Tsunami Forecast Model

    NASA Astrophysics Data System (ADS)

    Bahng, B.; Knight, W. R.; Whitmore, P.

    2014-12-01

    The Alaska Tsunami Forecast Model (ATFM) is a numerical model used to forecast propagation and inundation of tsunamis generated by earthquakes and other means in both the Pacific and Atlantic Oceans. At the U.S. National Tsunami Warning Center (NTWC), the model is mainly used in a pre-computed fashion. That is, results for hundreds of hypothetical events are computed before alerts, and are accessed and calibrated with observations during tsunamis to immediately produce forecasts. ATFM uses the non-linear, depth-averaged, shallow-water equations of motion with multiply nested grids in two-way communications between domains of each parent-child pair as waves get closer to coastal waters. Even with the pre-computation the task becomes non-trivial as sub-grid resolution gets finer. Currently, the finest resolution Digital Elevation Models (DEM) used by ATFM are 1/3 arc-seconds. With a serial code, large or multiple areas of very high resolution can produce run-times that are unrealistic even in a pre-computed approach. One way to increase the model performance is code parallelization used in conjunction with a multi-processor computing environment. NTWC developers have undertaken an ATFM code-parallelization effort to streamline the creation of the pre-computed database of results with the long term aim of tsunami forecasts from source to high resolution shoreline grids in real time. Parallelization will also permit timely regeneration of the forecast model database with new DEMs; and, will make possible future inclusion of new physics such as the non-hydrostatic treatment of tsunami propagation. The purpose of our presentation is to elaborate on the parallelization approach and to show the compute speed increase on various multi-processor systems.

  20. Parallelization of PANDA discrete ordinates code using spatial decomposition

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Humbert, P.

    2006-07-01

    We present the parallel method, based on spatial domain decomposition, implemented in the 2D and 3D versions of the discrete Ordinates code PANDA. The spatial mesh is orthogonal and the spatial domain decomposition is Cartesian. For 3D problems a 3D Cartesian domain topology is created and the parallel method is based on a domain diagonal plane ordered sweep algorithm. The parallel efficiency of the method is improved by directions and octants pipelining. The implementation of the algorithm is straightforward using MPI blocking point to point communications. The efficiency of the method is illustrated by an application to the 3D-Ext C5G7more » benchmark of the OECD/NEA. (authors)« less

  1. Parallelization of Finite Element Analysis Codes Using Heterogeneous Distributed Computing

    NASA Technical Reports Server (NTRS)

    Ozguner, Fusun

    1996-01-01

    Performance gains in computer design are quickly consumed as users seek to analyze larger problems to a higher degree of accuracy. Innovative computational methods, such as parallel and distributed computing, seek to multiply the power of existing hardware technology to satisfy the computational demands of large applications. In the early stages of this project, experiments were performed using two large, coarse-grained applications, CSTEM and METCAN. These applications were parallelized on an Intel iPSC/860 hypercube. It was found that the overall speedup was very low, due to large, inherently sequential code segments present in the applications. The overall execution time T(sub par), of the application is dependent on these sequential segments. If these segments make up a significant fraction of the overall code, the application will have a poor speedup measure.

  2. User's Guide for TOUGH2-MP - A Massively Parallel Version of the TOUGH2 Code

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Earth Sciences Division; Zhang, Keni; Zhang, Keni

    TOUGH2-MP is a massively parallel (MP) version of the TOUGH2 code, designed for computationally efficient parallel simulation of isothermal and nonisothermal flows of multicomponent, multiphase fluids in one, two, and three-dimensional porous and fractured media. In recent years, computational requirements have become increasingly intensive in large or highly nonlinear problems for applications in areas such as radioactive waste disposal, CO2 geological sequestration, environmental assessment and remediation, reservoir engineering, and groundwater hydrology. The primary objective of developing the parallel-simulation capability is to significantly improve the computational performance of the TOUGH2 family of codes. The particular goal for the parallel simulator ismore » to achieve orders-of-magnitude improvement in computational time for models with ever-increasing complexity. TOUGH2-MP is designed to perform parallel simulation on multi-CPU computational platforms. An earlier version of TOUGH2-MP (V1.0) was based on the TOUGH2 Version 1.4 with EOS3, EOS9, and T2R3D modules, a software previously qualified for applications in the Yucca Mountain project, and was designed for execution on CRAY T3E and IBM SP supercomputers. The current version of TOUGH2-MP (V2.0) includes all fluid property modules of the standard version TOUGH2 V2.0. It provides computationally efficient capabilities using supercomputers, Linux clusters, or multi-core PCs, and also offers many user-friendly features. The parallel simulator inherits all process capabilities from V2.0 together with additional capabilities for handling fractured media from V1.4. This report provides a quick starting guide on how to set up and run the TOUGH2-MP program for users with a basic knowledge of running the (standard) version TOUGH2 code, The report also gives a brief technical description of the code, including a discussion of parallel methodology, code structure, as well as mathematical and numerical

  3. Adaptive zero-tree structure for curved wavelet image coding

    NASA Astrophysics Data System (ADS)

    Zhang, Liang; Wang, Demin; Vincent, André

    2006-02-01

    We investigate the issue of efficient data organization and representation of the curved wavelet coefficients [curved wavelet transform (WT)]. We present an adaptive zero-tree structure that exploits the cross-subband similarity of the curved wavelet transform. In the embedded zero-tree wavelet (EZW) and the set partitioning in hierarchical trees (SPIHT), the parent-child relationship is defined in such a way that a parent has four children, restricted to a square of 2×2 pixels, the parent-child relationship in the adaptive zero-tree structure varies according to the curves along which the curved WT is performed. Five child patterns were determined based on different combinations of curve orientation. A new image coder was then developed based on this adaptive zero-tree structure and the set-partitioning technique. Experimental results using synthetic and natural images showed the effectiveness of the proposed adaptive zero-tree structure for encoding of the curved wavelet coefficients. The coding gain of the proposed coder can be up to 1.2 dB in terms of peak SNR (PSNR) compared to the SPIHT coder. Subjective evaluation shows that the proposed coder preserves lines and edges better than the SPIHT coder.

  4. A Very Fast and Angular Momentum Conserving Tree Code

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Marcello, Dominic C., E-mail: dmarce504@gmail.com

    There are many methods used to compute the classical gravitational field in astrophysical simulation codes. With the exception of the typically impractical method of direct computation, none ensure conservation of angular momentum to machine precision. Under uniform time-stepping, the Cartesian fast multipole method of Dehnen (also known as the very fast tree code) conserves linear momentum to machine precision. We show that it is possible to modify this method in a way that conserves both angular and linear momenta.

  5. Code Optimization and Parallelization on the Origins: Looking from Users' Perspective

    NASA Technical Reports Server (NTRS)

    Chang, Yan-Tyng Sherry; Thigpen, William W. (Technical Monitor)

    2002-01-01

    Parallel machines are becoming the main compute engines for high performance computing. Despite their increasing popularity, it is still a challenge for most users to learn the basic techniques to optimize/parallelize their codes on such platforms. In this paper, we present some experiences on learning these techniques for the Origin systems at the NASA Advanced Supercomputing Division. Emphasis of this paper will be on a few essential issues (with examples) that general users should master when they work with the Origins as well as other parallel systems.

  6. A visual parallel-BCI speller based on the time-frequency coding strategy.

    PubMed

    Xu, Minpeng; Chen, Long; Zhang, Lixin; Qi, Hongzhi; Ma, Lan; Tang, Jiabei; Wan, Baikun; Ming, Dong

    2014-04-01

    Spelling is one of the most important issues in brain-computer interface (BCI) research. This paper is to develop a visual parallel-BCI speller system based on the time-frequency coding strategy in which the sub-speller switching among four simultaneously presented sub-spellers and the character selection are identified in a parallel mode. The parallel-BCI speller was constituted by four independent P300+SSVEP-B (P300 plus SSVEP blocking) spellers with different flicker frequencies, thereby all characters had a specific time-frequency code. To verify its effectiveness, 11 subjects were involved in the offline and online spellings. A classification strategy was designed to recognize the target character through jointly using the canonical correlation analysis and stepwise linear discriminant analysis. Online spellings showed that the proposed parallel-BCI speller had a high performance, reaching the highest information transfer rate of 67.4 bit min(-1), with an average of 54.0 bit min(-1) and 43.0 bit min(-1) in the three rounds and five rounds, respectively. The results indicated that the proposed parallel-BCI could be effectively controlled by users with attention shifting fluently among the sub-spellers, and highly improved the BCI spelling performance.

  7. GSHR-Tree: a spatial index tree based on dynamic spatial slot and hash table in grid environments

    NASA Astrophysics Data System (ADS)

    Chen, Zhanlong; Wu, Xin-cai; Wu, Liang

    2008-12-01

    distributed operation, reduplication operation transfer operation of spatial index in the grid environment. The design of GSHR-Tree has ensured the performance of the load balance in the parallel computation. This tree structure is fit for the parallel process of the spatial information in the distributed network environments. Instead of spatial object's recursive comparison where original R tree has been used, the algorithm builds the spatial index by applying binary code operation in which computer runs more efficiently, and extended dynamic hash code for bit comparison. In GSHR-Tree, a new server is assigned to the network whenever a split of a full node is required. We describe a more flexible allocation protocol which copes with a temporary shortage of storage resources. It uses a distributed balanced binary spatial tree that scales with insertions to potentially any number of storage servers through splits of the overloaded ones. The application manipulates the GSHR-Tree structure from a node in the grid environment. The node addresses the tree through its image that the splits can make outdated. This may generate addressing errors, solved by the forwarding among the servers. In this paper, a spatial index data distribution algorithm that limits the number of servers has been proposed. We improve the storage utilization at the cost of additional messages. The structure of GSHR-Tree is believed that the scheme of this grid spatial index should fit the needs of new applications using endlessly larger sets of spatial data. Our proposal constitutes a flexible storage allocation method for a distributed spatial index. The insertion policy can be tuned dynamically to cope with periods of storage shortage. In such cases storage balancing should be favored for better space utilization, at the price of extra message exchanges between servers. This structure makes a compromise in the updating of the duplicated index and the transformation of the spatial index data. Meeting the

  8. Use of Hilbert Curves in Parallelized CUDA code: Interaction of Interstellar Atoms with the Heliosphere

    NASA Astrophysics Data System (ADS)

    Destefano, Anthony; Heerikhuisen, Jacob

    2015-04-01

    Fully 3D particle simulations can be a computationally and memory expensive task, especially when high resolution grid cells are required. The problem becomes further complicated when parallelization is needed. In this work we focus on computational methods to solve these difficulties. Hilbert curves are used to map the 3D particle space to the 1D contiguous memory space. This method of organization allows for minimized cache misses on the GPU as well as a sorted structure that is equivalent to an octal tree data structure. This type of sorted structure is attractive for uses in adaptive mesh implementations due to the logarithm search time. Implementations using the Message Passing Interface (MPI) library and NVIDIA's parallel computing platform CUDA will be compared, as MPI is commonly used on server nodes with many CPU's. We will also compare static grid structures with those of adaptive mesh structures. The physical test bed will be simulating heavy interstellar atoms interacting with a background plasma, the heliosphere, simulated from fully consistent coupled MHD/kinetic particle code. It is known that charge exchange is an important factor in space plasmas, specifically it modifies the structure of the heliosphere itself. We would like to thank the Alabama Supercomputer Authority for the use of their computational resources.

  9. A visual parallel-BCI speller based on the time-frequency coding strategy

    NASA Astrophysics Data System (ADS)

    Xu, Minpeng; Chen, Long; Zhang, Lixin; Qi, Hongzhi; Ma, Lan; Tang, Jiabei; Wan, Baikun; Ming, Dong

    2014-04-01

    Objective. Spelling is one of the most important issues in brain-computer interface (BCI) research. This paper is to develop a visual parallel-BCI speller system based on the time-frequency coding strategy in which the sub-speller switching among four simultaneously presented sub-spellers and the character selection are identified in a parallel mode. Approach. The parallel-BCI speller was constituted by four independent P300+SSVEP-B (P300 plus SSVEP blocking) spellers with different flicker frequencies, thereby all characters had a specific time-frequency code. To verify its effectiveness, 11 subjects were involved in the offline and online spellings. A classification strategy was designed to recognize the target character through jointly using the canonical correlation analysis and stepwise linear discriminant analysis. Main results. Online spellings showed that the proposed parallel-BCI speller had a high performance, reaching the highest information transfer rate of 67.4 bit min-1, with an average of 54.0 bit min-1 and 43.0 bit min-1 in the three rounds and five rounds, respectively. Significance. The results indicated that the proposed parallel-BCI could be effectively controlled by users with attention shifting fluently among the sub-spellers, and highly improved the BCI spelling performance.

  10. Competitive code-based fast palmprint identification using a set of cover trees

    NASA Astrophysics Data System (ADS)

    Yue, Feng; Zuo, Wangmeng; Zhang, David; Wang, Kuanquan

    2009-06-01

    A palmprint identification system recognizes a query palmprint image by searching for its nearest neighbor from among all the templates in a database. When applied on a large-scale identification system, it is often necessary to speed up the nearest-neighbor searching process. We use competitive code, which has very fast feature extraction and matching speed, for palmprint identification. To speed up the identification process, we extend the cover tree method and propose to use a set of cover trees to facilitate the fast and accurate nearest-neighbor searching. We can use the cover tree method because, as we show, the angular distance used in competitive code can be decomposed into a set of metrics. Using the Hong Kong PolyU palmprint database (version 2) and a large-scale palmprint database, our experimental results show that the proposed method searches for nearest neighbors faster than brute force searching.

  11. Development Of A Parallel Performance Model For The THOR Neutral Particle Transport Code

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Yessayan, Raffi; Azmy, Yousry; Schunert, Sebastian

    The THOR neutral particle transport code enables simulation of complex geometries for various problems from reactor simulations to nuclear non-proliferation. It is undergoing a thorough V&V requiring computational efficiency. This has motivated various improvements including angular parallelization, outer iteration acceleration, and development of peripheral tools. For guiding future improvements to the code’s efficiency, better characterization of its parallel performance is useful. A parallel performance model (PPM) can be used to evaluate the benefits of modifications and to identify performance bottlenecks. Using INL’s Falcon HPC, the PPM development incorporates an evaluation of network communication behavior over heterogeneous links and a functionalmore » characterization of the per-cell/angle/group runtime of each major code component. After evaluating several possible sources of variability, this resulted in a communication model and a parallel portion model. The former’s accuracy is bounded by the variability of communication on Falcon while the latter has an error on the order of 1%.« less

  12. GOTHIC: Gravitational oct-tree code accelerated by hierarchical time step controlling

    NASA Astrophysics Data System (ADS)

    Miki, Yohei; Umemura, Masayuki

    2017-04-01

    The tree method is a widely implemented algorithm for collisionless N-body simulations in astrophysics well suited for GPU(s). Adopting hierarchical time stepping can accelerate N-body simulations; however, it is infrequently implemented and its potential remains untested in GPU implementations. We have developed a Gravitational Oct-Tree code accelerated by HIerarchical time step Controlling named GOTHIC, which adopts both the tree method and the hierarchical time step. The code adopts some adaptive optimizations by monitoring the execution time of each function on-the-fly and minimizes the time-to-solution by balancing the measured time of multiple functions. Results of performance measurements with realistic particle distribution performed on NVIDIA Tesla M2090, K20X, and GeForce GTX TITAN X, which are representative GPUs of the Fermi, Kepler, and Maxwell generation of GPUs, show that the hierarchical time step achieves a speedup by a factor of around 3-5 times compared to the shared time step. The measured elapsed time per step of GOTHIC is 0.30 s or 0.44 s on GTX TITAN X when the particle distribution represents the Andromeda galaxy or the NFW sphere, respectively, with 224 = 16,777,216 particles. The averaged performance of the code corresponds to 10-30% of the theoretical single precision peak performance of the GPU.

  13. Solar wind interaction with Venus and Mars in a parallel hybrid code

    NASA Astrophysics Data System (ADS)

    Jarvinen, Riku; Sandroos, Arto

    2013-04-01

    We discuss the development and applications of a new parallel hybrid simulation, where ions are treated as particles and electrons as a charge-neutralizing fluid, for the interaction between the solar wind and Venus and Mars. The new simulation code under construction is based on the algorithm of the sequential global planetary hybrid model developed at the Finnish Meteorological Institute (FMI) and on the Corsair parallel simulation platform also developed at the FMI. The FMI's sequential hybrid model has been used for studies of plasma interactions of several unmagnetized and weakly magnetized celestial bodies for more than a decade. Especially, the model has been used to interpret in situ particle and magnetic field observations from plasma environments of Mars, Venus and Titan. Further, Corsair is an open source MPI (Message Passing Interface) particle and mesh simulation platform, mainly aimed for simulations of diffusive shock acceleration in solar corona and interplanetary space, but which is now also being extended for global planetary hybrid simulations. In this presentation we discuss challenges and strategies of parallelizing a legacy simulation code as well as possible applications and prospects of a scalable parallel hybrid model for the solar wind interactions of Venus and Mars.

  14. VINE-A NUMERICAL CODE FOR SIMULATING ASTROPHYSICAL SYSTEMS USING PARTICLES. II. IMPLEMENTATION AND PERFORMANCE CHARACTERISTICS

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Nelson, Andrew F.; Wetzstein, M.; Naab, T.

    2009-10-01

    We continue our presentation of VINE. In this paper, we begin with a description of relevant architectural properties of the serial and shared memory parallel computers on which VINE is intended to run, and describe their influences on the design of the code itself. We continue with a detailed description of a number of optimizations made to the layout of the particle data in memory and to our implementation of a binary tree used to access that data for use in gravitational force calculations and searches for smoothed particle hydrodynamics (SPH) neighbor particles. We describe the modifications to the codemore » necessary to obtain forces efficiently from special purpose 'GRAPE' hardware, the interfaces required to allow transparent substitution of those forces in the code instead of those obtained from the tree, and the modifications necessary to use both tree and GRAPE together as a fused GRAPE/tree combination. We conclude with an extensive series of performance tests, which demonstrate that the code can be run efficiently and without modification in serial on small workstations or in parallel using the OpenMP compiler directives on large-scale, shared memory parallel machines. We analyze the effects of the code optimizations and estimate that they improve its overall performance by more than an order of magnitude over that obtained by many other tree codes. Scaled parallel performance of the gravity and SPH calculations, together the most costly components of most simulations, is nearly linear up to at least 120 processors on moderate sized test problems using the Origin 3000 architecture, and to the maximum machine sizes available to us on several other architectures. At similar accuracy, performance of VINE, used in GRAPE-tree mode, is approximately a factor 2 slower than that of VINE, used in host-only mode. Further optimizations of the GRAPE/host communications could improve the speed by as much as a factor of 3, but have not yet been implemented in VINE

  15. [Series: Medical Applications of the PHITS Code (2): Acceleration by Parallel Computing].

    PubMed

    Furuta, Takuya; Sato, Tatsuhiko

    2015-01-01

    Time-consuming Monte Carlo dose calculation becomes feasible owing to the development of computer technology. However, the recent development is due to emergence of the multi-core high performance computers. Therefore, parallel computing becomes a key to achieve good performance of software programs. A Monte Carlo simulation code PHITS contains two parallel computing functions, the distributed-memory parallelization using protocols of message passing interface (MPI) and the shared-memory parallelization using open multi-processing (OpenMP) directives. Users can choose the two functions according to their needs. This paper gives the explanation of the two functions with their advantages and disadvantages. Some test applications are also provided to show their performance using a typical multi-core high performance workstation.

  16. Development of Parallel Computing Framework to Enhance Radiation Transport Code Capabilities for Rare Isotope Beam Facility Design

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kostin, Mikhail; Mokhov, Nikolai; Niita, Koji

    A parallel computing framework has been developed to use with general-purpose radiation transport codes. The framework was implemented as a C++ module that uses MPI for message passing. It is intended to be used with older radiation transport codes implemented in Fortran77, Fortran 90 or C. The module is significantly independent of radiation transport codes it can be used with, and is connected to the codes by means of a number of interface functions. The framework was developed and tested in conjunction with the MARS15 code. It is possible to use it with other codes such as PHITS, FLUKA andmore » MCNP after certain adjustments. Besides the parallel computing functionality, the framework offers a checkpoint facility that allows restarting calculations with a saved checkpoint file. The checkpoint facility can be used in single process calculations as well as in the parallel regime. The framework corrects some of the known problems with the scheduling and load balancing found in the original implementations of the parallel computing functionality in MARS15 and PHITS. The framework can be used efficiently on homogeneous systems and networks of workstations, where the interference from the other users is possible.« less

  17. Parallelized direct execution simulation of message-passing parallel programs

    NASA Technical Reports Server (NTRS)

    Dickens, Phillip M.; Heidelberger, Philip; Nicol, David M.

    1994-01-01

    As massively parallel computers proliferate, there is growing interest in findings ways by which performance of massively parallel codes can be efficiently predicted. This problem arises in diverse contexts such as parallelizing computers, parallel performance monitoring, and parallel algorithm development. In this paper we describe one solution where one directly executes the application code, but uses a discrete-event simulator to model details of the presumed parallel machine such as operating system and communication network behavior. Because this approach is computationally expensive, we are interested in its own parallelization specifically the parallelization of the discrete-event simulator. We describe methods suitable for parallelized direct execution simulation of message-passing parallel programs, and report on the performance of such a system, Large Application Parallel Simulation Environment (LAPSE), we have built on the Intel Paragon. On all codes measured to date, LAPSE predicts performance well typically within 10 percent relative error. Depending on the nature of the application code, we have observed low slowdowns (relative to natively executing code) and high relative speedups using up to 64 processors.

  18. PIXIE3D: A Parallel, Implicit, eXtended MHD 3D Code.

    NASA Astrophysics Data System (ADS)

    Chacon, L.; Knoll, D. A.

    2004-11-01

    We report on the development of PIXIE3D, a 3D parallel, fully implicit Newton-Krylov extended primitive-variable MHD code in general curvilinear geometry. PIXIE3D employs a second-order, finite-volume-based spatial discretization that satisfies remarkable properties such as being conservative, solenoidal in the magnetic field, non-dissipative, and stable in the absence of physical dissipation.(L. Chacón , phComput. Phys. Comm.) submitted (2004) PIXIE3D employs fully-implicit Newton-Krylov methods for the time advance. Currently, first and second-order implicit schemes are available, although higher-order temporal implicit schemes can be effortlessly implemented within the Newton-Krylov framework. A successful, scalable, MG physics-based preconditioning strategy, similar in concept to previous 2D MHD efforts,(L. Chacón et al., phJ. Comput. Phys). 178 (1), 15- 36 (2002); phJ. Comput. Phys., 188 (2), 573-592 (2003) has been developed. We are currently in the process of parallelizing the code using the PETSc library, and a Newton-Krylov-Schwarz approach for the parallel treatment of the preconditioner. In this poster, we will report on both the serial and parallel performance of PIXIE3D, focusing primarily on scalability and CPU speedup vs. an explicit approach.

  19. Performance Modeling and Measurement of Parallelized Code for Distributed Shared Memory Multiprocessors

    NASA Technical Reports Server (NTRS)

    Waheed, Abdul; Yan, Jerry

    1998-01-01

    This paper presents a model to evaluate the performance and overhead of parallelizing sequential code using compiler directives for multiprocessing on distributed shared memory (DSM) systems. With increasing popularity of shared address space architectures, it is essential to understand their performance impact on programs that benefit from shared memory multiprocessing. We present a simple model to characterize the performance of programs that are parallelized using compiler directives for shared memory multiprocessing. We parallelized the sequential implementation of NAS benchmarks using native Fortran77 compiler directives for an Origin2000, which is a DSM system based on a cache-coherent Non Uniform Memory Access (ccNUMA) architecture. We report measurement based performance of these parallelized benchmarks from four perspectives: efficacy of parallelization process; scalability; parallelization overhead; and comparison with hand-parallelized and -optimized version of the same benchmarks. Our results indicate that sequential programs can conveniently be parallelized for DSM systems using compiler directives but realizing performance gains as predicted by the performance model depends primarily on minimizing architecture-specific data locality overhead.

  20. Self-Scheduling Parallel Methods for Multiple Serial Codes with Application to WOPWOP

    NASA Technical Reports Server (NTRS)

    Long, Lyle N.; Brentner, Kenneth S.

    2000-01-01

    This paper presents a scheme for efficiently running a large number of serial jobs on parallel computers. Two examples are given of computer programs that run relatively quickly, but often they must be run numerous times to obtain all the results needed. It is very common in science and engineering to have codes that are not massive computing challenges in themselves, but due to the number of instances that must be run, they do become large-scale computing problems. The two examples given here represent common problems in aerospace engineering: aerodynamic panel methods and aeroacoustic integral methods. The first example simply solves many systems of linear equations. This is representative of an aerodynamic panel code where someone would like to solve for numerous angles of attack. The complete code for this first example is included in the appendix so that it can be readily used by others as a template. The second example is an aeroacoustics code (WOPWOP) that solves the Ffowcs Williams Hawkings equation to predict the far-field sound due to rotating blades. In this example, one quite often needs to compute the sound at numerous observer locations, hence parallelization is utilized to automate the noise computation for a large number of observers.

  1. Flood predictions using the parallel version of distributed numerical physical rainfall-runoff model TOPKAPI

    NASA Astrophysics Data System (ADS)

    Boyko, Oleksiy; Zheleznyak, Mark

    2015-04-01

    The original numerical code TOPKAPI-IMMS of the distributed rainfall-runoff model TOPKAPI ( Todini et al, 1996-2014) is developed and implemented in Ukraine. The parallel version of the code has been developed recently to be used on multiprocessors systems - multicore/processors PC and clusters. Algorithm is based on binary-tree decomposition of the watershed for the balancing of the amount of computation for all processors/cores. Message passing interface (MPI) protocol is used as a parallel computing framework. The numerical efficiency of the parallelization algorithms is demonstrated for the case studies for the flood predictions of the mountain watersheds of the Ukrainian Carpathian regions. The modeling results is compared with the predictions based on the lumped parameters models.

  2. HBT+: an improved code for finding subhaloes and building merger trees in cosmological simulations

    NASA Astrophysics Data System (ADS)

    Han, Jiaxin; Cole, Shaun; Frenk, Carlos S.; Benitez-Llambay, Alejandro; Helly, John

    2018-02-01

    Dark matter subhalos are the remnants of (incomplete) halo mergers. Identifying them and establishing their evolutionary links in the form of merger trees is one of the most important applications of cosmological simulations. The HBT (Hierachical Bound-Tracing) code identifies haloes as they form and tracks their evolution as they merge, simultaneously detecting subhaloes and building their merger trees. Here we present a new implementation of this approach, HBT+ , that is much faster, more user friendly, and more physically complete than the original code. Applying HBT+ to cosmological simulations, we show that both the subhalo mass function and the peak-mass function are well fitted by similar double-Schechter functions. The ratio between the two is highest at the high-mass end, reflecting the resilience of massive subhaloes that experience substantial dynamical friction but limited tidal stripping. The radial distribution of the most-massive subhaloes is more concentrated than the universal radial distribution of lower mass subhaloes. Subhalo finders that work in configuration space tend to underestimate the masses of massive subhaloes, an effect that is stronger in the host centre. This may explain, at least in part, the excess of massive subhaloes in galaxy cluster centres inferred from recent lensing observations. We demonstrate that the peak-mass function is a powerful diagnostic of merger tree defects, and the merger trees constructed using HBT+ do not suffer from the missing or switched links that tend to afflict merger trees constructed from more conventional halo finders. We make the HBT+ code publicly available.

  3. CHOLLA: A New Massively Parallel Hydrodynamics Code for Astrophysical Simulation

    NASA Astrophysics Data System (ADS)

    Schneider, Evan E.; Robertson, Brant E.

    2015-04-01

    We present Computational Hydrodynamics On ParaLLel Architectures (Cholla ), a new three-dimensional hydrodynamics code that harnesses the power of graphics processing units (GPUs) to accelerate astrophysical simulations. Cholla models the Euler equations on a static mesh using state-of-the-art techniques, including the unsplit Corner Transport Upwind algorithm, a variety of exact and approximate Riemann solvers, and multiple spatial reconstruction techniques including the piecewise parabolic method (PPM). Using GPUs, Cholla evolves the fluid properties of thousands of cells simultaneously and can update over 10 million cells per GPU-second while using an exact Riemann solver and PPM reconstruction. Owing to the massively parallel architecture of GPUs and the design of the Cholla code, astrophysical simulations with physically interesting grid resolutions (≳2563) can easily be computed on a single device. We use the Message Passing Interface library to extend calculations onto multiple devices and demonstrate nearly ideal scaling beyond 64 GPUs. A suite of test problems highlights the physical accuracy of our modeling and provides a useful comparison to other codes. We then use Cholla to simulate the interaction of a shock wave with a gas cloud in the interstellar medium, showing that the evolution of the cloud is highly dependent on its density structure. We reconcile the computed mixing time of a turbulent cloud with a realistic density distribution destroyed by a strong shock with the existing analytic theory for spherical cloud destruction by describing the system in terms of its median gas density.

  4. Parameters that affect parallel processing for computational electromagnetic simulation codes on high performance computing clusters

    NASA Astrophysics Data System (ADS)

    Moon, Hongsik

    What is the impact of multicore and associated advanced technologies on computational software for science? Most researchers and students have multicore laptops or desktops for their research and they need computing power to run computational software packages. Computing power was initially derived from Central Processing Unit (CPU) clock speed. That changed when increases in clock speed became constrained by power requirements. Chip manufacturers turned to multicore CPU architectures and associated technological advancements to create the CPUs for the future. Most software applications benefited by the increased computing power the same way that increases in clock speed helped applications run faster. However, for Computational ElectroMagnetics (CEM) software developers, this change was not an obvious benefit - it appeared to be a detriment. Developers were challenged to find a way to correctly utilize the advancements in hardware so that their codes could benefit. The solution was parallelization and this dissertation details the investigation to address these challenges. Prior to multicore CPUs, advanced computer technologies were compared with the performance using benchmark software and the metric was FLoting-point Operations Per Seconds (FLOPS) which indicates system performance for scientific applications that make heavy use of floating-point calculations. Is FLOPS an effective metric for parallelized CEM simulation tools on new multicore system? Parallel CEM software needs to be benchmarked not only by FLOPS but also by the performance of other parameters related to type and utilization of the hardware, such as CPU, Random Access Memory (RAM), hard disk, network, etc. The codes need to be optimized for more than just FLOPs and new parameters must be included in benchmarking. In this dissertation, the parallel CEM software named High Order Basis Based Integral Equation Solver (HOBBIES) is introduced. This code was developed to address the needs of the

  5. Parallelizing serial code for a distributed processing environment with an application to high frequency electromagnetic scattering

    NASA Astrophysics Data System (ADS)

    Work, Paul R.

    1991-12-01

    This thesis investigates the parallelization of existing serial programs in computational electromagnetics for use in a parallel environment. Existing algorithms for calculating the radar cross section of an object are covered, and a ray-tracing code is chosen for implementation on a parallel machine. Current parallel architectures are introduced and a suitable parallel machine is selected for the implementation of the chosen ray-tracing algorithm. The standard techniques for the parallelization of serial codes are discussed, including load balancing and decomposition considerations, and appropriate methods for the parallelization effort are selected. A load balancing algorithm is modified to increase the efficiency of the application, and a high level design of the structure of the serial program is presented. A detailed design of the modifications for the parallel implementation is also included, with both the high level and the detailed design specified in a high level design language called UNITY. The correctness of the design is proven using UNITY and standard logic operations. The theoretical and empirical results show that it is possible to achieve an efficient parallel application for a serial computational electromagnetic program where the characteristics of the algorithm and the target architecture critically influence the development of such an implementation.

  6. Shared Memory Parallelization of an Implicit ADI-type CFD Code

    NASA Technical Reports Server (NTRS)

    Hauser, Th.; Huang, P. G.

    1999-01-01

    A parallelization study designed for ADI-type algorithms is presented using the OpenMP specification for shared-memory multiprocessor programming. Details of optimizations specifically addressed to cache-based computer architectures are described and performance measurements for the single and multiprocessor implementation are summarized. The paper demonstrates that optimization of memory access on a cache-based computer architecture controls the performance of the computational algorithm. A hybrid MPI/OpenMP approach is proposed for clusters of shared memory machines to further enhance the parallel performance. The method is applied to develop a new LES/DNS code, named LESTool. A preliminary DNS calculation of a fully developed channel flow at a Reynolds number of 180, Re(sub tau) = 180, has shown good agreement with existing data.

  7. On the error probability of general tree and trellis codes with applications to sequential decoding

    NASA Technical Reports Server (NTRS)

    Johannesson, R.

    1973-01-01

    An upper bound on the average error probability for maximum-likelihood decoding of the ensemble of random binary tree codes is derived and shown to be independent of the length of the tree. An upper bound on the average error probability for maximum-likelihood decoding of the ensemble of random L-branch binary trellis codes of rate R = 1/n is derived which separates the effects of the tail length T and the memory length M of the code. It is shown that the bound is independent of the length L of the information sequence. This implication is investigated by computer simulations of sequential decoding utilizing the stack algorithm. These simulations confirm the implication and further suggest an empirical formula for the true undetected decoding error probability with sequential decoding.

  8. Neptune: An astrophysical smooth particle hydrodynamics code for massively parallel computer architectures

    NASA Astrophysics Data System (ADS)

    Sandalski, Stou

    Smooth particle hydrodynamics is an efficient method for modeling the dynamics of fluids. It is commonly used to simulate astrophysical processes such as binary mergers. We present a newly developed GPU accelerated smooth particle hydrodynamics code for astrophysical simulations. The code is named neptune after the Roman god of water. It is written in OpenMP parallelized C++ and OpenCL and includes octree based hydrodynamic and gravitational acceleration. The design relies on object-oriented methodologies in order to provide a flexible and modular framework that can be easily extended and modified by the user. Several pre-built scenarios for simulating collisions of polytropes and black-hole accretion are provided. The code is released under the MIT Open Source license and publicly available at http://code.google.com/p/neptune-sph/.

  9. Coding for parallel execution of hardware-in-the-loop millimeter-wave scene generation models on multicore SIMD processor architectures

    NASA Astrophysics Data System (ADS)

    Olson, Richard F.

    2013-05-01

    Rendering of point scatterer based radar scenes for millimeter wave (mmW) seeker tests in real-time hardware-in-the-loop (HWIL) scene generation requires efficient algorithms and vector-friendly computer architectures for complex signal synthesis. New processor technology from Intel implements an extended 256-bit vector SIMD instruction set (AVX, AVX2) in a multi-core CPU design providing peak execution rates of hundreds of GigaFLOPS (GFLOPS) on one chip. Real world mmW scene generation code can approach peak SIMD execution rates only after careful algorithm and source code design. An effective software design will maintain high computing intensity emphasizing register-to-register SIMD arithmetic operations over data movement between CPU caches or off-chip memories. Engineers at the U.S. Army Aviation and Missile Research, Development and Engineering Center (AMRDEC) applied two basic parallel coding methods to assess new 256-bit SIMD multi-core architectures for mmW scene generation in HWIL. These include use of POSIX threads built on vector library functions and more portable, highlevel parallel code based on compiler technology (e.g. OpenMP pragmas and SIMD autovectorization). Since CPU technology is rapidly advancing toward high processor core counts and TeraFLOPS peak SIMD execution rates, it is imperative that coding methods be identified which produce efficient and maintainable parallel code. This paper describes the algorithms used in point scatterer target model rendering, the parallelization of those algorithms, and the execution performance achieved on an AVX multi-core machine using the two basic parallel coding methods. The paper concludes with estimates for scale-up performance on upcoming multi-core technology.

  10. Parallel fast multipole boundary element method applied to computational homogenization

    NASA Astrophysics Data System (ADS)

    Ptaszny, Jacek

    2018-01-01

    In the present work, a fast multipole boundary element method (FMBEM) and a parallel computer code for 3D elasticity problem is developed and applied to the computational homogenization of a solid containing spherical voids. The system of equation is solved by using the GMRES iterative solver. The boundary of the body is dicretized by using the quadrilateral serendipity elements with an adaptive numerical integration. Operations related to a single GMRES iteration, performed by traversing the corresponding tree structure upwards and downwards, are parallelized by using the OpenMP standard. The assignment of tasks to threads is based on the assumption that the tree nodes at which the moment transformations are initialized can be partitioned into disjoint sets of equal or approximately equal size and assigned to the threads. The achieved speedup as a function of number of threads is examined.

  11. The language parallel Pascal and other aspects of the massively parallel processor

    NASA Technical Reports Server (NTRS)

    Reeves, A. P.; Bruner, J. D.

    1982-01-01

    A high level language for the Massively Parallel Processor (MPP) was designed. This language, called Parallel Pascal, is described in detail. A description of the language design, a description of the intermediate language, Parallel P-Code, and details for the MPP implementation are included. Formal descriptions of Parallel Pascal and Parallel P-Code are given. A compiler was developed which converts programs in Parallel Pascal into the intermediate Parallel P-Code language. The code generator to complete the compiler for the MPP is being developed independently. A Parallel Pascal to Pascal translator was also developed. The architecture design for a VLSI version of the MPP was completed with a description of fault tolerant interconnection networks. The memory arrangement aspects of the MPP are discussed and a survey of other high level languages is given.

  12. Schnek: A C++ library for the development of parallel simulation codes on regular grids

    NASA Astrophysics Data System (ADS)

    Schmitz, Holger

    2018-05-01

    A large number of algorithms across the field of computational physics are formulated on grids with a regular topology. We present Schnek, a library that enables fast development of parallel simulations on regular grids. Schnek contains a number of easy-to-use modules that greatly reduce the amount of administrative code for large-scale simulation codes. The library provides an interface for reading simulation setup files with a hierarchical structure. The structure of the setup file is translated into a hierarchy of simulation modules that the developer can specify. The reader parses and evaluates mathematical expressions and initialises variables or grid data. This enables developers to write modular and flexible simulation codes with minimal effort. Regular grids of arbitrary dimension are defined as well as mechanisms for defining physical domain sizes, grid staggering, and ghost cells on these grids. Ghost cells can be exchanged between neighbouring processes using MPI with a simple interface. The grid data can easily be written into HDF5 files using serial or parallel I/O.

  13. ALEGRA -- A massively parallel h-adaptive code for solid dynamics

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Summers, R.M.; Wong, M.K.; Boucheron, E.A.

    1997-12-31

    ALEGRA is a multi-material, arbitrary-Lagrangian-Eulerian (ALE) code for solid dynamics designed to run on massively parallel (MP) computers. It combines the features of modern Eulerian shock codes, such as CTH, with modern Lagrangian structural analysis codes using an unstructured grid. ALEGRA is being developed for use on the teraflop supercomputers to conduct advanced three-dimensional (3D) simulations of shock phenomena important to a variety of systems. ALEGRA was designed with the Single Program Multiple Data (SPMD) paradigm, in which the mesh is decomposed into sub-meshes so that each processor gets a single sub-mesh with approximately the same number of elements. Usingmore » this approach the authors have been able to produce a single code that can scale from one processor to thousands of processors. A current major effort is to develop efficient, high precision simulation capabilities for ALEGRA, without the computational cost of using a global highly resolved mesh, through flexible, robust h-adaptivity of finite elements. H-adaptivity is the dynamic refinement of the mesh by subdividing elements, thus changing the characteristic element size and reducing numerical error. The authors are working on several major technical challenges that must be met to make effective use of HAMMER on MP computers.« less

  14. The UPSF code: a metaprogramming-based high-performance automatically parallelized plasma simulation framework

    NASA Astrophysics Data System (ADS)

    Gao, Xiatian; Wang, Xiaogang; Jiang, Binhao

    2017-10-01

    UPSF (Universal Plasma Simulation Framework) is a new plasma simulation code designed for maximum flexibility by using edge-cutting techniques supported by C++17 standard. Through use of metaprogramming technique, UPSF provides arbitrary dimensional data structures and methods to support various kinds of plasma simulation models, like, Vlasov, particle in cell (PIC), fluid, Fokker-Planck, and their variants and hybrid methods. Through C++ metaprogramming technique, a single code can be used to arbitrary dimensional systems with no loss of performance. UPSF can also automatically parallelize the distributed data structure and accelerate matrix and tensor operations by BLAS. A three-dimensional particle in cell code is developed based on UPSF. Two test cases, Landau damping and Weibel instability for electrostatic and electromagnetic situation respectively, are presented to show the validation and performance of the UPSF code.

  15. Distributed Contour Trees

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Morozov, Dmitriy; Weber, Gunther H.

    2014-03-31

    Topological techniques provide robust tools for data analysis. They are used, for example, for feature extraction, for data de-noising, and for comparison of data sets. This chapter concerns contour trees, a topological descriptor that records the connectivity of the isosurfaces of scalar functions. These trees are fundamental to analysis and visualization of physical phenomena modeled by real-valued measurements. We study the parallel analysis of contour trees. After describing a particular representation of a contour tree, called local{global representation, we illustrate how di erent problems that rely on contour trees can be solved in parallel with minimal communication.

  16. SUPREM-DSMC: A New Scalable, Parallel, Reacting, Multidimensional Direct Simulation Monte Carlo Flow Code

    NASA Technical Reports Server (NTRS)

    Campbell, David; Wysong, Ingrid; Kaplan, Carolyn; Mott, David; Wadsworth, Dean; VanGilder, Douglas

    2000-01-01

    An AFRL/NRL team has recently been selected to develop a scalable, parallel, reacting, multidimensional (SUPREM) Direct Simulation Monte Carlo (DSMC) code for the DoD user community under the High Performance Computing Modernization Office (HPCMO) Common High Performance Computing Software Support Initiative (CHSSI). This paper will introduce the JANNAF Exhaust Plume community to this three-year development effort and present the overall goals, schedule, and current status of this new code.

  17. CRUNCH_PARALLEL

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Shumaker, Dana E.; Steefel, Carl I.

    The code CRUNCH_PARALLEL is a parallel version of the CRUNCH code. CRUNCH code version 2.0 was previously released by LLNL, (UCRL-CODE-200063). Crunch is a general purpose reactive transport code developed by Carl Steefel and Yabusake (Steefel Yabsaki 1996). The code handles non-isothermal transport and reaction in one, two, and three dimensions. The reaction algorithm is generic in form, handling an arbitrary number of aqueous and surface complexation as well as mineral dissolution/precipitation. A standardized database is used containing thermodynamic and kinetic data. The code includes advective, dispersive, and diffusive transport.

  18. A parallel Monte Carlo code for planar and SPECT imaging: implementation, verification and applications in (131)I SPECT.

    PubMed

    Dewaraja, Yuni K; Ljungberg, Michael; Majumdar, Amitava; Bose, Abhijit; Koral, Kenneth F

    2002-02-01

    This paper reports the implementation of the SIMIND Monte Carlo code on an IBM SP2 distributed memory parallel computer. Basic aspects of running Monte Carlo particle transport calculations on parallel architectures are described. Our parallelization is based on equally partitioning photons among the processors and uses the Message Passing Interface (MPI) library for interprocessor communication and the Scalable Parallel Random Number Generator (SPRNG) to generate uncorrelated random number streams. These parallelization techniques are also applicable to other distributed memory architectures. A linear increase in computing speed with the number of processors is demonstrated for up to 32 processors. This speed-up is especially significant in Single Photon Emission Computed Tomography (SPECT) simulations involving higher energy photon emitters, where explicit modeling of the phantom and collimator is required. For (131)I, the accuracy of the parallel code is demonstrated by comparing simulated and experimental SPECT images from a heart/thorax phantom. Clinically realistic SPECT simulations using the voxel-man phantom are carried out to assess scatter and attenuation correction.

  19. Dynamic Load Balancing Based on Constrained K-D Tree Decomposition for Parallel Particle Tracing

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Zhang, Jiang; Guo, Hanqi; Yuan, Xiaoru

    Particle tracing is a fundamental technique in flow field data visualization. In this work, we present a novel dynamic load balancing method for parallel particle tracing. Specifically, we employ a constrained k-d tree decomposition approach to dynamically redistribute tasks among processes. Each process is initially assigned a regularly partitioned block along with duplicated ghost layer under the memory limit. During particle tracing, the k-d tree decomposition is dynamically performed by constraining the cutting planes in the overlap range of duplicated data. This ensures that each process is reassigned particles as even as possible, and on the other hand the newmore » assigned particles for a process always locate in its block. Result shows good load balance and high efficiency of our method.« less

  20. Parallel coding of conjunctions in visual search.

    PubMed

    Found, A

    1998-10-01

    Two experiments investigated whether the conjunctive nature of nontarget items influenced search for a conjunction target. Each experiment consisted of two conditions. In both conditions, the target item was a red bar tilted to the right, among white tilted bars and vertical red bars. As well as color and orientation, display items also differed in terms of size. Size was irrelevant to search in that the size of the target varied randomly from trial to trial. In one condition, the size of items correlated with the other attributes of display items (e.g., all red items were big and all white items were small). In the other condition, the size of items varied randomly (i.e., some red items were small and some were big, and some white items were big and some were small). Search was more efficient in the size-correlated condition, consistent with the parallel coding of conjunctions in visual search.

  1. Performance analysis of parallel gravitational N-body codes on large GPU clusters

    NASA Astrophysics Data System (ADS)

    Huang, Si-Yi; Spurzem, Rainer; Berczik, Peter

    2016-01-01

    We compare the performance of two very different parallel gravitational N-body codes for astrophysical simulations on large Graphics Processing Unit (GPU) clusters, both of which are pioneers in their own fields as well as on certain mutual scales - NBODY6++ and Bonsai. We carry out benchmarks of the two codes by analyzing their performance, accuracy and efficiency through the modeling of structure decomposition and timing measurements. We find that both codes are heavily optimized to leverage the computational potential of GPUs as their performance has approached half of the maximum single precision performance of the underlying GPU cards. With such performance we predict that a speed-up of 200 - 300 can be achieved when up to 1k processors and GPUs are employed simultaneously. We discuss the quantitative information about comparisons of the two codes, finding that in the same cases Bonsai adopts larger time steps as well as larger relative energy errors than NBODY6++, typically ranging from 10 - 50 times larger, depending on the chosen parameters of the codes. Although the two codes are built for different astrophysical applications, in specified conditions they may overlap in performance at certain physical scales, thus allowing the user to choose either one by fine-tuning parameters accordingly.

  2. PIXIE3D: A Parallel, Implicit, eXtended MHD 3D Code

    NASA Astrophysics Data System (ADS)

    Chacon, Luis

    2006-10-01

    We report on the development of PIXIE3D, a 3D parallel, fully implicit Newton-Krylov extended MHD code in general curvilinear geometry. PIXIE3D employs a second-order, finite-volume-based spatial discretization that satisfies remarkable properties such as being conservative, solenoidal in the magnetic field to machine precision, non-dissipative, and linearly and nonlinearly stable in the absence of physical dissipation. PIXIE3D employs fully-implicit Newton-Krylov methods for the time advance. Currently, second-order implicit schemes such as Crank-Nicolson and BDF2 (2^nd order backward differentiation formula) are available. PIXIE3D is fully parallel (employs PETSc for parallelism), and exhibits excellent parallel scalability. A parallel, scalable, MG preconditioning strategy, based on physics-based preconditioning ideas, has been developed for resistive MHD, and is currently being extended to Hall MHD. In this poster, we will report on progress in the algorithmic formulation for extended MHD, as well as the the serial and parallel performance of PIXIE3D in a variety of problems and geometries. L. Chac'on, Comput. Phys. Comm., 163 (3), 143-171 (2004) L. Chac'on et al., J. Comput. Phys. 178 (1), 15- 36 (2002); J. Comput. Phys., 188 (2), 573-592 (2003) L. Chac'on, 32nd EPS Conf. Plasma Physics, Tarragona, Spain, 2005 L. Chac'on et al., 33rd EPS Conf. Plasma Physics, Rome, Italy, 2006

  3. PHoToNs–A parallel heterogeneous and threads oriented code for cosmological N-body simulation

    NASA Astrophysics Data System (ADS)

    Wang, Qiao; Cao, Zong-Yan; Gao, Liang; Chi, Xue-Bin; Meng, Chen; Wang, Jie; Wang, Long

    2018-06-01

    We introduce a new code for cosmological simulations, PHoToNs, which incorporates features for performing massive cosmological simulations on heterogeneous high performance computer (HPC) systems and threads oriented programming. PHoToNs adopts a hybrid scheme to compute gravitational force, with the conventional Particle-Mesh (PM) algorithm to compute the long-range force, the Tree algorithm to compute the short range force and the direct summation Particle-Particle (PP) algorithm to compute gravity from very close particles. A self-similar space filling a Peano-Hilbert curve is used to decompose the computing domain. Threads programming is advantageously used to more flexibly manage the domain communication, PM calculation and synchronization, as well as Dual Tree Traversal on the CPU+MIC platform. PHoToNs scales well and efficiency of the PP kernel achieves 68.6% of peak performance on MIC and 74.4% on CPU platforms. We also test the accuracy of the code against the much used Gadget-2 in the community and found excellent agreement.

  4. Implementation of the DPM Monte Carlo code on a parallel architecture for treatment planning applications.

    PubMed

    Tyagi, Neelam; Bose, Abhijit; Chetty, Indrin J

    2004-09-01

    We have parallelized the Dose Planning Method (DPM), a Monte Carlo code optimized for radiotherapy class problems, on distributed-memory processor architectures using the Message Passing Interface (MPI). Parallelization has been investigated on a variety of parallel computing architectures at the University of Michigan-Center for Advanced Computing, with respect to efficiency and speedup as a function of the number of processors. We have integrated the parallel pseudo random number generator from the Scalable Parallel Pseudo-Random Number Generator (SPRNG) library to run with the parallel DPM. The Intel cluster consisting of 800 MHz Intel Pentium III processor shows an almost linear speedup up to 32 processors for simulating 1 x 10(8) or more particles. The speedup results are nearly linear on an Athlon cluster (up to 24 processors based on availability) which consists of 1.8 GHz+ Advanced Micro Devices (AMD) Athlon processors on increasing the problem size up to 8 x 10(8) histories. For a smaller number of histories (1 x 10(8)) the reduction of efficiency with the Athlon cluster (down to 83.9% with 24 processors) occurs because the processing time required to simulate 1 x 10(8) histories is less than the time associated with interprocessor communication. A similar trend was seen with the Opteron Cluster (consisting of 1400 MHz, 64-bit AMD Opteron processors) on increasing the problem size. Because of the 64-bit architecture Opteron processors are capable of storing and processing instructions at a faster rate and hence are faster as compared to the 32-bit Athlon processors. We have validated our implementation with an in-phantom dose calculation study using a parallel pencil monoenergetic electron beam of 20 MeV energy. The phantom consists of layers of water, lung, bone, aluminum, and titanium. The agreement in the central axis depth dose curves and profiles at different depths shows that the serial and parallel codes are equivalent in accuracy.

  5. Parallel diversifications of Cremastosperma and Mosannona (Annonaceae), tropical rainforest trees tracking Neogene upheaval of South America

    PubMed Central

    Maas, Paul J. M.; Melchers-Sharrott, Heleen

    2018-01-01

    Much of the immense present day biological diversity of Neotropical rainforests originated from the Miocene onwards, a period of geological and ecological upheaval in South America. We assess the impact of the Andean orogeny, drainage of Lake Pebas and closure of the Panama isthmus on two clades of tropical trees (Cremastosperma, ca 31 spp.; and Mosannona, ca 14 spp.; both Annonaceae). Phylogenetic inference revealed similar patterns of geographically restricted clades and molecular dating showed diversifications in the different areas occurred in parallel, with timing consistent with Andean vicariance and Central American geodispersal. Ecological niche modelling approaches show phylogenetically conserved niche differentiation, particularly within Cremastosperma. Niche similarity and recent common ancestry of Amazon and Guianan Mosannona species contrast with dissimilar niches and more distant ancestry of Amazon, Venezuelan and Guianan species of Cremastosperma, suggesting that this element of the similar patterns of disjunct distributions in the two genera is instead a biogeographic parallelism, with differing origins. The results provide further independent evidence for the importance of the Andean orogeny, the drainage of Lake Pebas, and the formation of links between South and Central America in the evolutionary history of Neotropical lowland rainforest trees. PMID:29410860

  6. Parallel diversifications of Cremastosperma and Mosannona (Annonaceae), tropical rainforest trees tracking Neogene upheaval of South America.

    PubMed

    Pirie, Michael D; Maas, Paul J M; Wilschut, Rutger A; Melchers-Sharrott, Heleen; Chatrou, Lars W

    2018-01-01

    Much of the immense present day biological diversity of Neotropical rainforests originated from the Miocene onwards, a period of geological and ecological upheaval in South America. We assess the impact of the Andean orogeny, drainage of Lake Pebas and closure of the Panama isthmus on two clades of tropical trees ( Cremastosperma , ca 31 spp.; and Mosannona , ca 14 spp.; both Annonaceae). Phylogenetic inference revealed similar patterns of geographically restricted clades and molecular dating showed diversifications in the different areas occurred in parallel, with timing consistent with Andean vicariance and Central American geodispersal. Ecological niche modelling approaches show phylogenetically conserved niche differentiation, particularly within Cremastosperma . Niche similarity and recent common ancestry of Amazon and Guianan Mosannona species contrast with dissimilar niches and more distant ancestry of Amazon, Venezuelan and Guianan species of Cremastosperma , suggesting that this element of the similar patterns of disjunct distributions in the two genera is instead a biogeographic parallelism, with differing origins. The results provide further independent evidence for the importance of the Andean orogeny, the drainage of Lake Pebas, and the formation of links between South and Central America in the evolutionary history of Neotropical lowland rainforest trees.

  7. Parallel object-oriented decision tree system

    DOEpatents

    Kamath,; Chandrika, Cantu-Paz [Dublin, CA; Erick, [Oakland, CA

    2006-02-28

    A data mining decision tree system that uncovers patterns, associations, anomalies, and other statistically significant structures in data by reading and displaying data files, extracting relevant features for each of the objects, and using a method of recognizing patterns among the objects based upon object features through a decision tree that reads the data, sorts the data if necessary, determines the best manner to split the data into subsets according to some criterion, and splits the data.

  8. Faster Bit-Parallel Algorithms for Unordered Pseudo-tree Matching and Tree Homeomorphism

    NASA Astrophysics Data System (ADS)

    Kaneta, Yusaku; Arimura, Hiroki

    In this paper, we consider the unordered pseudo-tree matching problem, which is a problem of, given two unordered labeled trees P and T, finding all occurrences of P in T via such many-one embeddings that preserve node labels and parent-child relationship. This problem is closely related to tree pattern matching problem for XPath queries with child axis only. If m > w , we present an efficient algorithm that solves the problem in O(nm log(w)/w) time using O(hm/w + mlog(w)/w) space and O(m log(w)) preprocessing on a unit-cost arithmetic RAM model with addition, where m is the number of nodes in P, n is the number of nodes in T, h is the height of T, and w is the word length. We also discuss a modification of our algorithm for the unordered tree homeomorphism problem, which corresponds to a tree pattern matching problem for XPath queries with descendant axis only.

  9. Implementation and Characterization of Three-Dimensional Particle-in-Cell Codes on Multiple-Instruction-Multiple-Data Massively Parallel Supercomputers

    NASA Technical Reports Server (NTRS)

    Lyster, P. M.; Liewer, P. C.; Decyk, V. K.; Ferraro, R. D.

    1995-01-01

    A three-dimensional electrostatic particle-in-cell (PIC) plasma simulation code has been developed on coarse-grain distributed-memory massively parallel computers with message passing communications. Our implementation is the generalization to three-dimensions of the general concurrent particle-in-cell (GCPIC) algorithm. In the GCPIC algorithm, the particle computation is divided among the processors using a domain decomposition of the simulation domain. In a three-dimensional simulation, the domain can be partitioned into one-, two-, or three-dimensional subdomains ("slabs," "rods," or "cubes") and we investigate the efficiency of the parallel implementation of the push for all three choices. The present implementation runs on the Intel Touchstone Delta machine at Caltech; a multiple-instruction-multiple-data (MIMD) parallel computer with 512 nodes. We find that the parallel efficiency of the push is very high, with the ratio of communication to computation time in the range 0.3%-10.0%. The highest efficiency (> 99%) occurs for a large, scaled problem with 64(sup 3) particles per processing node (approximately 134 million particles of 512 nodes) which has a push time of about 250 ns per particle per time step. We have also developed expressions for the timing of the code which are a function of both code parameters (number of grid points, particles, etc.) and machine-dependent parameters (effective FLOP rate, and the effective interprocessor bandwidths for the communication of particles and grid points). These expressions can be used to estimate the performance of scaled problems--including those with inhomogeneous plasmas--to other parallel machines once the machine-dependent parameters are known.

  10. Hybrid-view programming of nuclear fusion simulation code in the PGAS parallel programming language XcalableMP

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Tsugane, Keisuke; Boku, Taisuke; Murai, Hitoshi

    Recently, the Partitioned Global Address Space (PGAS) parallel programming model has emerged as a usable distributed memory programming model. XcalableMP (XMP) is a PGAS parallel programming language that extends base languages such as C and Fortran with directives in OpenMP-like style. XMP supports a global-view model that allows programmers to define global data and to map them to a set of processors, which execute the distributed global data as a single thread. In XMP, the concept of a coarray is also employed for local-view programming. In this study, we port Gyrokinetic Toroidal Code - Princeton (GTC-P), which is a three-dimensionalmore » gyrokinetic PIC code developed at Princeton University to study the microturbulence phenomenon in magnetically confined fusion plasmas, to XMP as an example of hybrid memory model coding with the global-view and local-view programming models. In local-view programming, the coarray notation is simple and intuitive compared with Message Passing Interface (MPI) programming while the performance is comparable to that of the MPI version. Thus, because the global-view programming model is suitable for expressing the data parallelism for a field of grid space data, we implement a hybrid-view version using a global-view programming model to compute the field and a local-view programming model to compute the movement of particles. Finally, the performance is degraded by 20% compared with the original MPI version, but the hybrid-view version facilitates more natural data expression for static grid space data (in the global-view model) and dynamic particle data (in the local-view model), and it also increases the readability of the code for higher productivity.« less

  11. Hybrid-view programming of nuclear fusion simulation code in the PGAS parallel programming language XcalableMP

    DOE PAGES

    Tsugane, Keisuke; Boku, Taisuke; Murai, Hitoshi; ...

    2016-06-01

    Recently, the Partitioned Global Address Space (PGAS) parallel programming model has emerged as a usable distributed memory programming model. XcalableMP (XMP) is a PGAS parallel programming language that extends base languages such as C and Fortran with directives in OpenMP-like style. XMP supports a global-view model that allows programmers to define global data and to map them to a set of processors, which execute the distributed global data as a single thread. In XMP, the concept of a coarray is also employed for local-view programming. In this study, we port Gyrokinetic Toroidal Code - Princeton (GTC-P), which is a three-dimensionalmore » gyrokinetic PIC code developed at Princeton University to study the microturbulence phenomenon in magnetically confined fusion plasmas, to XMP as an example of hybrid memory model coding with the global-view and local-view programming models. In local-view programming, the coarray notation is simple and intuitive compared with Message Passing Interface (MPI) programming while the performance is comparable to that of the MPI version. Thus, because the global-view programming model is suitable for expressing the data parallelism for a field of grid space data, we implement a hybrid-view version using a global-view programming model to compute the field and a local-view programming model to compute the movement of particles. Finally, the performance is degraded by 20% compared with the original MPI version, but the hybrid-view version facilitates more natural data expression for static grid space data (in the global-view model) and dynamic particle data (in the local-view model), and it also increases the readability of the code for higher productivity.« less

  12. Parallel adaptive wavelet collocation method for PDEs

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Nejadmalayeri, Alireza, E-mail: Alireza.Nejadmalayeri@gmail.com; Vezolainen, Alexei, E-mail: Alexei.Vezolainen@Colorado.edu; Brown-Dymkoski, Eric, E-mail: Eric.Browndymkoski@Colorado.edu

    2015-10-01

    A parallel adaptive wavelet collocation method for solving a large class of Partial Differential Equations is presented. The parallelization is achieved by developing an asynchronous parallel wavelet transform, which allows one to perform parallel wavelet transform and derivative calculations with only one data synchronization at the highest level of resolution. The data are stored using tree-like structure with tree roots starting at a priori defined level of resolution. Both static and dynamic domain partitioning approaches are developed. For the dynamic domain partitioning, trees are considered to be the minimum quanta of data to be migrated between the processes. This allowsmore » fully automated and efficient handling of non-simply connected partitioning of a computational domain. Dynamic load balancing is achieved via domain repartitioning during the grid adaptation step and reassigning trees to the appropriate processes to ensure approximately the same number of grid points on each process. The parallel efficiency of the approach is discussed based on parallel adaptive wavelet-based Coherent Vortex Simulations of homogeneous turbulence with linear forcing at effective non-adaptive resolutions up to 2048{sup 3} using as many as 2048 CPU cores.« less

  13. Concurrent computation of attribute filters on shared memory parallel machines.

    PubMed

    Wilkinson, Michael H F; Gao, Hui; Hesselink, Wim H; Jonker, Jan-Eppo; Meijster, Arnold

    2008-10-01

    Morphological attribute filters have not previously been parallelized, mainly because they are both global and non-separable. We propose a parallel algorithm that achieves efficient parallelism for a large class of attribute filters, including attribute openings, closings, thinnings and thickenings, based on Salembier's Max-Trees and Min-trees. The image or volume is first partitioned in multiple slices. We then compute the Max-trees of each slice using any sequential Max-Tree algorithm. Subsequently, the Max-trees of the slices can be merged to obtain the Max-tree of the image. A C-implementation yielded good speed-ups on both a 16-processor MIPS 14000 parallel machine, and a dual-core Opteron-based machine. It is shown that the speed-up of the parallel algorithm is a direct measure of the gain with respect to the sequential algorithm used. Furthermore, the concurrent algorithm shows a speed gain of up to 72 percent on a single-core processor, due to reduced cache thrashing.

  14. Coding for Parallel Links to Maximize the Expected Value of Decodable Messages

    NASA Technical Reports Server (NTRS)

    Klimesh, Matthew A.; Chang, Christopher S.

    2011-01-01

    When multiple parallel communication links are available, it is useful to consider link-utilization strategies that provide tradeoffs between reliability and throughput. Interesting cases arise when there are three or more available links. Under the model considered, the links have known probabilities of being in working order, and each link has a known capacity. The sender has a number of messages to send to the receiver. Each message has a size and a value (i.e., a worth or priority). Messages may be divided into pieces arbitrarily, and the value of each piece is proportional to its size. The goal is to choose combinations of messages to send on the links so that the expected value of the messages decodable by the receiver is maximized. There are three parts to the innovation: (1) Applying coding to parallel links under the model; (2) Linear programming formulation for finding the optimal combinations of messages to send on the links; and (3) Algorithms for assisting in finding feasible combinations of messages, as support for the linear programming formulation. There are similarities between this innovation and methods developed in the field of network coding. However, network coding has generally been concerned with either maximizing throughput in a fixed network, or robust communication of a fixed volume of data. In contrast, under this model, the throughput is expected to vary depending on the state of the network. Examples of error-correcting codes that are useful under this model but which are not needed under previous models have been found. This model can represent either a one-shot communication attempt, or a stream of communications. Under the one-shot model, message sizes and link capacities are quantities of information (e.g., measured in bits), while under the communications stream model, message sizes and link capacities are information rates (e.g., measured in bits/second). This work has the potential to increase the value of data returned from

  15. SAPNEW: Parallel finite element code for thin shell structures on the Alliant FX/80

    NASA Astrophysics Data System (ADS)

    Kamat, Manohar P.; Watson, Brian C.

    1992-02-01

    The results of a research activity aimed at providing a finite element capability for analyzing turbo-machinery bladed-disk assemblies in a vector/parallel processing environment are summarized. Analysis of aircraft turbofan engines is very computationally intensive. The performance limit of modern day computers with a single processing unit was estimated at 3 billions of floating point operations per second (3 gigaflops). In view of this limit of a sequential unit, performance rates higher than 3 gigaflops can be achieved only through vectorization and/or parallelization as on Alliant FX/80. Accordingly, the efforts of this critically needed research were geared towards developing and evaluating parallel finite element methods for static and vibration analysis. A special purpose code, named with the acronym SAPNEW, performs static and eigen analysis of multi-degree-of-freedom blade models built-up from flat thin shell elements.

  16. SAPNEW: Parallel finite element code for thin shell structures on the Alliant FX/80

    NASA Technical Reports Server (NTRS)

    Kamat, Manohar P.; Watson, Brian C.

    1992-01-01

    The results of a research activity aimed at providing a finite element capability for analyzing turbo-machinery bladed-disk assemblies in a vector/parallel processing environment are summarized. Analysis of aircraft turbofan engines is very computationally intensive. The performance limit of modern day computers with a single processing unit was estimated at 3 billions of floating point operations per second (3 gigaflops). In view of this limit of a sequential unit, performance rates higher than 3 gigaflops can be achieved only through vectorization and/or parallelization as on Alliant FX/80. Accordingly, the efforts of this critically needed research were geared towards developing and evaluating parallel finite element methods for static and vibration analysis. A special purpose code, named with the acronym SAPNEW, performs static and eigen analysis of multi-degree-of-freedom blade models built-up from flat thin shell elements.

  17. Parallel evolution of chordate cis-regulatory code for development.

    PubMed

    Doglio, Laura; Goode, Debbie K; Pelleri, Maria C; Pauls, Stefan; Frabetti, Flavia; Shimeld, Sebastian M; Vavouri, Tanya; Elgar, Greg

    2013-11-01

    Urochordates are the closest relatives of vertebrates and at the larval stage, possess a characteristic bilateral chordate body plan. In vertebrates, the genes that orchestrate embryonic patterning are in part regulated by highly conserved non-coding elements (CNEs), yet these elements have not been identified in urochordate genomes. Consequently the evolution of the cis-regulatory code for urochordate development remains largely uncharacterised. Here, we use genome-wide comparisons between C. intestinalis and C. savignyi to identify putative urochordate cis-regulatory sequences. Ciona conserved non-coding elements (ciCNEs) are associated with largely the same key regulatory genes as vertebrate CNEs. Furthermore, some of the tested ciCNEs are able to activate reporter gene expression in both zebrafish and Ciona embryos, in a pattern that at least partially overlaps that of the gene they associate with, despite the absence of sequence identity. We also show that the ability of a ciCNE to up-regulate gene expression in vertebrate embryos can in some cases be localised to short sub-sequences, suggesting that functional cross-talk may be defined by small regions of ancestral regulatory logic, although functional sub-sequences may also be dispersed across the whole element. We conclude that the structure and organisation of cis-regulatory modules is very different between vertebrates and urochordates, reflecting their separate evolutionary histories. However, functional cross-talk still exists because the same repertoire of transcription factors has likely guided their parallel evolution, exploiting similar sets of binding sites but in different combinations.

  18. Hybrid threshold adaptable quantum secret sharing scheme with reverse Huffman-Fibonacci-tree coding.

    PubMed

    Lai, Hong; Zhang, Jun; Luo, Ming-Xing; Pan, Lei; Pieprzyk, Josef; Xiao, Fuyuan; Orgun, Mehmet A

    2016-08-12

    With prevalent attacks in communication, sharing a secret between communicating parties is an ongoing challenge. Moreover, it is important to integrate quantum solutions with classical secret sharing schemes with low computational cost for the real world use. This paper proposes a novel hybrid threshold adaptable quantum secret sharing scheme, using an m-bonacci orbital angular momentum (OAM) pump, Lagrange interpolation polynomials, and reverse Huffman-Fibonacci-tree coding. To be exact, we employ entangled states prepared by m-bonacci sequences to detect eavesdropping. Meanwhile, we encode m-bonacci sequences in Lagrange interpolation polynomials to generate the shares of a secret with reverse Huffman-Fibonacci-tree coding. The advantages of the proposed scheme is that it can detect eavesdropping without joint quantum operations, and permits secret sharing for an arbitrary but no less than threshold-value number of classical participants with much lower bandwidth. Also, in comparison with existing quantum secret sharing schemes, it still works when there are dynamic changes, such as the unavailability of some quantum channel, the arrival of new participants and the departure of participants. Finally, we provide security analysis of the new hybrid quantum secret sharing scheme and discuss its useful features for modern applications.

  19. Jet formation and equatorial superrotation in Jupiter's atmosphere: Numerical modelling using a new efficient parallel code

    NASA Astrophysics Data System (ADS)

    Rivier, Leonard Gilles

    Using an efficient parallel code solving the primitive equations of atmospheric dynamics, the jet structure of a Jupiter like atmosphere is modeled. In the first part of this thesis, a parallel spectral code solving both the shallow water equations and the multi-level primitive equations of atmospheric dynamics is built. The implementation of this code called BOB is done so that it runs effectively on an inexpensive cluster of workstations. A one dimensional decomposition and transposition method insuring load balancing among processes is used. The Legendre transform is cache-blocked. A "compute on the fly" of the Legendre polynomials used in the spectral method produces a lower memory footprint and enables high resolution runs on relatively small memory machines. Performance studies are done using a cluster of workstations located at the National Center for Atmospheric Research (NCAR). BOB performances are compared to the parallel benchmark code PSTSWM and the dynamical core of NCAR's CCM3.6.6. In both cases, the comparison favors BOB. In the second part of this thesis, the primitive equation version of the code described in part I is used to study the formation of organized zonal jets and equatorial superrotation in a planetary atmosphere where the parameters are chosen to best model the upper atmosphere of Jupiter. Two levels are used in the vertical and only large scale forcing is present. The model is forced towards a baroclinically unstable flow, so that eddies are generated by baroclinic instability. We consider several types of forcing, acting on either the temperature or the momentum field. We show that only under very specific parametric conditions, zonally elongated structures form and persist resembling the jet structure observed near the cloud level top (1 bar) on Jupiter. We also study the effect of an equatorial heat source, meant to be a crude representation of the effect of the deep convective planetary interior onto the outer atmospheric layer. We

  20. Categorizing Ideas about Trees: A Tree of Trees

    PubMed Central

    Fisler, Marie; Lecointre, Guillaume

    2013-01-01

    The aim of this study is to explore whether matrices and MP trees used to produce systematic categories of organisms could be useful to produce categories of ideas in history of science. We study the history of the use of trees in systematics to represent the diversity of life from 1766 to 1991. We apply to those ideas a method inspired from coding homologous parts of organisms. We discretize conceptual parts of ideas, writings and drawings about trees contained in 41 main writings; we detect shared parts among authors and code them into a 91-characters matrix and use a tree representation to show who shares what with whom. In other words, we propose a hierarchical representation of the shared ideas about trees among authors: this produces a “tree of trees.” Then, we categorize schools of tree-representations. Classical schools like “cladists” and “pheneticists” are recovered but others are not: “gradists” are separated into two blocks, one of them being called here “grade theoreticians.” We propose new interesting categories like the “buffonian school,” the “metaphoricians,” and those using “strictly genealogical classifications.” We consider that networks are not useful to represent shared ideas at the present step of the study. A cladogram is made for showing who is sharing what with whom, but also heterobathmy and homoplasy of characters. The present cladogram is not modelling processes of transmission of ideas about trees, and here it is mostly used to test for proximity of ideas of the same age and for categorization. PMID:23950877

  1. Hybrid parallelization of the XTOR-2F code for the simulation of two-fluid MHD instabilities in tokamaks

    NASA Astrophysics Data System (ADS)

    Marx, Alain; Lütjens, Hinrich

    2017-03-01

    A hybrid MPI/OpenMP parallel version of the XTOR-2F code [Lütjens and Luciani, J. Comput. Phys. 229 (2010) 8130] solving the two-fluid MHD equations in full tokamak geometry by means of an iterative Newton-Krylov matrix-free method has been developed. The present work shows that the code has been parallelized significantly despite the numerical profile of the problem solved by XTOR-2F, i.e. a discretization with pseudo-spectral representations in all angular directions, the stiffness of the two-fluid stability problem in tokamaks, and the use of a direct LU decomposition to invert the physical pre-conditioner at every Krylov iteration of the solver. The execution time of the parallelized version is an order of magnitude smaller than the sequential one for low resolution cases, with an increasing speedup when the discretization mesh is refined. Moreover, it allows to perform simulations with higher resolutions, previously forbidden because of memory limitations.

  2. OpenGeoSys-GEMS: Hybrid parallelization of a reactive transport code with MPI and threads

    NASA Astrophysics Data System (ADS)

    Kosakowski, G.; Kulik, D. A.; Shao, H.

    2012-04-01

    OpenGeoSys-GEMS is a generic purpose reactive transport code based on the operator splitting approach. The code couples the Finite-Element groundwater flow and multi-species transport modules of the OpenGeoSys (OGS) project (http://www.ufz.de/index.php?en=18345) with the GEM-Selektor research package to model thermodynamic equilibrium of aquatic (geo)chemical systems utilizing the Gibbs Energy Minimization approach (http://gems.web.psi.ch/). The combination of OGS and the GEM-Selektor kernel (GEMS3K) is highly flexible due to the object-oriented modular code structures and the well defined (memory based) data exchange modules. Like other reactive transport codes, the practical applicability of OGS-GEMS is often hampered by the long calculation time and large memory requirements. • For realistic geochemical systems which might include dozens of mineral phases and several (non-ideal) solid solutions the time needed to solve the chemical system with GEMS3K may increase exceptionally. • The codes are coupled in a sequential non-iterative loop. In order to keep the accuracy, the time step size is restricted. In combination with a fine spatial discretization the time step size may become very small which increases calculation times drastically even for small 1D problems. • The current version of OGS is not optimized for memory use and the MPI version of OGS does not distribute data between nodes. Even for moderately small 2D problems the number of MPI processes that fit into memory of up-to-date workstations or HPC hardware is limited. One strategy to overcome the above mentioned restrictions of OGS-GEMS is to parallelize the coupled code. For OGS a parallelized version already exists. It is based on a domain decomposition method implemented with MPI and provides a parallel solver for fluid and mass transport processes. In the coupled code, after solving fluid flow and solute transport, geochemical calculations are done in form of a central loop over all finite

  3. Fortran code for SU(3) lattice gauge theory with and without MPI checkerboard parallelization

    NASA Astrophysics Data System (ADS)

    Berg, Bernd A.; Wu, Hao

    2012-10-01

    We document plain Fortran and Fortran MPI checkerboard code for Markov chain Monte Carlo simulations of pure SU(3) lattice gauge theory with the Wilson action in D dimensions. The Fortran code uses periodic boundary conditions and is suitable for pedagogical purposes and small scale simulations. For the Fortran MPI code two geometries are covered: the usual torus with periodic boundary conditions and the double-layered torus as defined in the paper. Parallel computing is performed on checkerboards of sublattices, which partition the full lattice in one, two, and so on, up to D directions (depending on the parameters set). For updating, the Cabibbo-Marinari heatbath algorithm is used. We present validations and test runs of the code. Performance is reported for a number of currently used Fortran compilers and, when applicable, MPI versions. For the parallelized code, performance is studied as a function of the number of processors. Program summary Program title: STMC2LSU3MPI Catalogue identifier: AEMJ_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEMJ_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 26666 No. of bytes in distributed program, including test data, etc.: 233126 Distribution format: tar.gz Programming language: Fortran 77 compatible with the use of Fortran 90/95 compilers, in part with MPI extensions. Computer: Any capable of compiling and executing Fortran 77 or Fortran 90/95, when needed with MPI extensions. Operating system: Red Hat Enterprise Linux Server 6.1 with OpenMPI + pgf77 11.8-0, Centos 5.3 with OpenMPI + gfortran 4.1.2, Cray XT4 with MPICH2 + pgf90 11.2-0. Has the code been vectorised or parallelized?: Yes, parallelized using MPI extensions. Number of processors used: 2 to 11664 RAM: 200 Mega bytes per process. Classification: 11

  4. Hybrid threshold adaptable quantum secret sharing scheme with reverse Huffman-Fibonacci-tree coding

    PubMed Central

    Lai, Hong; Zhang, Jun; Luo, Ming-Xing; Pan, Lei; Pieprzyk, Josef; Xiao, Fuyuan; Orgun, Mehmet A.

    2016-01-01

    With prevalent attacks in communication, sharing a secret between communicating parties is an ongoing challenge. Moreover, it is important to integrate quantum solutions with classical secret sharing schemes with low computational cost for the real world use. This paper proposes a novel hybrid threshold adaptable quantum secret sharing scheme, using an m-bonacci orbital angular momentum (OAM) pump, Lagrange interpolation polynomials, and reverse Huffman-Fibonacci-tree coding. To be exact, we employ entangled states prepared by m-bonacci sequences to detect eavesdropping. Meanwhile, we encode m-bonacci sequences in Lagrange interpolation polynomials to generate the shares of a secret with reverse Huffman-Fibonacci-tree coding. The advantages of the proposed scheme is that it can detect eavesdropping without joint quantum operations, and permits secret sharing for an arbitrary but no less than threshold-value number of classical participants with much lower bandwidth. Also, in comparison with existing quantum secret sharing schemes, it still works when there are dynamic changes, such as the unavailability of some quantum channel, the arrival of new participants and the departure of participants. Finally, we provide security analysis of the new hybrid quantum secret sharing scheme and discuss its useful features for modern applications. PMID:27515908

  5. Parallel Monte Carlo transport modeling in the context of a time-dependent, three-dimensional multi-physics code

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Procassini, R.J.

    1997-12-31

    The fine-scale, multi-space resolution that is envisioned for accurate simulations of complex weapons systems in three spatial dimensions implies flop-rate and memory-storage requirements that will only be obtained in the near future through the use of parallel computational techniques. Since the Monte Carlo transport models in these simulations usually stress both of these computational resources, they are prime candidates for parallelization. The MONACO Monte Carlo transport package, which is currently under development at LLNL, will utilize two types of parallelism within the context of a multi-physics design code: decomposition of the spatial domain across processors (spatial parallelism) and distribution ofmore » particles in a given spatial subdomain across additional processors (particle parallelism). This implementation of the package will utilize explicit data communication between domains (message passing). Such a parallel implementation of a Monte Carlo transport model will result in non-deterministic communication patterns. The communication of particles between subdomains during a Monte Carlo time step may require a significant level of effort to achieve a high parallel efficiency.« less

  6. Hybrid MPI-OpenMP Parallelism in the ONETEP Linear-Scaling Electronic Structure Code: Application to the Delamination of Cellulose Nanofibrils.

    PubMed

    Wilkinson, Karl A; Hine, Nicholas D M; Skylaris, Chris-Kriton

    2014-11-11

    We present a hybrid MPI-OpenMP implementation of Linear-Scaling Density Functional Theory within the ONETEP code. We illustrate its performance on a range of high performance computing (HPC) platforms comprising shared-memory nodes with fast interconnect. Our work has focused on applying OpenMP parallelism to the routines which dominate the computational load, attempting where possible to parallelize different loops from those already parallelized within MPI. This includes 3D FFT box operations, sparse matrix algebra operations, calculation of integrals, and Ewald summation. While the underlying numerical methods are unchanged, these developments represent significant changes to the algorithms used within ONETEP to distribute the workload across CPU cores. The new hybrid code exhibits much-improved strong scaling relative to the MPI-only code and permits calculations with a much higher ratio of cores to atoms. These developments result in a significantly shorter time to solution than was possible using MPI alone and facilitate the application of the ONETEP code to systems larger than previously feasible. We illustrate this with benchmark calculations from an amyloid fibril trimer containing 41,907 atoms. We use the code to study the mechanism of delamination of cellulose nanofibrils when undergoing sonification, a process which is controlled by a large number of interactions that collectively determine the structural properties of the fibrils. Many energy evaluations were needed for these simulations, and as these systems comprise up to 21,276 atoms this would not have been feasible without the developments described here.

  7. Efficient random access high resolution region-of-interest (ROI) image retrieval using backward coding of wavelet trees (BCWT)

    NASA Astrophysics Data System (ADS)

    Corona, Enrique; Nutter, Brian; Mitra, Sunanda; Guo, Jiangling; Karp, Tanja

    2008-03-01

    Efficient retrieval of high quality Regions-Of-Interest (ROI) from high resolution medical images is essential for reliable interpretation and accurate diagnosis. Random access to high quality ROI from codestreams is becoming an essential feature in many still image compression applications, particularly in viewing diseased areas from large medical images. This feature is easier to implement in block based codecs because of the inherent spatial independency of the code blocks. This independency implies that the decoding order of the blocks is unimportant as long as the position for each is properly identified. In contrast, wavelet-tree based codecs naturally use some interdependency that exploits the decaying spectrum model of the wavelet coefficients. Thus one must keep track of the decoding order from level to level with such codecs. We have developed an innovative multi-rate image subband coding scheme using "Backward Coding of Wavelet Trees (BCWT)" which is fast, memory efficient, and resolution scalable. It offers far less complexity than many other existing codecs including both, wavelet-tree, and block based algorithms. The ROI feature in BCWT is implemented through a transcoder stage that generates a new BCWT codestream containing only the information associated with the user-defined ROI. This paper presents an efficient technique that locates a particular ROI within the BCWT coded domain, and decodes it back to the spatial domain. This technique allows better access and proper identification of pathologies in high resolution images since only a small fraction of the codestream is required to be transmitted and analyzed.

  8. ls1 mardyn: The Massively Parallel Molecular Dynamics Code for Large Systems.

    PubMed

    Niethammer, Christoph; Becker, Stefan; Bernreuther, Martin; Buchholz, Martin; Eckhardt, Wolfgang; Heinecke, Alexander; Werth, Stephan; Bungartz, Hans-Joachim; Glass, Colin W; Hasse, Hans; Vrabec, Jadran; Horsch, Martin

    2014-10-14

    The molecular dynamics simulation code ls1 mardyn is presented. It is a highly scalable code, optimized for massively parallel execution on supercomputing architectures and currently holds the world record for the largest molecular simulation with over four trillion particles. It enables the application of pair potentials to length and time scales that were previously out of scope for molecular dynamics simulation. With an efficient dynamic load balancing scheme, it delivers high scalability even for challenging heterogeneous configurations. Presently, multicenter rigid potential models based on Lennard-Jones sites, point charges, and higher-order polarities are supported. Due to its modular design, ls1 mardyn can be extended to new physical models, methods, and algorithms, allowing future users to tailor it to suit their respective needs. Possible applications include scenarios with complex geometries, such as fluids at interfaces, as well as nonequilibrium molecular dynamics simulation of heat and mass transfer.

  9. Reactor Dosimetry Applications Using RAPTOR-M3G:. a New Parallel 3-D Radiation Transport Code

    NASA Astrophysics Data System (ADS)

    Longoni, Gianluca; Anderson, Stanwood L.

    2009-08-01

    The numerical solution of the Linearized Boltzmann Equation (LBE) via the Discrete Ordinates method (SN) requires extensive computational resources for large 3-D neutron and gamma transport applications due to the concurrent discretization of the angular, spatial, and energy domains. This paper will discuss the development RAPTOR-M3G (RApid Parallel Transport Of Radiation - Multiple 3D Geometries), a new 3-D parallel radiation transport code, and its application to the calculation of ex-vessel neutron dosimetry responses in the cavity of a commercial 2-loop Pressurized Water Reactor (PWR). RAPTOR-M3G is based domain decomposition algorithms, where the spatial and angular domains are allocated and processed on multi-processor computer architectures. As compared to traditional single-processor applications, this approach reduces the computational load as well as the memory requirement per processor, yielding an efficient solution methodology for large 3-D problems. Measured neutron dosimetry responses in the reactor cavity air gap will be compared to the RAPTOR-M3G predictions. This paper is organized as follows: Section 1 discusses the RAPTOR-M3G methodology; Section 2 describes the 2-loop PWR model and the numerical results obtained. Section 3 addresses the parallel performance of the code, and Section 4 concludes this paper with final remarks and future work.

  10. A parallel decision tree-based method for user authentication based on keystroke patterns.

    PubMed

    Sheng, Yong; Phoha, Vir V; Rovnyak, Steven M

    2005-08-01

    We propose a Monte Carlo approach to attain sufficient training data, a splitting method to improve effectiveness, and a system composed of parallel decision trees (DTs) to authenticate users based on keystroke patterns. For each user, approximately 19 times as much simulated data was generated to complement the 387 vectors of raw data. The training set, including raw and simulated data, is split into four subsets. For each subset, wavelet transforms are performed to obtain a total of eight training subsets for each user. Eight DTs are thus trained using the eight subsets. A parallel DT is constructed for each user, which contains all eight DTs with a criterion for its output that it authenticates the user if at least three DTs do so; otherwise it rejects the user. Training and testing data were collected from 43 users who typed the exact same string of length 37 nine consecutive times to provide data for training purposes. The users typed the same string at various times over a period from November through December 2002 to provide test data. The average false reject rate was 9.62% and the average false accept rate was 0.88%.

  11. High-Performance Psychometrics: The Parallel-E Parallel-M Algorithm for Generalized Latent Variable Models. Research Report. ETS RR-16-34

    ERIC Educational Resources Information Center

    von Davier, Matthias

    2016-01-01

    This report presents results on a parallel implementation of the expectation-maximization (EM) algorithm for multidimensional latent variable models. The developments presented here are based on code that parallelizes both the E step and the M step of the parallel-E parallel-M algorithm. Examples presented in this report include item response…

  12. Hybrid parallel code acceleration methods in full-core reactor physics calculations

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Courau, T.; Plagne, L.; Ponicot, A.

    2012-07-01

    When dealing with nuclear reactor calculation schemes, the need for three dimensional (3D) transport-based reference solutions is essential for both validation and optimization purposes. Considering a benchmark problem, this work investigates the potential of discrete ordinates (Sn) transport methods applied to 3D pressurized water reactor (PWR) full-core calculations. First, the benchmark problem is described. It involves a pin-by-pin description of a 3D PWR first core, and uses a 8-group cross-section library prepared with the DRAGON cell code. Then, a convergence analysis is performed using the PENTRAN parallel Sn Cartesian code. It discusses the spatial refinement and the associated angular quadraturemore » required to properly describe the problem physics. It also shows that initializing the Sn solution with the EDF SPN solver COCAGNE reduces the number of iterations required to converge by nearly a factor of 6. Using a best estimate model, PENTRAN results are then compared to multigroup Monte Carlo results obtained with the MCNP5 code. Good consistency is observed between the two methods (Sn and Monte Carlo), with discrepancies that are less than 25 pcm for the k{sub eff}, and less than 2.1% and 1.6% for the flux at the pin-cell level and for the pin-power distribution, respectively. (authors)« less

  13. Tree ferns: monophyletic groups and their relationships as revealed by four protein-coding plastid loci.

    PubMed

    Korall, Petra; Pryer, Kathleen M; Metzgar, Jordan S; Schneider, Harald; Conant, David S

    2006-06-01

    Tree ferns are a well-established clade within leptosporangiate ferns. Most of the 600 species (in seven families and 13 genera) are arborescent, but considerable morphological variability exists, spanning the giant scaly tree ferns (Cyatheaceae), the low, erect plants (Plagiogyriaceae), and the diminutive endemics of the Guayana Highlands (Hymenophyllopsidaceae). In this study, we investigate phylogenetic relationships within tree ferns based on analyses of four protein-coding, plastid loci (atpA, atpB, rbcL, and rps4). Our results reveal four well-supported clades, with genera of Dicksoniaceae (sensu ) interspersed among them: (A) (Loxomataceae, (Culcita, Plagiogyriaceae)), (B) (Calochlaena, (Dicksonia, Lophosoriaceae)), (C) Cibotium, and (D) Cyatheaceae, with Hymenophyllopsidaceae nested within. How these four groups are related to one other, to Thyrsopteris, or to Metaxyaceae is weakly supported. Our results show that Dicksoniaceae and Cyatheaceae, as currently recognised, are not monophyletic and new circumscriptions for these families are needed.

  14. Implementation and performance of FDPS: a framework for developing parallel particle simulation codes

    NASA Astrophysics Data System (ADS)

    Iwasawa, Masaki; Tanikawa, Ataru; Hosono, Natsuki; Nitadori, Keigo; Muranushi, Takayuki; Makino, Junichiro

    2016-08-01

    We present the basic idea, implementation, measured performance, and performance model of FDPS (Framework for Developing Particle Simulators). FDPS is an application-development framework which helps researchers to develop simulation programs using particle methods for large-scale distributed-memory parallel supercomputers. A particle-based simulation program for distributed-memory parallel computers needs to perform domain decomposition, exchange of particles which are not in the domain of each computing node, and gathering of the particle information in other nodes which are necessary for interaction calculation. Also, even if distributed-memory parallel computers are not used, in order to reduce the amount of computation, algorithms such as the Barnes-Hut tree algorithm or the Fast Multipole Method should be used in the case of long-range interactions. For short-range interactions, some methods to limit the calculation to neighbor particles are required. FDPS provides all of these functions which are necessary for efficient parallel execution of particle-based simulations as "templates," which are independent of the actual data structure of particles and the functional form of the particle-particle interaction. By using FDPS, researchers can write their programs with the amount of work necessary to write a simple, sequential and unoptimized program of O(N2) calculation cost, and yet the program, once compiled with FDPS, will run efficiently on large-scale parallel supercomputers. A simple gravitational N-body program can be written in around 120 lines. We report the actual performance of these programs and the performance model. The weak scaling performance is very good, and almost linear speed-up was obtained for up to the full system of the K computer. The minimum calculation time per timestep is in the range of 30 ms (N = 107) to 300 ms (N = 109). These are currently limited by the time for the calculation of the domain decomposition and communication

  15. SAPNEW: Parallel finite element code for thin shell structures on the Alliant FX-80

    NASA Astrophysics Data System (ADS)

    Kamat, Manohar P.; Watson, Brian C.

    1992-11-01

    The finite element method has proven to be an invaluable tool for analysis and design of complex, high performance systems, such as bladed-disk assemblies in aircraft turbofan engines. However, as the problem size increase, the computation time required by conventional computers can be prohibitively high. Parallel processing computers provide the means to overcome these computation time limits. This report summarizes the results of a research activity aimed at providing a finite element capability for analyzing turbomachinery bladed-disk assemblies in a vector/parallel processing environment. A special purpose code, named with the acronym SAPNEW, has been developed to perform static and eigen analysis of multi-degree-of-freedom blade models built-up from flat thin shell elements. SAPNEW provides a stand alone capability for static and eigen analysis on the Alliant FX/80, a parallel processing computer. A preprocessor, named with the acronym NTOS, has been developed to accept NASTRAN input decks and convert them to the SAPNEW format to make SAPNEW more readily used by researchers at NASA Lewis Research Center.

  16. Temporal parallelization of edge plasma simulations using the parareal algorithm and the SOLPS code

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Samaddar, Debasmita; Coster, D. P.; Bonnin, X.

    We show that numerical modelling of edge plasma physics may be successfully parallelized in time. The parareal algorithm has been employed for this purpose and the SOLPS code package coupling the B2.5 finite-volume fluid plasma solver with the kinetic Monte-Carlo neutral code Eirene has been used as a test bed. The complex dynamics of the plasma and neutrals in the scrape-off layer (SOL) region makes this a unique application. It is demonstrated that a significant computational gain (more than an order of magnitude) may be obtained with this technique. The use of the IPS framework for event-based parareal implementation optimizesmore » resource utilization and has been shown to significantly contribute to the computational gain.« less

  17. Temporal parallelization of edge plasma simulations using the parareal algorithm and the SOLPS code

    DOE PAGES

    Samaddar, Debasmita; Coster, D. P.; Bonnin, X.; ...

    2017-07-31

    We show that numerical modelling of edge plasma physics may be successfully parallelized in time. The parareal algorithm has been employed for this purpose and the SOLPS code package coupling the B2.5 finite-volume fluid plasma solver with the kinetic Monte-Carlo neutral code Eirene has been used as a test bed. The complex dynamics of the plasma and neutrals in the scrape-off layer (SOL) region makes this a unique application. It is demonstrated that a significant computational gain (more than an order of magnitude) may be obtained with this technique. The use of the IPS framework for event-based parareal implementation optimizesmore » resource utilization and has been shown to significantly contribute to the computational gain.« less

  18. Fully Parallel MHD Stability Analysis Tool

    NASA Astrophysics Data System (ADS)

    Svidzinski, Vladimir; Galkin, Sergei; Kim, Jin-Soo; Liu, Yueqiang

    2014-10-01

    Progress on full parallelization of the plasma stability code MARS will be reported. MARS calculates eigenmodes in 2D axisymmetric toroidal equilibria in MHD-kinetic plasma models. It is a powerful tool for studying MHD and MHD-kinetic instabilities and it is widely used by fusion community. Parallel version of MARS is intended for simulations on local parallel clusters. It will be an efficient tool for simulation of MHD instabilities with low, intermediate and high toroidal mode numbers within both fluid and kinetic plasma models, already implemented in MARS. Parallelization of the code includes parallelization of the construction of the matrix for the eigenvalue problem and parallelization of the inverse iterations algorithm, implemented in MARS for the solution of the formulated eigenvalue problem. Construction of the matrix is parallelized by distributing the load among processors assigned to different magnetic surfaces. Parallelization of the solution of the eigenvalue problem is made by repeating steps of the present MARS algorithm using parallel libraries and procedures. Initial results of the code parallelization will be reported. Work is supported by the U.S. DOE SBIR program.

  19. Bilingual parallel programming

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Foster, I.; Overbeek, R.

    1990-01-01

    Numerous experiments have demonstrated that computationally intensive algorithms support adequate parallelism to exploit the potential of large parallel machines. Yet successful parallel implementations of serious applications are rare. The limiting factor is clearly programming technology. None of the approaches to parallel programming that have been proposed to date -- whether parallelizing compilers, language extensions, or new concurrent languages -- seem to adequately address the central problems of portability, expressiveness, efficiency, and compatibility with existing software. In this paper, we advocate an alternative approach to parallel programming based on what we call bilingual programming. We present evidence that this approach providesmore » and effective solution to parallel programming problems. The key idea in bilingual programming is to construct the upper levels of applications in a high-level language while coding selected low-level components in low-level languages. This approach permits the advantages of a high-level notation (expressiveness, elegance, conciseness) to be obtained without the cost in performance normally associated with high-level approaches. In addition, it provides a natural framework for reusing existing code.« less

  20. Computer-Aided Parallelizer and Optimizer

    NASA Technical Reports Server (NTRS)

    Jin, Haoqiang

    2011-01-01

    The Computer-Aided Parallelizer and Optimizer (CAPO) automates the insertion of compiler directives (see figure) to facilitate parallel processing on Shared Memory Parallel (SMP) machines. While CAPO currently is integrated seamlessly into CAPTools (developed at the University of Greenwich, now marketed as ParaWise), CAPO was independently developed at Ames Research Center as one of the components for the Legacy Code Modernization (LCM) project. The current version takes serial FORTRAN programs, performs interprocedural data dependence analysis, and generates OpenMP directives. Due to the widely supported OpenMP standard, the generated OpenMP codes have the potential to run on a wide range of SMP machines. CAPO relies on accurate interprocedural data dependence information currently provided by CAPTools. Compiler directives are generated through identification of parallel loops in the outermost level, construction of parallel regions around parallel loops and optimization of parallel regions, and insertion of directives with automatic identification of private, reduction, induction, and shared variables. Attempts also have been made to identify potential pipeline parallelism (implemented with point-to-point synchronization). Although directives are generated automatically, user interaction with the tool is still important for producing good parallel codes. A comprehensive graphical user interface is included for users to interact with the parallelization process.

  1. Performance analysis of a parallel Monte Carlo code for simulating solar radiative transfer in cloudy atmospheres using CUDA-enabled NVIDIA GPU

    NASA Astrophysics Data System (ADS)

    Russkova, Tatiana V.

    2017-11-01

    One tool to improve the performance of Monte Carlo methods for numerical simulation of light transport in the Earth's atmosphere is the parallel technology. A new algorithm oriented to parallel execution on the CUDA-enabled NVIDIA graphics processor is discussed. The efficiency of parallelization is analyzed on the basis of calculating the upward and downward fluxes of solar radiation in both a vertically homogeneous and inhomogeneous models of the atmosphere. The results of testing the new code under various atmospheric conditions including continuous singlelayered and multilayered clouds, and selective molecular absorption are presented. The results of testing the code using video cards with different compute capability are analyzed. It is shown that the changeover of computing from conventional PCs to the architecture of graphics processors gives more than a hundredfold increase in performance and fully reveals the capabilities of the technology used.

  2. Legacy Code Modernization

    NASA Technical Reports Server (NTRS)

    Hribar, Michelle R.; Frumkin, Michael; Jin, Haoqiang; Waheed, Abdul; Yan, Jerry; Saini, Subhash (Technical Monitor)

    1998-01-01

    Over the past decade, high performance computing has evolved rapidly; systems based on commodity microprocessors have been introduced in quick succession from at least seven vendors/families. Porting codes to every new architecture is a difficult problem; in particular, here at NASA, there are many large CFD applications that are very costly to port to new machines by hand. The LCM ("Legacy Code Modernization") Project is the development of an integrated parallelization environment (IPE) which performs the automated mapping of legacy CFD (Fortran) applications to state-of-the-art high performance computers. While most projects to port codes focus on the parallelization of the code, we consider porting to be an iterative process consisting of several steps: 1) code cleanup, 2) serial optimization,3) parallelization, 4) performance monitoring and visualization, 5) intelligent tools for automated tuning using performance prediction and 6) machine specific optimization. The approach for building this parallelization environment is to build the components for each of the steps simultaneously and then integrate them together. The demonstration will exhibit our latest research in building this environment: 1. Parallelizing tools and compiler evaluation. 2. Code cleanup and serial optimization using automated scripts 3. Development of a code generator for performance prediction 4. Automated partitioning 5. Automated insertion of directives. These demonstrations will exhibit the effectiveness of an automated approach for all the steps involved with porting and tuning a legacy code application for a new architecture.

  3. Implementation, capabilities, and benchmarking of Shift, a massively parallel Monte Carlo radiation transport code

    DOE PAGES

    Pandya, Tara M.; Johnson, Seth R.; Evans, Thomas M.; ...

    2015-12-21

    This paper discusses the implementation, capabilities, and validation of Shift, a massively parallel Monte Carlo radiation transport package developed and maintained at Oak Ridge National Laboratory. It has been developed to scale well from laptop to small computing clusters to advanced supercomputers. Special features of Shift include hybrid capabilities for variance reduction such as CADIS and FW-CADIS, and advanced parallel decomposition and tally methods optimized for scalability on supercomputing architectures. Shift has been validated and verified against various reactor physics benchmarks and compares well to other state-of-the-art Monte Carlo radiation transport codes such as MCNP5, CE KENO-VI, and OpenMC. Somemore » specific benchmarks used for verification and validation include the CASL VERA criticality test suite and several Westinghouse AP1000 ® problems. These benchmark and scaling studies show promising results.« less

  4. OSIRIS - an object-oriented parallel 3D PIC code for modeling laser and particle beam-plasma interaction

    NASA Astrophysics Data System (ADS)

    Hemker, Roy

    1999-11-01

    The advances in computational speed make it now possible to do full 3D PIC simulations of laser plasma and beam plasma interactions, but at the same time the increased complexity of these problems makes it necessary to apply modern approaches like object oriented programming to the development of simulation codes. We report here on our progress in developing an object oriented parallel 3D PIC code using Fortran 90. In its current state the code contains algorithms for 1D, 2D, and 3D simulations in cartesian coordinates and for 2D cylindrically-symmetric geometry. For all of these algorithms the code allows for a moving simulation window and arbitrary domain decomposition for any number of dimensions. Recent 3D simulation results on the propagation of intense laser and electron beams through plasmas will be presented.

  5. Fully Parallel MHD Stability Analysis Tool

    NASA Astrophysics Data System (ADS)

    Svidzinski, Vladimir; Galkin, Sergei; Kim, Jin-Soo; Liu, Yueqiang

    2015-11-01

    Progress on full parallelization of the plasma stability code MARS will be reported. MARS calculates eigenmodes in 2D axisymmetric toroidal equilibria in MHD-kinetic plasma models. It is a powerful tool for studying MHD and MHD-kinetic instabilities and it is widely used by fusion community. Parallel version of MARS is intended for simulations on local parallel clusters. It will be an efficient tool for simulation of MHD instabilities with low, intermediate and high toroidal mode numbers within both fluid and kinetic plasma models, already implemented in MARS. Parallelization of the code includes parallelization of the construction of the matrix for the eigenvalue problem and parallelization of the inverse iterations algorithm, implemented in MARS for the solution of the formulated eigenvalue problem. Construction of the matrix is parallelized by distributing the load among processors assigned to different magnetic surfaces. Parallelization of the solution of the eigenvalue problem is made by repeating steps of the present MARS algorithm using parallel libraries and procedures. Results of MARS parallelization and of the development of a new fix boundary equilibrium code adapted for MARS input will be reported. Work is supported by the U.S. DOE SBIR program.

  6. Wakefield Computations for the CLIC PETS using the Parallel Finite Element Time-Domain Code T3P

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Candel, A; Kabel, A.; Lee, L.

    In recent years, SLAC's Advanced Computations Department (ACD) has developed the high-performance parallel 3D electromagnetic time-domain code, T3P, for simulations of wakefields and transients in complex accelerator structures. T3P is based on advanced higher-order Finite Element methods on unstructured grids with quadratic surface approximation. Optimized for large-scale parallel processing on leadership supercomputing facilities, T3P allows simulations of realistic 3D structures with unprecedented accuracy, aiding the design of the next generation of accelerator facilities. Applications to the Compact Linear Collider (CLIC) Power Extraction and Transfer Structure (PETS) are presented.

  7. Support for Debugging Automatically Parallelized Programs

    NASA Technical Reports Server (NTRS)

    Hood, Robert; Jost, Gabriele

    2001-01-01

    This viewgraph presentation provides information on support sources available for the automatic parallelization of computer program. CAPTools, a support tool developed at the University of Greenwich, transforms, with user guidance, existing sequential Fortran code into parallel message passing code. Comparison routines are then run for debugging purposes, in essence, ensuring that the code transformation was accurate.

  8. Multi-Zone Liquid Thrust Chamber Performance Code with Domain Decomposition for Parallel Processing

    NASA Technical Reports Server (NTRS)

    Navaz, Homayun K.

    2002-01-01

    -equation turbulence model, and two-phase flow. To overcome these limitations, the LTCP code is rewritten to include the multi-zone capability with domain decomposition that makes it suitable for parallel processing, i.e., enabling the code to run every zone or sub-domain on a separate processor. This can reduce the run time by a factor of 6 to 8, depending on the problem.

  9. Implementation of a flexible and scalable particle-in-cell method for massively parallel computations in the mantle convection code ASPECT

    NASA Astrophysics Data System (ADS)

    Gassmöller, Rene; Bangerth, Wolfgang

    2016-04-01

    Particle-in-cell methods have a long history and many applications in geodynamic modelling of mantle convection, lithospheric deformation and crustal dynamics. They are primarily used to track material information, the strain a material has undergone, the pressure-temperature history a certain material region has experienced, or the amount of volatiles or partial melt present in a region. However, their efficient parallel implementation - in particular combined with adaptive finite-element meshes - is complicated due to the complex communication patterns and frequent reassignment of particles to cells. Consequently, many current scientific software packages accomplish this efficient implementation by specifically designing particle methods for a single purpose, like the advection of scalar material properties that do not evolve over time (e.g., for chemical heterogeneities). Design choices for particle integration, data storage, and parallel communication are then optimized for this single purpose, making the code relatively rigid to changing requirements. Here, we present the implementation of a flexible, scalable and efficient particle-in-cell method for massively parallel finite-element codes with adaptively changing meshes. Using a modular plugin structure, we allow maximum flexibility of the generation of particles, the carried tracer properties, the advection and output algorithms, and the projection of properties to the finite-element mesh. We present scaling tests ranging up to tens of thousands of cores and tens of billions of particles. Additionally, we discuss efficient load-balancing strategies for particles in adaptive meshes with their strengths and weaknesses, local particle-transfer between parallel subdomains utilizing existing communication patterns from the finite element mesh, and the use of established parallel output algorithms like the HDF5 library. Finally, we show some relevant particle application cases, compare our implementation to a

  10. Incremental Parallelization of Non-Data-Parallel Programs Using the Charon Message-Passing Library

    NASA Technical Reports Server (NTRS)

    VanderWijngaart, Rob F.

    2000-01-01

    Message passing is among the most popular techniques for parallelizing scientific programs on distributed-memory architectures. The reasons for its success are wide availability (MPI), efficiency, and full tuning control provided to the programmer. A major drawback, however, is that incremental parallelization, as offered by compiler directives, is not generally possible, because all data structures have to be changed throughout the program simultaneously. Charon remedies this situation through mappings between distributed and non-distributed data. It allows breaking up the parallelization into small steps, guaranteeing correctness at every stage. Several tools are available to help convert legacy codes into high-performance message-passing programs. They usually target data-parallel applications, whose loops carrying most of the work can be distributed among all processors without much dependency analysis. Others do a full dependency analysis and then convert the code virtually automatically. Even more toolkits are available that aid construction from scratch of message passing programs. None, however, allows piecemeal translation of codes with complex data dependencies (i.e. non-data-parallel programs) into message passing codes. The Charon library (available in both C and Fortran) provides incremental parallelization capabilities by linking legacy code arrays with distributed arrays. During the conversion process, non-distributed and distributed arrays exist side by side, and simple mapping functions allow the programmer to switch between the two in any location in the program. Charon also provides wrapper functions that leave the structure of the legacy code intact, but that allow execution on truly distributed data. Finally, the library provides a rich set of communication functions that support virtually all patterns of remote data demands in realistic structured grid scientific programs, including transposition, nearest-neighbor communication, pipelining

  11. OpenSWPC: an open-source integrated parallel simulation code for modeling seismic wave propagation in 3D heterogeneous viscoelastic media

    NASA Astrophysics Data System (ADS)

    Maeda, Takuto; Takemura, Shunsuke; Furumura, Takashi

    2017-07-01

    We have developed an open-source software package, Open-source Seismic Wave Propagation Code (OpenSWPC), for parallel numerical simulations of seismic wave propagation in 3D and 2D (P-SV and SH) viscoelastic media based on the finite difference method in local-to-regional scales. This code is equipped with a frequency-independent attenuation model based on the generalized Zener body and an efficient perfectly matched layer for absorbing boundary condition. A hybrid-style programming using OpenMP and the Message Passing Interface (MPI) is adopted for efficient parallel computation. OpenSWPC has wide applicability for seismological studies and great portability to allowing excellent performance from PC clusters to supercomputers. Without modifying the code, users can conduct seismic wave propagation simulations using their own velocity structure models and the necessary source representations by specifying them in an input parameter file. The code has various modes for different types of velocity structure model input and different source representations such as single force, moment tensor and plane-wave incidence, which can easily be selected via the input parameters. Widely used binary data formats, the Network Common Data Form (NetCDF) and the Seismic Analysis Code (SAC) are adopted for the input of the heterogeneous structure model and the outputs of the simulation results, so users can easily handle the input/output datasets. All codes are written in Fortran 2003 and are available with detailed documents in a public repository.[Figure not available: see fulltext.

  12. DLRS: gene tree evolution in light of a species tree.

    PubMed

    Sjöstrand, Joel; Sennblad, Bengt; Arvestad, Lars; Lagergren, Jens

    2012-11-15

    PrIME-DLRS (or colloquially: 'Delirious') is a phylogenetic software tool to simultaneously infer and reconcile a gene tree given a species tree. It accounts for duplication and loss events, a relaxed molecular clock and is intended for the study of homologous gene families, for example in a comparative genomics setting involving multiple species. PrIME-DLRS uses a Bayesian MCMC framework, where the input is a known species tree with divergence times and a multiple sequence alignment, and the output is a posterior distribution over gene trees and model parameters. PrIME-DLRS is available for Java SE 6+ under the New BSD License, and JAR files and source code can be downloaded from http://code.google.com/p/jprime/. There is also a slightly older C++ version available as a binary package for Ubuntu, with download instructions at http://prime.sbc.su.se. The C++ source code is available upon request. joel.sjostrand@scilifelab.se or jens.lagergren@scilifelab.se. PrIME-DLRS is based on a sound probabilistic model (Åkerborg et al., 2009) and has been thoroughly validated on synthetic and biological datasets (Supplementary Material online).

  13. Parallelized modelling and solution scheme for hierarchically scaled simulations

    NASA Technical Reports Server (NTRS)

    Padovan, Joe

    1995-01-01

    This two-part paper presents the results of a benchmarked analytical-numerical investigation into the operational characteristics of a unified parallel processing strategy for implicit fluid mechanics formulations. This hierarchical poly tree (HPT) strategy is based on multilevel substructural decomposition. The Tree morphology is chosen to minimize memory, communications and computational effort. The methodology is general enough to apply to existing finite difference (FD), finite element (FEM), finite volume (FV) or spectral element (SE) based computer programs without an extensive rewrite of code. In addition to finding large reductions in memory, communications, and computational effort associated with a parallel computing environment, substantial reductions are generated in the sequential mode of application. Such improvements grow with increasing problem size. Along with a theoretical development of general 2-D and 3-D HPT, several techniques for expanding the problem size that the current generation of computers are capable of solving, are presented and discussed. Among these techniques are several interpolative reduction methods. It was found that by combining several of these techniques that a relatively small interpolative reduction resulted in substantial performance gains. Several other unique features/benefits are discussed in this paper. Along with Part 1's theoretical development, Part 2 presents a numerical approach to the HPT along with four prototype CFD applications. These demonstrate the potential of the HPT strategy.

  14. Force user's manual: A portable, parallel FORTRAN

    NASA Technical Reports Server (NTRS)

    Jordan, Harry F.; Benten, Muhammad S.; Arenstorf, Norbert S.; Ramanan, Aruna V.

    1990-01-01

    The use of Force, a parallel, portable FORTRAN on shared memory parallel computers is described. Force simplifies writing code for parallel computers and, once the parallel code is written, it is easily ported to computers on which Force is installed. Although Force is nearly the same for all computers, specific details are included for the Cray-2, Cray-YMP, Convex 220, Flex/32, Encore, Sequent, Alliant computers on which it is installed.

  15. Efficient Helicopter Aerodynamic and Aeroacoustic Predictions on Parallel Computers

    NASA Technical Reports Server (NTRS)

    Wissink, Andrew M.; Lyrintzis, Anastasios S.; Strawn, Roger C.; Oliker, Leonid; Biswas, Rupak

    1996-01-01

    This paper presents parallel implementations of two codes used in a combined CFD/Kirchhoff methodology to predict the aerodynamics and aeroacoustics properties of helicopters. The rotorcraft Navier-Stokes code, TURNS, computes the aerodynamic flowfield near the helicopter blades and the Kirchhoff acoustics code computes the noise in the far field, using the TURNS solution as input. The overall parallel strategy adds MPI message passing calls to the existing serial codes to allow for communication between processors. As a result, the total code modifications required for parallel execution are relatively small. The biggest bottleneck in running the TURNS code in parallel comes from the LU-SGS algorithm that solves the implicit system of equations. We use a new hybrid domain decomposition implementation of LU-SGS to obtain good parallel performance on the SP-2. TURNS demonstrates excellent parallel speedups for quasi-steady and unsteady three-dimensional calculations of a helicopter blade in forward flight. The execution rate attained by the code on 114 processors is six times faster than the same cases run on one processor of the Cray C-90. The parallel Kirchhoff code also shows excellent parallel speedups and fast execution rates. As a performance demonstration, unsteady acoustic pressures are computed at 1886 far-field observer locations for a sample acoustics problem. The calculation requires over two hundred hours of CPU time on one C-90 processor but takes only a few hours on 80 processors of the SP2. The resultant far-field acoustic field is analyzed with state of-the-art audio and video rendering of the propagating acoustic signals.

  16. Parallel Event Analysis Under Unix

    NASA Astrophysics Data System (ADS)

    Looney, S.; Nilsson, B. S.; Oest, T.; Pettersson, T.; Ranjard, F.; Thibonnier, J.-P.

    The ALEPH experiment at LEP, the CERN CN division and Digital Equipment Corp. have, in a joint project, developed a parallel event analysis system. The parallel physics code is identical to ALEPH's standard analysis code, ALPHA, only the organisation of input/output is changed. The user may switch between sequential and parallel processing by simply changing one input "card". The initial implementation runs on an 8-node DEC 3000/400 farm, using the PVM software, and exhibits a near-perfect speed-up linearity, reducing the turn-around time by a factor of 8.

  17. PARAMESH: A Parallel Adaptive Mesh Refinement Community Toolkit

    NASA Technical Reports Server (NTRS)

    MacNeice, Peter; Olson, Kevin M.; Mobarry, Clark; deFainchtein, Rosalinda; Packer, Charles

    1999-01-01

    In this paper, we describe a community toolkit which is designed to provide parallel support with adaptive mesh capability for a large and important class of computational models, those using structured, logically cartesian meshes. The package of Fortran 90 subroutines, called PARAMESH, is designed to provide an application developer with an easy route to extend an existing serial code which uses a logically cartesian structured mesh into a parallel code with adaptive mesh refinement. Alternatively, in its simplest use, and with minimal effort, it can operate as a domain decomposition tool for users who want to parallelize their serial codes, but who do not wish to use adaptivity. The package can provide them with an incremental evolutionary path for their code, converting it first to uniformly refined parallel code, and then later if they so desire, adding adaptivity.

  18. Parallelization of GeoClaw code for modeling geophysical flows with adaptive mesh refinement on many-core systems

    USGS Publications Warehouse

    Zhang, S.; Yuen, D.A.; Zhu, A.; Song, S.; George, D.L.

    2011-01-01

    We parallelized the GeoClaw code on one-level grid using OpenMP in March, 2011 to meet the urgent need of simulating tsunami waves at near-shore from Tohoku 2011 and achieved over 75% of the potential speed-up on an eight core Dell Precision T7500 workstation [1]. After submitting that work to SC11 - the International Conference for High Performance Computing, we obtained an unreleased OpenMP version of GeoClaw from David George, who developed the GeoClaw code as part of his PH.D thesis. In this paper, we will show the complementary characteristics of the two approaches used in parallelizing GeoClaw and the speed-up obtained by combining the advantage of each of the two individual approaches with adaptive mesh refinement (AMR), demonstrating the capabilities of running GeoClaw efficiently on many-core systems. We will also show a novel simulation of the Tohoku 2011 Tsunami waves inundating the Sendai airport and Fukushima Nuclear Power Plants, over which the finest grid distance of 20 meters is achieved through a 4-level AMR. This simulation yields quite good predictions about the wave-heights and travel time of the tsunami waves. ?? 2011 IEEE.

  19. Spin wave based parallel logic operations for binary data coded with domain walls

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Urazuka, Y.; Oyabu, S.; Chen, H.

    2014-05-07

    We numerically investigate the feasibility of spin wave (SW) based parallel logic operations, where the phase of SW packet (SWP) is exploited as a state variable and the phase shift caused by the interaction with domain wall (DW) is utilized as a logic inversion functionality. A designed functional element consists of parallel ferromagnetic nanowires (6 nm-thick, 36 nm-width, 5120 nm-length, and 200 nm separation) with the perpendicular magnetization and sub-μm scale overlaid conductors. The logic outputs for binary data, coded with the existence (“1”) or absence (“0”) of the DW, are inductively read out from interferometric aspect of the superposed SWPs, one of themmore » propagating through the stored data area. A practical exclusive-or operation, based on 2π periodicity in the phase logic, is demonstrated for the individual nanowire with an order of different output voltage V{sub out}, depending on the logic output for the stored data. The inductive output from the two nanowires exhibits well defined three different signal levels, corresponding to the information distance (Hamming distance) between 2-bit data stored in the multiple nanowires.« less

  20. MPI parallelization of Vlasov codes for the simulation of nonlinear laser-plasma interactions

    NASA Astrophysics Data System (ADS)

    Savchenko, V.; Won, K.; Afeyan, B.; Decyk, V.; Albrecht-Marc, M.; Ghizzo, A.; Bertrand, P.

    2003-10-01

    The simulation of optical mixing driven KEEN waves [1] and electron plasma waves [1] in laser-produced plasmas require nonlinear kinetic models and massive parallelization. We use Massage Passing Interface (MPI) libraries and Appleseed [2] to solve the Vlasov Poisson system of equations on an 8 node dual processor MAC G4 cluster. We use the semi-Lagrangian time splitting method [3]. It requires only row-column exchanges in the global data redistribution, minimizing the total number of communications between processors. Recurrent communication patterns for 2D FFTs involves global transposition. In the Vlasov-Maxwell case, we use splitting into two 1D spatial advections and a 2D momentum advection [4]. Discretized momentum advection equations have a double loop structure with the outer index being assigned to different processors. We adhere to a code structure with separate routines for calculations and data management for parallel computations. [1] B. Afeyan et al., IFSA 2003 Conference Proceedings, Monterey, CA [2] V. K. Decyk, Computers in Physics, 7, 418 (1993) [3] Sonnendrucker et al., JCP 149, 201 (1998) [4] Begue et al., JCP 151, 458 (1999)

  1. Parallel community climate model: Description and user`s guide

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Drake, J.B.; Flanery, R.E.; Semeraro, B.D.

    This report gives an overview of a parallel version of the NCAR Community Climate Model, CCM2, implemented for MIMD massively parallel computers using a message-passing programming paradigm. The parallel implementation was developed on an Intel iPSC/860 with 128 processors and on the Intel Delta with 512 processors, and the initial target platform for the production version of the code is the Intel Paragon with 2048 processors. Because the implementation uses a standard, portable message-passing libraries, the code has been easily ported to other multiprocessors supporting a message-passing programming paradigm. The parallelization strategy used is to decompose the problem domain intomore » geographical patches and assign each processor the computation associated with a distinct subset of the patches. With this decomposition, the physics calculations involve only grid points and data local to a processor and are performed in parallel. Using parallel algorithms developed for the semi-Lagrangian transport, the fast Fourier transform and the Legendre transform, both physics and dynamics are computed in parallel with minimal data movement and modest change to the original CCM2 source code. Sequential or parallel history tapes are written and input files (in history tape format) are read sequentially by the parallel code to promote compatibility with production use of the model on other computer systems. A validation exercise has been performed with the parallel code and is detailed along with some performance numbers on the Intel Paragon and the IBM SP2. A discussion of reproducibility of results is included. A user`s guide for the PCCM2 version 2.1 on the various parallel machines completes the report. Procedures for compilation, setup and execution are given. A discussion of code internals is included for those who may wish to modify and use the program in their own research.« less

  2. Breakdown of Spatial Parallel Coding in Children's Drawing

    ERIC Educational Resources Information Center

    De Bruyn, Bart; Davis, Alyson

    2005-01-01

    When drawing real scenes or copying simple geometric figures young children are highly sensitive to parallel cues and use them effectively. However, this sensitivity can break down in surprisingly simple tasks such as copying a single line where robust directional errors occur despite the presence of parallel cues. Before we can conclude that this…

  3. Global Magnetohydrodynamic Simulation Using High Performance FORTRAN on Parallel Computers

    NASA Astrophysics Data System (ADS)

    Ogino, T.

    High Performance Fortran (HPF) is one of modern and common techniques to achieve high performance parallel computation. We have translated a 3-dimensional magnetohydrodynamic (MHD) simulation code of the Earth's magnetosphere from VPP Fortran to HPF/JA on the Fujitsu VPP5000/56 vector-parallel supercomputer and the MHD code was fully vectorized and fully parallelized in VPP Fortran. The entire performance and capability of the HPF MHD code could be shown to be almost comparable to that of VPP Fortran. A 3-dimensional global MHD simulation of the earth's magnetosphere was performed at a speed of over 400 Gflops with an efficiency of 76.5 VPP5000/56 in vector and parallel computation that permitted comparison with catalog values. We have concluded that fluid and MHD codes that are fully vectorized and fully parallelized in VPP Fortran can be translated with relative ease to HPF/JA, and a code in HPF/JA may be expected to perform comparably to the same code written in VPP Fortran.

  4. Multitasking TORT under UNICOS: Parallel performance models and measurements

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Barnett, A.; Azmy, Y.Y.

    1999-09-27

    The existing parallel algorithms in the TORT discrete ordinates code were updated to function in a UNICOS environment. A performance model for the parallel overhead was derived for the existing algorithms. The largest contributors to the parallel overhead were identified and a new algorithm was developed. A parallel overhead model was also derived for the new algorithm. The results of the comparison of parallel performance models were compared to applications of the code to two TORT standard test problems and a large production problem. The parallel performance models agree well with the measured parallel overhead.

  5. Parallel VLSI architecture emulation and the organization of APSA/MPP

    NASA Technical Reports Server (NTRS)

    Odonnell, John T.

    1987-01-01

    The Applicative Programming System Architecture (APSA) combines an applicative language interpreter with a novel parallel computer architecture that is well suited for Very Large Scale Integration (VLSI) implementation. The Massively Parallel Processor (MPP) can simulate VLSI circuits by allocating one processing element in its square array to an area on a square VLSI chip. As long as there are not too many long data paths, the MPP can simulate a VLSI clock cycle very rapidly. The APSA circuit contains a binary tree with a few long paths and many short ones. A skewed H-tree layout allows every processing element to simulate a leaf cell and up to four tree nodes, with no loss in parallelism. Emulation of a key APSA algorithm on the MPP resulted in performance 16,000 times faster than a Vax. This speed will make it possible for the APSA language interpreter to run fast enough to support research in parallel list processing algorithms.

  6. Parallelization of ARC3D with Computer-Aided Tools

    NASA Technical Reports Server (NTRS)

    Jin, Haoqiang; Hribar, Michelle; Yan, Jerry; Saini, Subhash (Technical Monitor)

    1998-01-01

    A series of efforts have been devoted to investigating methods of porting and parallelizing applications quickly and efficiently for new architectures, such as the SCSI Origin 2000 and Cray T3E. This report presents the parallelization of a CFD application, ARC3D, using the computer-aided tools, Cesspools. Steps of parallelizing this code and requirements of achieving better performance are discussed. The generated parallel version has achieved reasonably well performance, for example, having a speedup of 30 for 36 Cray T3E processors. However, this performance could not be obtained without modification of the original serial code. It is suggested that in many cases improving serial code and performing necessary code transformations are important parts for the automated parallelization process although user intervention in many of these parts are still necessary. Nevertheless, development and improvement of useful software tools, such as Cesspools, can help trim down many tedious parallelization details and improve the processing efficiency.

  7. Using Coarrays to Parallelize Legacy Fortran Applications: Strategy and Case Study

    DOE PAGES

    Radhakrishnan, Hari; Rouson, Damian W. I.; Morris, Karla; ...

    2015-01-01

    This paper summarizes a strategy for parallelizing a legacy Fortran 77 program using the object-oriented (OO) and coarray features that entered Fortran in the 2003 and 2008 standards, respectively. OO programming (OOP) facilitates the construction of an extensible suite of model-verification and performance tests that drive the development. Coarray parallel programming facilitates a rapid evolution from a serial application to a parallel application capable of running on multicore processors and many-core accelerators in shared and distributed memory. We delineate 17 code modernization steps used to refactor and parallelize the program and study the resulting performance. Our initial studies were donemore » using the Intel Fortran compiler on a 32-core shared memory server. Scaling behavior was very poor, and profile analysis using TAU showed that the bottleneck in the performance was due to our implementation of a collective, sequential summation procedure. We were able to improve the scalability and achieve nearly linear speedup by replacing the sequential summation with a parallel, binary tree algorithm. We also tested the Cray compiler, which provides its own collective summation procedure. Intel provides no collective reductions. With Cray, the program shows linear speedup even in distributed-memory execution. We anticipate similar results with other compilers once they support the new collective procedures proposed for Fortran 2015.« less

  8. Study of Linear and Nonlinear Waves in Plasma Crystals Using the Box_Tree Code

    NASA Astrophysics Data System (ADS)

    Qiao, K.; Hyde, T.; Barge, L.

    Dusty plasma systems play an important role in both astrophysical and planetary environments (protostellar clouds, planetary ring systems and magnetospheres, cometary environments) and laboratory settings (plasma processing or nanofabrication). Recent research has focussed on defining (both theoretically and experimentally) the different types of wave mode propagations, which are possible within plasma crystals. This is an important topic since several of the fundamental quantities for characterizing such crystals can be obtained directly from an analysis of the wave propagation/dispersion. This paper will discuss a num rical model fore 2D-monolayer plasma crystals, which was established using a modified box tree code. Different wave modes were examined by adding a time dependent potential to the code designed to simulate a laser radiation perturbation as has been applied in many experiments. Both linear waves (for example, longitudinal and transverse dust lattice waves) and nonlinear waves (solitary waves) are examined. The output data will also be compared with the results of corresponding experiments and discussed.

  9. Research in Parallel Algorithms and Software for Computational Aerosciences

    NASA Technical Reports Server (NTRS)

    Domel, Neal D.

    1996-01-01

    Phase I is complete for the development of a Computational Fluid Dynamics parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.

  10. Utilizing GPUs to Accelerate Turbomachinery CFD Codes

    NASA Technical Reports Server (NTRS)

    MacCalla, Weylin; Kulkarni, Sameer

    2016-01-01

    GPU computing has established itself as a way to accelerate parallel codes in the high performance computing world. This work focuses on speeding up APNASA, a legacy CFD code used at NASA Glenn Research Center, while also drawing conclusions about the nature of GPU computing and the requirements to make GPGPU worthwhile on legacy codes. Rewriting and restructuring of the source code was avoided to limit the introduction of new bugs. The code was profiled and investigated for parallelization potential, then OpenACC directives were used to indicate parallel parts of the code. The use of OpenACC directives was not able to reduce the runtime of APNASA on either the NVIDIA Tesla discrete graphics card, or the AMD accelerated processing unit. Additionally, it was found that in order to justify the use of GPGPU, the amount of parallel work being done within a kernel would have to greatly exceed the work being done by any one portion of the APNASA code. It was determined that in order for an application like APNASA to be accelerated on the GPU, it should not be modular in nature, and the parallel portions of the code must contain a large portion of the code's computation time.

  11. Rubus: A compiler for seamless and extensible parallelism.

    PubMed

    Adnan, Muhammad; Aslam, Faisal; Nawaz, Zubair; Sarwar, Syed Mansoor

    2017-01-01

    Nowadays, a typical processor may have multiple processing cores on a single chip. Furthermore, a special purpose processing unit called Graphic Processing Unit (GPU), originally designed for 2D/3D games, is now available for general purpose use in computers and mobile devices. However, the traditional programming languages which were designed to work with machines having single core CPUs, cannot utilize the parallelism available on multi-core processors efficiently. Therefore, to exploit the extraordinary processing power of multi-core processors, researchers are working on new tools and techniques to facilitate parallel programming. To this end, languages like CUDA and OpenCL have been introduced, which can be used to write code with parallelism. The main shortcoming of these languages is that programmer needs to specify all the complex details manually in order to parallelize the code across multiple cores. Therefore, the code written in these languages is difficult to understand, debug and maintain. Furthermore, to parallelize legacy code can require rewriting a significant portion of code in CUDA or OpenCL, which can consume significant time and resources. Thus, the amount of parallelism achieved is proportional to the skills of the programmer and the time spent in code optimizations. This paper proposes a new open source compiler, Rubus, to achieve seamless parallelism. The Rubus compiler relieves the programmer from manually specifying the low-level details. It analyses and transforms a sequential program into a parallel program automatically, without any user intervention. This achieves massive speedup and better utilization of the underlying hardware without a programmer's expertise in parallel programming. For five different benchmarks, on average a speedup of 34.54 times has been achieved by Rubus as compared to Java on a basic GPU having only 96 cores. Whereas, for a matrix multiplication benchmark the average execution speedup of 84 times has been

  12. Rubus: A compiler for seamless and extensible parallelism

    PubMed Central

    Adnan, Muhammad; Aslam, Faisal; Sarwar, Syed Mansoor

    2017-01-01

    Nowadays, a typical processor may have multiple processing cores on a single chip. Furthermore, a special purpose processing unit called Graphic Processing Unit (GPU), originally designed for 2D/3D games, is now available for general purpose use in computers and mobile devices. However, the traditional programming languages which were designed to work with machines having single core CPUs, cannot utilize the parallelism available on multi-core processors efficiently. Therefore, to exploit the extraordinary processing power of multi-core processors, researchers are working on new tools and techniques to facilitate parallel programming. To this end, languages like CUDA and OpenCL have been introduced, which can be used to write code with parallelism. The main shortcoming of these languages is that programmer needs to specify all the complex details manually in order to parallelize the code across multiple cores. Therefore, the code written in these languages is difficult to understand, debug and maintain. Furthermore, to parallelize legacy code can require rewriting a significant portion of code in CUDA or OpenCL, which can consume significant time and resources. Thus, the amount of parallelism achieved is proportional to the skills of the programmer and the time spent in code optimizations. This paper proposes a new open source compiler, Rubus, to achieve seamless parallelism. The Rubus compiler relieves the programmer from manually specifying the low-level details. It analyses and transforms a sequential program into a parallel program automatically, without any user intervention. This achieves massive speedup and better utilization of the underlying hardware without a programmer’s expertise in parallel programming. For five different benchmarks, on average a speedup of 34.54 times has been achieved by Rubus as compared to Java on a basic GPU having only 96 cores. Whereas, for a matrix multiplication benchmark the average execution speedup of 84 times has been

  13. Locating hardware faults in a data communications network of a parallel computer

    DOEpatents

    Archer, Charles J.; Megerian, Mark G.; Ratterman, Joseph D.; Smith, Brian E.

    2010-01-12

    Hardware faults location in a data communications network of a parallel computer. Such a parallel computer includes a plurality of compute nodes and a data communications network that couples the compute nodes for data communications and organizes the compute node as a tree. Locating hardware faults includes identifying a next compute node as a parent node and a root of a parent test tree, identifying for each child compute node of the parent node a child test tree having the child compute node as root, running a same test suite on the parent test tree and each child test tree, and identifying the parent compute node as having a defective link connected from the parent compute node to a child compute node if the test suite fails on the parent test tree and succeeds on all the child test trees.

  14. TreSpEx—Detection of Misleading Signal in Phylogenetic Reconstructions Based on Tree Information

    PubMed Central

    Struck, Torsten H

    2014-01-01

    Phylogenies of species or genes are commonplace nowadays in many areas of comparative biological studies. However, for phylogenetic reconstructions one must refer to artificial signals such as paralogy, long-branch attraction, saturation, or conflict between different datasets. These signals might eventually mislead the reconstruction even in phylogenomic studies employing hundreds of genes. Unfortunately, there has been no program allowing the detection of such effects in combination with an implementation into automatic process pipelines. TreSpEx (Tree Space Explorer) now combines different approaches (including statistical tests), which utilize tree-based information like nodal support or patristic distances (PDs) to identify misleading signals. The program enables the parallel analysis of hundreds of trees and/or predefined gene partitions, and being command-line driven, it can be integrated into automatic process pipelines. TreSpEx is implemented in Perl and supported on Linux, Mac OS X, and MS Windows. Source code, binaries, and additional material are freely available at http://www.annelida.de/research/bioinformatics/software.html. PMID:24701118

  15. Highly fault-tolerant parallel computation

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Spielman, D.A.

    We re-introduce the coded model of fault-tolerant computation in which the input and output of a computational device are treated as words in an error-correcting code. A computational device correctly computes a function in the coded model if its input and output, once decoded, are a valid input and output of the function. In the coded model, it is reasonable to hope to simulate all computational devices by devices whose size is greater by a constant factor but which are exponentially reliable even if each of their components can fail with some constant probability. We consider fine-grained parallel computations inmore » which each processor has a constant probability of producing the wrong output at each time step. We show that any parallel computation that runs for time t on w processors can be performed reliably on a faulty machine in the coded model using w log{sup O(l)} w processors and time t log{sup O(l)} w. The failure probability of the computation will be at most t {center_dot} exp(-w{sup 1/4}). The codes used to communicate with our fault-tolerant machines are generalized Reed-Solomon codes and can thus be encoded and decoded in O(n log{sup O(1)} n) sequential time and are independent of the machine they are used to communicate with. We also show how coded computation can be used to self-correct many linear functions in parallel with arbitrarily small overhead.« less

  16. Coset Codes Viewed as Terminated Convolutional Codes

    NASA Technical Reports Server (NTRS)

    Fossorier, Marc P. C.; Lin, Shu

    1996-01-01

    In this paper, coset codes are considered as terminated convolutional codes. Based on this approach, three new general results are presented. First, it is shown that the iterative squaring construction can equivalently be defined from a convolutional code whose trellis terminates. This convolutional code determines a simple encoder for the coset code considered, and the state and branch labelings of the associated trellis diagram become straightforward. Also, from the generator matrix of the code in its convolutional code form, much information about the trade-off between the state connectivity and complexity at each section, and the parallel structure of the trellis, is directly available. Based on this generator matrix, it is shown that the parallel branches in the trellis diagram of the convolutional code represent the same coset code C(sub 1), of smaller dimension and shorter length. Utilizing this fact, a two-stage optimum trellis decoding method is devised. The first stage decodes C(sub 1), while the second stage decodes the associated convolutional code, using the branch metrics delivered by stage 1. Finally, a bidirectional decoding of each received block starting at both ends is presented. If about the same number of computations is required, this approach remains very attractive from a practical point of view as it roughly doubles the decoding speed. This fact is particularly interesting whenever the second half of the trellis is the mirror image of the first half, since the same decoder can be implemented for both parts.

  17. Parallel changes in mate-attracting calls and female preferences in autotriploid tree frogs

    PubMed Central

    Tucker, Mitch A.; Gerhardt, H. C.

    2012-01-01

    For polyploid species to persist, they must be reproductively isolated from their diploid parental species, which coexist at the same time and place at least initially. In a complex of biparentally reproducing tetraploid and diploid tree frogs in North America, selective phonotaxis—mediated by differences in the pulse-repetition (pulse rate) of their mate-attracting vocalizations—ensures assortative mating. We show that artificially produced autotriploid females of the diploid species (Hyla chrysoscelis) show a shift in pulse-rate preference in the direction of the pulse rate produced by males of the tetraploid species (Hyla versicolor). The estimated preference function is centred near the mean pulse rate of the calls of artificially produced male autotriploids. Such a parallel shift, which is caused by polyploidy per se and whose magnitude is expected to be greater in autotetraploids, may have facilitated sympatric speciation by promoting reproductive isolation of the initially formed polyploids from their diploid parental forms. This process also helps to explain why tetraploid lineages with different origins have similar advertisement calls and freely interbreed. PMID:22113033

  18. Parallel optical image addition and subtraction in a dynamic photorefractive memory by phase-code multiplexing

    NASA Astrophysics Data System (ADS)

    Denz, Cornelia; Dellwig, Thilo; Lembcke, Jan; Tschudi, Theo

    1996-02-01

    We propose and demonstrate experimentally a method for utilizing a dynamic phase-encoded photorefractive memory to realize parallel optical addition, subtraction, and inversion operations of stored images. The phase-encoded holographic memory is realized in photorefractive BaTiO3, storing eight images using WalshHadamard binary phase codes and an incremental recording procedure. By subsampling the set of reference beams during the recall operation, the selectivity of the phase address is decreased, allowing one to combine images in such a way that different linear combination of the images can be realized at the output of the memory.

  19. VINE-A NUMERICAL CODE FOR SIMULATING ASTROPHYSICAL SYSTEMS USING PARTICLES. I. DESCRIPTION OF THE PHYSICS AND THE NUMERICAL METHODS

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Wetzstein, M.; Nelson, Andrew F.; Naab, T.

    2009-10-01

    We present a numerical code for simulating the evolution of astrophysical systems using particles to represent the underlying fluid flow. The code is written in Fortran 95 and is designed to be versatile, flexible, and extensible, with modular options that can be selected either at the time the code is compiled or at run time through a text input file. We include a number of general purpose modules describing a variety of physical processes commonly required in the astrophysical community and we expect that the effort required to integrate additional or alternate modules into the code will be small. Inmore » its simplest form the code can evolve the dynamical trajectories of a set of particles in two or three dimensions using a module which implements either a Leapfrog or Runge-Kutta-Fehlberg integrator, selected by the user at compile time. The user may choose to allow the integrator to evolve the system using individual time steps for each particle or with a single, global time step for all. Particles may interact gravitationally as N-body particles, and all or any subset may also interact hydrodynamically, using the smoothed particle hydrodynamic (SPH) method by selecting the SPH module. A third particle species can be included with a module to model massive point particles which may accrete nearby SPH or N-body particles. Such particles may be used to model, e.g., stars in a molecular cloud. Free boundary conditions are implemented by default, and a module may be selected to include periodic boundary conditions. We use a binary 'Press' tree to organize particles for rapid access in gravity and SPH calculations. Modules implementing an interface with special purpose 'GRAPE' hardware may also be selected to accelerate the gravity calculations. If available, forces obtained from the GRAPE coprocessors may be transparently substituted for those obtained from the tree, or both tree and GRAPE may be used as a combination GRAPE/tree code. The code may be run without

  20. Vine—A Numerical Code for Simulating Astrophysical Systems Using Particles. I. Description of the Physics and the Numerical Methods

    NASA Astrophysics Data System (ADS)

    Wetzstein, M.; Nelson, Andrew F.; Naab, T.; Burkert, A.

    2009-10-01

    We present a numerical code for simulating the evolution of astrophysical systems using particles to represent the underlying fluid flow. The code is written in Fortran 95 and is designed to be versatile, flexible, and extensible, with modular options that can be selected either at the time the code is compiled or at run time through a text input file. We include a number of general purpose modules describing a variety of physical processes commonly required in the astrophysical community and we expect that the effort required to integrate additional or alternate modules into the code will be small. In its simplest form the code can evolve the dynamical trajectories of a set of particles in two or three dimensions using a module which implements either a Leapfrog or Runge-Kutta-Fehlberg integrator, selected by the user at compile time. The user may choose to allow the integrator to evolve the system using individual time steps for each particle or with a single, global time step for all. Particles may interact gravitationally as N-body particles, and all or any subset may also interact hydrodynamically, using the smoothed particle hydrodynamic (SPH) method by selecting the SPH module. A third particle species can be included with a module to model massive point particles which may accrete nearby SPH or N-body particles. Such particles may be used to model, e.g., stars in a molecular cloud. Free boundary conditions are implemented by default, and a module may be selected to include periodic boundary conditions. We use a binary "Press" tree to organize particles for rapid access in gravity and SPH calculations. Modules implementing an interface with special purpose "GRAPE" hardware may also be selected to accelerate the gravity calculations. If available, forces obtained from the GRAPE coprocessors may be transparently substituted for those obtained from the tree, or both tree and GRAPE may be used as a combination GRAPE/tree code. The code may be run without

  1. Research in Parallel Algorithms and Software for Computational Aerosciences

    NASA Technical Reports Server (NTRS)

    Domel, Neal D.

    1996-01-01

    Phase 1 is complete for the development of a computational fluid dynamics CFD) parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.

  2. Implementing Shared Memory Parallelism in MCBEND

    NASA Astrophysics Data System (ADS)

    Bird, Adam; Long, David; Dobson, Geoff

    2017-09-01

    MCBEND is a general purpose radiation transport Monte Carlo code from AMEC Foster Wheelers's ANSWERS® Software Service. MCBEND is well established in the UK shielding community for radiation shielding and dosimetry assessments. The existing MCBEND parallel capability effectively involves running the same calculation on many processors. This works very well except when the memory requirements of a model restrict the number of instances of a calculation that will fit on a machine. To more effectively utilise parallel hardware OpenMP has been used to implement shared memory parallelism in MCBEND. This paper describes the reasoning behind the choice of OpenMP, notes some of the challenges of multi-threading an established code such as MCBEND and assesses the performance of the parallel method implemented in MCBEND.

  3. Parallel Computation of the Jacobian Matrix for Nonlinear Equation Solvers Using MATLAB

    NASA Technical Reports Server (NTRS)

    Rose, Geoffrey K.; Nguyen, Duc T.; Newman, Brett A.

    2017-01-01

    Demonstrating speedup for parallel code on a multicore shared memory PC can be challenging in MATLAB due to underlying parallel operations that are often opaque to the user. This can limit potential for improvement of serial code even for the so-called embarrassingly parallel applications. One such application is the computation of the Jacobian matrix inherent to most nonlinear equation solvers. Computation of this matrix represents the primary bottleneck in nonlinear solver speed such that commercial finite element (FE) and multi-body-dynamic (MBD) codes attempt to minimize computations. A timing study using MATLAB's Parallel Computing Toolbox was performed for numerical computation of the Jacobian. Several approaches for implementing parallel code were investigated while only the single program multiple data (spmd) method using composite objects provided positive results. Parallel code speedup is demonstrated but the goal of linear speedup through the addition of processors was not achieved due to PC architecture.

  4. Parallel tempering simulation of the three-dimensional Edwards-Anderson model with compact asynchronous multispin coding on GPU

    NASA Astrophysics Data System (ADS)

    Fang, Ye; Feng, Sheng; Tam, Ka-Ming; Yun, Zhifeng; Moreno, Juana; Ramanujam, J.; Jarrell, Mark

    2014-10-01

    Monte Carlo simulations of the Ising model play an important role in the field of computational statistical physics, and they have revealed many properties of the model over the past few decades. However, the effect of frustration due to random disorder, in particular the possible spin glass phase, remains a crucial but poorly understood problem. One of the obstacles in the Monte Carlo simulation of random frustrated systems is their long relaxation time making an efficient parallel implementation on state-of-the-art computation platforms highly desirable. The Graphics Processing Unit (GPU) is such a platform that provides an opportunity to significantly enhance the computational performance and thus gain new insight into this problem. In this paper, we present optimization and tuning approaches for the CUDA implementation of the spin glass simulation on GPUs. We discuss the integration of various design alternatives, such as GPU kernel construction with minimal communication, memory tiling, and look-up tables. We present a binary data format, Compact Asynchronous Multispin Coding (CAMSC), which provides an additional 28.4% speedup compared with the traditionally used Asynchronous Multispin Coding (AMSC). Our overall design sustains a performance of 33.5 ps per spin flip attempt for simulating the three-dimensional Edwards-Anderson model with parallel tempering, which significantly improves the performance over existing GPU implementations.

  5. Fault trees and sequence dependencies

    NASA Technical Reports Server (NTRS)

    Dugan, Joanne Bechta; Boyd, Mark A.; Bavuso, Salvatore J.

    1990-01-01

    One of the frequently cited shortcomings of fault-tree models, their inability to model so-called sequence dependencies, is discussed. Several sources of such sequence dependencies are discussed, and new fault-tree gates to capture this behavior are defined. These complex behaviors can be included in present fault-tree models because they utilize a Markov solution. The utility of the new gates is demonstrated by presenting several models of the fault-tolerant parallel processor, which include both hot and cold spares.

  6. Binary Trees and Parallel Scheduling Algorithms.

    DTIC Science & Technology

    1980-09-01

    been pro- cessed for p. time units. If a job does not complete by its due time, it is tardy. In a nonpreemptive schedule, job i is scheduled to process...the preemptive schedule obtained by the algorithm of section 2.1.2 also minimizes 5Ti, this problem is easily solved in parallel. When lci is to e...August 1978, pp. 657-661. 14. Horn, W. A., "Some simple scheduling algorithms," Naval Res. Logist . Qur., Vol. 21, pp. 177-185, 1974. i5. Hforowitz, E

  7. Tough2{_}MP: A parallel version of TOUGH2

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Zhang, Keni; Wu, Yu-Shu; Ding, Chris

    2003-04-09

    TOUGH2{_}MP is a massively parallel version of TOUGH2. It was developed for running on distributed-memory parallel computers to simulate large simulation problems that may not be solved by the standard, single-CPU TOUGH2 code. The new code implements an efficient massively parallel scheme, while preserving the full capacity and flexibility of the original TOUGH2 code. The new software uses the METIS software package for grid partitioning and AZTEC software package for linear-equation solving. The standard message-passing interface is adopted for communication among processors. Numerical performance of the current version code has been tested on CRAY-T3E and IBM RS/6000 SP platforms. Inmore » addition, the parallel code has been successfully applied to real field problems of multi-million-cell simulations for three-dimensional multiphase and multicomponent fluid and heat flow, as well as solute transport. In this paper, we will review the development of the TOUGH2{_}MP, and discuss the basic features, modules, and their applications.« less

  8. Aerodynamic simulation on massively parallel systems

    NASA Technical Reports Server (NTRS)

    Haeuser, Jochem; Simon, Horst D.

    1992-01-01

    This paper briefly addresses the computational requirements for the analysis of complete configurations of aircraft and spacecraft currently under design to be used for advanced transportation in commercial applications as well as in space flight. The discussion clearly shows that massively parallel systems are the only alternative which is both cost effective and on the other hand can provide the necessary TeraFlops, needed to satisfy the narrow design margins of modern vehicles. It is assumed that the solution of the governing physical equations, i.e., the Navier-Stokes equations which may be complemented by chemistry and turbulence models, is done on multiblock grids. This technique is situated between the fully structured approach of classical boundary fitted grids and the fully unstructured tetrahedra grids. A fully structured grid best represents the flow physics, while the unstructured grid gives best geometrical flexibility. The multiblock grid employed is structured within a block, but completely unstructured on the block level. While a completely unstructured grid is not straightforward to parallelize, the above mentioned multiblock grid is inherently parallel, in particular for multiple instruction multiple datastream (MIMD) machines. In this paper guidelines are provided for setting up or modifying an existing sequential code so that a direct parallelization on a massively parallel system is possible. Results are presented for three parallel systems, namely the Intel hypercube, the Ncube hypercube, and the FPS 500 system. Some preliminary results for an 8K CM2 machine will also be mentioned. The code run is the two dimensional grid generation module of Grid, which is a general two dimensional and three dimensional grid generation code for complex geometries. A system of nonlinear Poisson equations is solved. This code is also a good testcase for complex fluid dynamics codes, since the same datastructures are used. All systems provided good speedups, but

  9. Locating hardware faults in a parallel computer

    DOEpatents

    Archer, Charles J.; Megerian, Mark G.; Ratterman, Joseph D.; Smith, Brian E.

    2010-04-13

    Locating hardware faults in a parallel computer, including defining within a tree network of the parallel computer two or more sets of non-overlapping test levels of compute nodes of the network that together include all the data communications links of the network, each non-overlapping test level comprising two or more adjacent tiers of the tree; defining test cells within each non-overlapping test level, each test cell comprising a subtree of the tree including a subtree root compute node and all descendant compute nodes of the subtree root compute node within a non-overlapping test level; performing, separately on each set of non-overlapping test levels, an uplink test on all test cells in a set of non-overlapping test levels; and performing, separately from the uplink tests and separately on each set of non-overlapping test levels, a downlink test on all test cells in a set of non-overlapping test levels.

  10. Particle In Cell Codes on Highly Parallel Architectures

    NASA Astrophysics Data System (ADS)

    Tableman, Adam

    2014-10-01

    We describe strategies and examples of Particle-In-Cell Codes running on Nvidia GPU and Intel Phi architectures. This includes basic implementations in skeletons codes and full-scale development versions (encompassing 1D, 2D, and 3D codes) in Osiris. Both the similarities and differences between Intel's and Nvidia's hardware will be examined. Work supported by grants NSF ACI 1339893, DOE DE SC 000849, DOE DE SC 0008316, DOE DE NA 0001833, and DOE DE FC02 04ER 54780.

  11. IcyTree: rapid browser-based visualization for phylogenetic trees and networks

    PubMed Central

    2017-01-01

    Abstract Summary: IcyTree is an easy-to-use application which can be used to visualize a wide variety of phylogenetic trees and networks. While numerous phylogenetic tree viewers exist already, IcyTree distinguishes itself by being a purely online tool, having a responsive user interface, supporting phylogenetic networks (ancestral recombination graphs in particular), and efficiently drawing trees that include information such as ancestral locations or trait values. IcyTree also provides intuitive panning and zooming utilities that make exploring large phylogenetic trees of many thousands of taxa feasible. Availability and Implementation: IcyTree is a web application and can be accessed directly at http://tgvaughan.github.com/icytree. Currently supported web browsers include Mozilla Firefox and Google Chrome. IcyTree is written entirely in client-side JavaScript (no plugin required) and, once loaded, does not require network access to run. IcyTree is free software, and the source code is made available at http://github.com/tgvaughan/icytree under version 3 of the GNU General Public License. Contact: tgvaughan@gmail.com PMID:28407035

  12. IcyTree: rapid browser-based visualization for phylogenetic trees and networks.

    PubMed

    Vaughan, Timothy G

    2017-08-01

    IcyTree is an easy-to-use application which can be used to visualize a wide variety of phylogenetic trees and networks. While numerous phylogenetic tree viewers exist already, IcyTree distinguishes itself by being a purely online tool, having a responsive user interface, supporting phylogenetic networks (ancestral recombination graphs in particular), and efficiently drawing trees that include information such as ancestral locations or trait values. IcyTree also provides intuitive panning and zooming utilities that make exploring large phylogenetic trees of many thousands of taxa feasible. IcyTree is a web application and can be accessed directly at http://tgvaughan.github.com/icytree . Currently supported web browsers include Mozilla Firefox and Google Chrome. IcyTree is written entirely in client-side JavaScript (no plugin required) and, once loaded, does not require network access to run. IcyTree is free software, and the source code is made available at http://github.com/tgvaughan/icytree under version 3 of the GNU General Public License. tgvaughan@gmail.com. © The Author(s) 2017. Published by Oxford University Press.

  13. Labeled trees and the efficient computation of derivations

    NASA Technical Reports Server (NTRS)

    Grossman, Robert; Larson, Richard G.

    1989-01-01

    The effective parallel symbolic computation of operators under composition is discussed. Examples include differential operators under composition and vector fields under the Lie bracket. Data structures consisting of formal linear combinations of rooted labeled trees are discussed. A multiplication on rooted labeled trees is defined, thereby making the set of these data structures into an associative algebra. An algebra homomorphism is defined from the original algebra of operators into this algebra of trees. An algebra homomorphism from the algebra of trees into the algebra of differential operators is then described. The cancellation which occurs when noncommuting operators are expressed in terms of commuting ones occurs naturally when the operators are represented using this data structure. This leads to an algorithm which, for operators which are derivations, speeds up the computation exponentially in the degree of the operator. It is shown that the algebra of trees leads naturally to a parallel version of the algorithm.

  14. Real-time SHVC software decoding with multi-threaded parallel processing

    NASA Astrophysics Data System (ADS)

    Gudumasu, Srinivas; He, Yuwen; Ye, Yan; He, Yong; Ryu, Eun-Seok; Dong, Jie; Xiu, Xiaoyu

    2014-09-01

    This paper proposes a parallel decoding framework for scalable HEVC (SHVC). Various optimization technologies are implemented on the basis of SHVC reference software SHM-2.0 to achieve real-time decoding speed for the two layer spatial scalability configuration. SHVC decoder complexity is analyzed with profiling information. The decoding process at each layer and the up-sampling process are designed in parallel and scheduled by a high level application task manager. Within each layer, multi-threaded decoding is applied to accelerate the layer decoding speed. Entropy decoding, reconstruction, and in-loop processing are pipeline designed with multiple threads based on groups of coding tree units (CTU). A group of CTUs is treated as a processing unit in each pipeline stage to achieve a better trade-off between parallelism and synchronization. Motion compensation, inverse quantization, and inverse transform modules are further optimized with SSE4 SIMD instructions. Simulations on a desktop with an Intel i7 processor 2600 running at 3.4 GHz show that the parallel SHVC software decoder is able to decode 1080p spatial 2x at up to 60 fps (frames per second) and 1080p spatial 1.5x at up to 50 fps for those bitstreams generated with SHVC common test conditions in the JCT-VC standardization group. The decoding performance at various bitrates with different optimization technologies and different numbers of threads are compared in terms of decoding speed and resource usage, including processor and memory.

  15. Walking tree heuristics for biological string alignment, gene location, and phylogenies

    NASA Astrophysics Data System (ADS)

    Cull, P.; Holloway, J. L.; Cavener, J. D.

    1999-03-01

    Basic biological information is stored in strings of nucleic acids (DNA, RNA) or amino acids (proteins). Teasing out the meaning of these strings is a central problem of modern biology. Matching and aligning strings brings out their shared characteristics. Although string matching is well-understood in the edit-distance model, biological strings with transpositions and inversions violate this model's assumptions. We propose a family of heuristics called walking trees to align biologically reasonable strings. Both edit-distance and walking tree methods can locate specific genes within a large string when the genes' sequences are given. When we attempt to match whole strings, the walking tree matches most genes, while the edit-distance method fails. We also give examples in which the walking tree matches substrings even if they have been moved or inverted. The edit-distance method was not designed to handle these problems. We include an example in which the walking tree "discovered" a gene. Calculating scores for whole genome matches gives a method for approximating evolutionary distance. We show two evolutionary trees for the picornaviruses which were computed by the walking tree heuristic. Both of these trees show great similarity to previously constructed trees. The point of this demonstration is that WHOLE genomes can be matched and distances calculated. The first tree was created on a Sequent parallel computer and demonstrates that the walking tree heuristic can be efficiently parallelized. The second tree was created using a network of work stations and demonstrates that there is suffient parallelism in the phylogenetic tree calculation that the sequential walking tree can be used effectively on a network.

  16. Automatic Multilevel Parallelization Using OpenMP

    NASA Technical Reports Server (NTRS)

    Jin, Hao-Qiang; Jost, Gabriele; Yan, Jerry; Ayguade, Eduard; Gonzalez, Marc; Martorell, Xavier; Biegel, Bryan (Technical Monitor)

    2002-01-01

    In this paper we describe the extension of the CAPO (CAPtools (Computer Aided Parallelization Toolkit) OpenMP) parallelization support tool to support multilevel parallelism based on OpenMP directives. CAPO generates OpenMP directives with extensions supported by the NanosCompiler to allow for directive nesting and definition of thread groups. We report some results for several benchmark codes and one full application that have been parallelized using our system.

  17. Ghost-tree: creating hybrid-gene phylogenetic trees for diversity analyses.

    PubMed

    Fouquier, Jennifer; Rideout, Jai Ram; Bolyen, Evan; Chase, John; Shiffer, Arron; McDonald, Daniel; Knight, Rob; Caporaso, J Gregory; Kelley, Scott T

    2016-02-24

    -phylogenetic methods for larger effect sizes. The Silva/UNITE-based ghost tree presented here can be easily integrated into existing fungal analysis pipelines to enhance the resolution of fungal community differences and improve understanding of these communities in built environments. The ghost-tree software package can also be used to develop phylogenetic trees for other marker gene sets that afford different taxonomic resolution, or for bridging genome trees with amplicon trees. ghost-tree is pip-installable. All source code, documentation, and test code are available under the BSD license at https://github.com/JTFouquier/ghost-tree .

  18. TOMO3D: 3-D joint refraction and reflection traveltime tomography parallel code for active-source seismic data—synthetic test

    NASA Astrophysics Data System (ADS)

    Meléndez, A.; Korenaga, J.; Sallarès, V.; Miniussi, A.; Ranero, C. R.

    2015-10-01

    We present a new 3-D traveltime tomography code (TOMO3D) for the modelling of active-source seismic data that uses the arrival times of both refracted and reflected seismic phases to derive the velocity distribution and the geometry of reflecting boundaries in the subsurface. This code is based on its popular 2-D version TOMO2D from which it inherited the methods to solve the forward and inverse problems. The traveltime calculations are done using a hybrid ray-tracing technique combining the graph and bending methods. The LSQR algorithm is used to perform the iterative regularized inversion to improve the initial velocity and depth models. In order to cope with an increased computational demand due to the incorporation of the third dimension, the forward problem solver, which takes most of the run time (˜90 per cent in the test presented here), has been parallelized with a combination of multi-processing and message passing interface standards. This parallelization distributes the ray-tracing and traveltime calculations among available computational resources. The code's performance is illustrated with a realistic synthetic example, including a checkerboard anomaly and two reflectors, which simulates the geometry of a subduction zone. The code is designed to invert for a single reflector at a time. A data-driven layer-stripping strategy is proposed for cases involving multiple reflectors, and it is tested for the successive inversion of the two reflectors. Layers are bound by consecutive reflectors, and an initial velocity model for each inversion step incorporates the results from previous steps. This strategy poses simpler inversion problems at each step, allowing the recovery of strong velocity discontinuities that would otherwise be smoothened.

  19. Optimal parallel solution of sparse triangular systems

    NASA Technical Reports Server (NTRS)

    Alvarado, Fernando L.; Schreiber, Robert

    1990-01-01

    A method for the parallel solution of triangular sets of equations is described that is appropriate when there are many right-handed sides. By preprocessing, the method can reduce the number of parallel steps required to solve Lx = b compared to parallel forward or backsolve. Applications are to iterative solvers with triangular preconditioners, to structural analysis, or to power systems applications, where there may be many right-handed sides (not all available a priori). The inverse of L is represented as a product of sparse triangular factors. The problem is to find a factored representation of this inverse of L with the smallest number of factors (or partitions), subject to the requirement that no new nonzero elements be created in the formation of these inverse factors. A method from an earlier reference is shown to solve this problem. This method is improved upon by constructing a permutation of the rows and columns of L that preserves triangularity and allow for the best possible such partition. A number of practical examples and algorithmic details are presented. The parallelism attainable is illustrated by means of elimination trees and clique trees.

  20. Support for Debugging Automatically Parallelized Programs

    NASA Technical Reports Server (NTRS)

    Hood, Robert; Jost, Gabriele; Biegel, Bryan (Technical Monitor)

    2001-01-01

    This viewgraph presentation provides information on the technical aspects of debugging computer code that has been automatically converted for use in a parallel computing system. Shared memory parallelization and distributed memory parallelization entail separate and distinct challenges for a debugging program. A prototype system has been developed which integrates various tools for the debugging of automatically parallelized programs including the CAPTools Database which provides variable definition information across subroutines as well as array distribution information.

  1. Automatic Multilevel Parallelization Using OpenMP

    NASA Technical Reports Server (NTRS)

    Jin, Hao-Qiang; Jost, Gabriele; Yan, Jerry; Ayguade, Eduard; Gonzalez, Marc; Martorell, Xavier; Biegel, Bryan (Technical Monitor)

    2002-01-01

    In this paper we describe the extension of the CAPO parallelization support tool to support multilevel parallelism based on OpenMP directives. CAPO generates OpenMP directives with extensions supported by the NanosCompiler to allow for directive nesting and definition of thread groups. We report first results for several benchmark codes and one full application that have been parallelized using our system.

  2. Parallel computing techniques for rotorcraft aerodynamics

    NASA Astrophysics Data System (ADS)

    Ekici, Kivanc

    The modification of unsteady three-dimensional Navier-Stokes codes for application on massively parallel and distributed computing environments is investigated. The Euler/Navier-Stokes code TURNS (Transonic Unsteady Rotor Navier-Stokes) was chosen as a test bed because of its wide use by universities and industry. For the efficient implementation of TURNS on parallel computing systems, two algorithmic changes are developed. First, main modifications to the implicit operator, Lower-Upper Symmetric Gauss Seidel (LU-SGS) originally used in TURNS, is performed. Second, application of an inexact Newton method, coupled with a Krylov subspace iterative method (Newton-Krylov method) is carried out. Both techniques have been tried previously for the Euler equations mode of the code. In this work, we have extended the methods to the Navier-Stokes mode. Several new implicit operators were tried because of convergence problems of traditional operators with the high cell aspect ratio (CAR) grids needed for viscous calculations on structured grids. Promising results for both Euler and Navier-Stokes cases are presented for these operators. For the efficient implementation of Newton-Krylov methods to the Navier-Stokes mode of TURNS, efficient preconditioners must be used. The parallel implicit operators used in the previous step are employed as preconditioners and the results are compared. The Message Passing Interface (MPI) protocol has been used because of its portability to various parallel architectures. It should be noted that the proposed methodology is general and can be applied to several other CFD codes (e.g. OVERFLOW).

  3. An ecological and evolutionary perspective on the parallel invasion of two cross-compatible trees

    PubMed Central

    Besnard, Guillaume; Cuneo, Peter

    2016-01-01

    Invasive trees are generally seen as ecosystem-transforming plants that can have significant impacts on native vegetation, and often require management and control. Understanding their history and biology is essential to guide actions of land managers. Here, we present a summary of recent research into the ecology, phylogeography and management of invasive olives, which are now established outside of their native range as high ecological impact invasive trees. The parallel invasion of European and African olive in different climatic zones of Australia provides an interesting case study of invasion, characterized by early genetic admixture between domesticated and wild taxa. Today, the impact of the invasive olives on native vegetation and ecosystem function is of conservation concern, with European olive a declared weed in areas of South Australia, and African olive a declared weed in New South Wales and Pacific islands. Population genetics was used to trace the origins and invasion of both subspecies in Australia, indicating that both olive subspecies have hybridized early after introduction. Research also indicates that African olive populations can establish from a low number of founder individuals even after successive bottlenecks. Modelling based on distributional data from the native and invasive range identified a shift of the realized ecological niche in the Australian invasive range for both olive subspecies, which was particularly marked for African olive. As highly successful and long-lived invaders, olives offer further opportunities to understand the genetic basis of invasion, and we propose that future research examines the history of introduction and admixture, the genetic basis of adaptability and the role of biotic interactions during invasion. Advances on these questions will ultimately improve predictions on the future olive expansion and provide a solid basis for better management of invasive populations. PMID:27519914

  4. Xyce parallel electronic simulator : users' guide.

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Mei, Ting; Rankin, Eric Lamont; Thornquist, Heidi K.

    2011-05-01

    This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: (1) Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). Note that this includes support for most popular parallel and serial computers; (2) Improved performance for all numerical kernels (e.g., time integrator, nonlinear and linear solvers) through state-of-the-artmore » algorithms and novel techniques. (3) Device models which are specifically tailored to meet Sandia's needs, including some radiation-aware devices (for Sandia users only); and (4) Object-oriented code design and implementation using modern coding practices that ensure that the Xyce Parallel Electronic Simulator will be maintainable and extensible far into the future. Xyce is a parallel code in the most general sense of the phrase - a message passing parallel implementation - which allows it to run efficiently on the widest possible number of computing platforms. These include serial, shared-memory and distributed-memory parallel as well as heterogeneous platforms. Careful attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows. The development of Xyce provides a platform for computational research and development aimed specifically at the needs of the Laboratory. With Xyce, Sandia has an 'in-house' capability with which both new electrical (e.g., device model development) and algorithmic (e.g., faster time-integration methods, parallel solver algorithms) research and development can be performed. As a result, Xyce is a

  5. Modification of the SAS4A Safety Analysis Code for Integration with the ADAPT Discrete Dynamic Event Tree Framework.

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Jankovsky, Zachary Kyle; Denman, Matthew R.

    It is difficult to assess the consequences of a transient in a sodium-cooled fast reactor (SFR) using traditional probabilistic risk assessment (PRA) methods, as numerous safety-related sys- tems have passive characteristics. Often there is significant dependence on the value of con- tinuous stochastic parameters rather than binary success/failure determinations. One form of dynamic PRA uses a system simulator to represent the progression of a transient, tracking events through time in a discrete dynamic event tree (DDET). In order to function in a DDET environment, a simulator must have characteristics that make it amenable to changing physical parameters midway through themore » analysis. The SAS4A SFR system analysis code did not have these characteristics as received. This report describes the code modifications made to allow dynamic operation as well as the linking to a Sandia DDET driver code. A test case is briefly described to demonstrate the utility of the changes.« less

  6. Binary tree eigen solver in finite element analysis

    NASA Technical Reports Server (NTRS)

    Akl, F. A.; Janetzke, D. C.; Kiraly, L. J.

    1993-01-01

    This paper presents a transputer-based binary tree eigensolver for the solution of the generalized eigenproblem in linear elastic finite element analysis. The algorithm is based on the method of recursive doubling, which parallel implementation of a number of associative operations on an arbitrary set having N elements is of the order of o(log2N), compared to (N-1) steps if implemented sequentially. The hardware used in the implementation of the binary tree consists of 32 transputers. The algorithm is written in OCCAM which is a high-level language developed with the transputers to address parallel programming constructs and to provide the communications between processors. The algorithm can be replicated to match the size of the binary tree transputer network. Parallel and sequential finite element analysis programs have been developed to solve for the set of the least-order eigenpairs using the modified subspace method. The speed-up obtained for a typical analysis problem indicates close agreement with the theoretical prediction given by the method of recursive doubling.

  7. User's Guide for ENSAERO_FE Parallel Finite Element Solver

    NASA Technical Reports Server (NTRS)

    Eldred, Lloyd B.; Guruswamy, Guru P.

    1999-01-01

    A high fidelity parallel static structural analysis capability is created and interfaced to the multidisciplinary analysis package ENSAERO-MPI of Ames Research Center. This new module replaces ENSAERO's lower fidelity simple finite element and modal modules. Full aircraft structures may be more accurately modeled using the new finite element capability. Parallel computation is performed by breaking the full structure into multiple substructures. This approach is conceptually similar to ENSAERO's multizonal fluid analysis capability. The new substructure code is used to solve the structural finite element equations for each substructure in parallel. NASTRANKOSMIC is utilized as a front end for this code. Its full library of elements can be used to create an accurate and realistic aircraft model. It is used to create the stiffness matrices for each substructure. The new parallel code then uses an iterative preconditioned conjugate gradient method to solve the global structural equations for the substructure boundary nodes.

  8. Implementation of parallel moment equations in NIMROD

    NASA Astrophysics Data System (ADS)

    Lee, Hankyu Q.; Held, Eric D.; Ji, Jeong-Young

    2017-10-01

    As collisionality is low (the Knudsen number is large) in many plasma applications, kinetic effects become important, particularly in parallel dynamics for magnetized plasmas. Fluid models can capture some kinetic effects when integral parallel closures are adopted. The adiabatic and linear approximations are used in solving general moment equations to obtain the integral closures. In this work, we present an effort to incorporate non-adiabatic (time-dependent) and nonlinear effects into parallel closures. Instead of analytically solving the approximate moment system, we implement exact parallel moment equations in the NIMROD fluid code. The moment code is expected to provide a natural convergence scheme by increasing the number of moments. Work in collaboration with the PSI Center and supported by the U.S. DOE under Grant Nos. DE-SC0014033, DE-SC0016256, and DE-FG02-04ER54746.

  9. Parallel processing approach to transform-based image coding

    NASA Astrophysics Data System (ADS)

    Normile, James O.; Wright, Dan; Chu, Ken; Yeh, Chia L.

    1991-06-01

    This paper describes a flexible parallel processing architecture designed for use in real time video processing. The system consists of floating point DSP processors connected to each other via fast serial links, each processor has access to a globally shared memory. A multiple bus architecture in combination with a dual ported memory allows communication with a host control processor. The system has been applied to prototyping of video compression and decompression algorithms. The decomposition of transform based algorithms for decompression into a form suitable for parallel processing is described. A technique for automatic load balancing among the processors is developed and discussed, results ar presented with image statistics and data rates. Finally techniques for accelerating the system throughput are analyzed and results from the application of one such modification described.

  10. Parallel-vector computation for linear structural analysis and non-linear unconstrained optimization problems

    NASA Technical Reports Server (NTRS)

    Nguyen, D. T.; Al-Nasra, M.; Zhang, Y.; Baddourah, M. A.; Agarwal, T. K.; Storaasli, O. O.; Carmona, E. A.

    1991-01-01

    Several parallel-vector computational improvements to the unconstrained optimization procedure are described which speed up the structural analysis-synthesis process. A fast parallel-vector Choleski-based equation solver, pvsolve, is incorporated into the well-known SAP-4 general-purpose finite-element code. The new code, denoted PV-SAP, is tested for static structural analysis. Initial results on a four processor CRAY 2 show that using pvsolve reduces the equation solution time by a factor of 14-16 over the original SAP-4 code. In addition, parallel-vector procedures for the Golden Block Search technique and the BFGS method are developed and tested for nonlinear unconstrained optimization. A parallel version of an iterative solver and the pvsolve direct solver are incorporated into the BFGS method. Preliminary results on nonlinear unconstrained optimization test problems, using pvsolve in the analysis, show excellent parallel-vector performance indicating that these parallel-vector algorithms can be used in a new generation of finite-element based structural design/analysis-synthesis codes.

  11. Parallel Mitogenome Sequencing Alleviates Random Rooting Effect in Phylogeography.

    PubMed

    Hirase, Shotaro; Takeshima, Hirohiko; Nishida, Mutsumi; Iwasaki, Wataru

    2016-04-28

    Reliably rooted phylogenetic trees play irreplaceable roles in clarifying diversification in the patterns of species and populations. However, such trees are often unavailable in phylogeographic studies, particularly when the focus is on rapidly expanded populations that exhibit star-like trees. A fundamental bottleneck is known as the random rooting effect, where a distant outgroup tends to root an unrooted tree "randomly." We investigated whether parallel mitochondrial genome (mitogenome) sequencing alleviates this effect in phylogeography using a case study on the Sea of Japan lineage of the intertidal goby Chaenogobius annularis Eighty-three C. annularis individuals were collected and their mitogenomes were determined by high-throughput and low-cost parallel sequencing. Phylogenetic analysis of these mitogenome sequences was conducted to root the Sea of Japan lineage, which has a star-like phylogeny and had not been reliably rooted. The topologies of the bootstrap trees were investigated to determine whether the use of mitogenomes alleviated the random rooting effect. The mitogenome data successfully rooted the Sea of Japan lineage by alleviating the effect, which hindered phylogenetic analysis that used specific gene sequences. The reliable rooting of the lineage led to the discovery of a novel, northern lineage that expanded during an interglacial period with high bootstrap support. Furthermore, the finding of this lineage suggested the existence of additional glacial refugia and provided a new recent calibration point that revised the divergence time estimation between the Sea of Japan and Pacific Ocean lineages. This study illustrates the effectiveness of parallel mitogenome sequencing for solving the random rooting problem in phylogeographic studies. © The Author 2016. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.

  12. Parallel Coding of First- and Second-Order Stimulus Attributes by Midbrain Electrosensory Neurons

    PubMed Central

    McGillivray, Patrick; Vonderschen, Katrin; Fortune, Eric S.; Chacron, Maurice J.

    2015-01-01

    Natural stimuli often have time-varying first-order (i.e., mean) and second-order (i.e., variance) attributes that each carry critical information for perception and can vary independently over orders of magnitude. Experiments have shown that sensory systems continuously adapt their responses based on changes in each of these attributes. This adaptation creates ambiguity in the neural code as multiple stimuli may elicit the same neural response. While parallel processing of first- and second-order attributes by separate neural pathways is sufficient to remove this ambiguity, the existence of such pathways and the neural circuits that mediate their emergence have not been uncovered to date. We recorded the responses of midbrain electrosensory neurons in the weakly electric fish Apteronotus leptorhynchus to stimuli with first- and second-order attributes that varied independently in time. We found three distinct groups of midbrain neurons: the first group responded to both first- and second-order attributes, the second group responded selectively to first-order attributes, and the last group responded selectively to second-order attributes. In contrast, all afferent hindbrain neurons responded to both first- and second-order attributes. Using computational analyses, we show how inputs from a heterogeneous population of ON- and OFF-type afferent neurons are combined to give rise to response selectivity to either first- or second-order stimulus attributes in midbrain neurons. Our study thus uncovers, for the first time, generic and widely applicable mechanisms by which parallel processing of first- and second-order stimulus attributes emerges in the brain. PMID:22514313

  13. Parallel computing on Unix workstation arrays

    NASA Astrophysics Data System (ADS)

    Reale, F.; Bocchino, F.; Sciortino, S.

    1994-12-01

    We have tested arrays of general-purpose Unix workstations used as MIMD systems for massive parallel computations. In particular we have solved numerically a demanding test problem with a 2D hydrodynamic code, generally developed to study astrophysical flows, by exucuting it on arrays either of DECstations 5000/200 on Ethernet LAN, or of DECstations 3000/400, equipped with powerful Alpha processors, on FDDI LAN. The code is appropriate for data-domain decomposition, and we have used a library for parallelization previously developed in our Institute, and easily extended to work on Unix workstation arrays by using the PVM software toolset. We have compared the parallel efficiencies obtained on arrays of several processors to those obtained on a dedicated MIMD parallel system, namely a Meiko Computing Surface (CS-1), equipped with Intel i860 processors. We discuss the feasibility of using non-dedicated parallel systems and conclude that the convenience depends essentially on the size of the computational domain as compared to the relative processor power and network bandwidth. We point out that for future perspectives a parallel development of processor and network technology is important, and that the software still offers great opportunities of improvement, especially in terms of latency times in the message-passing protocols. In conditions of significant gain in terms of speedup, such workstation arrays represent a cost-effective approach to massive parallel computations.

  14. MLP: A Parallel Programming Alternative to MPI for New Shared Memory Parallel Systems

    NASA Technical Reports Server (NTRS)

    Taft, James R.

    1999-01-01

    Recent developments at the NASA AMES Research Center's NAS Division have demonstrated that the new generation of NUMA based Symmetric Multi-Processing systems (SMPs), such as the Silicon Graphics Origin 2000, can successfully execute legacy vector oriented CFD production codes at sustained rates far exceeding processing rates possible on dedicated 16 CPU Cray C90 systems. This high level of performance is achieved via shared memory based Multi-Level Parallelism (MLP). This programming approach, developed at NAS and outlined below, is distinct from the message passing paradigm of MPI. It offers parallelism at both the fine and coarse grained level, with communication latencies that are approximately 50-100 times lower than typical MPI implementations on the same platform. Such latency reductions offer the promise of performance scaling to very large CPU counts. The method draws on, but is also distinct from, the newly defined OpenMP specification, which uses compiler directives to support a limited subset of multi-level parallel operations. The NAS MLP method is general, and applicable to a large class of NASA CFD codes.

  15. Dust Dynamics in Protoplanetary Disks: Parallel Computing with PVM

    NASA Astrophysics Data System (ADS)

    de La Fuente Marcos, Carlos; Barge, Pierre; de La Fuente Marcos, Raúl

    2002-03-01

    We describe a parallel version of our high-order-accuracy particle-mesh code for the simulation of collisionless protoplanetary disks. We use this code to carry out a massively parallel, two-dimensional, time-dependent, numerical simulation, which includes dust particles, to study the potential role of large-scale, gaseous vortices in protoplanetary disks. This noncollisional problem is easy to parallelize on message-passing multicomputer architectures. We performed the simulations on a cache-coherent nonuniform memory access Origin 2000 machine, using both the parallel virtual machine (PVM) and message-passing interface (MPI) message-passing libraries. Our performance analysis suggests that, for our problem, PVM is about 25% faster than MPI. Using PVM and MPI made it possible to reduce CPU time and increase code performance. This allows for simulations with a large number of particles (N ~ 105-106) in reasonable CPU times. The performances of our implementation of the pa! rallel code on an Origin 2000 supercomputer are presented and discussed. They exhibit very good speedup behavior and low load unbalancing. Our results confirm that giant gaseous vortices can play a dominant role in giant planet formation.

  16. Low Density Parity Check Codes: Bandwidth Efficient Channel Coding

    NASA Technical Reports Server (NTRS)

    Fong, Wai; Lin, Shu; Maki, Gary; Yeh, Pen-Shu

    2003-01-01

    Low Density Parity Check (LDPC) Codes provide near-Shannon Capacity performance for NASA Missions. These codes have high coding rates R=0.82 and 0.875 with moderate code lengths, n=4096 and 8176. Their decoders have inherently parallel structures which allows for high-speed implementation. Two codes based on Euclidean Geometry (EG) were selected for flight ASIC implementation. These codes are cyclic and quasi-cyclic in nature and therefore have a simple encoder structure. This results in power and size benefits. These codes also have a large minimum distance as much as d,,, = 65 giving them powerful error correcting capabilities and error floors less than lo- BER. This paper will present development of the LDPC flight encoder and decoder, its applications and status.

  17. SequenceL: Automated Parallel Algorithms Derived from CSP-NT Computational Laws

    NASA Technical Reports Server (NTRS)

    Cooke, Daniel; Rushton, Nelson

    2013-01-01

    With the introduction of new parallel architectures like the cell and multicore chips from IBM, Intel, AMD, and ARM, as well as the petascale processing available for highend computing, a larger number of programmers will need to write parallel codes. Adding the parallel control structure to the sequence, selection, and iterative control constructs increases the complexity of code development, which often results in increased development costs and decreased reliability. SequenceL is a high-level programming language that is, a programming language that is closer to a human s way of thinking than to a machine s. Historically, high-level languages have resulted in decreased development costs and increased reliability, at the expense of performance. In recent applications at JSC and in industry, SequenceL has demonstrated the usual advantages of high-level programming in terms of low cost and high reliability. SequenceL programs, however, have run at speeds typically comparable with, and in many cases faster than, their counterparts written in C and C++ when run on single-core processors. Moreover, SequenceL is able to generate parallel executables automatically for multicore hardware, gaining parallel speedups without any extra effort from the programmer beyond what is required to write the sequen tial/singlecore code. A SequenceL-to-C++ translator has been developed that automatically renders readable multithreaded C++ from a combination of a SequenceL program and sample data input. The SequenceL language is based on two fundamental computational laws, Consume-Simplify- Produce (CSP) and Normalize-Trans - pose (NT), which enable it to automate the creation of parallel algorithms from high-level code that has no annotations of parallelism whatsoever. In our anecdotal experience, SequenceL development has been in every case less costly than development of the same algorithm in sequential (that is, single-core, single process) C or C++, and an order of magnitude less

  18. Parallel computation with the force

    NASA Technical Reports Server (NTRS)

    Jordan, H. F.

    1985-01-01

    A methodology, called the force, supports the construction of programs to be executed in parallel by a force of processes. The number of processes in the force is unspecified, but potentially very large. The force idea is embodied in a set of macros which produce multiproceossor FORTRAN code and has been studied on two shared memory multiprocessors of fairly different character. The method has simplified the writing of highly parallel programs within a limited class of parallel algorithms and is being extended to cover a broader class. The individual parallel constructs which comprise the force methodology are discussed. Of central concern are their semantics, implementation on different architectures and performance implications.

  19. Synthesizing parallel imaging applications using the CAP (computer-aided parallelization) tool

    NASA Astrophysics Data System (ADS)

    Gennart, Benoit A.; Mazzariol, Marc; Messerli, Vincent; Hersch, Roger D.

    1997-12-01

    Imaging applications such as filtering, image transforms and compression/decompression require vast amounts of computing power when applied to large data sets. These applications would potentially benefit from the use of parallel processing. However, dedicated parallel computers are expensive and their processing power per node lags behind that of the most recent commodity components. Furthermore, developing parallel applications remains a difficult task: writing and debugging the application is difficult (deadlocks), programs may not be portable from one parallel architecture to the other, and performance often comes short of expectations. In order to facilitate the development of parallel applications, we propose the CAP computer-aided parallelization tool which enables application programmers to specify at a high-level of abstraction the flow of data between pipelined-parallel operations. In addition, the CAP tool supports the programmer in developing parallel imaging and storage operations. CAP enables combining efficiently parallel storage access routines and image processing sequential operations. This paper shows how processing and I/O intensive imaging applications must be implemented to take advantage of parallelism and pipelining between data access and processing. This paper's contribution is (1) to show how such implementations can be compactly specified in CAP, and (2) to demonstrate that CAP specified applications achieve the performance of custom parallel code. The paper analyzes theoretically the performance of CAP specified applications and demonstrates the accuracy of the theoretical analysis through experimental measurements.

  20. Xyce parallel electronic simulator users guide, version 6.1

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Keiter, Eric R; Mei, Ting; Russo, Thomas V.

    This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas; Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). This includes support for most popular parallel and serial computers; A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to developmore » new types of analysis without requiring the implementation of analysis-specific device models; Device models that are specifically tailored to meet Sandia's needs, including some radiationaware devices (for Sandia users only); and Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase-a message passing parallel implementation-which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows.« less

  1. Xyce Parallel Electronic Simulator Users' Guide Version 6.8

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Keiter, Eric R.; Aadithya, Karthik Venkatraman; Mei, Ting

    This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been de- signed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel com- puting platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows onemore » to develop new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandia's needs, including some radiation- aware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase$-$ a message passing parallel implementation $-$ which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows.« less

  2. Xyce parallel electronic simulator users guide, version 6.0.

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Keiter, Eric R; Mei, Ting; Russo, Thomas V.

    This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to developmore » new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandias needs, including some radiationaware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase a message passing parallel implementation which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows.« less

  3. The Particle Accelerator Simulation Code PyORBIT

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Gorlov, Timofey V; Holmes, Jeffrey A; Cousineau, Sarah M

    2015-01-01

    The particle accelerator simulation code PyORBIT is presented. The structure, implementation, history, parallel and simulation capabilities, and future development of the code are discussed. The PyORBIT code is a new implementation and extension of algorithms of the original ORBIT code that was developed for the Spallation Neutron Source accelerator at the Oak Ridge National Laboratory. The PyORBIT code has a two level structure. The upper level uses the Python programming language to control the flow of intensive calculations performed by the lower level code implemented in the C++ language. The parallel capabilities are based on MPI communications. The PyORBIT ismore » an open source code accessible to the public through the Google Open Source Projects Hosting service.« less

  4. Transferring ecosystem simulation codes to supercomputers

    NASA Technical Reports Server (NTRS)

    Skiles, J. W.; Schulbach, C. H.

    1995-01-01

    Many ecosystem simulation computer codes have been developed in the last twenty-five years. This development took place initially on main-frame computers, then mini-computers, and more recently, on micro-computers and workstations. Supercomputing platforms (both parallel and distributed systems) have been largely unused, however, because of the perceived difficulty in accessing and using the machines. Also, significant differences in the system architectures of sequential, scalar computers and parallel and/or vector supercomputers must be considered. We have transferred a grassland simulation model (developed on a VAX) to a Cray Y-MP/C90. We describe porting the model to the Cray and the changes we made to exploit the parallelism in the application and improve code execution. The Cray executed the model 30 times faster than the VAX and 10 times faster than a Unix workstation. We achieved an additional speedup of 30 percent by using the compiler's vectoring and 'in-line' capabilities. The code runs at only about 5 percent of the Cray's peak speed because it ineffectively uses the vector and parallel processing capabilities of the Cray. We expect that by restructuring the code, it could execute an additional six to ten times faster.

  5. Interframe vector wavelet coding technique

    NASA Astrophysics Data System (ADS)

    Wus, John P.; Li, Weiping

    1997-01-01

    Wavelet coding is often used to divide an image into multi- resolution wavelet coefficients which are quantized and coded. By 'vectorizing' scalar wavelet coding and combining this with vector quantization (VQ), vector wavelet coding (VWC) can be implemented. Using a finite number of states, finite-state vector quantization (FSVQ) takes advantage of the similarity between frames by incorporating memory into the video coding system. Lattice VQ eliminates the potential mismatch that could occur using pre-trained VQ codebooks. It also eliminates the need for codebook storage in the VQ process, thereby creating a more robust coding system. Therefore, by using the VWC coding method in conjunction with the FSVQ system and lattice VQ, the formulation of a high quality very low bit rate coding systems is proposed. A coding system using a simple FSVQ system where the current state is determined by the previous channel symbol only is developed. To achieve a higher degree of compression, a tree-like FSVQ system is implemented. The groupings are done in this tree-like structure from the lower subbands to the higher subbands in order to exploit the nature of subband analysis in terms of the parent-child relationship. Class A and Class B video sequences from the MPEG-IV testing evaluations are used in the evaluation of this coding method.

  6. Why Do Phylogenomic Data Sets Yield Conflicting Trees? Data Type Influences the Avian Tree of Life more than Taxon Sampling.

    PubMed

    Reddy, Sushma; Kimball, Rebecca T; Pandey, Akanksha; Hosner, Peter A; Braun, Michael J; Hackett, Shannon J; Han, Kin-Lan; Harshman, John; Huddleston, Christopher J; Kingston, Sarah; Marks, Ben D; Miglia, Kathleen J; Moore, William S; Sheldon, Frederick H; Witt, Christopher C; Yuri, Tamaki; Braun, Edward L

    2017-09-01

    Phylogenomics, the use of large-scale data matrices in phylogenetic analyses, has been viewed as the ultimate solution to the problem of resolving difficult nodes in the tree of life. However, it has become clear that analyses of these large genomic data sets can also result in conflicting estimates of phylogeny. Here, we use the early divergences in Neoaves, the largest clade of extant birds, as a "model system" to understand the basis for incongruence among phylogenomic trees. We were motivated by the observation that trees from two recent avian phylogenomic studies exhibit conflicts. Those studies used different strategies: 1) collecting many characters [$\\sim$ 42 mega base pairs (Mbp) of sequence data] from 48 birds, sometimes including only one taxon for each major clade; and 2) collecting fewer characters ($\\sim$ 0.4 Mbp) from 198 birds, selected to subdivide long branches. However, the studies also used different data types: the taxon-poor data matrix comprised 68% non-coding sequences whereas coding exons dominated the taxon-rich data matrix. This difference raises the question of whether the primary reason for incongruence is the number of sites, the number of taxa, or the data type. To test among these alternative hypotheses we assembled a novel, large-scale data matrix comprising 90% non-coding sequences from 235 bird species. Although increased taxon sampling appeared to have a positive impact on phylogenetic analyses the most important variable was data type. Indeed, by analyzing different subsets of the taxa in our data matrix we found that increased taxon sampling actually resulted in increased congruence with the tree from the previous taxon-poor study (which had a majority of non-coding data) instead of the taxon-rich study (which largely used coding data). We suggest that the observed differences in the estimates of topology for these studies reflect data-type effects due to violations of the models used in phylogenetic analyses, some of which

  7. A Comparison of Automatic Parallelization Tools/Compilers on the SGI Origin 2000 Using the NAS Benchmarks

    NASA Technical Reports Server (NTRS)

    Saini, Subhash; Frumkin, Michael; Hribar, Michelle; Jin, Hao-Qiang; Waheed, Abdul; Yan, Jerry

    1998-01-01

    Porting applications to new high performance parallel and distributed computing platforms is a challenging task. Since writing parallel code by hand is extremely time consuming and costly, porting codes would ideally be automated by using some parallelization tools and compilers. In this paper, we compare the performance of the hand written NAB Parallel Benchmarks against three parallel versions generated with the help of tools and compilers: 1) CAPTools: an interactive computer aided parallelization too] that generates message passing code, 2) the Portland Group's HPF compiler and 3) using compiler directives with the native FORTAN77 compiler on the SGI Origin2000.

  8. Improve load balancing and coding efficiency of tiles in high efficiency video coding by adaptive tile boundary

    NASA Astrophysics Data System (ADS)

    Chan, Chia-Hsin; Tu, Chun-Chuan; Tsai, Wen-Jiin

    2017-01-01

    High efficiency video coding (HEVC) not only improves the coding efficiency drastically compared to the well-known H.264/AVC but also introduces coding tools for parallel processing, one of which is tiles. Tile partitioning is allowed to be arbitrary in HEVC, but how to decide tile boundaries remains an open issue. An adaptive tile boundary (ATB) method is proposed to select a better tile partitioning to improve load balancing (ATB-LoadB) and coding efficiency (ATB-Gain) with a unified scheme. Experimental results show that, compared to ordinary uniform-space partitioning, the proposed ATB can save up to 17.65% of encoding times in parallel encoding scenarios and can reduce up to 0.8% of total bit rates for coding efficiency.

  9. Integrated Task and Data Parallel Programming

    NASA Technical Reports Server (NTRS)

    Grimshaw, A. S.

    1998-01-01

    This research investigates the combination of task and data parallel language constructs within a single programming language. There are an number of applications that exhibit properties which would be well served by such an integrated language. Examples include global climate models, aircraft design problems, and multidisciplinary design optimization problems. Our approach incorporates data parallel language constructs into an existing, object oriented, task parallel language. The language will support creation and manipulation of parallel classes and objects of both types (task parallel and data parallel). Ultimately, the language will allow data parallel and task parallel classes to be used either as building blocks or managers of parallel objects of either type, thus allowing the development of single and multi-paradigm parallel applications. 1995 Research Accomplishments In February I presented a paper at Frontiers 1995 describing the design of the data parallel language subset. During the spring I wrote and defended my dissertation proposal. Since that time I have developed a runtime model for the language subset. I have begun implementing the model and hand-coding simple examples which demonstrate the language subset. I have identified an astrophysical fluid flow application which will validate the data parallel language subset. 1996 Research Agenda Milestones for the coming year include implementing a significant portion of the data parallel language subset over the Legion system. Using simple hand-coded methods, I plan to demonstrate (1) concurrent task and data parallel objects and (2) task parallel objects managing both task and data parallel objects. My next steps will focus on constructing a compiler and implementing the fluid flow application with the language. Concurrently, I will conduct a search for a real-world application exhibiting both task and data parallelism within the same program. Additional 1995 Activities During the fall I collaborated

  10. The fault-tree compiler

    NASA Technical Reports Server (NTRS)

    Martensen, Anna L.; Butler, Ricky W.

    1987-01-01

    The Fault Tree Compiler Program is a new reliability tool used to predict the top event probability for a fault tree. Five different gate types are allowed in the fault tree: AND, OR, EXCLUSIVE OR, INVERT, and M OF N gates. The high level input language is easy to understand and use when describing the system tree. In addition, the use of the hierarchical fault tree capability can simplify the tree description and decrease program execution time. The current solution technique provides an answer precise (within the limits of double precision floating point arithmetic) to the five digits in the answer. The user may vary one failure rate or failure probability over a range of values and plot the results for sensitivity analyses. The solution technique is implemented in FORTRAN; the remaining program code is implemented in Pascal. The program is written to run on a Digital Corporation VAX with the VMS operation system.

  11. Xyce™ Parallel Electronic Simulator Users' Guide, Version 6.5.

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Keiter, Eric R.; Aadithya, Karthik V.; Mei, Ting

    This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to developmore » new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandia's needs, including some radiation- aware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase -- a message passing parallel implementation -- which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows. The information herein is subject to change without notice. Copyright © 2002-2016 Sandia Corporation. All rights reserved.« less

  12. Parallelization of an Object-Oriented Unstructured Aeroacoustics Solver

    NASA Technical Reports Server (NTRS)

    Baggag, Abdelkader; Atkins, Harold; Oezturan, Can; Keyes, David

    1999-01-01

    A computational aeroacoustics code based on the discontinuous Galerkin method is ported to several parallel platforms using MPI. The discontinuous Galerkin method is a compact high-order method that retains its accuracy and robustness on non-smooth unstructured meshes. In its semi-discrete form, the discontinuous Galerkin method can be combined with explicit time marching methods making it well suited to time accurate computations. The compact nature of the discontinuous Galerkin method also makes it well suited for distributed memory parallel platforms. The original serial code was written using an object-oriented approach and was previously optimized for cache-based machines. The port to parallel platforms was achieved simply by treating partition boundaries as a type of boundary condition. Code modifications were minimal because boundary conditions were abstractions in the original program. Scalability results are presented for the SCI Origin, IBM SP2, and clusters of SGI and Sun workstations. Slightly superlinear speedup is achieved on a fixed-size problem on the Origin, due to cache effects.

  13. Extensive gene tree discordance and hemiplasy shaped the genomes of North American columnar cacti.

    PubMed

    Copetti, Dario; Búrquez, Alberto; Bustamante, Enriquena; Charboneau, Joseph L M; Childs, Kevin L; Eguiarte, Luis E; Lee, Seunghee; Liu, Tiffany L; McMahon, Michelle M; Whiteman, Noah K; Wing, Rod A; Wojciechowski, Martin F; Sanderson, Michael J

    2017-11-07

    Few clades of plants have proven as difficult to classify as cacti. One explanation may be an unusually high level of convergent and parallel evolution (homoplasy). To evaluate support for this phylogenetic hypothesis at the molecular level, we sequenced the genomes of four cacti in the especially problematic tribe Pachycereeae, which contains most of the large columnar cacti of Mexico and adjacent areas, including the iconic saguaro cactus ( Carnegiea gigantea ) of the Sonoran Desert. We assembled a high-coverage draft genome for saguaro and lower coverage genomes for three other genera of tribe Pachycereeae ( Pachycereus , Lophocereus , and Stenocereus ) and a more distant outgroup cactus, Pereskia We used these to construct 4,436 orthologous gene alignments. Species tree inference consistently returned the same phylogeny, but gene tree discordance was high: 37% of gene trees having at least 90% bootstrap support conflicted with the species tree. Evidently, discordance is a product of long generation times and moderately large effective population sizes, leading to extensive incomplete lineage sorting (ILS). In the best supported gene trees, 58% of apparent homoplasy at amino sites in the species tree is due to gene tree-species tree discordance rather than parallel substitutions in the gene trees themselves, a phenomenon termed "hemiplasy." The high rate of genomic hemiplasy may contribute to apparent parallelisms in phenotypic traits, which could confound understanding of species relationships and character evolution in cacti. Published under the PNAS license.

  14. Extensive gene tree discordance and hemiplasy shaped the genomes of North American columnar cacti

    PubMed Central

    Búrquez, Alberto; Bustamante, Enriquena; Charboneau, Joseph L. M.; Childs, Kevin L.; Eguiarte, Luis E.; Lee, Seunghee; Liu, Tiffany L.; McMahon, Michelle M.; Whiteman, Noah K.; Wing, Rod A.; Wojciechowski, Martin F.; Sanderson, Michael J.

    2017-01-01

    Few clades of plants have proven as difficult to classify as cacti. One explanation may be an unusually high level of convergent and parallel evolution (homoplasy). To evaluate support for this phylogenetic hypothesis at the molecular level, we sequenced the genomes of four cacti in the especially problematic tribe Pachycereeae, which contains most of the large columnar cacti of Mexico and adjacent areas, including the iconic saguaro cactus (Carnegiea gigantea) of the Sonoran Desert. We assembled a high-coverage draft genome for saguaro and lower coverage genomes for three other genera of tribe Pachycereeae (Pachycereus, Lophocereus, and Stenocereus) and a more distant outgroup cactus, Pereskia. We used these to construct 4,436 orthologous gene alignments. Species tree inference consistently returned the same phylogeny, but gene tree discordance was high: 37% of gene trees having at least 90% bootstrap support conflicted with the species tree. Evidently, discordance is a product of long generation times and moderately large effective population sizes, leading to extensive incomplete lineage sorting (ILS). In the best supported gene trees, 58% of apparent homoplasy at amino sites in the species tree is due to gene tree-species tree discordance rather than parallel substitutions in the gene trees themselves, a phenomenon termed “hemiplasy.” The high rate of genomic hemiplasy may contribute to apparent parallelisms in phenotypic traits, which could confound understanding of species relationships and character evolution in cacti. PMID:29078296

  15. A high-speed linear algebra library with automatic parallelism

    NASA Technical Reports Server (NTRS)

    Boucher, Michael L.

    1994-01-01

    Parallel or distributed processing is key to getting highest performance workstations. However, designing and implementing efficient parallel algorithms is difficult and error-prone. It is even more difficult to write code that is both portable to and efficient on many different computers. Finally, it is harder still to satisfy the above requirements and include the reliability and ease of use required of commercial software intended for use in a production environment. As a result, the application of parallel processing technology to commercial software has been extremely small even though there are numerous computationally demanding programs that would significantly benefit from application of parallel processing. This paper describes DSSLIB, which is a library of subroutines that perform many of the time-consuming computations in engineering and scientific software. DSSLIB combines the high efficiency and speed of parallel computation with a serial programming model that eliminates many undesirable side-effects of typical parallel code. The result is a simple way to incorporate the power of parallel processing into commercial software without compromising maintainability, reliability, or ease of use. This gives significant advantages over less powerful non-parallel entries in the market.

  16. Interfacing Computer Aided Parallelization and Performance Analysis

    NASA Technical Reports Server (NTRS)

    Jost, Gabriele; Jin, Haoqiang; Labarta, Jesus; Gimenez, Judit; Biegel, Bryan A. (Technical Monitor)

    2003-01-01

    When porting sequential applications to parallel computer architectures, the program developer will typically go through several cycles of source code optimization and performance analysis. We have started a project to develop an environment where the user can jointly navigate through program structure and performance data information in order to make efficient optimization decisions. In a prototype implementation we have interfaced the CAPO computer aided parallelization tool with the Paraver performance analysis tool. We describe both tools and their interface and give an example for how the interface helps within the program development cycle of a benchmark code.

  17. Xyce parallel electronic simulator users' guide, Version 6.0.1.

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Keiter, Eric R; Mei, Ting; Russo, Thomas V.

    This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to developmore » new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandias needs, including some radiationaware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase a message passing parallel implementation which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows.« less

  18. Multitasking TORT Under UNICOS: Parallel Performance Models and Measurements

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Azmy, Y.Y.; Barnett, D.A.

    1999-09-27

    The existing parallel algorithms in the TORT discrete ordinates were updated to function in a UNI-COS environment. A performance model for the parallel overhead was derived for the existing algorithms. The largest contributors to the parallel overhead were identified and a new algorithm was developed. A parallel overhead model was also derived for the new algorithm. The results of the comparison of parallel performance models were compared to applications of the code to two TORT standard test problems and a large production problem. The parallel performance models agree well with the measured parallel overhead.

  19. Vascular system modeling in parallel environment - distributed and shared memory approaches

    PubMed Central

    Jurczuk, Krzysztof; Kretowski, Marek; Bezy-Wendling, Johanne

    2011-01-01

    The paper presents two approaches in parallel modeling of vascular system development in internal organs. In the first approach, new parts of tissue are distributed among processors and each processor is responsible for perfusing its assigned parts of tissue to all vascular trees. Communication between processors is accomplished by passing messages and therefore this algorithm is perfectly suited for distributed memory architectures. The second approach is designed for shared memory machines. It parallelizes the perfusion process during which individual processing units perform calculations concerning different vascular trees. The experimental results, performed on a computing cluster and multi-core machines, show that both algorithms provide a significant speedup. PMID:21550891

  20. Seeing the forest for the trees: Networked workstations as a parallel processing computer

    NASA Technical Reports Server (NTRS)

    Breen, J. O.; Meleedy, D. M.

    1992-01-01

    Unlike traditional 'serial' processing computers in which one central processing unit performs one instruction at a time, parallel processing computers contain several processing units, thereby, performing several instructions at once. Many of today's fastest supercomputers achieve their speed by employing thousands of processing elements working in parallel. Few institutions can afford these state-of-the-art parallel processors, but many already have the makings of a modest parallel processing system. Workstations on existing high-speed networks can be harnessed as nodes in a parallel processing environment, bringing the benefits of parallel processing to many. While such a system can not rival the industry's latest machines, many common tasks can be accelerated greatly by spreading the processing burden and exploiting idle network resources. We study several aspects of this approach, from algorithms to select nodes to speed gains in specific tasks. With ever-increasing volumes of astronomical data, it becomes all the more necessary to utilize our computing resources fully.

  1. Development and Application of a Parallel LCAO Cluster Method

    NASA Astrophysics Data System (ADS)

    Patton, David C.

    1997-08-01

    CPU intensive steps in the SCF electronic structure calculations of clusters and molecules with a first-principles LCAO method have been fully parallelized via a message passing paradigm. Identification of the parts of the code that are composed of many independent compute-intensive steps is discussed in detail as they are the most readily parallelized. Most of the parallelization involves spatially decomposing numerical operations on a mesh. One exception is the solution of Poisson's equation which relies on distribution of the charge density and multipole methods. The method we use to parallelize this part of the calculation is quite novel and is covered in detail. We present a general method for dynamically load-balancing a parallel calculation and discuss how we use this method in our code. The results of benchmark calculations of the IR and Raman spectra of PAH molecules such as anthracene (C_14H_10) and tetracene (C_18H_12) are presented. These benchmark calculations were performed on an IBM SP2 and a SUN Ultra HPC server with both MPI and PVM. Scalability and speedup for these calculations is analyzed to determine the efficiency of the code. In addition, performance and usage issues for MPI and PVM are presented.

  2. Soft-output decoding algorithms in iterative decoding of turbo codes

    NASA Technical Reports Server (NTRS)

    Benedetto, S.; Montorsi, G.; Divsalar, D.; Pollara, F.

    1996-01-01

    In this article, we present two versions of a simplified maximum a posteriori decoding algorithm. The algorithms work in a sliding window form, like the Viterbi algorithm, and can thus be used to decode continuously transmitted sequences obtained by parallel concatenated codes, without requiring code trellis termination. A heuristic explanation is also given of how to embed the maximum a posteriori algorithms into the iterative decoding of parallel concatenated codes (turbo codes). The performances of the two algorithms are compared on the basis of a powerful rate 1/3 parallel concatenated code. Basic circuits to implement the simplified a posteriori decoding algorithm using lookup tables, and two further approximations (linear and threshold), with a very small penalty, to eliminate the need for lookup tables are proposed.

  3. The Design and Evaluation of "CAPTools"--A Computer Aided Parallelization Toolkit

    NASA Technical Reports Server (NTRS)

    Yan, Jerry; Frumkin, Michael; Hribar, Michelle; Jin, Haoqiang; Waheed, Abdul; Johnson, Steve; Cross, Jark; Evans, Emyr; Ierotheou, Constantinos; Leggett, Pete; hide

    1998-01-01

    Writing applications for high performance computers is a challenging task. Although writing code by hand still offers the best performance, it is extremely costly and often not very portable. The Computer Aided Parallelization Tools (CAPTools) are a toolkit designed to help automate the mapping of sequential FORTRAN scientific applications onto multiprocessors. CAPTools consists of the following major components: an inter-procedural dependence analysis module that incorporates user knowledge; a 'self-propagating' data partitioning module driven via user guidance; an execution control mask generation and optimization module for the user to fine tune parallel processing of individual partitions; a program transformation/restructuring facility for source code clean up and optimization; a set of browsers through which the user interacts with CAPTools at each stage of the parallelization process; and a code generator supporting multiple programming paradigms on various multiprocessors. Besides describing the rationale behind the architecture of CAPTools, the parallelization process is illustrated via case studies involving structured and unstructured meshes. The programming process and the performance of the generated parallel programs are compared against other programming alternatives based on the NAS Parallel Benchmarks, ARC3D and other scientific applications. Based on these results, a discussion on the feasibility of constructing architectural independent parallel applications is presented.

  4. Parallel auto-correlative statistics with VTK.

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Pebay, Philippe Pierre; Bennett, Janine Camille

    2013-08-01

    This report summarizes existing statistical engines in VTK and presents both the serial and parallel auto-correlative statistics engines. It is a sequel to [PT08, BPRT09b, PT09, BPT09, PT10] which studied the parallel descriptive, correlative, multi-correlative, principal component analysis, contingency, k-means, and order statistics engines. The ease of use of the new parallel auto-correlative statistics engine is illustrated by the means of C++ code snippets and algorithm verification is provided. This report justifies the design of the statistics engines with parallel scalability in mind, and provides scalability and speed-up analysis results for the autocorrelative statistics engine.

  5. A Massively Parallel Code for Polarization Calculations

    NASA Astrophysics Data System (ADS)

    Akiyama, Shizuka; Höflich, Peter

    2001-03-01

    We present an implementation of our Monte-Carlo radiation transport method for rapidly expanding, NLTE atmospheres for massively parallel computers which utilizes both the distributed and shared memory models. This allows us to take full advantage of the fast communication and low latency inherent to nodes with multiple CPUs, and to stretch the limits of scalability with the number of nodes compared to a version which is based on the shared memory model. Test calculations on a local 20-node Beowulf cluster with dual CPUs showed an improved scalability by about 40%.

  6. OFF, Open source Finite volume Fluid dynamics code: A free, high-order solver based on parallel, modular, object-oriented Fortran API

    NASA Astrophysics Data System (ADS)

    Zaghi, S.

    2014-07-01

    OFF, an open source (free software) code for performing fluid dynamics simulations, is presented. The aim of OFF is to solve, numerically, the unsteady (and steady) compressible Navier-Stokes equations of fluid dynamics by means of finite volume techniques: the research background is mainly focused on high-order (WENO) schemes for multi-fluids, multi-phase flows over complex geometries. To this purpose a highly modular, object-oriented application program interface (API) has been developed. In particular, the concepts of data encapsulation and inheritance available within Fortran language (from standard 2003) have been stressed in order to represent each fluid dynamics "entity" (e.g. the conservative variables of a finite volume, its geometry, etc…) by a single object so that a large variety of computational libraries can be easily (and efficiently) developed upon these objects. The main features of OFF can be summarized as follows: Programming LanguageOFF is written in standard (compliant) Fortran 2003; its design is highly modular in order to enhance simplicity of use and maintenance without compromising the efficiency; Parallel Frameworks Supported the development of OFF has been also targeted to maximize the computational efficiency: the code is designed to run on shared-memory multi-cores workstations and distributed-memory clusters of shared-memory nodes (supercomputers); the code's parallelization is based on Open Multiprocessing (OpenMP) and Message Passing Interface (MPI) paradigms; Usability, Maintenance and Enhancement in order to improve the usability, maintenance and enhancement of the code also the documentation has been carefully taken into account; the documentation is built upon comprehensive comments placed directly into the source files (no external documentation files needed): these comments are parsed by means of doxygen free software producing high quality html and latex documentation pages; the distributed versioning system referred as git

  7. SKIRT: Hybrid parallelization of radiative transfer simulations

    NASA Astrophysics Data System (ADS)

    Verstocken, S.; Van De Putte, D.; Camps, P.; Baes, M.

    2017-07-01

    We describe the design, implementation and performance of the new hybrid parallelization scheme in our Monte Carlo radiative transfer code SKIRT, which has been used extensively for modelling the continuum radiation of dusty astrophysical systems including late-type galaxies and dusty tori. The hybrid scheme combines distributed memory parallelization, using the standard Message Passing Interface (MPI) to communicate between processes, and shared memory parallelization, providing multiple execution threads within each process to avoid duplication of data structures. The synchronization between multiple threads is accomplished through atomic operations without high-level locking (also called lock-free programming). This improves the scaling behaviour of the code and substantially simplifies the implementation of the hybrid scheme. The result is an extremely flexible solution that adjusts to the number of available nodes, processors and memory, and consequently performs well on a wide variety of computing architectures.

  8. Efficient Delaunay Tessellation through K-D Tree Decomposition

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Morozov, Dmitriy; Peterka, Tom

    Delaunay tessellations are fundamental data structures in computational geometry. They are important in data analysis, where they can represent the geometry of a point set or approximate its density. The algorithms for computing these tessellations at scale perform poorly when the input data is unbalanced. We investigate the use of k-d trees to evenly distribute points among processes and compare two strategies for picking split points between domain regions. Because resulting point distributions no longer satisfy the assumptions of existing parallel Delaunay algorithms, we develop a new parallel algorithm that adapts to its input and prove its correctness. We evaluatemore » the new algorithm using two late-stage cosmology datasets. The new running times are up to 50 times faster using k-d tree compared with regular grid decomposition. Moreover, in the unbalanced data sets, decomposing the domain into a k-d tree is up to five times faster than decomposing it into a regular grid.« less

  9. Tuning iteration space slicing based tiled multi-core code implementing Nussinov's RNA folding.

    PubMed

    Palkowski, Marek; Bielecki, Wlodzimierz

    2018-01-15

    RNA folding is an ongoing compute-intensive task of bioinformatics. Parallelization and improving code locality for this kind of algorithms is one of the most relevant areas in computational biology. Fortunately, RNA secondary structure approaches, such as Nussinov's recurrence, involve mathematical operations over affine control loops whose iteration space can be represented by the polyhedral model. This allows us to apply powerful polyhedral compilation techniques based on the transitive closure of dependence graphs to generate parallel tiled code implementing Nussinov's RNA folding. Such techniques are within the iteration space slicing framework - the transitive dependences are applied to the statement instances of interest to produce valid tiles. The main problem at generating parallel tiled code is defining a proper tile size and tile dimension which impact parallelism degree and code locality. To choose the best tile size and tile dimension, we first construct parallel parametric tiled code (parameters are variables defining tile size). With this purpose, we first generate two nonparametric tiled codes with different fixed tile sizes but with the same code structure and then derive a general affine model, which describes all integer factors available in expressions of those codes. Using this model and known integer factors present in the mentioned expressions (they define the left-hand side of the model), we find unknown integers in this model for each integer factor available in the same fixed tiled code position and replace in this code expressions, including integer factors, with those including parameters. Then we use this parallel parametric tiled code to implement the well-known tile size selection (TSS) technique, which allows us to discover in a given search space the best tile size and tile dimension maximizing target code performance. For a given search space, the presented approach allows us to choose the best tile size and tile dimension in

  10. Nyx: Adaptive mesh, massively-parallel, cosmological simulation code

    NASA Astrophysics Data System (ADS)

    Almgren, Ann; Beckner, Vince; Friesen, Brian; Lukic, Zarija; Zhang, Weiqun

    2017-12-01

    Nyx code solves equations of compressible hydrodynamics on an adaptive grid hierarchy coupled with an N-body treatment of dark matter. The gas dynamics in Nyx use a finite volume methodology on an adaptive set of 3-D Eulerian grids; dark matter is represented as discrete particles moving under the influence of gravity. Particles are evolved via a particle-mesh method, using Cloud-in-Cell deposition/interpolation scheme. Both baryonic and dark matter contribute to the gravitational field. In addition, Nyx includes physics for accurately modeling the intergalactic medium; in optically thin limits and assuming ionization equilibrium, the code calculates heating and cooling processes of the primordial-composition gas in an ionizing ultraviolet background radiation field.

  11. Parallel k-means++

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    A parallelization of the k-means++ seed selection algorithm on three distinct hardware platforms: GPU, multicore CPU, and multithreaded architecture. K-means++ was developed by David Arthur and Sergei Vassilvitskii in 2007 as an extension of the k-means data clustering technique. These algorithms allow people to cluster multidimensional data, by attempting to minimize the mean distance of data points within a cluster. K-means++ improved upon traditional k-means by using a more intelligent approach to selecting the initial seeds for the clustering process. While k-means++ has become a popular alternative to traditional k-means clustering, little work has been done to parallelize this technique.more » We have developed original C++ code for parallelizing the algorithm on three unique hardware architectures: GPU using NVidia's CUDA/Thrust framework, multicore CPU using OpenMP, and the Cray XMT multithreaded architecture. By parallelizing the process for these platforms, we are able to perform k-means++ clustering much more quickly than it could be done before.« less

  12. Identification of microRNAs and long non-coding RNAs involved in fatty acid biosynthesis in tree peony seeds.

    PubMed

    Yin, Dan-Dan; Li, Shan-Shan; Shu, Qing-Yan; Gu, Zhao-Yu; Wu, Qian; Feng, Cheng-Yong; Xu, Wen-Zhong; Wang, Liang-Sheng

    2018-08-05

    MicroRNAs (miRNAs) and long noncoding RNAs (lncRNAs) act as important molecular regulators in a wide range of biological processes during plant development and seed formation, including oil production. Tree peony seeds contain >90% unsaturated fatty acids (UFAs) and high proportions of α-linolenic acid (ALA, > 40%). To dissect the non-coding RNAs (ncRNAs) pathway involved in fatty acids synthesis in tree peony seeds, we construct six small RNA libraries and six transcriptome libraries from developing seeds of two cultivars (J and S) containing different content of fatty acid compositions. After deep sequencing the RNA libraries, the ncRNA expression profiles of tree peony seeds in two cultivars were systematically and comparatively analyzed. A total of 318 known and 153 new miRNAs and 22,430 lncRNAs were identified, among which 106 conserved and 9 novel miRNAs and 2785 lncRNAs were differentially expressed between the two cultivars. In addition, potential target genes of the microRNA and lncRNAs were also predicted and annotated. Among them, 9 miRNAs and 39 lncRNAs were predicted to target lipid related genes. Results showed that all of miR414, miR156b, miR2673b, miR7826, novel-m0027-5p, TR24651|c0_g1, TR24544|c0_g15, and TR27305|c0_g1 were up-regulated and expressed at a higher level in high-ALA cultivar J when compared to low-ALA cultivar S, suggesting that these ncRNAs and target genes are possibly involved in different fatty acid synthesis and lipid metabolism through post-transcriptional regulation. These results provide a better understanding of the roles of ncRNAs during fatty acid biosynthesis and metabolism in tree peony seeds. Copyright © 2018 Elsevier B.V. All rights reserved.

  13. TORUS: Radiation transport and hydrodynamics code

    NASA Astrophysics Data System (ADS)

    Harries, Tim

    2014-04-01

    TORUS is a flexible radiation transfer and radiation-hydrodynamics code. The code has a basic infrastructure that includes the AMR mesh scheme that is used by several physics modules including atomic line transfer in a moving medium, molecular line transfer, photoionization, radiation hydrodynamics and radiative equilibrium. TORUS is useful for a variety of problems, including magnetospheric accretion onto T Tauri stars, spiral nebulae around Wolf-Rayet stars, discs around Herbig AeBe stars, structured winds of O supergiants and Raman-scattered line formation in symbiotic binaries, and dust emission and molecular line formation in star forming clusters. The code is written in Fortran 2003 and is compiled using a standard Gnu makefile. The code is parallelized using both MPI and OMP, and can use these parallel sections either separately or in a hybrid mode.

  14. Graphical Representation of Parallel Algorithmic Processes

    DTIC Science & Technology

    1990-12-01

    interface with the AAARF main process . The source code for the AAARF class-common library is in the common subdi- rectory and consists of the following files... for public release; distribution unlimited AFIT/GCE/ENG/90D-07 Graphical Representation of Parallel Algorithmic Processes THESIS Presented to the...goal of this study is to develop an algorithm animation facility for parallel processes executing on different architectures, from multiprocessor

  15. GRAPE-6A: A Single-Card GRAPE-6 for Parallel PC-GRAPE Cluster Systems

    NASA Astrophysics Data System (ADS)

    Fukushige, Toshiyuki; Makino, Junichiro; Kawai, Atsushi

    2005-12-01

    In this paper, we describe the design and performance of GRAPE-6A, a special-purpose computer for gravitational many-body simulations. It was designed to be used with a PC cluster, in which each node has one GRAPE-6A. Such a configuration is particularly cost-effective in running parallel tree algorithms. Though the use of parallel tree algorithms was possible with the original GRAPE-6 hardware, it was not very cost-effective since a single GRAPE-6 board was still too fast and too expensive. Therefore, we designed GRAPE-6A as a single PCI card to minimize the reproduction cost and to optimize the computing speed. The peak performance is 130 Gflops for one GRAPE-6A board and 3.1 Tflops for our 24 node cluster. We describe the implementation of the tree, TreePM and individual timestep algorithms on both a single GRAPE-6A system and GRAPE-6A cluster. Using the tree algorithm on our 16-node GRAPE-6A system, we can complete a collisionless simulation with 100 million particles (8000 steps) within 10 days.

  16. An integrated runtime and compile-time approach for parallelizing structured and block structured applications

    NASA Technical Reports Server (NTRS)

    Agrawal, Gagan; Sussman, Alan; Saltz, Joel

    1993-01-01

    Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). A combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion was described. A runtime library which can be used to port these applications on distributed memory machines was designed and implemented. The library is currently implemented on several different systems. To further ease the task of application programmers, methods were developed for integrating this runtime library with compilers for HPK-like parallel programming languages. How this runtime library was integrated with the Fortran 90D compiler being developed at Syracuse University is discussed. Experimental results to demonstrate the efficacy of our approach are presented. A multiblock Navier-Stokes solver template and a multigrid code were experimented with. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20 percent of the code parallelized by manually inserting calls to the runtime library.

  17. Efficiency of parallel direct optimization

    NASA Technical Reports Server (NTRS)

    Janies, D. A.; Wheeler, W. C.

    2001-01-01

    Tremendous progress has been made at the level of sequential computation in phylogenetics. However, little attention has been paid to parallel computation. Parallel computing is particularly suited to phylogenetics because of the many ways large computational problems can be broken into parts that can be analyzed concurrently. In this paper, we investigate the scaling factors and efficiency of random addition and tree refinement strategies using the direct optimization software, POY, on a small (10 slave processors) and a large (256 slave processors) cluster of networked PCs running LINUX. These algorithms were tested on several data sets composed of DNA and morphology ranging from 40 to 500 taxa. Various algorithms in POY show fundamentally different properties within and between clusters. All algorithms are efficient on the small cluster for the 40-taxon data set. On the large cluster, multibuilding exhibits excellent parallel efficiency, whereas parallel building is inefficient. These results are independent of data set size. Branch swapping in parallel shows excellent speed-up for 16 slave processors on the large cluster. However, there is no appreciable speed-up for branch swapping with the further addition of slave processors (>16). This result is independent of data set size. Ratcheting in parallel is efficient with the addition of up to 32 processors in the large cluster. This result is independent of data set size. c2001 The Willi Hennig Society.

  18. Partitioning problems in parallel, pipelined and distributed computing

    NASA Technical Reports Server (NTRS)

    Bokhari, S.

    1985-01-01

    The problem of optimally assigning the modules of a parallel program over the processors of a multiple computer system is addressed. A Sum-Bottleneck path algorithm is developed that permits the efficient solution of many variants of this problem under some constraints on the structure of the partitions. In particular, the following problems are solved optimally for a single-host, multiple satellite system: partitioning multiple chain structured parallel programs, multiple arbitrarily structured serial programs and single tree structured parallel programs. In addition, the problems of partitioning chain structured parallel programs across chain connected systems and across shared memory (or shared bus) systems are also solved under certain constraints. All solutions for parallel programs are equally applicable to pipelined programs. These results extend prior research in this area by explicitly taking concurrency into account and permit the efficient utilization of multiple computer architectures for a wide range of problems of practical interest.

  19. Retargeting of existing FORTRAN program and development of parallel compilers

    NASA Technical Reports Server (NTRS)

    Agrawal, Dharma P.

    1988-01-01

    The software models used in implementing the parallelizing compiler for the B-HIVE multiprocessor system are described. The various models and strategies used in the compiler development are: flexible granularity model, which allows a compromise between two extreme granularity models; communication model, which is capable of precisely describing the interprocessor communication timings and patterns; loop type detection strategy, which identifies different types of loops; critical path with coloring scheme, which is a versatile scheduling strategy for any multicomputer with some associated communication costs; and loop allocation strategy, which realizes optimum overlapped operations between computation and communication of the system. Using these models, several sample routines of the AIR3D package are examined and tested. It may be noted that automatically generated codes are highly parallelized to provide the maximized degree of parallelism, obtaining the speedup up to a 28 to 32-processor system. A comparison of parallel codes for both the existing and proposed communication model, is performed and the corresponding expected speedup factors are obtained. The experimentation shows that the B-HIVE compiler produces more efficient codes than existing techniques. Work is progressing well in completing the final phase of the compiler. Numerous enhancements are needed to improve the capabilities of the parallelizing compiler.

  20. High Performance Input/Output for Parallel Computer Systems

    NASA Technical Reports Server (NTRS)

    Ligon, W. B.

    1996-01-01

    The goal of our project is to study the I/O characteristics of parallel applications used in Earth Science data processing systems such as Regional Data Centers (RDCs) or EOSDIS. Our approach is to study the runtime behavior of typical programs and the effect of key parameters of the I/O subsystem both under simulation and with direct experimentation on parallel systems. Our three year activity has focused on two items: developing a test bed that facilitates experimentation with parallel I/O, and studying representative programs from the Earth science data processing application domain. The Parallel Virtual File System (PVFS) has been developed for use on a number of platforms including the Tiger Parallel Architecture Workbench (TPAW) simulator, The Intel Paragon, a cluster of DEC Alpha workstations, and the Beowulf system (at CESDIS). PVFS provides considerable flexibility in configuring I/O in a UNIX- like environment. Access to key performance parameters facilitates experimentation. We have studied several key applications fiom levels 1,2 and 3 of the typical RDC processing scenario including instrument calibration and navigation, image classification, and numerical modeling codes. We have also considered large-scale scientific database codes used to organize image data.

  1. Parallel software for lattice N = 4 supersymmetric Yang-Mills theory

    NASA Astrophysics Data System (ADS)

    Schaich, David; DeGrand, Thomas

    2015-05-01

    We present new parallel software, SUSY LATTICE, for lattice studies of four-dimensional N = 4 supersymmetric Yang-Mills theory with gauge group SU(N). The lattice action is constructed to exactly preserve a single supersymmetry charge at non-zero lattice spacing, up to additional potential terms included to stabilize numerical simulations. The software evolved from the MILC code for lattice QCD, and retains a similar large-scale framework despite the different target theory. Many routines are adapted from an existing serial code (Catterall and Joseph, 2012), which SUSY LATTICE supersedes. This paper provides an overview of the new parallel software, summarizing the lattice system, describing the applications that are currently provided and explaining their basic workflow for non-experts in lattice gauge theory. We discuss the parallel performance of the code, and highlight some notable aspects of the documentation for those interested in contributing to its future development.

  2. Parallelization of the FLAPW method

    NASA Astrophysics Data System (ADS)

    Canning, A.; Mannstadt, W.; Freeman, A. J.

    2000-08-01

    The FLAPW (full-potential linearized-augmented plane-wave) method is one of the most accurate first-principles methods for determining structural, electronic and magnetic properties of crystals and surfaces. Until the present work, the FLAPW method has been limited to systems of less than about a hundred atoms due to the lack of an efficient parallel implementation to exploit the power and memory of parallel computers. In this work, we present an efficient parallelization of the method by division among the processors of the plane-wave components for each state. The code is also optimized for RISC (reduced instruction set computer) architectures, such as those found on most parallel computers, making full use of BLAS (basic linear algebra subprograms) wherever possible. Scaling results are presented for systems of up to 686 silicon atoms and 343 palladium atoms per unit cell, running on up to 512 processors on a CRAY T3E parallel supercomputer.

  3. Parallelization of sequential Gaussian, indicator and direct simulation algorithms

    NASA Astrophysics Data System (ADS)

    Nunes, Ruben; Almeida, José A.

    2010-08-01

    Improving the performance and robustness of algorithms on new high-performance parallel computing architectures is a key issue in efficiently performing 2D and 3D studies with large amount of data. In geostatistics, sequential simulation algorithms are good candidates for parallelization. When compared with other computational applications in geosciences (such as fluid flow simulators), sequential simulation software is not extremely computationally intensive, but parallelization can make it more efficient and creates alternatives for its integration in inverse modelling approaches. This paper describes the implementation and benchmarking of a parallel version of the three classic sequential simulation algorithms: direct sequential simulation (DSS), sequential indicator simulation (SIS) and sequential Gaussian simulation (SGS). For this purpose, the source used was GSLIB, but the entire code was extensively modified to take into account the parallelization approach and was also rewritten in the C programming language. The paper also explains in detail the parallelization strategy and the main modifications. Regarding the integration of secondary information, the DSS algorithm is able to perform simple kriging with local means, kriging with an external drift and collocated cokriging with both local and global correlations. SIS includes a local correction of probabilities. Finally, a brief comparison is presented of simulation results using one, two and four processors. All performance tests were carried out on 2D soil data samples. The source code is completely open source and easy to read. It should be noted that the code is only fully compatible with Microsoft Visual C and should be adapted for other systems/compilers.

  4. The Complexity of Parallel Algorithms,

    DTIC Science & Technology

    1985-11-01

    programns have been written for se(luiential coiipn ters. Many p~eop~le want coimp ~ilers dihal. will c(nimpile t he, code for parallel machines, to avoid...between two vertices. We also rely on parallel algorithms for maintaining data structures and manipulating graphs. We do not go into the details of these...Jpatlis and maintain connected coimp ~onents. The routine is: - 35 .- ExtendPath(r, Q, V) begin P +-0; s 4- while there is a path in V - P from s to a vertex

  5. Parallel DSMC Solution of Three-Dimensional Flow Over a Finite Flat Plate

    NASA Technical Reports Server (NTRS)

    Nance, Robert P.; Wilmoth, Richard G.; Moon, Bongki; Hassan, H. A.; Saltz, Joel

    1994-01-01

    This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that a one-dimensional chain-partitioning method works best for the problems considered. The parallel code is then used to simulate the Mach 20 nitrogen flow over a finite-thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times than the scalar code, as well as very good load-balance characteristics.

  6. Parallel programming of industrial applications

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Heroux, M; Koniges, A; Simon, H

    1998-07-21

    In the introductory material, we overview the typical MPP environment for real application computing and the special tools available such as parallel debuggers and performance analyzers. Next, we draw from a series of real applications codes and discuss the specific challenges and problems that are encountered in parallelizing these individual applications. The application areas drawn from include biomedical sciences, materials processing and design, plasma and fluid dynamics, and others. We show how it was possible to get a particular application to run efficiently and what steps were necessary. Finally we end with a summary of the lessons learned from thesemore » applications and predictions for the future of industrial parallel computing. This tutorial is based on material from a forthcoming book entitled: "Industrial Strength Parallel Computing" to be published by Morgan Kaufmann Publishers (ISBN l-55860-54).« less

  7. Parallel-In-Time For Moving Meshes

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Falgout, R. D.; Manteuffel, T. A.; Southworth, B.

    2016-02-04

    With steadily growing computational resources available, scientists must develop e ective ways to utilize the increased resources. High performance, highly parallel software has be- come a standard. However until recent years parallelism has focused primarily on the spatial domain. When solving a space-time partial di erential equation (PDE), this leads to a sequential bottleneck in the temporal dimension, particularly when taking a large number of time steps. The XBraid parallel-in-time library was developed as a practical way to add temporal parallelism to existing se- quential codes with only minor modi cations. In this work, a rezoning-type moving mesh is appliedmore » to a di usion problem and formulated in a parallel-in-time framework. Tests and scaling studies are run using XBraid and demonstrate excellent results for the simple model problem considered herein.« less

  8. Parallel computing for probabilistic fatigue analysis

    NASA Technical Reports Server (NTRS)

    Sues, Robert H.; Lua, Yuan J.; Smith, Mark D.

    1993-01-01

    This paper presents the results of Phase I research to investigate the most effective parallel processing software strategies and hardware configurations for probabilistic structural analysis. We investigate the efficiency of both shared and distributed-memory architectures via a probabilistic fatigue life analysis problem. We also present a parallel programming approach, the virtual shared-memory paradigm, that is applicable across both types of hardware. Using this approach, problems can be solved on a variety of parallel configurations, including networks of single or multiprocessor workstations. We conclude that it is possible to effectively parallelize probabilistic fatigue analysis codes; however, special strategies will be needed to achieve large-scale parallelism to keep large number of processors busy and to treat problems with the large memory requirements encountered in practice. We also conclude that distributed-memory architecture is preferable to shared-memory for achieving large scale parallelism; however, in the future, the currently emerging hybrid-memory architectures will likely be optimal.

  9. Integrated Task And Data Parallel Programming: Language Design

    NASA Technical Reports Server (NTRS)

    Grimshaw, Andrew S.; West, Emily A.

    1998-01-01

    his research investigates the combination of task and data parallel language constructs within a single programming language. There are an number of applications that exhibit properties which would be well served by such an integrated language. Examples include global climate models, aircraft design problems, and multidisciplinary design optimization problems. Our approach incorporates data parallel language constructs into an existing, object oriented, task parallel language. The language will support creation and manipulation of parallel classes and objects of both types (task parallel and data parallel). Ultimately, the language will allow data parallel and task parallel classes to be used either as building blocks or managers of parallel objects of either type, thus allowing the development of single and multi-paradigm parallel applications. 1995 Research Accomplishments In February I presented a paper at Frontiers '95 describing the design of the data parallel language subset. During the spring I wrote and defended my dissertation proposal. Since that time I have developed a runtime model for the language subset. I have begun implementing the model and hand-coding simple examples which demonstrate the language subset. I have identified an astrophysical fluid flow application which will validate the data parallel language subset. 1996 Research Agenda Milestones for the coming year include implementing a significant portion of the data parallel language subset over the Legion system. Using simple hand-coded methods, I plan to demonstrate (1) concurrent task and data parallel objects and (2) task parallel objects managing both task and data parallel objects. My next steps will focus on constructing a compiler and implementing the fluid flow application with the language. Concurrently, I will conduct a search for a real-world application exhibiting both task and data parallelism within the same program m. Additional 1995 Activities During the fall I collaborated

  10. Neural representation of objects in space: a dual coding account.

    PubMed Central

    Humphreys, G W

    1998-01-01

    I present evidence on the nature of object coding in the brain and discuss the implications of this coding for models of visual selective attention. Neuropsychological studies of task-based constraints on: (i) visual neglect; and (ii) reading and counting, reveal the existence of parallel forms of spatial representation for objects: within-object representations, where elements are coded as parts of objects, and between-object representations, where elements are coded as independent objects. Aside from these spatial codes for objects, however, the coding of visual space is limited. We are extremely poor at remembering small spatial displacements across eye movements, indicating (at best) impoverished coding of spatial position per se. Also, effects of element separation on spatial extinction can be eliminated by filling the space with an occluding object, indicating that spatial effects on visual selection are moderated by object coding. Overall, there are separate limits on visual processing reflecting: (i) the competition to code parts within objects; (ii) the small number of independent objects that can be coded in parallel; and (iii) task-based selection of whether within- or between-object codes determine behaviour. Between-object coding may be linked to the dorsal visual system while parallel coding of parts within objects takes place in the ventral system, although there may additionally be some dorsal involvement either when attention must be shifted within objects or when explicit spatial coding of parts is necessary for object identification. PMID:9770227

  11. Programming Probabilistic Structural Analysis for Parallel Processing Computer

    NASA Technical Reports Server (NTRS)

    Sues, Robert H.; Chen, Heh-Chyun; Twisdale, Lawrence A.; Chamis, Christos C.; Murthy, Pappu L. N.

    1991-01-01

    The ultimate goal of this research program is to make Probabilistic Structural Analysis (PSA) computationally efficient and hence practical for the design environment by achieving large scale parallelism. The paper identifies the multiple levels of parallelism in PSA, identifies methodologies for exploiting this parallelism, describes the development of a parallel stochastic finite element code, and presents results of two example applications. It is demonstrated that speeds within five percent of those theoretically possible can be achieved. A special-purpose numerical technique, the stochastic preconditioned conjugate gradient method, is also presented and demonstrated to be extremely efficient for certain classes of PSA problems.

  12. Exploiting Thread Parallelism for Ocean Modeling on Cray XC Supercomputers

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Sarje, Abhinav; Jacobsen, Douglas W.; Williams, Samuel W.

    The incorporation of increasing core counts in modern processors used to build state-of-the-art supercomputers is driving application development towards exploitation of thread parallelism, in addition to distributed memory parallelism, with the goal of delivering efficient high-performance codes. In this work we describe the exploitation of threading and our experiences with it with respect to a real-world ocean modeling application code, MPAS-Ocean. We present detailed performance analysis and comparisons of various approaches and configurations for threading on the Cray XC series supercomputers.

  13. Error Control Coding Techniques for Space and Satellite Communications

    NASA Technical Reports Server (NTRS)

    Costello, Daniel J., Jr.; Takeshita, Oscar Y.; Cabral, Hermano A.

    1998-01-01

    It is well known that the BER performance of a parallel concatenated turbo-code improves roughly as 1/N, where N is the information block length. However, it has been observed by Benedetto and Montorsi that for most parallel concatenated turbo-codes, the FER performance does not improve monotonically with N. In this report, we study the FER of turbo-codes, and the effects of their concatenation with an outer code. Two methods of concatenation are investigated: across several frames and within each frame. Some asymmetric codes are shown to have excellent FER performance with an information block length of 16384. We also show that the proposed outer coding schemes can improve the BER performance as well by eliminating pathological frames generated by the iterative MAP decoding process.

  14. Parallel programming with Easy Java Simulations

    NASA Astrophysics Data System (ADS)

    Esquembre, F.; Christian, W.; Belloni, M.

    2018-01-01

    Nearly all of today's processors are multicore, and ideally programming and algorithm development utilizing the entire processor should be introduced early in the computational physics curriculum. Parallel programming is often not introduced because it requires a new programming environment and uses constructs that are unfamiliar to many teachers. We describe how we decrease the barrier to parallel programming by using a java-based programming environment to treat problems in the usual undergraduate curriculum. We use the easy java simulations programming and authoring tool to create the program's graphical user interface together with objects based on those developed by Kaminsky [Building Parallel Programs (Course Technology, Boston, 2010)] to handle common parallel programming tasks. Shared-memory parallel implementations of physics problems, such as time evolution of the Schrödinger equation, are available as source code and as ready-to-run programs from the AAPT-ComPADRE digital library.

  15. Parallel Implementation of the Discontinuous Galerkin Method

    NASA Technical Reports Server (NTRS)

    Baggag, Abdalkader; Atkins, Harold; Keyes, David

    1999-01-01

    This paper describes a parallel implementation of the discontinuous Galerkin method. Discontinuous Galerkin is a spatially compact method that retains its accuracy and robustness on non-smooth unstructured grids and is well suited for time dependent simulations. Several parallelization approaches are studied and evaluated. The most natural and symmetric of the approaches has been implemented in all object-oriented code used to simulate aeroacoustic scattering. The parallel implementation is MPI-based and has been tested on various parallel platforms such as the SGI Origin, IBM SP2, and clusters of SGI and Sun workstations. The scalability results presented for the SGI Origin show slightly superlinear speedup on a fixed-size problem due to cache effects.

  16. Modeling Cooperative Threads to Project GPU Performance for Adaptive Parallelism

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Meng, Jiayuan; Uram, Thomas; Morozov, Vitali A.

    Most accelerators, such as graphics processing units (GPUs) and vector processors, are particularly suitable for accelerating massively parallel workloads. On the other hand, conventional workloads are developed for multi-core parallelism, which often scale to only a few dozen OpenMP threads. When hardware threads significantly outnumber the degree of parallelism in the outer loop, programmers are challenged with efficient hardware utilization. A common solution is to further exploit the parallelism hidden deep in the code structure. Such parallelism is less structured: parallel and sequential loops may be imperfectly nested within each other, neigh boring inner loops may exhibit different concurrency patternsmore » (e.g. Reduction vs. Forall), yet have to be parallelized in the same parallel section. Many input-dependent transformations have to be explored. A programmer often employs a larger group of hardware threads to cooperatively walk through a smaller outer loop partition and adaptively exploit any encountered parallelism. This process is time-consuming and error-prone, yet the risk of gaining little or no performance remains high for such workloads. To reduce risk and guide implementation, we propose a technique to model workloads with limited parallelism that can automatically explore and evaluate transformations involving cooperative threads. Eventually, our framework projects the best achievable performance and the most promising transformations without implementing GPU code or using physical hardware. We envision our technique to be integrated into future compilers or optimization frameworks for autotuning.« less

  17. Global tree network for computing structures enabling global processing operations

    DOEpatents

    Blumrich; Matthias A.; Chen, Dong; Coteus, Paul W.; Gara, Alan G.; Giampapa, Mark E.; Heidelberger, Philip; Hoenicke, Dirk; Steinmacher-Burow, Burkhard D.; Takken, Todd E.; Vranas, Pavlos M.

    2010-01-19

    A system and method for enabling high-speed, low-latency global tree network communications among processing nodes interconnected according to a tree network structure. The global tree network enables collective reduction operations to be performed during parallel algorithm operations executing in a computer structure having a plurality of the interconnected processing nodes. Router devices are included that interconnect the nodes of the tree via links to facilitate performance of low-latency global processing operations at nodes of the virtual tree and sub-tree structures. The global operations performed include one or more of: broadcast operations downstream from a root node to leaf nodes of a virtual tree, reduction operations upstream from leaf nodes to the root node in the virtual tree, and point-to-point message passing from any node to the root node. The global tree network is configurable to provide global barrier and interrupt functionality in asynchronous or synchronized manner, and, is physically and logically partitionable.

  18. Scalable Parallel Algorithms for Multidimensional Digital Signal Processing

    DTIC Science & Technology

    1991-12-31

    Proceedings, San Diego CL., August 1989, pp. 132-146. 53 [13] A. L. Gorin, L. Auslander, and A. Silberger . Balanced computation of 2D trans- forms on a tree...Speech, Signal Processing. ASSP-34, Oct. 1986,pp. 1301-1309. [24] A. Norton and A. Silberger . Parallelization and performance analysis of the Cooley-Tukey

  19. Performance analysis of three dimensional integral equation computations on a massively parallel computer. M.S. Thesis

    NASA Technical Reports Server (NTRS)

    Logan, Terry G.

    1994-01-01

    The purpose of this study is to investigate the performance of the integral equation computations using numerical source field-panel method in a massively parallel processing (MPP) environment. A comparative study of computational performance of the MPP CM-5 computer and conventional Cray-YMP supercomputer for a three-dimensional flow problem is made. A serial FORTRAN code is converted into a parallel CM-FORTRAN code. Some performance results are obtained on CM-5 with 32, 62, 128 nodes along with those on Cray-YMP with a single processor. The comparison of the performance indicates that the parallel CM-FORTRAN code near or out-performs the equivalent serial FORTRAN code for some cases.

  20. Xyce Parallel Electronic Simulator Users Guide Version 6.2.

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Keiter, Eric R.; Mei, Ting; Russo, Thomas V.

    This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been de- signed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel com- puting platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows onemore » to develop new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandia's needs, including some radiation- aware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase -- a message passing parallel implementation -- which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows. Trademarks The information herein is subject to change without notice. Copyright c 2002-2014 Sandia Corporation. All rights reserved. Xyce TM Electronic Simulator and Xyce TM are trademarks of Sandia Corporation. Portions of the Xyce TM code are: Copyright c 2002, The Regents of the University of California. Produced at the Lawrence Livermore National Laboratory. Written by Alan Hindmarsh, Allan Taylor, Radu Serban. UCRL-CODE-2002-59 All rights reserved. Orcad, Orcad Capture, PSpice and Probe are

  1. Xyce Parallel Electronic Simulator Users Guide Version 6.4

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Keiter, Eric R.; Mei, Ting; Russo, Thomas V.

    This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been de- signed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel com- puting platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows onemore » to develop new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandia's needs, including some radiation- aware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase -- a message passing parallel implementation -- which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows. Trademarks The information herein is subject to change without notice. Copyright c 2002-2015 Sandia Corporation. All rights reserved. Xyce TM Electronic Simulator and Xyce TM are trademarks of Sandia Corporation. Portions of the Xyce TM code are: Copyright c 2002, The Regents of the University of California. Produced at the Lawrence Livermore National Laboratory. Written by Alan Hindmarsh, Allan Taylor, Radu Serban. UCRL-CODE-2002-59 All rights reserved. Orcad, Orcad Capture, PSpice and Probe are

  2. Xyce Parallel Electronic Simulator : users' guide, version 2.0.

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Hoekstra, Robert John; Waters, Lon J.; Rankin, Eric Lamont

    2004-06-01

    This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator capable of simulating electrical circuits at a variety of abstraction levels. Primarily, Xyce has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability the current state-of-the-art in the following areas: {sm_bullet} Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). Note that this includes support for most popular parallel and serial computers. {sm_bullet} Improved performance for allmore » numerical kernels (e.g., time integrator, nonlinear and linear solvers) through state-of-the-art algorithms and novel techniques. {sm_bullet} Device models which are specifically tailored to meet Sandia's needs, including many radiation-aware devices. {sm_bullet} A client-server or multi-tiered operating model wherein the numerical kernel can operate independently of the graphical user interface (GUI). {sm_bullet} Object-oriented code design and implementation using modern coding practices that ensure that the Xyce Parallel Electronic Simulator will be maintainable and extensible far into the future. Xyce is a parallel code in the most general sense of the phrase - a message passing of computing platforms. These include serial, shared-memory and distributed-memory parallel implementation - which allows it to run efficiently on the widest possible number parallel as well as heterogeneous platforms. Careful attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows. One feature required by designers is the ability to add device models, many specific to the needs of Sandia, to the code. To this end, the device package in the

  3. Fault trees for decision making in systems analysis

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Lambert, Howard E.

    1975-10-09

    The application of fault tree analysis (FTA) to system safety and reliability is presented within the framework of system safety analysis. The concepts and techniques involved in manual and automated fault tree construction are described and their differences noted. The theory of mathematical reliability pertinent to FTA is presented with emphasis on engineering applications. An outline of the quantitative reliability techniques of the Reactor Safety Study is given. Concepts of probabilistic importance are presented within the fault tree framework and applied to the areas of system design, diagnosis and simulation. The computer code IMPORTANCE ranks basic events and cut setsmore » according to a sensitivity analysis. A useful feature of the IMPORTANCE code is that it can accept relative failure data as input. The output of the IMPORTANCE code can assist an analyst in finding weaknesses in system design and operation, suggest the most optimal course of system upgrade, and determine the optimal location of sensors within a system. A general simulation model of system failure in terms of fault tree logic is described. The model is intended for efficient diagnosis of the causes of system failure in the event of a system breakdown. It can also be used to assist an operator in making decisions under a time constraint regarding the future course of operations. The model is well suited for computer implementation. New results incorporated in the simulation model include an algorithm to generate repair checklists on the basis of fault tree logic and a one-step-ahead optimization procedure that minimizes the expected time to diagnose system failure.« less

  4. Optimisation of a parallel ocean general circulation model

    NASA Astrophysics Data System (ADS)

    Beare, M. I.; Stevens, D. P.

    1997-10-01

    This paper presents the development of a general-purpose parallel ocean circulation model, for use on a wide range of computer platforms, from traditional scalar machines to workstation clusters and massively parallel processors. Parallelism is provided, as a modular option, via high-level message-passing routines, thus hiding the technical intricacies from the user. An initial implementation highlights that the parallel efficiency of the model is adversely affected by a number of factors, for which optimisations are discussed and implemented. The resulting ocean code is portable and, in particular, allows science to be achieved on local workstations that could otherwise only be undertaken on state-of-the-art supercomputers.

  5. Visual analysis of inter-process communication for large-scale parallel computing.

    PubMed

    Muelder, Chris; Gygi, Francois; Ma, Kwan-Liu

    2009-01-01

    In serial computation, program profiling is often helpful for optimization of key sections of code. When moving to parallel computation, not only does the code execution need to be considered but also communication between the different processes which can induce delays that are detrimental to performance. As the number of processes increases, so does the impact of the communication delays on performance. For large-scale parallel applications, it is critical to understand how the communication impacts performance in order to make the code more efficient. There are several tools available for visualizing program execution and communications on parallel systems. These tools generally provide either views which statistically summarize the entire program execution or process-centric views. However, process-centric visualizations do not scale well as the number of processes gets very large. In particular, the most common representation of parallel processes is a Gantt char t with a row for each process. As the number of processes increases, these charts can become difficult to work with and can even exceed screen resolution. We propose a new visualization approach that affords more scalability and then demonstrate it on systems running with up to 16,384 processes.

  6. An interactive programme for weighted Steiner trees

    NASA Astrophysics Data System (ADS)

    Zanchetta do Nascimento, Marcelo; Ramos Batista, Valério; Raffa Coimbra, Wendhel

    2015-01-01

    We introduce a fully written programmed code with a supervised method for generating weighted Steiner trees. Our choice of the programming language, and the use of well- known theorems from Geometry and Complex Analysis, allowed this method to be implemented with only 764 lines of effective source code. This eases the understanding and the handling of this beta version for future developments.

  7. Nuclide Depletion Capabilities in the Shift Monte Carlo Code

    DOE PAGES

    Davidson, Gregory G.; Pandya, Tara M.; Johnson, Seth R.; ...

    2017-12-21

    A new depletion capability has been developed in the Exnihilo radiation transport code suite. This capability enables massively parallel domain-decomposed coupling between the Shift continuous-energy Monte Carlo solver and the nuclide depletion solvers in ORIGEN to perform high-performance Monte Carlo depletion calculations. This paper describes this new depletion capability and discusses its various features, including a multi-level parallel decomposition, high-order transport-depletion coupling, and energy-integrated power renormalization. Several test problems are presented to validate the new capability against other Monte Carlo depletion codes, and the parallel performance of the new capability is analyzed.

  8. Performance of a plasma fluid code on the Intel parallel computers

    NASA Technical Reports Server (NTRS)

    Lynch, V. E.; Carreras, B. A.; Drake, J. B.; Leboeuf, J. N.; Liewer, P.

    1992-01-01

    One approach to improving the real-time efficiency of plasma turbulence calculations is to use a parallel algorithm. A parallel algorithm for plasma turbulence calculations was tested on the Intel iPSC/860 hypercube and the Touchtone Delta machine. Using the 128 processors of the Intel iPSC/860 hypercube, a factor of 5 improvement over a single-processor CRAY-2 is obtained. For the Touchtone Delta machine, the corresponding improvement factor is 16. For plasma edge turbulence calculations, an extrapolation of the present results to the Intel (sigma) machine gives an improvement factor close to 64 over the single-processor CRAY-2.

  9. Parallel performance of TORT on the CRAY J90: Model and measurement

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Barnett, A.; Azmy, Y.Y.

    1997-10-01

    A limitation on the parallel performance of TORT on the CRAY J90 is the amount of extra work introduced by the multitasking algorithm itself. The extra work beyond that of the serial version of the code, called overhead, arises from the synchronization of the parallel tasks and the accumulation of results by the master task. The goal of recent updates to TORT was to reduce the time consumed by these activities. To help understand which components of the multitasking algorithm contribute significantly to the overhead, a parallel performance model was constructed and compared to measurements of actual timings of themore » code.« less

  10. Hamming and Accumulator Codes Concatenated with MPSK or QAM

    NASA Technical Reports Server (NTRS)

    Divsalar, Dariush; Dolinar, Samuel

    2009-01-01

    In a proposed coding-and-modulation scheme, a high-rate binary data stream would be processed as follows: 1. The input bit stream would be demultiplexed into multiple bit streams. 2. The multiple bit streams would be processed simultaneously into a high-rate outer Hamming code that would comprise multiple short constituent Hamming codes a distinct constituent Hamming code for each stream. 3. The streams would be interleaved. The interleaver would have a block structure that would facilitate parallelization for high-speed decoding. 4. The interleaved streams would be further processed simultaneously into an inner two-state, rate-1 accumulator code that would comprise multiple constituent accumulator codes - a distinct accumulator code for each stream. 5. The resulting bit streams would be mapped into symbols to be transmitted by use of a higher-order modulation - for example, M-ary phase-shift keying (MPSK) or quadrature amplitude modulation (QAM). The novelty of the scheme lies in the concatenation of the multiple-constituent Hamming and accumulator codes and the corresponding parallel architectures of the encoder and decoder circuitry (see figure) needed to process the multiple bit streams simultaneously. As in the cases of other parallel-processing schemes, one advantage of this scheme is that the overall data rate could be much greater than the data rate of each encoder and decoder stream and, hence, the encoder and decoder could handle data at an overall rate beyond the capability of the individual encoder and decoder circuits.

  11. Hierarchial parallel computer architecture defined by computational multidisciplinary mechanics

    NASA Technical Reports Server (NTRS)

    Padovan, Joe; Gute, Doug; Johnson, Keith

    1989-01-01

    The goal is to develop an architecture for parallel processors enabling optimal handling of multi-disciplinary computation of fluid-solid simulations employing finite element and difference schemes. The goals, philosphical and modeling directions, static and dynamic poly trees, example problems, interpolative reduction, the impact on solvers are shown in viewgraph form.

  12. Constructions for finite-state codes

    NASA Technical Reports Server (NTRS)

    Pollara, F.; Mceliece, R. J.; Abdel-Ghaffar, K.

    1987-01-01

    A class of codes called finite-state (FS) codes is defined and investigated. These codes, which generalize both block and convolutional codes, are defined by their encoders, which are finite-state machines with parallel inputs and outputs. A family of upper bounds on the free distance of a given FS code is derived from known upper bounds on the minimum distance of block codes. A general construction for FS codes is then given, based on the idea of partitioning a given linear block into cosets of one of its subcodes, and it is shown that in many cases the FS codes constructed in this way have a d sub free which is as large as possible. These codes are found without the need for lengthy computer searches, and have potential applications for future deep-space coding systems. The issue of catastropic error propagation (CEP) for FS codes is also investigated.

  13. Massively parallel multicanonical simulations

    NASA Astrophysics Data System (ADS)

    Gross, Jonathan; Zierenberg, Johannes; Weigel, Martin; Janke, Wolfhard

    2018-03-01

    Generalized-ensemble Monte Carlo simulations such as the multicanonical method and similar techniques are among the most efficient approaches for simulations of systems undergoing discontinuous phase transitions or with rugged free-energy landscapes. As Markov chain methods, they are inherently serial computationally. It was demonstrated recently, however, that a combination of independent simulations that communicate weight updates at variable intervals allows for the efficient utilization of parallel computational resources for multicanonical simulations. Implementing this approach for the many-thread architecture provided by current generations of graphics processing units (GPUs), we show how it can be efficiently employed with of the order of 104 parallel walkers and beyond, thus constituting a versatile tool for Monte Carlo simulations in the era of massively parallel computing. We provide the fully documented source code for the approach applied to the paradigmatic example of the two-dimensional Ising model as starting point and reference for practitioners in the field.

  14. Parallelization of Lower-Upper Symmetric Gauss-Seidel Method for Chemically Reacting Flow

    NASA Technical Reports Server (NTRS)

    Yoon, Seokkwan; Jost, Gabriele; Chang, Sherry

    2005-01-01

    Development of technologies for exploration of the solar system has revived an interest in computational simulation of chemically reacting flows since planetary probe vehicles exhibit non-equilibrium phenomena during the atmospheric entry of a planet or a moon as well as the reentry to the Earth. Stability in combustion is essential for new propulsion systems. Numerical solution of real-gas flows often increases computational work by an order-of-magnitude compared to perfect gas flow partly because of the increased complexity of equations to solve. Recently, as part of Project Columbia, NASA has integrated a cluster of interconnected SGI Altix systems to provide a ten-fold increase in current supercomputing capacity that includes an SGI Origin system. Both the new and existing machines are based on cache coherent non-uniform memory access architecture. Lower-Upper Symmetric Gauss-Seidel (LU-SGS) relaxation method has been implemented into both perfect and real gas flow codes including Real-Gas Aerodynamic Simulator (RGAS). However, the vectorized RGAS code runs inefficiently on cache-based shared-memory machines such as SGI system. Parallelization of a Gauss-Seidel method is nontrivial due to its sequential nature. The LU-SGS method has been vectorized on an oblique plane in INS3D-LU code that has been one of the base codes for NAS Parallel benchmarks. The oblique plane has been called a hyperplane by computer scientists. It is straightforward to parallelize a Gauss-Seidel method by partitioning the hyperplanes once they are formed. Another way of parallelization is to schedule processors like a pipeline using software. Both hyperplane and pipeline methods have been implemented using openMP directives. The present paper reports the performance of the parallelized RGAS code on SGI Origin and Altix systems.

  15. Parallel workflow manager for non-parallel bioinformatic applications to solve large-scale biological problems on a supercomputer.

    PubMed

    Suplatov, Dmitry; Popova, Nina; Zhumatiy, Sergey; Voevodin, Vladimir; Švedas, Vytas

    2016-04-01

    Rapid expansion of online resources providing access to genomic, structural, and functional information associated with biological macromolecules opens an opportunity to gain a deeper understanding of the mechanisms of biological processes due to systematic analysis of large datasets. This, however, requires novel strategies to optimally utilize computer processing power. Some methods in bioinformatics and molecular modeling require extensive computational resources. Other algorithms have fast implementations which take at most several hours to analyze a common input on a modern desktop station, however, due to multiple invocations for a large number of subtasks the full task requires a significant computing power. Therefore, an efficient computational solution to large-scale biological problems requires both a wise parallel implementation of resource-hungry methods as well as a smart workflow to manage multiple invocations of relatively fast algorithms. In this work, a new computer software mpiWrapper has been developed to accommodate non-parallel implementations of scientific algorithms within the parallel supercomputing environment. The Message Passing Interface has been implemented to exchange information between nodes. Two specialized threads - one for task management and communication, and another for subtask execution - are invoked on each processing unit to avoid deadlock while using blocking calls to MPI. The mpiWrapper can be used to launch all conventional Linux applications without the need to modify their original source codes and supports resubmission of subtasks on node failure. We show that this approach can be used to process huge amounts of biological data efficiently by running non-parallel programs in parallel mode on a supercomputer. The C++ source code and documentation are available from http://biokinet.belozersky.msu.ru/mpiWrapper .

  16. GRay: A Massively Parallel GPU-based Code for Ray Tracing in Relativistic Spacetimes

    NASA Astrophysics Data System (ADS)

    Chan, Chi-kwan; Psaltis, Dimitrios; Özel, Feryal

    2013-11-01

    We introduce GRay, a massively parallel integrator designed to trace the trajectories of billions of photons in a curved spacetime. This graphics-processing-unit (GPU)-based integrator employs the stream processing paradigm, is implemented in CUDA C/C++, and runs on nVidia graphics cards. The peak performance of GRay using single-precision floating-point arithmetic on a single GPU exceeds 300 GFLOP (or 1 ns per photon per time step). For a realistic problem, where the peak performance cannot be reached, GRay is two orders of magnitude faster than existing central-processing-unit-based ray-tracing codes. This performance enhancement allows more effective searches of large parameter spaces when comparing theoretical predictions of images, spectra, and light curves from the vicinities of compact objects to observations. GRay can also perform on-the-fly ray tracing within general relativistic magnetohydrodynamic algorithms that simulate accretion flows around compact objects. Making use of this algorithm, we calculate the properties of the shadows of Kerr black holes and the photon rings that surround them. We also provide accurate fitting formulae of their dependencies on black hole spin and observer inclination, which can be used to interpret upcoming observations of the black holes at the center of the Milky Way, as well as M87, with the Event Horizon Telescope.

  17. GRay: A MASSIVELY PARALLEL GPU-BASED CODE FOR RAY TRACING IN RELATIVISTIC SPACETIMES

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Chan, Chi-kwan; Psaltis, Dimitrios; Özel, Feryal

    We introduce GRay, a massively parallel integrator designed to trace the trajectories of billions of photons in a curved spacetime. This graphics-processing-unit (GPU)-based integrator employs the stream processing paradigm, is implemented in CUDA C/C++, and runs on nVidia graphics cards. The peak performance of GRay using single-precision floating-point arithmetic on a single GPU exceeds 300 GFLOP (or 1 ns per photon per time step). For a realistic problem, where the peak performance cannot be reached, GRay is two orders of magnitude faster than existing central-processing-unit-based ray-tracing codes. This performance enhancement allows more effective searches of large parameter spaces when comparingmore » theoretical predictions of images, spectra, and light curves from the vicinities of compact objects to observations. GRay can also perform on-the-fly ray tracing within general relativistic magnetohydrodynamic algorithms that simulate accretion flows around compact objects. Making use of this algorithm, we calculate the properties of the shadows of Kerr black holes and the photon rings that surround them. We also provide accurate fitting formulae of their dependencies on black hole spin and observer inclination, which can be used to interpret upcoming observations of the black holes at the center of the Milky Way, as well as M87, with the Event Horizon Telescope.« less

  18. 3D streamers simulation in a pin to plane configuration using massively parallel computing

    NASA Astrophysics Data System (ADS)

    Plewa, J.-M.; Eichwald, O.; Ducasse, O.; Dessante, P.; Jacobs, C.; Renon, N.; Yousfi, M.

    2018-03-01

    This paper concerns the 3D simulation of corona discharge using high performance computing (HPC) managed with the message passing interface (MPI) library. In the field of finite volume methods applied on non-adaptive mesh grids and in the case of a specific 3D dynamic benchmark test devoted to streamer studies, the great efficiency of the iterative R&B SOR and BiCGSTAB methods versus the direct MUMPS method was clearly demonstrated in solving the Poisson equation using HPC resources. The optimization of the parallelization and the resulting scalability was undertaken as a function of the HPC architecture for a number of mesh cells ranging from 8 to 512 million and a number of cores ranging from 20 to 1600. The R&B SOR method remains at least about four times faster than the BiCGSTAB method and requires significantly less memory for all tested situations. The R&B SOR method was then implemented in a 3D MPI parallelized code that solves the classical first order model of an atmospheric pressure corona discharge in air. The 3D code capabilities were tested by following the development of one, two and four coplanar streamers generated by initial plasma spots for 6 ns. The preliminary results obtained allowed us to follow in detail the formation of the tree structure of a corona discharge and the effects of the mutual interactions between the streamers in terms of streamer velocity, trajectory and diameter. The computing time for 64 million of mesh cells distributed over 1000 cores using the MPI procedures is about 30 min ns-1, regardless of the number of streamers.

  19. Multiple description distributed image coding with side information for mobile wireless transmission

    NASA Astrophysics Data System (ADS)

    Wu, Min; Song, Daewon; Chen, Chang Wen

    2005-03-01

    Multiple description coding (MDC) is a source coding technique that involves coding the source information into multiple descriptions, and then transmitting them over different channels in packet network or error-prone wireless environment to achieve graceful degradation if parts of descriptions are lost at the receiver. In this paper, we proposed a multiple description distributed wavelet zero tree image coding system for mobile wireless transmission. We provide two innovations to achieve an excellent error resilient capability. First, when MDC is applied to wavelet subband based image coding, it is possible to introduce correlation between the descriptions in each subband. We consider using such a correlation as well as potentially error corrupted description as side information in the decoding to formulate the MDC decoding as a Wyner Ziv decoding problem. If only part of descriptions is lost, however, their correlation information is still available, the proposed Wyner Ziv decoder can recover the description by using the correlation information and the error corrupted description as side information. Secondly, in each description, single bitstream wavelet zero tree coding is very vulnerable to the channel errors. The first bit error may cause the decoder to discard all subsequent bits whether or not the subsequent bits are correctly received. Therefore, we integrate the multiple description scalar quantization (MDSQ) with the multiple wavelet tree image coding method to reduce error propagation. We first group wavelet coefficients into multiple trees according to parent-child relationship and then code them separately by SPIHT algorithm to form multiple bitstreams. Such decomposition is able to reduce error propagation and therefore improve the error correcting capability of Wyner Ziv decoder. Experimental results show that the proposed scheme not only exhibits an excellent error resilient performance but also demonstrates graceful degradation over the packet

  20. Methods of parallel computation applied on granular simulations

    NASA Astrophysics Data System (ADS)

    Martins, Gustavo H. B.; Atman, Allbens P. F.

    2017-06-01

    Every year, parallel computing has becoming cheaper and more accessible. As consequence, applications were spreading over all research areas. Granular materials is a promising area for parallel computing. To prove this statement we study the impact of parallel computing in simulations of the BNE (Brazil Nut Effect). This property is due the remarkable arising of an intruder confined to a granular media when vertically shaken against gravity. By means of DEM (Discrete Element Methods) simulations, we study the code performance testing different methods to improve clock time. A comparison between serial and parallel algorithms, using OpenMP® is also shown. The best improvement was obtained by optimizing the function that find contacts using Verlet's cells.

  1. Computational mechanics analysis tools for parallel-vector supercomputers

    NASA Technical Reports Server (NTRS)

    Storaasli, Olaf O.; Nguyen, Duc T.; Baddourah, Majdi; Qin, Jiangning

    1993-01-01

    Computational algorithms for structural analysis on parallel-vector supercomputers are reviewed. These parallel algorithms, developed by the authors, are for the assembly of structural equations, 'out-of-core' strategies for linear equation solution, massively distributed-memory equation solution, unsymmetric equation solution, general eigensolution, geometrically nonlinear finite element analysis, design sensitivity analysis for structural dynamics, optimization search analysis and domain decomposition. The source code for many of these algorithms is available.

  2. Parallel and Portable Monte Carlo Particle Transport

    NASA Astrophysics Data System (ADS)

    Lee, S. R.; Cummings, J. C.; Nolen, S. D.; Keen, N. D.

    1997-08-01

    We have developed a multi-group, Monte Carlo neutron transport code in C++ using object-oriented methods and the Parallel Object-Oriented Methods and Applications (POOMA) class library. This transport code, called MC++, currently computes k and α eigenvalues of the neutron transport equation on a rectilinear computational mesh. It is portable to and runs in parallel on a wide variety of platforms, including MPPs, clustered SMPs, and individual workstations. It contains appropriate classes and abstractions for particle transport and, through the use of POOMA, for portable parallelism. Current capabilities are discussed, along with physics and performance results for several test problems on a variety of hardware, including all three Accelerated Strategic Computing Initiative (ASCI) platforms. Current parallel performance indicates the ability to compute α-eigenvalues in seconds or minutes rather than days or weeks. Current and future work on the implementation of a general transport physics framework (TPF) is also described. This TPF employs modern C++ programming techniques to provide simplified user interfaces, generic STL-style programming, and compile-time performance optimization. Physics capabilities of the TPF will be extended to include continuous energy treatments, implicit Monte Carlo algorithms, and a variety of convergence acceleration techniques such as importance combing.

  3. Creating ensembles of oblique decision trees with evolutionary algorithms and sampling

    DOEpatents

    Cantu-Paz, Erick [Oakland, CA; Kamath, Chandrika [Tracy, CA

    2006-06-13

    A decision tree system that is part of a parallel object-oriented pattern recognition system, which in turn is part of an object oriented data mining system. A decision tree process includes the step of reading the data. If necessary, the data is sorted. A potential split of the data is evaluated according to some criterion. An initial split of the data is determined. The final split of the data is determined using evolutionary algorithms and statistical sampling techniques. The data is split. Multiple decision trees are combined in ensembles.

  4. Parallelization of the TRIGRS model for rainfall-induced landslides using the message passing interface

    USGS Publications Warehouse

    Alvioli, M.; Baum, R.L.

    2016-01-01

    We describe a parallel implementation of TRIGRS, the Transient Rainfall Infiltration and Grid-Based Regional Slope-Stability Model for the timing and distribution of rainfall-induced shallow landslides. We have parallelized the four time-demanding execution modes of TRIGRS, namely both the saturated and unsaturated model with finite and infinite soil depth options, within the Message Passing Interface framework. In addition to new features of the code, we outline details of the parallel implementation and show the performance gain with respect to the serial code. Results are obtained both on commercial hardware and on a high-performance multi-node machine, showing the different limits of applicability of the new code. We also discuss the implications for the application of the model on large-scale areas and as a tool for real-time landslide hazard monitoring.

  5. Toward an automated parallel computing environment for geosciences

    NASA Astrophysics Data System (ADS)

    Zhang, Huai; Liu, Mian; Shi, Yaolin; Yuen, David A.; Yan, Zhenzhen; Liang, Guoping

    2007-08-01

    Software for geodynamic modeling has not kept up with the fast growing computing hardware and network resources. In the past decade supercomputing power has become available to most researchers in the form of affordable Beowulf clusters and other parallel computer platforms. However, to take full advantage of such computing power requires developing parallel algorithms and associated software, a task that is often too daunting for geoscience modelers whose main expertise is in geosciences. We introduce here an automated parallel computing environment built on open-source algorithms and libraries. Users interact with this computing environment by specifying the partial differential equations, solvers, and model-specific properties using an English-like modeling language in the input files. The system then automatically generates the finite element codes that can be run on distributed or shared memory parallel machines. This system is dynamic and flexible, allowing users to address different problems in geosciences. It is capable of providing web-based services, enabling users to generate source codes online. This unique feature will facilitate high-performance computing to be integrated with distributed data grids in the emerging cyber-infrastructures for geosciences. In this paper we discuss the principles of this automated modeling environment and provide examples to demonstrate its versatility.

  6. Evaluating the performance of parallel subsurface simulators: An illustrative example with PFLOTRAN

    PubMed Central

    Hammond, G E; Lichtner, P C; Mills, R T

    2014-01-01

    [1] To better inform the subsurface scientist on the expected performance of parallel simulators, this work investigates performance of the reactive multiphase flow and multicomponent biogeochemical transport code PFLOTRAN as it is applied to several realistic modeling scenarios run on the Jaguar supercomputer. After a brief introduction to the code's parallel layout and code design, PFLOTRAN's parallel performance (measured through strong and weak scalability analyses) is evaluated in the context of conceptual model layout, software and algorithmic design, and known hardware limitations. PFLOTRAN scales well (with regard to strong scaling) for three realistic problem scenarios: (1) in situ leaching of copper from a mineral ore deposit within a 5-spot flow regime, (2) transient flow and solute transport within a regional doublet, and (3) a real-world problem involving uranium surface complexation within a heterogeneous and extremely dynamic variably saturated flow field. Weak scalability is discussed in detail for the regional doublet problem, and several difficulties with its interpretation are noted. PMID:25506097

  7. Evaluating the performance of parallel subsurface simulators: An illustrative example with PFLOTRAN.

    PubMed

    Hammond, G E; Lichtner, P C; Mills, R T

    2014-01-01

    [1] To better inform the subsurface scientist on the expected performance of parallel simulators, this work investigates performance of the reactive multiphase flow and multicomponent biogeochemical transport code PFLOTRAN as it is applied to several realistic modeling scenarios run on the Jaguar supercomputer. After a brief introduction to the code's parallel layout and code design, PFLOTRAN's parallel performance (measured through strong and weak scalability analyses) is evaluated in the context of conceptual model layout, software and algorithmic design, and known hardware limitations. PFLOTRAN scales well (with regard to strong scaling) for three realistic problem scenarios: (1) in situ leaching of copper from a mineral ore deposit within a 5-spot flow regime, (2) transient flow and solute transport within a regional doublet, and (3) a real-world problem involving uranium surface complexation within a heterogeneous and extremely dynamic variably saturated flow field. Weak scalability is discussed in detail for the regional doublet problem, and several difficulties with its interpretation are noted.

  8. Parallel evolutionary computation in bioinformatics applications.

    PubMed

    Pinho, Jorge; Sobral, João Luis; Rocha, Miguel

    2013-05-01

    A large number of optimization problems within the field of Bioinformatics require methods able to handle its inherent complexity (e.g. NP-hard problems) and also demand increased computational efforts. In this context, the use of parallel architectures is a necessity. In this work, we propose ParJECoLi, a Java based library that offers a large set of metaheuristic methods (such as Evolutionary Algorithms) and also addresses the issue of its efficient execution on a wide range of parallel architectures. The proposed approach focuses on the easiness of use, making the adaptation to distinct parallel environments (multicore, cluster, grid) transparent to the user. Indeed, this work shows how the development of the optimization library can proceed independently of its adaptation for several architectures, making use of Aspect-Oriented Programming. The pluggable nature of parallelism related modules allows the user to easily configure its environment, adding parallelism modules to the base source code when needed. The performance of the platform is validated with two case studies within biological model optimization. Copyright © 2012 Elsevier Ireland Ltd. All rights reserved.

  9. Support for Debugging Automatically Parallelized Programs

    NASA Technical Reports Server (NTRS)

    Jost, Gabriele; Hood, Robert; Biegel, Bryan (Technical Monitor)

    2001-01-01

    We describe a system that simplifies the process of debugging programs produced by computer-aided parallelization tools. The system uses relative debugging techniques to compare serial and parallel executions in order to show where the computations begin to differ. If the original serial code is correct, errors due to parallelization will be isolated by the comparison. One of the primary goals of the system is to minimize the effort required of the user. To that end, the debugging system uses information produced by the parallelization tool to drive the comparison process. In particular the debugging system relies on the parallelization tool to provide information about where variables may have been modified and how arrays are distributed across multiple processes. User effort is also reduced through the use of dynamic instrumentation. This allows us to modify the program execution without changing the way the user builds the executable. The use of dynamic instrumentation also permits us to compare the executions in a fine-grained fashion and only involve the debugger when a difference has been detected. This reduces the overhead of executing instrumentation.

  10. Relative Debugging of Automatically Parallelized Programs

    NASA Technical Reports Server (NTRS)

    Jost, Gabriele; Hood, Robert; Biegel, Bryan (Technical Monitor)

    2002-01-01

    We describe a system that simplifies the process of debugging programs produced by computer-aided parallelization tools. The system uses relative debugging techniques to compare serial and parallel executions in order to show where the computations begin to differ. If the original serial code is correct, errors due to parallelization will be isolated by the comparison. One of the primary goals of the system is to minimize the effort required of the user. To that end, the debugging system uses information produced by the parallelization tool to drive the comparison process. In particular, the debugging system relies on the parallelization tool to provide information about where variables may have been modified and how arrays are distributed across multiple processes. User effort is also reduced through the use of dynamic instrumentation. This allows us to modify, the program execution with out changing the way the user builds the executable. The use of dynamic instrumentation also permits us to compare the executions in a fine-grained fashion and only involve the debugger when a difference has been detected. This reduces the overhead of executing instrumentation.

  11. Tree-root control of shallow landslides

    NASA Astrophysics Data System (ADS)

    Cohen, Denis; Schwarz, Massimiliano

    2017-08-01

    Tree roots have long been recognized to increase slope stability by reinforcing the strength of soils. Slope stability models usually include the effects of roots by adding an apparent cohesion to the soil to simulate root strength. No model includes the combined effects of root distribution heterogeneity, stress-strain behavior of root reinforcement, or root strength in compression. Recent field observations, however, indicate that shallow landslide triggering mechanisms are characterized by differential deformation that indicates localized activation of zones in tension, compression, and shear in the soil. Here we describe a new model for slope stability that specifically considers these effects. The model is a strain-step discrete element model that reproduces the self-organized redistribution of forces on a slope during rainfall-triggered shallow landslides. We use a conceptual sigmoidal-shaped hillslope with a clearing in its center to explore the effects of tree size, spacing, weak zones, maximum root-size diameter, and different root strength configurations. Simulation results indicate that tree roots can stabilize slopes that would otherwise fail without them and, in general, higher root density with higher root reinforcement results in a more stable slope. The variation in root stiffness with diameter can, in some cases, invert this relationship. Root tension provides more resistance to failure than root compression but roots with both tension and compression offer the best resistance to failure. Lateral (slope-parallel) tension can be important in cases when the magnitude of this force is comparable to the slope-perpendicular tensile force. In this case, lateral forces can bring to failure tree-covered areas with high root reinforcement. Slope failure occurs when downslope soil compression reaches the soil maximum strength. When this occurs depends on the amount of root tension upslope in both the slope-perpendicular and slope-parallel directions. Roots

  12. Parallel Subspace Subcodes of Reed-Solomon Codes for Magnetic Recording Channels

    ERIC Educational Resources Information Center

    Wang, Han

    2010-01-01

    Read channel architectures based on a single low-density parity-check (LDPC) code are being considered for the next generation of hard disk drives. However, LDPC-only solutions suffer from the error floor problem, which may compromise reliability, if not handled properly. Concatenated architectures using an LDPC code plus a Reed-Solomon (RS) code…

  13. Parallel multiscale simulations of a brain aneurysm

    PubMed Central

    Grinberg, Leopold; Fedosov, Dmitry A.; Karniadakis, George Em

    2012-01-01

    Cardiovascular pathologies, such as a brain aneurysm, are affected by the global blood circulation as well as by the local microrheology. Hence, developing computational models for such cases requires the coupling of disparate spatial and temporal scales often governed by diverse mathematical descriptions, e.g., by partial differential equations (continuum) and ordinary differential equations for discrete particles (atomistic). However, interfacing atomistic-based with continuum-based domain discretizations is a challenging problem that requires both mathematical and computational advances. We present here a hybrid methodology that enabled us to perform the first multi-scale simulations of platelet depositions on the wall of a brain aneurysm. The large scale flow features in the intracranial network are accurately resolved by using the high-order spectral element Navier-Stokes solver εκ αr. The blood rheology inside the aneurysm is modeled using a coarse-grained stochastic molecular dynamics approach (the dissipative particle dynamics method) implemented in the parallel code LAMMPS. The continuum and atomistic domains overlap with interface conditions provided by effective forces computed adaptively to ensure continuity of states across the interface boundary. A two-way interaction is allowed with the time-evolving boundary of the (deposited) platelet clusters tracked by an immersed boundary method. The corresponding heterogeneous solvers ( εκ αr and LAMMPS) are linked together by a computational multilevel message passing interface that facilitates modularity and high parallel efficiency. Results of multiscale simulations of clot formation inside the aneurysm in a patient-specific arterial tree are presented. We also discuss the computational challenges involved and present scalability results of our coupled solver on up to 300K computer processors. Validation of such coupled atomistic-continuum models is a main open issue that has to be addressed in future

  14. Parallel multiscale simulations of a brain aneurysm.

    PubMed

    Grinberg, Leopold; Fedosov, Dmitry A; Karniadakis, George Em

    2013-07-01

    Cardiovascular pathologies, such as a brain aneurysm, are affected by the global blood circulation as well as by the local microrheology. Hence, developing computational models for such cases requires the coupling of disparate spatial and temporal scales often governed by diverse mathematical descriptions, e.g., by partial differential equations (continuum) and ordinary differential equations for discrete particles (atomistic). However, interfacing atomistic-based with continuum-based domain discretizations is a challenging problem that requires both mathematical and computational advances. We present here a hybrid methodology that enabled us to perform the first multi-scale simulations of platelet depositions on the wall of a brain aneurysm. The large scale flow features in the intracranial network are accurately resolved by using the high-order spectral element Navier-Stokes solver εκ αr . The blood rheology inside the aneurysm is modeled using a coarse-grained stochastic molecular dynamics approach (the dissipative particle dynamics method) implemented in the parallel code LAMMPS. The continuum and atomistic domains overlap with interface conditions provided by effective forces computed adaptively to ensure continuity of states across the interface boundary. A two-way interaction is allowed with the time-evolving boundary of the (deposited) platelet clusters tracked by an immersed boundary method. The corresponding heterogeneous solvers ( εκ αr and LAMMPS) are linked together by a computational multilevel message passing interface that facilitates modularity and high parallel efficiency. Results of multiscale simulations of clot formation inside the aneurysm in a patient-specific arterial tree are presented. We also discuss the computational challenges involved and present scalability results of our coupled solver on up to 300K computer processors. Validation of such coupled atomistic-continuum models is a main open issue that has to be addressed in future

  15. Parallel multiscale simulations of a brain aneurysm

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Grinberg, Leopold; Fedosov, Dmitry A.; Karniadakis, George Em, E-mail: george_karniadakis@brown.edu

    2013-07-01

    Cardiovascular pathologies, such as a brain aneurysm, are affected by the global blood circulation as well as by the local microrheology. Hence, developing computational models for such cases requires the coupling of disparate spatial and temporal scales often governed by diverse mathematical descriptions, e.g., by partial differential equations (continuum) and ordinary differential equations for discrete particles (atomistic). However, interfacing atomistic-based with continuum-based domain discretizations is a challenging problem that requires both mathematical and computational advances. We present here a hybrid methodology that enabled us to perform the first multiscale simulations of platelet depositions on the wall of a brain aneurysm.more » The large scale flow features in the intracranial network are accurately resolved by using the high-order spectral element Navier–Stokes solver NεκTαr. The blood rheology inside the aneurysm is modeled using a coarse-grained stochastic molecular dynamics approach (the dissipative particle dynamics method) implemented in the parallel code LAMMPS. The continuum and atomistic domains overlap with interface conditions provided by effective forces computed adaptively to ensure continuity of states across the interface boundary. A two-way interaction is allowed with the time-evolving boundary of the (deposited) platelet clusters tracked by an immersed boundary method. The corresponding heterogeneous solvers (NεκTαr and LAMMPS) are linked together by a computational multilevel message passing interface that facilitates modularity and high parallel efficiency. Results of multiscale simulations of clot formation inside the aneurysm in a patient-specific arterial tree are presented. We also discuss the computational challenges involved and present scalability results of our coupled solver on up to 300 K computer processors. Validation of such coupled atomistic-continuum models is a main open issue that has to be addressed

  16. Parallel ALLSPD-3D: Speeding Up Combustor Analysis Via Parallel Processing

    NASA Technical Reports Server (NTRS)

    Fricker, David M.

    1997-01-01

    The ALLSPD-3D Computational Fluid Dynamics code for reacting flow simulation was run on a set of benchmark test cases to determine its parallel efficiency. These test cases included non-reacting and reacting flow simulations with varying numbers of processors. Also, the tests explored the effects of scaling the simulation with the number of processors in addition to distributing a constant size problem over an increasing number of processors. The test cases were run on a cluster of IBM RS/6000 Model 590 workstations with ethernet and ATM networking plus a shared memory SGI Power Challenge L workstation. The results indicate that the network capabilities significantly influence the parallel efficiency, i.e., a shared memory machine is fastest and ATM networking provides acceptable performance. The limitations of ethernet greatly hamper the rapid calculation of flows using ALLSPD-3D.

  17. Runtime Detection of C-Style Errors in UPC Code

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Pirkelbauer, P; Liao, C; Panas, T

    2011-09-29

    Unified Parallel C (UPC) extends the C programming language (ISO C 99) with explicit parallel programming support for the partitioned global address space (PGAS), which provides a global memory space with localized partitions to each thread. Like its ancestor C, UPC is a low-level language that emphasizes code efficiency over safety. The absence of dynamic (and static) safety checks allows programmer oversights and software flaws that can be hard to spot. In this paper, we present an extension of a dynamic analysis tool, ROSE-Code Instrumentation and Runtime Monitor (ROSECIRM), for UPC to help programmers find C-style errors involving the globalmore » address space. Built on top of the ROSE source-to-source compiler infrastructure, the tool instruments source files with code that monitors operations and keeps track of changes to the system state. The resulting code is linked to a runtime monitor that observes the program execution and finds software defects. We describe the extensions to ROSE-CIRM that were necessary to support UPC. We discuss complications that arise from parallel code and our solutions. We test ROSE-CIRM against a runtime error detection test suite, and present performance results obtained from running error-free codes. ROSE-CIRM is released as part of the ROSE compiler under a BSD-style open source license.« less

  18. Full Wave Parallel Code for Modeling RF Fields in Hot Plasmas

    NASA Astrophysics Data System (ADS)

    Spencer, Joseph; Svidzinski, Vladimir; Evstatiev, Evstati; Galkin, Sergei; Kim, Jin-Soo

    2015-11-01

    FAR-TECH, Inc. is developing a suite of full wave RF codes in hot plasmas. It is based on a formulation in configuration space with grid adaptation capability. The conductivity kernel (which includes a nonlocal dielectric response) is calculated by integrating the linearized Vlasov equation along unperturbed test particle orbits. For Tokamak applications a 2-D version of the code is being developed. Progress of this work will be reported. This suite of codes has the following advantages over existing spectral codes: 1) It utilizes the localized nature of plasma dielectric response to the RF field and calculates this response numerically without approximations. 2) It uses an adaptive grid to better resolve resonances in plasma and antenna structures. 3) It uses an efficient sparse matrix solver to solve the formulated linear equations. The linear wave equation is formulated using two approaches: for cold plasmas the local cold plasma dielectric tensor is used (resolving resonances by particle collisions), while for hot plasmas the conductivity kernel is calculated. Work is supported by the U.S. DOE SBIR program.

  19. Tutorial: Parallel Computing of Simulation Models for Risk Analysis.

    PubMed

    Reilly, Allison C; Staid, Andrea; Gao, Michael; Guikema, Seth D

    2016-10-01

    Simulation models are widely used in risk analysis to study the effects of uncertainties on outcomes of interest in complex problems. Often, these models are computationally complex and time consuming to run. This latter point may be at odds with time-sensitive evaluations or may limit the number of parameters that are considered. In this article, we give an introductory tutorial focused on parallelizing simulation code to better leverage modern computing hardware, enabling risk analysts to better utilize simulation-based methods for quantifying uncertainty in practice. This article is aimed primarily at risk analysts who use simulation methods but do not yet utilize parallelization to decrease the computational burden of these models. The discussion is focused on conceptual aspects of embarrassingly parallel computer code and software considerations. Two complementary examples are shown using the languages MATLAB and R. A brief discussion of hardware considerations is located in the Appendix. © 2016 Society for Risk Analysis.

  20. Pacific Northwest ecoclass codes for seral and potential natural communities.

    Treesearch

    Frederick C. Hall

    1998-01-01

    Lists codes for identification of potential natural communities (plant association, habitat types), their seral status, and vegetation structure in and around the Pacific Northwest. Codes are a six-digit alphanumeric system using the first letter of tree species, life-form, seral status, and structure so that most codes can be directly interpreted. Seven appendices...

  1. A Parallel Numerical Algorithm To Solve Linear Systems Of Equations Emerging From 3D Radiative Transfer

    NASA Astrophysics Data System (ADS)

    Wichert, Viktoria; Arkenberg, Mario; Hauschildt, Peter H.

    2016-10-01

    Highly resolved state-of-the-art 3D atmosphere simulations will remain computationally extremely expensive for years to come. In addition to the need for more computing power, rethinking coding practices is necessary. We take a dual approach by introducing especially adapted, parallel numerical methods and correspondingly parallelizing critical code passages. In the following, we present our respective work on PHOENIX/3D. With new parallel numerical algorithms, there is a big opportunity for improvement when iteratively solving the system of equations emerging from the operator splitting of the radiative transfer equation J = ΛS. The narrow-banded approximate Λ-operator Λ* , which is used in PHOENIX/3D, occurs in each iteration step. By implementing a numerical algorithm which takes advantage of its characteristic traits, the parallel code's efficiency is further increased and a speed-up in computational time can be achieved.

  2. Use Computer-Aided Tools to Parallelize Large CFD Applications

    NASA Technical Reports Server (NTRS)

    Jin, H.; Frumkin, M.; Yan, J.

    2000-01-01

    Porting applications to high performance parallel computers is always a challenging task. It is time consuming and costly. With rapid progressing in hardware architectures and increasing complexity of real applications in recent years, the problem becomes even more sever. Today, scalability and high performance are mostly involving handwritten parallel programs using message-passing libraries (e.g. MPI). However, this process is very difficult and often error-prone. The recent reemergence of shared memory parallel (SMP) architectures, such as the cache coherent Non-Uniform Memory Access (ccNUMA) architecture used in the SGI Origin 2000, show good prospects for scaling beyond hundreds of processors. Programming on an SMP is simplified by working in a globally accessible address space. The user can supply compiler directives, such as OpenMP, to parallelize the code. As an industry standard for portable implementation of parallel programs for SMPs, OpenMP is a set of compiler directives and callable runtime library routines that extend Fortran, C and C++ to express shared memory parallelism. It promises an incremental path for parallel conversion of existing software, as well as scalability and performance for a complete rewrite or an entirely new development. Perhaps the main disadvantage of programming with directives is that inserted directives may not necessarily enhance performance. In the worst cases, it can create erroneous results. While vendors have provided tools to perform error-checking and profiling, automation in directive insertion is very limited and often failed on large programs, primarily due to the lack of a thorough enough data dependence analysis. To overcome the deficiency, we have developed a toolkit, CAPO, to automatically insert OpenMP directives in Fortran programs and apply certain degrees of optimization. CAPO is aimed at taking advantage of detailed inter-procedural dependence analysis provided by CAPTools, developed by the University of

  3. PRATHAM: Parallel Thermal Hydraulics Simulations using Advanced Mesoscopic Methods

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Joshi, Abhijit S; Jain, Prashant K; Mudrich, Jaime A

    2012-01-01

    At the Oak Ridge National Laboratory, efforts are under way to develop a 3D, parallel LBM code called PRATHAM (PaRAllel Thermal Hydraulic simulations using Advanced Mesoscopic Methods) to demonstrate the accuracy and scalability of LBM for turbulent flow simulations in nuclear applications. The code has been developed using FORTRAN-90, and parallelized using the message passing interface MPI library. Silo library is used to compact and write the data files, and VisIt visualization software is used to post-process the simulation data in parallel. Both the single relaxation time (SRT) and multi relaxation time (MRT) LBM schemes have been implemented in PRATHAM.more » To capture turbulence without prohibitively increasing the grid resolution requirements, an LES approach [5] is adopted allowing large scale eddies to be numerically resolved while modeling the smaller (subgrid) eddies. In this work, a Smagorinsky model has been used, which modifies the fluid viscosity by an additional eddy viscosity depending on the magnitude of the rate-of-strain tensor. In LBM, this is achieved by locally varying the relaxation time of the fluid.« less

  4. Customizing FP-growth algorithm to parallel mining with Charm++ library

    NASA Astrophysics Data System (ADS)

    Puścian, Marek

    2017-08-01

    This paper presents a frequent item mining algorithm that was customized to handle growing data repositories. The proposed solution applies Master Slave scheme to frequent pattern growth technique. Efficient utilization of available computation units is achieved by dynamic reallocation of tasks. Conditional frequent trees are assigned to parallel workers basing on their workload. Proposed enhancements have been successfully implemented using Charm++ library. This paper discusses results of the performance of parallelized FP-growth algorithm against different datasets. The approach has been illustrated with many experiments and measurements performed using multiprocessor and multithreaded computer.

  5. Constructing Neuronal Network Models in Massively Parallel Environments.

    PubMed

    Ippen, Tammo; Eppler, Jochen M; Plesser, Hans E; Diesmann, Markus

    2017-01-01

    Recent advances in the development of data structures to represent spiking neuron network models enable us to exploit the complete memory of petascale computers for a single brain-scale network simulation. In this work, we investigate how well we can exploit the computing power of such supercomputers for the creation of neuronal networks. Using an established benchmark, we divide the runtime of simulation code into the phase of network construction and the phase during which the dynamical state is advanced in time. We find that on multi-core compute nodes network creation scales well with process-parallel code but exhibits a prohibitively large memory consumption. Thread-parallel network creation, in contrast, exhibits speedup only up to a small number of threads but has little overhead in terms of memory. We further observe that the algorithms creating instances of model neurons and their connections scale well for networks of ten thousand neurons, but do not show the same speedup for networks of millions of neurons. Our work uncovers that the lack of scaling of thread-parallel network creation is due to inadequate memory allocation strategies and demonstrates that thread-optimized memory allocators recover excellent scaling. An analysis of the loop order used for network construction reveals that more complex tests on the locality of operations significantly improve scaling and reduce runtime by allowing construction algorithms to step through large networks more efficiently than in existing code. The combination of these techniques increases performance by an order of magnitude and harnesses the increasingly parallel compute power of the compute nodes in high-performance clusters and supercomputers.

  6. Constructing Neuronal Network Models in Massively Parallel Environments

    PubMed Central

    Ippen, Tammo; Eppler, Jochen M.; Plesser, Hans E.; Diesmann, Markus

    2017-01-01

    Recent advances in the development of data structures to represent spiking neuron network models enable us to exploit the complete memory of petascale computers for a single brain-scale network simulation. In this work, we investigate how well we can exploit the computing power of such supercomputers for the creation of neuronal networks. Using an established benchmark, we divide the runtime of simulation code into the phase of network construction and the phase during which the dynamical state is advanced in time. We find that on multi-core compute nodes network creation scales well with process-parallel code but exhibits a prohibitively large memory consumption. Thread-parallel network creation, in contrast, exhibits speedup only up to a small number of threads but has little overhead in terms of memory. We further observe that the algorithms creating instances of model neurons and their connections scale well for networks of ten thousand neurons, but do not show the same speedup for networks of millions of neurons. Our work uncovers that the lack of scaling of thread-parallel network creation is due to inadequate memory allocation strategies and demonstrates that thread-optimized memory allocators recover excellent scaling. An analysis of the loop order used for network construction reveals that more complex tests on the locality of operations significantly improve scaling and reduce runtime by allowing construction algorithms to step through large networks more efficiently than in existing code. The combination of these techniques increases performance by an order of magnitude and harnesses the increasingly parallel compute power of the compute nodes in high-performance clusters and supercomputers. PMID:28559808

  7. Computational mechanics analysis tools for parallel-vector supercomputers

    NASA Technical Reports Server (NTRS)

    Storaasli, O. O.; Nguyen, D. T.; Baddourah, M. A.; Qin, J.

    1993-01-01

    Computational algorithms for structural analysis on parallel-vector supercomputers are reviewed. These parallel algorithms, developed by the authors, are for the assembly of structural equations, 'out-of-core' strategies for linear equation solution, massively distributed-memory equation solution, unsymmetric equation solution, general eigen-solution, geometrically nonlinear finite element analysis, design sensitivity analysis for structural dynamics, optimization algorithm and domain decomposition. The source code for many of these algorithms is available from NASA Langley.

  8. Parallel design of JPEG-LS encoder on graphics processing units

    NASA Astrophysics Data System (ADS)

    Duan, Hao; Fang, Yong; Huang, Bormin

    2012-01-01

    With recent technical advances in graphic processing units (GPUs), GPUs have outperformed CPUs in terms of compute capability and memory bandwidth. Many successful GPU applications to high performance computing have been reported. JPEG-LS is an ISO/IEC standard for lossless image compression which utilizes adaptive context modeling and run-length coding to improve compression ratio. However, adaptive context modeling causes data dependency among adjacent pixels and the run-length coding has to be performed in a sequential way. Hence, using JPEG-LS to compress large-volume hyperspectral image data is quite time-consuming. We implement an efficient parallel JPEG-LS encoder for lossless hyperspectral compression on a NVIDIA GPU using the computer unified device architecture (CUDA) programming technology. We use the block parallel strategy, as well as such CUDA techniques as coalesced global memory access, parallel prefix sum, and asynchronous data transfer. We also show the relation between GPU speedup and AVIRIS block size, as well as the relation between compression ratio and AVIRIS block size. When AVIRIS images are divided into blocks, each with 64×64 pixels, we gain the best GPU performance with 26.3x speedup over its original CPU code.

  9. Increasing processor utilization during parallel computation rundown

    NASA Technical Reports Server (NTRS)

    Jones, W. H.

    1986-01-01

    Some parallel processing environments provide for asynchronous execution and completion of general purpose parallel computations from a single computational phase. When all the computations from such a phase are complete, a new parallel computational phase is begun. Depending upon the granularity of the parallel computations to be performed, there may be a shortage of available work as a particular computational phase draws to a close (computational rundown). This can result in the waste of computing resources and the delay of the overall problem. In many practical instances, strict sequential ordering of phases of parallel computation is not totally required. In such cases, the beginning of one phase can be correctly computed before the end of a previous phase is completed. This allows additional work to be generated somewhat earlier to keep computing resources busy during each computational rundown. The conditions under which this can occur are identified and the frequency of occurrence of such overlapping in an actual parallel Navier-Stokes code is reported. A language construct is suggested and possible control strategies for the management of such computational phase overlapping are discussed.

  10. Application of a Scalable, Parallel, Unstructured-Grid-Based Navier-Stokes Solver

    NASA Technical Reports Server (NTRS)

    Parikh, Paresh

    2001-01-01

    A parallel version of an unstructured-grid based Navier-Stokes solver, USM3Dns, previously developed for efficient operation on a variety of parallel computers, has been enhanced to incorporate upgrades made to the serial version. The resultant parallel code has been extensively tested on a variety of problems of aerospace interest and on two sets of parallel computers to understand and document its characteristics. An innovative grid renumbering construct and use of non-blocking communication are shown to produce superlinear computing performance. Preliminary results from parallelization of a recently introduced "porous surface" boundary condition are also presented.

  11. MPI_XSTAR: MPI-based Parallelization of the XSTAR Photoionization Program

    NASA Astrophysics Data System (ADS)

    Danehkar, Ashkbiz; Nowak, Michael A.; Lee, Julia C.; Smith, Randall K.

    2018-02-01

    We describe a program for the parallel implementation of multiple runs of XSTAR, a photoionization code that is used to predict the physical properties of an ionized gas from its emission and/or absorption lines. The parallelization program, called MPI_XSTAR, has been developed and implemented in the C++ language by using the Message Passing Interface (MPI) protocol, a conventional standard of parallel computing. We have benchmarked parallel multiprocessing executions of XSTAR, using MPI_XSTAR, against a serial execution of XSTAR, in terms of the parallelization speedup and the computing resource efficiency. Our experience indicates that the parallel execution runs significantly faster than the serial execution, however, the efficiency in terms of the computing resource usage decreases with increasing the number of processors used in the parallel computing.

  12. Coarse-grained component concurrency in Earth system modeling: parallelizing atmospheric radiative transfer in the GFDL AM3 model using the Flexible Modeling System coupling framework

    NASA Astrophysics Data System (ADS)

    Balaji, V.; Benson, Rusty; Wyman, Bruce; Held, Isaac

    2016-10-01

    Climate models represent a large variety of processes on a variety of timescales and space scales, a canonical example of multi-physics multi-scale modeling. Current hardware trends, such as Graphical Processing Units (GPUs) and Many Integrated Core (MIC) chips, are based on, at best, marginal increases in clock speed, coupled with vast increases in concurrency, particularly at the fine grain. Multi-physics codes face particular challenges in achieving fine-grained concurrency, as different physics and dynamics components have different computational profiles, and universal solutions are hard to come by. We propose here one approach for multi-physics codes. These codes are typically structured as components interacting via software frameworks. The component structure of a typical Earth system model consists of a hierarchical and recursive tree of components, each representing a different climate process or dynamical system. This recursive structure generally encompasses a modest level of concurrency at the highest level (e.g., atmosphere and ocean on different processor sets) with serial organization underneath. We propose to extend concurrency much further by running more and more lower- and higher-level components in parallel with each other. Each component can further be parallelized on the fine grain, potentially offering a major increase in the scalability of Earth system models. We present here first results from this approach, called coarse-grained component concurrency, or CCC. Within the Geophysical Fluid Dynamics Laboratory (GFDL) Flexible Modeling System (FMS), the atmospheric radiative transfer component has been configured to run in parallel with a composite component consisting of every other atmospheric component, including the atmospheric dynamics and all other atmospheric physics components. We will explore the algorithmic challenges involved in such an approach, and present results from such simulations. Plans to achieve even greater levels of

  13. Multiple Independent File Parallel I/O with HDF5

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Miller, M. C.

    2016-07-13

    The HDF5 library has supported the I/O requirements of HPC codes at Lawrence Livermore National Labs (LLNL) since the late 90’s. In particular, HDF5 used in the Multiple Independent File (MIF) parallel I/O paradigm has supported LLNL code’s scalable I/O requirements and has recently been gainfully used at scales as large as O(10 6) parallel tasks.

  14. Reconstruction for time-domain in vivo EPR 3D multigradient oximetric imaging--a parallel processing perspective.

    PubMed

    Dharmaraj, Christopher D; Thadikonda, Kishan; Fletcher, Anthony R; Doan, Phuc N; Devasahayam, Nallathamby; Matsumoto, Shingo; Johnson, Calvin A; Cook, John A; Mitchell, James B; Subramanian, Sankaran; Krishna, Murali C

    2009-01-01

    Three-dimensional Oximetric Electron Paramagnetic Resonance Imaging using the Single Point Imaging modality generates unpaired spin density and oxygen images that can readily distinguish between normal and tumor tissues in small animals. It is also possible with fast imaging to track the changes in tissue oxygenation in response to the oxygen content in the breathing air. However, this involves dealing with gigabytes of data for each 3D oximetric imaging experiment involving digital band pass filtering and background noise subtraction, followed by 3D Fourier reconstruction. This process is rather slow in a conventional uniprocessor system. This paper presents a parallelization framework using OpenMP runtime support and parallel MATLAB to execute such computationally intensive programs. The Intel compiler is used to develop a parallel C++ code based on OpenMP. The code is executed on four Dual-Core AMD Opteron shared memory processors, to reduce the computational burden of the filtration task significantly. The results show that the parallel code for filtration has achieved a speed up factor of 46.66 as against the equivalent serial MATLAB code. In addition, a parallel MATLAB code has been developed to perform 3D Fourier reconstruction. Speedup factors of 4.57 and 4.25 have been achieved during the reconstruction process and oximetry computation, for a data set with 23 x 23 x 23 gradient steps. The execution time has been computed for both the serial and parallel implementations using different dimensions of the data and presented for comparison. The reported system has been designed to be easily accessible even from low-cost personal computers through local internet (NIHnet). The experimental results demonstrate that the parallel computing provides a source of high computational power to obtain biophysical parameters from 3D EPR oximetric imaging, almost in real-time.

  15. Exploiting Symmetry on Parallel Architectures.

    NASA Astrophysics Data System (ADS)

    Stiller, Lewis Benjamin

    1995-01-01

    This thesis describes techniques for the design of parallel programs that solve well-structured problems with inherent symmetry. Part I demonstrates the reduction of such problems to generalized matrix multiplication by a group-equivariant matrix. Fast techniques for this multiplication are described, including factorization, orbit decomposition, and Fourier transforms over finite groups. Our algorithms entail interaction between two symmetry groups: one arising at the software level from the problem's symmetry and the other arising at the hardware level from the processors' communication network. Part II illustrates the applicability of our symmetry -exploitation techniques by presenting a series of case studies of the design and implementation of parallel programs. First, a parallel program that solves chess endgames by factorization of an associated dihedral group-equivariant matrix is described. This code runs faster than previous serial programs, and discovered it a number of results. Second, parallel algorithms for Fourier transforms for finite groups are developed, and preliminary parallel implementations for group transforms of dihedral and of symmetric groups are described. Applications in learning, vision, pattern recognition, and statistics are proposed. Third, parallel implementations solving several computational science problems are described, including the direct n-body problem, convolutions arising from molecular biology, and some communication primitives such as broadcast and reduce. Some of our implementations ran orders of magnitude faster than previous techniques, and were used in the investigation of various physical phenomena.

  16. Methodes iteratives paralleles: Applications en neutronique et en mecanique des fluides

    NASA Astrophysics Data System (ADS)

    Qaddouri, Abdessamad

    Dans cette these, le calcul parallele est applique successivement a la neutronique et a la mecanique des fluides. Dans chacune de ces deux applications, des methodes iteratives sont utilisees pour resoudre le systeme d'equations algebriques resultant de la discretisation des equations du probleme physique. Dans le probleme de neutronique, le calcul des matrices des probabilites de collision (PC) ainsi qu'un schema iteratif multigroupe utilisant une methode inverse de puissance sont parallelises. Dans le probleme de mecanique des fluides, un code d'elements finis utilisant un algorithme iteratif du type GMRES preconditionne est parallelise. Cette these est presentee sous forme de six articles suivis d'une conclusion. Les cinq premiers articles traitent des applications en neutronique, articles qui representent l'evolution de notre travail dans ce domaine. Cette evolution passe par un calcul parallele des matrices des PC et un algorithme multigroupe parallele teste sur un probleme unidimensionnel (article 1), puis par deux algorithmes paralleles l'un mutiregion l'autre multigroupe, testes sur des problemes bidimensionnels (articles 2--3). Ces deux premieres etapes sont suivies par l'application de deux techniques d'acceleration, le rebalancement neutronique et la minimisation du residu aux deux algorithmes paralleles (article 4). Finalement, on a mis en oeuvre l'algorithme multigroupe et le calcul parallele des matrices des PC sur un code de production DRAGON ou les tests sont plus realistes et peuvent etre tridimensionnels (article 5). Le sixieme article (article 6), consacre a l'application a la mecanique des fluides, traite la parallelisation d'un code d'elements finis FES ou le partitionneur de graphe METIS et la librairie PSPARSLIB sont utilises.

  17. ANTLR Tree Grammar Generator and Extensions

    NASA Technical Reports Server (NTRS)

    Craymer, Loring

    2005-01-01

    A computer program implements two extensions of ANTLR (Another Tool for Language Recognition), which is a set of software tools for translating source codes between different computing languages. ANTLR supports predicated- LL(k) lexer and parser grammars, a notation for annotating parser grammars to direct tree construction, and predicated tree grammars. [ LL(k) signifies left-right, leftmost derivation with k tokens of look-ahead, referring to certain characteristics of a grammar.] One of the extensions is a syntax for tree transformations. The other extension is the generation of tree grammars from annotated parser or input tree grammars. These extensions can simplify the process of generating source-to-source language translators and they make possible an approach, called "polyphase parsing," to translation between computing languages. The typical approach to translator development is to identify high-level semantic constructs such as "expressions," "declarations," and "definitions" as fundamental building blocks in the grammar specification used for language recognition. The polyphase approach is to lump ambiguous syntactic constructs during parsing and then disambiguate the alternatives in subsequent tree transformation passes. Polyphase parsing is believed to be useful for generating efficient recognizers for C++ and other languages that, like C++, have significant ambiguities.

  18. An object-oriented approach to nested data parallelism

    NASA Technical Reports Server (NTRS)

    Sheffler, Thomas J.; Chatterjee, Siddhartha

    1994-01-01

    This paper describes an implementation technique for integrating nested data parallelism into an object-oriented language. Data-parallel programming employs sets of data called 'collections' and expresses parallelism as operations performed over the elements of a collection. When the elements of a collection are also collections, then there is the possibility for 'nested data parallelism.' Few current programming languages support nested data parallelism however. In an object-oriented framework, a collection is a single object. Its type defines the parallel operations that may be applied to it. Our goal is to design and build an object-oriented data-parallel programming environment supporting nested data parallelism. Our initial approach is built upon three fundamental additions to C++. We add new parallel base types by implementing them as classes, and add a new parallel collection type called a 'vector' that is implemented as a template. Only one new language feature is introduced: the 'foreach' construct, which is the basis for exploiting elementwise parallelism over collections. The strength of the method lies in the compilation strategy, which translates nested data-parallel C++ into ordinary C++. Extracting the potential parallelism in nested 'foreach' constructs is called 'flattening' nested parallelism. We show how to flatten 'foreach' constructs using a simple program transformation. Our prototype system produces vector code which has been successfully run on workstations, a CM-2, and a CM-5.

  19. A transient FETI methodology for large-scale parallel implicit computations in structural mechanics

    NASA Technical Reports Server (NTRS)

    Farhat, Charbel; Crivelli, Luis; Roux, Francois-Xavier

    1992-01-01

    Explicit codes are often used to simulate the nonlinear dynamics of large-scale structural systems, even for low frequency response, because the storage and CPU requirements entailed by the repeated factorizations traditionally found in implicit codes rapidly overwhelm the available computing resources. With the advent of parallel processing, this trend is accelerating because explicit schemes are also easier to parallelize than implicit ones. However, the time step restriction imposed by the Courant stability condition on all explicit schemes cannot yet -- and perhaps will never -- be offset by the speed of parallel hardware. Therefore, it is essential to develop efficient and robust alternatives to direct methods that are also amenable to massively parallel processing because implicit codes using unconditionally stable time-integration algorithms are computationally more efficient when simulating low-frequency dynamics. Here we present a domain decomposition method for implicit schemes that requires significantly less storage than factorization algorithms, that is several times faster than other popular direct and iterative methods, that can be easily implemented on both shared and local memory parallel processors, and that is both computationally and communication-wise efficient. The proposed transient domain decomposition method is an extension of the method of Finite Element Tearing and Interconnecting (FETI) developed by Farhat and Roux for the solution of static problems. Serial and parallel performance results on the CRAY Y-MP/8 and the iPSC-860/128 systems are reported and analyzed for realistic structural dynamics problems. These results establish the superiority of the FETI method over both the serial/parallel conjugate gradient algorithm with diagonal scaling and the serial/parallel direct method, and contrast the computational power of the iPSC-860/128 parallel processor with that of the CRAY Y-MP/8 system.

  20. Experience in highly parallel processing using DAP

    NASA Technical Reports Server (NTRS)

    Parkinson, D.

    1987-01-01

    Distributed Array Processors (DAP) have been in day to day use for ten years and a large amount of user experience has been gained. The profile of user applications is similar to that of the Massively Parallel Processor (MPP) working group. Experience has shown that contrary to expectations, highly parallel systems provide excellent performance on so-called dirty problems such as the physics part of meteorological codes. The reasons for this observation are discussed. The arguments against replacing bit processors with floating point processors are also discussed.

  1. Line-drawing algorithms for parallel machines

    NASA Technical Reports Server (NTRS)

    Pang, Alex T.

    1990-01-01

    The fact that conventional line-drawing algorithms, when applied directly on parallel machines, can lead to very inefficient codes is addressed. It is suggested that instead of modifying an existing algorithm for a parallel machine, a more efficient implementation can be produced by going back to the invariants in the definition. Popular line-drawing algorithms are compared with two alternatives; distance to a line (a point is on the line if sufficiently close to it) and intersection with a line (a point on the line if an intersection point). For massively parallel single-instruction-multiple-data (SIMD) machines (with thousands of processors and up), the alternatives provide viable line-drawing algorithms. Because of the pixel-per-processor mapping, their performance is independent of the line length and orientation.

  2. MHD code using multi graphical processing units: SMAUG+

    NASA Astrophysics Data System (ADS)

    Gyenge, N.; Griffiths, M. K.; Erdélyi, R.

    2018-01-01

    This paper introduces the Sheffield Magnetohydrodynamics Algorithm Using GPUs (SMAUG+), an advanced numerical code for solving magnetohydrodynamic (MHD) problems, using multi-GPU systems. Multi-GPU systems facilitate the development of accelerated codes and enable us to investigate larger model sizes and/or more detailed computational domain resolutions. This is a significant advancement over the parent single-GPU MHD code, SMAUG (Griffiths et al., 2015). Here, we demonstrate the validity of the SMAUG + code, describe the parallelisation techniques and investigate performance benchmarks. The initial configuration of the Orszag-Tang vortex simulations are distributed among 4, 16, 64 and 100 GPUs. Furthermore, different simulation box resolutions are applied: 1000 × 1000, 2044 × 2044, 4000 × 4000 and 8000 × 8000 . We also tested the code with the Brio-Wu shock tube simulations with model size of 800 employing up to 10 GPUs. Based on the test results, we observed speed ups and slow downs, depending on the granularity and the communication overhead of certain parallel tasks. The main aim of the code development is to provide massively parallel code without the memory limitation of a single GPU. By using our code, the applied model size could be significantly increased. We demonstrate that we are able to successfully compute numerically valid and large 2D MHD problems.

  3. Parallel computation of multigroup reactivity coefficient using iterative method

    NASA Astrophysics Data System (ADS)

    Susmikanti, Mike; Dewayatna, Winter

    2013-09-01

    One of the research activities to support the commercial radioisotope production program is a safety research target irradiation FPM (Fission Product Molybdenum). FPM targets form a tube made of stainless steel in which the nuclear degrees of superimposed high-enriched uranium. FPM irradiation tube is intended to obtain fission. The fission material widely used in the form of kits in the world of nuclear medicine. Irradiation FPM tube reactor core would interfere with performance. One of the disorders comes from changes in flux or reactivity. It is necessary to study a method for calculating safety terrace ongoing configuration changes during the life of the reactor, making the code faster became an absolute necessity. Neutron safety margin for the research reactor can be reused without modification to the calculation of the reactivity of the reactor, so that is an advantage of using perturbation method. The criticality and flux in multigroup diffusion model was calculate at various irradiation positions in some uranium content. This model has a complex computation. Several parallel algorithms with iterative method have been developed for the sparse and big matrix solution. The Black-Red Gauss Seidel Iteration and the power iteration parallel method can be used to solve multigroup diffusion equation system and calculated the criticality and reactivity coeficient. This research was developed code for reactivity calculation which used one of safety analysis with parallel processing. It can be done more quickly and efficiently by utilizing the parallel processing in the multicore computer. This code was applied for the safety limits calculation of irradiated targets FPM with increment Uranium.

  4. The BLAZE language: A parallel language for scientific programming

    NASA Technical Reports Server (NTRS)

    Mehrotra, P.; Vanrosendale, J.

    1985-01-01

    A Pascal-like scientific programming language, Blaze, is described. Blaze contains array arithmetic, forall loops, and APL-style accumulation operators, which allow natural expression of fine grained parallelism. It also employs an applicative or functional procedure invocation mechanism, which makes it easy for compilers to extract coarse grained parallelism using machine specific program restructuring. Thus Blaze should allow one to achieve highly parallel execution on multiprocessor architectures, while still providing the user with onceptually sequential control flow. A central goal in the design of Blaze is portability across a broad range of parallel architectures. The multiple levels of parallelism present in Blaze code, in principle, allow a compiler to extract the types of parallelism appropriate for the given architecture while neglecting the remainder. The features of Blaze are described and shows how this language would be used in typical scientific programming.

  5. The BLAZE language - A parallel language for scientific programming

    NASA Technical Reports Server (NTRS)

    Mehrotra, Piyush; Van Rosendale, John

    1987-01-01

    A Pascal-like scientific programming language, BLAZE, is described. BLAZE contains array arithmetic, forall loops, and APL-style accumulation operators, which allow natural expression of fine grained parallelism. It also employs an applicative or functional procedure invocation mechanism, which makes it easy for compilers to extract coarse grained parallelism using machine specific program restructuring. Thus BLAZE should allow one to achieve highly parallel execution on multiprocessor architectures, while still providing the user with conceptually sequential control flow. A central goal in the design of BLAZE is portability across a broad range of parallel architectures. The multiple levels of parallelism present in BLAZE code, in principle, allow a compiler to extract the types of parallelism appropriate for the given architecture while neglecting the remainder. The features of BLAZE are described and it is shown how this language would be used in typical scientific programming.

  6. The dynamics of strangling among forest trees.

    PubMed

    Okamoto, Kenichi W

    2015-11-07

    Strangler trees germinate and grow on other trees, eventually enveloping and potentially even girdling their hosts. This allows them to mitigate fitness costs otherwise incurred by germinating and competing with other trees on the forest floor, as well as minimize risks associated with host tree-fall. If stranglers can themselves host other strangler trees, they may not even seem to need non-stranglers to persist. Yet despite their high fitness potential, strangler trees neither dominate the communities in which they occur nor is the strategy particularly common outside of figs (genus Ficus). Here we analyze how dynamic interactions between strangling and non-strangling trees can shape the adaptive landscape for strangling mutants and mutant trees that have lost the ability to strangle. We find a threshold which strangler germination rates must exceed for selection to favor the evolution of strangling, regardless of how effectively hemiepiphytic stranglers may subsequently replace their hosts. This condition describes the magnitude of the phenotypic displacement in the ability to germinate on other trees necessary for invasion by a mutant tree that could potentially strangle its host following establishment as an epiphyte. We show how the relative abilities of strangling and non-strangling trees to occupy empty sites can govern whether strangling is an evolutionarily stable strategy, and obtain the conditions for strangler coexistence with non-stranglers. We then elucidate when the evolution of strangling can disrupt stable coexistence between commensal epiphytic ancestors and their non-strangling host trees. This allows us to highlight parallels between the invasion fitness of strangler trees arising from commensalist ancestors, and cases where strangling can arise in concert with the evolution of hemiepiphytism among free-standing ancestors. Finally, we discuss how our results can inform the evolutionary ecology of antagonistic interactions more generally

  7. PFLOTRAN: Reactive Flow & Transport Code for Use on Laptops to Leadership-Class Supercomputers

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Hammond, Glenn E.; Lichtner, Peter C.; Lu, Chuan

    PFLOTRAN, a next-generation reactive flow and transport code for modeling subsurface processes, has been designed from the ground up to run efficiently on machines ranging from leadership-class supercomputers to laptops. Based on an object-oriented design, the code is easily extensible to incorporate additional processes. It can interface seamlessly with Fortran 9X, C and C++ codes. Domain decomposition parallelism is employed, with the PETSc parallel framework used to manage parallel solvers, data structures and communication. Features of the code include a modular input file, implementation of high-performance I/O using parallel HDF5, ability to perform multiple realization simulations with multiple processors permore » realization in a seamless manner, and multiple modes for multiphase flow and multicomponent geochemical transport. Chemical reactions currently implemented in the code include homogeneous aqueous complexing reactions and heterogeneous mineral precipitation/dissolution, ion exchange, surface complexation and a multirate kinetic sorption model. PFLOTRAN has demonstrated petascale performance using 2{sup 17} processor cores with over 2 billion degrees of freedom. Accomplishments achieved to date include applications to the Hanford 300 Area and modeling CO{sub 2} sequestration in deep geologic formations.« less

  8. Using CLIPS in the domain of knowledge-based massively parallel programming

    NASA Technical Reports Server (NTRS)

    Dvorak, Jiri J.

    1994-01-01

    The Program Development Environment (PDE) is a tool for massively parallel programming of distributed-memory architectures. Adopting a knowledge-based approach, the PDE eliminates the complexity introduced by parallel hardware with distributed memory and offers complete transparency in respect of parallelism exploitation. The knowledge-based part of the PDE is realized in CLIPS. Its principal task is to find an efficient parallel realization of the application specified by the user in a comfortable, abstract, domain-oriented formalism. A large collection of fine-grain parallel algorithmic skeletons, represented as COOL objects in a tree hierarchy, contains the algorithmic knowledge. A hybrid knowledge base with rule modules and procedural parts, encoding expertise about application domain, parallel programming, software engineering, and parallel hardware, enables a high degree of automation in the software development process. In this paper, important aspects of the implementation of the PDE using CLIPS and COOL are shown, including the embedding of CLIPS with C++-based parts of the PDE. The appropriateness of the chosen approach and of the CLIPS language for knowledge-based software engineering are discussed.

  9. Xyce Parallel Electronic Simulator Users' Guide Version 6.7.

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Keiter, Eric R.; Aadithya, Karthik Venkatraman; Mei, Ting

    This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel com- puting platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one tomore » develop new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandia's needs, including some radiation- aware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase -- a message passing parallel implementation -- which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows. The information herein is subject to change without notice. Copyright c 2002-2017 Sandia Corporation. All rights reserved. Trademarks Xyce TM Electronic Simulator and Xyce TM are trademarks of Sandia Corporation. Orcad, Orcad Capture, PSpice and Probe are registered trademarks of Cadence Design Systems, Inc. Microsoft, Windows and Windows 7 are registered trademarks of Microsoft Corporation. Medici, DaVinci and Taurus are registered trademarks of Synopsys Corporation. Amtec and TecPlot are trademarks

  10. The EMCC / DARPA Massively Parallel Electromagnetic Scattering Project

    NASA Technical Reports Server (NTRS)

    Woo, Alex C.; Hill, Kueichien C.

    1996-01-01

    The Electromagnetic Code Consortium (EMCC) was sponsored by the Advanced Research Program Agency (ARPA) to demonstrate the effectiveness of massively parallel computing in large scale radar signature predictions. The EMCC/ARPA project consisted of three parts.

  11. The Fault Tree Compiler (FTC): Program and mathematics

    NASA Technical Reports Server (NTRS)

    Butler, Ricky W.; Martensen, Anna L.

    1989-01-01

    The Fault Tree Compiler Program is a new reliability tool used to predict the top-event probability for a fault tree. Five different gate types are allowed in the fault tree: AND, OR, EXCLUSIVE OR, INVERT, AND m OF n gates. The high-level input language is easy to understand and use when describing the system tree. In addition, the use of the hierarchical fault tree capability can simplify the tree description and decrease program execution time. The current solution technique provides an answer precisely (within the limits of double precision floating point arithmetic) within a user specified number of digits accuracy. The user may vary one failure rate or failure probability over a range of values and plot the results for sensitivity analyses. The solution technique is implemented in FORTRAN; the remaining program code is implemented in Pascal. The program is written to run on a Digital Equipment Corporation (DEC) VAX computer with the VMS operation system.

  12. Hierarchical Parallelization of Gene Differential Association Analysis

    PubMed Central

    2011-01-01

    Background Microarray gene differential expression analysis is a widely used technique that deals with high dimensional data and is computationally intensive for permutation-based procedures. Microarray gene differential association analysis is even more computationally demanding and must take advantage of multicore computing technology, which is the driving force behind increasing compute power in recent years. In this paper, we present a two-layer hierarchical parallel implementation of gene differential association analysis. It takes advantage of both fine- and coarse-grain (with granularity defined by the frequency of communication) parallelism in order to effectively leverage the non-uniform nature of parallel processing available in the cutting-edge systems of today. Results Our results show that this hierarchical strategy matches data sharing behavior to the properties of the underlying hardware, thereby reducing the memory and bandwidth needs of the application. The resulting improved efficiency reduces computation time and allows the gene differential association analysis code to scale its execution with the number of processors. The code and biological data used in this study are downloadable from http://www.urmc.rochester.edu/biostat/people/faculty/hu.cfm. Conclusions The performance sweet spot occurs when using a number of threads per MPI process that allows the working sets of the corresponding MPI processes running on the multicore to fit within the machine cache. Hence, we suggest that practitioners follow this principle in selecting the appropriate number of MPI processes and threads within each MPI process for their cluster configurations. We believe that the principles of this hierarchical approach to parallelization can be utilized in the parallelization of other computationally demanding kernels. PMID:21936916

  13. Hierarchical parallelization of gene differential association analysis.

    PubMed

    Needham, Mark; Hu, Rui; Dwarkadas, Sandhya; Qiu, Xing

    2011-09-21

    Microarray gene differential expression analysis is a widely used technique that deals with high dimensional data and is computationally intensive for permutation-based procedures. Microarray gene differential association analysis is even more computationally demanding and must take advantage of multicore computing technology, which is the driving force behind increasing compute power in recent years. In this paper, we present a two-layer hierarchical parallel implementation of gene differential association analysis. It takes advantage of both fine- and coarse-grain (with granularity defined by the frequency of communication) parallelism in order to effectively leverage the non-uniform nature of parallel processing available in the cutting-edge systems of today. Our results show that this hierarchical strategy matches data sharing behavior to the properties of the underlying hardware, thereby reducing the memory and bandwidth needs of the application. The resulting improved efficiency reduces computation time and allows the gene differential association analysis code to scale its execution with the number of processors. The code and biological data used in this study are downloadable from http://www.urmc.rochester.edu/biostat/people/faculty/hu.cfm. The performance sweet spot occurs when using a number of threads per MPI process that allows the working sets of the corresponding MPI processes running on the multicore to fit within the machine cache. Hence, we suggest that practitioners follow this principle in selecting the appropriate number of MPI processes and threads within each MPI process for their cluster configurations. We believe that the principles of this hierarchical approach to parallelization can be utilized in the parallelization of other computationally demanding kernels.

  14. The Refinement-Tree Partition for Parallel Solution of Partial Differential Equations.

    PubMed

    Mitchell, William F

    1998-01-01

    Dynamic load balancing is considered in the context of adaptive multilevel methods for partial differential equations on distributed memory multiprocessors. An approach that periodically repartitions the grid is taken. The important properties of a partitioning algorithm are presented and discussed in this context. A partitioning algorithm based on the refinement tree of the adaptive grid is presented and analyzed in terms of these properties. Theoretical and numerical results are given.

  15. Turbo Trellis Coded Modulation With Iterative Decoding for Mobile Satellite Communications

    NASA Technical Reports Server (NTRS)

    Divsalar, D.; Pollara, F.

    1997-01-01

    In this paper, analytical bounds on the performance of parallel concatenation of two codes, known as turbo codes, and serial concatenation of two codes over fading channels are obtained. Based on this analysis, design criteria for the selection of component trellis codes for MPSK modulation, and a suitable bit-by-bit iterative decoding structure are proposed. Examples are given for throughput of 2 bits/sec/Hz with 8PSK modulation. The parallel concatenation example uses two rate 4/5 8-state convolutional codes with two interleavers. The convolutional codes' outputs are then mapped to two 8PSK modulations. The serial concatenated code example uses an 8-state outer code with rate 4/5 and a 4-state inner trellis code with 5 inputs and 2 x 8PSK outputs per trellis branch. Based on the above mentioned design criteria for fading channels, a method to obtain he structure of the trellis code with maximum diversity is proposed. Simulation results are given for AWGN and an independent Rayleigh fading channel with perfect Channel State Information (CSI).

  16. LDRD final report on massively-parallel linear programming : the parPCx system.

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Parekh, Ojas; Phillips, Cynthia Ann; Boman, Erik Gunnar

    2005-02-01

    This report summarizes the research and development performed from October 2002 to September 2004 at Sandia National Laboratories under the Laboratory-Directed Research and Development (LDRD) project ''Massively-Parallel Linear Programming''. We developed a linear programming (LP) solver designed to use a large number of processors. LP is the optimization of a linear objective function subject to linear constraints. Companies and universities have expended huge efforts over decades to produce fast, stable serial LP solvers. Previous parallel codes run on shared-memory systems and have little or no distribution of the constraint matrix. We have seen no reports of general LP solver runsmore » on large numbers of processors. Our parallel LP code is based on an efficient serial implementation of Mehrotra's interior-point predictor-corrector algorithm (PCx). The computational core of this algorithm is the assembly and solution of a sparse linear system. We have substantially rewritten the PCx code and based it on Trilinos, the parallel linear algebra library developed at Sandia. Our interior-point method can use either direct or iterative solvers for the linear system. To achieve a good parallel data distribution of the constraint matrix, we use a (pre-release) version of a hypergraph partitioner from the Zoltan partitioning library. We describe the design and implementation of our new LP solver called parPCx and give preliminary computational results. We summarize a number of issues related to efficient parallel solution of LPs with interior-point methods including data distribution, numerical stability, and solving the core linear system using both direct and iterative methods. We describe a number of applications of LP specific to US Department of Energy mission areas and we summarize our efforts to integrate parPCx (and parallel LP solvers in general) into Sandia's massively-parallel integer programming solver PICO (Parallel Interger and Combinatorial Optimizer

  17. A Multiple Sphere T-Matrix Fortran Code for Use on Parallel Computer Clusters

    NASA Technical Reports Server (NTRS)

    Mackowski, D. W.; Mishchenko, M. I.

    2011-01-01

    A general-purpose Fortran-90 code for calculation of the electromagnetic scattering and absorption properties of multiple sphere clusters is described. The code can calculate the efficiency factors and scattering matrix elements of the cluster for either fixed or random orientation with respect to the incident beam and for plane wave or localized- approximation Gaussian incident fields. In addition, the code can calculate maps of the electric field both interior and exterior to the spheres.The code is written with message passing interface instructions to enable the use on distributed memory compute clusters, and for such platforms the code can make feasible the calculation of absorption, scattering, and general EM characteristics of systems containing several thousand spheres.

  18. Parallel hyperbolic PDE simulation on clusters: Cell versus GPU

    NASA Astrophysics Data System (ADS)

    Rostrup, Scott; De Sterck, Hans

    2010-12-01

    Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications. Program summaryProgram title: SWsolver Catalogue identifier: AEGY_v1_0 Program summary URL

  19. Exploiting Vector and Multicore Parallelsim for Recursive, Data- and Task-Parallel Programs

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Ren, Bin; Krishnamoorthy, Sriram; Agrawal, Kunal

    Modern hardware contains parallel execution resources that are well-suited for data-parallelism-vector units-and task parallelism-multicores. However, most work on parallel scheduling focuses on one type of hardware or the other. In this work, we present a scheduling framework that allows for a unified treatment of task- and data-parallelism. Our key insight is an abstraction, task blocks, that uniformly handles data-parallel iterations and task-parallel tasks, allowing them to be scheduled on vector units or executed independently as multicores. Our framework allows us to define schedulers that can dynamically select between executing task- blocks on vector units or multicores. We show that thesemore » schedulers are asymptotically optimal, and deliver the maximum amount of parallelism available in computation trees. To evaluate our schedulers, we develop program transformations that can convert mixed data- and task-parallel pro- grams into task block-based programs. Using a prototype instantiation of our scheduling framework, we show that, on an 8-core system, we can simultaneously exploit vector and multicore parallelism to achieve 14×-108× speedup over sequential baselines.« less

  20. Scaling Optimization of the SIESTA MHD Code

    NASA Astrophysics Data System (ADS)

    Seal, Sudip; Hirshman, Steven; Perumalla, Kalyan

    2013-10-01

    SIESTA is a parallel three-dimensional plasma equilibrium code capable of resolving magnetic islands at high spatial resolutions for toroidal plasmas. Originally designed to exploit small-scale parallelism, SIESTA has now been scaled to execute efficiently over several thousands of processors P. This scaling improvement was accomplished with minimal intrusion to the execution flow of the original version. First, the efficiency of the iterative solutions was improved by integrating the parallel tridiagonal block solver code BCYCLIC. Krylov-space generation in GMRES was then accelerated using a customized parallel matrix-vector multiplication algorithm. Novel parallel Hessian generation algorithms were integrated and memory access latencies were dramatically reduced through loop nest optimizations and data layout rearrangement. These optimizations sped up equilibria calculations by factors of 30-50. It is possible to compute solutions with granularity N/P near unity on extremely fine radial meshes (N > 1024 points). Grid separation in SIESTA, which manifests itself primarily in the resonant components of the pressure far from rational surfaces, is strongly suppressed by finer meshes. Large problem sizes of up to 300 K simultaneous non-linear coupled equations have been solved on the NERSC supercomputers. Work supported by U.S. DOE under Contract DE-AC05-00OR22725 with UT-Battelle, LLC.

  1. Highly parallel sparse Cholesky factorization

    NASA Technical Reports Server (NTRS)

    Gilbert, John R.; Schreiber, Robert

    1990-01-01

    Several fine grained parallel algorithms were developed and compared to compute the Cholesky factorization of a sparse matrix. The experimental implementations are on the Connection Machine, a distributed memory SIMD machine whose programming model conceptually supplies one processor per data element. In contrast to special purpose algorithms in which the matrix structure conforms to the connection structure of the machine, the focus is on matrices with arbitrary sparsity structure. The most promising algorithm is one whose inner loop performs several dense factorizations simultaneously on a 2-D grid of processors. Virtually any massively parallel dense factorization algorithm can be used as the key subroutine. The sparse code attains execution rates comparable to those of the dense subroutine. Although at present architectural limitations prevent the dense factorization from realizing its potential efficiency, it is concluded that a regular data parallel architecture can be used efficiently to solve arbitrarily structured sparse problems. A performance model is also presented and it is used to analyze the algorithms.

  2. Multispectral code excited linear prediction coding and its application in magnetic resonance images.

    PubMed

    Hu, J H; Wang, Y; Cahill, P T

    1997-01-01

    This paper reports a multispectral code excited linear prediction (MCELP) method for the compression of multispectral images. Different linear prediction models and adaptation schemes have been compared. The method that uses a forward adaptive autoregressive (AR) model has been proven to achieve a good compromise between performance, complexity, and robustness. This approach is referred to as the MFCELP method. Given a set of multispectral images, the linear predictive coefficients are updated over nonoverlapping three-dimensional (3-D) macroblocks. Each macroblock is further divided into several 3-D micro-blocks, and the best excitation signal for each microblock is determined through an analysis-by-synthesis procedure. The MFCELP method has been applied to multispectral magnetic resonance (MR) images. To satisfy the high quality requirement for medical images, the error between the original image set and the synthesized one is further specified using a vector quantizer. This method has been applied to images from 26 clinical MR neuro studies (20 slices/study, three spectral bands/slice, 256x256 pixels/band, 12 b/pixel). The MFCELP method provides a significant visual improvement over the discrete cosine transform (DCT) based Joint Photographers Expert Group (JPEG) method, the wavelet transform based embedded zero-tree wavelet (EZW) coding method, and the vector tree (VT) coding method, as well as the multispectral segmented autoregressive moving average (MSARMA) method we developed previously.

  3. Parallel computation of transverse wakes in linear colliders

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Zhan, Xiaowei; Ko, Kwok

    1996-11-01

    SLAC has proposed the detuned structure (DS) as one possible design to control the emittance growth of long bunch trains due to transverse wakefields in the Next Linear Collider (NLC). The DS consists of 206 cells with tapering from cell to cell of the order of few microns to provide Gaussian detuning of the dipole modes. The decoherence of these modes leads to two orders of magnitude reduction in wakefield experienced by the trailing bunch. To model such a large heterogeneous structure realistically is impractical with finite-difference codes using structured grids. The authors have calculated the wakefield in the DSmore » on a parallel computer with a finite-element code using an unstructured grid. The parallel implementation issues are presented along with simulation results that include contributions from higher dipole bands and wall dissipation.« less

  4. The Refinement-Tree Partition for Parallel Solution of Partial Differential Equations

    PubMed Central

    Mitchell, William F.

    1998-01-01

    Dynamic load balancing is considered in the context of adaptive multilevel methods for partial differential equations on distributed memory multiprocessors. An approach that periodically repartitions the grid is taken. The important properties of a partitioning algorithm are presented and discussed in this context. A partitioning algorithm based on the refinement tree of the adaptive grid is presented and analyzed in terms of these properties. Theoretical and numerical results are given. PMID:28009355

  5. A parallel adaptive mesh refinement algorithm

    NASA Technical Reports Server (NTRS)

    Quirk, James J.; Hanebutte, Ulf R.

    1993-01-01

    Over recent years, Adaptive Mesh Refinement (AMR) algorithms which dynamically match the local resolution of the computational grid to the numerical solution being sought have emerged as powerful tools for solving problems that contain disparate length and time scales. In particular, several workers have demonstrated the effectiveness of employing an adaptive, block-structured hierarchical grid system for simulations of complex shock wave phenomena. Unfortunately, from the parallel algorithm developer's viewpoint, this class of scheme is quite involved; these schemes cannot be distilled down to a small kernel upon which various parallelizing strategies may be tested. However, because of their block-structured nature such schemes are inherently parallel, so all is not lost. In this paper we describe the method by which Quirk's AMR algorithm has been parallelized. This method is built upon just a few simple message passing routines and so it may be implemented across a broad class of MIMD machines. Moreover, the method of parallelization is such that the original serial code is left virtually intact, and so we are left with just a single product to support. The importance of this fact should not be underestimated given the size and complexity of the original algorithm.

  6. MHD Code Optimizations and Jets in Dense Gaseous Halos

    NASA Astrophysics Data System (ADS)

    Gaibler, Volker; Vigelius, Matthias; Krause, Martin; Camenzind, Max

    We have further optimized and extended the 3D-MHD-code NIRVANA. The magnetized part runs in parallel, reaching 19 Gflops per SX-6 node, and has a passively advected particle population. In addition, the code is MPI-parallel now - on top of the shared memory parallelization. On a 512^3 grid, we reach 561 Gflops with 32 nodes on the SX-8. Also, we have successfully used FLASH on the Opteron cluster. Scientific results are preliminary so far. We report one computation of highly resolved cocoon turbulence. While we find some similarities to earlier 2D work by us and others, we note a strange reluctancy of cold material to enter the low density cocoon, which has to be investigated further.

  7. A Structured Light Sensor System for Tree Inventory

    NASA Technical Reports Server (NTRS)

    Chien, Chiun-Hong; Zemek, Michael C.

    2000-01-01

    Tree Inventory is referred to measurement and estimation of marketable wood volume in a piece of land or forest for purposes such as investment or for loan applications. Exist techniques rely on trained surveyor conducting measurements manually using simple optical or mechanical devices, and hence are time consuming subjective and error prone. The advance of computer vision techniques makes it possible to conduct automatic measurements that are more efficient, objective and reliable. This paper describes 3D measurements of tree diameters using a uniquely designed ensemble of two line laser emitters rigidly mounted on a video camera. The proposed laser camera system relies on a fixed distance between two parallel laser planes and projections of laser lines to calculate tree diameters. Performance of the laser camera system is further enhanced by fusion of information induced from structured lighting and that contained in video images. Comparison will be made between the laser camera sensor system and a stereo vision system previously developed for measurements of tree diameters.

  8. Code generator for implementing dual tree complex wavelet transform on reconfigurable architectures for mobile applications

    PubMed Central

    Canbay, Ferhat; Levent, Vecdi Emre; Serbes, Gorkem; Ugurdag, H. Fatih; Goren, Sezer

    2016-01-01

    The authors aimed to develop an application for producing different architectures to implement dual tree complex wavelet transform (DTCWT) having near shift-invariance property. To obtain a low-cost and portable solution for implementing the DTCWT in multi-channel real-time applications, various embedded-system approaches are realised. For comparison, the DTCWT was implemented in C language on a personal computer and on a PIC microcontroller. However, in the former approach portability and in the latter desired speed performance properties cannot be achieved. Hence, implementation of the DTCWT on a reconfigurable platform such as field programmable gate array, which provides portable, low-cost, low-power, and high-performance computing, is considered as the most feasible solution. At first, they used the system generator DSP design tool of Xilinx for algorithm design. However, the design implemented by using such tools is not optimised in terms of area and power. To overcome all these drawbacks mentioned above, they implemented the DTCWT algorithm by using Verilog Hardware Description Language, which has its own difficulties. To overcome these difficulties, simplify the usage of proposed algorithms and the adaptation procedures, a code generator program that can produce different architectures is proposed. PMID:27733925

  9. Code generator for implementing dual tree complex wavelet transform on reconfigurable architectures for mobile applications.

    PubMed

    Canbay, Ferhat; Levent, Vecdi Emre; Serbes, Gorkem; Ugurdag, H Fatih; Goren, Sezer; Aydin, Nizamettin

    2016-09-01

    The authors aimed to develop an application for producing different architectures to implement dual tree complex wavelet transform (DTCWT) having near shift-invariance property. To obtain a low-cost and portable solution for implementing the DTCWT in multi-channel real-time applications, various embedded-system approaches are realised. For comparison, the DTCWT was implemented in C language on a personal computer and on a PIC microcontroller. However, in the former approach portability and in the latter desired speed performance properties cannot be achieved. Hence, implementation of the DTCWT on a reconfigurable platform such as field programmable gate array, which provides portable, low-cost, low-power, and high-performance computing, is considered as the most feasible solution. At first, they used the system generator DSP design tool of Xilinx for algorithm design. However, the design implemented by using such tools is not optimised in terms of area and power. To overcome all these drawbacks mentioned above, they implemented the DTCWT algorithm by using Verilog Hardware Description Language, which has its own difficulties. To overcome these difficulties, simplify the usage of proposed algorithms and the adaptation procedures, a code generator program that can produce different architectures is proposed.

  10. Portable parallel portfolio optimization in the Aurora Financial Management System

    NASA Astrophysics Data System (ADS)

    Laure, Erwin; Moritsch, Hans

    2001-07-01

    Financial planning problems are formulated as large scale, stochastic, multiperiod, tree structured optimization problems. An efficient technique for solving this kind of problems is the nested Benders decomposition method. In this paper we present a parallel, portable, asynchronous implementation of this technique. To achieve our portability goals we elected the programming language Java for our implementation and used a high level Java based framework, called OpusJava, for expressing the parallelism potential as well as synchronization constraints. Our implementation is embedded within a modular decision support tool for portfolio and asset liability management, the Aurora Financial Management System.

  11. Parallel tiled Nussinov RNA folding loop nest generated using both dependence graph transitive closure and loop skewing.

    PubMed

    Palkowski, Marek; Bielecki, Wlodzimierz

    2017-06-02

    RNA secondary structure prediction is a compute intensive task that lies at the core of several search algorithms in bioinformatics. Fortunately, the RNA folding approaches, such as the Nussinov base pair maximization, involve mathematical operations over affine control loops whose iteration space can be represented by the polyhedral model. Polyhedral compilation techniques have proven to be a powerful tool for optimization of dense array codes. However, classical affine loop nest transformations used with these techniques do not optimize effectively codes of dynamic programming of RNA structure predictions. The purpose of this paper is to present a novel approach allowing for generation of a parallel tiled Nussinov RNA loop nest exposing significantly higher performance than that of known related code. This effect is achieved due to improving code locality and calculation parallelization. In order to improve code locality, we apply our previously published technique of automatic loop nest tiling to all the three loops of the Nussinov loop nest. This approach first forms original rectangular 3D tiles and then corrects them to establish their validity by means of applying the transitive closure of a dependence graph. To produce parallel code, we apply the loop skewing technique to a tiled Nussinov loop nest. The technique is implemented as a part of the publicly available polyhedral source-to-source TRACO compiler. Generated code was run on modern Intel multi-core processors and coprocessors. We present the speed-up factor of generated Nussinov RNA parallel code and demonstrate that it is considerably faster than related codes in which only the two outer loops of the Nussinov loop nest are tiled.

  12. Optimization of Particle-in-Cell Codes on RISC Processors

    NASA Technical Reports Server (NTRS)

    Decyk, Viktor K.; Karmesin, Steve Roy; Boer, Aeint de; Liewer, Paulette C.

    1996-01-01

    General strategies are developed to optimize particle-cell-codes written in Fortran for RISC processors which are commonly used on massively parallel computers. These strategies include data reorganization to improve cache utilization and code reorganization to improve efficiency of arithmetic pipelines.

  13. GIZMO: Multi-method magneto-hydrodynamics+gravity code

    NASA Astrophysics Data System (ADS)

    Hopkins, Philip F.

    2014-10-01

    GIZMO is a flexible, multi-method magneto-hydrodynamics+gravity code that solves the hydrodynamic equations using a variety of different methods. It introduces new Lagrangian Godunov-type methods that allow solving the fluid equations with a moving particle distribution that is automatically adaptive in resolution and avoids the advection errors, angular momentum conservation errors, and excessive diffusion problems that seriously limit the applicability of “adaptive mesh” (AMR) codes, while simultaneously avoiding the low-order errors inherent to simpler methods like smoothed-particle hydrodynamics (SPH). GIZMO also allows the use of SPH either in “traditional” form or “modern” (more accurate) forms, or use of a mesh. Self-gravity is solved quickly with a BH-Tree (optionally a hybrid PM-Tree for periodic boundaries) and on-the-fly adaptive gravitational softenings. The code is descended from P-GADGET, itself descended from GADGET-2 (ascl:0003.001), and many of the naming conventions remain (for the sake of compatibility with the large library of GADGET work and analysis software).

  14. Parallel Computation of the Regional Ocean Modeling System (ROMS)

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Wang, P; Song, Y T; Chao, Y

    2005-04-05

    The Regional Ocean Modeling System (ROMS) is a regional ocean general circulation modeling system solving the free surface, hydrostatic, primitive equations over varying topography. It is free software distributed world-wide for studying both complex coastal ocean problems and the basin-to-global scale ocean circulation. The original ROMS code could only be run on shared-memory systems. With the increasing need to simulate larger model domains with finer resolutions and on a variety of computer platforms, there is a need in the ocean-modeling community to have a ROMS code that can be run on any parallel computer ranging from 10 to hundreds ofmore » processors. Recently, we have explored parallelization for ROMS using the MPI programming model. In this paper, an efficient parallelization strategy for such a large-scale scientific software package, based on an existing shared-memory computing model, is presented. In addition, scientific applications and data-performance issues on a couple of SGI systems, including Columbia, the world's third-fastest supercomputer, are discussed.« less

  15. Object-Oriented Implementation of the NAS Parallel Benchmarks using Charm++

    NASA Technical Reports Server (NTRS)

    Krishnan, Sanjeev; Bhandarkar, Milind; Kale, Laxmikant V.

    1996-01-01

    This report describes experiences with implementing the NAS Computational Fluid Dynamics benchmarks using a parallel object-oriented language, Charm++. Our main objective in implementing the NAS CFD kernel benchmarks was to develop a code that could be used to easily experiment with different domain decomposition strategies and dynamic load balancing. We also wished to leverage the object-orientation provided by the Charm++ parallel object-oriented language, to develop reusable abstractions that would simplify the process of developing parallel applications. We first describe the Charm++ parallel programming model and the parallel object array abstraction, then go into detail about each of the Scalar Pentadiagonal (SP) and Lower/Upper Triangular (LU) benchmarks, along with performance results. Finally we conclude with an evaluation of the methodology used.

  16. Rooted tRNAomes and evolution of the genetic code

    PubMed Central

    Pak, Daewoo; Du, Nan; Kim, Yunsoo; Sun, Yanni

    2018-01-01

    ABSTRACT We advocate for a tRNA- rather than an mRNA-centric model for evolution of the genetic code. The mechanism for evolution of cloverleaf tRNA provides a root sequence for radiation of tRNAs and suggests a simplified understanding of code evolution. To analyze code sectoring, rooted tRNAomes were compared for several archaeal and one bacterial species. Rooting of tRNAome trees reveals conserved structures, indicating how the code was shaped during evolution and suggesting a model for evolution of a LUCA tRNAome tree. We propose the polyglycine hypothesis that the initial product of the genetic code may have been short chain polyglycine to stabilize protocells. In order to describe how anticodons were allotted in evolution, the sectoring-degeneracy hypothesis is proposed. Based on sectoring, a simple stepwise model is developed, in which the code sectors from a 1→4→8→∼16 letter code. At initial stages of code evolution, we posit strong positive selection for wobble base ambiguity, supporting convergence to 4-codon sectors and ∼16 letters. In a later stage, ∼5–6 letters, including stops, were added through innovating at the anticodon wobble position. In archaea and bacteria, tRNA wobble adenine is negatively selected, shrinking the maximum size of the primordial genetic code to 48 anticodons. Because 64 codons are recognized in mRNA, tRNA-mRNA coevolution requires tRNA wobble position ambiguity leading to degeneracy of the code. PMID:29372672

  17. High-performance computational fluid dynamics: a custom-code approach

    NASA Astrophysics Data System (ADS)

    Fannon, James; Loiseau, Jean-Christophe; Valluri, Prashant; Bethune, Iain; Náraigh, Lennon Ó.

    2016-07-01

    We introduce a modified and simplified version of the pre-existing fully parallelized three-dimensional Navier-Stokes flow solver known as TPLS. We demonstrate how the simplified version can be used as a pedagogical tool for the study of computational fluid dynamics (CFDs) and parallel computing. TPLS is at its heart a two-phase flow solver, and uses calls to a range of external libraries to accelerate its performance. However, in the present context we narrow the focus of the study to basic hydrodynamics and parallel computing techniques, and the code is therefore simplified and modified to simulate pressure-driven single-phase flow in a channel, using only relatively simple Fortran 90 code with MPI parallelization, but no calls to any other external libraries. The modified code is analysed in order to both validate its accuracy and investigate its scalability up to 1000 CPU cores. Simulations are performed for several benchmark cases in pressure-driven channel flow, including a turbulent simulation, wherein the turbulence is incorporated via the large-eddy simulation technique. The work may be of use to advanced undergraduate and graduate students as an introductory study in CFDs, while also providing insight for those interested in more general aspects of high-performance computing.

  18. DOE SBIR Phase-1 Report on Hybrid CPU-GPU Parallel Development of the Eulerian-Lagrangian Barracuda Multiphase Program

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Dr. Dale M. Snider

    2011-02-28

    This report gives the result from the Phase-1 work on demonstrating greater than 10x speedup of the Barracuda computer program using parallel methods and GPU processors (General-Purpose Graphics Processing Unit or Graphics Processing Unit). Phase-1 demonstrated a 12x speedup on a typical Barracuda function using the GPU processor. The problem test case used about 5 million particles and 250,000 Eulerian grid cells. The relative speedup, compared to a single CPU, increases with increased number of particles giving greater than 12x speedup. Phase-1 work provided a path for reformatting data structure modifications to give good parallel performance while keeping a friendlymore » environment for new physics development and code maintenance. The implementation of data structure changes will be in Phase-2. Phase-1 laid the ground work for the complete parallelization of Barracuda in Phase-2, with the caveat that implemented computer practices for parallel programming done in Phase-1 gives immediate speedup in the current Barracuda serial running code. The Phase-1 tasks were completed successfully laying the frame work for Phase-2. The detailed results of Phase-1 are within this document. In general, the speedup of one function would be expected to be higher than the speedup of the entire code because of I/O functions and communication between the algorithms. However, because one of the most difficult Barracuda algorithms was parallelized in Phase-1 and because advanced parallelization methods and proposed parallelization optimization techniques identified in Phase-1 will be used in Phase-2, an overall Barracuda code speedup (relative to a single CPU) is expected to be greater than 10x. This means that a job which takes 30 days to complete will be done in 3 days. Tasks completed in Phase-1 are: Task 1: Profile the entire Barracuda code and select which subroutines are to be parallelized (See Section Choosing a Function to Accelerate) Task 2: Select a GPU consultant

  19. Extending Automatic Parallelization to Optimize High-Level Abstractions for Multicore

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Liao, C; Quinlan, D J; Willcock, J J

    2008-12-12

    Automatic introduction of OpenMP for sequential applications has attracted significant attention recently because of the proliferation of multicore processors and the simplicity of using OpenMP to express parallelism for shared-memory systems. However, most previous research has only focused on C and Fortran applications operating on primitive data types. C++ applications using high-level abstractions, such as STL containers and complex user-defined types, are largely ignored due to the lack of research compilers that are readily able to recognize high-level object-oriented abstractions and leverage their associated semantics. In this paper, we automatically parallelize C++ applications using ROSE, a multiple-language source-to-source compiler infrastructuremore » which preserves the high-level abstractions and gives us access to their semantics. Several representative parallelization candidate kernels are used to explore semantic-aware parallelization strategies for high-level abstractions, combined with extended compiler analyses. Those kernels include an array-base computation loop, a loop with task-level parallelism, and a domain-specific tree traversal. Our work extends the applicability of automatic parallelization to modern applications using high-level abstractions and exposes more opportunities to take advantage of multicore processors.« less

  20. Employing Nested OpenMP for the Parallelization of Multi-Zone Computational Fluid Dynamics Applications

    NASA Technical Reports Server (NTRS)

    Ayguade, Eduard; Gonzalez, Marc; Martorell, Xavier; Jost, Gabriele

    2004-01-01

    In this paper we describe the parallelization of the multi-zone code versions of the NAS Parallel Benchmarks employing multi-level OpenMP parallelism. For our study we use the NanosCompiler, which supports nesting of OpenMP directives and provides clauses to control the grouping of threads, load balancing, and synchronization. We report the benchmark results, compare the timings with those of different hybrid parallelization paradigms and discuss OpenMP implementation issues which effect the performance of multi-level parallel applications.

  1. Using bark char codes to predict post-fire cambium mortality

    Treesearch

    Sharon M. Hood; Danny R. Cluck; Sheri L. Smith; Kevin C. Ryan

    2008-01-01

    Cambium injury is an important factor in post-fire tree survival. Measurements that quantify the degree of bark charring on tree stems after fire are often used as surrogates for direct cambium injury because they are relatively easy to assign and are non-destructive. However, bark char codes based on these measurements have been inadequately tested to determine how...

  2. HERCULES: A Pattern Driven Code Transformation System

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kartsaklis, Christos; Hernandez, Oscar R; Hsu, Chung-Hsing

    2012-01-01

    New parallel computers are emerging, but developing efficient scientific code for them remains difficult. A scientist must manage not only the science-domain complexity but also the performance-optimization complexity. HERCULES is a code transformation system designed to help the scientist to separate the two concerns, which improves code maintenance, and facilitates performance optimization. The system combines three technologies, code patterns, transformation scripts and compiler plugins, to provide the scientist with an environment to quickly implement code transformations that suit his needs. Unlike existing code optimization tools, HERCULES is unique in its focus on user-level accessibility. In this paper we discuss themore » design, implementation and an initial evaluation of HERCULES.« less

  3. Progressive video coding for noisy channels

    NASA Astrophysics Data System (ADS)

    Kim, Beong-Jo; Xiong, Zixiang; Pearlman, William A.

    1998-10-01

    We extend the work of Sherwood and Zeger to progressive video coding for noisy channels. By utilizing a 3D extension of the set partitioning in hierarchical trees (SPIHT) algorithm, we cascade the resulting 3D SPIHT video coder with a rate-compatible punctured convolutional channel coder for transmission of video over a binary symmetric channel. Progressive coding is achieved by increasing the target rate of the 3D embedded SPIHT video coder as the channel condition improves. The performance of our proposed coding system is acceptable at low transmission rate and bad channel conditions. Its low complexity makes it suitable for emerging applications such as video over wireless channels.

  4. tropiTree: An NGS-Based EST-SSR Resource for 24 Tropical Tree Species

    PubMed Central

    Russell, Joanne R.; Hedley, Peter E.; Cardle, Linda; Dancey, Siobhan; Morris, Jenny; Booth, Allan; Odee, David; Mwaura, Lucy; Omondi, William; Angaine, Peter; Machua, Joseph; Muchugi, Alice; Milne, Iain; Kindt, Roeland; Jamnadass, Ramni; Dawson, Ian K.

    2014-01-01

    The development of genetic tools for non-model organisms has been hampered by cost, but advances in next-generation sequencing (NGS) have created new opportunities. In ecological research, this raises the prospect for developing molecular markers to simultaneously study important genetic processes such as gene flow in multiple non-model plant species within complex natural and anthropogenic landscapes. Here, we report the use of bar-coded multiplexed paired-end Illumina NGS for the de novo development of expressed sequence tag-derived simple sequence repeat (EST-SSR) markers at low cost for a range of 24 tree species. Each chosen tree species is important in complex tropical agroforestry systems where little is currently known about many genetic processes. An average of more than 5,000 EST-SSRs was identified for each of the 24 sequenced species, whereas prior to analysis 20 of the species had fewer than 100 nucleotide sequence citations. To make results available to potential users in a suitable format, we have developed an open-access, interactive online database, tropiTree (http://bioinf.hutton.ac.uk/tropiTree), which has a range of visualisation and search facilities, and which is a model for the efficient presentation and application of NGS data. PMID:25025376

  5. On the relationship between parallel computation and graph embedding

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Gupta, A.K.

    1989-01-01

    The problem of efficiently simulating an algorithm designed for an n-processor parallel machine G on an m-processor parallel machine H with n > m arises when parallel algorithms designed for an ideal size machine are simulated on existing machines which are of a fixed size. The author studies this problem when every processor of H takes over the function of a number of processors in G, and he phrases the simulation problem as a graph embedding problem. New embeddings presented address relevant issues arising from the parallel computation environment. The main focus centers around embedding complete binary trees into smaller-sizedmore » binary trees, butterflies, and hypercubes. He also considers simultaneous embeddings of r source machines into a single hypercube. Constant factors play a crucial role in his embeddings since they are not only important in practice but also lead to interesting theoretical problems. All of his embeddings minimize dilation and load, which are the conventional cost measures in graph embeddings and determine the maximum amount of time required to simulate one step of G on H. His embeddings also optimize a new cost measure called ({alpha},{beta})-utilization which characterizes how evenly the processors of H are used by the processors of G. Ideally, the utilization should be balanced (i.e., every processor of H simulates at most (n/m) processors of G) and the ({alpha},{beta})-utilization measures how far off from a balanced utilization the embedding is. He presents embeddings for the situation when some processors of G have different capabilities (e.g. memory or I/O) than others and the processors with different capabilities are to be distributed uniformly among the processors of H. Placing such conditions on an embedding results in an increase in some of the cost measures.« less

  6. Production Level CFD Code Acceleration for Hybrid Many-Core Architectures

    NASA Technical Reports Server (NTRS)

    Duffy, Austen C.; Hammond, Dana P.; Nielsen, Eric J.

    2012-01-01

    In this work, a novel graphics processing unit (GPU) distributed sharing model for hybrid many-core architectures is introduced and employed in the acceleration of a production-level computational fluid dynamics (CFD) code. The latest generation graphics hardware allows multiple processor cores to simultaneously share a single GPU through concurrent kernel execution. This feature has allowed the NASA FUN3D code to be accelerated in parallel with up to four processor cores sharing a single GPU. For codes to scale and fully use resources on these and the next generation machines, codes will need to employ some type of GPU sharing model, as presented in this work. Findings include the effects of GPU sharing on overall performance. A discussion of the inherent challenges that parallel unstructured CFD codes face in accelerator-based computing environments is included, with considerations for future generation architectures. This work was completed by the author in August 2010, and reflects the analysis and results of the time.

  7. Execution of a parallel edge-based Navier-Stokes solver on commodity graphics processor units

    NASA Astrophysics Data System (ADS)

    Corral, Roque; Gisbert, Fernando; Pueblas, Jesus

    2017-02-01

    The implementation of an edge-based three-dimensional Reynolds Average Navier-Stokes solver for unstructured grids able to run on multiple graphics processing units (GPUs) is presented. Loops over edges, which are the most time-consuming part of the solver, have been written to exploit the massively parallel capabilities of GPUs. Non-blocking communications between parallel processes and between the GPU and the central processor unit (CPU) have been used to enhance code scalability. The code is written using a mixture of C++ and OpenCL, to allow the execution of the source code on GPUs. The Message Passage Interface (MPI) library is used to allow the parallel execution of the solver on multiple GPUs. A comparative study of the solver parallel performance is carried out using a cluster of CPUs and another of GPUs. It is shown that a single GPU is up to 64 times faster than a single CPU core. The parallel scalability of the solver is mainly degraded due to the loss of computing efficiency of the GPU when the size of the case decreases. However, for large enough grid sizes, the scalability is strongly improved. A cluster featuring commodity GPUs and a high bandwidth network is ten times less costly and consumes 33% less energy than a CPU-based cluster with an equivalent computational power.

  8. An object-oriented approach for parallel self adaptive mesh refinement on block structured grids

    NASA Technical Reports Server (NTRS)

    Lemke, Max; Witsch, Kristian; Quinlan, Daniel

    1993-01-01

    Self-adaptive mesh refinement dynamically matches the computational demands of a solver for partial differential equations to the activity in the application's domain. In this paper we present two C++ class libraries, P++ and AMR++, which significantly simplify the development of sophisticated adaptive mesh refinement codes on (massively) parallel distributed memory architectures. The development is based on our previous research in this area. The C++ class libraries provide abstractions to separate the issues of developing parallel adaptive mesh refinement applications into those of parallelism, abstracted by P++, and adaptive mesh refinement, abstracted by AMR++. P++ is a parallel array class library to permit efficient development of architecture independent codes for structured grid applications, and AMR++ provides support for self-adaptive mesh refinement on block-structured grids of rectangular non-overlapping blocks. Using these libraries, the application programmers' work is greatly simplified to primarily specifying the serial single grid application and obtaining the parallel and self-adaptive mesh refinement code with minimal effort. Initial results for simple singular perturbation problems solved by self-adaptive multilevel techniques (FAC, AFAC), being implemented on the basis of prototypes of the P++/AMR++ environment, are presented. Singular perturbation problems frequently arise in large applications, e.g. in the area of computational fluid dynamics. They usually have solutions with layers which require adaptive mesh refinement and fast basic solvers in order to be resolved efficiently.

  9. A tool for simulating parallel branch-and-bound methods

    NASA Astrophysics Data System (ADS)

    Golubeva, Yana; Orlov, Yury; Posypkin, Mikhail

    2016-01-01

    The Branch-and-Bound method is known as one of the most powerful but very resource consuming global optimization methods. Parallel and distributed computing can efficiently cope with this issue. The major difficulty in parallel B&B method is the need for dynamic load redistribution. Therefore design and study of load balancing algorithms is a separate and very important research topic. This paper presents a tool for simulating parallel Branchand-Bound method. The simulator allows one to run load balancing algorithms with various numbers of processors, sizes of the search tree, the characteristics of the supercomputer's interconnect thereby fostering deep study of load distribution strategies. The process of resolution of the optimization problem by B&B method is replaced by a stochastic branching process. Data exchanges are modeled using the concept of logical time. The user friendly graphical interface to the simulator provides efficient visualization and convenient performance analysis.

  10. The Fortran-P Translator: Towards Automatic Translation of Fortran 77 Programs for Massively Parallel Processors

    DOE PAGES

    O'keefe, Matthew; Parr, Terence; Edgar, B. Kevin; ...

    1995-01-01

    Massively parallel processors (MPPs) hold the promise of extremely high performance that, if realized, could be used to study problems of unprecedented size and complexity. One of the primary stumbling blocks to this promise has been the lack of tools to translate application codes to MPP form. In this article we show how applications codes written in a subset of Fortran 77, called Fortran-P, can be translated to achieve good performance on several massively parallel machines. This subset can express codes that are self-similar, where the algorithm applied to the global data domain is also applied to each subdomain. Wemore » have found many codes that match the Fortran-P programming style and have converted them using our tools. We believe a self-similar coding style will accomplish what a vectorizable style has accomplished for vector machines by allowing the construction of robust, user-friendly, automatic translation systems that increase programmer productivity and generate fast, efficient code for MPPs.« less

  11. A new conformal absorbing boundary condition for finite element meshes and parallelization of FEMATS

    NASA Technical Reports Server (NTRS)

    Chatterjee, A.; Volakis, J. L.; Nguyen, J.; Nurnberger, M.; Ross, D.

    1993-01-01

    Some of the progress toward the development and parallelization of an improved version of the finite element code FEMATS is described. This is a finite element code for computing the scattering by arbitrarily shaped three dimensional surfaces composite scatterers. The following tasks were worked on during the report period: (1) new absorbing boundary conditions (ABC's) for truncating the finite element mesh; (2) mixed mesh termination schemes; (3) hierarchical elements and multigridding; (4) parallelization; and (5) various modeling enhancements (antenna feeds, anisotropy, and higher order GIBC).

  12. Modulation and coding for satellite and space communications

    NASA Technical Reports Server (NTRS)

    Yuen, Joseph H.; Simon, Marvin K.; Pollara, Fabrizio; Divsalar, Dariush; Miller, Warner H.; Morakis, James C.; Ryan, Carl R.

    1990-01-01

    Several modulation and coding advances supported by NASA are summarized. To support long-constraint-length convolutional code, a VLSI maximum-likelihood decoder, utilizing parallel processing techniques, which is being developed to decode convolutional codes of constraint length 15 and a code rate as low as 1/6 is discussed. A VLSI high-speed 8-b Reed-Solomon decoder which is being developed for advanced tracking and data relay satellite (ATDRS) applications is discussed. A 300-Mb/s modem with continuous phase modulation (CPM) and codings which is being developed for ATDRS is discussed. Trellis-coded modulation (TCM) techniques are discussed for satellite-based mobile communication applications.

  13. MIST: An Open Source Environmental Modelling Programming Language Incorporating Easy to Use Data Parallelism.

    NASA Astrophysics Data System (ADS)

    Bellerby, Tim

    2014-05-01

    Model Integration System (MIST) is open-source environmental modelling programming language that directly incorporates data parallelism. The language is designed to enable straightforward programming structures, such as nested loops and conditional statements to be directly translated into sequences of whole-array (or more generally whole data-structure) operations. MIST thus enables the programmer to use well-understood constructs, directly relating to the mathematical structure of the model, without having to explicitly vectorize code or worry about details of parallelization. A range of common modelling operations are supported by dedicated language structures operating on cell neighbourhoods rather than individual cells (e.g.: the 3x3 local neighbourhood needed to implement an averaging image filter can be simply accessed from within a simple loop traversing all image pixels). This facility hides details of inter-process communication behind more mathematically relevant descriptions of model dynamics. The MIST automatic vectorization/parallelization process serves both to distribute work among available nodes and separately to control storage requirements for intermediate expressions - enabling operations on very large domains for which memory availability may be an issue. MIST is designed to facilitate efficient interpreter based implementations. A prototype open source interpreter is available, coded in standard FORTRAN 95, with tools to rapidly integrate existing FORTRAN 77 or 95 code libraries. The language is formally specified and thus not limited to FORTRAN implementation or to an interpreter-based approach. A MIST to FORTRAN compiler is under development and volunteers are sought to create an ANSI-C implementation. Parallel processing is currently implemented using OpenMP. However, parallelization code is fully modularised and could be replaced with implementations using other libraries. GPU implementation is potentially possible.

  14. Research in Computational Aeroscience Applications Implemented on Advanced Parallel Computing Systems

    NASA Technical Reports Server (NTRS)

    Wigton, Larry

    1996-01-01

    Improving the numerical linear algebra routines for use in new Navier-Stokes codes, specifically Tim Barth's unstructured grid code, with spin-offs to TRANAIR is reported. A fast distance calculation routine for Navier-Stokes codes using the new one-equation turbulence models is written. The primary focus of this work was devoted to improving matrix-iterative methods. New algorithms have been developed which activate the full potential of classical Cray-class computers as well as distributed-memory parallel computers.

  15. Autumn Algorithm-Computation of Hybridization Networks for Realistic Phylogenetic Trees.

    PubMed

    Huson, Daniel H; Linz, Simone

    2018-01-01

    A minimum hybridization network is a rooted phylogenetic network that displays two given rooted phylogenetic trees using a minimum number of reticulations. Previous mathematical work on their calculation has usually assumed the input trees to be bifurcating, correctly rooted, or that they both contain the same taxa. These assumptions do not hold in biological studies and "realistic" trees have multifurcations, are difficult to root, and rarely contain the same taxa. We present a new algorithm for computing minimum hybridization networks for a given pair of "realistic" rooted phylogenetic trees. We also describe how the algorithm might be used to improve the rooting of the input trees. We introduce the concept of "autumn trees", a nice framework for the formulation of algorithms based on the mathematics of "maximum acyclic agreement forests". While the main computational problem is hard, the run-time depends mainly on how different the given input trees are. In biological studies, where the trees are reasonably similar, our parallel implementation performs well in practice. The algorithm is available in our open source program Dendroscope 3, providing a platform for biologists to explore rooted phylogenetic networks. We demonstrate the utility of the algorithm using several previously studied data sets.

  16. GANDALF - Graphical Astrophysics code for N-body Dynamics And Lagrangian Fluids

    NASA Astrophysics Data System (ADS)

    Hubber, D. A.; Rosotti, G. P.; Booth, R. A.

    2018-01-01

    GANDALF is a new hydrodynamics and N-body dynamics code designed for investigating planet formation, star formation and star cluster problems. GANDALF is written in C++, parallelized with both OPENMP and MPI and contains a PYTHON library for analysis and visualization. The code has been written with a fully object-oriented approach to easily allow user-defined implementations of physics modules or other algorithms. The code currently contains implementations of smoothed particle hydrodynamics, meshless finite-volume and collisional N-body schemes, but can easily be adapted to include additional particle schemes. We present in this paper the details of its implementation, results from the test suite, serial and parallel performance results and discuss the planned future development. The code is freely available as an open source project on the code-hosting website github at https://github.com/gandalfcode/gandalf and is available under the GPLv2 license.

  17. Utilization of Patch/Triangular Target Description Data in BRL Parallel Ray Vulnerability Assessment Codes

    DTIC Science & Technology

    1979-09-01

    KEY WORDS (Continue on revmrem elde It necmmemry and Identity by block number) Target Descriptions GIFT Code C0MGE0M Descriptions FASTGEN Code...which accepts the COMGEOM target description and 1 2 produces the shotline data is the GIFT ’ code. The GIFT code evolved 3 4 from and has...the COMGEOM/ GIFT methodology, while the Navy and Air Force use the PATCH/SHOTGEN-FASTGEN methodology. Lawrence W. Bain, Mathew J. Heisinger

  18. Reconstruction of coded aperture images

    NASA Technical Reports Server (NTRS)

    Bielefeld, Michael J.; Yin, Lo I.

    1987-01-01

    Balanced correlation method and the Maximum Entropy Method (MEM) were implemented to reconstruct a laboratory X-ray source as imaged by a Uniformly Redundant Array (URA) system. Although the MEM method has advantages over the balanced correlation method, it is computationally time consuming because of the iterative nature of its solution. Massively Parallel Processing, with its parallel array structure is ideally suited for such computations. These preliminary results indicate that it is possible to use the MEM method in future coded-aperture experiments with the help of the MPP.

  19. Parallel Numerical Simulations of Water Reservoirs

    NASA Astrophysics Data System (ADS)

    Torres, Pedro; Mangiavacchi, Norberto

    2010-11-01

    The study of the water flow and scalar transport in water reservoirs is important for the determination of the water quality during the initial stages of the reservoir filling and during the life of the reservoir. For this scope, a parallel 2D finite element code for solving the incompressible Navier-Stokes equations coupled with scalar transport was implemented using the message-passing programming model, in order to perform simulations of hidropower water reservoirs in a computer cluster environment. The spatial discretization is based on the MINI element that satisfies the Babuska-Brezzi (BB) condition, which provides sufficient conditions for a stable mixed formulation. All the distributed data structures needed in the different stages of the code, such as preprocessing, solving and post processing, were implemented using the PETSc library. The resulting linear systems for the velocity and the pressure fields were solved using the projection method, implemented by an approximate block LU factorization. In order to increase the parallel performance in the solution of the linear systems, we employ the static condensation method for solving the intermediate velocity at vertex and centroid nodes separately. We compare performance results of the static condensation method with the approach of solving the complete system. In our tests the static condensation method shows better performance for large problems, at the cost of an increased memory usage. Performance results for other intensive parts of the code in a computer cluster are also presented.

  20. Improving parallel I/O autotuning with performance modeling

    DOE PAGES

    Behzad, Babak; Byna, Surendra; Wild, Stefan M.; ...

    2014-01-01

    Various layers of the parallel I/O subsystem offer tunable parameters for improving I/O performance on large-scale computers. However, searching through a large parameter space is challenging. We are working towards an autotuning framework for determining the parallel I/O parameters that can achieve good I/O performance for different data write patterns. In this paper, we characterize parallel I/O and discuss the development of predictive models for use in effectively reducing the parameter space. Furthermore, applying our technique on tuning an I/O kernel derived from a large-scale simulation code shows that the search time can be reduced from 12 hours to 2more » hours, while achieving 54X I/O performance speedup.« less

  1. Fast Parallel Tree Codes for Gravitational and Fluid Dynamical N-Body Problems

    DTIC Science & Technology

    1993-01-01

    The particles represent the so-called " Dark Matter " eral times the the full message latency for every remote which is believed to dominate the mass...Katz, L. Herquist, and D. H. Weinberg. Galax- R. Acad. Sci. Paris, 306(I):739-742, 1988. ies and gas in a cold dark matter universe. Ap. J., [111 A...57(3):566-569, 1983. dark matter halos. Ap. J., 378:496, 1991. [30] G. Pedrizzetti. Insight into singular vortex flows. [17] C. C. Dyer and P. S. S

  2. Design of neurophysiologically motivated structures of time-pulse coded neurons

    NASA Astrophysics Data System (ADS)

    Krasilenko, Vladimir G.; Nikolsky, Alexander I.; Lazarev, Alexander A.; Lobodzinska, Raisa F.

    2009-04-01

    The common methodology of biologically motivated concept of building of processing sensors systems with parallel input and picture operands processing and time-pulse coding are described in paper. Advantages of such coding for creation of parallel programmed 2D-array structures for the next generation digital computers which require untraditional numerical systems for processing of analog, digital, hybrid and neuro-fuzzy operands are shown. The optoelectronic time-pulse coded intelligent neural elements (OETPCINE) simulation results and implementation results of a wide set of neuro-fuzzy logic operations are considered. The simulation results confirm engineering advantages, intellectuality, circuit flexibility of OETPCINE for creation of advanced 2D-structures. The developed equivalentor-nonequivalentor neural element has power consumption of 10mW and processing time about 10...100us.

  3. Parallel software support for computational structural mechanics

    NASA Technical Reports Server (NTRS)

    Jordan, Harry F.

    1987-01-01

    The application of the parallel programming methodology known as the Force was conducted. Two application issues were addressed. The first involves the efficiency of the implementation and its completeness in terms of satisfying the needs of other researchers implementing parallel algorithms. Support for, and interaction with, other Computational Structural Mechanics (CSM) researchers using the Force was the main issue, but some independent investigation of the Barrier construct, which is extremely important to overall performance, was also undertaken. Another efficiency issue which was addressed was that of relaxing the strong synchronization condition imposed on the self-scheduled parallel DO loop. The Force was extended by the addition of logical conditions to the cases of a parallel case construct and by the inclusion of a self-scheduled version of this construct. The second issue involved applying the Force to the parallelization of finite element codes such as those found in the NICE/SPAR testbed system. One of the more difficult problems encountered is the determination of what information in COMMON blocks is actually used outside of a subroutine and when a subroutine uses a COMMON block merely as scratch storage for internal temporary results.

  4. A practical approach to portability and performance problems on massively parallel supercomputers

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Beazley, D.M.; Lomdahl, P.S.

    1994-12-08

    We present an overview of the tactics we have used to achieve a high-level of performance while improving portability for a large-scale molecular dynamics code SPaSM. SPaSM was originally implemented in ANSI C with message passing for the Connection Machine 5 (CM-5). In 1993, SPaSM was selected as one of the winners in the IEEE Gordon Bell Prize competition for sustaining 50 Gflops on the 1024 node CM-5 at Los Alamos National Laboratory. Achieving this performance on the CM-5 required rewriting critical sections of code in CDPEAC assembler language. In addition, the code made extensive use of CM-5 parallel I/Omore » and the CMMD message passing library. Given this highly specialized implementation, we describe how we have ported the code to the Cray T3D and high performance workstations. In addition we will describe how it has been possible to do this using a single version of source code that runs on all three platforms without sacrificing any performance. Sound too good to be true? We hope to demonstrate that one can realize both code performance and portability without relying on the latest and greatest prepackaged tool or parallelizing compiler.« less

  5. Some Problems and Solutions in Transferring Ecosystem Simulation Codes to Supercomputers

    NASA Technical Reports Server (NTRS)

    Skiles, J. W.; Schulbach, C. H.

    1994-01-01

    Many computer codes for the simulation of ecological systems have been developed in the last twenty-five years. This development took place initially on main-frame computers, then mini-computers, and more recently, on micro-computers and workstations. Recent recognition of ecosystem science as a High Performance Computing and Communications Program Grand Challenge area emphasizes supercomputers (both parallel and distributed systems) as the next set of tools for ecological simulation. Transferring ecosystem simulation codes to such systems is not a matter of simply compiling and executing existing code on the supercomputer since there are significant differences in the system architectures of sequential, scalar computers and parallel and/or vector supercomputers. To more appropriately match the application to the architecture (necessary to achieve reasonable performance), the parallelism (if it exists) of the original application must be exploited. We discuss our work in transferring a general grassland simulation model (developed on a VAX in the FORTRAN computer programming language) to a Cray Y-MP. We show the Cray shared-memory vector-architecture, and discuss our rationale for selecting the Cray. We describe porting the model to the Cray and executing and verifying a baseline version, and we discuss the changes we made to exploit the parallelism in the application and to improve code execution. As a result, the Cray executed the model 30 times faster than the VAX 11/785 and 10 times faster than a Sun 4 workstation. We achieved an additional speed-up of approximately 30 percent over the original Cray run by using the compiler's vectorizing capabilities and the machine's ability to put subroutines and functions "in-line" in the code. With the modifications, the code still runs at only about 5% of the Cray's peak speed because it makes ineffective use of the vector processing capabilities of the Cray. We conclude with a discussion and future plans.

  6. Computing NLTE Opacities -- Node Level Parallel Calculation

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Holladay, Daniel

    Presentation. The goal: to produce a robust library capable of computing reasonably accurate opacities inline with the assumption of LTE relaxed (non-LTE). Near term: demonstrate acceleration of non-LTE opacity computation. Far term (if funded): connect to application codes with in-line capability and compute opacities. Study science problems. Use efficient algorithms that expose many levels of parallelism and utilize good memory access patterns for use on advanced architectures. Portability to multiple types of hardware including multicore processors, manycore processors such as KNL, GPUs, etc. Easily coupled to radiation hydrodynamics and thermal radiative transfer codes.

  7. Generating performance portable geoscientific simulation code with Firedrake (Invited)

    NASA Astrophysics Data System (ADS)

    Ham, D. A.; Bercea, G.; Cotter, C. J.; Kelly, P. H.; Loriant, N.; Luporini, F.; McRae, A. T.; Mitchell, L.; Rathgeber, F.

    2013-12-01

    This presentation will demonstrate how a change in simulation programming paradigm can be exploited to deliver sophisticated simulation capability which is far easier to programme than are conventional models, is capable of exploiting different emerging parallel hardware, and is tailored to the specific needs of geoscientific simulation. Geoscientific simulation represents a grand challenge computational task: many of the largest computers in the world are tasked with this field, and the requirements of resolution and complexity of scientists in this field are far from being sated. However, single thread performance has stalled, even sometimes decreased, over the last decade, and has been replaced by ever more parallel systems: both as conventional multicore CPUs and in the emerging world of accelerators. At the same time, the needs of scientists to couple ever-more complex dynamics and parametrisations into their models makes the model development task vastly more complex. The conventional approach of writing code in low level languages such as Fortran or C/C++ and then hand-coding parallelism for different platforms by adding library calls and directives forces the intermingling of the numerical code with its implementation. This results in an almost impossible set of skill requirements for developers, who must simultaneously be domain science experts, numericists, software engineers and parallelisation specialists. Even more critically, it requires code to be essentially rewritten for each emerging hardware platform. Since new platforms are emerging constantly, and since code owners do not usually control the procurement of the supercomputers on which they must run, this represents an unsustainable development load. The Firedrake system, conversely, offers the developer the opportunity to write PDE discretisations in the high-level mathematical language UFL from the FEniCS project (http://fenicsproject.org). Non-PDE model components, such as parametrisations

  8. NAS Parallel Benchmark. Results 11-96: Performance Comparison of HPF and MPI Based NAS Parallel Benchmarks. 1.0

    NASA Technical Reports Server (NTRS)

    Saini, Subash; Bailey, David; Chancellor, Marisa K. (Technical Monitor)

    1997-01-01

    High Performance Fortran (HPF), the high-level language for parallel Fortran programming, is based on Fortran 90. HALF was defined by an informal standards committee known as the High Performance Fortran Forum (HPFF) in 1993, and modeled on TMC's CM Fortran language. Several HPF features have since been incorporated into the draft ANSI/ISO Fortran 95, the next formal revision of the Fortran standard. HPF allows users to write a single parallel program that can execute on a serial machine, a shared-memory parallel machine, or a distributed-memory parallel machine. HPF eliminates the complex, error-prone task of explicitly specifying how, where, and when to pass messages between processors on distributed-memory machines, or when to synchronize processors on shared-memory machines. HPF is designed in a way that allows the programmer to code an application at a high level, and then selectively optimize portions of the code by dropping into message-passing or calling tuned library routines as 'extrinsics'. Compilers supporting High Performance Fortran features first appeared in late 1994 and early 1995 from Applied Parallel Research (APR) Digital Equipment Corporation, and The Portland Group (PGI). IBM introduced an HPF compiler for the IBM RS/6000 SP/2 in April of 1996. Over the past two years, these implementations have shown steady improvement in terms of both features and performance. The performance of various hardware/ programming model (HPF and MPI (message passing interface)) combinations will be compared, based on latest NAS (NASA Advanced Supercomputing) Parallel Benchmark (NPB) results, thus providing a cross-machine and cross-model comparison. Specifically, HPF based NPB results will be compared with MPI based NPB results to provide perspective on performance currently obtainable using HPF versus MPI or versus hand-tuned implementations such as those supplied by the hardware vendors. In addition we would also present NPB (Version 1.0) performance results for

  9. Time-dependent density-functional theory in massively parallel computer architectures: the octopus project

    NASA Astrophysics Data System (ADS)

    Andrade, Xavier; Alberdi-Rodriguez, Joseba; Strubbe, David A.; Oliveira, Micael J. T.; Nogueira, Fernando; Castro, Alberto; Muguerza, Javier; Arruabarrena, Agustin; Louie, Steven G.; Aspuru-Guzik, Alán; Rubio, Angel; Marques, Miguel A. L.

    2012-06-01

    Octopus is a general-purpose density-functional theory (DFT) code, with a particular emphasis on the time-dependent version of DFT (TDDFT). In this paper we present the ongoing efforts to achieve the parallelization of octopus. We focus on the real-time variant of TDDFT, where the time-dependent Kohn-Sham equations are directly propagated in time. This approach has great potential for execution in massively parallel systems such as modern supercomputers with thousands of processors and graphics processing units (GPUs). For harvesting the potential of conventional supercomputers, the main strategy is a multi-level parallelization scheme that combines the inherent scalability of real-time TDDFT with a real-space grid domain-partitioning approach. A scalable Poisson solver is critical for the efficiency of this scheme. For GPUs, we show how using blocks of Kohn-Sham states provides the required level of data parallelism and that this strategy is also applicable for code optimization on standard processors. Our results show that real-time TDDFT, as implemented in octopus, can be the method of choice for studying the excited states of large molecular systems in modern parallel architectures.

  10. Time-dependent density-functional theory in massively parallel computer architectures: the OCTOPUS project.

    PubMed

    Andrade, Xavier; Alberdi-Rodriguez, Joseba; Strubbe, David A; Oliveira, Micael J T; Nogueira, Fernando; Castro, Alberto; Muguerza, Javier; Arruabarrena, Agustin; Louie, Steven G; Aspuru-Guzik, Alán; Rubio, Angel; Marques, Miguel A L

    2012-06-13

    Octopus is a general-purpose density-functional theory (DFT) code, with a particular emphasis on the time-dependent version of DFT (TDDFT). In this paper we present the ongoing efforts to achieve the parallelization of octopus. We focus on the real-time variant of TDDFT, where the time-dependent Kohn-Sham equations are directly propagated in time. This approach has great potential for execution in massively parallel systems such as modern supercomputers with thousands of processors and graphics processing units (GPUs). For harvesting the potential of conventional supercomputers, the main strategy is a multi-level parallelization scheme that combines the inherent scalability of real-time TDDFT with a real-space grid domain-partitioning approach. A scalable Poisson solver is critical for the efficiency of this scheme. For GPUs, we show how using blocks of Kohn-Sham states provides the required level of data parallelism and that this strategy is also applicable for code optimization on standard processors. Our results show that real-time TDDFT, as implemented in octopus, can be the method of choice for studying the excited states of large molecular systems in modern parallel architectures.

  11. MILC Code Performance on High End CPU and GPU Supercomputer Clusters

    NASA Astrophysics Data System (ADS)

    DeTar, Carleton; Gottlieb, Steven; Li, Ruizi; Toussaint, Doug

    2018-03-01

    With recent developments in parallel supercomputing architecture, many core, multi-core, and GPU processors are now commonplace, resulting in more levels of parallelism, memory hierarchy, and programming complexity. It has been necessary to adapt the MILC code to these new processors starting with NVIDIA GPUs, and more recently, the Intel Xeon Phi processors. We report on our efforts to port and optimize our code for the Intel Knights Landing architecture. We consider performance of the MILC code with MPI and OpenMP, and optimizations with QOPQDP and QPhiX. For the latter approach, we concentrate on the staggered conjugate gradient and gauge force. We also consider performance on recent NVIDIA GPUs using the QUDA library.

  12. Multipath search coding of stationary signals with applications to speech

    NASA Astrophysics Data System (ADS)

    Fehn, H. G.; Noll, P.

    1982-04-01

    This paper deals with the application of multipath search coding (MSC) concepts to the coding of stationary memoryless and correlated sources, and of speech signals, at a rate of one bit per sample. Use is made of three MSC classes: (1) codebook coding, or vector quantization, (2) tree coding, and (3) trellis coding. This paper explains the performances of these coders and compares them both with those of conventional coders and with rate-distortion bounds. The potentials of MSC coding strategies are demonstrated by illustrations. The paper reports also on results of MSC coding of speech, where both the strategy of adaptive quantization and of adaptive prediction were included in coder design.

  13. Classification and Compression of Multi-Resolution Vectors: A Tree Structured Vector Quantizer Approach

    DTIC Science & Technology

    2002-01-01

    their expression profile and for classification of cells into tumerous and non- tumerous classes. Then we will present a parallel tree method for... cancerous cells. We will use the same dataset and use tree structured classifiers with multi-resolution analysis for classifying cancerous from non- cancerous ...cells. We have the expressions of 4096 genes from 98 different cell types. Of these 98, 72 are cancerous while 26 are non- cancerous . We are interested

  14. Enhancing Application Performance Using Mini-Apps: Comparison of Hybrid Parallel Programming Paradigms

    NASA Technical Reports Server (NTRS)

    Lawson, Gary; Sosonkina, Masha; Baurle, Robert; Hammond, Dana

    2017-01-01

    In many fields, real-world applications for High Performance Computing have already been developed. For these applications to stay up-to-date, new parallel strategies must be explored to yield the best performance; however, restructuring or modifying a real-world application may be daunting depending on the size of the code. In this case, a mini-app may be employed to quickly explore such options without modifying the entire code. In this work, several mini-apps have been created to enhance a real-world application performance, namely the VULCAN code for complex flow analysis developed at the NASA Langley Research Center. These mini-apps explore hybrid parallel programming paradigms with Message Passing Interface (MPI) for distributed memory access and either Shared MPI (SMPI) or OpenMP for shared memory accesses. Performance testing shows that MPI+SMPI yields the best execution performance, while requiring the largest number of code changes. A maximum speedup of 23 was measured for MPI+SMPI, but only 11 was measured for MPI+OpenMP.

  15. Discrete sensitivity derivatives of the Navier-Stokes equations with a parallel Krylov solver

    NASA Technical Reports Server (NTRS)

    Ajmani, Kumud; Taylor, Arthur C., III

    1994-01-01

    This paper solves an 'incremental' form of the sensitivity equations derived by differentiating the discretized thin-layer Navier Stokes equations with respect to certain design variables of interest. The equations are solved with a parallel, preconditioned Generalized Minimal RESidual (GMRES) solver on a distributed-memory architecture. The 'serial' sensitivity analysis code is parallelized by using the Single Program Multiple Data (SPMD) programming model, domain decomposition techniques, and message-passing tools. Sensitivity derivatives are computed for low and high Reynolds number flows over a NACA 1406 airfoil on a 32-processor Intel Hypercube, and found to be identical to those computed on a single-processor Cray Y-MP. It is estimated that the parallel sensitivity analysis code has to be run on 40-50 processors of the Intel Hypercube in order to match the single-processor processing time of a Cray Y-MP.

  16. Thread concept for automatic task parallelization in image analysis

    NASA Astrophysics Data System (ADS)

    Lueckenhaus, Maximilian; Eckstein, Wolfgang

    1998-09-01

    Parallel processing of image analysis tasks is an essential method to speed up image processing and helps to exploit the full capacity of distributed systems. However, writing parallel code is a difficult and time-consuming process and often leads to an architecture-dependent program that has to be re-implemented when changing the hardware. Therefore it is highly desirable to do the parallelization automatically. For this we have developed a special kind of thread concept for image analysis tasks. Threads derivated from one subtask may share objects and run in the same context but may process different threads of execution and work on different data in parallel. In this paper we describe the basics of our thread concept and show how it can be used as basis of an automatic task parallelization to speed up image processing. We further illustrate the design and implementation of an agent-based system that uses image analysis threads for generating and processing parallel programs by taking into account the available hardware. The tests made with our system prototype show that the thread concept combined with the agent paradigm is suitable to speed up image processing by an automatic parallelization of image analysis tasks.

  17. Branson: A Mini-App for Studying Parallel IMC, Version 1.0

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Long, Alex

    This code solves the gray thermal radiative transfer (TRT) equations in parallel using simple opacities and Cartesian meshes. Although Branson solves the TRT equations it is not designed to model radiation transport: Branson contains simple physics and does not have a multigroup treatment, nor can it use physical material data. The opacities have are simple polynomials in temperature there is a limited ability to specify complex geometries and sources. Branson was designed only to capture the computational demands of production IMC codes, especially in large parallel runs. It was also intended to foster collaboration with vendors, universities and other DOEmore » partners. Branson is similar in character to the neutron transport proxy-app Quicksilver from LLNL, which was recently open-sourced.« less

  18. Parallel implementation of an adaptive and parameter-free N-body integrator

    NASA Astrophysics Data System (ADS)

    Pruett, C. David; Ingham, William H.; Herman, Ralph D.

    2011-05-01

    Previously, Pruett et al. (2003) [3] described an N-body integrator of arbitrarily high order M with an asymptotic operation count of O(MN). The algorithm's structure lends itself readily to data parallelization, which we document and demonstrate here in the integration of point-mass systems subject to Newtonian gravitation. High order is shown to benefit parallel efficiency. The resulting N-body integrator is robust, parameter-free, highly accurate, and adaptive in both time-step and order. Moreover, it exhibits linear speedup on distributed parallel processors, provided that each processor is assigned at least a handful of bodies. Program summaryProgram title: PNB.f90 Catalogue identifier: AEIK_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEIK_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC license, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 3052 No. of bytes in distributed program, including test data, etc.: 68 600 Distribution format: tar.gz Programming language: Fortran 90 and OpenMPI Computer: All shared or distributed memory parallel processors Operating system: Unix/Linux Has the code been vectorized or parallelized?: The code has been parallelized but has not been explicitly vectorized. RAM: Dependent upon N Classification: 4.3, 4.12, 6.5 Nature of problem: High accuracy numerical evaluation of trajectories of N point masses each subject to Newtonian gravitation. Solution method: Parallel and adaptive extrapolation in time via power series of arbitrary degree. Running time: 5.1 s for the demo program supplied with the package.

  19. Plasma Physics Calculations on a Parallel Macintosh Cluster

    NASA Astrophysics Data System (ADS)

    Decyk, Viktor; Dauger, Dean; Kokelaar, Pieter

    2000-03-01

    We have constructed a parallel cluster consisting of 16 Apple Macintosh G3 computers running the MacOS, and achieved very good performance on numerically intensive, parallel plasma particle-in-cell simulations. A subset of the MPI message-passing library was implemented in Fortran77 and C. This library enabled us to port code, without modification, from other parallel processors to the Macintosh cluster. For large problems where message packets are large and relatively few in number, performance of 50-150 MFlops/node is possible, depending on the problem. This is fast enough that 3D calculations can be routinely done. Unlike Unix-based clusters, no special expertise in operating systems is required to build and run the cluster. Full details are available on our web site: http://exodus.physics.ucla.edu/appleseed/.

  20. Plasma Physics Calculations on a Parallel Macintosh Cluster

    NASA Astrophysics Data System (ADS)

    Decyk, Viktor K.; Dauger, Dean E.; Kokelaar, Pieter R.

    We have constructed a parallel cluster consisting of 16 Apple Macintosh G3 computers running the MacOS, and achieved very good performance on numerically intensive, parallel plasma particle-in-cell simulations. A subset of the MPI message-passing library was implemented in Fortran77 and C. This library enabled us to port code, without modification, from other parallel processors to the Macintosh cluster. For large problems where message packets are large and relatively few in number, performance of 50-150 Mflops/node is possible, depending on the problem. This is fast enough that 3D calculations can be routinely done. Unlike Unix-based clusters, no special expertise in operating systems is required to build and run the cluster. Full details are available on our web site: http://exodus.physics.ucla.edu/appleseed/.

  1. High-performance Chinese multiclass traffic sign detection via coarse-to-fine cascade and parallel support vector machine detectors

    NASA Astrophysics Data System (ADS)

    Chang, Faliang; Liu, Chunsheng

    2017-09-01

    The high variability of sign colors and shapes in uncontrolled environments has made the detection of traffic signs a challenging problem in computer vision. We propose a traffic sign detection (TSD) method based on coarse-to-fine cascade and parallel support vector machine (SVM) detectors to detect Chinese warning and danger traffic signs. First, a region of interest (ROI) extraction method is proposed to extract ROIs using color contrast features in local regions. The ROI extraction can reduce scanning regions and save detection time. For multiclass TSD, we propose a structure that combines a coarse-to-fine cascaded tree with a parallel structure of histogram of oriented gradients (HOG) + SVM detectors. The cascaded tree is designed to detect different types of traffic signs in a coarse-to-fine process. The parallel HOG + SVM detectors are designed to do fine detection of different types of traffic signs. The experiments demonstrate the proposed TSD method can rapidly detect multiclass traffic signs with different colors and shapes in high accuracy.

  2. TU-AB-BRC-12: Optimized Parallel MonteCarlo Dose Calculations for Secondary MU Checks

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    French, S; Nazareth, D; Bellor, M

    Purpose: Secondary MU checks are an important tool used during a physics review of a treatment plan. Commercial software packages offer varying degrees of theoretical dose calculation accuracy, depending on the modality involved. Dose calculations of VMAT plans are especially prone to error due to the large approximations involved. Monte Carlo (MC) methods are not commonly used due to their long run times. We investigated two methods to increase the computational efficiency of MC dose simulations with the BEAMnrc code. Distributed computing resources, along with optimized code compilation, will allow for accurate and efficient VMAT dose calculations. Methods: The BEAMnrcmore » package was installed on a high performance computing cluster accessible to our clinic. MATLAB and PYTHON scripts were developed to convert a clinical VMAT DICOM plan into BEAMnrc input files. The BEAMnrc installation was optimized by running the VMAT simulations through profiling tools which indicated the behavior of the constituent routines in the code, e.g. the bremsstrahlung splitting routine, and the specified random number generator. This information aided in determining the most efficient compiling parallel configuration for the specific CPU’s available on our cluster, resulting in the fastest VMAT simulation times. Our method was evaluated with calculations involving 10{sup 8} – 10{sup 9} particle histories which are sufficient to verify patient dose using VMAT. Results: Parallelization allowed the calculation of patient dose on the order of 10 – 15 hours with 100 parallel jobs. Due to the compiler optimization process, further speed increases of 23% were achieved when compared with the open-source compiler BEAMnrc packages. Conclusion: Analysis of the BEAMnrc code allowed us to optimize the compiler configuration for VMAT dose calculations. In future work, the optimized MC code, in conjunction with the parallel processing capabilities of BEAMnrc, will be applied to provide

  3. [Analysis of the molecular characteristics and cloning of full-length coding sequence of interleukin-2 in tree shrews].

    PubMed

    Huang, Xiao-Yan; Li, Ming-Li; Xu, Juan; Gao, Yue-Dong; Wang, Wen-Guang; Yin, An-Guo; Li, Xiao-Fei; Sun, Xiao-Mei; Xia, Xue-Shan; Dai, Jie-Jie

    2013-04-01

    While the tree shrew (Tupaia belangeri chinensis) is an excellent animal model for studying the mechanisms of human diseases, but few studies examine interleukin-2 (IL-2), an important immune factor in disease model evaluation. In this study, a 465 bp of the full-length IL-2 cDNA encoding sequence was cloned from the RNA of tree shrew spleen lymphocytes, which were then cultivated and stimulated with ConA (concanavalin). Clustal W 2.0 was used to compare and analyze the sequence and molecular characteristics, and establish the similarity of the overall structure of IL-2 between tree shrews and other mammals. The homology of the IL-2 nucleotide sequence between tree shrews and humans was 93%, and the amino acid homology was 80%. The phylogenetic tree results, derived through the Neighbour-Joining method using MEGA5.0, indicated a close genetic relationship between tree shrews, Homo sapiens, and Macaca mulatta. The three-dimensional structure analysis showed that the surface charges in most regions of tree shrew IL-2 were similar to between tree shrews and humans; however, the N-glycosylation sites and local structures were different, which may affect antibody binding. These results provide a fundamental basis for the future study of IL-2 monoclonal antibody in tree shrews, thereby improving their utility as a model.

  4. Parallel 3D Multi-Stage Simulation of a Turbofan Engine

    NASA Technical Reports Server (NTRS)

    Turner, Mark G.; Topp, David A.

    1998-01-01

    A 3D multistage simulation of each component of a modern GE Turbofan engine has been made. An axisymmetric view of this engine is presented in the document. This includes a fan, booster rig, high pressure compressor rig, high pressure turbine rig and a low pressure turbine rig. In the near future, all components will be run in a single calculation for a solution of 49 blade rows. The simulation exploits the use of parallel computations by using two levels of parallelism. Each blade row is run in parallel and each blade row grid is decomposed into several domains and run in parallel. 20 processors are used for the 4 blade row analysis. The average passage approach developed by John Adamczyk at NASA Lewis Research Center has been further developed and parallelized. This is APNASA Version A. It is a Navier-Stokes solver using a 4-stage explicit Runge-Kutta time marching scheme with variable time steps and residual smoothing for convergence acceleration. It has an implicit K-E turbulence model which uses an ADI solver to factor the matrix. Between 50 and 100 explicit time steps are solved before a blade row body force is calculated and exchanged with the other blade rows. This outer iteration has been coined a "flip." Efforts have been made to make the solver linearly scaleable with the number of blade rows. Enough flips are run (between 50 and 200) so the solution in the entire machine is not changing. The K-E equations are generally solved every other explicit time step. One of the key requirements in the development of the parallel code was to make the parallel solution exactly (bit for bit) match the serial solution. This has helped isolate many small parallel bugs and guarantee the parallelization was done correctly. The domain decomposition is done only in the axial direction since the number of points axially is much larger than the other two directions. This code uses MPI for message passing. The parallel speed up of the solver portion (no 1/0 or body force

  5. Extending HPF for advanced data parallel applications

    NASA Technical Reports Server (NTRS)

    Chapman, Barbara; Mehrotra, Piyush; Zima, Hans

    1994-01-01

    The stated goal of High Performance Fortran (HPF) was to 'address the problems of writing data parallel programs where the distribution of data affects performance'. After examining the current version of the language we are led to the conclusion that HPF has not fully achieved this goal. While the basic distribution functions offered by the language - regular block, cyclic, and block cyclic distributions - can support regular numerical algorithms, advanced applications such as particle-in-cell codes or unstructured mesh solvers cannot be expressed adequately. We believe that this is a major weakness of HPF, significantly reducing its chances of becoming accepted in the numeric community. The paper discusses the data distribution and alignment issues in detail, points out some flaws in the basic language, and outlines possible future paths of development. Furthermore, we briefly deal with the issue of task parallelism and its integration with the data parallel paradigm of HPF.

  6. A Robust and Scalable Software Library for Parallel Adaptive Refinement on Unstructured Meshes

    NASA Technical Reports Server (NTRS)

    Lou, John Z.; Norton, Charles D.; Cwik, Thomas A.

    1999-01-01

    The design and implementation of Pyramid, a software library for performing parallel adaptive mesh refinement (PAMR) on unstructured meshes, is described. This software library can be easily used in a variety of unstructured parallel computational applications, including parallel finite element, parallel finite volume, and parallel visualization applications using triangular or tetrahedral meshes. The library contains a suite of well-designed and efficiently implemented modules that perform operations in a typical PAMR process. Among these are mesh quality control during successive parallel adaptive refinement (typically guided by a local-error estimator), parallel load-balancing, and parallel mesh partitioning using the ParMeTiS partitioner. The Pyramid library is implemented in Fortran 90 with an interface to the Message-Passing Interface (MPI) library, supporting code efficiency, modularity, and portability. An EM waveguide filter application, adaptively refined using the Pyramid library, is illustrated.

  7. A parallel solver for huge dense linear systems

    NASA Astrophysics Data System (ADS)

    Badia, J. M.; Movilla, J. L.; Climente, J. I.; Castillo, M.; Marqués, M.; Mayo, R.; Quintana-Ortí, E. S.; Planelles, J.

    2011-11-01

    HDSS (Huge Dense Linear System Solver) is a Fortran Application Programming Interface (API) to facilitate the parallel solution of very large dense systems to scientists and engineers. The API makes use of parallelism to yield an efficient solution of the systems on a wide range of parallel platforms, from clusters of processors to massively parallel multiprocessors. It exploits out-of-core strategies to leverage the secondary memory in order to solve huge linear systems O(100.000). The API is based on the parallel linear algebra library PLAPACK, and on its Out-Of-Core (OOC) extension POOCLAPACK. Both PLAPACK and POOCLAPACK use the Message Passing Interface (MPI) as the communication layer and BLAS to perform the local matrix operations. The API provides a friendly interface to the users, hiding almost all the technical aspects related to the parallel execution of the code and the use of the secondary memory to solve the systems. In particular, the API can automatically select the best way to store and solve the systems, depending of the dimension of the system, the number of processes and the main memory of the platform. Experimental results on several parallel platforms report high performance, reaching more than 1 TFLOP with 64 cores to solve a system with more than 200 000 equations and more than 10 000 right-hand side vectors. New version program summaryProgram title: Huge Dense System Solver (HDSS) Catalogue identifier: AEHU_v1_1 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEHU_v1_1.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 87 062 No. of bytes in distributed program, including test data, etc.: 1 069 110 Distribution format: tar.gz Programming language: Fortran90, C Computer: Parallel architectures: multiprocessors, computer clusters Operating system

  8. Efficient parallel simulation of CO2 geologic sequestration insaline aquifers

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Zhang, Keni; Doughty, Christine; Wu, Yu-Shu

    2007-01-01

    An efficient parallel simulator for large-scale, long-termCO2 geologic sequestration in saline aquifers has been developed. Theparallel simulator is a three-dimensional, fully implicit model thatsolves large, sparse linear systems arising from discretization of thepartial differential equations for mass and energy balance in porous andfractured media. The simulator is based on the ECO2N module of the TOUGH2code and inherits all the process capabilities of the single-CPU TOUGH2code, including a comprehensive description of the thermodynamics andthermophysical properties of H2O-NaCl- CO2 mixtures, modeling singleand/or two-phase isothermal or non-isothermal flow processes, two-phasemixtures, fluid phases appearing or disappearing, as well as saltprecipitation or dissolution. The newmore » parallel simulator uses MPI forparallel implementation, the METIS software package for simulation domainpartitioning, and the iterative parallel linear solver package Aztec forsolving linear equations by multiple processors. In addition, theparallel simulator has been implemented with an efficient communicationscheme. Test examples show that a linear or super-linear speedup can beobtained on Linux clusters as well as on supercomputers. Because of thesignificant improvement in both simulation time and memory requirement,the new simulator provides a powerful tool for tackling larger scale andmore complex problems than can be solved by single-CPU codes. Ahigh-resolution simulation example is presented that models buoyantconvection, induced by a small increase in brine density caused bydissolution of CO2.« less

  9. Phonological coding during reading

    PubMed Central

    Leinenger, Mallorie

    2014-01-01

    The exact role that phonological coding (the recoding of written, orthographic information into a sound based code) plays during silent reading has been extensively studied for more than a century. Despite the large body of research surrounding the topic, varying theories as to the time course and function of this recoding still exist. The present review synthesizes this body of research, addressing the topics of time course and function in tandem. The varying theories surrounding the function of phonological coding (e.g., that phonological codes aid lexical access, that phonological codes aid comprehension and bolster short-term memory, or that phonological codes are largely epiphenomenal in skilled readers) are first outlined, and the time courses that each maps onto (e.g., that phonological codes come online early (pre-lexical) or that phonological codes come online late (post-lexical)) are discussed. Next the research relevant to each of these proposed functions is reviewed, discussing the varying methodologies that have been used to investigate phonological coding (e.g., response time methods, reading while eyetracking or recording EEG and MEG, concurrent articulation) and highlighting the advantages and limitations of each with respect to the study of phonological coding. In response to the view that phonological coding is largely epiphenomenal in skilled readers, research on the use of phonological codes in prelingually, profoundly deaf readers is reviewed. Finally, implications for current models of word identification (activation-verification model (Van Order, 1987), dual-route model (e.g., Coltheart, Rastle, Perry, Langdon, & Ziegler, 2001), parallel distributed processing model (Seidenberg & McClelland, 1989)) are discussed. PMID:25150679

  10. Seedling establishment in a masting desert shrub parallels the pattern for forest trees

    Treesearch

    Susan E. Meyer; Burton K. Pendleton

    2015-01-01

    The masting phenomenon along with its accompanying suite of seedling adaptive traits has been well studied in forest trees but has rarely been examined in desert shrubs. Blackbrush (Coleogyne ramosissima) is a regionally dominant North American desert shrub whose seeds are produced in mast events and scatter-hoarded by rodents. We followed the fate of seedlings in...

  11. pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree

    PubMed Central

    2010-01-01

    Background Likelihood-based phylogenetic inference is generally considered to be the most reliable classification method for unknown sequences. However, traditional likelihood-based phylogenetic methods cannot be applied to large volumes of short reads from next-generation sequencing due to computational complexity issues and lack of phylogenetic signal. "Phylogenetic placement," where a reference tree is fixed and the unknown query sequences are placed onto the tree via a reference alignment, is a way to bring the inferential power offered by likelihood-based approaches to large data sets. Results This paper introduces pplacer, a software package for phylogenetic placement and subsequent visualization. The algorithm can place twenty thousand short reads on a reference tree of one thousand taxa per hour per processor, has essentially linear time and memory complexity in the number of reference taxa, and is easy to run in parallel. Pplacer features calculation of the posterior probability of a placement on an edge, which is a statistically rigorous way of quantifying uncertainty on an edge-by-edge basis. It also can inform the user of the positional uncertainty for query sequences by calculating expected distance between placement locations, which is crucial in the estimation of uncertainty with a well-sampled reference tree. The software provides visualizations using branch thickness and color to represent number of placements and their uncertainty. A simulation study using reads generated from 631 COG alignments shows a high level of accuracy for phylogenetic placement over a wide range of alignment diversity, and the power of edge uncertainty estimates to measure placement confidence. Conclusions Pplacer enables efficient phylogenetic placement and subsequent visualization, making likelihood-based phylogenetics methodology practical for large collections of reads; it is freely available as source code, binaries, and a web service. PMID:21034504

  12. NAS Parallel Benchmarks. 2.4

    NASA Technical Reports Server (NTRS)

    VanderWijngaart, Rob; Biegel, Bryan A. (Technical Monitor)

    2002-01-01

    We describe a new problem size, called Class D, for the NAS Parallel Benchmarks (NPB), whose MPI source code implementation is being released as NPB 2.4. A brief rationale is given for how the new class is derived. We also describe the modifications made to the MPI (Message Passing Interface) implementation to allow the new class to be run on systems with 32-bit integers, and with moderate amounts of memory. Finally, we give the verification values for the new problem size.

  13. Efficient parallelization for AMR MHD multiphysics calculations; implementation in AstroBEAR

    NASA Astrophysics Data System (ADS)

    Carroll-Nellenback, Jonathan J.; Shroyer, Brandon; Frank, Adam; Ding, Chen

    2013-03-01

    Current adaptive mesh refinement (AMR) simulations require algorithms that are highly parallelized and manage memory efficiently. As compute engines grow larger, AMR simulations will require algorithms that achieve new levels of efficient parallelization and memory management. We have attempted to employ new techniques to achieve both of these goals. Patch or grid based AMR often employs ghost cells to decouple the hyperbolic advances of each grid on a given refinement level. This decoupling allows each grid to be advanced independently. In AstroBEAR we utilize this independence by threading the grid advances on each level with preference going to the finer level grids. This allows for global load balancing instead of level by level load balancing and allows for greater parallelization across both physical space and AMR level. Threading of level advances can also improve performance by interleaving communication with computation, especially in deep simulations with many levels of refinement. While we see improvements of up to 30% on deep simulations run on a few cores, the speedup is typically more modest (5-20%) for larger scale simulations. To improve memory management we have employed a distributed tree algorithm that requires processors to only store and communicate local sections of the AMR tree structure with neighboring processors. Using this distributed approach we are able to get reasonable scaling efficiency (>80%) out to 12288 cores and up to 8 levels of AMR - independent of the use of threading.

  14. Concurrent error detecting codes for arithmetic processors

    NASA Technical Reports Server (NTRS)

    Lim, R. S.

    1979-01-01

    A method of concurrent error detection for arithmetic processors is described. Low-cost residue codes with check-length l and checkbase m = 2 to the l power - 1 are described for checking arithmetic operations of addition, subtraction, multiplication, division complement, shift, and rotate. Of the three number representations, the signed-magnitude representation is preferred for residue checking. Two methods of residue generation are described: the standard method of using modulo m adders and the method of using a self-testing residue tree. A simple single-bit parity-check code is described for checking the logical operations of XOR, OR, and AND, and also the arithmetic operations of complement, shift, and rotate. For checking complement, shift, and rotate, the single-bit parity-check code is simpler to implement than the residue codes.

  15. PARALLEL PERTURBATION MODEL FOR CYCLE TO CYCLE VARIABILITY PPM4CCV

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Ameen, Muhsin Mohammed; Som, Sibendu

    This code consists of a Fortran 90 implementation of the parallel perturbation model to compute cyclic variability in spark ignition (SI) engines. Cycle-to-cycle variability (CCV) is known to be detrimental to SI engine operation resulting in partial burn and knock, and result in an overall reduction in the reliability of the engine. Numerical prediction of cycle-to-cycle variability (CCV) in SI engines is extremely challenging for two key reasons: (i) high-fidelity methods such as large eddy simulation (LES) are required to accurately capture the in-cylinder turbulent flow field, and (ii) CCV is experienced over long timescales and hence the simulations needmore » to be performed for hundreds of consecutive cycles. In the new technique, the strategy is to perform multiple parallel simulations, each of which encompasses 2-3 cycles, by effectively perturbing the simulation parameters such as the initial and boundary conditions. The PPM4CCV code is a pre-processing code and can be coupled with any engine CFD code. PPM4CCV was coupled with Converge CFD code and a 10-time speedup was demonstrated over the conventional multi-cycle LES in predicting the CCV for a motored engine. Recently, the model is also being applied to fired engines including port fuel injected (PFI) and direct injection spark ignition engines and the preliminary results are very encouraging.« less

  16. Series and parallel arc-fault circuit interrupter tests.

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Johnson, Jay Dean; Fresquez, Armando J.; Gudgel, Bob

    2013-07-01

    While the 2011 National Electrical Codeª (NEC) only requires series arc-fault protection, some arc-fault circuit interrupter (AFCI) manufacturers are designing products to detect and mitigate both series and parallel arc-faults. Sandia National Laboratories (SNL) has extensively investigated the electrical differences of series and parallel arc-faults and has offered possible classification and mitigation solutions. As part of this effort, Sandia National Laboratories has collaborated with MidNite Solar to create and test a 24-string combiner box with an AFCI which detects, differentiates, and de-energizes series and parallel arc-faults. In the case of the MidNite AFCI prototype, series arc-faults are mitigated by openingmore » the PV strings, whereas parallel arc-faults are mitigated by shorting the array. A range of different experimental series and parallel arc-fault tests with the MidNite combiner box were performed at the Distributed Energy Technologies Laboratory (DETL) at SNL in Albuquerque, NM. In all the tests, the prototype de-energized the arc-faults in the time period required by the arc-fault circuit interrupt testing standard, UL 1699B. The experimental tests confirm series and parallel arc-faults can be successfully mitigated with a combiner box-integrated solution.« less

  17. An Expert Assistant for Computer Aided Parallelization

    NASA Technical Reports Server (NTRS)

    Jost, Gabriele; Chun, Robert; Jin, Haoqiang; Labarta, Jesus; Gimenez, Judit

    2004-01-01

    The prototype implementation of an expert system was developed to assist the user in the computer aided parallelization process. The system interfaces to tools for automatic parallelization and performance analysis. By fusing static program structure information and dynamic performance analysis data the expert system can help the user to filter, correlate, and interpret the data gathered by the existing tools. Sections of the code that show poor performance and require further attention are rapidly identified and suggestions for improvements are presented to the user. In this paper we describe the components of the expert system and discuss its interface to the existing tools. We present a case study to demonstrate the successful use in full scale scientific applications.

  18. Multi-level Hierarchical Poly Tree computer architectures

    NASA Technical Reports Server (NTRS)

    Padovan, Joe; Gute, Doug

    1990-01-01

    Based on the concept of hierarchical substructuring, this paper develops an optimal multi-level Hierarchical Poly Tree (HPT) parallel computer architecture scheme which is applicable to the solution of finite element and difference simulations. Emphasis is given to minimizing computational effort, in-core/out-of-core memory requirements, and the data transfer between processors. In addition, a simplified communications network that reduces the number of I/O channels between processors is presented. HPT configurations that yield optimal superlinearities are also demonstrated. Moreover, to generalize the scope of applicability, special attention is given to developing: (1) multi-level reduction trees which provide an orderly/optimal procedure by which model densification/simplification can be achieved, as well as (2) methodologies enabling processor grading that yields architectures with varying types of multi-level granularity.

  19. Parallelization of a blind deconvolution algorithm

    NASA Astrophysics Data System (ADS)

    Matson, Charles L.; Borelli, Kathy J.

    2006-09-01

    Often it is of interest to deblur imagery in order to obtain higher-resolution images. Deblurring requires knowledge of the blurring function - information that is often not available separately from the blurred imagery. Blind deconvolution algorithms overcome this problem by jointly estimating both the high-resolution image and the blurring function from the blurred imagery. Because blind deconvolution algorithms are iterative in nature, they can take minutes to days to deblur an image depending how many frames of data are used for the deblurring and the platforms on which the algorithms are executed. Here we present our progress in parallelizing a blind deconvolution algorithm to increase its execution speed. This progress includes sub-frame parallelization and a code structure that is not specialized to a specific computer hardware architecture.

  20. Basal jawed vertebrate phylogeny inferred from multiple nuclear DNA-coded genes

    PubMed Central

    Kikugawa, Kanae; Katoh, Kazutaka; Kuraku, Shigehiro; Sakurai, Hiroshi; Ishida, Osamu; Iwabe, Naoyuki; Miyata, Takashi

    2004-01-01

    Background Phylogenetic analyses of jawed vertebrates based on mitochondrial sequences often result in confusing inferences which are obviously inconsistent with generally accepted trees. In particular, in a hypothesis by Rasmussen and Arnason based on mitochondrial trees, cartilaginous fishes have a terminal position in a paraphyletic cluster of bony fishes. No previous analysis based on nuclear DNA-coded genes could significantly reject the mitochondrial trees of jawed vertebrates. Results We have cloned and sequenced seven nuclear DNA-coded genes from 13 vertebrate species. These sequences, together with sequences available from databases including 13 jawed vertebrates from eight major groups (cartilaginous fishes, bichir, chondrosteans, gar, bowfin, teleost fishes, lungfishes and tetrapods) and an outgroup (a cyclostome and a lancelet), have been subjected to phylogenetic analyses based on the maximum likelihood method. Conclusion Cartilaginous fishes have been inferred to be basal to other jawed vertebrates, which is consistent with the generally accepted view. The minimum log-likelihood difference between the maximum likelihood tree and trees not supporting the basal position of cartilaginous fishes is 18.3 ± 13.1. The hypothesis by Rasmussen and Arnason has been significantly rejected with the minimum log-likelihood difference of 123 ± 23.3. Our tree has also shown that living holosteans, comprising bowfin and gar, form a monophyletic group which is the sister group to teleost fishes. This is consistent with a formerly prevalent view of vertebrate classification, although inconsistent with both of the current morphology-based and mitochondrial sequence-based trees. Furthermore, the bichir has been shown to be the basal ray-finned fish. Tetrapods and lungfish have formed a monophyletic cluster in the tree inferred from the concatenated alignment, being consistent with the currently prevalent view. It also remains possible that tetrapods are more closely

  1. Status report on the 'Merging' of the Electron-Cloud Code POSINST with the 3-D Accelerator PIC CODE WARP

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Vay, J.-L.; Furman, M.A.; Azevedo, A.W.

    2004-04-19

    We have integrated the electron-cloud code POSINST [1] with WARP [2]--a 3-D parallel Particle-In-Cell accelerator code developed for Heavy Ion Inertial Fusion--so that the two can interoperate. Both codes are run in the same process, communicate through a Python interpreter (already used in WARP), and share certain key arrays (so far, particle positions and velocities). Currently, POSINST provides primary and secondary sources of electrons, beam bunch kicks, a particle mover, and diagnostics. WARP provides the field solvers and diagnostics. Secondary emission routines are provided by the Tech-X package CMEE.

  2. PhyloExplorer: a web server to validate, explore and query phylogenetic trees

    PubMed Central

    Ranwez, Vincent; Clairon, Nicolas; Delsuc, Frédéric; Pourali, Saeed; Auberval, Nicolas; Diser, Sorel; Berry, Vincent

    2009-01-01

    Background Many important problems in evolutionary biology require molecular phylogenies to be reconstructed. Phylogenetic trees must then be manipulated for subsequent inclusion in publications or analyses such as supertree inference and tree comparisons. However, no tool is currently available to facilitate the management of tree collections providing, for instance: standardisation of taxon names among trees with respect to a reference taxonomy; selection of relevant subsets of trees or sub-trees according to a taxonomic query; or simply computation of descriptive statistics on the collection. Moreover, although several databases of phylogenetic trees exist, there is currently no easy way to find trees that are both relevant and complementary to a given collection of trees. Results We propose a tool to facilitate assessment and management of phylogenetic tree collections. Given an input collection of rooted trees, PhyloExplorer provides facilities for obtaining statistics describing the collection, correcting invalid taxon names, extracting taxonomically relevant parts of the collection using a dedicated query language, and identifying related trees in the TreeBASE database. Conclusion PhyloExplorer is a simple and interactive website implemented through underlying Python libraries and MySQL databases. It is available at: and the source code can be downloaded from: . PMID:19450253

  3. Massively parallel data processing for quantitative total flow imaging with optical coherence microscopy and tomography

    NASA Astrophysics Data System (ADS)

    Sylwestrzak, Marcin; Szlag, Daniel; Marchand, Paul J.; Kumar, Ashwin S.; Lasser, Theo

    2017-08-01

    We present an application of massively parallel processing of quantitative flow measurements data acquired using spectral optical coherence microscopy (SOCM). The need for massive signal processing of these particular datasets has been a major hurdle for many applications based on SOCM. In view of this difficulty, we implemented and adapted quantitative total flow estimation algorithms on graphics processing units (GPU) and achieved a 150 fold reduction in processing time when compared to a former CPU implementation. As SOCM constitutes the microscopy counterpart to spectral optical coherence tomography (SOCT), the developed processing procedure can be applied to both imaging modalities. We present the developed DLL library integrated in MATLAB (with an example) and have included the source code for adaptations and future improvements. Catalogue identifier: AFBT_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AFBT_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GNU GPLv3 No. of lines in distributed program, including test data, etc.: 913552 No. of bytes in distributed program, including test data, etc.: 270876249 Distribution format: tar.gz Programming language: CUDA/C, MATLAB. Computer: Intel x64 CPU, GPU supporting CUDA technology. Operating system: 64-bit Windows 7 Professional. Has the code been vectorized or parallelized?: Yes, CPU code has been vectorized in MATLAB, CUDA code has been parallelized. RAM: Dependent on users parameters, typically between several gigabytes and several tens of gigabytes Classification: 6.5, 18. Nature of problem: Speed up of data processing in optical coherence microscopy Solution method: Utilization of GPU for massively parallel data processing Additional comments: Compiled DLL library with source code and documentation, example of utilization (MATLAB script with raw data) Running time: 1,8 s for one B-scan (150 × faster in comparison to the CPU

  4. Parallelization of interpolation, solar radiation and water flow simulation modules in GRASS GIS using OpenMP

    NASA Astrophysics Data System (ADS)

    Hofierka, Jaroslav; Lacko, Michal; Zubal, Stanislav

    2017-10-01

    In this paper, we describe the parallelization of three complex and computationally intensive modules of GRASS GIS using the OpenMP application programming interface for multi-core computers. These include the v.surf.rst module for spatial interpolation, the r.sun module for solar radiation modeling and the r.sim.water module for water flow simulation. We briefly describe the functionality of the modules and parallelization approaches used in the modules. Our approach includes the analysis of the module's functionality, identification of source code segments suitable for parallelization and proper application of OpenMP parallelization code to create efficient threads processing the subtasks. We document the efficiency of the solutions using the airborne laser scanning data representing land surface in the test area and derived high-resolution digital terrain model grids. We discuss the performance speed-up and parallelization efficiency depending on the number of processor threads. The study showed a substantial increase in computation speeds on a standard multi-core computer while maintaining the accuracy of results in comparison to the output from original modules. The presented parallelization approach showed the simplicity and efficiency of the parallelization of open-source GRASS GIS modules using OpenMP, leading to an increased performance of this geospatial software on standard multi-core computers.

  5. VAC: Versatile Advection Code

    NASA Astrophysics Data System (ADS)

    Tóth, Gábor; Keppens, Rony

    2012-07-01

    The Versatile Advection Code (VAC) is a freely available general hydrodynamic and magnetohydrodynamic simulation software that works in 1, 2 or 3 dimensions on Cartesian and logically Cartesian grids. VAC runs on any Unix/Linux system with a Fortran 90 (or 77) compiler and Perl interpreter. VAC can run on parallel machines using either the Message Passing Interface (MPI) library or a High Performance Fortran (HPF) compiler.

  6. Phonological coding during reading.

    PubMed

    Leinenger, Mallorie

    2014-11-01

    The exact role that phonological coding (the recoding of written, orthographic information into a sound based code) plays during silent reading has been extensively studied for more than a century. Despite the large body of research surrounding the topic, varying theories as to the time course and function of this recoding still exist. The present review synthesizes this body of research, addressing the topics of time course and function in tandem. The varying theories surrounding the function of phonological coding (e.g., that phonological codes aid lexical access, that phonological codes aid comprehension and bolster short-term memory, or that phonological codes are largely epiphenomenal in skilled readers) are first outlined, and the time courses that each maps onto (e.g., that phonological codes come online early [prelexical] or that phonological codes come online late [postlexical]) are discussed. Next the research relevant to each of these proposed functions is reviewed, discussing the varying methodologies that have been used to investigate phonological coding (e.g., response time methods, reading while eye-tracking or recording EEG and MEG, concurrent articulation) and highlighting the advantages and limitations of each with respect to the study of phonological coding. In response to the view that phonological coding is largely epiphenomenal in skilled readers, research on the use of phonological codes in prelingually, profoundly deaf readers is reviewed. Finally, implications for current models of word identification (activation-verification model, Van Orden, 1987; dual-route model, e.g., M. Coltheart, Rastle, Perry, Langdon, & Ziegler, 2001; parallel distributed processing model, Seidenberg & McClelland, 1989) are discussed. (PsycINFO Database Record (c) 2014 APA, all rights reserved).

  7. Applications of Parallel Process HiMAP for Large Scale Multidisciplinary Problems

    NASA Technical Reports Server (NTRS)

    Guruswamy, Guru P.; Potsdam, Mark; Rodriguez, David; Kwak, Dochay (Technical Monitor)

    2000-01-01

    HiMAP is a three level parallel middleware that can be interfaced to a large scale global design environment for code independent, multidisciplinary analysis using high fidelity equations. Aerospace technology needs are rapidly changing. Computational tools compatible with the requirements of national programs such as space transportation are needed. Conventional computation tools are inadequate for modern aerospace design needs. Advanced, modular computational tools are needed, such as those that incorporate the technology of massively parallel processors (MPP).

  8. GASPRNG: GPU accelerated scalable parallel random number generator library

    NASA Astrophysics Data System (ADS)

    Gao, Shuang; Peterson, Gregory D.

    2013-04-01

    Graphics processors represent a promising technology for accelerating computational science applications. Many computational science applications require fast and scalable random number generation with good statistical properties, so they use the Scalable Parallel Random Number Generators library (SPRNG). We present the GPU Accelerated SPRNG library (GASPRNG) to accelerate SPRNG in GPU-based high performance computing systems. GASPRNG includes code for a host CPU and CUDA code for execution on NVIDIA graphics processing units (GPUs) along with a programming interface to support various usage models for pseudorandom numbers and computational science applications executing on the CPU, GPU, or both. This paper describes the implementation approach used to produce high performance and also describes how to use the programming interface. The programming interface allows a user to be able to use GASPRNG the same way as SPRNG on traditional serial or parallel computers as well as to develop tightly coupled programs executing primarily on the GPU. We also describe how to install GASPRNG and use it. To help illustrate linking with GASPRNG, various demonstration codes are included for the different usage models. GASPRNG on a single GPU shows up to 280x speedup over SPRNG on a single CPU core and is able to scale for larger systems in the same manner as SPRNG. Because GASPRNG generates identical streams of pseudorandom numbers as SPRNG, users can be confident about the quality of GASPRNG for scalable computational science applications. Catalogue identifier: AEOI_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEOI_v1_0.html Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland Licensing provisions: UTK license. No. of lines in distributed program, including test data, etc.: 167900 No. of bytes in distributed program, including test data, etc.: 1422058 Distribution format: tar.gz Programming language: C and CUDA. Computer: Any PC or

  9. F100(3) parallel compressor computer code and user's manual

    NASA Technical Reports Server (NTRS)

    Mazzawy, R. S.; Fulkerson, D. A.; Haddad, D. E.; Clark, T. A.

    1978-01-01

    The Pratt & Whitney Aircraft multiple segment parallel compressor model has been modified to include the influence of variable compressor vane geometry on the sensitivity to circumferential flow distortion. Further, performance characteristics of the F100 (3) compression system have been incorporated into the model on a blade row basis. In this modified form, the distortion's circumferential location is referenced relative to the variable vane controlling sensors of the F100 (3) engine so that the proper solution can be obtained regardless of distortion orientation. This feature is particularly important for the analysis of inlet temperature distortion. Compatibility with fixed geometry compressor applications has been maintained in the model.

  10. Simulated Tree Growth across the Northern Hemisphere and the Seasonality of Climate Signals Encoded within Tree-ring Widths

    NASA Astrophysics Data System (ADS)

    Li, X.; St George, S.

    2013-12-01

    Both dendrochronological theory and regional and global networks of tree-ring width measurements indicate that trees can respond to climate variations quite differently from one location to another. To explain these geographical differences at hemispheric scale, we used a process-based model of tree-ring formation (the Vaganov-Shashkin model) to simulate tree growth at over 6000 locations across the Northern Hemisphere. We compared the seasonality and strength of climate signals in the simulated tree-ring records against parallel analysis conducted on a hemispheric network of real tree-ring observations, tested the ability of the model to reproduce behaviors that emerge from large networks of tree-ring widths and used the model outputs to explain why the network exhibits these behaviors. The simulated tree-ring records are consistent with observations with respect to the seasonality and relative strength of the encoded climate signals, and time-related changes in these climate signals can be predicted using the modeled relative growth rate due to temperature or soil moisture. The positive imprint of winter (DJF) precipitation is strongest in simulations from the American Southwest and northern Mexico as well as selected locations in the Mediterranean and central Asia. Summer (JJA) precipitation has higher positive correlations with simulations in the mid-latitudes, but some high-latitude coastal sites exhibit a negative association. The influence of summer temperature is mainly positive at high-latitude or high-altitude sites and negative in the mid-latitudes. The absolute magnitude of climate correlations are generally higher in simulations than in observations, but the pattern and geographical differences remain the same, demonstrating that the model has skill in reproducing tree-ring growth response to climate variability in the Northern Hemisphere. Because the model uses only temperature, precipitation and latitude as input and is not adjusted for species or

  11. A Fast MHD Code for Gravitationally Stratified Media using Graphical Processing Units: SMAUG

    NASA Astrophysics Data System (ADS)

    Griffiths, M. K.; Fedun, V.; Erdélyi, R.

    2015-03-01

    Parallelization techniques have been exploited most successfully by the gaming/graphics industry with the adoption of graphical processing units (GPUs), possessing hundreds of processor cores. The opportunity has been recognized by the computational sciences and engineering communities, who have recently harnessed successfully the numerical performance of GPUs. For example, parallel magnetohydrodynamic (MHD) algorithms are important for numerical modelling of highly inhomogeneous solar, astrophysical and geophysical plasmas. Here, we describe the implementation of SMAUG, the Sheffield Magnetohydrodynamics Algorithm Using GPUs. SMAUG is a 1-3D MHD code capable of modelling magnetized and gravitationally stratified plasma. The objective of this paper is to present the numerical methods and techniques used for porting the code to this novel and highly parallel compute architecture. The methods employed are justified by the performance benchmarks and validation results demonstrating that the code successfully simulates the physics for a range of test scenarios including a full 3D realistic model of wave propagation in the solar atmosphere.

  12. Open-Source Development of the Petascale Reactive Flow and Transport Code PFLOTRAN

    NASA Astrophysics Data System (ADS)

    Hammond, G. E.; Andre, B.; Bisht, G.; Johnson, T.; Karra, S.; Lichtner, P. C.; Mills, R. T.

    2013-12-01

    Open-source software development has become increasingly popular in recent years. Open-source encourages collaborative and transparent software development and promotes unlimited free redistribution of source code to the public. Open-source development is good for science as it reveals implementation details that are critical to scientific reproducibility, but generally excluded from journal publications. In addition, research funds that would have been spent on licensing fees can be redirected to code development that benefits more scientists. In 2006, the developers of PFLOTRAN open-sourced their code under the U.S. Department of Energy SciDAC-II program. Since that time, the code has gained popularity among code developers and users from around the world seeking to employ PFLOTRAN to simulate thermal, hydraulic, mechanical and biogeochemical processes in the Earth's surface/subsurface environment. PFLOTRAN is a massively-parallel subsurface reactive multiphase flow and transport simulator designed from the ground up to run efficiently on computing platforms ranging from the laptop to leadership-class supercomputers, all from a single code base. The code employs domain decomposition for parallelism and is founded upon the well-established and open-source parallel PETSc and HDF5 frameworks. PFLOTRAN leverages modern Fortran (i.e. Fortran 2003-2008) in its extensible object-oriented design. The use of this progressive, yet domain-friendly programming language has greatly facilitated collaboration in the code's software development. Over the past year, PFLOTRAN's top-level data structures were refactored as Fortran classes (i.e. extendible derived types) to improve the flexibility of the code, ease the addition of new process models, and enable coupling to external simulators. For instance, PFLOTRAN has been coupled to the parallel electrical resistivity tomography code E4D to enable hydrogeophysical inversion while the same code base can be used as a third

  13. Improved Frame Mode Selection for AMR-WB+ Based on Decision Tree

    NASA Astrophysics Data System (ADS)

    Kim, Jong Kyu; Kim, Nam Soo

    In this letter, we propose a coding mode selection method for the AMR-WB+ audio coder based on a decision tree. In order to reduce computation while maintaining good performance, decision tree classifier is adopted with the closed loop mode selection results as the target classification labels. The size of the decision tree is controlled by pruning, so the proposed method does not increase the memory requirement significantly. Through an evaluation test on a database covering both speech and music materials, the proposed method is found to achieve a much better mode selection accuracy compared with the open loop mode selection module in the AMR-WB+.

  14. Vector processing efficiency of plasma MHD codes by use of the FACOM 230-75 APU

    NASA Astrophysics Data System (ADS)

    Matsuura, T.; Tanaka, Y.; Naraoka, K.; Takizuka, T.; Tsunematsu, T.; Tokuda, S.; Azumi, M.; Kurita, G.; Takeda, T.

    1982-06-01

    In the framework of pipelined vector architecture, the efficiency of vector processing is assessed with respect to plasma MHD codes in nuclear fusion research. By using a vector processor, the FACOM 230-75 APU, the limit of the enhancement factor due to parallelism of current vector machines is examined for three numerical codes based on a fluid model. Reasonable speed-up factors of approximately 6,6 and 4 times faster than the highly optimized scalar version are obtained for ERATO (linear stability code), AEOLUS-R1 (nonlinear stability code) and APOLLO (1-1/2D transport code), respectively. Problems of the pipelined vector processors are discussed from the viewpoint of restructuring, optimization and choice of algorithms. In conclusion, the important concept of "concurrency within pipelined parallelism" is emphasized.

  15. Scalability and Portability of Two Parallel Implementations of ADI

    NASA Technical Reports Server (NTRS)

    Phung, Thanh; VanderWijngaart, Rob F.

    1994-01-01

    Two domain decompositions for the implementation of the NAS Scalar Penta-diagonal Parallel Benchmark on MIMD systems are investigated, namely transposition and multi-partitioning. Hardware platforms considered are the Intel iPSC/860 and Paragon XP/S-15, and clusters of SGI workstations on ethernet, communicating through PVM. It is found that the multi-partitioning strategy offers the kind of coarse granularity that allows scaling up to hundreds of processors on a massively parallel machine. Moreover, efficiency is retained when the code is ported verbatim (save message passing syntax) to a PVM environment on a modest size cluster of workstations.

  16. An efficient decoding for low density parity check codes

    NASA Astrophysics Data System (ADS)

    Zhao, Ling; Zhang, Xiaolin; Zhu, Manjie

    2009-12-01

    Low density parity check (LDPC) codes are a class of forward-error-correction codes. They are among the best-known codes capable of achieving low bit error rates (BER) approaching Shannon's capacity limit. Recently, LDPC codes have been adopted by the European Digital Video Broadcasting (DVB-S2) standard, and have also been proposed for the emerging IEEE 802.16 fixed and mobile broadband wireless-access standard. The consultative committee for space data system (CCSDS) has also recommended using LDPC codes in the deep space communications and near-earth communications. It is obvious that LDPC codes will be widely used in wired and wireless communication, magnetic recording, optical networking, DVB, and other fields in the near future. Efficient hardware implementation of LDPC codes is of great interest since LDPC codes are being considered for a wide range of applications. This paper presents an efficient partially parallel decoder architecture suited for quasi-cyclic (QC) LDPC codes using Belief propagation algorithm for decoding. Algorithmic transformation and architectural level optimization are incorporated to reduce the critical path. First, analyze the check matrix of LDPC code, to find out the relationship between the row weight and the column weight. And then, the sharing level of the check node updating units (CNU) and the variable node updating units (VNU) are determined according to the relationship. After that, rearrange the CNU and the VNU, and divide them into several smaller parts, with the help of some assistant logic circuit, these smaller parts can be grouped into CNU during the check node update processing and grouped into VNU during the variable node update processing. These smaller parts are called node update kernel units (NKU) and the assistant logic circuit are called node update auxiliary unit (NAU). With NAUs' help, the two steps of iteration operation are completed by NKUs, which brings in great hardware resource reduction. Meanwhile

  17. A parallel simulated annealing algorithm for standard cell placement on a hypercube computer

    NASA Technical Reports Server (NTRS)

    Jones, Mark Howard

    1987-01-01

    A parallel version of a simulated annealing algorithm is presented which is targeted to run on a hypercube computer. A strategy for mapping the cells in a two dimensional area of a chip onto processors in an n-dimensional hypercube is proposed such that both small and large distance moves can be applied. Two types of moves are allowed: cell exchanges and cell displacements. The computation of the cost function in parallel among all the processors in the hypercube is described along with a distributed data structure that needs to be stored in the hypercube to support parallel cost evaluation. A novel tree broadcasting strategy is used extensively in the algorithm for updating cell locations in the parallel environment. Studies on the performance of the algorithm on example industrial circuits show that it is faster and gives better final placement results than the uniprocessor simulated annealing algorithms. An improved uniprocessor algorithm is proposed which is based on the improved results obtained from parallelization of the simulated annealing algorithm.

  18. A Novel Implementation of Massively Parallel Three Dimensional Monte Carlo Radiation Transport

    NASA Astrophysics Data System (ADS)

    Robinson, P. B.; Peterson, J. D. L.

    2005-12-01

    The goal of our summer project was to implement the difference formulation for radiation transport into Cosmos++, a multidimensional, massively parallel, magneto hydrodynamics code for astrophysical applications (Peter Anninos - AX). The difference formulation is a new method for Symbolic Implicit Monte Carlo thermal transport (Brooks and Szöke - PAT). Formerly, simultaneous implementation of fully implicit Monte Carlo radiation transport in multiple dimensions on multiple processors had not been convincingly demonstrated. We found that a combination of the difference formulation and the inherent structure of Cosmos++ makes such an implementation both accurate and straightforward. We developed a "nearly nearest neighbor physics" technique to allow each processor to work independently, even with a fully implicit code. This technique coupled with the increased accuracy of an implicit Monte Carlo solution and the efficiency of parallel computing systems allows us to demonstrate the possibility of massively parallel thermal transport. This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No. W-7405-Eng-48

  19. Parallel pivoting combined with parallel reduction

    NASA Technical Reports Server (NTRS)

    Alaghband, Gita

    1987-01-01

    Parallel algorithms for triangularization of large, sparse, and unsymmetric matrices are presented. The method combines the parallel reduction with a new parallel pivoting technique, control over generations of fill-ins and a check for numerical stability, all done in parallel with the work being distributed over the active processes. The parallel technique uses the compatibility relation between pivots to identify parallel pivot candidates and uses the Markowitz number of pivots to minimize fill-in. This technique is not a preordering of the sparse matrix and is applied dynamically as the decomposition proceeds.

  20. Interaction between Japanese flowering cherry trees and some wild animals observed during physiological experiment in fields

    NASA Technical Reports Server (NTRS)

    Nakamura, Teruko

    2003-01-01

    We have studied the weeping habit of Japanese flowering cherry tree in the field of Tama Forest Science Garden, Forestry and Forest Products Research Institute at the foot of Mt. Takao. Since cherry trees at various age were the materials for our plant physiology experiments, our studies were conducted in the fields where we experienced certain difficulties. Even under such difficult environment that was rather unexpected and uncontrollable, we could obtain fruitful results on the growth of cherry tree, and found them scientifically significant, especially in terms of biological effects of gravity on earth. Moreover, a lot of interesting interactions of cherry trees with various kinds of animals were observed in parallel to the plant physiology.

  1. Quasi-Optimal Elimination Trees for 2D Grids with Singularities

    DOE PAGES

    Paszyńska, A.; Paszyński, M.; Jopek, K.; ...

    2015-01-01

    We consmore » truct quasi-optimal elimination trees for 2D finite element meshes with singularities. These trees minimize the complexity of the solution of the discrete system. The computational cost estimates of the elimination process model the execution of the multifrontal algorithms in serial and in parallel shared-memory executions. Since the meshes considered are a subspace of all possible mesh partitions, we call these minimizers quasi-optimal. We minimize the cost functionals using dynamic programming. Finding these minimizers is more computationally expensive than solving the original algebraic system. Nevertheless, from the insights provided by the analysis of the dynamic programming minima, we propose a heuristic construction of the elimination trees that has cost O N e log ⁡ N e , where N e is the number of elements in the mesh. We show that this heuristic ordering has similar computational cost to the quasi-optimal elimination trees found with dynamic programming and outperforms state-of-the-art alternatives in our numerical experiments.« less

  2. Quasi-Optimal Elimination Trees for 2D Grids with Singularities

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Paszyńska, A.; Paszyński, M.; Jopek, K.

    We consmore » truct quasi-optimal elimination trees for 2D finite element meshes with singularities. These trees minimize the complexity of the solution of the discrete system. The computational cost estimates of the elimination process model the execution of the multifrontal algorithms in serial and in parallel shared-memory executions. Since the meshes considered are a subspace of all possible mesh partitions, we call these minimizers quasi-optimal. We minimize the cost functionals using dynamic programming. Finding these minimizers is more computationally expensive than solving the original algebraic system. Nevertheless, from the insights provided by the analysis of the dynamic programming minima, we propose a heuristic construction of the elimination trees that has cost O N e log ⁡ N e , where N e is the number of elements in the mesh. We show that this heuristic ordering has similar computational cost to the quasi-optimal elimination trees found with dynamic programming and outperforms state-of-the-art alternatives in our numerical experiments.« less

  3. Decision-Tree Formulation With Order-1 Lateral Execution

    NASA Technical Reports Server (NTRS)

    James, Mark

    2007-01-01

    A compact symbolic formulation enables mapping of an arbitrarily complex decision tree of a certain type into a highly computationally efficient multidimensional software object. The type of decision trees to which this formulation applies is that known in the art as the Boolean class of balanced decision trees. Parallel lateral slices of an object created by means of this formulation can be executed in constant time considerably less time than would otherwise be required. Decision trees of various forms are incorporated into almost all large software systems. A decision tree is a way of hierarchically solving a problem, proceeding through a set of true/false responses to a conclusion. By definition, a decision tree has a tree-like structure, wherein each internal node denotes a test on an attribute, each branch from an internal node represents an outcome of a test, and leaf nodes represent classes or class distributions that, in turn represent possible conclusions. The drawback of decision trees is that execution of them can be computationally expensive (and, hence, time-consuming) because each non-leaf node must be examined to determine whether to progress deeper into a tree structure or to examine an alternative. The present formulation was conceived as an efficient means of representing a decision tree and executing it in as little time as possible. The formulation involves the use of a set of symbolic algorithms to transform a decision tree into a multi-dimensional object, the rank of which equals the number of lateral non-leaf nodes. The tree can then be executed in constant time by means of an order-one table lookup. The sequence of operations performed by the algorithms is summarized as follows: 1. Determination of whether the tree under consideration can be encoded by means of this formulation. 2. Extraction of decision variables. 3. Symbolic optimization of the decision tree to minimize its form. 4. Expansion and transformation of all nested conjunctive

  4. Implementation of a Parallel Kalman Filter for Stratospheric Chemical Tracer Assimilation

    NASA Technical Reports Server (NTRS)

    Chang, Lang-Ping; Lyster, Peter M.; Menard, R.; Cohn, S. E.

    1998-01-01

    A Kalman filter for the assimilation of long-lived atmospheric chemical constituents has been developed for two-dimensional transport models on isentropic surfaces over the globe. An important attribute of the Kalman filter is that it calculates error covariances of the constituent fields using the tracer dynamics. Consequently, the current Kalman-filter assimilation is a five-dimensional problem (coordinates of two points and time), and it can only be handled on computers with large memory and high floating point speed. In this paper, an implementation of the Kalman filter for distributed-memory, message-passing parallel computers is discussed. Two approaches were studied: an operator decomposition and a covariance decomposition. The latter was found to be more scalable than the former, and it possesses the property that the dynamical model does not need to be parallelized, which is of considerable practical advantage. This code is currently used to assimilate constituent data retrieved by limb sounders on the Upper Atmosphere Research Satellite. Tests of the code examined the variance transport and observability properties. Aspects of the parallel implementation, some timing results, and a brief discussion of the physical results will be presented.

  5. An automated workflow for parallel processing of large multiview SPIM recordings.

    PubMed

    Schmied, Christopher; Steinbach, Peter; Pietzsch, Tobias; Preibisch, Stephan; Tomancak, Pavel

    2016-04-01

    Selective Plane Illumination Microscopy (SPIM) allows to image developing organisms in 3D at unprecedented temporal resolution over long periods of time. The resulting massive amounts of raw image data requires extensive processing interactively via dedicated graphical user interface (GUI) applications. The consecutive processing steps can be easily automated and the individual time points can be processed independently, which lends itself to trivial parallelization on a high performance computing (HPC) cluster. Here, we introduce an automated workflow for processing large multiview, multichannel, multiillumination time-lapse SPIM data on a single workstation or in parallel on a HPC cluster. The pipeline relies on snakemake to resolve dependencies among consecutive processing steps and can be easily adapted to any cluster environment for processing SPIM data in a fraction of the time required to collect it. The code is distributed free and open source under the MIT license http://opensource.org/licenses/MIT The source code can be downloaded from github: https://github.com/mpicbg-scicomp/snakemake-workflows Documentation can be found here: http://fiji.sc/Automated_workflow_for_parallel_Multiview_Reconstruction : schmied@mpi-cbg.de Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press.

  6. An automated workflow for parallel processing of large multiview SPIM recordings

    PubMed Central

    Schmied, Christopher; Steinbach, Peter; Pietzsch, Tobias; Preibisch, Stephan; Tomancak, Pavel

    2016-01-01

    Summary: Selective Plane Illumination Microscopy (SPIM) allows to image developing organisms in 3D at unprecedented temporal resolution over long periods of time. The resulting massive amounts of raw image data requires extensive processing interactively via dedicated graphical user interface (GUI) applications. The consecutive processing steps can be easily automated and the individual time points can be processed independently, which lends itself to trivial parallelization on a high performance computing (HPC) cluster. Here, we introduce an automated workflow for processing large multiview, multichannel, multiillumination time-lapse SPIM data on a single workstation or in parallel on a HPC cluster. The pipeline relies on snakemake to resolve dependencies among consecutive processing steps and can be easily adapted to any cluster environment for processing SPIM data in a fraction of the time required to collect it. Availability and implementation: The code is distributed free and open source under the MIT license http://opensource.org/licenses/MIT. The source code can be downloaded from github: https://github.com/mpicbg-scicomp/snakemake-workflows. Documentation can be found here: http://fiji.sc/Automated_workflow_for_parallel_Multiview_Reconstruction. Contact: schmied@mpi-cbg.de Supplementary information: Supplementary data are available at Bioinformatics online. PMID:26628585

  7. CUTSETS - MINIMAL CUT SET CALCULATION FOR DIGRAPH AND FAULT TREE RELIABILITY MODELS

    NASA Technical Reports Server (NTRS)

    Iverson, D. L.

    1994-01-01

    Fault tree and digraph models are frequently used for system failure analysis. Both type of models represent a failure space view of the system using AND and OR nodes in a directed graph structure. Fault trees must have a tree structure and do not allow cycles or loops in the graph. Digraphs allow any pattern of interconnection between loops in the graphs. A common operation performed on digraph and fault tree models is the calculation of minimal cut sets. A cut set is a set of basic failures that could cause a given target failure event to occur. A minimal cut set for a target event node in a fault tree or digraph is any cut set for the node with the property that if any one of the failures in the set is removed, the occurrence of the other failures in the set will not cause the target failure event. CUTSETS will identify all the minimal cut sets for a given node. The CUTSETS package contains programs that solve for minimal cut sets of fault trees and digraphs using object-oriented programming techniques. These cut set codes can be used to solve graph models for reliability analysis and identify potential single point failures in a modeled system. The fault tree minimal cut set code reads in a fault tree model input file with each node listed in a text format. In the input file the user specifies a top node of the fault tree and a maximum cut set size to be calculated. CUTSETS will find minimal sets of basic events which would cause the failure at the output of a given fault tree gate. The program can find all the minimal cut sets of a node, or minimal cut sets up to a specified size. The algorithm performs a recursive top down parse of the fault tree, starting at the specified top node, and combines the cut sets of each child node into sets of basic event failures that would cause the failure event at the output of that gate. Minimal cut set solutions can be found for all nodes in the fault tree or just for the top node. The digraph cut set code uses the same

  8. A compositional reservoir simulator on distributed memory parallel computers

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Rame, M.; Delshad, M.

    1995-12-31

    This paper presents the application of distributed memory parallel computes to field scale reservoir simulations using a parallel version of UTCHEM, The University of Texas Chemical Flooding Simulator. The model is a general purpose highly vectorized chemical compositional simulator that can simulate a wide range of displacement processes at both field and laboratory scales. The original simulator was modified to run on both distributed memory parallel machines (Intel iPSC/960 and Delta, Connection Machine 5, Kendall Square 1 and 2, and CRAY T3D) and a cluster of workstations. A domain decomposition approach has been taken towards parallelization of the code. Amore » portion of the discrete reservoir model is assigned to each processor by a set-up routine that attempts a data layout as even as possible from the load-balance standpoint. Each of these subdomains is extended so that data can be shared between adjacent processors for stencil computation. The added routines that make parallel execution possible are written in a modular fashion that makes the porting to new parallel platforms straight forward. Results of the distributed memory computing performance of Parallel simulator are presented for field scale applications such as tracer flood and polymer flood. A comparison of the wall-clock times for same problems on a vector supercomputer is also presented.« less

  9. A 2D electrostatic PIC code for the Mark III Hypercube

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Ferraro, R.D.; Liewer, P.C.; Decyk, V.K.

    We have implemented a 2D electrostastic plasma particle in cell (PIC) simulation code on the Caltech/JPL Mark IIIfp Hypercube. The code simulates plasma effects by evolving in time the trajectories of thousands to millions of charged particles subject to their self-consistent fields. Each particle`s position and velocity is advanced in time using a leap frog method for integrating Newton`s equations of motion in electric and magnetic fields. The electric field due to these moving charged particles is calculated on a spatial grid at each time by solving Poisson`s equation in Fourier space. These two tasks represent the largest part ofmore » the computation. To obtain efficient operation on a distributed memory parallel computer, we are using the General Concurrent PIC (GCPIC) algorithm previously developed for a 1D parallel PIC code.« less

  10. Efficient Parallelization of a Dynamic Unstructured Application on the Tera MTA

    NASA Technical Reports Server (NTRS)

    Oliker, Leonid; Biswas, Rupak

    1999-01-01

    The success of parallel computing in solving real-life computationally-intensive problems relies on their efficient mapping and execution on large-scale multiprocessor architectures. Many important applications are both unstructured and dynamic in nature, making their efficient parallel implementation a daunting task. This paper presents the parallelization of a dynamic unstructured mesh adaptation algorithm using three popular programming paradigms on three leading supercomputers. We examine an MPI message-passing implementation on the Cray T3E and the SGI Origin2OOO, a shared-memory implementation using cache coherent nonuniform memory access (CC-NUMA) of the Origin2OOO, and a multi-threaded version on the newly-released Tera Multi-threaded Architecture (MTA). We compare several critical factors of this parallel code development, including runtime, scalability, programmability, and memory overhead. Our overall results demonstrate that multi-threaded systems offer tremendous potential for quickly and efficiently solving some of the most challenging real-life problems on parallel computers.

  11. Combining Phase Identification and Statistic Modeling for Automated Parallel Benchmark Generation

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Jin, Ye; Ma, Xiaosong; Liu, Qing Gary

    2015-01-01

    Parallel application benchmarks are indispensable for evaluating/optimizing HPC software and hardware. However, it is very challenging and costly to obtain high-fidelity benchmarks reflecting the scale and complexity of state-of-the-art parallel applications. Hand-extracted synthetic benchmarks are time-and labor-intensive to create. Real applications themselves, while offering most accurate performance evaluation, are expensive to compile, port, reconfigure, and often plainly inaccessible due to security or ownership concerns. This work contributes APPRIME, a novel tool for trace-based automatic parallel benchmark generation. Taking as input standard communication-I/O traces of an application's execution, it couples accurate automatic phase identification with statistical regeneration of event parameters tomore » create compact, portable, and to some degree reconfigurable parallel application benchmarks. Experiments with four NAS Parallel Benchmarks (NPB) and three real scientific simulation codes confirm the fidelity of APPRIME benchmarks. They retain the original applications' performance characteristics, in particular the relative performance across platforms.« less

  12. Density-based parallel skin lesion border detection with webCL

    PubMed Central

    2015-01-01

    Background Dermoscopy is a highly effective and noninvasive imaging technique used in diagnosis of melanoma and other pigmented skin lesions. Many aspects of the lesion under consideration are defined in relation to the lesion border. This makes border detection one of the most important steps in dermoscopic image analysis. In current practice, dermatologists often delineate borders through a hand drawn representation based upon visual inspection. Due to the subjective nature of this technique, intra- and inter-observer variations are common. Because of this, the automated assessment of lesion borders in dermoscopic images has become an important area of study. Methods Fast density based skin lesion border detection method has been implemented in parallel with a new parallel technology called WebCL. WebCL utilizes client side computing capabilities to use available hardware resources such as multi cores and GPUs. Developed WebCL-parallel density based skin lesion border detection method runs efficiently from internet browsers. Results Previous research indicates that one of the highest accuracy rates can be achieved using density based clustering techniques for skin lesion border detection. While these algorithms do have unfavorable time complexities, this effect could be mitigated when implemented in parallel. In this study, density based clustering technique for skin lesion border detection is parallelized and redesigned to run very efficiently on the heterogeneous platforms (e.g. tablets, SmartPhones, multi-core CPUs, GPUs, and fully-integrated Accelerated Processing Units) by transforming the technique into a series of independent concurrent operations. Heterogeneous computing is adopted to support accessibility, portability and multi-device use in the clinical settings. For this, we used WebCL, an emerging technology that enables a HTML5 Web browser to execute code in parallel for heterogeneous platforms. We depicted WebCL and our parallel algorithm design. In

  13. Density-based parallel skin lesion border detection with webCL.

    PubMed

    Lemon, James; Kockara, Sinan; Halic, Tansel; Mete, Mutlu

    2015-01-01

    Dermoscopy is a highly effective and noninvasive imaging technique used in diagnosis of melanoma and other pigmented skin lesions. Many aspects of the lesion under consideration are defined in relation to the lesion border. This makes border detection one of the most important steps in dermoscopic image analysis. In current practice, dermatologists often delineate borders through a hand drawn representation based upon visual inspection. Due to the subjective nature of this technique, intra- and inter-observer variations are common. Because of this, the automated assessment of lesion borders in dermoscopic images has become an important area of study. Fast density based skin lesion border detection method has been implemented in parallel with a new parallel technology called WebCL. WebCL utilizes client side computing capabilities to use available hardware resources such as multi cores and GPUs. Developed WebCL-parallel density based skin lesion border detection method runs efficiently from internet browsers. Previous research indicates that one of the highest accuracy rates can be achieved using density based clustering techniques for skin lesion border detection. While these algorithms do have unfavorable time complexities, this effect could be mitigated when implemented in parallel. In this study, density based clustering technique for skin lesion border detection is parallelized and redesigned to run very efficiently on the heterogeneous platforms (e.g. tablets, SmartPhones, multi-core CPUs, GPUs, and fully-integrated Accelerated Processing Units) by transforming the technique into a series of independent concurrent operations. Heterogeneous computing is adopted to support accessibility, portability and multi-device use in the clinical settings. For this, we used WebCL, an emerging technology that enables a HTML5 Web browser to execute code in parallel for heterogeneous platforms. We depicted WebCL and our parallel algorithm design. In addition, we tested

  14. Testing New Programming Paradigms with NAS Parallel Benchmarks

    NASA Technical Reports Server (NTRS)

    Jin, H.; Frumkin, M.; Schultz, M.; Yan, J.

    2000-01-01

    was applied to several benchmarks, noticeably BT and SP, resulting in better sequential performance. In order to overcome the lack of an HPF performance model and guide the development of the HPF codes, we employed an empirical performance model for several primitives found in the benchmarks. We encountered a few limitations of HPF, such as lack of supporting the "REDISTRIBUTION" directive and no easy way to handle irregular computation. The parallelization with OpenMP directives was done at the outer-most loop level to achieve the largest granularity. The performance of six HPF and OpenMP benchmarks is compared with their MPI counterparts for the Class-A problem size in the figure in next page. These results were obtained on an SGI Origin2000 (195MHz) with MIPSpro-f77 compiler 7.2.1 for OpenMP and MPI codes and PGI pghpf-2.4.3 compiler with MPI interface for HPF programs.

  15. Nearest Neighbor Searching in Binary Search Trees: Simulation of a Multiprocessor System.

    ERIC Educational Resources Information Center

    Stewart, Mark; Willett, Peter

    1987-01-01

    Describes the simulation of a nearest neighbor searching algorithm for document retrieval using a pool of microprocessors. Three techniques are described which allow parallel searching of a binary search tree as well as a PASCAL-based system, PASSIM, which can simulate these techniques. Fifty-six references are provided. (Author/LRW)

  16. Revisiting Molecular Dynamics on a CPU/GPU system: Water Kernel and SHAKE Parallelization.

    PubMed

    Ruymgaart, A Peter; Elber, Ron

    2012-11-13

    We report Graphics Processing Unit (GPU) and Open-MP parallel implementations of water-specific force calculations and of bond constraints for use in Molecular Dynamics simulations. We focus on a typical laboratory computing-environment in which a CPU with a few cores is attached to a GPU. We discuss in detail the design of the code and we illustrate performance comparable to highly optimized codes such as GROMACS. Beside speed our code shows excellent energy conservation. Utilization of water-specific lists allows the efficient calculations of non-bonded interactions that include water molecules and results in a speed-up factor of more than 40 on the GPU compared to code optimized on a single CPU core for systems larger than 20,000 atoms. This is up four-fold from a factor of 10 reported in our initial GPU implementation that did not include a water-specific code. Another optimization is the implementation of constrained dynamics entirely on the GPU. The routine, which enforces constraints of all bonds, runs in parallel on multiple Open-MP cores or entirely on the GPU. It is based on Conjugate Gradient solution of the Lagrange multipliers (CG SHAKE). The GPU implementation is partially in double precision and requires no communication with the CPU during the execution of the SHAKE algorithm. The (parallel) implementation of SHAKE allows an increase of the time step to 2.0fs while maintaining excellent energy conservation. Interestingly, CG SHAKE is faster than the usual bond relaxation algorithm even on a single core if high accuracy is expected. The significant speedup of the optimized components transfers the computational bottleneck of the MD calculation to the reciprocal part of Particle Mesh Ewald (PME).

  17. MulRF: a software package for phylogenetic analysis using multi-copy gene trees.

    PubMed

    Chaudhary, Ruchi; Fernández-Baca, David; Burleigh, John Gordon

    2015-02-01

    MulRF is a platform-independent software package for phylogenetic analysis using multi-copy gene trees. It seeks the species tree that minimizes the Robinson-Foulds (RF) distance to the input trees using a generalization of the RF distance to multi-labeled trees. The underlying generic tree distance measure and fast running time make MulRF useful for inferring phylogenies from large collections of gene trees, in which multiple evolutionary processes as well as phylogenetic error may contribute to gene tree discord. MulRF implements several features for customizing the species tree search and assessing the results, and it provides a user-friendly graphical user interface (GUI) with tree visualization. The species tree search is implemented in C++ and the GUI in Java Swing. MulRF's executable as well as sample datasets and manual are available at http://genome.cs.iastate.edu/CBL/MulRF/, and the source code is available at https://github.com/ruchiherself/MulRFRepo. ruchic@ufl.edu Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  18. Two C++ Libraries for Counting Trees on a Phylogenetic Terrace.

    PubMed

    Biczok, R; Bozsoky, P; Eisenmann, P; Ernst, J; Ribizel, T; Scholz, F; Trefzer, A; Weber, F; Hamann, M; Stamatakis, A

    2018-05-08

    The presence of terraces in phylogenetic tree space, that is, a potentially large number of distinct tree topologies that have exactly the same analytical likelihood score, was first described by Sanderson et al. (2011). However, popular software tools for maximum likelihood and Bayesian phylogenetic inference do not yet routinely report, if inferred phylogenies reside on a terrace, or not. We believe, this is due to the lack of an efficient library to (i) determine if a tree resides on a terrace, (ii) calculate how many trees reside on a terrace, and (iii) enumerate all trees on a terrace. In our bioinformatics practical that is set up as a programming contest we developed two efficient and independent C++ implementations of the SUPERB algorithm by Constantinescu and Sankoff (1995) for counting and enumerating trees on a terrace. Both implementations yield exactly the same results, are more than one order of magnitude faster, and require one order of magnitude less memory than a previous 3rd party python implementation. The source codes are available under GNU GPL at https://github.com/terraphast. Alexandros.Stamatakis@h-its.org. Supplementary data are available at Bioinformatics online.

  19. Parallel Processing of Images in Mobile Devices using BOINC

    NASA Astrophysics Data System (ADS)

    Curiel, Mariela; Calle, David F.; Santamaría, Alfredo S.; Suarez, David F.; Flórez, Leonardo

    2018-04-01

    Medical image processing helps health professionals make decisions for the diagnosis and treatment of patients. Since some algorithms for processing images require substantial amounts of resources, one could take advantage of distributed or parallel computing. A mobile grid can be an adequate computing infrastructure for this problem. A mobile grid is a grid that includes mobile devices as resource providers. In a previous step of this research, we selected BOINC as the infrastructure to build our mobile grid. However, parallel processing of images in mobile devices poses at least two important challenges: the execution of standard libraries for processing images and obtaining adequate performance when compared to desktop computers grids. By the time we started our research, the use of BOINC in mobile devices also involved two issues: a) the execution of programs in mobile devices required to modify the code to insert calls to the BOINC API, and b) the division of the image among the mobile devices as well as its merging required additional code in some BOINC components. This article presents answers to these four challenges.

  20. Code Optimization Techniques

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    MAGEE,GLEN I.

    Computers transfer data in a number of different ways. Whether through a serial port, a parallel port, over a modem, over an ethernet cable, or internally from a hard disk to memory, some data will be lost. To compensate for that loss, numerous error detection and correction algorithms have been developed. One of the most common error correction codes is the Reed-Solomon code, which is a special subset of BCH (Bose-Chaudhuri-Hocquenghem) linear cyclic block codes. In the AURA project, an unmanned aircraft sends the data it collects back to earth so it can be analyzed during flight and possible flightmore » modifications made. To counter possible data corruption during transmission, the data is encoded using a multi-block Reed-Solomon implementation with a possibly shortened final block. In order to maximize the amount of data transmitted, it was necessary to reduce the computation time of a Reed-Solomon encoding to three percent of the processor's time. To achieve such a reduction, many code optimization techniques were employed. This paper outlines the steps taken to reduce the processing time of a Reed-Solomon encoding and the insight into modern optimization techniques gained from the experience.« less

  1. How to Build an AppleSeed: A Parallel Macintosh Cluster for Numerically Intensive Computing

    NASA Astrophysics Data System (ADS)

    Decyk, V. K.; Dauger, D. E.

    We have constructed a parallel cluster consisting of a mixture of Apple Macintosh G3 and G4 computers running the Mac OS, and have achieved very good performance on numerically intensive, parallel plasma particle-incell simulations. A subset of the MPI message-passing library was implemented in Fortran77 and C. This library enabled us to port code, without modification, from other parallel processors to the Macintosh cluster. Unlike Unix-based clusters, no special expertise in operating systems is required to build and run the cluster. This enables us to move parallel computing from the realm of experts to the main stream of computing.

  2. FleCSPH - a parallel and distributed SPH implementation based on the FleCSI framework

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Junghans, Christoph; Loiseau, Julien

    2017-06-20

    FleCSPH is a multi-physics compact application that exercises FleCSI parallel data structures for tree-based particle methods. In particular, FleCSPH implements a smoothed-particle hydrodynamics (SPH) solver for the solution of Lagrangian problems in astrophysics and cosmology. FleCSPH includes support for gravitational forces using the fast multipole method (FMM).

  3. Implementation of Multivariable Logic Functions in Parallel by Electrically Addressing a Molecule of Three Dopants in Silicon.

    PubMed

    Fresch, Barbara; Bocquel, Juanita; Hiluf, Dawit; Rogge, Sven; Levine, Raphael D; Remacle, Françoise

    2017-07-05

    To realize low-power, compact logic circuits, one can explore parallel operation on single nanoscale devices. An added incentive is to use multivalued (as distinct from Boolean) logic. Here, we theoretically demonstrate that the computation of all the possible outputs of a multivariate, multivalued logic function can be implemented in parallel by electrical addressing of a molecule made up of three interacting dopant atoms embedded in Si. The electronic states of the dopant molecule are addressed by pulsing a gate voltage. By simulating the time evolution of the non stationary electronic density built by the gate voltage, we show that one can implement a molecular decision tree that provides in parallel all the outputs for all the inputs of the multivariate, multivalued logic function. The outputs are encoded in the populations and in the bond orders of the dopant molecule, which can be measured using an STM tip. We show that the implementation of the molecular logic tree is equivalent to a spectral function decomposition. The function that is evaluated can be field-programmed by changing the time profile of the pulsed gate voltage. © 2017 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim.

  4. The Newick utilities: high-throughput phylogenetic tree processing in the UNIX shell.

    PubMed

    Junier, Thomas; Zdobnov, Evgeny M

    2010-07-01

    We present a suite of Unix shell programs for processing any number of phylogenetic trees of any size. They perform frequently-used tree operations without requiring user interaction. They also allow tree drawing as scalable vector graphics (SVG), suitable for high-quality presentations and further editing, and as ASCII graphics for command-line inspection. As an example we include an implementation of bootscanning, a procedure for finding recombination breakpoints in viral genomes. C source code, Python bindings and executables for various platforms are available from http://cegg.unige.ch/newick_utils. The distribution includes a manual and example data. The package is distributed under the BSD License. thomas.junier@unige.ch

  5. [Comparative review of the Senegalese and French deontology codes].

    PubMed

    Soumah, M; Mbaye, I; Bah, H; Gaye Fall, M C; Sow, M L

    2005-01-01

    The medical deontology regroups duties of the physicians and regulate the exercise of medicine. The code of medical deontology of Senegal inspired of the French medical deontology code, has not been revised since its institution whereas the French deontology code knew three revisions. Comparing the two codes of deontology titles by title and article by article, this work beyond a parallel between the two codes puts in inscription the progress in bioethics that are to the basis of the revisions of the French medical deontology code. This article will permit an advocacy of the health professionals, in favor of a setting to level of the of Senegalese medical deontology code. Because legal litigation, that is important in the developed countries, intensify in our developing countries. It is inherent to the technological progress and to the awareness of the patients of their rights.

  6. Block-Parallel Data Analysis with DIY2

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Morozov, Dmitriy; Peterka, Tom

    DIY2 is a programming model and runtime for block-parallel analytics on distributed-memory machines. Its main abstraction is block-structured data parallelism: data are decomposed into blocks; blocks are assigned to processing elements (processes or threads); computation is described as iterations over these blocks, and communication between blocks is defined by reusable patterns. By expressing computation in this general form, the DIY2 runtime is free to optimize the movement of blocks between slow and fast memories (disk and flash vs. DRAM) and to concurrently execute blocks residing in memory with multiple threads. This enables the same program to execute in-core, out-of-core, serial,more » parallel, single-threaded, multithreaded, or combinations thereof. This paper describes the implementation of the main features of the DIY2 programming model and optimizations to improve performance. DIY2 is evaluated on benchmark test cases to establish baseline performance for several common patterns and on larger complete analysis codes running on large-scale HPC machines.« less

  7. Computational performance of a smoothed particle hydrodynamics simulation for shared-memory parallel computing

    NASA Astrophysics Data System (ADS)

    Nishiura, Daisuke; Furuichi, Mikito; Sakaguchi, Hide

    2015-09-01

    The computational performance of a smoothed particle hydrodynamics (SPH) simulation is investigated for three types of current shared-memory parallel computer devices: many integrated core (MIC) processors, graphics processing units (GPUs), and multi-core CPUs. We are especially interested in efficient shared-memory allocation methods for each chipset, because the efficient data access patterns differ between compute unified device architecture (CUDA) programming for GPUs and OpenMP programming for MIC processors and multi-core CPUs. We first introduce several parallel implementation techniques for the SPH code, and then examine these on our target computer architectures to determine the most effective algorithms for each processor unit. In addition, we evaluate the effective computing performance and power efficiency of the SPH simulation on each architecture, as these are critical metrics for overall performance in a multi-device environment. In our benchmark test, the GPU is found to produce the best arithmetic performance as a standalone device unit, and gives the most efficient power consumption. The multi-core CPU obtains the most effective computing performance. The computational speed of the MIC processor on Xeon Phi approached that of two Xeon CPUs. This indicates that using MICs is an attractive choice for existing SPH codes on multi-core CPUs parallelized by OpenMP, as it gains computational acceleration without the need for significant changes to the source code.

  8. A scalable approach for tree segmentation within small-footprint airborne LiDAR data

    NASA Astrophysics Data System (ADS)

    Hamraz, Hamid; Contreras, Marco A.; Zhang, Jun

    2017-05-01

    This paper presents a distributed approach that scales up to segment tree crowns within a LiDAR point cloud representing an arbitrarily large forested area. The approach uses a single-processor tree segmentation algorithm as a building block in order to process the data delivered in the shape of tiles in parallel. The distributed processing is performed in a master-slave manner, in which the master maintains the global map of the tiles and coordinates the slaves that segment tree crowns within and across the boundaries of the tiles. A minimal bias was introduced to the number of detected trees because of trees lying across the tile boundaries, which was quantified and adjusted for. Theoretical and experimental analyses of the runtime of the approach revealed a near linear speedup. The estimated number of trees categorized by crown class and the associated error margins as well as the height distribution of the detected trees aligned well with field estimations, verifying that the distributed approach works correctly. The approach enables providing information of individual tree locations and point cloud segments for a forest-level area in a timely manner, which can be used to create detailed remotely sensed forest inventories. Although the approach was presented for tree segmentation within LiDAR point clouds, the idea can also be generalized to scale up processing other big spatial datasets.

  9. Magnetosphere simulations with a high-performance 3D AMR MHD Code

    NASA Astrophysics Data System (ADS)

    Gombosi, Tamas; Dezeeuw, Darren; Groth, Clinton; Powell, Kenneth; Song, Paul

    1998-11-01

    BATS-R-US is a high-performance 3D AMR MHD code for space physics applications running on massively parallel supercomputers. In BATS-R-US the electromagnetic and fluid equations are solved with a high-resolution upwind numerical scheme in a tightly coupled manner. The code is very robust and it is capable of spanning a wide range of plasma parameters (such as β, acoustic and Alfvénic Mach numbers). Our code is highly scalable: it achieved a sustained performance of 233 GFLOPS on a Cray T3E-1200 supercomputer with 1024 PEs. This talk reports results from the BATS-R-US code for the GGCM (Geospace General Circularculation Model) Phase 1 Standard Model Suite. This model suite contains 10 different steady-state configurations: 5 IMF clock angles (north, south, and three equally spaced angles in- between) with 2 IMF field strengths for each angle (5 nT and 10 nT). The other parameters are: solar wind speed =400 km/sec; solar wind number density = 5 protons/cc; Hall conductance = 0; Pedersen conductance = 5 S; parallel conductivity = ∞.

  10. JSPAM: A restricted three-body code for simulating interacting galaxies

    NASA Astrophysics Data System (ADS)

    Wallin, J. F.; Holincheck, A. J.; Harvey, A.

    2016-07-01

    Restricted three-body codes have a proven ability to recreate much of the disturbed morphology of actual interacting galaxies. As more sophisticated n-body models were developed and computer speed increased, restricted three-body codes fell out of favor. However, their supporting role for performing wide searches of parameter space when fitting orbits to real systems demonstrates a continuing need for their use. Here we present the model and algorithm used in the JSPAM code. A precursor of this code was originally described in 1990, and was called SPAM. We have recently updated the software with an alternate potential and a treatment of dynamical friction to more closely mimic the results from n-body tree codes. The code is released publicly for use under the terms of the Academic Free License ("AFL") v. 3.0 and has been added to the Astrophysics Source Code Library.

  11. Charon Toolkit for Parallel, Implicit Structured-Grid Computations: Functional Design

    NASA Technical Reports Server (NTRS)

    VanderWijngaart, Rob F.; Kutler, Paul (Technical Monitor)

    1997-01-01

    In a previous report the design concepts of Charon were presented. Charon is a toolkit that aids engineers in developing scientific programs for structured-grid applications to be run on MIMD parallel computers. It constitutes an augmentation of the general-purpose MPI-based message-passing layer, and provides the user with a hierarchy of tools for rapid prototyping and validation of parallel programs, and subsequent piecemeal performance tuning. Here we describe the implementation of the domain decomposition tools used for creating data distributions across sets of processors. We also present the hierarchy of parallelization tools that allows smooth translation of legacy code (or a serial design) into a parallel program. Along with the actual tool descriptions, we will present the considerations that led to the particular design choices. Many of these are motivated by the requirement that Charon must be useful within the traditional computational environments of Fortran 77 and C. Only the Fortran 77 syntax will be presented in this report.

  12. Homoplastic microinversions and the avian tree of life

    PubMed Central

    2011-01-01

    Background Microinversions are cytologically undetectable inversions of DNA sequences that accumulate slowly in genomes. Like many other rare genomic changes (RGCs), microinversions are thought to be virtually homoplasy-free evolutionary characters, suggesting that they may be very useful for difficult phylogenetic problems such as the avian tree of life. However, few detailed surveys of these genomic rearrangements have been conducted, making it difficult to assess this hypothesis or understand the impact of microinversions upon genome evolution. Results We surveyed non-coding sequence data from a recent avian phylogenetic study and found substantially more microinversions than expected based upon prior information about vertebrate inversion rates, although this is likely due to underestimation of these rates in previous studies. Most microinversions were lineage-specific or united well-accepted groups. However, some homoplastic microinversions were evident among the informative characters. Hemiplasy, which reflects differences between gene trees and the species tree, did not explain the observed homoplasy. Two specific loci were microinversion hotspots, with high numbers of inversions that included both the homoplastic as well as some overlapping microinversions. Neither stem-loop structures nor detectable sequence motifs were associated with microinversions in the hotspots. Conclusions Microinversions can provide valuable phylogenetic information, although power analysis indicates that large amounts of sequence data will be necessary to identify enough inversions (and similar RGCs) to resolve short branches in the tree of life. Moreover, microinversions are not perfect characters and should be interpreted with caution, just as with any other character type. Independent of their use for phylogenetic analyses, microinversions are important because they have the potential to complicate alignment of non-coding sequences. Despite their low rate of accumulation, they

  13. A portable MPI-based parallel vector template library

    NASA Technical Reports Server (NTRS)

    Sheffler, Thomas J.

    1995-01-01

    This paper discusses the design and implementation of a polymorphic collection library for distributed address-space parallel computers. The library provides a data-parallel programming model for C++ by providing three main components: a single generic collection class, generic algorithms over collections, and generic algebraic combining functions. Collection elements are the fourth component of a program written using the library and may be either of the built-in types of C or of user-defined types. Many ideas are borrowed from the Standard Template Library (STL) of C++, although a restricted programming model is proposed because of the distributed address-space memory model assumed. Whereas the STL provides standard collections and implementations of algorithms for uniprocessors, this paper advocates standardizing interfaces that may be customized for different parallel computers. Just as the STL attempts to increase programmer productivity through code reuse, a similar standard for parallel computers could provide programmers with a standard set of algorithms portable across many different architectures. The efficacy of this approach is verified by examining performance data collected from an initial implementation of the library running on an IBM SP-2 and an Intel Paragon.

  14. A Portable MPI-Based Parallel Vector Template Library

    NASA Technical Reports Server (NTRS)

    Sheffler, Thomas J.

    1995-01-01

    This paper discusses the design and implementation of a polymorphic collection library for distributed address-space parallel computers. The library provides a data-parallel programming model for C + + by providing three main components: a single generic collection class, generic algorithms over collections, and generic algebraic combining functions. Collection elements are the fourth component of a program written using the library and may be either of the built-in types of c or of user-defined types. Many ideas are borrowed from the Standard Template Library (STL) of C++, although a restricted programming model is proposed because of the distributed address-space memory model assumed. Whereas the STL provides standard collections and implementations of algorithms for uniprocessors, this paper advocates standardizing interfaces that may be customized for different parallel computers. Just as the STL attempts to increase programmer productivity through code reuse, a similar standard for parallel computers could provide programmers with a standard set of algorithms portable across many different architectures. The efficacy of this approach is verified by examining performance data collected from an initial implementation of the library running on an IBM SP-2 and an Intel Paragon.

  15. CFD code evaluation for internal flow modeling

    NASA Technical Reports Server (NTRS)

    Chung, T. J.

    1990-01-01

    Research on the computational fluid dynamics (CFD) code evaluation with emphasis on supercomputing in reacting flows is discussed. Advantages of unstructured grids, multigrids, adaptive methods, improved flow solvers, vector processing, parallel processing, and reduction of memory requirements are discussed. As examples, researchers include applications of supercomputing to reacting flow Navier-Stokes equations including shock waves and turbulence and combustion instability problems associated with solid and liquid propellants. Evaluation of codes developed by other organizations are not included. Instead, the basic criteria for accuracy and efficiency have been established, and some applications on rocket combustion have been made. Research toward an ultimate goal, the most accurate and efficient CFD code, is in progress and will continue for years to come.

  16. Automatically generated code for relativistic inhomogeneous cosmologies

    NASA Astrophysics Data System (ADS)

    Bentivegna, Eloisa

    2017-02-01

    The applications of numerical relativity to cosmology are on the rise, contributing insight into such cosmological problems as structure formation, primordial phase transitions, gravitational-wave generation, and inflation. In this paper, I present the infrastructure for the computation of inhomogeneous dust cosmologies which was used recently to measure the effect of nonlinear inhomogeneity on the cosmic expansion rate. I illustrate the code's architecture, provide evidence for its correctness in a number of familiar cosmological settings, and evaluate its parallel performance for grids of up to several billion points. The code, which is available as free software, is based on the Einstein Toolkit infrastructure, and in particular leverages the automated code generation capabilities provided by its component Kranc.

  17. The GBS code for tokamak scrape-off layer simulations

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Halpern, F.D., E-mail: federico.halpern@epfl.ch; Ricci, P.; Jolliet, S.

    2016-06-15

    We describe a new version of GBS, a 3D global, flux-driven plasma turbulence code to simulate the turbulent dynamics in the tokamak scrape-off layer (SOL), superseding the code presented by Ricci et al. (2012) [14]. The present work is driven by the objective of studying SOL turbulent dynamics in medium size tokamaks and beyond with a high-fidelity physics model. We emphasize an intertwining framework of improved physics models and the computational improvements that allow them. The model extensions include neutral atom physics, finite ion temperature, the addition of a closed field line region, and a non-Boussinesq treatment of the polarizationmore » drift. GBS has been completely refactored with the introduction of a 3-D Cartesian communicator and a scalable parallel multigrid solver. We report dramatically enhanced parallel scalability, with the possibility of treating electromagnetic fluctuations very efficiently. The method of manufactured solutions as a verification process has been carried out for this new code version, demonstrating the correct implementation of the physical model.« less

  18. Multilevel Parallelization of AutoDock 4.2.

    PubMed

    Norgan, Andrew P; Coffman, Paul K; Kocher, Jean-Pierre A; Katzmann, David J; Sosa, Carlos P

    2011-04-28

    Virtual (computational) screening is an increasingly important tool for drug discovery. AutoDock is a popular open-source application for performing molecular docking, the prediction of ligand-receptor interactions. AutoDock is a serial application, though several previous efforts have parallelized various aspects of the program. In this paper, we report on a multi-level parallelization of AutoDock 4.2 (mpAD4). Using MPI and OpenMP, AutoDock 4.2 was parallelized for use on MPI-enabled systems and to multithread the execution of individual docking jobs. In addition, code was implemented to reduce input/output (I/O) traffic by reusing grid maps at each node from docking to docking. Performance of mpAD4 was examined on two multiprocessor computers. Using MPI with OpenMP multithreading, mpAD4 scales with near linearity on the multiprocessor systems tested. In situations where I/O is limiting, reuse of grid maps reduces both system I/O and overall screening time. Multithreading of AutoDock's Lamarkian Genetic Algorithm with OpenMP increases the speed of execution of individual docking jobs, and when combined with MPI parallelization can significantly reduce the execution time of virtual screens. This work is significant in that mpAD4 speeds the execution of certain molecular docking workloads and allows the user to optimize the degree of system-level (MPI) and node-level (OpenMP) parallelization to best fit both workloads and computational resources.

  19. A path-level exact parallelization strategy for sequential simulation

    NASA Astrophysics Data System (ADS)

    Peredo, Oscar F.; Baeza, Daniel; Ortiz, Julián M.; Herrero, José R.

    2018-01-01

    Sequential Simulation is a well known method in geostatistical modelling. Following the Bayesian approach for simulation of conditionally dependent random events, Sequential Indicator Simulation (SIS) method draws simulated values for K categories (categorical case) or classes defined by K different thresholds (continuous case). Similarly, Sequential Gaussian Simulation (SGS) method draws simulated values from a multivariate Gaussian field. In this work, a path-level approach to parallelize SIS and SGS methods is presented. A first stage of re-arrangement of the simulation path is performed, followed by a second stage of parallel simulation for non-conflicting nodes. A key advantage of the proposed parallelization method is to generate identical realizations as with the original non-parallelized methods. Case studies are presented using two sequential simulation codes from GSLIB: SISIM and SGSIM. Execution time and speedup results are shown for large-scale domains, with many categories and maximum kriging neighbours in each case, achieving high speedup results in the best scenarios using 16 threads of execution in a single machine.

  20. Xyce parallel electronic simulator design.

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Thornquist, Heidi K.; Rankin, Eric Lamont; Mei, Ting

    2010-09-01

    This document is the Xyce Circuit Simulator developer guide. Xyce has been designed from the 'ground up' to be a SPICE-compatible, distributed memory parallel circuit simulator. While it is in many respects a research code, Xyce is intended to be a production simulator. As such, having software quality engineering (SQE) procedures in place to insure a high level of code quality and robustness are essential. Version control, issue tracking customer support, C++ style guildlines and the Xyce release process are all described. The Xyce Parallel Electronic Simulator has been under development at Sandia since 1999. Historically, Xyce has mostly beenmore » funded by ASC, the original focus of Xyce development has primarily been related to circuits for nuclear weapons. However, this has not been the only focus and it is expected that the project will diversify. Like many ASC projects, Xyce is a group development effort, which involves a number of researchers, engineers, scientists, mathmaticians and computer scientists. In addition to diversity of background, it is to be expected on long term projects for there to be a certain amount of staff turnover, as people move on to different projects. As a result, it is very important that the project maintain high software quality standards. The point of this document is to formally document a number of the software quality practices followed by the Xyce team in one place. Also, it is hoped that this document will be a good source of information for new developers.« less