Parallelized tree-code for clusters of personal computers
NASA Astrophysics Data System (ADS)
Viturro, H. R.; Carpintero, D. D.
2000-02-01
We present a tree-code for integrating the equations of the motion of collisionless systems, which has been fully parallelized and adapted to run in several PC-based processors simultaneously, using the well-known PVM message passing library software. SPH algorithms, not yet included, may be easily incorporated to the code. The code is written in ANSI C; it can be freely downloaded from a public ftp site. Simulations of collisions of galaxies are presented, with which the performance of the code is tested.
FLY MPI-2: a parallel tree code for LSS
NASA Astrophysics Data System (ADS)
Becciani, U.; Comparato, M.; Antonuccio-Delogu, V.
2006-04-01
New version program summaryProgram title: FLY 3.1 Catalogue identifier: ADSC_v2_0 Licensing provisions: yes Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADSC_v2_0 Program obtainable from: CPC Program Library, Queen's University of Belfast, N. Ireland No. of lines in distributed program, including test data, etc.: 158 172 No. of bytes in distributed program, including test data, etc.: 4 719 953 Distribution format: tar.gz Programming language: Fortran 90, C Computer: Beowulf cluster, PC, MPP systems Operating system: Linux, Aix RAM: 100M words Catalogue identifier of previous version: ADSC_v1_0 Journal reference of previous version: Comput. Phys. Comm. 155 (2003) 159 Does the new version supersede the previous version?: yes Nature of problem: FLY is a parallel collisionless N-body code for the calculation of the gravitational force Solution method: FLY is based on the hierarchical oct-tree domain decomposition introduced by Barnes and Hut (1986) Reasons for the new version: The new version of FLY is implemented by using the MPI-2 standard: the distributed version 3.1 was developed by using the MPICH2 library on a PC Linux cluster. Today the FLY performance allows us to consider the FLY code among the most powerful parallel codes for tree N-body simulations. Another important new feature regards the availability of an interface with hydrodynamical Paramesh based codes. Simulations must follow a box large enough to accurately represent the power spectrum of fluctuations on very large scales so that we may hope to compare them meaningfully with real data. The number of particles then sets the mass resolution of the simulation, which we would like to make as fine as possible. The idea to build an interface between two codes, that have different and complementary cosmological tasks, allows us to execute complex cosmological simulations with FLY, specialized for DM evolution, and a code specialized for hydrodynamical components that uses a Paramesh block
A Modified Parallel Tree Code for N-Body Simulation of the Large-Scale Structure of the Universe
NASA Astrophysics Data System (ADS)
Becciani, U.; Antonuccio-Delogu, V.; Gambera, M.
2000-09-01
N-body codes for performing simulations of the origin and evolution of the large-scale structure of the universe have improved significantly over the past decade in terms of both the resolution achieved and the reduction of the CPU time. However, state-of-the-art N-body codes hardly allow one to deal with particle numbers larger than a few 107, even on the largest parallel systems. In order to allow simulations with larger resolution, we have first reconsidered the grouping strategy as described in J. Barnes (1990, J. Comput. Phys. 87, 161) (hereafter B90) and applied it with some modifications to our WDSH-PT (Work and Data SHaring-Parallel Tree) code (U. Becciani et al., 1996, Comput. Phys. Comm. 99, 1). In the first part of this paper we will give a short description of the code adopting the algorithm of J. E. Barnes and P. Hut (1986, Nature 324, 446) and in particular the memory and work distribution strategy applied to describe the data distribution on a CC-NUMA machine like the CRAY-T3E system. In very large simulations (typically N>=107), due to network contention and the formation of clusters of galaxies, an uneven load easily verifies. To remedy this, we have devised an automatic work redistribution mechanism which provided a good dynamic load balance without adding significant overhead. In the second part of the paper we describe the modification to the Barnes grouping strategy we have devised to improve the performance of the WDSH-PT code. We will use the property that nearby particles have similar interaction lists. This idea has been checked in B90, where an interaction list is built which applies everywhere within a cell Cgroup containing a small number of particles Ncrit. B90 reuses this interaction list for each particle p∈Cgroup in the cell in turn. We will assume each particle p to have the same interaction list. We consider that the agent force Fp on a particle p can be decomposed into two terms Fp=Ffar+Fnear. The first term Ffar is the same for
Efficient tree codes on SIMD computer architectures
NASA Astrophysics Data System (ADS)
Olson, Kevin M.
1996-11-01
This paper describes changes made to a previous implementation of an N -body tree code developed for a fine-grained, SIMD computer architecture. These changes include (1) switching from a balanced binary tree to a balanced oct tree, (2) addition of quadrupole corrections, and (3) having the particles search the tree in groups rather than individually. An algorithm for limiting errors is also discussed. In aggregate, these changes have led to a performance increase of over a factor of 10 compared to the previous code. For problems several times larger than the processor array, the code now achieves performance levels of ~ 1 Gflop on the Maspar MP-2 or roughly 20% of the quoted peak performance of this machine. This percentage is competitive with other parallel implementations of tree codes on MIMD architectures. This is significant, considering the low relative cost of SIMD architectures.
PARAVT: Parallel Voronoi tessellation code
NASA Astrophysics Data System (ADS)
González, R. E.
2016-10-01
In this study, we present a new open source code for massive parallel computation of Voronoi tessellations (VT hereafter) in large data sets. The code is focused for astrophysical purposes where VT densities and neighbors are widely used. There are several serial Voronoi tessellation codes, however no open source and parallel implementations are available to handle the large number of particles/galaxies in current N-body simulations and sky surveys. Parallelization is implemented under MPI and VT using Qhull library. Domain decomposition takes into account consistent boundary computation between tasks, and includes periodic conditions. In addition, the code computes neighbors list, Voronoi density, Voronoi cell volume, density gradient for each particle, and densities on a regular grid. Code implementation and user guide are publicly available at https://github.com/regonzar/paravt.
National Combustion Code: Parallel Performance
NASA Technical Reports Server (NTRS)
Babrauckas, Theresa
2001-01-01
This report discusses the National Combustion Code (NCC). The NCC is an integrated system of codes for the design and analysis of combustion systems. The advanced features of the NCC meet designers' requirements for model accuracy and turn-around time. The fundamental features at the inception of the NCC were parallel processing and unstructured mesh. The design and performance of the NCC are discussed.
Optimal parallel evaluation of AND trees
NASA Technical Reports Server (NTRS)
Wah, Benjamin W.; Li, Guo-Jie
1990-01-01
A quantitative analysis based on both preemptive and nonpreemptive critical-path scheduling algorithms is presently conducted for the optimal degree of parallelism required in evaluating a given AND tree. The optimal degree of parallelism is found to depend on problem complexity, precedence-graph shape, and task-time distribution along each path. In addition to demonstrating the optimality of the preemptive critical-path scheduling algorithm for evaluating an arbitrary AND tree on a fixed number of processors, the possibility of efficiently ascertaining tight bounds on the number of processors for optimal processor-time efficiency is illustrated.
Parallel algorithms for contour extraction and coding
NASA Astrophysics Data System (ADS)
Dinstein, Its'hak; Landau, Gad M.
1990-07-01
A parallel approach to contour extraction and coding on an Exclusive Read Exclusive Write (EREW) Parallel Random Access Machine (PRAM) is presented and analyzed. The algorithm is intended for binary images. The labeled contours can be represented by lists of coordinates, and/or chain codes, and/or any other user designed codes. Using O(n2/log n) processors, the algorithm runs in O(logn) time, where n by n is the size of the processed binary image.
Code Parallelization with CAPO: A User Manual
NASA Technical Reports Server (NTRS)
Jin, Hao-Qiang; Frumkin, Michael; Yan, Jerry; Biegel, Bryan (Technical Monitor)
2001-01-01
A software tool has been developed to assist the parallelization of scientific codes. This tool, CAPO, extends an existing parallelization toolkit, CAPTools developed at the University of Greenwich, to generate OpenMP parallel codes for shared memory architectures. This is an interactive toolkit to transform a serial Fortran application code to an equivalent parallel version of the software - in a small fraction of the time normally required for a manual parallelization. We first discuss the way in which loop types are categorized and how efficient OpenMP directives can be defined and inserted into the existing code using the in-depth interprocedural analysis. The use of the toolkit on a number of application codes ranging from benchmark to real-world application codes is presented. This will demonstrate the great potential of using the toolkit to quickly parallelize serial programs as well as the good performance achievable on a large number of toolkit to quickly parallelize serial programs as well as the good performance achievable on a large number of processors. The second part of the document gives references to the parameters and the graphic user interface implemented in the toolkit. Finally a set of tutorials is included for hands-on experiences with this toolkit.
National Combustion Code: Parallel Implementation and Performance
NASA Technical Reports Server (NTRS)
Quealy, A.; Ryder, R.; Norris, A.; Liu, N.-S.
2000-01-01
The National Combustion Code (NCC) is being developed by an industry-government team for the design and analysis of combustion systems. CORSAIR-CCD is the current baseline reacting flow solver for NCC. This is a parallel, unstructured grid code which uses a distributed memory, message passing model for its parallel implementation. The focus of the present effort has been to improve the performance of the NCC flow solver to meet combustor designer requirements for model accuracy and analysis turnaround time. Improving the performance of this code contributes significantly to the overall reduction in time and cost of the combustor design cycle. This paper describes the parallel implementation of the NCC flow solver and summarizes its current parallel performance on an SGI Origin 2000. Earlier parallel performance results on an IBM SP-2 are also included. The performance improvements which have enabled a turnaround of less than 15 hours for a 1.3 million element fully reacting combustion simulation are described.
National Combustion Code Parallel Performance Enhancements
NASA Technical Reports Server (NTRS)
Quealy, Angela; Benyo, Theresa (Technical Monitor)
2002-01-01
The National Combustion Code (NCC) is being developed by an industry-government team for the design and analysis of combustion systems. The unstructured grid, reacting flow code uses a distributed memory, message passing model for its parallel implementation. The focus of the present effort has been to improve the performance of the NCC code to meet combustor designer requirements for model accuracy and analysis turnaround time. Improving the performance of this code contributes significantly to the overall reduction in time and cost of the combustor design cycle. This report describes recent parallel processing modifications to NCC that have improved the parallel scalability of the code, enabling a two hour turnaround for a 1.3 million element fully reacting combustion simulation on an SGI Origin 2000.
On the parallelization of molecular dynamics codes
NASA Astrophysics Data System (ADS)
Trabado, G. P.; Plata, O.; Zapata, E. L.
2002-08-01
Molecular dynamics (MD) codes present a high degree of spatial data locality and a significant amount of independent computations. However, most of the parallelization strategies are usually based on the manual transformation of sequential programs either by completely rewriting the code with message passing routines or using specific libraries intended for writing new MD programs. In this paper we propose a new library-based approach (DDLY) which supports parallelization of existing short-range MD sequential codes. The novelty of this approach is that it can directly handle the distribution of common data structures used in MD codes to represent data (arrays, Verlet lists, link cells), using domain decomposition. Thus, the insertion of run-time support for distribution and communication in a MD program does not imply significant changes to its structure. The method is simple, efficient and portable. It may be also used to extend existing parallel programming languages, such as HPF.
GRADSPMHD: A parallel MHD code based on the SPH formalism
NASA Astrophysics Data System (ADS)
Vanaverbeke, S.; Keppens, R.; Poedts, S.
2014-03-01
We present GRADSPMHD, a completely Lagrangian parallel magnetohydrodynamics code based on the SPH formalism. The implementation of the equations of SPMHD in the “GRAD-h” formalism assembles known results, including the derivation of the discretized MHD equations from a variational principle, the inclusion of time-dependent artificial viscosity, resistivity and conductivity terms, as well as the inclusion of a mixed hyperbolic/parabolic correction scheme for satisfying the ∇ṡB→ constraint on the magnetic field. The code uses a tree-based formalism for neighbor finding and can optionally use the tree code for computing the self-gravity of the plasma. The structure of the code closely follows the framework of our parallel GRADSPH FORTRAN 90 code which we added previously to the CPC program library. We demonstrate the capabilities of GRADSPMHD by running 1, 2, and 3 dimensional standard benchmark tests and we find good agreement with previous work done by other researchers. The code is also applied to the problem of simulating the magnetorotational instability in 2.5D shearing box tests as well as in global simulations of magnetized accretion disks. We find good agreement with available results on this subject in the literature. Finally, we discuss the performance of the code on a parallel supercomputer with distributed memory architecture. Catalogue identifier: AERP_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AERP_v1_0.html Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 620503 No. of bytes in distributed program, including test data, etc.: 19837671 Distribution format: tar.gz Programming language: FORTRAN 90/MPI. Computer: HPC cluster. Operating system: Unix. Has the code been vectorized or parallelized?: Yes, parallelized using MPI. RAM: ˜30 MB for a
Parallel CARLOS-3D code development
Putnam, J.M.; Kotulski, J.D.
1996-02-01
CARLOS-3D is a three-dimensional scattering code which was developed under the sponsorship of the Electromagnetic Code Consortium, and is currently used by over 80 aerospace companies and government agencies. The code has been extensively validated and runs on both serial workstations and parallel super computers such as the Intel Paragon. CARLOS-3D is a three-dimensional surface integral equation scattering code based on a Galerkin method of moments formulation employing Rao- Wilton-Glisson roof-top basis for triangular faceted surfaces. Fully arbitrary 3D geometries composed of multiple conducting and homogeneous bulk dielectric materials can be modeled. This presentation describes some of the extensions to the CARLOS-3D code, and how the operator structure of the code facilitated these improvements. Body of revolution (BOR) and two-dimensional geometries were incorporated by simply including new input routines, and the appropriate Galerkin matrix operator routines. Some additional modifications were required in the combined field integral equation matrix generation routine due to the symmetric nature of the BOR and 2D operators. Quadrilateral patched surfaces with linear roof-top basis functions were also implemented in the same manner. Quadrilateral facets and triangular facets can be used in combination to more efficiently model geometries with both large smooth surfaces and surfaces with fine detail such as gaps and cracks. Since the parallel implementation in CARLOS-3D is at high level, these changes were independent of the computer platform being used. This approach minimizes code maintenance, while providing capabilities with little additional effort. Results are presented showing the performance and accuracy of the code for some large scattering problems. Comparisons between triangular faceted and quadrilateral faceted geometry representations will be shown for some complex scatterers.
A parallel and modular deformable cell Car-Parrinello code
NASA Astrophysics Data System (ADS)
Cavazzoni, Carlo; Chiarotti, Guido L.
1999-12-01
We have developed a modular parallel code implementing the Car-Parrinello [Phys. Rev. Lett. 55 (1985) 2471] algorithm including the variable cell dynamics [Europhys. Lett. 36 (1994) 345; J. Phys. Chem. Solids 56 (1995) 510]. Our code is written in Fortran 90, and makes use of some new programming concepts like encapsulation, data abstraction and data hiding. The code has a multi-layer hierarchical structure with tree like dependences among modules. The modules include not only the variables but also the methods acting on them, in an object oriented fashion. The modular structure allows easier code maintenance, develop and debugging procedures, and is suitable for a developer team. The layer structure permits high portability. The code displays an almost linear speed-up in a wide range of number of processors independently of the architecture. Super-linear speed up is obtained with a "smart" Fast Fourier Transform (FFT) that uses the available memory on the single node (increasing for a fixed problem with the number of processing elements) as temporary buffer to store wave function transforms. This code has been used to simulate water and ammonia at giant planet conditions for systems as large as 64 molecules for ˜50 ps.
Portable, parallel, reusable Krylov space codes
Smith, B.; Gropp, W.
1994-12-31
Krylov space accelerators are an important component of many algorithms for the iterative solution of linear systems. Each Krylov space method has it`s own particular advantages and disadvantages, therefore it is desirable to have a variety of them available all with an identical, easy to use, interface. A common complaint application programmers have with available software libraries for the iterative solution of linear systems is that they require the programmer to use the data structures provided by the library. The library is not able to work with the data structures of the application code. Hence, application programmers find themselves constantly recoding the Krlov space algorithms. The Krylov space package (KSP) is a data-structure-neutral implementation of a variety of Krylov space methods including preconditioned conjugate gradient, GMRES, BiCG-Stab, transpose free QMR and CGS. Unlike all other software libraries for linear systems that the authors are aware of, KSP will work with any application codes data structures, in Fortran or C. Due to it`s data-structure-neutral design KSP runs unchanged on both sequential and parallel machines. KSP has been tested on workstations, the Intel i860 and Paragon, Thinking Machines CM-5 and the IBM SP1.
Adaptive Dynamic Event Tree in RAVEN code
Alfonsi, Andrea; Rabiti, Cristian; Mandelli, Diego; Cogliati, Joshua Joseph; Kinoshita, Robert Arthur
2014-11-01
RAVEN is a software tool that is focused on performing statistical analysis of stochastic dynamic systems. RAVEN has been designed in a high modular and pluggable way in order to enable easy integration of different programming languages (i.e., C++, Python) and coupling with other applications (system codes). Among the several capabilities currently present in RAVEN, there are five different sampling strategies: Monte Carlo, Latin Hyper Cube, Grid, Adaptive and Dynamic Event Tree (DET) sampling methodologies. The scope of this paper is to present a new sampling approach, currently under definition and implementation: an evolution of the DET me
Parallel object-oriented decision tree system
Kamath; Chandrika , Cantu-Paz; Erick
2006-02-28
A data mining decision tree system that uncovers patterns, associations, anomalies, and other statistically significant structures in data by reading and displaying data files, extracting relevant features for each of the objects, and using a method of recognizing patterns among the objects based upon object features through a decision tree that reads the data, sorts the data if necessary, determines the best manner to split the data into subsets according to some criterion, and splits the data.
Parafrase restructuring of FORTRAN code for parallel processing
NASA Technical Reports Server (NTRS)
Wadhwa, Atul
1988-01-01
Parafrase transforms a FORTRAN code, subroutine by subroutine, into a parallel code for a vector and/or shared-memory multiprocessor system. Parafrase is not a compiler; it transforms a code and provides information for a vector or concurrent process. Parafrase uses a data dependency to reveal parallelism among instructions. The data dependency test distinguishes between recurrences and statements that can be directly vectorized or parallelized. A number of transformations are required to build a data dependency graph.
Petascale Parallelization of the Gyrokinetic Toroidal Code
Ethier, Stephane; Adams, Mark; Carter, Jonathan; Oliker, Leonid
2010-05-01
The Gyrokinetic Toroidal Code (GTC) is a global, three-dimensional particle-in-cell application developed to study microturbulence in tokamak fusion devices. The global capability of GTC is unique, allowing researchers to systematically analyze important dynamics such as turbulence spreading. In this work we examine a new radial domain decomposition approach to allow scalability onto the latest generation of petascale systems. Extensive performance evaluation is conducted on three high performance computing systems: the IBM BG/P, the Cray XT4, and an Intel Xeon Cluster. Overall results show that the radial decomposition approach dramatically increases scalability, while reducing the memory footprint - allowing for fusion device simulations at an unprecedented scale. After a decade where high-end computing (HEC) was dominated by the rapid pace of improvements to processor frequencies, the performance of next-generation supercomputers is increasingly differentiated by varying interconnect designs and levels of integration. Understanding the tradeoffs of these system designs is a key step towards making effective petascale computing a reality. In this work, we examine a new parallelization scheme for the Gyrokinetic Toroidal Code (GTC) [?] micro-turbulence fusion application. Extensive scalability results and analysis are presented on three HEC systems: the IBM BlueGene/P (BG/P) at Argonne National Laboratory, the Cray XT4 at Lawrence Berkeley National Laboratory, and an Intel Xeon cluster at Lawrence Livermore National Laboratory. Overall results indicate that the new radial decomposition approach successfully attains unprecedented scalability to 131,072 BG/P cores by overcoming the memory limitations of the previous approach. The new version is well suited to utilize emerging petascale resources to access new regimes of physical phenomena.
Parallel Tree Contraction and Its Application.
1985-12-01
observed by Uspensky [231, see 112]. These bounds are commonly known as Chernoff bounds 16J. We shall use the following simply stated bounds [3. Theorem 6...Functions in Logarithmic Parallel Time. 25th Annual Symp. on Foundations of Computer Science, IEEE, 1984, pp. 12-22. 22. J. Uspensky . Introduction to
Memory Scalability and Efficiency Analysis of Parallel Codes
Janjusic, Tommy; Kartsaklis, Christos
2015-01-01
Memory scalability is an enduring problem and bottleneck that plagues many parallel codes. Parallel codes designed for High Performance Systems are typically designed over the span of several, and in some instances 10+, years. As a result, optimization practices which were appropriate for earlier systems may no longer be valid and thus require careful optimization consideration. Specifically, parallel codes whose memory footprint is a function of their scalability must be carefully considered for future exa-scale systems. In this paper we present a methodology and tool to study the memory scalability of parallel codes. Using our methodology we evaluate an application s memory footprint as a function of scalability, which we coined memory efficiency, and describe our results. In particular, using our in-house tools we can pinpoint the specific application components which contribute to the application s overall memory foot-print (application data- structures, libraries, etc.).
Identifying failure in a tree network of a parallel computer
Archer, Charles J.; Pinnow, Kurt W.; Wallenfelt, Brian P.
2010-08-24
Methods, parallel computers, and products are provided for identifying failure in a tree network of a parallel computer. The parallel computer includes one or more processing sets including an I/O node and a plurality of compute nodes. For each processing set embodiments include selecting a set of test compute nodes, the test compute nodes being a subset of the compute nodes of the processing set; measuring the performance of the I/O node of the processing set; measuring the performance of the selected set of test compute nodes; calculating a current test value in dependence upon the measured performance of the I/O node of the processing set, the measured performance of the set of test compute nodes, and a predetermined value for I/O node performance; and comparing the current test value with a predetermined tree performance threshold. If the current test value is below the predetermined tree performance threshold, embodiments include selecting another set of test compute nodes. If the current test value is not below the predetermined tree performance threshold, embodiments include selecting from the test compute nodes one or more potential problem nodes and testing individually potential problem nodes and links to potential problem nodes.
Parallel-vector computation for CSI-design code
NASA Technical Reports Server (NTRS)
Nguyen, Duc T.
1990-01-01
Computational aspects of Control-Structure Interaction (CSI) DESIGN code is reviewed. Numerical intensive computation portions of CSI-DESIGN code were identified. Improvements in computational speed for the CSI-DESIGN code can be achieved by exploiting parallel and vector capabilities offered by modern computers, such as the Alliant, Convex, Cray-2, and Cray-YMP. Four options to generate the coefficient stiffness matrix and to solve the system of linear, simultaneous equations are currently available in the CSI-DESIGN code. A preprocessor to use RCM (Reverse Cuthill-Mackee) algorithm for bandwidth minimization was also developed for the CSI-DESIGN code. Preliminary results obtained by solving a small-scale, 97 node CSI finite element model (for eigensolution) have indicated that this new CSI-DESIGN code is 5 to 6 times faster (using 1 Alliant processor) than the old version of CSI-DESIGN code. This speed-up was achieved due to the RCM algorithm and the use of a new skyline solver. Efforts are underway to further improve the vector speed for CSI-DESIGN code, to evaluate its performance on a larger scale CSI model (such as phase zero CSI model) to make the code run efficiently on multiprocessor, parallel computer environment, and to make the code portable among different parallel computers available at NASA LaRC, such as Alliant, Convex, and Cray computers.
CALTRANS: A parallel, deterministic, 3D neutronics code
Carson, L.; Ferguson, J.; Rogers, J.
1994-04-01
Our efforts to parallelize the deterministic solution of the neutron transport equation has culminated in a new neutronics code CALTRANS, which has full 3D capability. In this article, we describe the layout and algorithms of CALTRANS and present performance measurements of the code on a variety of platforms. Explicit implementation of the parallel algorithms of CALTRANS using both the function calls of the Parallel Virtual Machine software package (PVM 3.2) and the Meiko CS-2 tagged message passing library (based on the Intel NX/2 interface) are provided in appendices.
BTREE: A FORTRAN Code for B+ Tree.
2014-09-26
such large databases. NSWC TR 85-54 REFERENCES 1. Comer , D., "The Ubiquitous B Tree," Computing Surveys, Vol. 11, 1979, pp. 121-137. 2. Knuth, D...34The Ubiquitous B Tree" by Douglas Comer , Computing Surveys, C 11(1979)121-137; a more complete discussion can be found in C "The Art of Computer
Capabilities of Fully Parallelized MHD Stability Code MARS
NASA Astrophysics Data System (ADS)
Svidzinski, Vladimir; Galkin, Sergei; Kim, Jin-Soo; Liu, Yueqiang
2016-10-01
Results of full parallelization of the plasma stability code MARS will be reported. MARS calculates eigenmodes in 2D axisymmetric toroidal equilibria in MHD-kinetic plasma models. Parallel version of MARS, named PMARS, has been recently developed at FAR-TECH. Parallelized MARS is an efficient tool for simulation of MHD instabilities with low, intermediate and high toroidal mode numbers within both fluid and kinetic plasma models, implemented in MARS. Parallelization of the code included parallelization of the construction of the matrix for the eigenvalue problem and parallelization of the inverse vector iterations algorithm, implemented in MARS for the solution of the formulated eigenvalue problem. Construction of the matrix is parallelized by distributing the load among processors assigned to different magnetic surfaces. Parallelization of the solution of the eigenvalue problem is made by repeating steps of the MARS algorithm using parallel libraries and procedures. Parallelized MARS is capable of calculating eigenmodes with significantly increased spatial resolution: up to 5,000 adapted radial grid points with up to 500 poloidal harmonics. Such resolution is sufficient for simulation of kink, tearing and peeling-ballooning instabilities with physically relevant parameters. Work is supported by the U.S. DOE SBIR program.
Parallel Algorithms for Graph Optimization using Tree Decompositions
Sullivan, Blair D; Weerapurage, Dinesh P; Groer, Christopher S
2012-06-01
Although many $\\cal{NP}$-hard graph optimization problems can be solved in polynomial time on graphs of bounded tree-width, the adoption of these techniques into mainstream scientific computation has been limited due to the high memory requirements of the necessary dynamic programming tables and excessive runtimes of sequential implementations. This work addresses both challenges by proposing a set of new parallel algorithms for all steps of a tree decomposition-based approach to solve the maximum weighted independent set problem. A hybrid OpenMP/MPI implementation includes a highly scalable parallel dynamic programming algorithm leveraging the MADNESS task-based runtime, and computational results demonstrate scaling. This work enables a significant expansion of the scale of graphs on which exact solutions to maximum weighted independent set can be obtained, and forms a framework for solving additional graph optimization problems with similar techniques.
Parallel Continuous Flow: A Parallel Suffix Tree Construction Tool for Whole Genomes
Farreras, Montse
2014-01-01
Abstract The construction of suffix trees for very long sequences is essential for many applications, and it plays a central role in the bioinformatic domain. With the advent of modern sequencing technologies, biological sequence databases have grown dramatically. Also the methodologies required to analyze these data have become more complex everyday, requiring fast queries to multiple genomes. In this article, we present parallel continuous flow (PCF), a parallel suffix tree construction method that is suitable for very long genomes. We tested our method for the suffix tree construction of the entire human genome, about 3GB. We showed that PCF can scale gracefully as the size of the input genome grows. Our method can work with an efficiency of 90% with 36 processors and 55% with 172 processors. We can index the human genome in 7 minutes using 172 processes. PMID:24597675
Parallel continuous flow: a parallel suffix tree construction tool for whole genomes.
Comin, Matteo; Farreras, Montse
2014-04-01
The construction of suffix trees for very long sequences is essential for many applications, and it plays a central role in the bioinformatic domain. With the advent of modern sequencing technologies, biological sequence databases have grown dramatically. Also the methodologies required to analyze these data have become more complex everyday, requiring fast queries to multiple genomes. In this article, we present parallel continuous flow (PCF), a parallel suffix tree construction method that is suitable for very long genomes. We tested our method for the suffix tree construction of the entire human genome, about 3GB. We showed that PCF can scale gracefully as the size of the input genome grows. Our method can work with an efficiency of 90% with 36 processors and 55% with 172 processors. We can index the human genome in 7 minutes using 172 processes.
Parallelization of a Monte Carlo particle transport simulation code
NASA Astrophysics Data System (ADS)
Hadjidoukas, P.; Bousis, C.; Emfietzoglou, D.
2010-05-01
We have developed a high performance version of the Monte Carlo particle transport simulation code MC4. The original application code, developed in Visual Basic for Applications (VBA) for Microsoft Excel, was first rewritten in the C programming language for improving code portability. Several pseudo-random number generators have been also integrated and studied. The new MC4 version was then parallelized for shared and distributed-memory multiprocessor systems using the Message Passing Interface. Two parallel pseudo-random number generator libraries (SPRNG and DCMT) have been seamlessly integrated. The performance speedup of parallel MC4 has been studied on a variety of parallel computing architectures including an Intel Xeon server with 4 dual-core processors, a Sun cluster consisting of 16 nodes of 2 dual-core AMD Opteron processors and a 200 dual-processor HP cluster. For large problem size, which is limited only by the physical memory of the multiprocessor server, the speedup results are almost linear on all systems. We have validated the parallel implementation against the serial VBA and C implementations using the same random number generator. Our experimental results on the transport and energy loss of electrons in a water medium show that the serial and parallel codes are equivalent in accuracy. The present improvements allow for studying of higher particle energies with the use of more accurate physical models, and improve statistics as more particles tracks can be simulated in low response time.
Efficient coding of wavelet trees and its applications in image coding
NASA Astrophysics Data System (ADS)
Zhu, Bin; Yang, En-hui; Tewfik, Ahmed H.; Kieffer, John C.
1996-02-01
We propose in this paper a novel lossless tree coding algorithm. The technique is a direct extension of the bisection method, the simplest case of the complexity reduction method proposed recently by Kieffer and Yang, that has been used for lossless data string coding. A reduction rule is used to obtain the irreducible representation of a tree, and this irreducible tree is entropy-coded instead of the input tree itself. This reduction is reversible, and the original tree can be fully recovered from its irreducible representation. More specifically, we search for equivalent subtrees from top to bottom. When equivalent subtrees are found, a special symbol is appended to the value of the root node of the first equivalent subtree, and the root node of the second subtree is assigned to the index which points to the first subtree, an all other nodes in the second subtrees are removed. This procedure is repeated until it cannot be reduced further. This yields the irreducible tree or irreducible representation of the original tree. The proposed method can effectively remove the redundancy in an image, and results in more efficient compression. It is proved that when the tree size approaches infinity, the proposed method offers the optimal compression performance. It is generally more efficient in practice than direct coding of the input tree. The proposed method can be directly applied to code wavelet trees in non-iterative wavelet-based image coding schemes. A modified method is also proposed for coding wavelet zerotrees in embedded zerotree wavelet (EZW) image coding. Although its coding efficiency is slightly reduced, the modified version maintains exact control of bit rate and the scalability of the bit stream in EZW coding.
Parallel Scaling Characteristics of Selected NERSC User ProjectCodes
Skinner, David; Verdier, Francesca; Anand, Harsh; Carter,Jonathan; Durst, Mark; Gerber, Richard
2005-03-05
This report documents parallel scaling characteristics of NERSC user project codes between Fiscal Year 2003 and the first half of Fiscal Year 2004 (Oct 2002-March 2004). The codes analyzed cover 60% of all the CPU hours delivered during that time frame on seaborg, a 6080 CPU IBM SP and the largest parallel computer at NERSC. The scale in terms of concurrency and problem size of the workload is analyzed. Drawing on batch queue logs, performance data and feedback from researchers we detail the motivations, benefits, and challenges of implementing highly parallel scientific codes on current NERSC High Performance Computing systems. An evaluation and outlook of the NERSC workload for Allocation Year 2005 is presented.
Parallel family trees for transfer matrices in the Potts model
NASA Astrophysics Data System (ADS)
Navarro, Cristobal A.; Canfora, Fabrizio; Hitschfeld, Nancy; Navarro, Gonzalo
2015-02-01
The computational cost of transfer matrix methods for the Potts model is related to the question in how many ways can two layers of a lattice be connected? Answering the question leads to the generation of a combinatorial set of lattice configurations. This set defines the configuration space of the problem, and the smaller it is, the faster the transfer matrix can be computed. The configuration space of generic (q , v) transfer matrix methods for strips is in the order of the Catalan numbers, which grows asymptotically as O(4m) where m is the width of the strip. Other transfer matrix methods with a smaller configuration space indeed exist but they make assumptions on the temperature, number of spin states, or restrict the structure of the lattice. In this paper we propose a parallel algorithm that uses a sub-Catalan configuration space of O(3m) to build the generic (q , v) transfer matrix in a compressed form. The improvement is achieved by grouping the original set of Catalan configurations into a forest of family trees, in such a way that the solution to the problem is now computed by solving the root node of each family. As a result, the algorithm becomes exponentially faster than the Catalan approach while still highly parallel. The resulting matrix is stored in a compressed form using O(3m ×4m) of space, making numerical evaluation and decompression to be faster than evaluating the matrix in its O(4m ×4m) uncompressed form. Experimental results for different sizes of strip lattices show that the parallel family trees (PFT) strategy indeed runs exponentially faster than the Catalan Parallel Method (CPM), especially when dealing with dense transfer matrices. In terms of parallel performance, we report strong-scaling speedups of up to 5.7 × when running on an 8-core shared memory machine and 28 × for a 32-core cluster. The best balance of speedup and efficiency for the multi-core machine was achieved when using p = 4 processors, while for the cluster
A Data Parallel Multizone Navier-Stokes Code
NASA Technical Reports Server (NTRS)
Jespersen, Dennis C.; Levit, Creon; Kwak, Dochan (Technical Monitor)
1995-01-01
We have developed a data parallel multizone compressible Navier-Stokes code on the Connection Machine CM-5. The code is set up for implicit time-stepping on single or multiple structured grids. For multiple grids and geometrically complex problems, we follow the "chimera" approach, where flow data on one zone is interpolated onto another in the region of overlap. We will describe our design philosophy and give some timing results for the current code. The design choices can be summarized as: 1. finite differences on structured grids; 2. implicit time-stepping with either distributed solves or data motion and local solves; 3. sequential stepping through multiple zones with interzone data transfer via a distributed data structure. We have implemented these ideas on the CM-5 using CMF (Connection Machine Fortran), a data parallel language which combines elements of Fortran 90 and certain extensions, and which bears a strong similarity to High Performance Fortran (HPF). One interesting feature is the issue of turbulence modeling, where the architecture of a parallel machine makes the use of an algebraic turbulence model awkward, whereas models based on transport equations are more natural. We will present some performance figures for the code on the CM-5, and consider the issues involved in transitioning the code to HPF for portability to other parallel platforms.
An Expert System for the Development of Efficient Parallel Code
NASA Technical Reports Server (NTRS)
Jost, Gabriele; Chun, Robert; Jin, Hao-Qiang; Labarta, Jesus; Gimenez, Judit
2004-01-01
We have built the prototype of an expert system to assist the user in the development of efficient parallel code. The system was integrated into the parallel programming environment that is currently being developed at NASA Ames. The expert system interfaces to tools for automatic parallelization and performance analysis. It uses static program structure information and performance data in order to automatically determine causes of poor performance and to make suggestions for improvements. In this paper we give an overview of our programming environment, describe the prototype implementation of our expert system, and demonstrate its usefulness with several case studies.
Parallelizing the MARS15 Code with MPI for shielding applications
Mikhail A. Kostin and Nikolai V. Mokhov
2004-05-12
The MARS15 Monte Carlo code capabilities to deal with time-consuming deep penetration shielding problems and other computationally tough tasks in accelerator, detector and shielding applications, have been enhanced by a parallel processing option. It has been developed, implemented and tested on the Fermilab Accelerator Division Linux cluster and network of Sun workstations. The code uses MPI. It is scalable and demonstrates good performance. The general architecture of the code, specific uses of message passing, and effects of a scheduling on the performance and fault tolerance are described.
New Parallel computing framework for radiation transport codes
Kostin, M.A.; Mokhov, N.V.; Niita, K.; /JAERI, Tokai
2010-09-01
A new parallel computing framework has been developed to use with general-purpose radiation transport codes. The framework was implemented as a C++ module that uses MPI for message passing. The module is significantly independent of radiation transport codes it can be used with, and is connected to the codes by means of a number of interface functions. The framework was integrated with the MARS15 code, and an effort is under way to deploy it in PHITS. Besides the parallel computing functionality, the framework offers a checkpoint facility that allows restarting calculations with a saved checkpoint file. The checkpoint facility can be used in single process calculations as well as in the parallel regime. Several checkpoint files can be merged into one thus combining results of several calculations. The framework also corrects some of the known problems with the scheduling and load balancing found in the original implementations of the parallel computing functionality in MARS15 and PHITS. The framework can be used efficiently on homogeneous systems and networks of workstations, where the interference from the other users is possible.
The Forest Method as a New Parallel Tree Method with the Sectional Voronoi Tessellation
NASA Astrophysics Data System (ADS)
Yahagi, Hideki; Mori, Masao; Yoshii, Yuzuru
1999-09-01
We have developed a new parallel tree method which will be called the forest method hereafter. This new method uses the sectional Voronoi tessellation (SVT) for the domain decomposition. The SVT decomposes a whole space into polyhedra and allows their flat borders to move by assigning different weights. The forest method determines these weights based on the load balancing among processors by means of the overload diffusion (OLD). Moreover, since all the borders are flat, before receiving the data from other processors, each processor can collect enough data to calculate the gravity force with precision. Both the SVT and the OLD are coded in a highly vectorizable manner to accommodate on vector parallel processors. The parallel code based on the forest method with the Message Passing Interface is run on various platforms so that a wide portability is guaranteed. Extensive calculations with 15 processors of Fujitsu VPP300/16R indicate that the code can calculate the gravity force exerted on 105 particles in each second for some ideal dark halo. This code is found to enable an N-body simulation with 107 or more particles for a wide dynamic range and is therefore a very powerful tool for the study of galaxy formation and large-scale structure in the universe.
Advances in Parallel Electromagnetic Codes for Accelerator Science and Development
Ko, Kwok; Candel, Arno; Ge, Lixin; Kabel, Andreas; Lee, Rich; Li, Zenghai; Ng, Cho; Rawat, Vineet; Schussman, Greg; Xiao, Liling; /SLAC
2011-02-07
Over a decade of concerted effort in code development for accelerator applications has resulted in a new set of electromagnetic codes which are based on higher-order finite elements for superior geometry fidelity and better solution accuracy. SLAC's ACE3P code suite is designed to harness the power of massively parallel computers to tackle large complex problems with the increased memory and solve them at greater speed. The US DOE supports the computational science R&D under the SciDAC project to improve the scalability of ACE3P, and provides the high performance computing resources needed for the applications. This paper summarizes the advances in the ACE3P set of codes, explains the capabilities of the modules, and presents results from selected applications covering a range of problems in accelerator science and development important to the Office of Science.
Boltzmann Transport Code Update: Parallelization and Integrated Design Updates
NASA Technical Reports Server (NTRS)
Heinbockel, J. H.; Nealy, J. E.; DeAngelis, G.; Feldman, G. A.; Chokshi, S.
2003-01-01
The on going efforts at developing a web site for radiation analysis is expected to result in an increased usage of the High Charge and Energy Transport Code HZETRN. It would be nice to be able to do the requested calculations quickly and efficiently. Therefore the question arose, "Could the implementation of parallel processing speed up the calculations required?" To answer this question two modifications of the HZETRN computer code were created. The first modification selected the shield material of Al(2219) , then polyethylene and then Al(2219). The modified Fortran code was labeled 1SSTRN.F. The second modification considered the shield material of CO2 and Martian regolith. This modified Fortran code was labeled MARSTRN.F.
Advances in Parallelization for Large Scale Oct-Tree Mesh Generation
NASA Technical Reports Server (NTRS)
O'Connell, Matthew; Karman, Steve L.
2015-01-01
Despite great advancements in the parallelization of numerical simulation codes over the last 20 years, it is still common to perform grid generation in serial. Generating large scale grids in serial often requires using special "grid generation" compute machines that can have more than ten times the memory of average machines. While some parallel mesh generation techniques have been proposed, generating very large meshes for LES or aeroacoustic simulations is still a challenging problem. An automated method for the parallel generation of very large scale off-body hierarchical meshes is presented here. This work enables large scale parallel generation of off-body meshes by using a novel combination of parallel grid generation techniques and a hybrid "top down" and "bottom up" oct-tree method. Meshes are generated using hardware commonly found in parallel compute clusters. The capability to generate very large meshes is demonstrated by the generation of off-body meshes surrounding complex aerospace geometries. Results are shown including a one billion cell mesh generated around a Predator Unmanned Aerial Vehicle geometry, which was generated on 64 processors in under 45 minutes.
Parallelization of KENO-Va Monte Carlo code
NASA Astrophysics Data System (ADS)
Ramón, Javier; Peña, Jorge
1995-07-01
KENO-Va is a code integrated within the SCALE system developed by Oak Ridge that solves the transport equation through the Monte Carlo Method. It is being used at the Consejo de Seguridad Nuclear (CSN) to perform criticality calculations for fuel storage pools and shipping casks. Two parallel versions of the code: one for shared memory machines and other for distributed memory systems using the message-passing interface PVM have been generated. In both versions the neutrons of each generation are tracked in parallel. In order to preserve the reproducibility of the results in both versions, advanced seeds for random numbers were used. The CONVEX C3440 with four processors and shared memory at CSN was used to implement the shared memory version. A FDDI network of 6 HP9000/735 was employed to implement the message-passing version using proprietary PVM. The speedup obtained was 3.6 in both cases.
Parallelization of Finite Element Analysis Codes Using Heterogeneous Distributed Computing
NASA Technical Reports Server (NTRS)
Ozguner, Fusun
1996-01-01
Performance gains in computer design are quickly consumed as users seek to analyze larger problems to a higher degree of accuracy. Innovative computational methods, such as parallel and distributed computing, seek to multiply the power of existing hardware technology to satisfy the computational demands of large applications. In the early stages of this project, experiments were performed using two large, coarse-grained applications, CSTEM and METCAN. These applications were parallelized on an Intel iPSC/860 hypercube. It was found that the overall speedup was very low, due to large, inherently sequential code segments present in the applications. The overall execution time T(sub par), of the application is dependent on these sequential segments. If these segments make up a significant fraction of the overall code, the application will have a poor speedup measure.
Parallelization of the Legendre Transform for a Geodynamics Code
NASA Astrophysics Data System (ADS)
Lokavarapu, H. V.; Matsui, H.; Heien, E. M.
2014-12-01
Calypso is a geodynamo code designed to model magnetohydrodynamics of a Boussinesq fluid in a rotating spherical shell, such as the outer core of Earth. The code has been shown to scale well on computer clusters capable of computing at the order of millions of core hours. Depending on the resolution and time requirements, simulations may require weeks to years of clock time for specific target problems. A significant portion of the code execution time is spent transforming computed quantities between physical values and spherical harmonic coefficients, equivalent to a series of linear algebra operations. Intermixing C and Fortran code has opened the door to the parallel computing platform, Cuda and its associated libraries. We successfully implemented the parallelization of the scaling of the Legendre polynomials by both Schmidt Normalization coefficients, and a set of weighting coefficients; however, the expected speedup was not realized. Specifically, the original version of Calypso 1.1 computes the Legendre transform approximately four seconds faster than the Cuda-enabled modified version. By profiling the code, we determined that the time taken to transfer the data from host memory to GPU memory does not compare to the number of computations happening within the GPU. Nevertheless, by utilizing techniques such as memory coalescing, cached memory, pinned memory, dynamic parallelism, asynchronous calls, and overlapped memory transfers with computations, the likelihood of a speedup increases. Moreover, ideally the generation of the Legendre polynomial coefficients, Schmidt Normalization Coefficients, and the set of weights should not only be parallelized but be computed on-the-fly within the GPU. The end result is that we reduce the number of memory transfers from host to GPU, increase the number of parallelized computations on the GPU, and decrease the number of serial computations on the CPU. Also, the time taken to transform physical values to spherical
Parallelization of PANDA discrete ordinates code using spatial decomposition
Humbert, P.
2006-07-01
We present the parallel method, based on spatial domain decomposition, implemented in the 2D and 3D versions of the discrete Ordinates code PANDA. The spatial mesh is orthogonal and the spatial domain decomposition is Cartesian. For 3D problems a 3D Cartesian domain topology is created and the parallel method is based on a domain diagonal plane ordered sweep algorithm. The parallel efficiency of the method is improved by directions and octants pipelining. The implementation of the algorithm is straightforward using MPI blocking point to point communications. The efficiency of the method is illustrated by an application to the 3D-Ext C5G7 benchmark of the OECD/NEA. (authors)
Development of Parallel Code for the Alaska Tsunami Forecast Model
NASA Astrophysics Data System (ADS)
Bahng, B.; Knight, W. R.; Whitmore, P.
2014-12-01
The Alaska Tsunami Forecast Model (ATFM) is a numerical model used to forecast propagation and inundation of tsunamis generated by earthquakes and other means in both the Pacific and Atlantic Oceans. At the U.S. National Tsunami Warning Center (NTWC), the model is mainly used in a pre-computed fashion. That is, results for hundreds of hypothetical events are computed before alerts, and are accessed and calibrated with observations during tsunamis to immediately produce forecasts. ATFM uses the non-linear, depth-averaged, shallow-water equations of motion with multiply nested grids in two-way communications between domains of each parent-child pair as waves get closer to coastal waters. Even with the pre-computation the task becomes non-trivial as sub-grid resolution gets finer. Currently, the finest resolution Digital Elevation Models (DEM) used by ATFM are 1/3 arc-seconds. With a serial code, large or multiple areas of very high resolution can produce run-times that are unrealistic even in a pre-computed approach. One way to increase the model performance is code parallelization used in conjunction with a multi-processor computing environment. NTWC developers have undertaken an ATFM code-parallelization effort to streamline the creation of the pre-computed database of results with the long term aim of tsunami forecasts from source to high resolution shoreline grids in real time. Parallelization will also permit timely regeneration of the forecast model database with new DEMs; and, will make possible future inclusion of new physics such as the non-hydrostatic treatment of tsunami propagation. The purpose of our presentation is to elaborate on the parallelization approach and to show the compute speed increase on various multi-processor systems.
Composing Data Parallel Code for a SPARQL Graph Engine
Castellana, Vito G.; Tumeo, Antonino; Villa, Oreste; Haglin, David J.; Feo, John
2013-09-08
Big data analytics process large amount of data to extract knowledge from them. Semantic databases are big data applications that adopt the Resource Description Framework (RDF) to structure metadata through a graph-based representation. The graph based representation provides several benefits, such as the possibility to perform in memory processing with large amounts of parallelism. SPARQL is a language used to perform queries on RDF-structured data through graph matching. In this paper we present a tool that automatically translates SPARQL queries to parallel graph crawling and graph matching operations. The tool also supports complex SPARQL constructs, which requires more than basic graph matching for their implementation. The tool generates parallel code annotated with OpenMP pragmas for x86 Shared-memory Multiprocessors (SMPs). With respect to commercial database systems such as Virtuoso, our approach reduces memory occupation due to join operations and provides higher performance. We show the scaling of the automatically generated graph-matching code on a 48-core SMP.
Adaptive zero-tree structure for curved wavelet image coding
NASA Astrophysics Data System (ADS)
Zhang, Liang; Wang, Demin; Vincent, André
2006-02-01
We investigate the issue of efficient data organization and representation of the curved wavelet coefficients [curved wavelet transform (WT)]. We present an adaptive zero-tree structure that exploits the cross-subband similarity of the curved wavelet transform. In the embedded zero-tree wavelet (EZW) and the set partitioning in hierarchical trees (SPIHT), the parent-child relationship is defined in such a way that a parent has four children, restricted to a square of 2×2 pixels, the parent-child relationship in the adaptive zero-tree structure varies according to the curves along which the curved WT is performed. Five child patterns were determined based on different combinations of curve orientation. A new image coder was then developed based on this adaptive zero-tree structure and the set-partitioning technique. Experimental results using synthetic and natural images showed the effectiveness of the proposed adaptive zero-tree structure for encoding of the curved wavelet coefficients. The coding gain of the proposed coder can be up to 1.2 dB in terms of peak SNR (PSNR) compared to the SPIHT coder. Subjective evaluation shows that the proposed coder preserves lines and edges better than the SPIHT coder.
NASA Technical Reports Server (NTRS)
Hanebutte, Ulf R.; Joslin, Ronald D.; Zubair, Mohammad
1994-01-01
The implementation and the performance of a parallel spatial direct numerical simulation (PSDNS) code are reported for the IBM SP1 supercomputer. The spatially evolving disturbances that are associated with laminar-to-turbulent in three-dimensional boundary-layer flows are computed with the PS-DNS code. By remapping the distributed data structure during the course of the calculation, optimized serial library routines can be utilized that substantially increase the computational performance. Although the remapping incurs a high communication penalty, the parallel efficiency of the code remains above 40% for all performed calculations. By using appropriate compile options and optimized library routines, the serial code achieves 52-56 Mflops on a single node of the SP1 (45% of theoretical peak performance). The actual performance of the PSDNS code on the SP1 is evaluated with a 'real world' simulation that consists of 1.7 million grid points. One time step of this simulation is calculated on eight nodes of the SP1 in the same time as required by a Cray Y/MP for the same simulation. The scalability information provides estimated computational costs that match the actual costs relative to changes in the number of grid points.
CHOLLA: A New Massively Parallel Hydrodynamics Code for Astrophysical Simulation
NASA Astrophysics Data System (ADS)
Schneider, Evan E.; Robertson, Brant E.
2015-04-01
We present Computational Hydrodynamics On ParaLLel Architectures (Cholla ), a new three-dimensional hydrodynamics code that harnesses the power of graphics processing units (GPUs) to accelerate astrophysical simulations. Cholla models the Euler equations on a static mesh using state-of-the-art techniques, including the unsplit Corner Transport Upwind algorithm, a variety of exact and approximate Riemann solvers, and multiple spatial reconstruction techniques including the piecewise parabolic method (PPM). Using GPUs, Cholla evolves the fluid properties of thousands of cells simultaneously and can update over 10 million cells per GPU-second while using an exact Riemann solver and PPM reconstruction. Owing to the massively parallel architecture of GPUs and the design of the Cholla code, astrophysical simulations with physically interesting grid resolutions (≳2563) can easily be computed on a single device. We use the Message Passing Interface library to extend calculations onto multiple devices and demonstrate nearly ideal scaling beyond 64 GPUs. A suite of test problems highlights the physical accuracy of our modeling and provides a useful comparison to other codes. We then use Cholla to simulate the interaction of a shock wave with a gas cloud in the interstellar medium, showing that the evolution of the cloud is highly dependent on its density structure. We reconcile the computed mixing time of a turbulent cloud with a realistic density distribution destroyed by a strong shock with the existing analytic theory for spherical cloud destruction by describing the system in terms of its median gas density.
CHOLLA: A NEW MASSIVELY PARALLEL HYDRODYNAMICS CODE FOR ASTROPHYSICAL SIMULATION
Schneider, Evan E.; Robertson, Brant E.
2015-04-15
We present Computational Hydrodynamics On ParaLLel Architectures (Cholla ), a new three-dimensional hydrodynamics code that harnesses the power of graphics processing units (GPUs) to accelerate astrophysical simulations. Cholla models the Euler equations on a static mesh using state-of-the-art techniques, including the unsplit Corner Transport Upwind algorithm, a variety of exact and approximate Riemann solvers, and multiple spatial reconstruction techniques including the piecewise parabolic method (PPM). Using GPUs, Cholla evolves the fluid properties of thousands of cells simultaneously and can update over 10 million cells per GPU-second while using an exact Riemann solver and PPM reconstruction. Owing to the massively parallel architecture of GPUs and the design of the Cholla code, astrophysical simulations with physically interesting grid resolutions (≳256{sup 3}) can easily be computed on a single device. We use the Message Passing Interface library to extend calculations onto multiple devices and demonstrate nearly ideal scaling beyond 64 GPUs. A suite of test problems highlights the physical accuracy of our modeling and provides a useful comparison to other codes. We then use Cholla to simulate the interaction of a shock wave with a gas cloud in the interstellar medium, showing that the evolution of the cloud is highly dependent on its density structure. We reconcile the computed mixing time of a turbulent cloud with a realistic density distribution destroyed by a strong shock with the existing analytic theory for spherical cloud destruction by describing the system in terms of its median gas density.
MPI parallelization of full PIC simulation code with Adaptive Mesh Refinement
NASA Astrophysics Data System (ADS)
Matsui, Tatsuki; Nunami, Masanori; Usui, Hideyuki; Moritaka, Toseo
2010-11-01
A new parallelization technique developed for PIC method with adaptive mesh refinement (AMR) is introduced. In AMR technique, the complicated cell arrangements are organized and managed as interconnected pointers with multiple resolution levels, forming a fully threaded tree structure as a whole. In order to retain this tree structure distributed over multiple processes, remote memory access, an extended feature of MPI2 standards, is employed. Another important feature of the present simulation technique is the domain decomposition according to the modified Morton ordering. This algorithm can group up the equal number of particle calculation loops, which allows for the better load balance. Using this advanced simulation code, preliminary results for basic physical problems are exhibited for the validity check, together with the benchmarks to test the performance and the scalability.
1 CFR 21.23 - Parallel citations of Code and Federal Register.
Code of Federal Regulations, 2012 CFR
2012-01-01
... 1 General Provisions 1 2012-01-01 2012-01-01 false Parallel citations of Code and Federal Register. 21.23 Section 21.23 General Provisions ADMINISTRATIVE COMMITTEE OF THE FEDERAL REGISTER PREPARATION... § 21.23 Parallel citations of Code and Federal Register. For parallel reference, the Code of...
1 CFR 21.23 - Parallel citations of Code and Federal Register.
Code of Federal Regulations, 2010 CFR
2010-01-01
... 1 General Provisions 1 2010-01-01 2010-01-01 false Parallel citations of Code and Federal Register. 21.23 Section 21.23 General Provisions ADMINISTRATIVE COMMITTEE OF THE FEDERAL REGISTER PREPARATION... § 21.23 Parallel citations of Code and Federal Register. For parallel reference, the Code of...
1 CFR 21.23 - Parallel citations of Code and Federal Register.
Code of Federal Regulations, 2014 CFR
2014-01-01
... 1 General Provisions 1 2014-01-01 2012-01-01 true Parallel citations of Code and Federal Register. 21.23 Section 21.23 General Provisions ADMINISTRATIVE COMMITTEE OF THE FEDERAL REGISTER PREPARATION... § 21.23 Parallel citations of Code and Federal Register. For parallel reference, the Code of...
1 CFR 21.23 - Parallel citations of Code and Federal Register.
Code of Federal Regulations, 2011 CFR
2011-01-01
... 1 General Provisions 1 2011-01-01 2011-01-01 false Parallel citations of Code and Federal Register. 21.23 Section 21.23 General Provisions ADMINISTRATIVE COMMITTEE OF THE FEDERAL REGISTER PREPARATION... § 21.23 Parallel citations of Code and Federal Register. For parallel reference, the Code of...
1 CFR 21.23 - Parallel citations of Code and Federal Register.
Code of Federal Regulations, 2013 CFR
2013-01-01
... 1 General Provisions 1 2013-01-01 2012-01-01 true Parallel citations of Code and Federal Register. 21.23 Section 21.23 General Provisions ADMINISTRATIVE COMMITTEE OF THE FEDERAL REGISTER PREPARATION... § 21.23 Parallel citations of Code and Federal Register. For parallel reference, the Code of...
Development of parallel DEM for the open source code MFIX
Gopalakrishnan, Pradeep; Tafti, Danesh
2013-02-01
The paper presents the development of a parallel Discrete Element Method (DEM) solver for the open source code, Multiphase Flow with Interphase eXchange (MFIX) based on the domain decomposition method. The performance of the code was evaluated by simulating a bubbling fluidized bed with 2.5 million particles. The DEM solver shows strong scalability up to 256 processors with an efficiency of 81%. Further, to analyze weak scaling, the static height of the fluidized bed was increased to hold 5 and 10 million particles. The results show that global communication cost increases with problem size while the computational cost remains constant. Further, the effects of static bed height on the bubble hydrodynamics and mixing characteristics are analyzed.
Development of a massively parallel parachute performance prediction code
Peterson, C.W.; Strickland, J.H.; Wolfe, W.P.; Sundberg, W.D.; McBride, D.D.
1997-04-01
The Department of Energy has given Sandia full responsibility for the complete life cycle (cradle to grave) of all nuclear weapon parachutes. Sandia National Laboratories is initiating development of a complete numerical simulation of parachute performance, beginning with parachute deployment and continuing through inflation and steady state descent. The purpose of the parachute performance code is to predict the performance of stockpile weapon parachutes as these parachutes continue to age well beyond their intended service life. A new massively parallel computer will provide unprecedented speed and memory for solving this complex problem, and new software will be written to treat the coupled fluid, structure and trajectory calculations as part of a single code. Verification and validation experiments have been proposed to provide the necessary confidence in the computations.
Dependent video coding using a tree representation of pixel dependencies
NASA Astrophysics Data System (ADS)
Amati, Luca; Valenzise, Giuseppe; Ortega, Antonio; Tubaro, Stefano
2011-09-01
Motion-compensated prediction induces a chain of coding dependencies between pixels in video. In principle, an optimal selection of encoding parameters (motion vectors, quantization parameters, coding modes) should take into account the whole temporal horizon of a GOP. However, in practical coding schemes, these choices are made on a frame-by-frame basis, thus with a possible loss of performance. In this paper we describe a tree-based model for pixelwise coding dependencies: each pixel in a frame is the child of a pixel in a previous reference frame. We show that some tree structures are more favorable than others from a rate-distortion perspective, e.g., because they entail a large descendance of pixels which are well predicted from a common ancestor. In those cases, a higher quality has to be assigned to pixels at the top of such trees. We promote the creation of these structures by adding a special discount term to the conventional Lagrangian cost adopted at the encoder. The proposed model can be implemented through a double-pass encoding procedure. Specifically, we devise heuristic cost functions to drive the selection of quantization parameters and of motion vectors, which can be readily implemented into a state-of-the-art H.264/AVC encoder. Our experiments demonstrate that coding efficiency is improved for video sequences with low motion, while there are no apparent gains for more complex motion. We argue that this is due to both the presence of complex encoder features not captured by the model, and to the complexity of the source to be encoded.
Parallel Monte Carlo Electron and Photon Transport Simulation Code (PMCEPT code)
NASA Astrophysics Data System (ADS)
Kum, Oyeon
2004-11-01
Simulations for customized cancer radiation treatment planning for each patient are very useful for both patient and doctor. These simulations can be used to find the most effective treatment with the least possible dose to the patient. This typical system, so called ``Doctor by Information Technology", will be useful to provide high quality medical services everywhere. However, the large amount of computing time required by the well-known general purpose Monte Carlo(MC) codes has prevented their use for routine dose distribution calculations for a customized radiation treatment planning. The optimal solution to provide ``accurate" dose distribution within an ``acceptable" time limit is to develop a parallel simulation algorithm on a beowulf PC cluster because it is the most accurate, efficient, and economic. I developed parallel MC electron and photon transport simulation code based on the standard MPI message passing interface. This algorithm solved the main difficulty of the parallel MC simulation (overlapped random number series in the different processors) using multiple random number seeds. The parallel results agreed well with the serial ones. The parallel efficiency approached 100% as was expected.
GPU-based parallel clustered differential pulse code modulation
NASA Astrophysics Data System (ADS)
Wu, Jiaji; Li, Wenze; Kong, Wanqiu
2015-10-01
Hyperspectral remote sensing technology is widely used in marine remote sensing, geological exploration, atmospheric and environmental remote sensing. Owing to the rapid development of hyperspectral remote sensing technology, resolution of hyperspectral image has got a huge boost. Thus data size of hyperspectral image is becoming larger. In order to reduce their saving and transmission cost, lossless compression for hyperspectral image has become an important research topic. In recent years, large numbers of algorithms have been proposed to reduce the redundancy between different spectra. Among of them, the most classical and expansible algorithm is the Clustered Differential Pulse Code Modulation (CDPCM) algorithm. This algorithm contains three parts: first clusters all spectral lines, then trains linear predictors for each band. Secondly, use these predictors to predict pixels, and get the residual image by subtraction between original image and predicted image. Finally, encode the residual image. However, the process of calculating predictors is timecosting. In order to improve the processing speed, we propose a parallel C-DPCM based on CUDA (Compute Unified Device Architecture) with GPU. Recently, general-purpose computing based on GPUs has been greatly developed. The capacity of GPU improves rapidly by increasing the number of processing units and storage control units. CUDA is a parallel computing platform and programming model created by NVIDIA. It gives developers direct access to the virtual instruction set and memory of the parallel computational elements in GPUs. Our core idea is to achieve the calculation of predictors in parallel. By respectively adopting global memory, shared memory and register memory, we finally get a decent speedup.
Recent developments in DYNSUB: New models, code optimization and parallelization
Daeubler, M.; Trost, N.; Jimenez, J.; Sanchez, V.
2013-07-01
DYNSUB is a high-fidelity coupled code system consisting of the reactor simulator DYN3D and the sub-channel code SUBCHANFLOW. It describes nuclear reactor core behavior with pin-by-pin resolution for both steady-state and transient scenarios. In the course of the coupled code system's active development, super-homogenization (SPH) and generalized equivalence theory (GET) discontinuity factors may be computed with and employed in DYNSUB to compensate pin level homogenization errors. Because of the largely increased numerical problem size for pin-by-pin simulations, DYNSUB has bene fitted from HPC techniques to improve its numerical performance. DYNSUB's coupling scheme has been structurally revised. Computational bottlenecks have been identified and parallelized for shared memory systems using OpenMP. Comparing the elapsed time for simulating a PWR core with one-eighth symmetry under hot zero power conditions applying the original and the optimized DYNSUB using 8 cores, overall speed up factors greater than 10 have been observed. The corresponding reduction in execution time enables a routine application of DYNSUB to study pin level safety parameters for engineering sized cases in a scientific environment. (authors)
Time-Dependent, Parallel Neutral Particle Transport Code System.
BAKER, RANDAL S.
2009-09-10
Version 00 PARTISN (PARallel, TIme-Dependent SN) is the evolutionary successor to CCC-547/DANTSYS. The PARTISN code package is a modular computer program package designed to solve the time-independent or dependent multigroup discrete ordinates form of the Boltzmann transport equation in several different geometries. The modular construction of the package separates the input processing, the transport equation solving, and the post processing (or edit) functions into distinct code modules: the Input Module, the Solver Module, and the Edit Module, respectively. PARTISN is the evolutionary successor to the DANTSYSTM code system package. The Input and Edit Modules in PARTISN are very similar to those in DANTSYS. However, unlike DANTSYS, the Solver Module in PARTISN contains one, two, and three-dimensional solvers in a single module. In addition to the diamond-differencing method, the Solver Module also has Adaptive Weighted Diamond-Differencing (AWDD), Linear Discontinuous (LD), and Exponential Discontinuous (ED) spatial differencing methods. The spatial mesh may consist of either a standard orthogonal mesh or a block adaptive orthogonal mesh. The Solver Module may be run in parallel for two and three dimensional problems. One can now run 1-D problems in parallel using Energy Domain Decomposition (triggered by Block 5 input keyword npeg>0). EDD can also be used in 2-D/3-D with or without our standard Spatial Domain Decomposition. Both the static (fixed source or eigenvalue) and time-dependent forms of the transport equation are solved in forward or adjoint mode. In addition, PARTISN now has a probabilistic mode for Probability of Initiation (static) and Probability of Survival (dynamic) calculations. Vacuum, reflective, periodic, white, or inhomogeneous boundary conditions are solved. General anisotropic scattering and inhomogeneous sources are permitted. PARTISN solves the transport equation on orthogonal (single level or block-structured AMR) grids in 1-D (slab, two
A GPU accelerated Barnes-Hut tree code for FLASH4
NASA Astrophysics Data System (ADS)
Lukat, Gunther; Banerjee, Robi
2016-05-01
We present a GPU accelerated CUDA-C implementation of the Barnes Hut (BH) tree code for calculating the gravitational potential on octree adaptive meshes. The tree code algorithm is implemented within the FLASH4 adaptive mesh refinement (AMR) code framework and therefore fully MPI parallel. We describe the algorithm and present test results that demonstrate its accuracy and performance in comparison to the algorithms available in the current FLASH4 version. We use a MacLaurin spheroid to test the accuracy of our new implementation and use spherical, collapsing cloud cores with effective AMR to carry out performance tests also in comparison with previous gravity solvers. Depending on the setup and the GPU/CPU ratio, we find a speedup for the gravity unit of at least a factor of 3 and up to 60 in comparison to the gravity solvers implemented in the FLASH4 code. We find an overall speedup factor for full simulations of at least factor 1.6 up to a factor of 10.
A Hybrid Shared-Memory Parallel Max-Tree Algorithm for Extreme Dynamic-Range Images.
Moschini, Ugo; Meijster, Arnold; Wilkinson, Michael
2017-03-30
Max-trees, or component trees, are graph structures that represent the connected components of an image in a hierarchical way. Nowadays, many application fields rely on images with high-dynamic range or floating point values. Efficient sequential algorithms exist to build trees and compute attributes for images of any bit depth. However, we show that the current parallel algorithms perform poorly already with integers at bit depths higher than 16 bits per pixel. We propose a parallel method combining the two worlds of flooding and merging max-tree algorithms. First, a pilot max-tree of a quantized version of the image is built in parallel using a flooding method. Later, this structure is used in a parallel leaf-to-root approach to compute efficiently the final max-tree and to drive the merging of the sub-trees computed by the threads. We present an analysis of the performance both on simulated and actual 2D images and 3D volumes. Execution times are about 20 better than the fastest sequential algorithm and speed-up goes up to 30 40 on 64 threads.
NASA Astrophysics Data System (ADS)
Destefano, Anthony; Heerikhuisen, Jacob
2015-04-01
Fully 3D particle simulations can be a computationally and memory expensive task, especially when high resolution grid cells are required. The problem becomes further complicated when parallelization is needed. In this work we focus on computational methods to solve these difficulties. Hilbert curves are used to map the 3D particle space to the 1D contiguous memory space. This method of organization allows for minimized cache misses on the GPU as well as a sorted structure that is equivalent to an octal tree data structure. This type of sorted structure is attractive for uses in adaptive mesh implementations due to the logarithm search time. Implementations using the Message Passing Interface (MPI) library and NVIDIA's parallel computing platform CUDA will be compared, as MPI is commonly used on server nodes with many CPU's. We will also compare static grid structures with those of adaptive mesh structures. The physical test bed will be simulating heavy interstellar atoms interacting with a background plasma, the heliosphere, simulated from fully consistent coupled MHD/kinetic particle code. It is known that charge exchange is an important factor in space plasmas, specifically it modifies the structure of the heliosphere itself. We would like to thank the Alabama Supercomputer Authority for the use of their computational resources.
Breakdown of Spatial Parallel Coding in Children's Drawing
ERIC Educational Resources Information Center
De Bruyn, Bart; Davis, Alyson
2005-01-01
When drawing real scenes or copying simple geometric figures young children are highly sensitive to parallel cues and use them effectively. However, this sensitivity can break down in surprisingly simple tasks such as copying a single line where robust directional errors occur despite the presence of parallel cues. Before we can conclude that this…
An Optimal Parallel Algorithm for Constructing a Spanning Tree on Circular Permutation Graphs
NASA Astrophysics Data System (ADS)
Honma, Hirotoshi; Honma, Saki; Masuyama, Shigeru
The spanning tree problem is to find a tree that connects all the vertices of G. This problem has many applications, such as electric power systems, computer network design and circuit analysis. Klein and Stein demonstrated that a spanning tree can be found in O(log n) time with O(n + m) processors on the CRCW PRAM. In general, it is known that more efficient parallel algorithms can be developed by restricting classes of graphs. Circular permutation graphs properly contain the set of permutation graphs as a subclass and are first introduced by Rotem and Urrutia. They provided O(n2.376) time recognition algorithm. Circular permutation graphs and their models find several applications in VLSI layout. In this paper, we propose an optimal parallel algorithm for constructing a spanning tree on circular permutation graphs. It runs in O(log n) time with O(n/ log n) processors on the EREW PRAM.
Punctured Parallel and Serial Concatenated Convolutional Codes for BPSK/QPSK Channels
NASA Technical Reports Server (NTRS)
Acikel, Omer Fatih
1999-01-01
As available bandwidth for communication applications becomes scarce, bandwidth-efficient modulation and coding schemes become ever important. Since their discovery in 1993, turbo codes (parallel concatenated convolutional codes) have been the center of the attention in the coding community because of their bit error rate performance near the Shannon limit. Serial concatenated convolutional codes have also been shown to be as powerful as turbo codes. In this dissertation, we introduce algorithms for designing bandwidth-efficient rate r = k/(k + 1),k = 2, 3,..., 16, parallel and rate 3/4, 7/8, and 15/16 serial concatenated convolutional codes via puncturing for BPSK/QPSK (Binary Phase Shift Keying/Quadrature Phase Shift Keying) channels. Both parallel and serial concatenated convolutional codes have initially, steep bit error rate versus signal-to-noise ratio slope (called the -"cliff region"). However, this steep slope changes to a moderate slope with increasing signal-to-noise ratio, where the slope is characterized by the weight spectrum of the code. The region after the cliff region is called the "error rate floor" which dominates the behavior of these codes in moderate to high signal-to-noise ratios. Our goal is to design high rate parallel and serial concatenated convolutional codes while minimizing the error rate floor effect. The design algorithm includes an interleaver enhancement procedure and finds the polynomial sets (only for parallel concatenated convolutional codes) and the puncturing schemes that achieve the lowest bit error rate performance around the floor for the code rates of interest.
Adaptive Mesh Refinement Algorithms for Parallel Unstructured Finite Element Codes
Parsons, I D; Solberg, J M
2006-02-03
This project produced algorithms for and software implementations of adaptive mesh refinement (AMR) methods for solving practical solid and thermal mechanics problems on multiprocessor parallel computers using unstructured finite element meshes. The overall goal is to provide computational solutions that are accurate to some prescribed tolerance, and adaptivity is the correct path toward this goal. These new tools will enable analysts to conduct more reliable simulations at reduced cost, both in terms of analyst and computer time. Previous academic research in the field of adaptive mesh refinement has produced a voluminous literature focused on error estimators and demonstration problems; relatively little progress has been made on producing efficient implementations suitable for large-scale problem solving on state-of-the-art computer systems. Research issues that were considered include: effective error estimators for nonlinear structural mechanics; local meshing at irregular geometric boundaries; and constructing efficient software for parallel computing environments.
Bit-parallel ASCII code artificial numeric keypad
Hale, G.M.
1981-03-01
Seven integrated circuits and a voltage regulator are combined with twelve reed relays to allow the ASCII encoded numerals 0 through 9 and characters ''.'' and R or S to momentarily close switches to an applications device, simulating keypad switch closures. This invention may be used as a PARALLEL TLL (Transistor Transistor Logic) data acqusition interface to a standard Hewlett-Packard HP-97 Calculator modified with a cable.
NASA Technical Reports Server (NTRS)
OKeefe, Matthew (Editor); Kerr, Christopher L. (Editor)
1998-01-01
This report contains the abstracts and technical papers from the Second International Workshop on Software Engineering and Code Design in Parallel Meteorological and Oceanographic Applications, held June 15-18, 1998, in Scottsdale, Arizona. The purpose of the workshop is to bring together software developers in meteorology and oceanography to discuss software engineering and code design issues for parallel architectures, including Massively Parallel Processors (MPP's), Parallel Vector Processors (PVP's), Symmetric Multi-Processors (SMP's), Distributed Shared Memory (DSM) multi-processors, and clusters. Issues to be discussed include: (1) code architectures for current parallel models, including basic data structures, storage allocation, variable naming conventions, coding rules and styles, i/o and pre/post-processing of data; (2) designing modular code; (3) load balancing and domain decomposition; (4) techniques that exploit parallelism efficiently yet hide the machine-related details from the programmer; (5) tools for making the programmer more productive; and (6) the proliferation of programming models (F--, OpenMP, MPI, and HPF).
Load-balancing techniques for a parallel electromagnetic particle-in-cell code
PLIMPTON,STEVEN J.; SEIDEL,DAVID B.; PASIK,MICHAEL F.; COATS,REBECCA S.
2000-01-01
QUICKSILVER is a 3-d electromagnetic particle-in-cell simulation code developed and used at Sandia to model relativistic charged particle transport. It models the time-response of electromagnetic fields and low-density-plasmas in a self-consistent manner: the fields push the plasma particles and the plasma current modifies the fields. Through an LDRD project a new parallel version of QUICKSILVER was created to enable large-scale plasma simulations to be run on massively-parallel distributed-memory supercomputers with thousands of processors, such as the Intel Tflops and DEC CPlant machines at Sandia. The new parallel code implements nearly all the features of the original serial QUICKSILVER and can be run on any platform which supports the message-passing interface (MPI) standard as well as on single-processor workstations. This report describes basic strategies useful for parallelizing and load-balancing particle-in-cell codes, outlines the parallel algorithms used in this implementation, and provides a summary of the modifications made to QUICKSILVER. It also highlights a series of benchmark simulations which have been run with the new code that illustrate its performance and parallel efficiency. These calculations have up to a billion grid cells and particles and were run on thousands of processors. This report also serves as a user manual for people wishing to run parallel QUICKSILVER.
BHARDWAJ, MANLJ K.; REESE,GARTH M.; DRIESSEN,BRIAN; ALVIN,KENNETH F.; DAY,DAVID M.
2000-04-06
As computational needs for structural finite element analysis increase, a robust implicit structural dynamics code is needed which can handle millions of degrees of freedom in the model and produce results with quick turn around time. A parallel code is needed to avoid limitations of serial platforms. Salinas is an implicit structural dynamics code specifically designed for massively parallel platforms. It computes the structural response of very large complex structures and provides solutions faster than any existing serial machine. This paper gives a current status of Salinas and uses demonstration problems to show Salinas' performance.
ANNarchy: a code generation approach to neural simulations on parallel hardware
Vitay, Julien; Dinkelbach, Helge Ü.; Hamker, Fred H.
2015-01-01
Many modern neural simulators focus on the simulation of networks of spiking neurons on parallel hardware. Another important framework in computational neuroscience, rate-coded neural networks, is mostly difficult or impossible to implement using these simulators. We present here the ANNarchy (Artificial Neural Networks architect) neural simulator, which allows to easily define and simulate rate-coded and spiking networks, as well as combinations of both. The interface in Python has been designed to be close to the PyNN interface, while the definition of neuron and synapse models can be specified using an equation-oriented mathematical description similar to the Brian neural simulator. This information is used to generate C++ code that will efficiently perform the simulation on the chosen parallel hardware (multi-core system or graphical processing unit). Several numerical methods are available to transform ordinary differential equations into an efficient C++code. We compare the parallel performance of the simulator to existing solutions. PMID:26283957
Nyx: A MASSIVELY PARALLEL AMR CODE FOR COMPUTATIONAL COSMOLOGY
Almgren, Ann S.; Bell, John B.; Lijewski, Mike J.; Lukic, Zarija; Van Andel, Ethan
2013-03-01
We present a new N-body and gas dynamics code, called Nyx, for large-scale cosmological simulations. Nyx follows the temporal evolution of a system of discrete dark matter particles gravitationally coupled to an inviscid ideal fluid in an expanding universe. The gas is advanced in an Eulerian framework with block-structured adaptive mesh refinement; a particle-mesh scheme using the same grid hierarchy is used to solve for self-gravity and advance the particles. Computational results demonstrating the validation of Nyx on standard cosmological test problems, and the scaling behavior of Nyx to 50,000 cores, are presented.
A parallelization approach to the COBRA-TF thermal-hydraulic subchannel code
NASA Astrophysics Data System (ADS)
Ramos, Enrique; Abarca, Agustín; Roman, Jose E.; Miró, Rafael
2014-06-01
In order to reduce the response time when simulating large reactors in detail, we have developed a parallel version of the thermal-hydraulic subchannel code COBRA-TF, with standard message passing technology (MPI). The parallelization is oriented to reactor cells, so it is best suited for models consisting of many cells. The generation of the Jacobian is parallelized, in such a way that each processor is in charge of generating the data associated to a subset of cells. Also, the solution of the linear system of equations is done in parallel, using the PETSc toolkit.
Image coding using parallel implementations of the embedded zerotree wavelet algorithm
NASA Astrophysics Data System (ADS)
Creusere, Charles D.
1996-03-01
We explore here the implementation of Shapiro's embedded zerotree wavelet (EZW) image coding algorithms on an array of parallel processors. To this end, we first consider the problem of parallelizing the basic wavelet transform, discussing past work in this area and the compatibility of that work with the zerotree coding process. From this discussion, we present a parallel partitioning of the transform which is computationally efficient and which allows the wavelet coefficients to be coded with little or no additional inter-processor communication. The key to achieving low data dependence between the processors is to ensure that each processor contains only entire zerotrees of wavelet coefficients after the decomposition is complete. We next quantify the rate-distortion tradeoffs associated with different levels of parallelization for a few variations of the basic coding algorithm. Studying these results, we conclude that the quality of the coder decreases as the number of parallel processors used to implement it increases. Noting that the performance of the parallel algorithm might be unacceptably poor for large processor arrays, we also develop an alternate algorithm which always achieves the same rate-distortion performance as the original sequential EZW algorithm at the cost of higher complexity and reduced scalability.
PARALLEL IMPLEMENTATION OF THE TOPAZ OPACITY CODE: ISSUES IN LOAD-BALANCING
Sonnad, V; Iglesias, C A
2008-05-12
The TOPAZ opacity code explicitly includes configuration term structure in the calculation of bound-bound radiative transitions. This approach involves myriad spectral lines and requires the large computational capabilities of parallel processing computers. It is important, however, to make use of these resources efficiently. For example, an increase in the number of processors should yield a comparable reduction in computational time. This proportional 'speedup' indicates that very large problems can be addressed with massively parallel computers. Opacity codes can readily take advantage of parallel architecture since many intermediate calculations are independent. On the other hand, since the different tasks entail significantly disparate computational effort, load-balancing issues emerge so that parallel efficiency does not occur naturally. Several schemes to distribute the labor among processors are discussed.
A portable, parallel, object-oriented Monte Carlo neutron transport code in C++
Lee, S.R.; Cummings, J.C.; Nolen, S.D. |
1997-05-01
We have developed a multi-group Monte Carlo neutron transport code using C++ and the Parallel Object-Oriented Methods and Applications (POOMA) class library. This transport code, called MC++, currently computes k and {alpha}-eigenvalues and is portable to and runs parallel on a wide variety of platforms, including MPPs, clustered SMPs, and individual workstations. It contains appropriate classes and abstractions for particle transport and, through the use of POOMA, for portable parallelism. Current capabilities of MC++ are discussed, along with physics and performance results on a variety of hardware, including all Accelerated Strategic Computing Initiative (ASCI) hardware. Current parallel performance indicates the ability to compute {alpha}-eigenvalues in seconds to minutes rather than hours to days. Future plans and the implementation of a general transport physics framework are also discussed.
Omega3P: A Parallel Finite-Element Eigenmode Analysis Code for Accelerator Cavities
Lee, Lie-Quan; Li, Zenghai; Ng, Cho; Ko, Kwok; /SLAC
2009-03-04
Omega3P is a parallel eigenmode calculation code for accelerator cavities in frequency domain analysis using finite-element methods. In this report, we will present detailed finite-element formulations and resulting eigenvalue problems for lossless cavities, cavities with lossy materials, cavities with imperfectly conducting surfaces, and cavities with waveguide coupling. We will discuss the parallel algorithms for solving those eigenvalue problems and demonstrate modeling of accelerator cavities through different examples.
The CERBERUS code: Experiments with parallel processing using RELAP5/MOD3
Makowitz, H. )
1989-01-01
CERBERUS, a six-equation parallel thermal-hydraulic system simulation code, is being developed at the Idaho National Engineering Laboratory (INEL). CERBERUS Ver.00 performs parallel computations only for the heat transfer model. It is projected that CERBERUS Ver.01 will have a parallel heat transfer and hydraulic module, excluding the matrix solver, and CERBERUS Ver.02 will contain Ver.01 plus the solver. Three implementations of the CERBERUS Ver.00 code with constructs of varying overhead have been developed using a META language. These implementations are under study on shared-memory Cray-like computer architectures. Results for the hybrid code version, which utilizes all three construct sets simultaneously (i.e., CRAY AUTO, MICRO, and MULTI TASKING) on 2- and 8-CPU Cray machines, indicate the importance of load balancing for overhead reduction, and indicate that greater speedup factors may be achievable than previously believed with a RELAP-based parallel code. Extrapolations based on Y-MP/832 overhead measurements indicate that a speedup factor of > 10 may be obtainable with the CERBERUS Ver.02 code on a 16-CPU machine.
Data Parallel Line Relaxation (DPLR) Code User Manual: Acadia - Version 4.01.1
NASA Technical Reports Server (NTRS)
Wright, Michael J.; White, Todd; Mangini, Nancy
2009-01-01
Data-Parallel Line Relaxation (DPLR) code is a computational fluid dynamic (CFD) solver that was developed at NASA Ames Research Center to help mission support teams generate high-value predictive solutions for hypersonic flow field problems. The DPLR Code Package is an MPI-based, parallel, full three-dimensional Navier-Stokes CFD solver with generalized models for finite-rate reaction kinetics, thermal and chemical non-equilibrium, accurate high-temperature transport coefficients, and ionized flow physics incorporated into the code. DPLR also includes a large selection of generalized realistic surface boundary conditions and links to enable loose coupling with external thermal protection system (TPS) material response and shock layer radiation codes.
Assessing the performance of a parallel MATLAB-based 3D convection code
NASA Astrophysics Data System (ADS)
Kirkpatrick, G. J.; Hasenclever, J.; Phipps Morgan, J.; Shi, C.
2008-12-01
We are currently building 2D and 3D MATLAB-based parallel finite element codes for mantle convection and melting. The codes use the MATLAB implementation of core MPI commands (eg. Send, Receive, Broadcast) for message passing between computational subdomains. We have found that code development and algorithm testing are much faster in MATLAB than in our previous work coding in C or FORTRAN, this code was built from scratch with only 12 man-months of effort. The one extra cost w.r.t. C coding on a Beowulf cluster is the cost of the parallel MATLAB license for a >4core cluster. Here we present some preliminary results on the efficiency of MPI messaging in MATLAB on a small 4 machine, 16core, 32Gb RAM Intel Q6600 processor-based cluster. Our code implements fully parallelized preconditioned conjugate gradients with a multigrid preconditioner. Our parallel viscous flow solver is currently 20% slower for a 1,000,000 DOF problem on a single core in 2D as the direct solve MILAMIN MATLAB viscous flow solver. We have tested both continuous and discontinuous pressure formulations. We test with various configurations of network hardware, CPU speeds, and memory using our own and MATLAB's built in cluster profiler. So far we have only explored relatively small (up to 1.6GB RAM) test problems. We find that with our current code and Intel memory controller bandwidth limitations we can only get ~2.3 times performance out of 4 cores than 1 core per machine. Even for these small problems the code runs faster with message passing between 4 machines with one core each than 1 machine with 4 cores and internal messaging (1.29x slower), or 1 core (2.15x slower). It surprised us that for 2D ~1GB-sized problems with only 3 multigrid levels, the direct- solve on the coarsest mesh consumes comparable time to the iterative solve on the finest mesh - a penalty that is greatly reduced either by using a 4th multigrid level or by using an iterative solve at the coarsest grid level. We plan to
Code Optimization and Parallelization on the Origins: Looking from Users' Perspective
NASA Technical Reports Server (NTRS)
Chang, Yan-Tyng Sherry; Thigpen, William W. (Technical Monitor)
2002-01-01
Parallel machines are becoming the main compute engines for high performance computing. Despite their increasing popularity, it is still a challenge for most users to learn the basic techniques to optimize/parallelize their codes on such platforms. In this paper, we present some experiences on learning these techniques for the Origin systems at the NASA Advanced Supercomputing Division. Emphasis of this paper will be on a few essential issues (with examples) that general users should master when they work with the Origins as well as other parallel systems.
CODE BLUE: Three dimensional massively-parallel simulation of multi-scale configurations
NASA Astrophysics Data System (ADS)
Juric, Damir; Kahouadji, Lyes; Chergui, Jalel; Shin, Seungwon; Craster, Richard; Matar, Omar
2016-11-01
We present recent progress on BLUE, a solver for massively parallel simulations of fully three-dimensional multiphase flows which runs on a variety of computer architectures from laptops to supercomputers and on 131072 threads or more (limited only by the availability to us of more threads). The code is wholly written in Fortran 2003 and uses a domain decomposition strategy for parallelization with MPI. The fluid interface solver is based on a parallel implementation of a hybrid Front Tracking/Level Set method designed to handle highly deforming interfaces with complex topology changes. We developed parallel GMRES and multigrid iterative solvers suited to the linear systems arising from the implicit solution for the fluid velocities and pressure in the presence of strong density and viscosity discontinuities across fluid phases. Particular attention is drawn to the details and performance of the parallel Multigrid solver. EPSRC UK Programme Grant MEMPHIS (EP/K003976/1).
Understanding Performance of Parallel Scientific Simulation Codes using Open|SpeedShop
Ghosh, K K
2011-11-07
Conclusions of this presentation are: (1) Open SpeedShop's (OSS) is convenient to use for large, parallel, scientific simulation codes; (2) Large codes benefit from uninstrumented execution; (3) Many experiments can be run in a short time - might need multiple shots e.g. usertime for caller-callee, hwcsamp for HW counters; (4) Decent idea of code's performance is easily obtained; (5) Statistical sampling calls for decent number of samples; and (6) HWC data is very useful for micro-analysis but can be tricky to analyze.
Recent development for the ITS code system: Parallel processing and visualization
Fan, W.C.; Turner, C.D.; Halbleib, J.A. Sr.; Kensek, R.P.
1996-03-01
A brief overview is given for two software developments related to the ITS code system. These developments provide parallel processing and visualization capabilities and thus allow users to perform ITS calculations more efficiently. Timing results and a graphical example are presented to demonstrate these capabilities.
User's Guide for TOUGH2-MP - A Massively Parallel Version of the TOUGH2 Code
Earth Sciences Division; Zhang, Keni; Zhang, Keni; Wu, Yu-Shu; Pruess, Karsten
2008-05-27
TOUGH2-MP is a massively parallel (MP) version of the TOUGH2 code, designed for computationally efficient parallel simulation of isothermal and nonisothermal flows of multicomponent, multiphase fluids in one, two, and three-dimensional porous and fractured media. In recent years, computational requirements have become increasingly intensive in large or highly nonlinear problems for applications in areas such as radioactive waste disposal, CO2 geological sequestration, environmental assessment and remediation, reservoir engineering, and groundwater hydrology. The primary objective of developing the parallel-simulation capability is to significantly improve the computational performance of the TOUGH2 family of codes. The particular goal for the parallel simulator is to achieve orders-of-magnitude improvement in computational time for models with ever-increasing complexity. TOUGH2-MP is designed to perform parallel simulation on multi-CPU computational platforms. An earlier version of TOUGH2-MP (V1.0) was based on the TOUGH2 Version 1.4 with EOS3, EOS9, and T2R3D modules, a software previously qualified for applications in the Yucca Mountain project, and was designed for execution on CRAY T3E and IBM SP supercomputers. The current version of TOUGH2-MP (V2.0) includes all fluid property modules of the standard version TOUGH2 V2.0. It provides computationally efficient capabilities using supercomputers, Linux clusters, or multi-core PCs, and also offers many user-friendly features. The parallel simulator inherits all process capabilities from V2.0 together with additional capabilities for handling fractured media from V1.4. This report provides a quick starting guide on how to set up and run the TOUGH2-MP program for users with a basic knowledge of running the (standard) version TOUGH2 code, The report also gives a brief technical description of the code, including a discussion of parallel methodology, code structure, as well as mathematical and numerical methods used
NASA Technical Reports Server (NTRS)
Waheed, Abdul; Yan, Jerry
1998-01-01
This paper presents a model to evaluate the performance and overhead of parallelizing sequential code using compiler directives for multiprocessing on distributed shared memory (DSM) systems. With increasing popularity of shared address space architectures, it is essential to understand their performance impact on programs that benefit from shared memory multiprocessing. We present a simple model to characterize the performance of programs that are parallelized using compiler directives for shared memory multiprocessing. We parallelized the sequential implementation of NAS benchmarks using native Fortran77 compiler directives for an Origin2000, which is a DSM system based on a cache-coherent Non Uniform Memory Access (ccNUMA) architecture. We report measurement based performance of these parallelized benchmarks from four perspectives: efficacy of parallelization process; scalability; parallelization overhead; and comparison with hand-parallelized and -optimized version of the same benchmarks. Our results indicate that sequential programs can conveniently be parallelized for DSM systems using compiler directives but realizing performance gains as predicted by the performance model depends primarily on minimizing architecture-specific data locality overhead.
Poole, G.; Heroux, M.
1994-12-31
This paper will focus on recent work in two widely used industrial applications codes with iterative methods. The ANSYS program, a general purpose finite element code widely used in structural analysis applications, has now added an iterative solver option. Some results are given from real applications comparing performance with the tradition parallel/vector frontal solver used in ANSYS. Discussion of the applicability of iterative solvers as a general purpose solver will include the topics of robustness, as well as memory requirements and CPU performance. The FIDAP program is a widely used CFD code which uses iterative solvers routinely. A brief description of preconditioners used and some performance enhancements for CRAY parallel/vector systems is given. The solution of large-scale applications in structures and CFD includes examples from industry problems solved on CRAY systems.
Parallel Grand Canonical Monte Carlo (ParaGrandMC) Simulation Code
NASA Technical Reports Server (NTRS)
Yamakov, Vesselin I.
2016-01-01
This report provides an overview of the Parallel Grand Canonical Monte Carlo (ParaGrandMC) simulation code. This is a highly scalable parallel FORTRAN code for simulating the thermodynamic evolution of metal alloy systems at the atomic level, and predicting the thermodynamic state, phase diagram, chemical composition and mechanical properties. The code is designed to simulate multi-component alloy systems, predict solid-state phase transformations such as austenite-martensite transformations, precipitate formation, recrystallization, capillary effects at interfaces, surface absorption, etc., which can aid the design of novel metallic alloys. While the software is mainly tailored for modeling metal alloys, it can also be used for other types of solid-state systems, and to some degree for liquid or gaseous systems, including multiphase systems forming solid-liquid-gas interfaces.
[Series: Medical Applications of the PHITS Code (2): Acceleration by Parallel Computing].
Furuta, Takuya; Sato, Tatsuhiko
2015-01-01
Time-consuming Monte Carlo dose calculation becomes feasible owing to the development of computer technology. However, the recent development is due to emergence of the multi-core high performance computers. Therefore, parallel computing becomes a key to achieve good performance of software programs. A Monte Carlo simulation code PHITS contains two parallel computing functions, the distributed-memory parallelization using protocols of message passing interface (MPI) and the shared-memory parallelization using open multi-processing (OpenMP) directives. Users can choose the two functions according to their needs. This paper gives the explanation of the two functions with their advantages and disadvantages. Some test applications are also provided to show their performance using a typical multi-core high performance workstation.
Tyagi, Neelam; Bose, Abhijit; Chetty, Indrin J
2004-09-01
We have parallelized the Dose Planning Method (DPM), a Monte Carlo code optimized for radiotherapy class problems, on distributed-memory processor architectures using the Message Passing Interface (MPI). Parallelization has been investigated on a variety of parallel computing architectures at the University of Michigan-Center for Advanced Computing, with respect to efficiency and speedup as a function of the number of processors. We have integrated the parallel pseudo random number generator from the Scalable Parallel Pseudo-Random Number Generator (SPRNG) library to run with the parallel DPM. The Intel cluster consisting of 800 MHz Intel Pentium III processor shows an almost linear speedup up to 32 processors for simulating 1 x 10(8) or more particles. The speedup results are nearly linear on an Athlon cluster (up to 24 processors based on availability) which consists of 1.8 GHz+ Advanced Micro Devices (AMD) Athlon processors on increasing the problem size up to 8 x 10(8) histories. For a smaller number of histories (1 x 10(8)) the reduction of efficiency with the Athlon cluster (down to 83.9% with 24 processors) occurs because the processing time required to simulate 1 x 10(8) histories is less than the time associated with interprocessor communication. A similar trend was seen with the Opteron Cluster (consisting of 1400 MHz, 64-bit AMD Opteron processors) on increasing the problem size. Because of the 64-bit architecture Opteron processors are capable of storing and processing instructions at a faster rate and hence are faster as compared to the 32-bit Athlon processors. We have validated our implementation with an in-phantom dose calculation study using a parallel pencil monoenergetic electron beam of 20 MeV energy. The phantom consists of layers of water, lung, bone, aluminum, and titanium. The agreement in the central axis depth dose curves and profiles at different depths shows that the serial and parallel codes are equivalent in accuracy.
Parallel level-set methods on adaptive tree-based grids
NASA Astrophysics Data System (ADS)
Mirzadeh, Mohammad; Guittet, Arthur; Burstedde, Carsten; Gibou, Frederic
2016-10-01
We present scalable algorithms for the level-set method on dynamic, adaptive Quadtree and Octree Cartesian grids. The algorithms are fully parallelized and implemented using the MPI standard and the open-source p4est library. We solve the level set equation with a semi-Lagrangian method which, similar to its serial implementation, is free of any time-step restrictions. This is achieved by introducing a scalable global interpolation scheme on adaptive tree-based grids. Moreover, we present a simple parallel reinitialization scheme using the pseudo-time transient formulation. Both parallel algorithms scale on the Stampede supercomputer, where we are currently using up to 4096 CPU cores, the limit of our current account. Finally, a relevant application of the algorithms is presented in modeling a crystallization phenomenon by solving a Stefan problem, illustrating a level of detail that would be impossible to achieve without a parallel adaptive strategy. We believe that the algorithms presented in this article will be of interest and useful to researchers working with the level-set framework and modeling multi-scale physics in general.
Performance and Application of Parallel OVERFLOW Codes on Distributed and Shared Memory Platforms
NASA Technical Reports Server (NTRS)
Djomehri, M. Jahed; Rizk, Yehia M.
1999-01-01
The presentation discusses recent studies on the performance of the two parallel versions of the aerodynamics CFD code, OVERFLOW_MPI and _MLP. Developed at NASA Ames, the serial version, OVERFLOW, is a multidimensional Navier-Stokes flow solver based on overset (Chimera) grid technology. The code has recently been parallelized in two ways. One is based on the explicit message-passing interface (MPI) across processors and uses the _MPI communication package. This approach is primarily suited for distributed memory systems and workstation clusters. The second, termed the multi-level parallel (MLP) method, is simple and uses shared memory for all communications. The _MLP code is suitable on distributed-shared memory systems. For both methods, the message passing takes place across the processors or processes at the advancement of each time step. This procedure is, in effect, the Chimera boundary conditions update, which is done in an explicit "Jacobi" style. In contrast, the update in the serial code is done in more of the "Gauss-Sidel" fashion. The programming efforts for the _MPI code is more complicated than for the _MLP code; the former requires modification of the outer and some inner shells of the serial code, whereas the latter focuses only on the outer shell of the code. The _MPI version offers a great deal of flexibility in distributing grid zones across a specified number of processors in order to achieve load balancing. The approach is capable of partitioning zones across multiple processors or sending each zone and/or cluster of several zones into a single processor. The message passing across the processors consists of Chimera boundary and/or an overlap of "halo" boundary points for each partitioned zone. The MLP version is a new coarse-grain parallel concept at the zonal and intra-zonal levels. A grouping strategy is used to distribute zones into several groups forming sub-processes which will run in parallel. The total volume of grid points in each
Parallelization issues of a code for physically-based simulation of fabrics
NASA Astrophysics Data System (ADS)
Romero, Sergio; Gutiérrez, Eladio; Romero, Luis F.; Plata, Oscar; Zapata, Emilio L.
2004-10-01
The simulation of fabrics, clothes, and flexible materials is an essential topic in computer animation of realistic virtual humans and dynamic sceneries. New emerging technologies, as interactive digital TV and multimedia products, make necessary the development of powerful tools to perform real-time simulations. Parallelism is one of such tools. When analyzing computationally fabric simulations we found these codes belonging to the complex class of irregular applications. Frequently this kind of codes includes reduction operations in their core, so that an important fraction of the computational time is spent on such operations. In fabric simulators these operations appear when evaluating forces, giving rise to the equation system to be solved. For this reason, this paper discusses only this phase of the simulation. This paper analyzes and evaluates different irregular reduction parallelization techniques on ccNUMA shared memory machines, applied to a real, physically-based, fabric simulator we have developed. Several issues are taken into account in order to achieve high code performance, as exploitation of data access locality and parallelism, as well as careful use of memory resources (memory overhead). In this paper we use the concept of data affinity to develop various efficient algorithms for reduction parallelization exploiting data locality.
A visual parallel-BCI speller based on the time-frequency coding strategy
NASA Astrophysics Data System (ADS)
Xu, Minpeng; Chen, Long; Zhang, Lixin; Qi, Hongzhi; Ma, Lan; Tang, Jiabei; Wan, Baikun; Ming, Dong
2014-04-01
Objective. Spelling is one of the most important issues in brain-computer interface (BCI) research. This paper is to develop a visual parallel-BCI speller system based on the time-frequency coding strategy in which the sub-speller switching among four simultaneously presented sub-spellers and the character selection are identified in a parallel mode. Approach. The parallel-BCI speller was constituted by four independent P300+SSVEP-B (P300 plus SSVEP blocking) spellers with different flicker frequencies, thereby all characters had a specific time-frequency code. To verify its effectiveness, 11 subjects were involved in the offline and online spellings. A classification strategy was designed to recognize the target character through jointly using the canonical correlation analysis and stepwise linear discriminant analysis. Main results. Online spellings showed that the proposed parallel-BCI speller had a high performance, reaching the highest information transfer rate of 67.4 bit min-1, with an average of 54.0 bit min-1 and 43.0 bit min-1 in the three rounds and five rounds, respectively. Significance. The results indicated that the proposed parallel-BCI could be effectively controlled by users with attention shifting fluently among the sub-spellers, and highly improved the BCI spelling performance.
Software tools for developing parallel applications. Part 1: Code development and debugging
Brown, J.; Geist, A.; Pancake, C.; Rover, D.
1997-04-01
Developing an application for parallel computers can be a lengthy and frustrating process making it a perfect candidate for software tool support. Yet application programmers are often the last to hear about new tools emerging from R and D efforts. This paper provides an overview of two focuses of tool support: code development and debugging. Each is discussed in terms of the programmer needs addressed, the extent to which representative current tools meet those needs, and what new levels of tool support are important if parallel computing is to become more widespread.
NASA Astrophysics Data System (ADS)
Li, Xujing; Zheng, Weiying
2016-10-01
A new parallel code based on discontinuous Galerkin (DG) method for hyperbolic conservation laws on three dimensional unstructured meshes is developed recently. This code can be used for simulations of MHD equations, which are very important in magnetic confined plasma research. The main challenges in MHD simulations in fusion include the complex geometry of the configurations, such as plasma in tokamaks, the possibly discontinuous solutions and large scale computing. Our new developed code is based on three dimensional unstructured meshes, i.e. tetrahedron. This makes the code flexible to arbitrary geometries. Second order polynomials are used on each element and HWENO type limiter are applied. The accuracy tests show that our scheme reaches the desired three order accuracy and the nonlinear shock test demonstrate that our code can capture the sharp shock transitions. Moreover, One of the advantages of DG compared with the classical finite element methods is that the matrices solved are localized on each element, making it easy for parallelization. Several simulations including the kink instabilities in toroidal geometry will be present here. Chinese National Magnetic Confinement Fusion Science Program 2015GB110003.
Quinlan, D; Barany, G; Panas, T
2007-08-30
Many forms of security analysis on large scale applications can be substantially automated but the size and complexity can exceed the time and memory available on conventional desktop computers. Most commercial tools are understandably focused on such conventional desktop resources. This paper presents research work on the parallelization of security analysis of both source code and binaries within our Compass tool, which is implemented using the ROSE source-to-source open compiler infrastructure. We have focused on both shared and distributed memory parallelization of the evaluation of rules implemented as checkers for a wide range of secure programming rules, applicable to desktop machines, networks of workstations and dedicated clusters. While Compass as a tool focuses on source code analysis and reports violations of an extensible set of rules, the binary analysis work uses the exact same infrastructure but is less well developed into an equivalent final tool.
Chin, George; Choudhury, Sutanay; Kangas, Lars J.; McFarlane, Sally A.; Marquez, Andres
2011-09-01
Long viewed as a strong statistical inference technique, Bayesian networks have emerged to be an important class of applications for high-performance computing. We have applied an architecture-conscious approach to parallelizing the Lauritzen-Spiegelhalter Junction Tree algorithm for exact inferencing in Bayesian networks. In optimizing the Junction Tree algorithm, we have implemented both in-clique and topological parallelism strategies to best leverage the fine-grained synchronization and massive-scale multithreading of the Cray XMT architecture. Two topological techniques were developed to parallelize the evidence propagation process through the Bayesian network. One technique involves performing intelligent scheduling of junction tree nodes based on its topology and relative size. The second technique involves decomposing the junction tree into a much finer tree-like representation to offer much more opportunities for parallelism. We evaluate these optimizations on five different Bayesian networks and report our findings and observations. Another important contribution of this paper is to demonstrate the application of massive-scale multithreading for load balancing and use of implicit parallelism-based compiler optimizations in designing scalable inferencing algorithms.
A PARALLEL MONTE CARLO CODE FOR SIMULATING COLLISIONAL N-BODY SYSTEMS
Pattabiraman, Bharath; Umbreit, Stefan; Liao, Wei-keng; Choudhary, Alok; Kalogera, Vassiliki; Memik, Gokhan; Rasio, Frederic A.
2013-02-15
We present a new parallel code for computing the dynamical evolution of collisional N-body systems with up to N {approx} 10{sup 7} particles. Our code is based on the Henon Monte Carlo method for solving the Fokker-Planck equation, and makes assumptions of spherical symmetry and dynamical equilibrium. The principal algorithmic developments involve optimizing data structures and the introduction of a parallel random number generation scheme as well as a parallel sorting algorithm required to find nearest neighbors for interactions and to compute the gravitational potential. The new algorithms we introduce along with our choice of decomposition scheme minimize communication costs and ensure optimal distribution of data and workload among the processing units. Our implementation uses the Message Passing Interface library for communication, which makes it portable to many different supercomputing architectures. We validate the code by calculating the evolution of clusters with initial Plummer distribution functions up to core collapse with the number of stars, N, spanning three orders of magnitude from 10{sup 5} to 10{sup 7}. We find that our results are in good agreement with self-similar core-collapse solutions, and the core-collapse times generally agree with expectations from the literature. Also, we observe good total energy conservation, within {approx}< 0.04% throughout all simulations. We analyze the performance of the code, and demonstrate near-linear scaling of the runtime with the number of processors up to 64 processors for N = 10{sup 5}, 128 for N = 10{sup 6} and 256 for N = 10{sup 7}. The runtime reaches saturation with the addition of processors beyond these limits, which is a characteristic of the parallel sorting algorithm. The resulting maximum speedups we achieve are approximately 60 Multiplication-Sign , 100 Multiplication-Sign , and 220 Multiplication-Sign , respectively.
Suarez, J.; Stamatiadis, S.; Farantos, S. C.; Lathouwers, L.
2009-08-13
Reproducing molecular dynamics is at the root of the basic principles of chemical change and physical properties of the matter. New insight on molecular encounters can be gained by solving the Schroedinger equation in cartesian coordinates, provided one can overcome the massive calculations that it implies. We have developed a parallel code for solving the molecular Time Dependent Schroedinger Equation (TDSE) in cartesian coordinates. Variable order Finite Difference methods result in sparse Hamiltonian matrices which can make the large scale problem solving feasible.
A Parallel Monte Carlo Code for Simulating Collisional N-body Systems
NASA Astrophysics Data System (ADS)
Pattabiraman, Bharath; Umbreit, Stefan; Liao, Wei-keng; Choudhary, Alok; Kalogera, Vassiliki; Memik, Gokhan; Rasio, Frederic A.
2013-02-01
We present a new parallel code for computing the dynamical evolution of collisional N-body systems with up to N ~ 107 particles. Our code is based on the Hénon Monte Carlo method for solving the Fokker-Planck equation, and makes assumptions of spherical symmetry and dynamical equilibrium. The principal algorithmic developments involve optimizing data structures and the introduction of a parallel random number generation scheme as well as a parallel sorting algorithm required to find nearest neighbors for interactions and to compute the gravitational potential. The new algorithms we introduce along with our choice of decomposition scheme minimize communication costs and ensure optimal distribution of data and workload among the processing units. Our implementation uses the Message Passing Interface library for communication, which makes it portable to many different supercomputing architectures. We validate the code by calculating the evolution of clusters with initial Plummer distribution functions up to core collapse with the number of stars, N, spanning three orders of magnitude from 105 to 107. We find that our results are in good agreement with self-similar core-collapse solutions, and the core-collapse times generally agree with expectations from the literature. Also, we observe good total energy conservation, within <~ 0.04% throughout all simulations. We analyze the performance of the code, and demonstrate near-linear scaling of the runtime with the number of processors up to 64 processors for N = 105, 128 for N = 106 and 256 for N = 107. The runtime reaches saturation with the addition of processors beyond these limits, which is a characteristic of the parallel sorting algorithm. The resulting maximum speedups we achieve are approximately 60×, 100×, and 220×, respectively.
Short Communication: A Parallel Newton-Krylov Method for Navier-Stokes Rotorcraft Codes
NASA Astrophysics Data System (ADS)
Ekici, Kivanc; Lyrintzis, Anastasios S.
2003-05-01
The application of Krylov subspace iterative methods to unsteady three-dimensional Navier-Stokes codes on massively parallel and distributed computing environments is investigated. Previously, the Euler mode of the Navier-Stokes flow solver Transonic Unsteady Rotor Navier-Stokes (TURNS) has been coupled with a Newton-Krylov scheme which uses two Conjugate-Gradient-like (CG) iterative methods. For the efficient implementation of Newton-Krylov methods to the Navier-Stokes mode of TURNS, efficient preconditioners must be used. Parallel implicit operators are used and compared as preconditioners. Results are presented for two-dimensional and three-dimensional viscous cases. The Message Passing Interface (MPI) protocol is used, because of its portability to various parallel architectures.
Self-Scheduling Parallel Methods for Multiple Serial Codes with Application to WOPWOP
NASA Technical Reports Server (NTRS)
Long, Lyle N.; Brentner, Kenneth S.
2000-01-01
This paper presents a scheme for efficiently running a large number of serial jobs on parallel computers. Two examples are given of computer programs that run relatively quickly, but often they must be run numerous times to obtain all the results needed. It is very common in science and engineering to have codes that are not massive computing challenges in themselves, but due to the number of instances that must be run, they do become large-scale computing problems. The two examples given here represent common problems in aerospace engineering: aerodynamic panel methods and aeroacoustic integral methods. The first example simply solves many systems of linear equations. This is representative of an aerodynamic panel code where someone would like to solve for numerous angles of attack. The complete code for this first example is included in the appendix so that it can be readily used by others as a template. The second example is an aeroacoustics code (WOPWOP) that solves the Ffowcs Williams Hawkings equation to predict the far-field sound due to rotating blades. In this example, one quite often needs to compute the sound at numerous observer locations, hence parallelization is utilized to automate the noise computation for a large number of observers.
Recent Improvements to the IMPACT-T Parallel Particle TrackingCode
Qiang, J.; Pogorelov, I.V.; Ryne, R.
2006-11-16
The IMPACT-T code is a parallel three-dimensional quasi-static beam dynamics code for modeling high brightness beams in photoinjectors and RF linacs. Developed under the US DOE Scientific Discovery through Advanced Computing (SciDAC) program, it includes several key features including a self-consistent calculation of 3D space-charge forces using a shifted and integrated Green function method, multiple energy bins for beams with large energy spread, and models for treating RF standing wave and traveling wave structures. In this paper, we report on recent improvements to the IMPACT-T code including modeling traveling wave structures, short-range transverse and longitudinal wakefields, and longitudinal coherent synchrotron radiation through bending magnets.
Korall, Petra; Pryer, Kathleen M; Metzgar, Jordan S; Schneider, Harald; Conant, David S
2006-06-01
Tree ferns are a well-established clade within leptosporangiate ferns. Most of the 600 species (in seven families and 13 genera) are arborescent, but considerable morphological variability exists, spanning the giant scaly tree ferns (Cyatheaceae), the low, erect plants (Plagiogyriaceae), and the diminutive endemics of the Guayana Highlands (Hymenophyllopsidaceae). In this study, we investigate phylogenetic relationships within tree ferns based on analyses of four protein-coding, plastid loci (atpA, atpB, rbcL, and rps4). Our results reveal four well-supported clades, with genera of Dicksoniaceae (sensu ) interspersed among them: (A) (Loxomataceae, (Culcita, Plagiogyriaceae)), (B) (Calochlaena, (Dicksonia, Lophosoriaceae)), (C) Cibotium, and (D) Cyatheaceae, with Hymenophyllopsidaceae nested within. How these four groups are related to one other, to Thyrsopteris, or to Metaxyaceae is weakly supported. Our results show that Dicksoniaceae and Cyatheaceae, as currently recognised, are not monophyletic and new circumscriptions for these families are needed.
A distributed coding approach for stereo sequences in the tree structured Haar transform domain
NASA Astrophysics Data System (ADS)
Cancellaro, M.; Carli, M.; Neri, A.
2009-02-01
In this contribution, a novel method for distributed video coding for stereo sequences is proposed. The system encodes independently the left and right frames of the stereoscopic sequence. The decoder exploits the side information to achieve the best reconstruction of the correlated video streams. In particular, a syndrome coder approach based on a lifted Tree Structured Haar wavelet scheme has been adopted. The experimental results show the effectiveness of the proposed scheme.
A new technique for a parallel dealiased pseudospectral Navier-Stokes code
NASA Astrophysics Data System (ADS)
Iovieno, Michele; Cavazzoni, Carlo; Tordella, Daniela
2001-12-01
A novel aspect of a parallel procedure for the numerical simulation of the solution of the Navier-Stokes equations through the Fourier-Galerkin pseudospectral method is presented. It consists of a dealiased ("3/2" rule) transposition of the data that organizes the computations in the distributed direction in such a way that whenever a Fast Fourier Transform must be calculated, the algorithm will employ data stored solely on the proper memory of the processor which is computing it. This provide for the employment of standard routines for the computations of the Fourier transform. The aliasing removal procedure has been directly inserted into the transposition algorithm. The code is written for distributed memory computers, but not specifically for a peculiar architecture. The use on a variety of machines is allowed by the adoption of the Message Passing Interface library. The portability of the code is demonstrated by the similar performances, in particular the high efficiency, that all the machines tested show up to a number of parallel processors equal to 1/2 the truncation parameter N/2. Explicit time integration is used. The present code organization is relevant to physical and mathematical problems which require a three dimensional spectral treatment.
Xu, Lijun; Chen, Jianjun; Cao, Zhang; Liu, Xingbin; Hu, Jinhai
2014-07-01
In this paper, a quasi-parallel inductive-capacitive (LC) resonance method is proposed to improve the recovery of MIL-STD-1553 Manchester code with several frequency components from attenuated, distorted, and drifted signal for data telemetry in well logging, and corresponding telemetry system is developed. Required resonant frequency and quality factor are derived, and the quasi-parallel LC resonant circuit is established at the receiving end of the logging cable to suppress the low-pass filtering effect caused by the distributed capacitance of the cable and provide balanced pass for all the three frequency components of the Manchester code. The performance of the method for various encoding frequencies and cable lengths at different bit energy to noise density ratios (Eb/No) have been evaluated in the simulation. A 5 km single-core cable used in on-site well logging and various encoding frequencies were employed to verify the proposed telemetry system in the experiment. Results obtained demonstrate that the telemetry system is feasible and effective to improve the code recovery in terms of anti-attenuation, anti-distortion, and anti-drift performances, decrease the bit error rate, and increase the reachable transmission rate and distance greatly.
GOTHIC: Gravitational oct-tree code accelerated by hierarchical time step controlling
NASA Astrophysics Data System (ADS)
Miki, Yohei; Umemura, Masayuki
2017-04-01
The tree method is a widely implemented algorithm for collisionless N-body simulations in astrophysics well suited for GPU(s). Adopting hierarchical time stepping can accelerate N-body simulations; however, it is infrequently implemented and its potential remains untested in GPU implementations. We have developed a Gravitational Oct-Tree code accelerated by HIerarchical time step Controlling named GOTHIC, which adopts both the tree method and the hierarchical time step. The code adopts some adaptive optimizations by monitoring the execution time of each function on-the-fly and minimizes the time-to-solution by balancing the measured time of multiple functions. Results of performance measurements with realistic particle distribution performed on NVIDIA Tesla M2090, K20X, and GeForce GTX TITAN X, which are representative GPUs of the Fermi, Kepler, and Maxwell generation of GPUs, show that the hierarchical time step achieves a speedup by a factor of around 3-5 times compared to the shared time step. The measured elapsed time per step of GOTHIC is 0.30 s or 0.44 s on GTX TITAN X when the particle distribution represents the Andromeda galaxy or the NFW sphere, respectively, with 224 = 16,777,216 particles. The averaged performance of the code corresponds to 10-30% of the theoretical single precision peak performance of the GPU.
Context Tree-Based Image Contour Coding Using a Geometric Prior
NASA Astrophysics Data System (ADS)
Zheng, Amin; Cheung, Gene; Florencio, Dinei
2017-02-01
If object contours in images are coded efficiently as side information, then they can facilitate advanced image / video coding techniques, such as graph Fourier transform coding or motion prediction of arbitrarily shaped pixel blocks. In this paper, we study the problem of lossless and lossy compression of detected contours in images. Specifically, we first convert a detected object contour composed of contiguous between-pixel edges to a sequence of directional symbols drawn from a small alphabet. To encode the symbol sequence using arithmetic coding, we compute an optimal variable-length context tree (VCT) $\\mathcal{T}$ via a maximum a posterior (MAP) formulation to estimate symbols' conditional probabilities. MAP prevents us from overfitting given a small training set $\\mathcal{X}$ of past symbol sequences by identifying a VCT $\\mathcal{T}$ that achieves a high likelihood $P(\\mathcal{X}|\\mathcal{T})$ of observing $\\mathcal{X}$ given $\\mathcal{T}$, and a large geometric prior $P(\\mathcal{T})$ stating that image contours are more often straight than curvy. For the lossy case, we design efficient dynamic programming (DP) algorithms that optimally trade off coding rate of an approximate contour $\\hat{\\mathbf{x}}$ given a VCT $\\mathcal{T}$ with two notions of distortion of $\\hat{\\mathbf{x}}$ with respect to the original contour $\\mathbf{x}$. To reduce the size of the DP tables, a total suffix tree is derived from a given VCT $\\mathcal{T}$ for compact table entry indexing, reducing complexity. Experimental results show that for lossless contour coding, our proposed algorithm outperforms state-of-the-art context-based schemes consistently for both small and large training datasets. For lossy contour coding, our algorithms outperform comparable schemes in the literature in rate-distortion performance.
CPIC: A Parallel Particle-In-Cell Code for Studying Spacecraft Charging
NASA Astrophysics Data System (ADS)
Meierbachtol, Collin; Delzanno, Gian Luca; Moulton, David; Vernon, Louis
2015-11-01
CPIC is a three-dimensional electrostatic particle-in-cell code designed for use with curvilinear meshes. One of its primary objectives is to aid in studying spacecraft charging in the magnetosphere. CPIC maintains near-optimal computational performance and scaling thanks to a mapped logical mesh field solver, and a hybrid physical-logical space particle mover (avoiding the need to track particles). CPIC is written for parallel execution, utilizing a combination of both OpenMP threading and MPI distributed memory. New capabilities are being actively developed and added to CPIC, including the ability to handle multi-block curvilinear mesh structures. Verification results comparing CPIC to analytic test problems will be provided. Particular emphasis will be placed on the charging and shielding of a sphere-in-plasma system. Simulated charging results of representative spacecraft geometries will also be presented. Finally, its performance capabilities will be demonstrated through parallel scaling data.
Shared Memory Parallelization of an Implicit ADI-type CFD Code
NASA Technical Reports Server (NTRS)
Hauser, Th.; Huang, P. G.
1999-01-01
A parallelization study designed for ADI-type algorithms is presented using the OpenMP specification for shared-memory multiprocessor programming. Details of optimizations specifically addressed to cache-based computer architectures are described and performance measurements for the single and multiprocessor implementation are summarized. The paper demonstrates that optimization of memory access on a cache-based computer architecture controls the performance of the computational algorithm. A hybrid MPI/OpenMP approach is proposed for clusters of shared memory machines to further enhance the parallel performance. The method is applied to develop a new LES/DNS code, named LESTool. A preliminary DNS calculation of a fully developed channel flow at a Reynolds number of 180, Re(sub tau) = 180, has shown good agreement with existing data.
Pandya, Tara M.; Johnson, Seth R.; Evans, Thomas M.; ...
2015-12-21
This paper discusses the implementation, capabilities, and validation of Shift, a massively parallel Monte Carlo radiation transport package developed and maintained at Oak Ridge National Laboratory. It has been developed to scale well from laptop to small computing clusters to advanced supercomputers. Special features of Shift include hybrid capabilities for variance reduction such as CADIS and FW-CADIS, and advanced parallel decomposition and tally methods optimized for scalability on supercomputing architectures. Shift has been validated and verified against various reactor physics benchmarks and compares well to other state-of-the-art Monte Carlo radiation transport codes such as MCNP5, CE KENO-VI, and OpenMC. Somemore » specific benchmarks used for verification and validation include the CASL VERA criticality test suite and several Westinghouse AP1000® problems. These benchmark and scaling studies show promising results.« less
Pandya, Tara M.; Johnson, Seth R.; Evans, Thomas M.; Davidson, Gregory G.; Hamilton, Steven P.; Godfrey, Andrew T.
2015-12-21
This paper discusses the implementation, capabilities, and validation of Shift, a massively parallel Monte Carlo radiation transport package developed and maintained at Oak Ridge National Laboratory. It has been developed to scale well from laptop to small computing clusters to advanced supercomputers. Special features of Shift include hybrid capabilities for variance reduction such as CADIS and FW-CADIS, and advanced parallel decomposition and tally methods optimized for scalability on supercomputing architectures. Shift has been validated and verified against various reactor physics benchmarks and compares well to other state-of-the-art Monte Carlo radiation transport codes such as MCNP5, CE KENO-VI, and OpenMC. Some specific benchmarks used for verification and validation include the CASL VERA criticality test suite and several Westinghouse AP1000^{®} problems. These benchmark and scaling studies show promising results.
NASA Astrophysics Data System (ADS)
Pandya, Tara M.; Johnson, Seth R.; Evans, Thomas M.; Davidson, Gregory G.; Hamilton, Steven P.; Godfrey, Andrew T.
2016-03-01
This work discusses the implementation, capabilities, and validation of Shift, a massively parallel Monte Carlo radiation transport package authored at Oak Ridge National Laboratory. Shift has been developed to scale well from laptops to small computing clusters to advanced supercomputers and includes features such as support for multiple geometry and physics engines, hybrid capabilities for variance reduction methods such as the Consistent Adjoint-Driven Importance Sampling methodology, advanced parallel decompositions, and tally methods optimized for scalability on supercomputing architectures. The scaling studies presented in this paper demonstrate good weak and strong scaling behavior for the implemented algorithms. Shift has also been validated and verified against various reactor physics benchmarks, including the Consortium for Advanced Simulation of Light Water Reactors' Virtual Environment for Reactor Analysis criticality test suite and several Westinghouse AP1000® problems presented in this paper. These benchmark results compare well to those from other contemporary Monte Carlo codes such as MCNP5 and KENO.
Multi-Zone Liquid Thrust Chamber Performance Code with Domain Decomposition for Parallel Processing
NASA Technical Reports Server (NTRS)
Navaz, Homayun K.
2002-01-01
-equation turbulence model, and two-phase flow. To overcome these limitations, the LTCP code is rewritten to include the multi-zone capability with domain decomposition that makes it suitable for parallel processing, i.e., enabling the code to run every zone or sub-domain on a separate processor. This can reduce the run time by a factor of 6 to 8, depending on the problem.
A Parallel Two-fluid Code for Global Magnetic Reconnection Studies
J.A. Breslau; S.C. Jardin
2001-08-09
This paper describes a new algorithm for the computation of two-dimensional resistive magnetohydrodynamic (MHD) and two-fluid studies of magnetic reconnection in plasmas. It has been implemented on several parallel platforms and shows good scalability up to 32 CPUs for reasonable problem sizes. A fixed, nonuniform rectangular mesh is used to resolve the different spatial scales in the reconnection problem. The resistive MHD version of the code uses an implicit/explicit hybrid method, while the two-fluid version uses an alternating-direction implicit (ADI) method. The technique has proven useful for comparing several different theories of collisional and collisionless reconnection.
Grid-based Parallel Data Streaming Implemented for the Gyrokinetic Toroidal Code
S. Klasky; S. Ethier; Z. Lin; K. Martins; D. McCune; R. Samtaney
2003-09-15
We have developed a threaded parallel data streaming approach using Globus to transfer multi-terabyte simulation data from a remote supercomputer to the scientist's home analysis/visualization cluster, as the simulation executes, with negligible overhead. Data transfer experiments show that this concurrent data transfer approach is more favorable compared with writing to local disk and then transferring this data to be post-processed. The present approach is conducive to using the grid to pipeline the simulation with post-processing and visualization. We have applied this method to the Gyrokinetic Toroidal Code (GTC), a 3-dimensional particle-in-cell code used to study microturbulence in magnetic confinement fusion from first principles plasma theory.
Coding for Parallel Links to Maximize the Expected Value of Decodable Messages
NASA Technical Reports Server (NTRS)
Klimesh, Matthew A.; Chang, Christopher S.
2011-01-01
When multiple parallel communication links are available, it is useful to consider link-utilization strategies that provide tradeoffs between reliability and throughput. Interesting cases arise when there are three or more available links. Under the model considered, the links have known probabilities of being in working order, and each link has a known capacity. The sender has a number of messages to send to the receiver. Each message has a size and a value (i.e., a worth or priority). Messages may be divided into pieces arbitrarily, and the value of each piece is proportional to its size. The goal is to choose combinations of messages to send on the links so that the expected value of the messages decodable by the receiver is maximized. There are three parts to the innovation: (1) Applying coding to parallel links under the model; (2) Linear programming formulation for finding the optimal combinations of messages to send on the links; and (3) Algorithms for assisting in finding feasible combinations of messages, as support for the linear programming formulation. There are similarities between this innovation and methods developed in the field of network coding. However, network coding has generally been concerned with either maximizing throughput in a fixed network, or robust communication of a fixed volume of data. In contrast, under this model, the throughput is expected to vary depending on the state of the network. Examples of error-correcting codes that are useful under this model but which are not needed under previous models have been found. This model can represent either a one-shot communication attempt, or a stream of communications. Under the one-shot model, message sizes and link capacities are quantities of information (e.g., measured in bits), while under the communications stream model, message sizes and link capacities are information rates (e.g., measured in bits/second). This work has the potential to increase the value of data returned from
ALEGRA -- A massively parallel h-adaptive code for solid dynamics
Summers, R.M.; Wong, M.K.; Boucheron, E.A.; Weatherby, J.R.
1997-12-31
ALEGRA is a multi-material, arbitrary-Lagrangian-Eulerian (ALE) code for solid dynamics designed to run on massively parallel (MP) computers. It combines the features of modern Eulerian shock codes, such as CTH, with modern Lagrangian structural analysis codes using an unstructured grid. ALEGRA is being developed for use on the teraflop supercomputers to conduct advanced three-dimensional (3D) simulations of shock phenomena important to a variety of systems. ALEGRA was designed with the Single Program Multiple Data (SPMD) paradigm, in which the mesh is decomposed into sub-meshes so that each processor gets a single sub-mesh with approximately the same number of elements. Using this approach the authors have been able to produce a single code that can scale from one processor to thousands of processors. A current major effort is to develop efficient, high precision simulation capabilities for ALEGRA, without the computational cost of using a global highly resolved mesh, through flexible, robust h-adaptivity of finite elements. H-adaptivity is the dynamic refinement of the mesh by subdividing elements, thus changing the characteristic element size and reducing numerical error. The authors are working on several major technical challenges that must be met to make effective use of HAMMER on MP computers.
NASA Astrophysics Data System (ADS)
Stepšys, A.; Mickevicius, S.; Germanas, D.; Kalinauskas, R. K.
2014-11-01
This new version of the HOTB program for calculation of the three and four particle harmonic oscillator transformation brackets provides some enhancements and corrections to the earlier version (Germanas et al., 2010) [1]. In particular, new version allows calculations of harmonic oscillator transformation brackets be performed in parallel using MPI parallel communication standard. Moreover, higher precision of intermediate calculations using GNU Quadruple Precision and arbitrary precision library FMLib [2] is done. A package of Fortran code is presented. Calculation time of large matrices can be significantly reduced using effective parallel code. Use of Higher Precision methods in intermediate calculations increases the stability of algorithms and extends the validity of used algorithms for larger input values. Catalogue identifier: AEFQ_v4_0 Program summary URL: http://cpc.cs.qub.ac.uk/summaries/AEFQ_v4_0.html Program obtainable from: CPC Program Library, Queen’s University of Belfast, N. Ireland Licensing provisions: GNU General Public License, version 3 Number of lines in programs, including test data, etc.: 1711 Number of bytes in distributed programs, including test data, etc.: 11667 Distribution format: tar.gz Program language used: FORTRAN 90 with MPI extensions for parallelism Computer: Any computer with FORTRAN 90 compiler Operating system: Windows, Linux, FreeBSD, True64 Unix Has the code been vectorized of parallelized?: Yes, parallelism using MPI extensions. Number of CPUs used: up to 999 RAM(per CPU core): Depending on allocated binomial and trinomial matrices and use of precision; at least 500 MB Catalogue identifier of previous version: AEFQ_v1_0 Journal reference of previous version: Comput. Phys. Comm. 181, Issue 2, (2010) 420-425 Does the new version supersede the previous version? Yes Nature of problem: Calculation of matrices of three-particle harmonic oscillator brackets (3HOB) and four-particle harmonic oscillator brackets (4HOB) in a more
An ecological and evolutionary perspective on the parallel invasion of two cross-compatible trees.
Besnard, Guillaume; Cuneo, Peter
2016-01-01
Invasive trees are generally seen as ecosystem-transforming plants that can have significant impacts on native vegetation, and often require management and control. Understanding their history and biology is essential to guide actions of land managers. Here, we present a summary of recent research into the ecology, phylogeography and management of invasive olives, which are now established outside of their native range as high ecological impact invasive trees. The parallel invasion of European and African olive in different climatic zones of Australia provides an interesting case study of invasion, characterized by early genetic admixture between domesticated and wild taxa. Today, the impact of the invasive olives on native vegetation and ecosystem function is of conservation concern, with European olive a declared weed in areas of South Australia, and African olive a declared weed in New South Wales and Pacific islands. Population genetics was used to trace the origins and invasion of both subspecies in Australia, indicating that both olive subspecies have hybridized early after introduction. Research also indicates that African olive populations can establish from a low number of founder individuals even after successive bottlenecks. Modelling based on distributional data from the native and invasive range identified a shift of the realized ecological niche in the Australian invasive range for both olive subspecies, which was particularly marked for African olive. As highly successful and long-lived invaders, olives offer further opportunities to understand the genetic basis of invasion, and we propose that future research examines the history of introduction and admixture, the genetic basis of adaptability and the role of biotic interactions during invasion. Advances on these questions will ultimately improve predictions on the future olive expansion and provide a solid basis for better management of invasive populations.
An ecological and evolutionary perspective on the parallel invasion of two cross-compatible trees
Besnard, Guillaume; Cuneo, Peter
2016-01-01
Invasive trees are generally seen as ecosystem-transforming plants that can have significant impacts on native vegetation, and often require management and control. Understanding their history and biology is essential to guide actions of land managers. Here, we present a summary of recent research into the ecology, phylogeography and management of invasive olives, which are now established outside of their native range as high ecological impact invasive trees. The parallel invasion of European and African olive in different climatic zones of Australia provides an interesting case study of invasion, characterized by early genetic admixture between domesticated and wild taxa. Today, the impact of the invasive olives on native vegetation and ecosystem function is of conservation concern, with European olive a declared weed in areas of South Australia, and African olive a declared weed in New South Wales and Pacific islands. Population genetics was used to trace the origins and invasion of both subspecies in Australia, indicating that both olive subspecies have hybridized early after introduction. Research also indicates that African olive populations can establish from a low number of founder individuals even after successive bottlenecks. Modelling based on distributional data from the native and invasive range identified a shift of the realized ecological niche in the Australian invasive range for both olive subspecies, which was particularly marked for African olive. As highly successful and long-lived invaders, olives offer further opportunities to understand the genetic basis of invasion, and we propose that future research examines the history of introduction and admixture, the genetic basis of adaptability and the role of biotic interactions during invasion. Advances on these questions will ultimately improve predictions on the future olive expansion and provide a solid basis for better management of invasive populations. PMID:27519914
NASA Astrophysics Data System (ADS)
Chang, Yang-Lang; Chen, Zhi-Ming; Liu, Jin-Nan; Chang, Lena; Fang, Jyh Perng
2010-08-01
Satellite remote sensing images can be interpreted to provide important information of large-scale natural resources, such as lands, oceans, mountains, rivers, forests and minerals for Earth observations. Recent advances of remote sensing technologies have improved the availability of satellite imagery in a wide range of applications including high dimensional remote sensing data sets (e.g. high spectral and high spatial resolution images). The information of high dimensional remote sensing images obtained by state-of-the-art sensor technologies can be identified more accurately than images acquired by conventional remote sensing techniques. However, due to its large volume of image data, it requires a huge amount of storages and computing time. In response, the computational complexity of data processing for high dimensional remote sensing data analysis will increase. Consequently, this paper proposes a novel classification algorithm based on semi-matroid structure, known as the parallel k-dimensional tree semi-matroid (PKTSM) classification, which adopts a new hybrid parallel approach to deal with high dimensional data sets. It is implemented by combining the message passing interface (MPI) library, the open multi-processing (OpenMP) application programming interface and the compute unified device architecture (CUDA) of graphics processing units (GPU) in a hybrid mode. The effectiveness of the proposed PKTSM is evaluated by using MODIS/ASTER airborne simulator (MASTER) images and airborne synthetic aperture radar (AIRSAR) images for land cover classification during the Pacrim II campaign. The experimental results demonstrated that the proposed hybrid PKTSM can significantly improve the performance in terms of both computational speed-up and classification accuracy.
Context Tree-Based Image Contour Coding Using a Geometric Prior.
Zheng, Amin; Cheung, Gene; Florencio, Dinei
2017-02-01
Efficient encoding of object contours in images can facilitate advanced image/video compression techniques, such as shape-adaptive transform coding or motion prediction of arbitrarily shaped pixel blocks. We study the problem of lossless and lossy compression of detected contours in images. Specifically, we first convert a detected object contour into a sequence of directional symbols drawn from a small alphabet. To encode the symbol sequence using arithmetic coding, we compute an optimal variable-length context tree (VCT) T via a maximum a posterior (MAP) formulation to estimate symbols' conditional probabilities. MAP can avoid overfitting given a small training set X of past symbol sequences by identifying a VCT T with high likelihood P(X|T) of observing X given T , using a geometric prior P(T) stating that image contours are more often straight than curvy. For the lossy case, we design fast dynamic programming (DP) algorithms that optimally trade off coding rate of an approximate contour [Formula: see text] given a VCT T with two notions of distortion of [Formula: see text] with respect to the original contour x. To reduce the size of the DP tables, a total suffix tree is derived from a given VCT T for compact table entry indexing, reducing complexity. Experimental results show that for lossless contour coding, our proposed algorithm outperforms state-of-the-art context-based schemes consistently for both small and large training datasets. For lossy contour coding, our algorithms outperform comparable schemes in the literature in rate-distortion performance.
NASA Astrophysics Data System (ADS)
Iwasawa, Masaki; Tanikawa, Ataru; Hosono, Natsuki; Nitadori, Keigo; Muranushi, Takayuki; Makino, Junichiro
2016-08-01
We present the basic idea, implementation, measured performance, and performance model of FDPS (Framework for Developing Particle Simulators). FDPS is an application-development framework which helps researchers to develop simulation programs using particle methods for large-scale distributed-memory parallel supercomputers. A particle-based simulation program for distributed-memory parallel computers needs to perform domain decomposition, exchange of particles which are not in the domain of each computing node, and gathering of the particle information in other nodes which are necessary for interaction calculation. Also, even if distributed-memory parallel computers are not used, in order to reduce the amount of computation, algorithms such as the Barnes-Hut tree algorithm or the Fast Multipole Method should be used in the case of long-range interactions. For short-range interactions, some methods to limit the calculation to neighbor particles are required. FDPS provides all of these functions which are necessary for efficient parallel execution of particle-based simulations as "templates," which are independent of the actual data structure of particles and the functional form of the particle-particle interaction. By using FDPS, researchers can write their programs with the amount of work necessary to write a simple, sequential and unoptimized program of O(N2) calculation cost, and yet the program, once compiled with FDPS, will run efficiently on large-scale parallel supercomputers. A simple gravitational N-body program can be written in around 120 lines. We report the actual performance of these programs and the performance model. The weak scaling performance is very good, and almost linear speed-up was obtained for up to the full system of the K computer. The minimum calculation time per timestep is in the range of 30 ms (N = 107) to 300 ms (N = 109). These are currently limited by the time for the calculation of the domain decomposition and communication
NASA Astrophysics Data System (ADS)
Peredo, Oscar; Ortiz, Julián M.; Herrero, José R.
2015-12-01
The Geostatistical Software Library (GSLIB) has been used in the geostatistical community for more than thirty years. It was designed as a bundle of sequential Fortran codes, and today it is still in use by many practitioners and researchers. Despite its widespread use, few attempts have been reported in order to bring this package to the multi-core era. Using all CPU resources, GSLIB algorithms can handle large datasets and grids, where tasks are compute- and memory-intensive applications. In this work, a methodology is presented to accelerate GSLIB applications using code optimization and hybrid parallel processing, specifically for compute-intensive applications. Minimal code modifications are added decreasing as much as possible the elapsed time of execution of the studied routines. If multi-core processing is available, the user can activate OpenMP directives to speed up the execution using all resources of the CPU. If multi-node processing is available, the execution is enhanced using MPI messages between the compute nodes.Four case studies are presented: experimental variogram calculation, kriging estimation, sequential gaussian and indicator simulation. For each application, three scenarios (small, large and extra large) are tested using a desktop environment with 4 CPU-cores and a multi-node server with 128 CPU-nodes. Elapsed times, speedup and efficiency results are shown.
NASA Astrophysics Data System (ADS)
Karl, Simon J.; Aarseth, Sverre J.; Naab, Thorsten; Haehnelt, Martin G.; Spurzem, Rainer
2015-09-01
We present a hybrid code combining the OpenMP-parallel tree code VINE with an algorithmic chain regularization scheme. The new code, called `rVINE', aims to significantly improve the accuracy of close encounters of massive bodies with supermassive black holes (SMBHs) in galaxy-scale numerical simulations. We demonstrate the capabilities of the code by studying two test problems, the sinking of a single massive black hole to the centre of a gas-free galaxy due to dynamical friction and the hardening of an SMBH binary due to close stellar encounters. We show that results obtained with rVINE compare well with NBODY7 for problems with particle numbers that can be simulated with NBODY7. In particular, in both NBODY7 and rVINE we find a clear N-dependence of the binary hardening rate, a low binary eccentricity and moderate eccentricity evolution, as well as the conversion of the galaxy's inner density profile from a cusp to a core via the ejection of stars at high velocity. The much larger number of particles that can be handled by rVINE will open up exciting opportunities to model stellar dynamics close to SMBHs much more accurately in a realistic galactic context. This will help to remedy the inherent limitations of commonly used tree solvers to follow the correct dynamical evolution of black holes in galaxy-scale simulations.
Fortran code for SU(3) lattice gauge theory with and without MPI checkerboard parallelization
NASA Astrophysics Data System (ADS)
Berg, Bernd A.; Wu, Hao
2012-10-01
We document plain Fortran and Fortran MPI checkerboard code for Markov chain Monte Carlo simulations of pure SU(3) lattice gauge theory with the Wilson action in D dimensions. The Fortran code uses periodic boundary conditions and is suitable for pedagogical purposes and small scale simulations. For the Fortran MPI code two geometries are covered: the usual torus with periodic boundary conditions and the double-layered torus as defined in the paper. Parallel computing is performed on checkerboards of sublattices, which partition the full lattice in one, two, and so on, up to D directions (depending on the parameters set). For updating, the Cabibbo-Marinari heatbath algorithm is used. We present validations and test runs of the code. Performance is reported for a number of currently used Fortran compilers and, when applicable, MPI versions. For the parallelized code, performance is studied as a function of the number of processors. Program summary Program title: STMC2LSU3MPI Catalogue identifier: AEMJ_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEMJ_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 26666 No. of bytes in distributed program, including test data, etc.: 233126 Distribution format: tar.gz Programming language: Fortran 77 compatible with the use of Fortran 90/95 compilers, in part with MPI extensions. Computer: Any capable of compiling and executing Fortran 77 or Fortran 90/95, when needed with MPI extensions. Operating system: Red Hat Enterprise Linux Server 6.1 with OpenMPI + pgf77 11.8-0, Centos 5.3 with OpenMPI + gfortran 4.1.2, Cray XT4 with MPICH2 + pgf90 11.2-0. Has the code been vectorised or parallelized?: Yes, parallelized using MPI extensions. Number of processors used: 2 to 11664 RAM: 200 Mega bytes per process. Classification: 11
NASA Astrophysics Data System (ADS)
Sosedkin, A. P.; Lotov, K. V.
2016-09-01
LCODE is a freely distributed quasistatic 2D3V code for simulating plasma wakefield acceleration, mainly specialized at resource-efficient studies of long-term propagation of ultrarelativistic particle beams in plasmas. The beam is modeled with fully relativistic macro-particles in a simulation window copropagating with the light velocity; the plasma can be simulated with either kinetic or fluid model. Several techniques are used to obtain exceptional numerical stability and precision while maintaining high resource efficiency, enabling LCODE to simulate the evolution of long particle beams over long propagation distances even on a laptop. A recent upgrade enabled LCODE to perform the calculations in parallel. A pipeline of several LCODE processes communicating via MPI (Message-Passing Interface) is capable of executing multiple consecutive time steps of the simulation in a single pass. This approach can speed up the calculations by hundreds of times.
An object-oriented implementation of a parallel Monte Carlo code for radiation transport
NASA Astrophysics Data System (ADS)
Santos, Pedro Duarte; Lani, Andrea
2016-05-01
This paper describes the main features of a state-of-the-art Monte Carlo solver for radiation transport which has been implemented within COOLFluiD, a world-class open source object-oriented platform for scientific simulations. The Monte Carlo code makes use of efficient ray tracing algorithms (for 2D, axisymmetric and 3D arbitrary unstructured meshes) which are described in detail. The solver accuracy is first verified in testcases for which analytical solutions are available, then validated for a space re-entry flight experiment (i.e. FIRE II) for which comparisons against both experiments and reference numerical solutions are provided. Through the flexible design of the physical models, ray tracing and parallelization strategy (fully reusing the mesh decomposition inherited by the fluid simulator), the implementation was made efficient and reusable.
Kostin, Mikhail; Mokhov, Nikolai; Niita, Koji
2013-09-25
A parallel computing framework has been developed to use with general-purpose radiation transport codes. The framework was implemented as a C++ module that uses MPI for message passing. It is intended to be used with older radiation transport codes implemented in Fortran77, Fortran 90 or C. The module is significantly independent of radiation transport codes it can be used with, and is connected to the codes by means of a number of interface functions. The framework was developed and tested in conjunction with the MARS15 code. It is possible to use it with other codes such as PHITS, FLUKA and MCNP after certain adjustments. Besides the parallel computing functionality, the framework offers a checkpoint facility that allows restarting calculations with a saved checkpoint file. The checkpoint facility can be used in single process calculations as well as in the parallel regime. The framework corrects some of the known problems with the scheduling and load balancing found in the original implementations of the parallel computing functionality in MARS15 and PHITS. The framework can be used efficiently on homogeneous systems and networks of workstations, where the interference from the other users is possible.
A massively parallel method of characteristic neutral particle transport code for GPUs
Boyd, W. R.; Smith, K.; Forget, B.
2013-07-01
Over the past 20 years, parallel computing has enabled computers to grow ever larger and more powerful while scientific applications have advanced in sophistication and resolution. This trend is being challenged, however, as the power consumption for conventional parallel computing architectures has risen to unsustainable levels and memory limitations have come to dominate compute performance. Heterogeneous computing platforms, such as Graphics Processing Units (GPUs), are an increasingly popular paradigm for solving these issues. This paper explores the applicability of GPUs for deterministic neutron transport. A 2D method of characteristics (MOC) code - OpenMOC - has been developed with solvers for both shared memory multi-core platforms as well as GPUs. The multi-threading and memory locality methodologies for the GPU solver are presented. Performance results for the 2D C5G7 benchmark demonstrate 25-35 x speedup for MOC on the GPU. The lessons learned from this case study will provide the basis for further exploration of MOC on GPUs as well as design decisions for hardware vendors exploring technologies for the next generation of machines for scientific computing. (authors)
On distributed memory MPI-based parallelization of SPH codes in massive HPC context
NASA Astrophysics Data System (ADS)
Oger, G.; Le Touzé, D.; Guibert, D.; de Leffe, M.; Biddiscombe, J.; Soumagne, J.; Piccinali, J.-G.
2016-03-01
Most of particle methods share the problem of high computational cost and in order to satisfy the demands of solvers, currently available hardware technologies must be fully exploited. Two complementary technologies are now accessible. On the one hand, CPUs which can be structured into a multi-node framework, allowing massive data exchanges through a high speed network. In this case, each node is usually comprised of several cores available to perform multithreaded computations. On the other hand, GPUs which are derived from the graphics computing technologies, able to perform highly multi-threaded calculations with hundreds of independent threads connected together through a common shared memory. This paper is primarily dedicated to the distributed memory parallelization of particle methods, targeting several thousands of CPU cores. The experience gained clearly shows that parallelizing a particle-based code on moderate numbers of cores can easily lead to an acceptable scalability, whilst a scalable speedup on thousands of cores is much more difficult to obtain. The discussion revolves around speeding up particle methods as a whole, in a massive HPC context by making use of the MPI library. We focus on one particular particle method which is Smoothed Particle Hydrodynamics (SPH), one of the most widespread today in the literature as well as in engineering.
MPI parallelization of Vlasov codes for the simulation of nonlinear laser-plasma interactions
NASA Astrophysics Data System (ADS)
Savchenko, V.; Won, K.; Afeyan, B.; Decyk, V.; Albrecht-Marc, M.; Ghizzo, A.; Bertrand, P.
2003-10-01
The simulation of optical mixing driven KEEN waves [1] and electron plasma waves [1] in laser-produced plasmas require nonlinear kinetic models and massive parallelization. We use Massage Passing Interface (MPI) libraries and Appleseed [2] to solve the Vlasov Poisson system of equations on an 8 node dual processor MAC G4 cluster. We use the semi-Lagrangian time splitting method [3]. It requires only row-column exchanges in the global data redistribution, minimizing the total number of communications between processors. Recurrent communication patterns for 2D FFTs involves global transposition. In the Vlasov-Maxwell case, we use splitting into two 1D spatial advections and a 2D momentum advection [4]. Discretized momentum advection equations have a double loop structure with the outer index being assigned to different processors. We adhere to a code structure with separate routines for calculations and data management for parallel computations. [1] B. Afeyan et al., IFSA 2003 Conference Proceedings, Monterey, CA [2] V. K. Decyk, Computers in Physics, 7, 418 (1993) [3] Sonnendrucker et al., JCP 149, 201 (1998) [4] Begue et al., JCP 151, 458 (1999)
OASIS4: An Efficient Parallel Code Coupler for Earth System Modelling
NASA Astrophysics Data System (ADS)
Coquart, L.; Valcke, S.; Redler, R.; Ritzdorf, H.
2009-04-01
As a new development step of the OASIS coupler family, we present OASIS4 in its latest version. OASIS4 is a software allowing synchronized exchanges of coupling information between numerical codes representing different components of the climate system. The concepts of portability, flexibility, parallelism and efficiency are the main drivers for the OASIS4 development with which we target the needs of Earth system modelling in its full complexity. The development and maintenance of OASIS4 has been supported by EU and institutional funding within the PRISM Support Initiative for the past seven years. Here we present the latest version of the OASIS4 coupling software which now includes the commonly known point based 2d and 3d interpolation schemes (bilinear, trilinear, bicubic, nearest neighbour), and 2D conservative remapping. Furthermore, the new version of the software now provides a complete parallel search taking into account specific requirements at process boundaries in order to provide identical search results independently of the domain partitioning. The parallel "multi-grid" search ensures low CPU cost to perform the task of the neighbourhood search and at the same time showing a good scalability when applied to grid partitioned domains. OASIS4 is currently used in few climate applications such as in the FP6 European GEMS project for the 3D coupling between atmosphere and atmosphere chemistry, by the Swedish Meteorological and Hydrological Institute (SMHI) for regional models covering the Arctic Sea or the Baltic area, and by the Calcul Intensif pour le CLimat et l'Environment (CICLE) project funded by the French "Agence Nationale de la Recherche".
Implementation of a tree-code for numerical simulations of stellar systems
NASA Astrophysics Data System (ADS)
Marinho, Eraldo Pereira
1991-10-01
An implementation of a tree code for the force calculation in gravitational N-body systems simulations is presented. The technique consists of virtualizing the entire system in a tree data-structure, which reduces the computational effort to theta(N log N) instead of the theta(N exp 2), typical of direct summation. The adopted time integrator is the simple leap-frog with second-order accuracy. A brief discussion about the truncation-error effects on the morphology of the system shows them to be essentially negligible. However, these errors do propagate in a Markovian way if a potential-adaptive time-step is used in order to maintain the expected truncation-error approximately constant in the entire system. The tests show that, even with totally arbitrary distributions, the total computation time obeys theta(N log N). As an application of the code, we evolved an initially cold and homogeneous sphere of point masses to simulate a primordial process of galaxy formation. The evolution of the global entropy of the system suggests that a quasi-equilibrium configuration is achieved after approximately 2 x 10 exp 9 years. It is shown that the final configuration displays a close resemblance to the well observed giant elliptical galaxies, in both kinematical and luminosity distribution properties. A discussion is given on the evolution of the important dynamic quantities characterizing the model. During all the computations, the energy is conserved to better than 0.1 percent.
Hybrid threshold adaptable quantum secret sharing scheme with reverse Huffman-Fibonacci-tree coding
Lai, Hong; Zhang, Jun; Luo, Ming-Xing; Pan, Lei; Pieprzyk, Josef; Xiao, Fuyuan; Orgun, Mehmet A.
2016-01-01
With prevalent attacks in communication, sharing a secret between communicating parties is an ongoing challenge. Moreover, it is important to integrate quantum solutions with classical secret sharing schemes with low computational cost for the real world use. This paper proposes a novel hybrid threshold adaptable quantum secret sharing scheme, using an m-bonacci orbital angular momentum (OAM) pump, Lagrange interpolation polynomials, and reverse Huffman-Fibonacci-tree coding. To be exact, we employ entangled states prepared by m-bonacci sequences to detect eavesdropping. Meanwhile, we encode m-bonacci sequences in Lagrange interpolation polynomials to generate the shares of a secret with reverse Huffman-Fibonacci-tree coding. The advantages of the proposed scheme is that it can detect eavesdropping without joint quantum operations, and permits secret sharing for an arbitrary but no less than threshold-value number of classical participants with much lower bandwidth. Also, in comparison with existing quantum secret sharing schemes, it still works when there are dynamic changes, such as the unavailability of some quantum channel, the arrival of new participants and the departure of participants. Finally, we provide security analysis of the new hybrid quantum secret sharing scheme and discuss its useful features for modern applications. PMID:27515908
Hybrid threshold adaptable quantum secret sharing scheme with reverse Huffman-Fibonacci-tree coding.
Lai, Hong; Zhang, Jun; Luo, Ming-Xing; Pan, Lei; Pieprzyk, Josef; Xiao, Fuyuan; Orgun, Mehmet A
2016-08-12
With prevalent attacks in communication, sharing a secret between communicating parties is an ongoing challenge. Moreover, it is important to integrate quantum solutions with classical secret sharing schemes with low computational cost for the real world use. This paper proposes a novel hybrid threshold adaptable quantum secret sharing scheme, using an m-bonacci orbital angular momentum (OAM) pump, Lagrange interpolation polynomials, and reverse Huffman-Fibonacci-tree coding. To be exact, we employ entangled states prepared by m-bonacci sequences to detect eavesdropping. Meanwhile, we encode m-bonacci sequences in Lagrange interpolation polynomials to generate the shares of a secret with reverse Huffman-Fibonacci-tree coding. The advantages of the proposed scheme is that it can detect eavesdropping without joint quantum operations, and permits secret sharing for an arbitrary but no less than threshold-value number of classical participants with much lower bandwidth. Also, in comparison with existing quantum secret sharing schemes, it still works when there are dynamic changes, such as the unavailability of some quantum channel, the arrival of new participants and the departure of participants. Finally, we provide security analysis of the new hybrid quantum secret sharing scheme and discuss its useful features for modern applications.
Hybrid threshold adaptable quantum secret sharing scheme with reverse Huffman-Fibonacci-tree coding
NASA Astrophysics Data System (ADS)
Lai, Hong; Zhang, Jun; Luo, Ming-Xing; Pan, Lei; Pieprzyk, Josef; Xiao, Fuyuan; Orgun, Mehmet A.
2016-08-01
With prevalent attacks in communication, sharing a secret between communicating parties is an ongoing challenge. Moreover, it is important to integrate quantum solutions with classical secret sharing schemes with low computational cost for the real world use. This paper proposes a novel hybrid threshold adaptable quantum secret sharing scheme, using an m-bonacci orbital angular momentum (OAM) pump, Lagrange interpolation polynomials, and reverse Huffman-Fibonacci-tree coding. To be exact, we employ entangled states prepared by m-bonacci sequences to detect eavesdropping. Meanwhile, we encode m-bonacci sequences in Lagrange interpolation polynomials to generate the shares of a secret with reverse Huffman-Fibonacci-tree coding. The advantages of the proposed scheme is that it can detect eavesdropping without joint quantum operations, and permits secret sharing for an arbitrary but no less than threshold-value number of classical participants with much lower bandwidth. Also, in comparison with existing quantum secret sharing schemes, it still works when there are dynamic changes, such as the unavailability of some quantum channel, the arrival of new participants and the departure of participants. Finally, we provide security analysis of the new hybrid quantum secret sharing scheme and discuss its useful features for modern applications.
NeuCode labels with parallel reaction monitoring for multiplexed, absolute protein quantification
Potts, Gregory K.; Voigt, Emily A.; Bailey, Derek J.; Westphall, Michael S.; Hebert, Alexander S.; Yin, John; Coon, Joshua J.
2016-01-01
We introduce a new method to multiplex the throughput of samples for targeted mass spectrometry analysis. The current paradigm for obtaining absolute quantification from biological samples requires spiking isotopically heavy peptide standards into light biological lysates. Because each lysate must be run individually, this method places limitations on sample throughput and high demands on instrument time. When cell lines are first metabolically labeled with various neutron-encoded (NeuCode) lysine isotopologues possessing mDa mass differences from each other, heavy cell lysates may be mixed and spiked with an additional heavy peptide as an internal standard. We demonstrate that these NeuCode lysate peptides may be co-isolated with their internal standards, fragmented, and analyzed together using high resolving power parallel reaction monitoring (PRM). Instead of running each sample individually, these methods allow samples to be multiplexed to obtain absolute concentrations of target peptides in 5, 15, and even 25 biological samples at a time during single mass spectrometry experiments. PMID:26882330
NASA Astrophysics Data System (ADS)
Lokavarapu, H. V.; Matsui, H.
2015-12-01
Convection and magnetic field of the Earth's outer core are expected to have vast length scales. To resolve these flows, high performance computing is required for geodynamo simulations using spherical harmonics transform (SHT), a significant portion of the execution time is spent on the Legendre transform. Calypso is a geodynamo code designed to model magnetohydrodynamics of a Boussinesq fluid in a rotating spherical shell, such as the outer core of the Earth. The code has been shown to scale well on computer clusters capable of computing at the order of 10⁵ cores using Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) parallelization for CPUs. To further optimize, we investigate three different algorithms of the SHT using GPUs. One is to preemptively compute the Legendre polynomials on the CPU before executing SHT on the GPU within the time integration loop. In the second approach, both the Legendre polynomials and the SHT are computed on the GPU simultaneously. In the third approach , we initially partition the radial grid for the forward transform and the harmonic order for the backward transform between the CPU and GPU. There after, the partitioned works are simultaneously computed in the time integration loop. We examine the trade-offs between space and time, memory bandwidth and GPU computations on Maverick, a Texas Advanced Computing Center (TACC) supercomputer. We have observed improved performance using a GPU enabled Legendre transform. Furthermore, we will compare and contrast the different algorithms in the context of GPUs.
Brock, R Scott; Hu, Xin-Hua; Yang, Ping; Lu, Jun
2005-07-11
A parallel Finite-Difference-Time-Domain (FDTD) code has been developed to numerically model the elastic light scattering by biological cells. Extensive validation and evaluation on various computing clusters demonstrated the high performance of the parallel code and its significant potential of reducing the computational cost of the FDTD method with low cost computer clusters. The parallel FDTD code has been used to study the problem of light scattering by a human red blood cell (RBC) of a deformed shape in terms of the angular distributions of the Mueller matrix elements. The dependence of the Mueller matrix elements on the shape and orientation of the deformed RBC has been investigated. Analysis of these data provides valuable insight on determination of the RBC shapes using the method of elastic light scattering measurements.
Parallel Adaptive Mesh Refinement Library
NASA Technical Reports Server (NTRS)
Mac-Neice, Peter; Olson, Kevin
2005-01-01
Parallel Adaptive Mesh Refinement Library (PARAMESH) is a package of Fortran 90 subroutines designed to provide a computer programmer with an easy route to extension of (1) a previously written serial code that uses a logically Cartesian structured mesh into (2) a parallel code with adaptive mesh refinement (AMR). Alternatively, in its simplest use, and with minimal effort, PARAMESH can operate as a domain-decomposition tool for users who want to parallelize their serial codes but who do not wish to utilize adaptivity. The package builds a hierarchy of sub-grids to cover the computational domain of a given application program, with spatial resolution varying to satisfy the demands of the application. The sub-grid blocks form the nodes of a tree data structure (a quad-tree in two or an oct-tree in three dimensions). Each grid block has a logically Cartesian mesh. The package supports one-, two- and three-dimensional models.
Seeing the forest for the trees: Networked workstations as a parallel processing computer
NASA Technical Reports Server (NTRS)
Breen, J. O.; Meleedy, D. M.
1992-01-01
Unlike traditional 'serial' processing computers in which one central processing unit performs one instruction at a time, parallel processing computers contain several processing units, thereby, performing several instructions at once. Many of today's fastest supercomputers achieve their speed by employing thousands of processing elements working in parallel. Few institutions can afford these state-of-the-art parallel processors, but many already have the makings of a modest parallel processing system. Workstations on existing high-speed networks can be harnessed as nodes in a parallel processing environment, bringing the benefits of parallel processing to many. While such a system can not rival the industry's latest machines, many common tasks can be accelerated greatly by spreading the processing burden and exploiting idle network resources. We study several aspects of this approach, from algorithms to select nodes to speed gains in specific tasks. With ever-increasing volumes of astronomical data, it becomes all the more necessary to utilize our computing resources fully.
Stankovski, Z.
1995-12-31
The collision probability method in neutron transport, as applied to 2D geometries, consume a great amount of computer time, for a typical 2D assembly calculation about 90% of the computing time is consumed in the collision probability evaluations. Consequently RZ or 3D calculations became prohibitive. In this paper the author presents a simple but efficient parallel algorithm based on the message passing host/node programmation model. Parallelization was applied to the energy group treatment. Such approach permits parallelization of the existing code, requiring only limited modifications. Sequential/parallel computer portability is preserved, which is a necessary condition for a industrial code. Sequential performances are also preserved. The algorithm is implemented on a CRAY 90 coupled to a 128 processor T3D computer, a 16 processor IBM SPI and a network of workstations, using the Public Domain PVM library. The tests were executed for a 2D geometry with the standard 99-group library. All results were very satisfactory, the best ones with IBM SPI. Because of heterogeneity of the workstation network, the author did not ask high performances for this architecture. The same source code was used for all computers. A more impressive advantage of this algorithm will appear in the calculations of the SAPHYR project (with the future fine multigroup library of about 8000 groups) with a massively parallel computer, using several hundreds of processors.
GRay: A MASSIVELY PARALLEL GPU-BASED CODE FOR RAY TRACING IN RELATIVISTIC SPACETIMES
Chan, Chi-kwan; Psaltis, Dimitrios; Özel, Feryal
2013-11-01
We introduce GRay, a massively parallel integrator designed to trace the trajectories of billions of photons in a curved spacetime. This graphics-processing-unit (GPU)-based integrator employs the stream processing paradigm, is implemented in CUDA C/C++, and runs on nVidia graphics cards. The peak performance of GRay using single-precision floating-point arithmetic on a single GPU exceeds 300 GFLOP (or 1 ns per photon per time step). For a realistic problem, where the peak performance cannot be reached, GRay is two orders of magnitude faster than existing central-processing-unit-based ray-tracing codes. This performance enhancement allows more effective searches of large parameter spaces when comparing theoretical predictions of images, spectra, and light curves from the vicinities of compact objects to observations. GRay can also perform on-the-fly ray tracing within general relativistic magnetohydrodynamic algorithms that simulate accretion flows around compact objects. Making use of this algorithm, we calculate the properties of the shadows of Kerr black holes and the photon rings that surround them. We also provide accurate fitting formulae of their dependencies on black hole spin and observer inclination, which can be used to interpret upcoming observations of the black holes at the center of the Milky Way, as well as M87, with the Event Horizon Telescope.
SPILADY: A parallel CPU and GPU code for spin-lattice magnetic molecular dynamics simulations
NASA Astrophysics Data System (ADS)
Ma, Pui-Wai; Dudarev, S. L.; Woo, C. H.
2016-10-01
Spin-lattice dynamics generalizes molecular dynamics to magnetic materials, where dynamic variables describing an evolving atomic system include not only coordinates and velocities of atoms but also directions and magnitudes of atomic magnetic moments (spins). Spin-lattice dynamics simulates the collective time evolution of spins and atoms, taking into account the effect of non-collinear magnetism on interatomic forces. Applications of the method include atomistic models for defects, dislocations and surfaces in magnetic materials, thermally activated diffusion of defects, magnetic phase transitions, and various magnetic and lattice relaxation phenomena. Spin-lattice dynamics retains all the capabilities of molecular dynamics, adding to them the treatment of non-collinear magnetic degrees of freedom. The spin-lattice dynamics time integration algorithm uses symplectic Suzuki-Trotter decomposition of atomic coordinate, velocity and spin evolution operators, and delivers highly accurate numerical solutions of dynamic evolution equations over extended intervals of time. The code is parallelized in coordinate and spin spaces, and is written in OpenMP C/C++ for CPU and in CUDA C/C++ for Nvidia GPU implementations. Temperatures of atoms and spins are controlled by Langevin thermostats. Conduction electrons are treated by coupling the discrete spin-lattice dynamics equations for atoms and spins to the heat transfer equation for the electrons. Worked examples include simulations of thermalization of ferromagnetic bcc iron, the dynamics of laser pulse demagnetization, and collision cascades.
Dewaraja, Yuni K; Ljungberg, Michael; Majumdar, Amitava; Bose, Abhijit; Koral, Kenneth F
2002-02-01
This paper reports the implementation of the SIMIND Monte Carlo code on an IBM SP2 distributed memory parallel computer. Basic aspects of running Monte Carlo particle transport calculations on parallel architectures are described. Our parallelization is based on equally partitioning photons among the processors and uses the Message Passing Interface (MPI) library for interprocessor communication and the Scalable Parallel Random Number Generator (SPRNG) to generate uncorrelated random number streams. These parallelization techniques are also applicable to other distributed memory architectures. A linear increase in computing speed with the number of processors is demonstrated for up to 32 processors. This speed-up is especially significant in Single Photon Emission Computed Tomography (SPECT) simulations involving higher energy photon emitters, where explicit modeling of the phantom and collimator is required. For (131)I, the accuracy of the parallel code is demonstrated by comparing simulated and experimental SPECT images from a heart/thorax phantom. Clinically realistic SPECT simulations using the voxel-man phantom are carried out to assess scatter and attenuation correction.
Fast Parallel Tree Codes for Gravitational and Fluid Dynamical N-Body Problems
1993-01-01
Chorin. Vortex models and boundary layer insta- 399:L109, 1992. bility. SIAMJ. Sci. Stat. Comput., 1(l):1 -21. 1980. [25] P. Koumoutsakos . Direct...turbulence. Comm. Pure Appl. thesis, California Institute of Technology, 1993. Math., 34:853-866, 1981. [26] P. Koumoutsakos and A. Leonard. Direct
Impact of parallel heterogeneity on a continuum model of the pulmonary arterial tree.
Krenz, G S; Lin, J; Dawson, C A; Linehan, J H
1994-08-01
Model arterial trees were constructed following rules consistent with morphometric data, Nj = (Dj/Da)-beta 1 and Lj = La(Dj/Da)beta 2, where Nj, Dj, and Lj are number, diameter, and length, respectively, of vessels in the jth level; Da and La are diameter and length, respectively, of the inlet artery, and -beta 1 and beta 2 are power law slopes relating vessel number and length, respectively, to vessel diameter. Simulated heterogeneous trees approximating these rules were constructed by assigning vessel diameters Dm = Da[2/(m + 1)]1/beta 1, such that m-1 vessels were larger than Dm (vessel length proportional to diameter). Vessels were connected, forming random bifurcating trees. Longitudinal intravascular pressure [P(Qcum)] with respect to cumulative vascular volume [Qcum] was computed for Poiseuille flow. Strahler-ordered tree morphometry yielded estimates of La, Da, beta 1, beta 2, and mean number ratio (B); B is defined by Nk + 1 = Bk, where k is total number of Strahler orders minus Strahler order number. The parameters were used in P(Qcum) = Pa [formula: see text] and the resulting P(Qcum) relationship was compared with that of the simulated tree, where Pa is total arterial pressure drop, Q is flow rate, Ra = (128 microLa)/(pi D4a (where mu is blood viscosity), and Qa (volume of inlet artery) = 1/4D2a pi La. Results indicate that the equation, originally developed for homogeneous trees (J. Appl. Physiol. 72: 2225-2237, 1992), provides a good approximation to the heterogeneous tree P(Qcum).
Procassini, R.J.
1997-12-31
The fine-scale, multi-space resolution that is envisioned for accurate simulations of complex weapons systems in three spatial dimensions implies flop-rate and memory-storage requirements that will only be obtained in the near future through the use of parallel computational techniques. Since the Monte Carlo transport models in these simulations usually stress both of these computational resources, they are prime candidates for parallelization. The MONACO Monte Carlo transport package, which is currently under development at LLNL, will utilize two types of parallelism within the context of a multi-physics design code: decomposition of the spatial domain across processors (spatial parallelism) and distribution of particles in a given spatial subdomain across additional processors (particle parallelism). This implementation of the package will utilize explicit data communication between domains (message passing). Such a parallel implementation of a Monte Carlo transport model will result in non-deterministic communication patterns. The communication of particles between subdomains during a Monte Carlo time step may require a significant level of effort to achieve a high parallel efficiency.
CMAD: A Self-consistent Parallel Code to Simulate the Electron Cloud Build-up and Instabilities
Pivi, M.T.F.; /SLAC
2007-11-07
We present the features of CMAD, a newly developed self-consistent code which simulates both the electron cloud build-up and related beam instabilities. By means of parallel (Message Passing Interface - MPI) computation, the code tracks the beam in an existing (MAD-type) lattice and continuously resolves the interaction between the beam and the cloud at each element location, with different cloud distributions at each magnet location. The goal of CMAD is to simulate single- and coupled-bunch instability, allowing tune shift, dynamic aperture and frequency map analysis and the determination of the secondary electron yield instability threshold. The code is in its phase of development and benchmarking with existing codes. Preliminary results on benchmarking are presented in this paper.
Wakefield Computations for the CLIC PETS using the Parallel Finite Element Time-Domain Code T3P
Candel, A; Kabel, A.; Lee, L.; Li, Z.; Ng, C.; Schussman, G.; Ko, K.; Syratchev, I.; /CERN
2009-06-19
In recent years, SLAC's Advanced Computations Department (ACD) has developed the high-performance parallel 3D electromagnetic time-domain code, T3P, for simulations of wakefields and transients in complex accelerator structures. T3P is based on advanced higher-order Finite Element methods on unstructured grids with quadratic surface approximation. Optimized for large-scale parallel processing on leadership supercomputing facilities, T3P allows simulations of realistic 3D structures with unprecedented accuracy, aiding the design of the next generation of accelerator facilities. Applications to the Compact Linear Collider (CLIC) Power Extraction and Transfer Structure (PETS) are presented.
NASA Astrophysics Data System (ADS)
Prashantha Kumar, H.; Sripati, U.; Shetty, K. Rajesh
2012-05-01
In this article, we propose a high-speed decoding algorithm for binary BCH codes that can correct up to 7 bits in error. Evaluation of the error-locator polynomial is the most complicated and time-consuming step in the decoding of a BCH code. We have derived equations for specifying the coefficients of the error-locator polynomial, which can form the basis for the development of a parallel architecture for the decoder. This approach has the advantage that all the coefficients of the error locator polynomial are computed in parallel (in one step). The roots of error-locator polynomial can be obtained by Chien's search and inverting these roots gives the error locations. This algorithm can be employed in any application where high-speed decoding of data encoded by a binary BCH code is required. One important application is in Flash memories where data integrity is preserved using a long, high-rate binary BCH code. We have synthesized generator polynomials for binary BCH codes (error-correcting capability, s ? ) that can be employed in Flash memory devices to improve the integrity of information storage. The proposed decoding algorithm can be used as an efficient, high-speed decoder in this important application.
Canbay, Ferhat; Levent, Vecdi Emre; Serbes, Gorkem; Ugurdag, H Fatih; Goren, Sezer; Aydin, Nizamettin
2016-09-01
The authors aimed to develop an application for producing different architectures to implement dual tree complex wavelet transform (DTCWT) having near shift-invariance property. To obtain a low-cost and portable solution for implementing the DTCWT in multi-channel real-time applications, various embedded-system approaches are realised. For comparison, the DTCWT was implemented in C language on a personal computer and on a PIC microcontroller. However, in the former approach portability and in the latter desired speed performance properties cannot be achieved. Hence, implementation of the DTCWT on a reconfigurable platform such as field programmable gate array, which provides portable, low-cost, low-power, and high-performance computing, is considered as the most feasible solution. At first, they used the system generator DSP design tool of Xilinx for algorithm design. However, the design implemented by using such tools is not optimised in terms of area and power. To overcome all these drawbacks mentioned above, they implemented the DTCWT algorithm by using Verilog Hardware Description Language, which has its own difficulties. To overcome these difficulties, simplify the usage of proposed algorithms and the adaptation procedures, a code generator program that can produce different architectures is proposed.
ERIC Educational Resources Information Center
Al-Khaja, Nawal
2007-01-01
This is a thematic lesson plan for young learners about palm trees and the importance of taking care of them. The two part lesson teaches listening, reading and speaking skills. The lesson includes parts of a tree; the modal auxiliary, can; dialogues and a role play activity.
Parallel Subspace Subcodes of Reed-Solomon Codes for Magnetic Recording Channels
ERIC Educational Resources Information Center
Wang, Han
2010-01-01
Read channel architectures based on a single low-density parity-check (LDPC) code are being considered for the next generation of hard disk drives. However, LDPC-only solutions suffer from the error floor problem, which may compromise reliability, if not handled properly. Concatenated architectures using an LDPC code plus a Reed-Solomon (RS) code…
A Multiple Sphere T-Matrix Fortran Code for Use on Parallel Computer Clusters
NASA Technical Reports Server (NTRS)
Mackowski, D. W.; Mishchenko, M. I.
2011-01-01
A general-purpose Fortran-90 code for calculation of the electromagnetic scattering and absorption properties of multiple sphere clusters is described. The code can calculate the efficiency factors and scattering matrix elements of the cluster for either fixed or random orientation with respect to the incident beam and for plane wave or localized- approximation Gaussian incident fields. In addition, the code can calculate maps of the electric field both interior and exterior to the spheres.The code is written with message passing interface instructions to enable the use on distributed memory compute clusters, and for such platforms the code can make feasible the calculation of absorption, scattering, and general EM characteristics of systems containing several thousand spheres.
Tsugane, Keisuke; Boku, Taisuke; Murai, Hitoshi; Sato, Mitsuhisa; Tang, William; Wang, Bei
2016-06-01
Recently, the Partitioned Global Address Space (PGAS) parallel programming model has emerged as a usable distributed memory programming model. XcalableMP (XMP) is a PGAS parallel programming language that extends base languages such as C and Fortran with directives in OpenMP-like style. XMP supports a global-view model that allows programmers to define global data and to map them to a set of processors, which execute the distributed global data as a single thread. In XMP, the concept of a coarray is also employed for local-view programming. In this study, we port Gyrokinetic Toroidal Code - Princeton (GTC-P), which is a three-dimensional gyrokinetic PIC code developed at Princeton University to study the microturbulence phenomenon in magnetically confined fusion plasmas, to XMP as an example of hybrid memory model coding with the global-view and local-view programming models. In local-view programming, the coarray notation is simple and intuitive compared with Message Passing Interface (MPI) programming while the performance is comparable to that of the MPI version. Thus, because the global-view programming model is suitable for expressing the data parallelism for a field of grid space data, we implement a hybrid-view version using a global-view programming model to compute the field and a local-view programming model to compute the movement of particles. Finally, the performance is degraded by 20% compared with the original MPI version, but the hybrid-view version facilitates more natural data expression for static grid space data (in the global-view model) and dynamic particle data (in the local-view model), and it also increases the readability of the code for higher productivity.
Tsugane, Keisuke; Boku, Taisuke; Murai, Hitoshi; ...
2016-06-01
Recently, the Partitioned Global Address Space (PGAS) parallel programming model has emerged as a usable distributed memory programming model. XcalableMP (XMP) is a PGAS parallel programming language that extends base languages such as C and Fortran with directives in OpenMP-like style. XMP supports a global-view model that allows programmers to define global data and to map them to a set of processors, which execute the distributed global data as a single thread. In XMP, the concept of a coarray is also employed for local-view programming. In this study, we port Gyrokinetic Toroidal Code - Princeton (GTC-P), which is a three-dimensionalmore » gyrokinetic PIC code developed at Princeton University to study the microturbulence phenomenon in magnetically confined fusion plasmas, to XMP as an example of hybrid memory model coding with the global-view and local-view programming models. In local-view programming, the coarray notation is simple and intuitive compared with Message Passing Interface (MPI) programming while the performance is comparable to that of the MPI version. Thus, because the global-view programming model is suitable for expressing the data parallelism for a field of grid space data, we implement a hybrid-view version using a global-view programming model to compute the field and a local-view programming model to compute the movement of particles. Finally, the performance is degraded by 20% compared with the original MPI version, but the hybrid-view version facilitates more natural data expression for static grid space data (in the global-view model) and dynamic particle data (in the local-view model), and it also increases the readability of the code for higher productivity.« less
NASA Astrophysics Data System (ADS)
Marx, Alain; Lütjens, Hinrich
2017-03-01
A hybrid MPI/OpenMP parallel version of the XTOR-2F code [Lütjens and Luciani, J. Comput. Phys. 229 (2010) 8130] solving the two-fluid MHD equations in full tokamak geometry by means of an iterative Newton-Krylov matrix-free method has been developed. The present work shows that the code has been parallelized significantly despite the numerical profile of the problem solved by XTOR-2F, i.e. a discretization with pseudo-spectral representations in all angular directions, the stiffness of the two-fluid stability problem in tokamaks, and the use of a direct LU decomposition to invert the physical pre-conditioner at every Krylov iteration of the solver. The execution time of the parallelized version is an order of magnitude smaller than the sequential one for low resolution cases, with an increasing speedup when the discretization mesh is refined. Moreover, it allows to perform simulations with higher resolutions, previously forbidden because of memory limitations.
Liu, Shifeng; Zhu, Dan; Wei, Zhengwu; Pan, Shilong
2014-07-01
A photonic approach for the generation of a widely tunable arbitrarily phase-coded microwave signal based on a dual-parallel polarization modulator (DP-PolM) is proposed and demonstrated without using any optical or electrical filter. Two orthogonally polarized ± first-order optical sidebands with suppressed carrier are generated based on the DP-PolM, and their polarization directions are aligned with the two principal axes of the following PolM. Phase coding is implemented at a following PolM driven by an electrical coding signal. The inherent frequency-doubling operation can make the system work at a frequency beyond the operation bandwidth of the DP-PolM and the 90° hybrid. Because no optical or electrical filter is applied, good frequency tunability is realized. An experiment is performed. The generation of phase-coded signals tuning from 10 to 40 GHz with up to 10 Gbit/s coding rates is verified.
2HOT: An Improved Parallel Hashed Oct-Tree N-Body Algorithm for Cosmological Simulation
Warren, Michael S.
2014-01-01
We report on improvements made over the past two decades to our adaptive treecode N-body method (HOT). A mathematical and computational approach to the cosmological N-body problem is described, with performance and scalability measured up to 256k (2 18 ) processors. We present error analysis and scientific application results from a series of more than ten 69 billion (4096 3 ) particle cosmological simulations, accounting for 4×10 20 floating point operations. These results include the first simulations using the new constraints on the standard model of cosmology from the Planck satellite. Our simulations set a new standard for accuracymore » and scientific throughput, while meeting or exceeding the computational efficiency of the latest generation of hybrid TreePM N-body methods.« less
NASA Astrophysics Data System (ADS)
Cattania, C.; Khalid, F.
2016-09-01
The estimation of space and time-dependent earthquake probabilities, including aftershock sequences, has received increased attention in recent years, and Operational Earthquake Forecasting systems are currently being implemented in various countries. Physics based earthquake forecasting models compute time dependent earthquake rates based on Coulomb stress changes, coupled with seismicity evolution laws derived from rate-state friction. While early implementations of such models typically performed poorly compared to statistical models, recent studies indicate that significant performance improvements can be achieved by considering the spatial heterogeneity of the stress field and secondary sources of stress. However, the major drawback of these methods is a rapid increase in computational costs. Here we present a code to calculate seismicity induced by time dependent stress changes. An important feature of the code is the possibility to include aleatoric uncertainties due to the existence of multiple receiver faults and to the finite grid size, as well as epistemic uncertainties due to the choice of input slip model. To compensate for the growth in computational requirements, we have parallelized the code for shared memory systems (using OpenMP) and distributed memory systems (using MPI). Performance tests indicate that these parallelization strategies lead to a significant speedup for problems with different degrees of complexity, ranging from those which can be solved on standard multicore desktop computers, to those requiring a small cluster, to a large simulation that can be run using up to 1500 cores.
NASA Astrophysics Data System (ADS)
Reuter, K.; Jenko, F.; Forest, C. B.; Bayliss, R. A.
2008-08-01
A parallel implementation of a nonlinear pseudo-spectral MHD code for the simulation of turbulent dynamos in spherical geometry is reported. It employs a dual domain decomposition technique in both real and spectral space. It is shown that this method shows nearly ideal scaling going up to 128 CPUs on Beowulf-type clusters with fast interconnect. Furthermore, the potential of exploiting single precision arithmetic on standard x86 processors is examined. It is pointed out that the MHD code thereby achieves a maximum speedup of 1.7, whereas the validity of the computations is still granted. The combination of both measures will allow for the direct numerical simulation of highly turbulent cases ( 1500
Lima, Thálitta Hetamaro Ayala; Buttura, Renato Vidal; Donadi, Eduardo Antônio; Veiga-Castelli, Luciana Caricati; Mendes-Junior, Celso Teixeira; Castelli, Erick C
2016-10-01
Human Leucocyte Antigen F (HLA-F) is a non-classical HLA class I gene distinguished from its classical counterparts by low allelic polymorphism and distinctive expression patterns. Its exact function remains unknown. It is believed that HLA-F has tolerogenic and immune modulatory properties. Currently, there is little information regarding the HLA-F allelic variation among human populations and the available studies have evaluated only a fraction of the HLA-F gene segment and/or have searched for known alleles only. Here we present a strategy to evaluate the complete HLA-F variability including its 5' upstream, coding and 3' downstream segments by using massively parallel sequencing procedures. HLA-F variability was surveyed on 196 individuals from the Brazilian Southeast. The results indicate that the HLA-F gene is indeed conserved at the protein level, where thirty coding haplotypes or coding alleles were detected, encoding only four different HLA-F full-length protein molecules. Moreover, a same protein molecule is encoded by 82.45% of all coding alleles detected in this Brazilian population sample. However, the HLA-F nucleotide and haplotype variability is much higher than our current knowledge both in Brazilians and considering the 1000 Genomes Project data. This protein conservation is probably a consequence of the key role of HLA-F in the immune system physiology.
Seedling establishment in a masting desert shrub parallels the pattern for forest trees
NASA Astrophysics Data System (ADS)
Meyer, Susan E.; Pendleton, Burton K.
2015-05-01
The masting phenomenon along with its accompanying suite of seedling adaptive traits has been well studied in forest trees but has rarely been examined in desert shrubs. Blackbrush (Coleogyne ramosissima) is a regionally dominant North American desert shrub whose seeds are produced in mast events and scatter-hoarded by rodents. We followed the fate of seedlings in intact stands vs. small-scale disturbances at four contrasting sites for nine growing seasons following emergence after a mast year. The primary cause of first-year mortality was post-emergence cache excavation and seedling predation, with contrasting impacts at sites with different heteromyid rodent seed predators. Long-term establishment patterns were strongly affected by rodent activity in the weeks following emergence. Survivorship curves generally showed decreased mortality risk with age but differed among sites even after the first year. There were no detectable effects of inter-annual precipitation variability or site climatic differences on survival. Intraspecific competition from conspecific adults had strong impacts on survival and growth, both of which were higher on small-scale disturbances, but similar in openings and under shrub crowns in intact stands. This suggests that adult plants preempted soil resources in the interspaces. Aside from effects on seedling predation, there was little evidence for facilitation or interference beneath adult plant crowns. Plants in intact stands were still small and clearly juvenile after nine years, showing that blackbrush forms cohorts of suppressed plants similar to the seedling banks of closed forests. Seedling banks function in the absence of a persistent seed bank in replacement after adult plant death (gap formation), which is temporally uncoupled from masting and associated recruitment events. This study demonstrates that the seedling establishment syndrome associated with masting has evolved in desert shrublands as well as in forests.
Kauer, J S
1991-02-01
Odor information appears to be encoded by activity distributed across many neurons at each level in the olfactory pathway. Thus olfactory circuits function as parallel distributed processors. New methods for observing distributed activity in such systems permit computer simulations to be constructed that are constrained by patterns of activity observed in the real system. Analysis of the system using a combination of physiological measurements and computational approaches might elucidate the principles by which odors are discriminated.
A user`s guide for BREAKUP: A computer code for parallelizing the overset grid approach
Barnette, D.W.
1998-04-01
In this user`s guide, details for running BREAKUP are discussed. BREAKUP allows the widely used overset grid method to be run in a parallel computer environment to achieve faster run times for computational field simulations over complex geometries. The overset grid method permits complex geometries to be divided into separate components. Each component is then gridded independently. The grids are computationally rejoined in a solver via interpolation coefficients used for grid-to-grid communications of boundary data. Overset grids have been in widespread use for many years on serial computers, and several well-known Navier-Stokes flow solvers have been extensively developed and validated to support their use. One drawback of serial overset grid methods has been the extensive compute time required to update flow solutions one grid at a time. Parallelizing the overset grid method overcomes this limitation by updating each grid or subgrid simultaneously. BREAKUP prepares overset grids for parallel processing by subdividing each overset grid into statically load-balanced subgrids. Two-dimensional examples with sample solutions, and three-dimensional examples, are presented.
NASA Astrophysics Data System (ADS)
Epstein, Henri
2016-11-01
An algebraic formalism, developed with V. Glaser and R. Stora for the study of the generalized retarded functions of quantum field theory, is used to prove a factorization theorem which provides a complete description of the generalized retarded functions associated with any tree graph. Integrating over the variables associated to internal vertices to obtain the perturbative generalized retarded functions for interacting fields arising from such graphs is shown to be possible for a large category of space-times.
NASA Astrophysics Data System (ADS)
Fang, Ye; Feng, Sheng; Tam, Ka-Ming; Yun, Zhifeng; Moreno, Juana; Ramanujam, J.; Jarrell, Mark
2014-10-01
Monte Carlo simulations of the Ising model play an important role in the field of computational statistical physics, and they have revealed many properties of the model over the past few decades. However, the effect of frustration due to random disorder, in particular the possible spin glass phase, remains a crucial but poorly understood problem. One of the obstacles in the Monte Carlo simulation of random frustrated systems is their long relaxation time making an efficient parallel implementation on state-of-the-art computation platforms highly desirable. The Graphics Processing Unit (GPU) is such a platform that provides an opportunity to significantly enhance the computational performance and thus gain new insight into this problem. In this paper, we present optimization and tuning approaches for the CUDA implementation of the spin glass simulation on GPUs. We discuss the integration of various design alternatives, such as GPU kernel construction with minimal communication, memory tiling, and look-up tables. We present a binary data format, Compact Asynchronous Multispin Coding (CAMSC), which provides an additional 28.4% speedup compared with the traditionally used Asynchronous Multispin Coding (AMSC). Our overall design sustains a performance of 33.5 ps per spin flip attempt for simulating the three-dimensional Edwards-Anderson model with parallel tempering, which significantly improves the performance over existing GPU implementations.
Zhang, S.; Yuen, D.A.; Zhu, A.; Song, S.; George, D.L.
2011-01-01
We parallelized the GeoClaw code on one-level grid using OpenMP in March, 2011 to meet the urgent need of simulating tsunami waves at near-shore from Tohoku 2011 and achieved over 75% of the potential speed-up on an eight core Dell Precision T7500 workstation [1]. After submitting that work to SC11 - the International Conference for High Performance Computing, we obtained an unreleased OpenMP version of GeoClaw from David George, who developed the GeoClaw code as part of his PH.D thesis. In this paper, we will show the complementary characteristics of the two approaches used in parallelizing GeoClaw and the speed-up obtained by combining the advantage of each of the two individual approaches with adaptive mesh refinement (AMR), demonstrating the capabilities of running GeoClaw efficiently on many-core systems. We will also show a novel simulation of the Tohoku 2011 Tsunami waves inundating the Sendai airport and Fukushima Nuclear Power Plants, over which the finest grid distance of 20 meters is achieved through a 4-level AMR. This simulation yields quite good predictions about the wave-heights and travel time of the tsunami waves. ?? 2011 IEEE.
1991-12-01
1,k) 2p(i+’/2,j+ /,k) 1+++ At •1(22) B(i +l/’j +/2’k)58 1+ 0.m(i +,j +/2,k) 2p(i+ 1A ,j+1/,k) ,[ ZEn(i+/2J+1’k)-E(i ’ ij k] where 5 is the lattice...1991. 19. Tipler , Paul A. Physics. New York: Worth Publishers Inc., 1976. 20. Work, Paul Rt and Gary B. Lamont. "Efficient Parallelization of Serial
NASA Astrophysics Data System (ADS)
Lu, C.; Lichtner, P. C.; Tsimpanogiannis, I. N.
2005-12-01
Uncontrolled release of CO2 to the atmosphere has been identified as a major contributing source to the global warming problem. Significant research efforts from the international scientific community are targeted towards stabilization/reduction of CO2 concentrations in the atmosphere while attempting to satisfy our continuously increasing needs for energy. CO2 sequestration (capture, separation, and long term storage) in various media (e.g. geologic such as depleted oil reservoirs, saline aquifers, etc.; oceanic at different depths) has been considered as a possible solution to reduce green house gas emissions. In this study we utilize the PFLOTRAN simulator to investigate geologic sequestration of CO2. PFLOTRAN is a massively parallel 3-D reservoir simulator for modeling supercritical CO2 sequestration in geologic formations based on continuum scale mass and energy conservations. The mass and energy equations are sequentially coupled to reactive transport equations describing multi-component chemical reactions within the formation including aqueous speciation, and precipitation and dissolution of minerals to describe aqueous and mineral CO2 sequestration. The effect of the injected CO2 on pH, CO2 concentration within the aqueous phase, mineral stability, and other factors can be evaluated with this model. Parallelization is carried out using the PETSc parallel library package based on MPI providing a high parallel efficiency and allowing simulations with several tens of millions of degrees of freedom to be carried out-ideal for large-scale field applications involving multi-component chemistry. In this work, our main focus is a parametrical examination on the effects of reservoir and fluid properties on the sequestration process, such as permeability and capillary pressure functions (e.g. linear, van Genuchten, etc.), diffusion coefficients in a multiphase system, the sensitivity of component solubility on pressure, temperature and mole fractions etc. Several
F100(3) parallel compressor computer code and user's manual
NASA Technical Reports Server (NTRS)
Mazzawy, R. S.; Fulkerson, D. A.; Haddad, D. E.; Clark, T. A.
1978-01-01
The Pratt & Whitney Aircraft multiple segment parallel compressor model has been modified to include the influence of variable compressor vane geometry on the sensitivity to circumferential flow distortion. Further, performance characteristics of the F100 (3) compression system have been incorporated into the model on a blade row basis. In this modified form, the distortion's circumferential location is referenced relative to the variable vane controlling sensors of the F100 (3) engine so that the proper solution can be obtained regardless of distortion orientation. This feature is particularly important for the analysis of inlet temperature distortion. Compatibility with fixed geometry compressor applications has been maintained in the model.
Li, Shengtai; Li, Hui
2012-06-14
the position of the planet, we adopt the corotating frame that allows the planet moving only in radial direction if only one planet is present. This code has been extensively tested on a number of problems. For the earthmass planet with constant aspect ratio h = 0.05, the torque calculated using our code matches quite well with the the 3D linear theory results by Tanaka et al. (2002). The code is fully parallelized via message-passing interface (MPI) and has very high parallel efficiency. Several numerical examples for both fixed planet and moving planet are provided to demonstrate the efficacy of the numerical method and code.
Parallel code NSBC: Simulations of relativistic nuclei scattering by a bent crystal
NASA Astrophysics Data System (ADS)
Babaev, A. A.
2014-01-01
The presented program was designed to simulate the passage of relativistic nuclei through a bent crystal. Namely, the input data is related to a nuclei beam. The nuclei move into the crystal under planar channeling and quasichanneling conditions. The program realizes the numerical algorithm to evaluate the trajectory of nucleus in the bent crystal. The program output is formed by the projectile motion data including the angular distribution of nuclei behind the crystal. The program could be useful to simulate the particle tracking at the accelerator facilities used the crystal collimation systems. The code has been written on C++ and designed for the multiprocessor systems (clusters).
NASA Astrophysics Data System (ADS)
Hegde, Ganapathi; Vaya, Pukhraj
2013-10-01
This article presents a parallel architecture for 3-D discrete wavelet transform (3-DDWT). The proposed design is based on the 1-D pipelined lifting scheme. The architecture is fully scalable beyond the present coherent Daubechies filter bank (9, 7). This 3-DDWT architecture has advantages such as no group of pictures restriction and reduced memory referencing. It offers low power consumption, low latency and high throughput. The computing technique is based on the concept that lifting scheme minimises the storage requirement. The application specific integrated circuit implementation of the proposed architecture is done by synthesising it using 65 nm Taiwan Semiconductor Manufacturing Company standard cell library. It offers a speed of 486 MHz with a power consumption of 2.56 mW. This architecture is suitable for real-time video compression even with large frame dimensions.
BMI optimization by using parallel UNDX real-coded genetic algorithm with Beowulf cluster
NASA Astrophysics Data System (ADS)
Handa, Masaya; Kawanishi, Michihiro; Kanki, Hiroshi
2007-12-01
This paper deals with the global optimization algorithm of the Bilinear Matrix Inequalities (BMIs) based on the Unimodal Normal Distribution Crossover (UNDX) GA. First, analyzing the structure of the BMIs, the existence of the typical difficult structures is confirmed. Then, in order to improve the performance of algorithm, based on results of the problem structures analysis and consideration of BMIs characteristic properties, we proposed the algorithm using primary search direction with relaxed Linear Matrix Inequality (LMI) convex estimation. Moreover, in these algorithms, we propose two types of evaluation methods for GA individuals based on LMI calculation considering BMI characteristic properties more. In addition, in order to reduce computational time, we proposed parallelization of RCGA algorithm, Master-Worker paradigm with cluster computing technique.
PORTA: A Massively Parallel Code for 3D Non-LTE Polarized Radiative Transfer
NASA Astrophysics Data System (ADS)
Štěpán, J.
2014-10-01
The interpretation of the Stokes profiles of the solar (stellar) spectral line radiation requires solving a non-LTE radiative transfer problem that can be very complex, especially when the main interest lies in modeling the linear polarization signals produced by scattering processes and their modification by the Hanle effect. One of the main difficulties is due to the fact that the plasma of a stellar atmosphere can be highly inhomogeneous and dynamic, which implies the need to solve the non-equilibrium problem of generation and transfer of polarized radiation in realistic three-dimensional stellar atmospheric models. Here we present PORTA, a computer program we have developed for solving, in three-dimensional (3D) models of stellar atmospheres, the problem of the generation and transfer of spectral line polarization taking into account anisotropic radiation pumping and the Hanle and Zeeman effects in multilevel atoms. The numerical method of solution is based on a highly convergent iterative algorithm, whose convergence rate is insensitive to the grid size, and on an accurate short-characteristics formal solver of the Stokes-vector transfer equation which uses monotonic Bezier interpolation. In addition to the iterative method and the 3D formal solver, another important feature of PORTA is a novel parallelization strategy suitable for taking advantage of massively parallel computers. Linear scaling of the solution with the number of processors allows to reduce the solution time by several orders of magnitude. We present useful benchmarks and a few illustrations of applications using a 3D model of the solar chromosphere resulting from MHD simulations. Finally, we present our conclusions with a view to future research. For more details see Štěpán & Trujillo Bueno (2013).
Robust conjunctive item-place coding by hippocampal neurons parallels learning what happens where
Komorowski, Robert W.; Manns, Joseph R.; Eichenbaum, Howard
2009-01-01
Previous research indicates a critical role of the hippocampus in memory for events in the context in which they occur. However, studies to date have not provided compelling evidence that hippocampal neurons encode event-context conjunctions directly associated with this kind of learning. Here we report that, as animals learn different meanings for items in distinct contexts, individual hippocampal neurons develop responses to specific stimuli in the places where they have differential significance. Furthermore, this conjunctive coding evolves in the form of enhanced item-specific responses within a subset of the pre-existing spatial representation. These findings support the view that conjunctive representations in the hippocampus underlie the acquisition of context specific memories. PMID:19657042
Modeling RF Fields in Hot Plasmas with Parallel Full Wave Code
NASA Astrophysics Data System (ADS)
Spencer, Andrew; Svidzinski, Vladimir; Zhao, Liangji; Galkin, Sergei; Kim, Jin-Soo
2016-10-01
FAR-TECH, Inc. is developing a suite of full wave RF plasma codes. It is based on a meshless formulation in configuration space with adapted cloud of computational points (CCP) capability and using the hot plasma conductivity kernel to model the nonlocal plasma dielectric response. The conductivity kernel is calculated by numerically integrating the linearized Vlasov equation along unperturbed particle trajectories. Work has been done on the following calculations: 1) the conductivity kernel in hot plasmas, 2) a monitor function based on analytic solutions of the cold-plasma dispersion relation, 3) an adaptive CCP based on the monitor function, 4) stencils to approximate the wave equations on the CCP, 5) the solution to the full wave equations in the cold-plasma model in tokamak geometry for ECRH and ICRH range of frequencies, and 6) the solution to the wave equations using the calculated hot plasma conductivity kernel. We will present results on using a meshless formulation on adaptive CCP to solve the wave equations and on implementing the non-local hot plasma dielectric response to the wave equations. The presentation will include numerical results of wave propagation and absorption in the cold and hot tokamak plasma RF models, using DIII-D geometry and plasma parameters. Work is supported by the U.S. DOE SBIR program.
Implementation of a tree algorithm in MCNP code for nuclear well logging applications.
Li, Fusheng; Han, Xiaogang
2012-07-01
The goal of this paper is to develop some modeling capabilities that are missing in the current MCNP code. Those missing capabilities can greatly help for some certain nuclear tools designs, such as a nuclear lithology/mineralogy spectroscopy tool. The new capabilities to be developed in this paper include the following: zone tally, neutron interaction tally, gamma rays index tally and enhanced pulse-height tally. The patched MCNP code also can be used to compute neutron slowing-down length and thermal neutron diffusion length.
NASA Astrophysics Data System (ADS)
Shi, Fei; Wang, Beibei; Selesnick, Ivan W.; Wang, Yao
2006-01-01
This paper introduces an anisotropic decomposition structure of a recently introduced 3-D dual-tree discrete wavelet transform (DDWT), and explores the applications for video denoising and coding. The 3-D DDWT is an attractive video representation because it isolates motion along different directions in separate subbands, and thus leads to sparse video decompositions. Our previous investigation shows that the 3-D DDWT, compared to the standard discrete wavelet transform (DWT), complies better with the statistical models based on sparse presumptions, and gives better visual and numerical results when used for statistical denoising algorithms. Our research on video compression also shows that even with 4:1 redundancy, the 3-D DDWT needs fewer coefficients to achieve the same coding quality (in PSNR) by applying the iterative projection-based noise shaping scheme proposed by Kingsbury. The proposed anisotropic DDWT extends the superiority of isotropic DDWT with more directional subbands without adding to the redundancy. Unlike the original 3-D DDWT which applies dyadic decomposition along all three directions and produces isotropic frequency spacing, it has a non-uniform tiling of the frequency space. By applying this structure, we can improve the denoising results, and the number of significant coefficients can be reduced further, which is beneficial for video coding.
NASA Astrophysics Data System (ADS)
Olson, Richard F.
2013-05-01
Rendering of point scatterer based radar scenes for millimeter wave (mmW) seeker tests in real-time hardware-in-the-loop (HWIL) scene generation requires efficient algorithms and vector-friendly computer architectures for complex signal synthesis. New processor technology from Intel implements an extended 256-bit vector SIMD instruction set (AVX, AVX2) in a multi-core CPU design providing peak execution rates of hundreds of GigaFLOPS (GFLOPS) on one chip. Real world mmW scene generation code can approach peak SIMD execution rates only after careful algorithm and source code design. An effective software design will maintain high computing intensity emphasizing register-to-register SIMD arithmetic operations over data movement between CPU caches or off-chip memories. Engineers at the U.S. Army Aviation and Missile Research, Development and Engineering Center (AMRDEC) applied two basic parallel coding methods to assess new 256-bit SIMD multi-core architectures for mmW scene generation in HWIL. These include use of POSIX threads built on vector library functions and more portable, highlevel parallel code based on compiler technology (e.g. OpenMP pragmas and SIMD autovectorization). Since CPU technology is rapidly advancing toward high processor core counts and TeraFLOPS peak SIMD execution rates, it is imperative that coding methods be identified which produce efficient and maintainable parallel code. This paper describes the algorithms used in point scatterer target model rendering, the parallelization of those algorithms, and the execution performance achieved on an AVX multi-core machine using the two basic parallel coding methods. The paper concludes with estimates for scale-up performance on upcoming multi-core technology.
NASA Astrophysics Data System (ADS)
Kum, Oyeon; Han, Youngyih; Jeong, Hae Sun
2012-05-01
Minimizing the differences between dose distributions calculated at the treatment planning stage and those delivered to the patient is an essential requirement for successful radiotheraphy. Accurate calculation of dose distributions in the treatment planning process is important and can be done only by using a Monte Carlo calculation of particle transport. In this paper, we perform a further validation of our previously developed parallel Monte Carlo electron and photon transport (PMCEPT) code [Kum and Lee, J. Korean Phys. Soc. 47, 716 (2005) and Kim and Kum, J. Korean Phys. Soc. 49, 1640 (2006)] for applications to clinical radiation problems. A linear accelerator, Siemens' Primus 6 MV, was modeled and commissioned. A thorough validation includes both small fields, closely related to the intensity modulated radiation treatment (IMRT), and large fields. Two-dimensional comparisons with film measurements were also performed. The PMCEPT results, in general, agreed well with the measured data within a maximum error of about 2%. However, considering the experimental errors, the PMCEPT results can provide the gold standard of dose distributions for radiotherapy. The computing time was also much faster, compared to that needed for experiments, although it is still a bottleneck for direct applications to the daily routine treatment planning procedure.
Wilkinson, Karl A; Hine, Nicholas D M; Skylaris, Chris-Kriton
2014-11-11
We present a hybrid MPI-OpenMP implementation of Linear-Scaling Density Functional Theory within the ONETEP code. We illustrate its performance on a range of high performance computing (HPC) platforms comprising shared-memory nodes with fast interconnect. Our work has focused on applying OpenMP parallelism to the routines which dominate the computational load, attempting where possible to parallelize different loops from those already parallelized within MPI. This includes 3D FFT box operations, sparse matrix algebra operations, calculation of integrals, and Ewald summation. While the underlying numerical methods are unchanged, these developments represent significant changes to the algorithms used within ONETEP to distribute the workload across CPU cores. The new hybrid code exhibits much-improved strong scaling relative to the MPI-only code and permits calculations with a much higher ratio of cores to atoms. These developments result in a significantly shorter time to solution than was possible using MPI alone and facilitate the application of the ONETEP code to systems larger than previously feasible. We illustrate this with benchmark calculations from an amyloid fibril trimer containing 41,907 atoms. We use the code to study the mechanism of delamination of cellulose nanofibrils when undergoing sonification, a process which is controlled by a large number of interactions that collectively determine the structural properties of the fibrils. Many energy evaluations were needed for these simulations, and as these systems comprise up to 21,276 atoms this would not have been feasible without the developments described here.
Simulation of Ionospheric E-Region Plasma Turbulence with a Massively Parallel Hybrid PIC/Fluid Code
NASA Astrophysics Data System (ADS)
Young, M.; Oppenheim, M. M.; Dimant, Y. S.
2015-12-01
The Farley-Buneman (FB) and gradient drift (GD) instabilities are plasma instabilities that occur at roughly 100 km in the equatorial E-region ionosphere. They develop when ion-neutral collisions dominate ion motion while electron motion is affected by both electron-neutral collisions and the background magnetic field. GD drift waves grow when the background density gradient and electric field are aligned; FB waves grow when the background electric field causes electrons to E × B drift with a speed slightly larger than the ion acoustic speed. Theory predicts that FB and GD turbulence should develop in the same plasma volume when GD waves create a perturbation electric field that exceeds the threshold value for FB turbulence. However, ionospheric radars, which regularly observe meter-scale irregularities associated with FB turbulence, must infer kilometer-scale GD dynamics rather than observe them directly. Numerical simulations have been unable to simultaneously resolve GD and FB structure. We present results from a parallelized hybrid simulation that uses a particle-in-cell (PIC) method for ions while modeling electrons as an inertialess, quasi-neutral fluid. This approach allows us to reach length scales of hundreds of meters to kilometers with sub-meter resolution, but requires solving a large linear system derived from an elliptic PDE that depends on plasma density, ion flux, and electron parameters. We solve the resultant linear system at each time step via the Portable Extensible Toolkit for Scientific Computing (PETSc). We compare results of simulated FB turbulence from this model to results from a thoroughly tested PIC code and describe progress toward the first simultaneous simulations of FB and GD instabilities. This model has immediate applications to radar observations of the E-region ionosphere, as well as potential applications to the F-region ionosphere and the chromosphere of the Sun.
NASA Astrophysics Data System (ADS)
Socas-Navarro, H.; de la Cruz Rodríguez, J.; Asensio Ramos, A.; Trujillo Bueno, J.; Ruiz Cobo, B.
2015-05-01
With the advent of a new generation of solar telescopes and instrumentation, interpreting chromospheric observations (in particular, spectropolarimetry) requires new, suitable diagnostic tools. This paper describes a new code, NICOLE, that has been designed for Stokes non-LTE radiative transfer, for synthesis and inversion of spectral lines and Zeeman-induced polarization profiles, spanning a wide range of atmospheric heights from the photosphere to the chromosphere. The code features a number of unique features and capabilities and has been built from scratch with a powerful parallelization scheme that makes it suitable for application on massive datasets using large supercomputers. The source code is written entirely in Fortran 90/2003 and complies strictly with the ANSI standards to ensure maximum compatibility and portability. It is being publicly released, with the idea of facilitating future branching by other groups to augment its capabilities. The source code is currently hosted at the following repository: http://https://github.com/hsocasnavarro/NICOLE
NASA Astrophysics Data System (ADS)
Pereira, Tiago M. D.; Uitenbroek, Han
2015-02-01
The emergence of three-dimensional magneto-hydrodynamic simulations of stellar atmospheres has sparked a need for efficient radiative transfer codes to calculate detailed synthetic spectra. We present RH 1.5D, a massively parallel code based on the RH code and capable of performing Zeeman polarised multi-level non-local thermodynamical equilibrium calculations with partial frequency redistribution for an arbitrary amount of chemical species. The code calculates spectra from 3D, 2D or 1D atmospheric models on a column-by-column basis (or 1.5D). While the 1.5D approximation breaks down in the cores of very strong lines in an inhomogeneous environment, it is nevertheless suitable for a large range of scenarios and allows for faster convergence with finer control over the iteration of each simulation column. The code scales well to at least tens of thousands of CPU cores, and is publicly available. In the present work we briefly describe its inner workings, strategies for convergence optimisation, its parallelism, and some possible applications.
Kirk, B.L.; Sartori, E.
1997-06-01
Subsequent to the introduction of High Performance Computing in the developed countries, the Organization for Economic Cooperation and Development/Nuclear Energy Agency (OECD/NEA) created the Task Force on Adapting Computer Codes in Nuclear Applications to Parallel Architectures (under the guidance of the Nuclear Science Committee`s Working Party on Advanced Computing) to study the growth area in supercomputing and its applicability to the nuclear community`s computer codes. The result has been four years of investigation for the Task Force in different subject fields - deterministic and Monte Carlo radiation transport, computational mechanics and fluid dynamics, nuclear safety, atmospheric models and waste management.
Salko, Robert K; Schmidt, Rodney; Avramova, Maria N
2014-01-01
This paper describes major improvements to the computational infrastructure of the CTF sub-channel code so that full-core sub-channel-resolved simulations can now be performed in much shorter run-times, either in stand-alone mode or as part of coupled-code multi-physics calculations. These improvements support the goals of the Department Of Energy (DOE) Consortium for Advanced Simulations of Light Water (CASL) Energy Innovation Hub to develop high fidelity multi-physics simulation tools for nuclear energy design and analysis. A set of serial code optimizations--including fixing computational inefficiencies, optimizing the numerical approach, and making smarter data storage choices--are first described and shown to reduce both execution time and memory usage by about a factor of ten. Next, a Single Program Multiple Data (SPMD) parallelization strategy targeting distributed memory Multiple Instruction Multiple Data (MIMD) platforms and utilizing domain-decomposition is presented. In this approach, data communication between processors is accomplished by inserting standard MPI calls at strategic points in the code. The domain decomposition approach implemented assigns one MPI process to each fuel assembly, with each domain being represented by its own CTF input file. The creation of CTF input files, both for serial and parallel runs, is also fully automated through use of a pre-processor utility that takes a greatly reduced set of user input over the traditional CTF input file. To run CTF in parallel, two additional libraries are currently needed; MPI, for inter-processor message passing, and the Parallel Extensible Toolkit for Scientific Computation (PETSc), which is leveraged to solve the global pressure matrix in parallel. Results presented include a set of testing and verification calculations and performance tests assessing parallel scaling characteristics up to a full core, sub-channel-resolved model of Watts Bar Unit 1 under hot full-power conditions (193 17x17
Fast Coding Unit Encoding Mechanism for Low Complexity Video Coding
Wu, Yueying; Jia, Kebin; Gao, Guandong
2016-01-01
In high efficiency video coding (HEVC), coding tree contributes to excellent compression performance. However, coding tree brings extremely high computational complexity. Innovative works for improving coding tree to further reduce encoding time are stated in this paper. A novel low complexity coding tree mechanism is proposed for HEVC fast coding unit (CU) encoding. Firstly, this paper makes an in-depth study of the relationship among CU distribution, quantization parameter (QP) and content change (CC). Secondly, a CU coding tree probability model is proposed for modeling and predicting CU distribution. Eventually, a CU coding tree probability update is proposed, aiming to address probabilistic model distortion problems caused by CC. Experimental results show that the proposed low complexity CU coding tree mechanism significantly reduces encoding time by 27% for lossy coding and 42% for visually lossless coding and lossless coding. The proposed low complexity CU coding tree mechanism devotes to improving coding performance under various application conditions. PMID:26999741
Fast Coding Unit Encoding Mechanism for Low Complexity Video Coding.
Gao, Yuan; Liu, Pengyu; Wu, Yueying; Jia, Kebin; Gao, Guandong
2016-01-01
In high efficiency video coding (HEVC), coding tree contributes to excellent compression performance. However, coding tree brings extremely high computational complexity. Innovative works for improving coding tree to further reduce encoding time are stated in this paper. A novel low complexity coding tree mechanism is proposed for HEVC fast coding unit (CU) encoding. Firstly, this paper makes an in-depth study of the relationship among CU distribution, quantization parameter (QP) and content change (CC). Secondly, a CU coding tree probability model is proposed for modeling and predicting CU distribution. Eventually, a CU coding tree probability update is proposed, aiming to address probabilistic model distortion problems caused by CC. Experimental results show that the proposed low complexity CU coding tree mechanism significantly reduces encoding time by 27% for lossy coding and 42% for visually lossless coding and lossless coding. The proposed low complexity CU coding tree mechanism devotes to improving coding performance under various application conditions.
Parallel algorithm development
Adams, T.F.
1996-06-01
Rapid changes in parallel computing technology are causing significant changes in the strategies being used for parallel algorithm development. One approach is simply to write computer code in a standard language like FORTRAN 77 or with the expectation that the compiler will produce executable code that will run in parallel. The alternatives are: (1) to build explicit message passing directly into the source code; or (2) to write source code without explicit reference to message passing or parallelism, but use a general communications library to provide efficient parallel execution. Application of these strategies is illustrated with examples of codes currently under development.
Hoover, C G; DeGroot, A J; Sherwood, R J
2000-06-01
ParaDyn is a parallel version of the DYNA3D computer program, a three-dimensional explicit finite-element program for analyzing the dynamic response of solids and structures. The ParaDyn program has been used as a production tool for over three years for analyzing problems which range in size from a few tens of thousands of elements to between one-million and ten-million elements. ParaDyn runs on parallel computers provided by the Department of Energy Accelerated Strategic Computing Initiative (ASCI) and the Department of Defense High Performance Computing and Modernization Program. Preprocessing and post-processing software utilities and tools are designed to facilitate the generation of partitioned domains for processors on a massively parallel computer and the visualization of both resultant data and boundary data generated in a parallel simulation. This manual provides a brief overview of the parallel implementation; describes techniques for running the ParaDyn program, tools and utilities; and provides examples of parallel simulations.
Huang, Xiao-Yan; Li, Ming-Li; Xu, Juan; Gao, Yue-Dong; Wang, Wen-Guang; Yin, An-Guo; Li, Xiao-Fei; Sun, Xiao-Mei; Xia, Xue-Shan; Dai, Jie-Jie
2013-04-01
While the tree shrew (Tupaia belangeri chinensis) is an excellent animal model for studying the mechanisms of human diseases, but few studies examine interleukin-2 (IL-2), an important immune factor in disease model evaluation. In this study, a 465 bp of the full-length IL-2 cDNA encoding sequence was cloned from the RNA of tree shrew spleen lymphocytes, which were then cultivated and stimulated with ConA (concanavalin). Clustal W 2.0 was used to compare and analyze the sequence and molecular characteristics, and establish the similarity of the overall structure of IL-2 between tree shrews and other mammals. The homology of the IL-2 nucleotide sequence between tree shrews and humans was 93%, and the amino acid homology was 80%. The phylogenetic tree results, derived through the Neighbour-Joining method using MEGA5.0, indicated a close genetic relationship between tree shrews, Homo sapiens, and Macaca mulatta. The three-dimensional structure analysis showed that the surface charges in most regions of tree shrew IL-2 were similar to between tree shrews and humans; however, the N-glycosylation sites and local structures were different, which may affect antibody binding. These results provide a fundamental basis for the future study of IL-2 monoclonal antibody in tree shrews, thereby improving their utility as a model.
NASA Astrophysics Data System (ADS)
Gassmöller, Rene; Bangerth, Wolfgang
2016-04-01
Particle-in-cell methods have a long history and many applications in geodynamic modelling of mantle convection, lithospheric deformation and crustal dynamics. They are primarily used to track material information, the strain a material has undergone, the pressure-temperature history a certain material region has experienced, or the amount of volatiles or partial melt present in a region. However, their efficient parallel implementation - in particular combined with adaptive finite-element meshes - is complicated due to the complex communication patterns and frequent reassignment of particles to cells. Consequently, many current scientific software packages accomplish this efficient implementation by specifically designing particle methods for a single purpose, like the advection of scalar material properties that do not evolve over time (e.g., for chemical heterogeneities). Design choices for particle integration, data storage, and parallel communication are then optimized for this single purpose, making the code relatively rigid to changing requirements. Here, we present the implementation of a flexible, scalable and efficient particle-in-cell method for massively parallel finite-element codes with adaptively changing meshes. Using a modular plugin structure, we allow maximum flexibility of the generation of particles, the carried tracer properties, the advection and output algorithms, and the projection of properties to the finite-element mesh. We present scaling tests ranging up to tens of thousands of cores and tens of billions of particles. Additionally, we discuss efficient load-balancing strategies for particles in adaptive meshes with their strengths and weaknesses, local particle-transfer between parallel subdomains utilizing existing communication patterns from the finite element mesh, and the use of established parallel output algorithms like the HDF5 library. Finally, we show some relevant particle application cases, compare our implementation to a
Li, Yong Gang; Yang, Yang; Short, Michael P.; Ding, Ze Jun; Zeng, Zhi; Li, Ju
2015-01-01
SRIM-like codes have limitations in describing general 3D geometries, for modeling radiation displacements and damage in nanostructured materials. A universal, computationally efficient and massively parallel 3D Monte Carlo code, IM3D, has been developed with excellent parallel scaling performance. IM3D is based on fast indexing of scattering integrals and the SRIM stopping power database, and allows the user a choice of Constructive Solid Geometry (CSG) or Finite Element Triangle Mesh (FETM) method for constructing 3D shapes and microstructures. For 2D films and multilayers, IM3D perfectly reproduces SRIM results, and can be ∼102 times faster in serial execution and > 104 times faster using parallel computation. For 3D problems, it provides a fast approach for analyzing the spatial distributions of primary displacements and defect generation under ion irradiation. Herein we also provide a detailed discussion of our open-source collision cascade physics engine, revealing the true meaning and limitations of the “Quick Kinchin-Pease” and “Full Cascades” options. The issues of femtosecond to picosecond timescales in defining displacement versus damage, the limitation of the displacements per atom (DPA) unit in quantifying radiation damage (such as inadequacy in quantifying degree of chemical mixing), are discussed. PMID:26658477
Qiang, J.; Leitner, D.; Todd, D.S.; Ryne, R.D.
2005-03-15
The superconducting ECR ion source VENUS serves as the prototype injector ion source for the Rare Isotope Accelerator (RIA) driver linac. The RIA driver linac requires a great variety of high charge state ion beams with up to an order of magnitude higher intensity than currently achievable with conventional ECR ion sources. In order to design the beam line optics of the low energy beam line for the RIA front end for the wide parameter range required for the RIA driver accelerator, reliable simulations of the ion beam extraction from the ECR ion source through the ion mass analyzing system are essential. The RIA low energy beam transport line must be able to transport intense beams (up to 10 mA) of light and heavy ions at 30 keV.For this purpose, LBNL is developing the parallel 3D particle-in-cell code IMPACT to simulate the ion beam transport from the ECR extraction aperture through the analyzing section of the low energy transport system. IMPACT, a parallel, particle-in-cell code, is currently used to model the superconducting RF linac section of RIA and is being modified in order to simulate DC beams from the ECR ion source extraction. By using the high performance of parallel supercomputing we will be able to account consistently for the changing space charge in the extraction region and the analyzing section. A progress report and early results in the modeling of the VENUS source will be presented.
NASA Astrophysics Data System (ADS)
Qiang, J.; Leitner, D.; Todd, D. S.; Ryne, R. D.
2005-03-01
The superconducting ECR ion source VENUS serves as the prototype injector ion source for the Rare Isotope Accelerator (RIA) driver linac. The RIA driver linac requires a great variety of high charge state ion beams with up to an order of magnitude higher intensity than currently achievable with conventional ECR ion sources. In order to design the beam line optics of the low energy beam line for the RIA front end for the wide parameter range required for the RIA driver accelerator, reliable simulations of the ion beam extraction from the ECR ion source through the ion mass analyzing system are essential. The RIA low energy beam transport line must be able to transport intense beams (up to 10 mA) of light and heavy ions at 30 keV. For this purpose, LBNL is developing the parallel 3D particle-in-cell code IMPACT to simulate the ion beam transport from the ECR extraction aperture through the analyzing section of the low energy transport system. IMPACT, a parallel, particle-in-cell code, is currently used to model the superconducting RF linac section of RIA and is being modified in order to simulate DC beams from the ECR ion source extraction. By using the high performance of parallel supercomputing we will be able to account consistently for the changing space charge in the extraction region and the analyzing section. A progress report and early results in the modeling of the VENUS source will be presented.
NASA Astrophysics Data System (ADS)
Feng, Sheng; Fang, Ye; Tam, Ka-Ming; Thakur, Bhupender; Yun, Zhifeng; Tomko, Karen; Moreno, Juana; Ramanujam, Jagannathan; Jarrell, Mark
2013-03-01
The Edwards Anderson model is a typical example of random frustrated system. It has been a long standing problem in computational physics due to its long relaxation time. Some important properties of the low temperature spin glass phase are still poorly understood after decades of study. The recent advances of GPU computing provide a new opportunity to substantially improve the simulations. We developed an MPI-CUDA hybrid code with multi-spin coding for parallel tempering Monte Carlo simulation of Edwards Anderson model. Since the system size is relatively small, and a large number of parallel replicas and Monte Carlo moves are required, the problem suits well for modern GPUs with CUDA architecture. We use the code to perform an extensive simulation on the three-dimensional Edwards Anderson model with an external field. This work is funded by the NSF EPSCoR LA-SiGMA project under award number EPS-1003897. This work is partly done on the machines of Ohio Supercomputer Center.
Hayes, J C; Norman, M
1999-10-28
This report details an investigation into the efficacy of two approaches to solving the radiation diffusion equation within a radiation hydrodynamic simulation. Because leading-edge scientific computing platforms have evolved from large single-node vector processors to parallel aggregates containing tens to thousands of individual CPU's, the ability of an algorithm to maintain high compute efficiency when distributed over a large array of nodes is critically important. The viability of an algorithm thus hinges upon the tripartite question of numerical accuracy, total time to solution, and parallel efficiency.
NASA Astrophysics Data System (ADS)
Misawa, Takeharu; Yoshida, Hiroyuki; Akimoto, Hajime
In Japan Atomic Energy Agency (JAEA), the Innovative Water Reactor for Flexible Fuel Cycle (FLWR) has been developed. For thermal design of FLWR, it is necessary to develop analytical method to predict boiling transition of FLWR. Japan Atomic Energy Agency (JAEA) has been developing three-dimensional two-fluid model analysis code ACE-3D, which adopts boundary fitted coordinate system to simulate complex shape channel flow. In this paper, as a part of development of ACE-3D to apply to rod bundle analysis, introduction of parallelization to ACE-3D and assessments of ACE-3D are shown. In analysis of large-scale domain such as a rod bundle, even two-fluid model requires large number of computational cost, which exceeds upper limit of memory amount of 1 CPU. Therefore, parallelization was introduced to ACE-3D to divide data amount for analysis of large-scale domain among large number of CPUs, and it is confirmed that analysis of large-scale domain such as a rod bundle can be performed by parallel computation with keeping parallel computation performance even using large number of CPUs. ACE-3D adopts two-phase flow models, some of which are dependent upon channel geometry. Therefore, analyses in the domains, which simulate individual subchannel and 37 rod bundle, are performed, and compared with experiments. It is confirmed that the results obtained by both analyses using ACE-3D show agreement with past experimental result qualitatively.
Tian, Wei-Wei; Gao, Yue-Dong; Guo, Yan; Huang, Jing-Fei; Xiao, Chang; Li, Zuo-Sheng; Zhang, Hua-Tang
2012-02-01
The tree shrews, as an ideal animal model receiving extensive attentions to human disease research, demands essential research tools, in particular cellular markers and monoclonal antibodies for immunological studies. In this paper, a 1 365 bp of the full-length CD4 cDNA encoding sequence was cloned from total RNA in peripheral blood of tree shrews, the sequence completes two unknown fragment gaps of tree shrews predicted CD4 cDNA in the GenBank database, and its molecular characteristics were analyzed compared with other mammals by using biology software such as Clustal W2.0 and so forth. The results showed that the extracellular and intracellular domains of tree shrews CD4 amino acid sequence are conserved. The tree shrews CD4 amino acid sequence showed a close genetic relationship with Homo sapiens and Macaca mulatta. Most regions of the tree shrews CD4 molecule surface showed positive charges as humans. However, compared with CD4 extracellular domain D1 of human, CD4 D1 surface of tree shrews showed more negative charges, and more two N-glycosylation sites, which may affect antibody binding. This study provides a theoretical basis for the preparation and functional studies of CD4 monoclonal antibody.
NASA Astrophysics Data System (ADS)
Sijoy, C. D.; Chaturvedi, S.
2016-06-01
Higher-order cell-centered multi-material hydrodynamics (HD) and parallel node-centered radiation transport (RT) schemes are combined self-consistently in three-temperature (3T) radiation hydrodynamics (RHD) code TRHD (Sijoy and Chaturvedi, 2015) developed for the simulation of intense thermal radiation or high-power laser driven RHD. For RT, a node-centered gray model implemented in a popular RHD code MULTI2D (Ramis et al., 2009) is used. This scheme, in principle, can handle RT in both optically thick and thin materials. The RT module has been parallelized using message passing interface (MPI) for parallel computation. Presently, for multi-material HD, we have used a simple and robust closure model in which common strain rates to all materials in a mixed cell is assumed. The closure model has been further generalized to allow different temperatures for the electrons and ions. In addition to this, electron and radiation temperatures are assumed to be in non-equilibrium. Therefore, the thermal relaxation between the electrons and ions and the coupling between the radiation and matter energies are required to be computed self-consistently. This has been achieved by using a node-centered symmetric-semi-implicit (SSI) integration scheme. The electron thermal conduction is calculated using a cell-centered, monotonic, non-linear finite volume scheme (NLFV) suitable for unstructured meshes. In this paper, we have described the details of the 2D, 3T, non-equilibrium, multi-material RHD code developed with a special attention to the coupling of various cell-centered and node-centered formulations along with a suite of validation test problems to demonstrate the accuracy and performance of the algorithms. We also report the parallel performance of RT module. Finally, in order to demonstrate the full capability of the code implementation, we have presented the simulation of laser driven shock propagation in a layered thin foil. The simulation results are found to be in good
NASA Astrophysics Data System (ADS)
Meléndez, A.; Korenaga, J.; Sallarès, V.; Miniussi, A.; Ranero, C. R.
2015-10-01
We present a new 3-D traveltime tomography code (TOMO3D) for the modelling of active-source seismic data that uses the arrival times of both refracted and reflected seismic phases to derive the velocity distribution and the geometry of reflecting boundaries in the subsurface. This code is based on its popular 2-D version TOMO2D from which it inherited the methods to solve the forward and inverse problems. The traveltime calculations are done using a hybrid ray-tracing technique combining the graph and bending methods. The LSQR algorithm is used to perform the iterative regularized inversion to improve the initial velocity and depth models. In order to cope with an increased computational demand due to the incorporation of the third dimension, the forward problem solver, which takes most of the run time (˜90 per cent in the test presented here), has been parallelized with a combination of multi-processing and message passing interface standards. This parallelization distributes the ray-tracing and traveltime calculations among available computational resources. The code's performance is illustrated with a realistic synthetic example, including a checkerboard anomaly and two reflectors, which simulates the geometry of a subduction zone. The code is designed to invert for a single reflector at a time. A data-driven layer-stripping strategy is proposed for cases involving multiple reflectors, and it is tested for the successive inversion of the two reflectors. Layers are bound by consecutive reflectors, and an initial velocity model for each inversion step incorporates the results from previous steps. This strategy poses simpler inversion problems at each step, allowing the recovery of strong velocity discontinuities that would otherwise be smoothened.
Salko, Robert K.; Schmidt, Rodney C.; Avramova, Maria N.
2014-11-23
This study describes major improvements to the computational infrastructure of the CTF subchannel code so that full-core, pincell-resolved (i.e., one computational subchannel per real bundle flow channel) simulations can now be performed in much shorter run-times, either in stand-alone mode or as part of coupled-code multi-physics calculations. These improvements support the goals of the Department Of Energy Consortium for Advanced Simulation of Light Water Reactors (CASL) Energy Innovation Hub to develop high fidelity multi-physics simulation tools for nuclear energy design and analysis.
NASA Astrophysics Data System (ADS)
Zaghi, S.
2014-07-01
OFF, an open source (free software) code for performing fluid dynamics simulations, is presented. The aim of OFF is to solve, numerically, the unsteady (and steady) compressible Navier-Stokes equations of fluid dynamics by means of finite volume techniques: the research background is mainly focused on high-order (WENO) schemes for multi-fluids, multi-phase flows over complex geometries. To this purpose a highly modular, object-oriented application program interface (API) has been developed. In particular, the concepts of data encapsulation and inheritance available within Fortran language (from standard 2003) have been stressed in order to represent each fluid dynamics "entity" (e.g. the conservative variables of a finite volume, its geometry, etc…) by a single object so that a large variety of computational libraries can be easily (and efficiently) developed upon these objects. The main features of OFF can be summarized as follows: Programming LanguageOFF is written in standard (compliant) Fortran 2003; its design is highly modular in order to enhance simplicity of use and maintenance without compromising the efficiency; Parallel Frameworks Supported the development of OFF has been also targeted to maximize the computational efficiency: the code is designed to run on shared-memory multi-cores workstations and distributed-memory clusters of shared-memory nodes (supercomputers); the code's parallelization is based on Open Multiprocessing (OpenMP) and Message Passing Interface (MPI) paradigms; Usability, Maintenance and Enhancement in order to improve the usability, maintenance and enhancement of the code also the documentation has been carefully taken into account; the documentation is built upon comprehensive comments placed directly into the source files (no external documentation files needed): these comments are parsed by means of doxygen free software producing high quality html and latex documentation pages; the distributed versioning system referred as git
Icarus: A 2D direct simulation Monte Carlo (DSMC) code for parallel computers. User`s manual - V.3.0
Bartel, T.; Plimpton, S.; Johannes, J.; Payne, J.
1996-10-01
Icarus is a 2D Direct Simulation Monte Carlo (DSMC) code which has been optimized for the parallel computing environment. The code is based on the DSMC method of Bird and models from free-molecular to continuum flowfields in either cartesian (x, y) or axisymmetric (z, r) coordinates. Computational particles, representing a given number of molecules or atoms, are tracked as they have collisions with other particles or surfaces. Multiple species, internal energy modes (rotation and vibration), chemistry, and ion transport are modelled. A new trace species methodology for collisions and chemistry is used to obtain statistics for small species concentrations. Gas phase chemistry is modelled using steric factors derived from Arrhenius reaction rates. Surface chemistry is modelled with surface reaction probabilities. The electron number density is either a fixed external generated field or determined using a local charge neutrality assumption. Ion chemistry is modelled with electron impact chemistry rates and charge exchange reactions. Coulomb collision cross-sections are used instead of Variable Hard Sphere values for ion-ion interactions. The electrostatic fields can either be externally input or internally generated using a Langmuir-Tonks model. The Icarus software package includes the grid generation, parallel processor decomposition, postprocessing, and restart software. The commercial graphics package, Tecplot, is used for graphics display. The majority of the software packages are written in standard Fortran.
Large Scale Earth's Bow Shock with Northern IMF as Simulated by PIC Code in Parallel with MHD Model
NASA Astrophysics Data System (ADS)
Baraka, Suleiman
2016-06-01
In this paper, we propose a 3D kinetic model (particle-in-cell, PIC) for the description of the large scale Earth's bow shock. The proposed version is stable and does not require huge or extensive computer resources. Because PIC simulations work with scaled plasma and field parameters, we also propose to validate our code by comparing its results with the available MHD simulations under same scaled solar wind (SW) and (IMF) conditions. We report new results from the two models. In both codes the Earth's bow shock position is found to be ≈14.8 R E along the Sun-Earth line, and ≈29 R E on the dusk side. Those findings are consistent with past in situ observations. Both simulations reproduce the theoretical jump conditions at the shock. However, the PIC code density and temperature distributions are inflated and slightly shifted sunward when compared to the MHD results. Kinetic electron motions and reflected ions upstream may cause this sunward shift. Species distributions in the foreshock region are depicted within the transition of the shock (measured ≈2 c/ ω pi for Θ Bn = 90° and M MS = 4.7) and in the downstream. The size of the foot jump in the magnetic field at the shock is measured to be (1.7 c/ ω pi ). In the foreshocked region, the thermal velocity is found equal to 213 km s-1 at 15 R E and is equal to 63 km s -1 at 12 R E (magnetosheath region). Despite the large cell size of the current version of the PIC code, it is powerful to retain macrostructure of planets magnetospheres in very short time, thus it can be used for pedagogical test purposes. It is also likely complementary with MHD to deepen our understanding of the large scale magnetosphere.
Mischerikow, Nikolai; van Nierop, Pim; Li, Ka Wan; Bernstein, Hans-Gert; Smit, August B; Heck, Albert J R; Altelaar, A F Maarten
2010-10-01
Isobaric stable isotope labeling of peptides using iTRAQ is an important method for MS based quantitative proteomics. Traditionally, quantitative analysis of iTRAQ labeled peptides has been confined to beam-type instruments because of the weak detection capabilities of ion traps for low mass ions. Recent technical advances in fragmentation techniques on linear ion traps and the hybrid linear ion trap-orbitrap allow circumventing this limitation. Namely, PQD and HCD facilitate iTRAQ analysis on these instrument types. Here we report a method for iTRAQ-based relative quantification on the ETD enabled LTQ Orbitrap XL, which is based on parallel peptide quantification and peptide identification. iTRAQ reporter ion generation is performed by HCD, while CID and ETD provide peptide identification data in parallel in the LTQ ion trap. This approach circumvents problems accompanying iTRAQ reporter ion generation with ETD and allows quantitative, decision tree-based CID/ETD experiments. Furthermore, the use of HCD solely for iTRAQ reporter ion read out significantly reduces the number of ions needed to obtain informative spectra, which significantly reduces the analysis time. Finally, we show that integration of this method, both with existing CID and ETD methods as well as with existing iTRAQ data analysis workflows, is simple to realize. By applying our approach to the analysis of the synapse proteome from human brain biopsies, we demonstrate that it outperforms a latest generation MALDI TOF/TOF instrument, with improvements in both peptide and protein identification and quantification. Conclusively, our work shows how HCD, CID and ETD can be beneficially combined to enable iTRAQ-based quantification on an ETD-enabled LTQ Orbitrap XL.
A parallel PCG solver for MODFLOW.
Dong, Yanhui; Li, Guomin
2009-01-01
In order to simulate large-scale ground water flow problems more efficiently with MODFLOW, the OpenMP programming paradigm was used to parallelize the preconditioned conjugate-gradient (PCG) solver with in this study. Incremental parallelization, the significant advantage supported by OpenMP on a shared-memory computer, made the solver transit to a parallel program smoothly one block of code at a time. The parallel PCG solver, suitable for both MODFLOW-2000 and MODFLOW-2005, is verified using an 8-processor computer. Both the impact of compilers and different model domain sizes were considered in the numerical experiments. Based on the timing results, execution times using the parallel PCG solver are typically about 1.40 to 5.31 times faster than those using the serial one. In addition, the simulation results are the exact same as the original PCG solver, because the majority of serial codes were not changed. It is worth noting that this parallelizing approach reduces cost in terms of software maintenance because only a single source PCG solver code needs to be maintained in the MODFLOW source tree.
GOTPM: a parallel hybrid particle-mesh treecode
NASA Astrophysics Data System (ADS)
Dubinski, John; Kim, Juhan; Park, Changbom; Humble, Robin
2004-02-01
We describe a parallel, cosmological N-body code based on a hybrid scheme using the particle-mesh (PM) and Barnes-Hut (BH) oct-tree algorithm. We call the algorithm GOTPM for Grid-of-Oct-Trees-Particle-Mesh. The code is parallelized using the Message Passing Interface (MPI) library and is optimized to run on Beowulf clusters as well as symmetric multi-processors. The gravitational potential is determined on a mesh using a standard PM method with particle forces determined through interpolation. The softened PM force is corrected for short range interactions using a grid of localized BH trees throughout the entire simulation volume in a completely analogous way to P3M methods. This method makes no assumptions about the local density for short range force corrections and so is consistent with the results of the P3M method in the limit that the treecode opening angle parameter, θ→0. The PM method is parallelized using one-dimensional slice domain decomposition. Particles are distributed in slices of equal width to allow mass assignment onto mesh points. The Fourier transforms in the PM method are done in parallel using the MPI implementation of the FFTW package. Parallelization for the tree force corrections is achieved again using one-dimensional slices but the width of each slice is allowed to vary according to the amount of computational work required by the particles within each slice to achieve load balance. The tree force corrections dominate the computational load and so imbalances in the PM density assignment step do not impact the overall load balance and performance significantly. The code performance scales well to 128 processors and is significantly better than competing methods. We present preliminary results from simulations run on different platforms containing up to N=1 G particles to verify the code.
NASA Astrophysics Data System (ADS)
Trost, Nico; Jiménez, Javier; Imke, Uwe; Sanchez, Victor
2014-06-01
TWOPORFLOW is a thermo-hydraulic code based on a porous media approach to simulate single- and two-phase flow including boiling. It is under development at the Institute for Neutron Physics and Reactor Technology (INR) at KIT. The code features a 3D transient solution of the mass, momentum and energy conservation equations for two inter-penetrating fluids with a semi-implicit continuous Eulerian type solver. The application domain of TWOPORFLOW includes the flow in standard porous media and in structured porous media such as micro-channels and cores of nuclear power plants. In the latter case, the fluid domain is coupled to a fuel rod model, describing the heat flow inside the solid structure. In this work, detailed profiling tools have been utilized to determine the optimization potential of TWOPORFLOW. As a result, bottle-necks were identified and reduced in the most feasible way, leading for instance to an optimization of the water-steam property computation. Furthermore, an OpenMP implementation addressing the routines in charge of inter-phase momentum-, energy- and mass-coupling delivered good performance together with a high scalability on shared memory architectures. In contrast to that, the approach for distributed memory systems was to solve sub-problems resulting by the decomposition of the initial Cartesian geometry. Thread communication for the sub-problem boundary updates was accomplished by the Message Passing Interface (MPI) standard.
Shumaker, Dana E.; Steefel, Carl I.
2016-06-21
The code CRUNCH_PARALLEL is a parallel version of the CRUNCH code. CRUNCH code version 2.0 was previously released by LLNL, (UCRL-CODE-200063). Crunch is a general purpose reactive transport code developed by Carl Steefel and Yabusake (Steefel Yabsaki 1996). The code handles non-isothermal transport and reaction in one, two, and three dimensions. The reaction algorithm is generic in form, handling an arbitrary number of aqueous and surface complexation as well as mineral dissolution/precipitation. A standardized database is used containing thermodynamic and kinetic data. The code includes advective, dispersive, and diffusive transport.
NASA Astrophysics Data System (ADS)
Bass, Graham; Thomas, Russell; Pearce, Julia
2009-04-01
The most recent electron dosimetry code of practice for radiotherapy written by the Institute of Physics and Engineering in Medicine was published in 2003 and is based on the NPL electron absorbed dose to water calibration service. NPL has calibrated many Scanditronix type NACP-02 and PTW Roos type 34001 parallel plate ionization chambers in terms of absorbed dose to water, for use with the code of practice. The results of the calibrations of these chamber types summarized here include the absorbed dose to water sensitivity, where the mean calibration factor standard deviations are 5.8% for NACP-02 chambers and 1.1% for PTW Roos chambers. The correction for the polarity effect is shown to be small (less than 0.2% for all beam qualities) but with a discernible beam quality dependence. The correction for recombination is shown to be consistent and reproducible, and an analysis of these results suggests that the plate separation of the NACP-02 chambers is more variable from chamber to chamber than with the PTW Roos chambers. The calibration of these chambers is shown to be repeatable within ±0.2% over 2-3 years. It is also shown that check source measurements can be repeated within ±0.3% over several years. The results justify the use of NACP-02 and PTW 34001 chambers as secondary standards, but also indicate that the PTW 34001 chambers show less variation from chamber to chamber.
Bass, Graham; Thomas, Russell; Pearce, Julia
2009-04-21
The most recent electron dosimetry code of practice for radiotherapy written by the Institute of Physics and Engineering in Medicine was published in 2003 and is based on the NPL electron absorbed dose to water calibration service. NPL has calibrated many Scanditronix type NACP-02 and PTW Roos type 34001 parallel plate ionization chambers in terms of absorbed dose to water, for use with the code of practice. The results of the calibrations of these chamber types summarized here include the absorbed dose to water sensitivity, where the mean calibration factor standard deviations are 5.8% for NACP-02 chambers and 1.1% for PTW Roos chambers. The correction for the polarity effect is shown to be small (less than 0.2% for all beam qualities) but with a discernible beam quality dependence. The correction for recombination is shown to be consistent and reproducible, and an analysis of these results suggests that the plate separation of the NACP-02 chambers is more variable from chamber to chamber than with the PTW Roos chambers. The calibration of these chambers is shown to be repeatable within +/-0.2% over 2-3 years. It is also shown that check source measurements can be repeated within +/-0.3% over several years. The results justify the use of NACP-02 and PTW 34001 chambers as secondary standards, but also indicate that the PTW 34001 chambers show less variation from chamber to chamber.
Morozov, Dmitriy; Weber, Gunther H.
2014-03-31
Topological techniques provide robust tools for data analysis. They are used, for example, for feature extraction, for data de-noising, and for comparison of data sets. This chapter concerns contour trees, a topological descriptor that records the connectivity of the isosurfaces of scalar functions. These trees are fundamental to analysis and visualization of physical phenomena modeled by real-valued measurements. We study the parallel analysis of contour trees. After describing a particular representation of a contour tree, called local{global representation, we illustrate how di erent problems that rely on contour trees can be solved in parallel with minimal communication.
ERIC Educational Resources Information Center
Walter, Pierre
2012-01-01
This study examines how cultural codes in environmental adult education can be used to "frame" collective identity, develop counterhegemonic ideologies, and catalyse "educative-activism" within social movements. Three diverse examples are discussed, spanning environmental movements in urban Victoria, British Columbia, Canada,…
A parallelized binary search tree
Technology Transfer Automated Retrieval System (TEKTRAN)
PTTRNFNDR is an unsupervised statistical learning algorithm that detects patterns in DNA sequences, protein sequences, or any natural language texts that can be decomposed into letters of a finite alphabet. PTTRNFNDR performs complex mathematical computations and its processing time increases when i...
Rubio, L; Ortiz, M C; Sarabia, L A
2014-04-11
A non-separative, fast and inexpensive spectrofluorimetric method based on the second order calibration of excitation-emission fluorescence matrices (EEMs) was proposed for the determination of carbaryl, carbendazim and 1-naphthol in dried lime tree flowers. The trilinearity property of three-way data was used to handle the intrinsic fluorescence of lime flowers and the difference in the fluorescence intensity of each analyte. It also made possible to identify unequivocally each analyte. Trilinearity of the data tensor guarantees the uniqueness of the solution obtained through parallel factor analysis (PARAFAC), so the factors of the decomposition match up with the analytes. In addition, an experimental procedure was proposed to identify, with three-way data, the quenching effect produced by the fluorophores of the lime flowers. This procedure also enabled the selection of the adequate dilution of the lime flowers extract to minimize the quenching effect so the three analytes can be quantified. Finally, the analytes were determined using the standard addition method for a calibration whose standards were chosen with a D-optimal design. The three analytes were unequivocally identified by the correlation between the pure spectra and the PARAFAC excitation and emission spectral loadings. The trueness was established by the accuracy line "calculated concentration versus added concentration" in all cases. Better decision limit values (CCα), in x0=0 with the probability of false positive fixed at 0.05, were obtained for the calibration performed in pure solvent: 2.97 μg L(-1) for 1-naphthol, 3.74 μg L(-1) for carbaryl and 23.25 μg L(-1) for carbendazim. The CCα values for the second calibration carried out in matrix were 1.61, 4.34 and 51.75 μg L(-1) respectively; while the values obtained considering only the pure samples as calibration set were: 2.65, 8.61 and 28.7 μg L(-1), respectively.
NASA Astrophysics Data System (ADS)
Štěpán, Jiří; Trujillo Bueno, Javier
2013-09-01
The interpretation of the intensity and polarization of the spectral line radiation produced in the atmosphere of the Sun and of other stars requires solving a radiative transfer problem that can be very complex, especially when the main interest lies in modeling the spectral line polarization produced by scattering processes and the Hanle and Zeeman effects. One of the difficulties is that the plasma of a stellar atmosphere can be highly inhomogeneous and dynamic, which implies the need to solve the non-equilibrium problem of the generation and transfer of polarized radiation in realistic three-dimensional (3D) stellar atmospheric models. Here we present PORTA, an efficient multilevel radiative transfer code we have developed for the simulation of the spectral line polarization caused by scattering processes and the Hanle and Zeeman effects in 3D models of stellar atmospheres. The numerical method of solution is based on the non-linear multigrid iterative method and on a novel short-characteristics formal solver of the Stokes-vector transfer equation which uses monotonic Bézier interpolation. Therefore, with PORTA the computing time needed to obtain at each spatial grid point the self-consistent values of the atomic density matrix (which quantifies the excitation state of the atomic system) scales linearly with the total number of grid points. Another crucial feature of PORTA is its parallelization strategy, which allows us to speed up the numerical solution of complicated 3D problems by several orders of magnitude with respect to sequential radiative transfer approaches, given its excellent linear scaling with the number of available processors. The PORTA code can also be conveniently applied to solve the simpler 3D radiative transfer problem of unpolarized radiation in multilevel systems.
van Beugen, Boeke J.; Gao, Zhenyu; Boele, Henk-Jan; Hoebeek, Freek; De Zeeuw, Chris I.
2013-01-01
Cerebellar granule cells (GrCs) convey information from mossy fibers (MFs) to Purkinje cells (PCs) via their parallel fibers (PFs). MF to GrC signaling allows transmission of frequencies up to 1 kHz and GrCs themselves can also fire bursts of action potentials with instantaneous frequencies up to 1 kHz. So far, in the scientific literature no evidence has been shown that these high-frequency bursts also exist in awake, behaving animals. More so, it remains to be shown whether such high-frequency bursts can transmit temporally coded information from MFs to PCs and/or whether these patterns of activity contribute to the spatiotemporal filtering properties of the GrC layer. Here, we show that, upon sensory stimulation in both un-anesthetized rabbits and mice, GrCs can show bursts that consist of tens of spikes at instantaneous frequencies over 800 Hz. In vitro recordings from individual GrC-PC pairs following high-frequency stimulation revealed an overall low initial release probability of ~0.17. Nevertheless, high-frequency burst activity induced a short-lived facilitation to ensure signaling within the first few spikes, which was rapidly followed by a reduction in transmitter release. The facilitation rate among individual GrC-PC pairs was heterogeneously distributed and could be classified as either “reluctant” or “responsive” according to their release characteristics. Despite the variety of efficacy at individual connections, grouped activity in GrCs resulted in a linear relationship between PC response and PF burst duration at frequencies up to 300 Hz allowing rate coding to persist at the network level. Together, these findings support the hypothesis that the cerebellar granular layer acts as a spatiotemporal filter between MF input and PC output (D’Angelo and De Zeeuw, 2009). PMID:23734102
NASA Astrophysics Data System (ADS)
Kumar, J.; Mills, R. T.; Lichtner, P. C.; Hammond, G. E.
2010-12-01
Fracture dominated flows occur in numerous subsurface geochemical processes and at many different scales in rock pore structures, micro-fractures, fracture networks and faults. Fractured porous media can be modeled as multiple interacting continua which are connected to each other through transfer terms that capture the flow of mass and energy in response to pressure, temperature and concentration gradients. However, the analysis of large-scale transient problems using the multiple interacting continuum approach presents an algorithmic and computational challenge for problems with very large numbers of degrees of freedom. A generalized dual porosity model based on the Dual Continuum Disconnected Matrix approach has been implemented within a massively parallel multiphysics-multicomponent-multiphase subsurface reactive flow and transport code PFLOTRAN. Developed as part of the Department of Energy's SciDAC-2 program, PFLOTRAN provides subsurface simulation capabilities that can scale from laptops to ultrascale supercomputers, and utilizes the PETSc framework to solve the large, sparse algebraic systems that arises in complex subsurface reactive flow and transport problems. It has been successfully applied to the solution of problems composed of more than two billions degrees of freedom, utilizing up to 131,072 processor cores on Jaguar, the Cray XT5 system at Oak Ridge National Laboratory that is the world’s fastest supercomputer. Building upon the capabilities and computational efficiency of PFLOTRAN, we will present an implementation of the multiple interacting continua formulation for fractured porous media along with an application case study.
NASA Astrophysics Data System (ADS)
Leboeuf, Jean-Noel; Decyk, Viktor; Newman, David; Sanchez, Raul
2013-10-01
The massively parallel, 2D domain-decomposed, nonlinear, 3D, toroidal, electrostatic, gyrokinetic, Particle in Cell (PIC), Cartesian geometry UCAN2 code, with particle ions and adiabatic electrons, has been ported to two emerging mainframes. These two computers, one at NERSC in the US built by Cray named Edison and the other at the Barcelona Supercomputer Center (BSC) in Spain built by IBM named MareNostrum III (MNIII) just happen to share the same Intel ``Sandy Bridge'' processors. The successful port of UCAN2 to MNIII which came online first has enabled us to be up and running efficiently in record time on Edison. Overall, the performance of UCAN2 on Edison is superior to that on MNIII, particularly at large numbers of processors (>1024) for the same Intel IFORT compiler. This appears to be due to different MPI modules (OpenMPI on MNIII and MPICH2 on Edison) and different interconnection networks (Infiniband on MNIII and Cray's Aries on Edison) on the two mainframes. Details of these ports and comparative benchmarks are presented. Work supported by OFES, USDOE, under contract no. DE-FG02-04ER54741 with the University of Alaska at Fairbanks.
NASA Astrophysics Data System (ADS)
Engel, D.; Klews, M.; Wunner, G.
2009-02-01
We have developed a new method for the fast computation of wavelengths and oscillator strengths for medium-Z atoms and ions, up to iron, at neutron star magnetic field strengths. The method is a parallelized Hartree-Fock approach in adiabatic approximation based on finite-element and B-spline techniques. It turns out that typically 15-20 finite elements are sufficient to calculate energies to within a relative accuracy of 10-5 in 4 or 5 iteration steps using B-splines of 6th order, with parallelization speed-ups of 20 on a 26-processor machine. Results have been obtained for the energies of the ground states and excited levels and for the transition strengths of astrophysically relevant atoms and ions in the range Z=2…26 in different ionization stages. Catalogue identifier: AECC_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AECC_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 3845 No. of bytes in distributed program, including test data, etc.: 27 989 Distribution format: tar.gz Programming language: MPI/Fortran 95 and Python Computer: Cluster of 1-26 HP Compaq dc5750 Operating system: Fedora 7 Has the code been vectorised or parallelized?: Yes RAM: 1 GByte Classification: 2.1 External routines: MPI/GFortran, LAPACK, PyLab/Matplotlib Nature of problem: Calculations of synthetic spectra [1] of strongly magnetized neutron stars are bedevilled by the lack of data for atoms in intense magnetic fields. While the behaviour of hydrogen and helium has been investigated in detail (see, e.g., [2]), complete and reliable data for heavier elements, in particular iron, are still missing. Since neutron stars are formed by the collapse of the iron cores of massive stars, it may be assumed that their atmospheres contain an iron plasma. Our objective is to fill the gap
Poirot, Jordan; De Luna, Paolo; Rainer, Gregor
2016-04-01
We comprehensively characterize spiking and visual evoked potential (VEP) activity in tree shrew V1 and V2 using Cartesian, hyperbolic, and polar gratings. Neural selectivity to structure of Cartesian gratings was higher than other grating classes in both visual areas. From V1 to V2, structure selectivity of spiking activity increased, whereas corresponding VEP values tended to decrease, suggesting that single-neuron coding of Cartesian grating attributes improved while the cortical columnar organization of these neurons became less precise from V1 to V2. We observed that neurons in V2 generally exhibited similar selectivity for polar and Cartesian gratings, suggesting that structure of polar-like stimuli might be encoded as early as in V2. This hypothesis is supported by the preference shift from V1 to V2 toward polar gratings of higher spatial frequency, consistent with the notion that V2 neurons encode visual scene borders and contours. Neural sensitivity to modulations of polarity of hyperbolic gratings was highest among all grating classes and closely related to the visual receptive field (RF) organization of ON- and OFF-dominated subregions. We show that spatial RF reconstructions depend strongly on grating class, suggesting that intracortical contributions to RF structure are strongest for Cartesian and polar gratings. Hyperbolic gratings tend to recruit least cortical elaboration such that the RF maps are similar to those generated by sparse noise, which most closely approximate feedforward inputs. Our findings complement previous literature in primates, rodents, and carnivores and highlight novel aspects of shape representation and coding occurring in mammalian early visual cortex.
Categorizing ideas about trees: a tree of trees.
Fisler, Marie; Lecointre, Guillaume
2013-01-01
The aim of this study is to explore whether matrices and MP trees used to produce systematic categories of organisms could be useful to produce categories of ideas in history of science. We study the history of the use of trees in systematics to represent the diversity of life from 1766 to 1991. We apply to those ideas a method inspired from coding homologous parts of organisms. We discretize conceptual parts of ideas, writings and drawings about trees contained in 41 main writings; we detect shared parts among authors and code them into a 91-characters matrix and use a tree representation to show who shares what with whom. In other words, we propose a hierarchical representation of the shared ideas about trees among authors: this produces a "tree of trees." Then, we categorize schools of tree-representations. Classical schools like "cladists" and "pheneticists" are recovered but others are not: "gradists" are separated into two blocks, one of them being called here "grade theoreticians." We propose new interesting categories like the "buffonian school," the "metaphoricians," and those using "strictly genealogical classifications." We consider that networks are not useful to represent shared ideas at the present step of the study. A cladogram is made for showing who is sharing what with whom, but also heterobathmy and homoplasy of characters. The present cladogram is not modelling processes of transmission of ideas about trees, and here it is mostly used to test for proximity of ideas of the same age and for categorization.
DIANE multiparticle transport code
NASA Astrophysics Data System (ADS)
Caillaud, M.; Lemaire, S.; Ménard, S.; Rathouit, P.; Ribes, J. C.; Riz, D.
2014-06-01
DIANE is the general Monte Carlo code developed at CEA-DAM. DIANE is a 3D multiparticle multigroup code. DIANE includes automated biasing techniques and is optimized for massive parallel calculations.
Scioto: A Framework for Global-ViewTask Parallelism
Dinan, James S.; Krishnamoorthy, Sriram; Larkins, D. B.; Nieplocha, Jaroslaw; Sadayappan, Ponnuswamy
2008-09-09
We introduce Scioto, Shared Collections of Task Objects, a framework for supporting task-parallelism in one-sided and global-view parallel programming models. Scioto provides lightweight, locality aware dynamic load balancing and interoperates with existing parallel models including MPI, SHMEM, CAF, and Global Arrays. Through task parallelism, the Scioto framework provides a solution for overcoming load imbalance and heterogeneity as well as dynamic mapping of computation onto emerging multicore architectures. In this paper, we present the design and implementation of the Scioto framework and demonstrate its effectiveness on the Unbalanced Tree Search (UTS) benchmark and two quantum chemistry codes: the closed shell Self-Consistent Field (SCF) method and a sparse tensor contraction kernel extracted from a coupled cluster computation. We explore the efficiency and scalability of Scioto through these sample applications and demonstrate that is offers low overhead, achieves good performance on heterogeneous and multicore clusters, and scales to hundreds of processors.
Eriksson, Maria E; Hoffman, Daniel; Kaduk, Mateusz; Mauriat, Mélanie; Moritz, Thomas
2015-02-01
Bioactive gibberellins (GAs) have been implicated in short day (SD)-induced growth cessation in Populus, because exogenous applications of bioactive GAs to hybrid aspens (Populus tremula × tremuloides) under SD conditions delay growth cessation. However, this effect diminishes with time, suggesting that plants may cease growth following exposure to SDs due to a reduction in sensitivity to GAs. In order to validate and further explore the role of GAs in growth cessation, we perturbed GA biosynthesis or signalling in hybrid aspen plants by overexpressing AtGA20ox1, AtGA2ox2 and PttGID1.3 (encoding GA biosynthesis enzymes and a GA receptor). We found trees with elevated concentrations of bioactive GA, due to overexpression of AtGA20ox1, continued to grow in SD conditions and were insensitive to the level of FLOWERING LOCUS T2 (FT2) expression. As transgenic plants overexpressing the PttGID1.3 GA receptor responded in a wild-type (WT) manner to SD conditions, this insensitivity did not result from limited receptor availability. As high concentrations of bioactive GA during SD conditions were sufficient to sustain shoot elongation growth in hybrid aspen trees, independent of FT2 expression levels, we conclude elongation growth in trees is regulated by both GA- and long day-responsive pathways, similar to the regulation of flowering in Arabidopsis thaliana.
Parallelized direct execution simulation of message-passing parallel programs
NASA Technical Reports Server (NTRS)
Dickens, Phillip M.; Heidelberger, Philip; Nicol, David M.
1994-01-01
As massively parallel computers proliferate, there is growing interest in findings ways by which performance of massively parallel codes can be efficiently predicted. This problem arises in diverse contexts such as parallelizing computers, parallel performance monitoring, and parallel algorithm development. In this paper we describe one solution where one directly executes the application code, but uses a discrete-event simulator to model details of the presumed parallel machine such as operating system and communication network behavior. Because this approach is computationally expensive, we are interested in its own parallelization specifically the parallelization of the discrete-event simulator. We describe methods suitable for parallelized direct execution simulation of message-passing parallel programs, and report on the performance of such a system, Large Application Parallel Simulation Environment (LAPSE), we have built on the Intel Paragon. On all codes measured to date, LAPSE predicts performance well typically within 10 percent relative error. Depending on the nature of the application code, we have observed low slowdowns (relative to natively executing code) and high relative speedups using up to 64 processors.
A systolic array parallelizing compiler
Tseng, P.S. )
1990-01-01
This book presents a completely new approach to the problem of systolic array parallelizing compiler. It describes the AL parallelizing compiler for the Warp systolic array, the first working systolic array parallelizing compiler which can generate efficient parallel code for complete LINPACK routines. This book begins by analyzing the architectural strength of the Warp systolic array. It proposes a model for mapping programs onto the machine and introduces the notion of data relations for optimizing the program mapping. Also presented are successful applications of the AL compiler in matrix computation and image processing. A complete listing of the source program and compiler-generated parallel code are given to clarify the overall picture of the compiler. The book concludes that systolic array parallelizing compiler can produce efficient parallel code, almost identical to what the user would have written by hand.
King, J. R.; Pankin, A. Y.; Kruger, S. E.; Snyder, P. B.
2016-06-24
The extended-MHD NIMROD code [C. R. Sovinec and J. R. King, J. Comput. Phys. 229, 5803 (2010)] is verified against the ideal-MHD ELITE code [H. R. Wilson et al., Phys. Plasmas 9, 1277 (2002)] on a diverted tokamak discharge. When the NIMROD model complexity is increased incrementally, resistive and first-order finite-Larmour radius effects are destabilizing and stabilizing, respectively. Lastly, the full result is compared to local analytic calculations which are found to overpredict both the resistive destabilization and drift stabilization in comparison to the NIMROD computations.
NASA Astrophysics Data System (ADS)
King, J. R.; Pankin, A. Y.; Kruger, S. E.; Snyder, P. B.
2016-06-01
The extended-MHD NIMROD code [C. R. Sovinec and J. R. King, J. Comput. Phys. 229, 5803 (2010)] is verified against the ideal-MHD ELITE code [H. R. Wilson et al., Phys. Plasmas 9, 1277 (2002)] on a diverted tokamak discharge. When the NIMROD model complexity is increased incrementally, resistive and first-order finite-Larmour radius effects are destabilizing and stabilizing, respectively. The full result is compared to local analytic calculations which are found to overpredict both the resistive destabilization and drift stabilization in comparison to the NIMROD computations.
Categorizing Ideas about Trees: A Tree of Trees
Fisler, Marie; Lecointre, Guillaume
2013-01-01
The aim of this study is to explore whether matrices and MP trees used to produce systematic categories of organisms could be useful to produce categories of ideas in history of science. We study the history of the use of trees in systematics to represent the diversity of life from 1766 to 1991. We apply to those ideas a method inspired from coding homologous parts of organisms. We discretize conceptual parts of ideas, writings and drawings about trees contained in 41 main writings; we detect shared parts among authors and code them into a 91-characters matrix and use a tree representation to show who shares what with whom. In other words, we propose a hierarchical representation of the shared ideas about trees among authors: this produces a “tree of trees.” Then, we categorize schools of tree-representations. Classical schools like “cladists” and “pheneticists” are recovered but others are not: “gradists” are separated into two blocks, one of them being called here “grade theoreticians.” We propose new interesting categories like the “buffonian school,” the “metaphoricians,” and those using “strictly genealogical classifications.” We consider that networks are not useful to represent shared ideas at the present step of the study. A cladogram is made for showing who is sharing what with whom, but also heterobathmy and homoplasy of characters. The present cladogram is not modelling processes of transmission of ideas about trees, and here it is mostly used to test for proximity of ideas of the same age and for categorization. PMID:23950877
Trees of trees: an approach to comparing multiple alternative phylogenies.
Nye, Tom M W
2008-10-01
Phylogenetic analysis very commonly produces several alternative trees for a given fixed set of taxa. For example, different sets of orthologous genes may be analyzed, or the analysis may sample from a distribution of probable trees. This article describes an approach to comparing and visualizing multiple alternative phylogenies via the idea of a "tree of trees" or "meta-tree." A meta-tree clusters phylogenies with similar topologies together in the same way that a phylogeny clusters species with similar DNA sequences. Leaf nodes on a meta-tree correspond to the original set of phylogenies given by some analysis, whereas interior nodes correspond to certain consensus topologies. The construction of meta-trees is motivated by analogy with construction of a most parsimonious tree for DNA data, but instead of using DNA letters, in a meta-tree the characters are partitions or splits of the set of taxa. An efficient algorithm for meta-tree construction is described that makes use of a known relationship between the majority consensus and parsimony in terms of gain and loss of splits. To illustrate these ideas meta-trees are constructed for two datasets: a set of gene trees for species of yeast and trees from a bootstrap analysis of a set of gene trees in ray-finned fish. A software tool for constructing meta-trees and comparing alternative phylogenies is available online, and the source code can be obtained from the author.
GSHR-Tree: a spatial index tree based on dynamic spatial slot and hash table in grid environments
NASA Astrophysics Data System (ADS)
Chen, Zhanlong; Wu, Xin-cai; Wu, Liang
2008-12-01
distributed operation, reduplication operation transfer operation of spatial index in the grid environment. The design of GSHR-Tree has ensured the performance of the load balance in the parallel computation. This tree structure is fit for the parallel process of the spatial information in the distributed network environments. Instead of spatial object's recursive comparison where original R tree has been used, the algorithm builds the spatial index by applying binary code operation in which computer runs more efficiently, and extended dynamic hash code for bit comparison. In GSHR-Tree, a new server is assigned to the network whenever a split of a full node is required. We describe a more flexible allocation protocol which copes with a temporary shortage of storage resources. It uses a distributed balanced binary spatial tree that scales with insertions to potentially any number of storage servers through splits of the overloaded ones. The application manipulates the GSHR-Tree structure from a node in the grid environment. The node addresses the tree through its image that the splits can make outdated. This may generate addressing errors, solved by the forwarding among the servers. In this paper, a spatial index data distribution algorithm that limits the number of servers has been proposed. We improve the storage utilization at the cost of additional messages. The structure of GSHR-Tree is believed that the scheme of this grid spatial index should fit the needs of new applications using endlessly larger sets of spatial data. Our proposal constitutes a flexible storage allocation method for a distributed spatial index. The insertion policy can be tuned dynamically to cope with periods of storage shortage. In such cases storage balancing should be favored for better space utilization, at the price of extra message exchanges between servers. This structure makes a compromise in the updating of the duplicated index and the transformation of the spatial index data. Meeting the
NASA Astrophysics Data System (ADS)
Boyko, Oleksiy; Zheleznyak, Mark
2015-04-01
The original numerical code TOPKAPI-IMMS of the distributed rainfall-runoff model TOPKAPI ( Todini et al, 1996-2014) is developed and implemented in Ukraine. The parallel version of the code has been developed recently to be used on multiprocessors systems - multicore/processors PC and clusters. Algorithm is based on binary-tree decomposition of the watershed for the balancing of the amount of computation for all processors/cores. Message passing interface (MPI) protocol is used as a parallel computing framework. The numerical efficiency of the parallelization algorithms is demonstrated for the case studies for the flood predictions of the mountain watersheds of the Ukrainian Carpathian regions. The modeling results is compared with the predictions based on the lumped parameters models.
NASA Technical Reports Server (NTRS)
Hribar, Michelle R.; Frumkin, Michael; Jin, Haoqiang; Waheed, Abdul; Yan, Jerry; Saini, Subhash (Technical Monitor)
1998-01-01
Over the past decade, high performance computing has evolved rapidly; systems based on commodity microprocessors have been introduced in quick succession from at least seven vendors/families. Porting codes to every new architecture is a difficult problem; in particular, here at NASA, there are many large CFD applications that are very costly to port to new machines by hand. The LCM ("Legacy Code Modernization") Project is the development of an integrated parallelization environment (IPE) which performs the automated mapping of legacy CFD (Fortran) applications to state-of-the-art high performance computers. While most projects to port codes focus on the parallelization of the code, we consider porting to be an iterative process consisting of several steps: 1) code cleanup, 2) serial optimization,3) parallelization, 4) performance monitoring and visualization, 5) intelligent tools for automated tuning using performance prediction and 6) machine specific optimization. The approach for building this parallelization environment is to build the components for each of the steps simultaneously and then integrate them together. The demonstration will exhibit our latest research in building this environment: 1. Parallelizing tools and compiler evaluation. 2. Code cleanup and serial optimization using automated scripts 3. Development of a code generator for performance prediction 4. Automated partitioning 5. Automated insertion of directives. These demonstrations will exhibit the effectiveness of an automated approach for all the steps involved with porting and tuning a legacy code application for a new architecture.
Force user's manual: A portable, parallel FORTRAN
NASA Technical Reports Server (NTRS)
Jordan, Harry F.; Benten, Muhammad S.; Arenstorf, Norbert S.; Ramanan, Aruna V.
1990-01-01
The use of Force, a parallel, portable FORTRAN on shared memory parallel computers is described. Force simplifies writing code for parallel computers and, once the parallel code is written, it is easily ported to computers on which Force is installed. Although Force is nearly the same for all computers, specific details are included for the Cray-2, Cray-YMP, Convex 220, Flex/32, Encore, Sequent, Alliant computers on which it is installed.
Nelson, Andrew F.; Wetzstein, M.; Naab, T.
2009-10-01
We continue our presentation of VINE. In this paper, we begin with a description of relevant architectural properties of the serial and shared memory parallel computers on which VINE is intended to run, and describe their influences on the design of the code itself. We continue with a detailed description of a number of optimizations made to the layout of the particle data in memory and to our implementation of a binary tree used to access that data for use in gravitational force calculations and searches for smoothed particle hydrodynamics (SPH) neighbor particles. We describe the modifications to the code necessary to obtain forces efficiently from special purpose 'GRAPE' hardware, the interfaces required to allow transparent substitution of those forces in the code instead of those obtained from the tree, and the modifications necessary to use both tree and GRAPE together as a fused GRAPE/tree combination. We conclude with an extensive series of performance tests, which demonstrate that the code can be run efficiently and without modification in serial on small workstations or in parallel using the OpenMP compiler directives on large-scale, shared memory parallel machines. We analyze the effects of the code optimizations and estimate that they improve its overall performance by more than an order of magnitude over that obtained by many other tree codes. Scaled parallel performance of the gravity and SPH calculations, together the most costly components of most simulations, is nearly linear up to at least 120 processors on moderate sized test problems using the Origin 3000 architecture, and to the maximum machine sizes available to us on several other architectures. At similar accuracy, performance of VINE, used in GRAPE-tree mode, is approximately a factor 2 slower than that of VINE, used in host-only mode. Further optimizations of the GRAPE/host communications could improve the speed by as much as a factor of 3, but have not yet been implemented in VINE
NASA Astrophysics Data System (ADS)
Nelson, Andrew F.; Wetzstein, M.; Naab, T.
2009-10-01
We continue our presentation of VINE. In this paper, we begin with a description of relevant architectural properties of the serial and shared memory parallel computers on which VINE is intended to run, and describe their influences on the design of the code itself. We continue with a detailed description of a number of optimizations made to the layout of the particle data in memory and to our implementation of a binary tree used to access that data for use in gravitational force calculations and searches for smoothed particle hydrodynamics (SPH) neighbor particles. We describe the modifications to the code necessary to obtain forces efficiently from special purpose "GRAPE" hardware, the interfaces required to allow transparent substitution of those forces in the code instead of those obtained from the tree, and the modifications necessary to use both tree and GRAPE together as a fused GRAPE/tree combination. We conclude with an extensive series of performance tests, which demonstrate that the code can be run efficiently and without modification in serial on small workstations or in parallel using the OpenMP compiler directives on large-scale, shared memory parallel machines. We analyze the effects of the code optimizations and estimate that they improve its overall performance by more than an order of magnitude over that obtained by many other tree codes. Scaled parallel performance of the gravity and SPH calculations, together the most costly components of most simulations, is nearly linear up to at least 120 processors on moderate sized test problems using the Origin 3000 architecture, and to the maximum machine sizes available to us on several other architectures. At similar accuracy, performance of VINE, used in GRAPE-tree mode, is approximately a factor 2 slower than that of VINE, used in host-only mode. Further optimizations of the GRAPE/host communications could improve the speed by as much as a factor of 3, but have not yet been implemented in VINE
Status of TRANSP Parallel Services
NASA Astrophysics Data System (ADS)
Indireshkumar, K.; Andre, Robert; McCune, Douglas; Randerson, Lewis
2006-10-01
The PPPL TRANSP code suite has been used successfully over many years to carry out time dependent simulations of tokamak plasmas. However, accurately modeling certain phenomena such as RF heating and fast ion behavior using TRANSP requires extensive computational power and will benefit from parallelization. Parallelizing all of TRANSP is not required and parts will run sequentially while other parts run parallelized. To efficiently use a site's parallel services, the parallelized TRANSP modules are deployed to a shared ``parallel service'' on a separate cluster. The PPPL Monte Carlo fast ion module NUBEAM and the MIT RF module TORIC are the first TRANSP modules to be so deployed. This poster will show the performance scaling of these modules within the parallel server. Communications between the serial client and the parallel server will be described in detail, and measurements of startup and communications overhead will be shown. Physics modeling benefits for TRANSP users will be assessed.
Chen, J.; Alpan, F. A.; Fischer, G.A.; Fero, A.H.
2011-07-01
Traditional two-dimensional (2D)/one-dimensional (1D) SYNTHESIS methodology has been widely used to calculate fast neutron (>1.0 MeV) fluence exposure to reactor pressure vessel in the belt-line region. However, it is expected that this methodology cannot provide accurate fast neutron fluence calculation at elevations far above or below the active core region. A three-dimensional (3D) parallel discrete ordinates calculation for ex-vessel neutron dosimetry on a Westinghouse 4-Loop XL Pressurized Water Reactor has been done. It shows good agreement between the calculated results and measured results. Furthermore, the results show very different fast neutron flux values at some of the former plate locations and elevations above and below an active core than those calculated by a 2D/1D SYNTHESIS method. This indicates that for certain irregular reactor internal structures, where the fast neutron flux has a very strong local effect, it is required to use a 3D transport method to calculate accurate fast neutron exposure. (authors)
Ghosh, J.; Harrison, C.G.
1990-01-01
The present conference discusses topics in the fields of VLSI-based and real-time image-processing systems, parallel architectures for image processing, image-processing algorithms, and image processing on the basis of artificial neural networks. Attention is given to a fixed-point VLSI architecture for high-speed image reconstruction, an orthogonal multiprocessor for image processing with neural networks, massively parallel processors in real-time applications, the use of the adiabatic approximation as a tool in image estimation, parallel algorithms for contour-extraction and coding, and a parallel architecture for multidimensional image processing. Also discussed are concurrent image-processing on hypercube multicomputers, neural-network simulation on a reduced-mesh-of-trees organization, and a goal-seeking neural net for recall and recognition.
Coset Codes Viewed as Terminated Convolutional Codes
NASA Technical Reports Server (NTRS)
Fossorier, Marc P. C.; Lin, Shu
1996-01-01
In this paper, coset codes are considered as terminated convolutional codes. Based on this approach, three new general results are presented. First, it is shown that the iterative squaring construction can equivalently be defined from a convolutional code whose trellis terminates. This convolutional code determines a simple encoder for the coset code considered, and the state and branch labelings of the associated trellis diagram become straightforward. Also, from the generator matrix of the code in its convolutional code form, much information about the trade-off between the state connectivity and complexity at each section, and the parallel structure of the trellis, is directly available. Based on this generator matrix, it is shown that the parallel branches in the trellis diagram of the convolutional code represent the same coset code C(sub 1), of smaller dimension and shorter length. Utilizing this fact, a two-stage optimum trellis decoding method is devised. The first stage decodes C(sub 1), while the second stage decodes the associated convolutional code, using the branch metrics delivered by stage 1. Finally, a bidirectional decoding of each received block starting at both ends is presented. If about the same number of computations is required, this approach remains very attractive from a practical point of view as it roughly doubles the decoding speed. This fact is particularly interesting whenever the second half of the trellis is the mirror image of the first half, since the same decoder can be implemented for both parts.
Electrical Circuit Simulation Code
Wix, Steven D.; Waters, Arlon J.; Shirley, David
2001-08-09
Massively-Parallel Electrical Circuit Simulation Code. CHILESPICE is a massively-arallel distributed-memory electrical circuit simulation tool that contains many enhanced radiation, time-based, and thermal features and models. Large scale electronic circuit simulation. Shared memory, parallel processing, enhance convergence. Sandia specific device models.
Foster, I.; Tuecke, S.
1991-12-01
PCN is a system for developing and executing parallel programs. It comprises a high-level programming language, tools for developing and debugging programs in this language, and interfaces to Fortran and C that allow the reuse of existing code in multilingual parallel programs. Programs developed using PCN are portable across many different workstations, networks, and parallel computers. This document provides all the information required to develop parallel programs with the PCN programming system. In includes both tutorial and reference material. It also presents the basic concepts that underly PCN, particularly where these are likely to be unfamiliar to the reader, and provides pointers to other documentation on the PCN language, programming techniques, and tools. PCN is in the public domain. The latest version of both the software and this manual can be obtained by anonymous FTP from Argonne National Laboratory in the directory pub/pcn at info.mcs.anl.gov (c.f. Appendix A).
Foster, I.; Tuecke, S.
1991-09-01
PCN is a system for developing and executing parallel programs. It comprises a high-level programming language, a set of tools for developing and debugging programs in this language, and interfaces to Fortran and C that allow the reuse of existing code in multilingual parallel programs. Programs developed using PCN are portable across many different workstations, networks, and parallel computers. This document provides all the information required to develop parallel programs with the PCN programming system. It includes both tutorial and reference material. It also presents the basic concepts that underlie PCN, particularly where these are likely to be unfamiliar to the reader, and provides pointers to other documentation on the PCN language, programming techniques, and tools. PCN is in the public domain. The latest version of both the software and this manual can be obtained by anonymous FTP from Argonne National Laboratory at info.mcs.anl.gov.
Start/Pat; A parallel-programming toolkit
Appelbe, B.; Smith, K. ); McDowell, C. )
1989-07-01
How can you make Fortran code parallel without isolating the programmer from learning to understand and exploit parallelism effectively. With an interactive toolkit that automates parallelization as it educates. This paper discusses the Start/Pat toolkit.
Simple, parallel virtual machines for extreme computations
NASA Astrophysics Data System (ADS)
Chokoufe Nejad, Bijan; Ohl, Thorsten; Reuter, Jürgen
2015-11-01
We introduce a virtual machine (VM) written in a numerically fast language like Fortran or C for evaluating very large expressions. We discuss the general concept of how to perform computations in terms of a VM and present specifically a VM that is able to compute tree-level cross sections for any number of external legs, given the corresponding byte-code from the optimal matrix element generator, O'MEGA. Furthermore, this approach allows to formulate the parallel computation of a single phase space point in a simple and obvious way. We analyze hereby the scaling behavior with multiple threads as well as the benefits and drawbacks that are introduced with this method. Our implementation of a VM can run faster than the corresponding native, compiled code for certain processes and compilers, especially for very high multiplicities, and has in general runtimes in the same order of magnitude. By avoiding the tedious compile and link steps, which may fail for source code files of gigabyte sizes, new processes or complex higher order corrections that are currently out of reach could be evaluated with a VM given enough computing power.
NASA Technical Reports Server (NTRS)
Crockett, Thomas W.
1995-01-01
This article provides a broad introduction to the subject of parallel rendering, encompassing both hardware and software systems. The focus is on the underlying concepts and the issues which arise in the design of parallel rendering algorithms and systems. We examine the different types of parallelism and how they can be applied in rendering applications. Concepts from parallel computing, such as data decomposition, task granularity, scalability, and load balancing, are considered in relation to the rendering problem. We also explore concepts from computer graphics, such as coherence and projection, which have a significant impact on the structure of parallel rendering algorithms. Our survey covers a number of practical considerations as well, including the choice of architectural platform, communication and memory requirements, and the problem of image assembly and display. We illustrate the discussion with numerous examples from the parallel rendering literature, representing most of the principal rendering methods currently used in computer graphics.
ERIC Educational Resources Information Center
von Davier, Matthias
2016-01-01
This report presents results on a parallel implementation of the expectation-maximization (EM) algorithm for multidimensional latent variable models. The developments presented here are based on code that parallelizes both the E step and the M step of the parallel-E parallel-M algorithm. Examples presented in this report include item response…
Multigrid on massively parallel architectures
Falgout, R D; Jones, J E
1999-09-17
The scalable implementation of multigrid methods for machines with several thousands of processors is investigated. Parallel performance models are presented for three different structured-grid multigrid algorithms, and a description is given of how these models can be used to guide implementation. Potential pitfalls are illustrated when moving from moderate-sized parallelism to large-scale parallelism, and results are given from existing multigrid codes to support the discussion. Finally, the use of mixed programming models is investigated for multigrid codes on clusters of SMPs.
The language parallel Pascal and other aspects of the massively parallel processor
NASA Technical Reports Server (NTRS)
Reeves, A. P.; Bruner, J. D.
1982-01-01
A high level language for the Massively Parallel Processor (MPP) was designed. This language, called Parallel Pascal, is described in detail. A description of the language design, a description of the intermediate language, Parallel P-Code, and details for the MPP implementation are included. Formal descriptions of Parallel Pascal and Parallel P-Code are given. A compiler was developed which converts programs in Parallel Pascal into the intermediate Parallel P-Code language. The code generator to complete the compiler for the MPP is being developed independently. A Parallel Pascal to Pascal translator was also developed. The architecture design for a VLSI version of the MPP was completed with a description of fault tolerant interconnection networks. The memory arrangement aspects of the MPP are discussed and a survey of other high level languages is given.
Parallel Symmetric Eigenvalue Problem Solvers
2015-05-01
graduate school. Debugging somebody else’s MPI code is an immensely frustrating experience, but he would regularly stay late at the oce to assist me...cessfully. In addition, I will describe the parallel kernels required by my code . 5 The next sections will describe my Fortran-based implementations of...Sandia’s publicly available Trace- Min code . Each of the methods has its own unique advantages and disadvantages, summarized in table 3.1. In short, I
Parallel Processing of a Groundwater Contaminant Code
Arnett, Ronald Chester; Greenwade, Lance Eric
2000-05-01
The U. S. Department of Energy’s Idaho National Engineering and Environmental Laboratory (INEEL) is conducting a field test of experimental enhanced bioremediation of trichoroethylene (TCE) contaminated groundwater. TCE is a chlorinated organic substance that was used as a solvent in the early years of the INEEL and disposed in some cases to the aquifer. There is an effort underway to enhance the natural bioremediation of TCE by adding a non-toxic substance that serves as a feed material for the bacteria that can biologically degrade the TCE.
On finding minimum-diameter clique trees
Blair, J.R.S. . Dept. of Computer Science); Peyton, B.W. )
1991-08-01
It is well-known that any chordal graph can be represented as a clique tree (acyclic hypergraph, join tree). Since some chordal graphs have many distinct clique tree representations, it is interesting to consider which one is most desirable under various circumstances. A clique tree of minimum diameter (or height) is sometimes a natural candidate when choosing clique trees to be processed in a parallel computing environment. This paper introduces a linear time algorithm for computing a minimum-diameter clique tree. The new algorithm is an analogue of the natural greedy algorithm for rooting an ordinary tree in order to minimize its height. It has potential application in the development of parallel algorithms for both knowledge-based systems and the solution of sparse linear systems of equations. 31 refs., 7 figs.
Parallel machines: Parallel machine languages
Iannucci, R.A. )
1990-01-01
This book presents a framework for understanding the tradeoffs between the conventional view and the dataflow view with the objective of discovering the critical hardware structures which must be present in any scalable, general-purpose parallel computer to effectively tolerate latency and synchronization costs. The author presents an approach to scalable general purpose parallel computation. Linguistic Concerns, Compiling Issues, Intermediate Language Issues, and hardware/technological constraints are presented as a combined approach to architectural Develoement. This book presents the notion of a parallel machine language.
NASA Technical Reports Server (NTRS)
Martensen, Anna L.; Butler, Ricky W.
1987-01-01
The Fault Tree Compiler Program is a new reliability tool used to predict the top event probability for a fault tree. Five different gate types are allowed in the fault tree: AND, OR, EXCLUSIVE OR, INVERT, and M OF N gates. The high level input language is easy to understand and use when describing the system tree. In addition, the use of the hierarchical fault tree capability can simplify the tree description and decrease program execution time. The current solution technique provides an answer precise (within the limits of double precision floating point arithmetic) to the five digits in the answer. The user may vary one failure rate or failure probability over a range of values and plot the results for sensitivity analyses. The solution technique is implemented in FORTRAN; the remaining program code is implemented in Pascal. The program is written to run on a Digital Corporation VAX with the VMS operation system.
Automatic Multilevel Parallelization Using OpenMP
NASA Technical Reports Server (NTRS)
Jin, Hao-Qiang; Jost, Gabriele; Yan, Jerry; Ayguade, Eduard; Gonzalez, Marc; Martorell, Xavier; Biegel, Bryan (Technical Monitor)
2002-01-01
In this paper we describe the extension of the CAPO (CAPtools (Computer Aided Parallelization Toolkit) OpenMP) parallelization support tool to support multilevel parallelism based on OpenMP directives. CAPO generates OpenMP directives with extensions supported by the NanosCompiler to allow for directive nesting and definition of thread groups. We report some results for several benchmark codes and one full application that have been parallelized using our system.
Joseph, D.D.; Bai, R.; Liao, T.Y.; Huang, A.; Hu, H.H.
1995-09-01
In this paper the authors introduce the idea of parallel pipelining for water lubricated transportation of oil (or other viscous material). A parallel system can have major advantages over a single pipe with respect to the cost of maintenance and continuous operation of the system, to the pressure gradients required to restart a stopped system and to the reduction and even elimination of the fouling of pipe walls in continuous operation. The authors show that the action of capillarity in small pipes is more favorable for restart than in large pipes. In a parallel pipeline system, they estimate the number of small pipes needed to deliver the same oil flux as in one larger pipe as N = (R/r){sup {alpha}}, where r and R are the radii of the small and large pipes, respectively, and {alpha} = 4 or 19/7 when the lubricating water flow is laminar or turbulent.
Morozov, Dmitriy; Weber, Gunther
2013-01-08
Improved simulations and sensors are producing datasets whose increasing complexity exhausts our ability to visualize and comprehend them directly. To cope with this problem, we can detect and extract significant features in the data and use them as the basis for subsequent analysis. Topological methods are valuable in this context because they provide robust and general feature definitions. As the growth of serial computational power has stalled, data analysis is becoming increasingly dependent on massively parallel machines. To satisfy the computational demand created by complex datasets, algorithms need to effectively utilize these computer architectures. The main strength of topological methods, their emphasis on global information, turns into an obstacle during parallelization. We present two approaches to alleviate this problem. We develop a distributed representation of the merge tree that avoids computing the global tree on a single processor and lets us parallelize subsequent queries. To account for the increasing number of cores per processor, we develop a new data structure that lets us take advantage of multiple shared-memory cores to parallelize the work on a single node. Finally, we present experiments that illustrate the strengths of our approach as well as help identify future challenges.
Fault trees and sequence dependencies
NASA Technical Reports Server (NTRS)
Dugan, Joanne Bechta; Boyd, Mark A.; Bavuso, Salvatore J.
1990-01-01
One of the frequently cited shortcomings of fault-tree models, their inability to model so-called sequence dependencies, is discussed. Several sources of such sequence dependencies are discussed, and new fault-tree gates to capture this behavior are defined. These complex behaviors can be included in present fault-tree models because they utilize a Markov solution. The utility of the new gates is demonstrated by presenting several models of the fault-tolerant parallel processor, which include both hot and cold spares.
Parallel Power Grid Simulation Toolkit
Smith, Steve; Kelley, Brian; Banks, Lawrence; Top, Philip; Woodward, Carol
2015-09-14
ParGrid is a 'wrapper' that integrates a coupled Power Grid Simulation toolkit consisting of a library to manage the synchronization and communication of independent simulations. The included library code in ParGid, named FSKIT, is intended to support the coupling multiple continuous and discrete even parallel simulations. The code is designed using modern object oriented C++ methods utilizing C++11 and current Boost libraries to ensure compatibility with multiple operating systems and environments.
Parallelized modelling and solution scheme for hierarchically scaled simulations
NASA Technical Reports Server (NTRS)
Padovan, Joe
1995-01-01
This two-part paper presents the results of a benchmarked analytical-numerical investigation into the operational characteristics of a unified parallel processing strategy for implicit fluid mechanics formulations. This hierarchical poly tree (HPT) strategy is based on multilevel substructural decomposition. The Tree morphology is chosen to minimize memory, communications and computational effort. The methodology is general enough to apply to existing finite difference (FD), finite element (FEM), finite volume (FV) or spectral element (SE) based computer programs without an extensive rewrite of code. In addition to finding large reductions in memory, communications, and computational effort associated with a parallel computing environment, substantial reductions are generated in the sequential mode of application. Such improvements grow with increasing problem size. Along with a theoretical development of general 2-D and 3-D HPT, several techniques for expanding the problem size that the current generation of computers are capable of solving, are presented and discussed. Among these techniques are several interpolative reduction methods. It was found that by combining several of these techniques that a relatively small interpolative reduction resulted in substantial performance gains. Several other unique features/benefits are discussed in this paper. Along with Part 1's theoretical development, Part 2 presents a numerical approach to the HPT along with four prototype CFD applications. These demonstrate the potential of the HPT strategy.
Parallel contingency statistics with Titan.
Thompson, David C.; Pebay, Philippe Pierre
2009-09-01
This report summarizes existing statistical engines in VTK/Titan and presents the recently parallelized contingency statistics engine. It is a sequel to [PT08] and [BPRT09] which studied the parallel descriptive, correlative, multi-correlative, and principal component analysis engines. The ease of use of this new parallel engines is illustrated by the means of C++ code snippets. Furthermore, this report justifies the design of these engines with parallel scalability in mind; however, the very nature of contingency tables prevent this new engine from exhibiting optimal parallel speed-up as the aforementioned engines do. This report therefore discusses the design trade-offs we made and study performance with up to 200 processors.
2014-12-01
density parity check (LDPC) code, a Reed–Solomon code, and three convolutional codes. iii CONTENTS EXECUTIVE SUMMARY...the most common. Many civilian systems use low density parity check (LDPC) FEC codes, and the Navy is planning to use LDPC for some future systems...other forward error correction methods: a turbo code, a low density parity check (LDPC) code, a Reed–Solomon code, and three convolutional codes
Foster, I.; Tuecke, S.
1993-01-01
PCN is a system for developing and executing parallel programs. It comprises a high-level programming language, tools for developing and debugging programs in this language, and interfaces to Fortran and Cthat allow the reuse of existing code in multilingual parallel programs. Programs developed using PCN are portable across many different workstations, networks, and parallel computers. This document provides all the information required to develop parallel programs with the PCN programming system. It includes both tutorial and reference material. It also presents the basic concepts that underlie PCN, particularly where these are likely to be unfamiliar to the reader, and provides pointers to other documentation on the PCN language, programming techniques, and tools. PCN is in the public domain. The latest version of both the software and this manual can be obtained by anonymous ftp from Argonne National Laboratory in the directory pub/pcn at info.mcs. ani.gov (cf. Appendix A). This version of this document describes PCN version 2.0, a major revision of the PCN programming system. It supersedes earlier versions of this report.
ERIC Educational Resources Information Center
Tolman, Marvin
2005-01-01
Students love outdoor activities and will love them even more when they build confidence in their tree identification and measurement skills. Through these activities, students will learn to identify the major characteristics of trees and discover how the pace--a nonstandard measuring unit--can be used to estimate not only distances but also the…
Parallel adaptive wavelet collocation method for PDEs
Nejadmalayeri, Alireza; Vezolainen, Alexei; Brown-Dymkoski, Eric; Vasilyev, Oleg V.
2015-10-01
A parallel adaptive wavelet collocation method for solving a large class of Partial Differential Equations is presented. The parallelization is achieved by developing an asynchronous parallel wavelet transform, which allows one to perform parallel wavelet transform and derivative calculations with only one data synchronization at the highest level of resolution. The data are stored using tree-like structure with tree roots starting at a priori defined level of resolution. Both static and dynamic domain partitioning approaches are developed. For the dynamic domain partitioning, trees are considered to be the minimum quanta of data to be migrated between the processes. This allows fully automated and efficient handling of non-simply connected partitioning of a computational domain. Dynamic load balancing is achieved via domain repartitioning during the grid adaptation step and reassigning trees to the appropriate processes to ensure approximately the same number of grid points on each process. The parallel efficiency of the approach is discussed based on parallel adaptive wavelet-based Coherent Vortex Simulations of homogeneous turbulence with linear forcing at effective non-adaptive resolutions up to 2048{sup 3} using as many as 2048 CPU cores.
Parallel adaptive wavelet collocation method for PDEs
NASA Astrophysics Data System (ADS)
Nejadmalayeri, Alireza; Vezolainen, Alexei; Brown-Dymkoski, Eric; Vasilyev, Oleg V.
2015-10-01
A parallel adaptive wavelet collocation method for solving a large class of Partial Differential Equations is presented. The parallelization is achieved by developing an asynchronous parallel wavelet transform, which allows one to perform parallel wavelet transform and derivative calculations with only one data synchronization at the highest level of resolution. The data are stored using tree-like structure with tree roots starting at a priori defined level of resolution. Both static and dynamic domain partitioning approaches are developed. For the dynamic domain partitioning, trees are considered to be the minimum quanta of data to be migrated between the processes. This allows fully automated and efficient handling of non-simply connected partitioning of a computational domain. Dynamic load balancing is achieved via domain repartitioning during the grid adaptation step and reassigning trees to the appropriate processes to ensure approximately the same number of grid points on each process. The parallel efficiency of the approach is discussed based on parallel adaptive wavelet-based Coherent Vortex Simulations of homogeneous turbulence with linear forcing at effective non-adaptive resolutions up to 20483 using as many as 2048 CPU cores.
DeHart, Mark D; Williams, Mark L; Bowman, Stephen M
2010-01-01
The SCALE computational architecture has remained basically the same since its inception 30 years ago, although constituent modules and capabilities have changed significantly. This SCALE concept was intended to provide a framework whereby independent codes can be linked to provide a more comprehensive capability than possible with the individual programs - allowing flexibility to address a wide variety of applications. However, the current system was designed originally for mainframe computers with a single CPU and with significantly less memory than today's personal computers. It has been recognized that the present SCALE computation system could be restructured to take advantage of modern hardware and software capabilities, while retaining many of the modular features of the present system. Preliminary work is being done to define specifications and capabilities for a more advanced computational architecture. This paper describes the state of current SCALE development activities and plans for future development. With the release of SCALE 6.1 in 2010, a new phase of evolutionary development will be available to SCALE users within the TRITON and NEWT modules. The SCALE (Standardized Computer Analyses for Licensing Evaluation) code system developed by Oak Ridge National Laboratory (ORNL) provides a comprehensive and integrated package of codes and nuclear data for a wide range of applications in criticality safety, reactor physics, shielding, isotopic depletion and decay, and sensitivity/uncertainty (S/U) analysis. Over the last three years, since the release of version 5.1 in 2006, several important new codes have been introduced within SCALE, and significant advances applied to existing codes. Many of these new features became available with the release of SCALE 6.0 in early 2009. However, beginning with SCALE 6.1, a first generation of parallel computing is being introduced. In addition to near-term improvements, a plan for longer term SCALE enhancement
Clinical coding. Code breakers.
Mathieson, Steve
2005-02-24
--The advent of payment by results has seen the role of the clinical coder pushed to the fore in England. --Examinations for a clinical coding qualification began in 1999. In 2004, approximately 200 people took the qualification. --Trusts are attracting people to the role by offering training from scratch or through modern apprenticeships.
Treating a User-Defined Parallel Library as a Domain-Specific Language
Quinlan, D; Miller, B; Schordan, M; Philip, B
2001-11-19
An important purpose of a programming language is to insulate the programmer from low level details and provide a high enough level of abstraction to be productive and develop reasonably portable application codes. For these reasons scientific programming is longer done using assembly language. But high performance of scientific applications often requires that critical sections of code be expressed at a particularly low level to avoid inefficiencies introduced by the comiler (function call overhead, poor cache use, etc.). The use of high-level abstractions exaserbates this problem since the compiler is often unable to generate the equivalent low-level code required for good performance. The result is often significantly degraded performance. Libraries provide a way for domain specific knowledge to be developed for large numbers of users. Libraries thus simplify the development of many application codes and the work spent building libraries can be amortized across large numbers of applications and application developers. Such a hierarchy puts languages and compilers at the root of tree of abstractsions developed within numerous libraries at one level and numerous applications at a second level. Libraries provide a way to define high-level abstractions. We have developed specific libraries to simplify the development of serial and parallel scientific applications. The A++/P++ library provide an essential array abstraction for C++ scientific applications. The effect is to provide a single array abstraction that permits the development of serial code (using A++). The serial application code using the array abstractions need only be recompiled (using P++) to run on parallel distributed memory machines. The resulting abstractions are simple and powerful since it simplifies serial application code and even completely hides parallel details. But since it operates as a library the compiler is oblivious to its semantics and likewise the library is oblivious to the context
Computer-Aided Parallelizer and Optimizer
NASA Technical Reports Server (NTRS)
Jin, Haoqiang
2011-01-01
The Computer-Aided Parallelizer and Optimizer (CAPO) automates the insertion of compiler directives (see figure) to facilitate parallel processing on Shared Memory Parallel (SMP) machines. While CAPO currently is integrated seamlessly into CAPTools (developed at the University of Greenwich, now marketed as ParaWise), CAPO was independently developed at Ames Research Center as one of the components for the Legacy Code Modernization (LCM) project. The current version takes serial FORTRAN programs, performs interprocedural data dependence analysis, and generates OpenMP directives. Due to the widely supported OpenMP standard, the generated OpenMP codes have the potential to run on a wide range of SMP machines. CAPO relies on accurate interprocedural data dependence information currently provided by CAPTools. Compiler directives are generated through identification of parallel loops in the outermost level, construction of parallel regions around parallel loops and optimization of parallel regions, and insertion of directives with automatic identification of private, reduction, induction, and shared variables. Attempts also have been made to identify potential pipeline parallelism (implemented with point-to-point synchronization). Although directives are generated automatically, user interaction with the tool is still important for producing good parallel codes. A comprehensive graphical user interface is included for users to interact with the parallelization process.
REBOUND: an open-source multi-purpose N-body code for collisional dynamics
NASA Astrophysics Data System (ADS)
Rein, H.; Liu, S.-F.
2012-01-01
REBOUND is a new multi-purpose N-body code which is freely available under an open-source license. It was designed for collisional dynamics such as planetary rings but can also solve the classical N-body problem. It is highly modular and can be customized easily to work on a wide variety of different problems in astrophysics and beyond. REBOUND comes with three symplectic integrators: leap-frog, the symplectic epicycle integrator (SEI) and a Wisdom-Holman mapping (WH). It supports open, periodic and shearing-sheet boundary conditions. REBOUND can use a Barnes-Hut tree to calculate both self-gravity and collisions. These modules are fully parallelized with MPI as well as OpenMP. The former makes use of a static domain decomposition and a distributed essential tree. Two new collision detection modules based on a plane-sweep algorithm are also implemented. The performance of the plane-sweep algorithm is superior to a tree code for simulations in which one dimension is much longer than the other two and in simulations which are quasi-two dimensional with less than one million particles. In this work, we discuss the different algorithms implemented in REBOUND, the philosophy behind the code's structure as well as implementation specific details of the different modules. We present results of accuracy and scaling tests which show that the code can run efficiently on both desktop machines and large computing clusters.
Parallel evolution of KCNQ4 in echolocating bats.
Liu, Zhen; Li, Shude; Wang, Wei; Xu, Dongming; Murphy, Robert W; Shi, Peng
2011-01-01
High-frequency hearing is required for echolocating bats to locate, range and identify objects, yet little is known about its molecular basis. The discovery of a high-frequency hearing-related gene, KCNQ4, provides an opportunity to address this question. Here, we obtain the coding regions of KCNQ4 from 15 species of bats, including echolocating bats that have higher frequency hearing and non-echolocating bats that have the same ability as most other species of mammals. The strongly supported protein-tree resolves a monophyletic group containing all bats with higher frequency hearing and this arrangement conflicts with the phylogeny of bats in which these species are paraphyletic. We identify five parallel evolved sites in echolocating bats belonging to both suborders. The evolutionary trajectories of the parallel sites suggest the independent gain of higher frequency hearing ability in echolocating bats. This study highlights the usefulness of convergent or parallel evolutionary studies for finding phenotype-related genes and contributing to the resolution of evolutionary problems.
Efficiency of parallel direct optimization
NASA Technical Reports Server (NTRS)
Janies, D. A.; Wheeler, W. C.
2001-01-01
Tremendous progress has been made at the level of sequential computation in phylogenetics. However, little attention has been paid to parallel computation. Parallel computing is particularly suited to phylogenetics because of the many ways large computational problems can be broken into parts that can be analyzed concurrently. In this paper, we investigate the scaling factors and efficiency of random addition and tree refinement strategies using the direct optimization software, POY, on a small (10 slave processors) and a large (256 slave processors) cluster of networked PCs running LINUX. These algorithms were tested on several data sets composed of DNA and morphology ranging from 40 to 500 taxa. Various algorithms in POY show fundamentally different properties within and between clusters. All algorithms are efficient on the small cluster for the 40-taxon data set. On the large cluster, multibuilding exhibits excellent parallel efficiency, whereas parallel building is inefficient. These results are independent of data set size. Branch swapping in parallel shows excellent speed-up for 16 slave processors on the large cluster. However, there is no appreciable speed-up for branch swapping with the further addition of slave processors (>16). This result is independent of data set size. Ratcheting in parallel is efficient with the addition of up to 32 processors in the large cluster. This result is independent of data set size. c2001 The Willi Hennig Society.
Efficiency of parallel direct optimization.
Janies, D A; Wheeler, W C
2001-03-01
Tremendous progress has been made at the level of sequential computation in phylogenetics. However, little attention has been paid to parallel computation. Parallel computing is particularly suited to phylogenetics because of the many ways large computational problems can be broken into parts that can be analyzed concurrently. In this paper, we investigate the scaling factors and efficiency of random addition and tree refinement strategies using the direct optimization software, POY, on a small (10 slave processors) and a large (256 slave processors) cluster of networked PCs running LINUX. These algorithms were tested on several data sets composed of DNA and morphology ranging from 40 to 500 taxa. Various algorithms in POY show fundamentally different properties within and between clusters. All algorithms are efficient on the small cluster for the 40-taxon data set. On the large cluster, multibuilding exhibits excellent parallel efficiency, whereas parallel building is inefficient. These results are independent of data set size. Branch swapping in parallel shows excellent speed-up for 16 slave processors on the large cluster. However, there is no appreciable speed-up for branch swapping with the further addition of slave processors (>16). This result is independent of data set size. Ratcheting in parallel is efficient with the addition of up to 32 processors in the large cluster. This result is independent of data set size.
Parallel programming of industrial applications
Heroux, M; Koniges, A; Simon, H
1998-07-21
In the introductory material, we overview the typical MPP environment for real application computing and the special tools available such as parallel debuggers and performance analyzers. Next, we draw from a series of real applications codes and discuss the specific challenges and problems that are encountered in parallelizing these individual applications. The application areas drawn from include biomedical sciences, materials processing and design, plasma and fluid dynamics, and others. We show how it was possible to get a particular application to run efficiently and what steps were necessary. Finally we end with a summary of the lessons learned from these applications and predictions for the future of industrial parallel computing. This tutorial is based on material from a forthcoming book entitled: "Industrial Strength Parallel Computing" to be published by Morgan Kaufmann Publishers (ISBN l-55860-54).
Hybrid parallel programming with MPI and Unified Parallel C.
Dinan, J.; Balaji, P.; Lusk, E.; Sadayappan, P.; Thakur, R.; Mathematics and Computer Science; The Ohio State Univ.
2010-01-01
The Message Passing Interface (MPI) is one of the most widely used programming models for parallel computing. However, the amount of memory available to an MPI process is limited by the amount of local memory within a compute node. Partitioned Global Address Space (PGAS) models such as Unified Parallel C (UPC) are growing in popularity because of their ability to provide a shared global address space that spans the memories of multiple compute nodes. However, taking advantage of UPC can require a large recoding effort for existing parallel applications. In this paper, we explore a new hybrid parallel programming model that combines MPI and UPC. This model allows MPI programmers incremental access to a greater amount of memory, enabling memory-constrained MPI codes to process larger data sets. In addition, the hybrid model offers UPC programmers an opportunity to create static UPC groups that are connected over MPI. As we demonstrate, the use of such groups can significantly improve the scalability of locality-constrained UPC codes. This paper presents a detailed description of the hybrid model and demonstrates its effectiveness in two applications: a random access benchmark and the Barnes-Hut cosmological simulation. Experimental results indicate that the hybrid model can greatly enhance performance; using hybrid UPC groups that span two cluster nodes, RA performance increases by a factor of 1.33 and using groups that span four cluster nodes, Barnes-Hut experiences a twofold speedup at the expense of a 2% increase in code size.
Automatic Multilevel Parallelization Using OpenMP
NASA Technical Reports Server (NTRS)
Jin, Hao-Qiang; Jost, Gabriele; Yan, Jerry; Ayguade, Eduard; Gonzalez, Marc; Martorell, Xavier; Biegel, Bryan (Technical Monitor)
2002-01-01
In this paper we describe the extension of the CAPO parallelization support tool to support multilevel parallelism based on OpenMP directives. CAPO generates OpenMP directives with extensions supported by the NanosCompiler to allow for directive nesting and definition of thread groups. We report first results for several benchmark codes and one full application that have been parallelized using our system.
Parallel multiscale simulations of a brain aneurysm
NASA Astrophysics Data System (ADS)
Grinberg, Leopold; Fedosov, Dmitry A.; Karniadakis, George Em
2013-07-01
Cardiovascular pathologies, such as a brain aneurysm, are affected by the global blood circulation as well as by the local microrheology. Hence, developing computational models for such cases requires the coupling of disparate spatial and temporal scales often governed by diverse mathematical descriptions, e.g., by partial differential equations (continuum) and ordinary differential equations for discrete particles (atomistic). However, interfacing atomistic-based with continuum-based domain discretizations is a challenging problem that requires both mathematical and computational advances. We present here a hybrid methodology that enabled us to perform the first multiscale simulations of platelet depositions on the wall of a brain aneurysm. The large scale flow features in the intracranial network are accurately resolved by using the high-order spectral element Navier-Stokes solver NɛκTαr. The blood rheology inside the aneurysm is modeled using a coarse-grained stochastic molecular dynamics approach (the dissipative particle dynamics method) implemented in the parallel code LAMMPS. The continuum and atomistic domains overlap with interface conditions provided by effective forces computed adaptively to ensure continuity of states across the interface boundary. A two-way interaction is allowed with the time-evolving boundary of the (deposited) platelet clusters tracked by an immersed boundary method. The corresponding heterogeneous solvers (NɛκTαr and LAMMPS) are linked together by a computational multilevel message passing interface that facilitates modularity and high parallel efficiency. Results of multiscale simulations of clot formation inside the aneurysm in a patient-specific arterial tree are presented. We also discuss the computational challenges involved and present scalability results of our coupled solver on up to 300 K computer processors. Validation of such coupled atomistic-continuum models is a main open issue that has to be addressed in future
Parallel multiscale simulations of a brain aneurysm
Grinberg, Leopold; Fedosov, Dmitry A.; Karniadakis, George Em
2013-07-01
Cardiovascular pathologies, such as a brain aneurysm, are affected by the global blood circulation as well as by the local microrheology. Hence, developing computational models for such cases requires the coupling of disparate spatial and temporal scales often governed by diverse mathematical descriptions, e.g., by partial differential equations (continuum) and ordinary differential equations for discrete particles (atomistic). However, interfacing atomistic-based with continuum-based domain discretizations is a challenging problem that requires both mathematical and computational advances. We present here a hybrid methodology that enabled us to perform the first multiscale simulations of platelet depositions on the wall of a brain aneurysm. The large scale flow features in the intracranial network are accurately resolved by using the high-order spectral element Navier–Stokes solver NεκTαr. The blood rheology inside the aneurysm is modeled using a coarse-grained stochastic molecular dynamics approach (the dissipative particle dynamics method) implemented in the parallel code LAMMPS. The continuum and atomistic domains overlap with interface conditions provided by effective forces computed adaptively to ensure continuity of states across the interface boundary. A two-way interaction is allowed with the time-evolving boundary of the (deposited) platelet clusters tracked by an immersed boundary method. The corresponding heterogeneous solvers (NεκTαr and LAMMPS) are linked together by a computational multilevel message passing interface that facilitates modularity and high parallel efficiency. Results of multiscale simulations of clot formation inside the aneurysm in a patient-specific arterial tree are presented. We also discuss the computational challenges involved and present scalability results of our coupled solver on up to 300 K computer processors. Validation of such coupled atomistic-continuum models is a main open issue that has to be addressed in
Parallel multiscale simulations of a brain aneurysm.
Grinberg, Leopold; Fedosov, Dmitry A; Karniadakis, George Em
2013-07-01
Cardiovascular pathologies, such as a brain aneurysm, are affected by the global blood circulation as well as by the local microrheology. Hence, developing computational models for such cases requires the coupling of disparate spatial and temporal scales often governed by diverse mathematical descriptions, e.g., by partial differential equations (continuum) and ordinary differential equations for discrete particles (atomistic). However, interfacing atomistic-based with continuum-based domain discretizations is a challenging problem that requires both mathematical and computational advances. We present here a hybrid methodology that enabled us to perform the first multi-scale simulations of platelet depositions on the wall of a brain aneurysm. The large scale flow features in the intracranial network are accurately resolved by using the high-order spectral element Navier-Stokes solver εκαr . The blood rheology inside the aneurysm is modeled using a coarse-grained stochastic molecular dynamics approach (the dissipative particle dynamics method) implemented in the parallel code LAMMPS. The continuum and atomistic domains overlap with interface conditions provided by effective forces computed adaptively to ensure continuity of states across the interface boundary. A two-way interaction is allowed with the time-evolving boundary of the (deposited) platelet clusters tracked by an immersed boundary method. The corresponding heterogeneous solvers ( εκαr and LAMMPS) are linked together by a computational multilevel message passing interface that facilitates modularity and high parallel efficiency. Results of multiscale simulations of clot formation inside the aneurysm in a patient-specific arterial tree are presented. We also discuss the computational challenges involved and present scalability results of our coupled solver on up to 300K computer processors. Validation of such coupled atomistic-continuum models is a main open issue that has to be addressed in future
Phonological coding during reading
Leinenger, Mallorie
2014-01-01
The exact role that phonological coding (the recoding of written, orthographic information into a sound based code) plays during silent reading has been extensively studied for more than a century. Despite the large body of research surrounding the topic, varying theories as to the time course and function of this recoding still exist. The present review synthesizes this body of research, addressing the topics of time course and function in tandem. The varying theories surrounding the function of phonological coding (e.g., that phonological codes aid lexical access, that phonological codes aid comprehension and bolster short-term memory, or that phonological codes are largely epiphenomenal in skilled readers) are first outlined, and the time courses that each maps onto (e.g., that phonological codes come online early (pre-lexical) or that phonological codes come online late (post-lexical)) are discussed. Next the research relevant to each of these proposed functions is reviewed, discussing the varying methodologies that have been used to investigate phonological coding (e.g., response time methods, reading while eyetracking or recording EEG and MEG, concurrent articulation) and highlighting the advantages and limitations of each with respect to the study of phonological coding. In response to the view that phonological coding is largely epiphenomenal in skilled readers, research on the use of phonological codes in prelingually, profoundly deaf readers is reviewed. Finally, implications for current models of word identification (activation-verification model (Van Order, 1987), dual-route model (e.g., Coltheart, Rastle, Perry, Langdon, & Ziegler, 2001), parallel distributed processing model (Seidenberg & McClelland, 1989)) are discussed. PMID:25150679
Parallel NPARC: Implementation and Performance
NASA Technical Reports Server (NTRS)
Townsend, S. E.
1996-01-01
Version 3 of the NPARC Navier-Stokes code includes support for large-grain (block level) parallelism using explicit message passing between a heterogeneous collection of computers. This capability has the potential for significant performance gains, depending upon the block data distribution. The parallel implementation uses a master/worker arrangement of processes. The master process assigns blocks to workers, controls worker actions, and provides remote file access for the workers. The processes communicate via explicit message passing using an interface library which provides portability to a number of message passing libraries, such as PVM (Parallel Virtual Machine). A Bourne shell script is used to simplify the task of selecting hosts, starting processes, retrieving remote files, and terminating a computation. This script also provides a simple form of fault tolerance. An analysis of the computational performance of NPARC is presented, using data sets from an F/A-18 inlet study and a Rocket Based Combined Cycle Engine analysis. Parallel speedup and overall computational efficiency were obtained for various NPARC run parameters on a cluster of IBM RS6000 workstations. The data show that although NPARC performance compares favorably with the estimated potential parallelism, typical data sets used with previous versions of NPARC will often need to be reblocked for optimum parallel performance. In one of the cases studied, reblocking increased peak parallel speedup from 3.2 to 11.8.
Parallelization of the Implicit RPLUS Algorithm
NASA Technical Reports Server (NTRS)
Orkwis, Paul D.
1997-01-01
The multiblock reacting Navier-Stokes flow solver RPLUS2D was modified for parallel implementation. Results for non-reacting flow calculations of this code indicate parallelization efficiencies greater than 84% are possible for a typical test problem. Results tend to improve as the size of the problem increases. The convergence rate of the scheme is degraded slightly when additional artificial block boundaries are included for the purpose of parallelization. However, this degradation virtually disappears if the solution is converged near to machine zero. Recommendations are made for further code improvements to increase efficiency, correct bugs in the original version, and study decomposition effectiveness.
Parallelization of the Implicit RPLUS Algorithm
NASA Technical Reports Server (NTRS)
Orkwis, Paul D.
1994-01-01
The multiblock reacting Navier-Stokes flow-solver RPLUS2D was modified for parallel implementation. Results for non-reacting flow calculations of this code indicate parallelization efficiencies greater than 84% are possible for a typical test problem. Results tend to improve as the size of the problem increases. The convergence rate of the scheme is degraded slightly when additional artificial block boundaries are included for the purpose of parallelization. However, this degradation virtually disappears if the solution is converged near to machine zero. Recommendations are made for further code improvements to increase efficiency, correct bugs in the original version, and study decomposition effectiveness.
Embedded foveation image coding.
Wang, Z; Bovik, A C
2001-01-01
The human visual system (HVS) is highly space-variant in sampling, coding, processing, and understanding. The spatial resolution of the HVS is highest around the point of fixation (foveation point) and decreases rapidly with increasing eccentricity. By taking advantage of this fact, it is possible to remove considerable high-frequency information redundancy from the peripheral regions and still reconstruct a perceptually good quality image. Great success has been obtained previously by a class of embedded wavelet image coding algorithms, such as the embedded zerotree wavelet (EZW) and the set partitioning in hierarchical trees (SPIHT) algorithms. Embedded wavelet coding not only provides very good compression performance, but also has the property that the bitstream can be truncated at any point and still be decoded to recreate a reasonably good quality image. In this paper, we propose an embedded foveation image coding (EFIC) algorithm, which orders the encoded bitstream to optimize foveated visual quality at arbitrary bit-rates. A foveation-based image quality metric, namely, foveated wavelet image quality index (FWQI), plays an important role in the EFIC system. We also developed a modified SPIHT algorithm to improve the coding efficiency. Experiments show that EFIC integrates foveation filtering with foveated image coding and demonstrates very good coding performance and scalability in terms of foveated image quality measurement.
2011-01-01
Background There are two components to the clinical efficacy of pediculicides: (i) efficacy against the crawling-stages (lousicidal efficacy); and (ii) efficacy against the eggs (ovicidal efficacy). Lousicidal efficacy and ovicidal efficacy are confounded in clinical trials. Here we report on a trial that was specially designed to rank the clinical ovicidal efficacy of pediculicides. Eggs were collected, pre-treatment and post-treatment, from subjects with different types of hair, different coloured hair and hair of different length. Method Subjects with at least 20 live eggs of Pediculus capitis (head lice) were randomised to one of three treatment-groups: a melaleuca oil (commonly called tea tree oil) and lavender oil pediculicide (TTO/LO); a eucalyptus oil and lemon tea tree oil pediculicide (EO/LTTO); or a "suffocation" pediculicide. Pre-treatment: 10 to 22 live eggs were taken from the head by cutting the single hair with the live egg attached, before the treatment (total of 1,062 eggs). Treatment: The subjects then received a single treatment of one of the three pediculicides, according to the manufacturers' instructions. Post-treatment: 10 to 41 treated live eggs were taken from the head by cutting the single hair with the egg attached (total of 1,183 eggs). Eggs were incubated for 14 days. The proportion of eggs that had hatched after 14 days in the pre-treatment group was compared with the proportion of eggs that hatched in the post-treatment group. The primary outcome measure was % ovicidal efficacy for each of the three pediculicides. Results 722 subjects were examined for the presence of eggs of head lice. 92 of these subjects were recruited and randomly assigned to: the "suffocation" pediculicide (n = 31); the melaleuca oil and lavender oil pediculicide (n = 31); and the eucalyptus oil and lemon tea tree oil pediculicide (n = 30 subjects). The group treated with eucalyptus oil and lemon tea tree oil had an ovicidal efficacy of 3.3% (SD 16%) whereas the
Springer, Mark S; Gatesy, John
2016-01-01
Higher-level relationships among placental mammals are mostly resolved, but several polytomies remain contentious. Song et al. (2012) claimed to have resolved three of these using shortcut coalescence methods (MP-EST, STAR) and further concluded that these methods, which assume no within-locus recombination, are required to unravel deep-level phylogenetic problems that have stymied concatenation. Here, we reanalyze Song et al.'s (2012) data and leverage these re-analyses to explore key issues in systematics including the recombination ratchet, gene tree stoichiometry, the proportion of gene tree incongruence that results from deep coalescence versus other factors, and simulations that compare the performance of coalescence and concatenation methods in species tree estimation. Song et al. (2012) reported an average locus length of 3.1 kb for the 447 protein-coding genes in their phylogenomic dataset, but the true mean length of these loci (start codon to stop codon) is 139.6 kb. Empirical estimates of recombination breakpoints in primates, coupled with consideration of the recombination ratchet, suggest that individual coalescence genes (c-genes) approach ∼12 bp or less for Song et al.'s (2012) dataset, three to four orders of magnitude shorter than the c-genes reported by these authors. This result has general implications for the application of coalescence methods in species tree estimation. We contend that it is illogical to apply coalescence methods to complete protein-coding sequences. Such analyses amalgamate c-genes with different evolutionary histories (i.e., exons separated by >100,000 bp), distort true gene tree stoichiometry that is required for accurate species tree inference, and contradict the central rationale for applying coalescence methods to difficult phylogenetic problems. In addition, Song et al.'s (2012) dataset of 447 genes includes 21 loci with switched taxonomic names, eight duplicated loci, 26 loci with non-homologous sequences that are
Using Coarrays to Parallelize Legacy Fortran Applications: Strategy and Case Study
Radhakrishnan, Hari; Rouson, Damian W. I.; Morris, Karla; ...
2015-01-01
This paper summarizes a strategy for parallelizing a legacy Fortran 77 program using the object-oriented (OO) and coarray features that entered Fortran in the 2003 and 2008 standards, respectively. OO programming (OOP) facilitates the construction of an extensible suite of model-verification and performance tests that drive the development. Coarray parallel programming facilitates a rapid evolution from a serial application to a parallel application capable of running on multicore processors and many-core accelerators in shared and distributed memory. We delineate 17 code modernization steps used to refactor and parallelize the program and study the resulting performance. Our initial studies were donemore » using the Intel Fortran compiler on a 32-core shared memory server. Scaling behavior was very poor, and profile analysis using TAU showed that the bottleneck in the performance was due to our implementation of a collective, sequential summation procedure. We were able to improve the scalability and achieve nearly linear speedup by replacing the sequential summation with a parallel, binary tree algorithm. We also tested the Cray compiler, which provides its own collective summation procedure. Intel provides no collective reductions. With Cray, the program shows linear speedup even in distributed-memory execution. We anticipate similar results with other compilers once they support the new collective procedures proposed for Fortran 2015.« less
Code Disentanglement: Initial Plan
Wohlbier, John Greaton; Kelley, Timothy M.; Rockefeller, Gabriel M.; Calef, Matthew Thomas
2015-01-27
The first step to making more ambitious changes in the EAP code base is to disentangle the code into a set of independent, levelized packages. We define a package as a collection of code, most often across a set of files, that provides a defined set of functionality; a package a) can be built and tested as an entity and b) fits within an overall levelization design. Each package contributes one or more libraries, or an application that uses the other libraries. A package set is levelized if the relationships between packages form a directed, acyclic graph and each package uses only packages at lower levels of the diagram (in Fortran this relationship is often describable by the use relationship between modules). Independent packages permit independent- and therefore parallel|development. The packages form separable units for the purposes of development and testing. This is a proven path for enabling finer-grained changes to a complex code.
ERIC Educational Resources Information Center
National Audubon Society, New York, NY.
Included are an illustrated student reader, "The Story of Trees," a leaders' guide, and a large tree chart with 37 colored pictures. The student reader reviews several aspects of trees: a definition of a tree; where and how trees grow; flowers, pollination and seed production; how trees make their food; how to recognize trees; seasonal changes;…
Parallel Implicit Algorithms for CFD
NASA Technical Reports Server (NTRS)
Keyes, David E.
1998-01-01
The main goal of this project was efficient distributed parallel and workstation cluster implementations of Newton-Krylov-Schwarz (NKS) solvers for implicit Computational Fluid Dynamics (CFD.) "Newton" refers to a quadratically convergent nonlinear iteration using gradient information based on the true residual, "Krylov" to an inner linear iteration that accesses the Jacobian matrix only through highly parallelizable sparse matrix-vector products, and "Schwarz" to a domain decomposition form of preconditioning the inner Krylov iterations with primarily neighbor-only exchange of data between the processors. Prior experience has established that Newton-Krylov methods are competitive solvers in the CFD context and that Krylov-Schwarz methods port well to distributed memory computers. The combination of the techniques into Newton-Krylov-Schwarz was implemented on 2D and 3D unstructured Euler codes on the parallel testbeds that used to be at LaRC and on several other parallel computers operated by other agencies or made available by the vendors. Early implementations were made directly in Massively Parallel Integration (MPI) with parallel solvers we adapted from legacy NASA codes and enhanced for full NKS functionality. Later implementations were made in the framework of the PETSC library from Argonne National Laboratory, which now includes pseudo-transient continuation Newton-Krylov-Schwarz solver capability (as a result of demands we made upon PETSC during our early porting experiences). A secondary project pursued with funding from this contract was parallel implicit solvers in acoustics, specifically in the Helmholtz formulation. A 2D acoustic inverse problem has been solved in parallel within the PETSC framework.
Badger, P.C.
1995-12-31
Short rotation intensive culture tree plantations have been a major part of biomass energy concepts since the beginning. One aspect receiving less attention than it deserves is harvesting. This article describes an method of harvesting somewhere between agricultural mowing machines and huge feller-bunchers of the pulpwood and lumber industries.
An interactive programme for weighted Steiner trees
NASA Astrophysics Data System (ADS)
Zanchetta do Nascimento, Marcelo; Ramos Batista, Valério; Raffa Coimbra, Wendhel
2015-01-01
We introduce a fully written programmed code with a supervised method for generating weighted Steiner trees. Our choice of the programming language, and the use of well- known theorems from Geometry and Complex Analysis, allowed this method to be implemented with only 764 lines of effective source code. This eases the understanding and the handling of this beta version for future developments.
GAMER: A GRAPHIC PROCESSING UNIT ACCELERATED ADAPTIVE-MESH-REFINEMENT CODE FOR ASTROPHYSICS
Schive, H.-Y.; Tsai, Y.-C.; Chiueh Tzihong
2010-02-01
We present the newly developed code, GPU-accelerated Adaptive-MEsh-Refinement code (GAMER), which adopts a novel approach in improving the performance of adaptive-mesh-refinement (AMR) astrophysical simulations by a large factor with the use of the graphic processing unit (GPU). The AMR implementation is based on a hierarchy of grid patches with an oct-tree data structure. We adopt a three-dimensional relaxing total variation diminishing scheme for the hydrodynamic solver and a multi-level relaxation scheme for the Poisson solver. Both solvers have been implemented in GPU, by which hundreds of patches can be advanced in parallel. The computational overhead associated with the data transfer between the CPU and GPU is carefully reduced by utilizing the capability of asynchronous memory copies in GPU, and the computing time of the ghost-zone values for each patch is diminished by overlapping it with the GPU computations. We demonstrate the accuracy of the code by performing several standard test problems in astrophysics. GAMER is a parallel code that can be run in a multi-GPU cluster system. We measure the performance of the code by performing purely baryonic cosmological simulations in different hardware implementations, in which detailed timing analyses provide comparison between the computations with and without GPU(s) acceleration. Maximum speed-up factors of 12.19 and 10.47 are demonstrated using one GPU with 4096{sup 3} effective resolution and 16 GPUs with 8192{sup 3} effective resolution, respectively.
Low Density Parity Check Codes: Bandwidth Efficient Channel Coding
NASA Technical Reports Server (NTRS)
Fong, Wai; Lin, Shu; Maki, Gary; Yeh, Pen-Shu
2003-01-01
Low Density Parity Check (LDPC) Codes provide near-Shannon Capacity performance for NASA Missions. These codes have high coding rates R=0.82 and 0.875 with moderate code lengths, n=4096 and 8176. Their decoders have inherently parallel structures which allows for high-speed implementation. Two codes based on Euclidean Geometry (EG) were selected for flight ASIC implementation. These codes are cyclic and quasi-cyclic in nature and therefore have a simple encoder structure. This results in power and size benefits. These codes also have a large minimum distance as much as d,,, = 65 giving them powerful error correcting capabilities and error floors less than lo- BER. This paper will present development of the LDPC flight encoder and decoder, its applications and status.
Parallel auto-correlative statistics with VTK.
Pebay, Philippe Pierre; Bennett, Janine Camille
2013-08-01
This report summarizes existing statistical engines in VTK and presents both the serial and parallel auto-correlative statistics engines. It is a sequel to [PT08, BPRT09b, PT09, BPT09, PT10] which studied the parallel descriptive, correlative, multi-correlative, principal component analysis, contingency, k-means, and order statistics engines. The ease of use of the new parallel auto-correlative statistics engine is illustrated by the means of C++ code snippets and algorithm verification is provided. This report justifies the design of the statistics engines with parallel scalability in mind, and provides scalability and speed-up analysis results for the autocorrelative statistics engine.
Parallel pivoting combined with parallel reduction
NASA Technical Reports Server (NTRS)
Alaghband, Gita
1987-01-01
Parallel algorithms for triangularization of large, sparse, and unsymmetric matrices are presented. The method combines the parallel reduction with a new parallel pivoting technique, control over generations of fill-ins and a check for numerical stability, all done in parallel with the work being distributed over the active processes. The parallel technique uses the compatibility relation between pivots to identify parallel pivot candidates and uses the Markowitz number of pivots to minimize fill-in. This technique is not a preordering of the sparse matrix and is applied dynamically as the decomposition proceeds.
Locating hardware faults in a parallel computer
Archer, Charles J.; Megerian, Mark G.; Ratterman, Joseph D.; Smith, Brian E.
2010-04-13
Locating hardware faults in a parallel computer, including defining within a tree network of the parallel computer two or more sets of non-overlapping test levels of compute nodes of the network that together include all the data communications links of the network, each non-overlapping test level comprising two or more adjacent tiers of the tree; defining test cells within each non-overlapping test level, each test cell comprising a subtree of the tree including a subtree root compute node and all descendant compute nodes of the subtree root compute node within a non-overlapping test level; performing, separately on each set of non-overlapping test levels, an uplink test on all test cells in a set of non-overlapping test levels; and performing, separately from the uplink tests and separately on each set of non-overlapping test levels, a downlink test on all test cells in a set of non-overlapping test levels.
Global tree network for computing structures enabling global processing operations
Blumrich; Matthias A.; Chen, Dong; Coteus, Paul W.; Gara, Alan G.; Giampapa, Mark E.; Heidelberger, Philip; Hoenicke, Dirk; Steinmacher-Burow, Burkhard D.; Takken, Todd E.; Vranas, Pavlos M.
2010-01-19
A system and method for enabling high-speed, low-latency global tree network communications among processing nodes interconnected according to a tree network structure. The global tree network enables collective reduction operations to be performed during parallel algorithm operations executing in a computer structure having a plurality of the interconnected processing nodes. Router devices are included that interconnect the nodes of the tree via links to facilitate performance of low-latency global processing operations at nodes of the virtual tree and sub-tree structures. The global operations performed include one or more of: broadcast operations downstream from a root node to leaf nodes of a virtual tree, reduction operations upstream from leaf nodes to the root node in the virtual tree, and point-to-point message passing from any node to the root node. The global tree network is configurable to provide global barrier and interrupt functionality in asynchronous or synchronized manner, and, is physically and logically partitionable.
Resnik, Barry I
2009-01-01
It is ethical, legal, and proper for a dermatologist to maximize income through proper coding of patient encounters and procedures. The overzealous physician can misinterpret reimbursement requirements or receive bad advice from other physicians and cross the line from aggressive coding to coding fraud. Several of the more common problem areas are discussed.
The Xyce Parallel Electronic Simulator - An Overview
HUTCHINSON,SCOTT A.; KEITER,ERIC R.; HOEKSTRA,ROBERT J.; WATTS,HERMAN A.; WATERS,ARLON J.; SCHELLS,REGINA L.; WIX,STEVEN D.
2000-12-08
The Xyce{trademark} Parallel Electronic Simulator has been written to support the simulation needs of the Sandia National Laboratories electrical designers. As such, the development has focused on providing the capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). In addition, they are providing improved performance for numerical kernels using state-of-the-art algorithms, support for modeling circuit phenomena at a variety of abstraction levels and using object-oriented and modern coding-practices that ensure the code will be maintainable and extensible far into the future. The code is a parallel code in the most general sense of the phrase--a message passing parallel implementation--which allows it to run efficiently on the widest possible number of computing platforms. These include serial, shared-memory and distributed-memory parallel as well as heterogeneous platforms. Furthermore, careful attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved even as the number of processors grows.
Functional Data Analysis of Tree Data Objects.
Shen, Dan; Shen, Haipeng; Bhamidi, Shankar; Maldonado, Yolanda Muñoz; Kim, Yongdai; Marron, J S
2014-01-01
Data analysis on non-Euclidean spaces, such as tree spaces, can be challenging. The main contribution of this paper is establishment of a connection between tree data spaces and the well developed area of Functional Data Analysis (FDA), where the data objects are curves. This connection comes through two tree representation approaches, the Dyck path representation and the branch length representation. These representations of trees in Euclidean spaces enable us to exploit the power of FDA to explore statistical properties of tree data objects. A major challenge in the analysis is the sparsity of tree branches in a sample of trees. We overcome this issue by using a tree pruning technique that focuses the analysis on important underlying population structures. This method parallels scale-space analysis in the sense that it reveals statistical properties of tree structured data over a range of scales. The effectiveness of these new approaches is demonstrated by some novel results obtained in the analysis of brain artery trees. The scale space analysis reveals a deeper relationship between structure and age. These methods are the first to find a statistically significant gender difference.
ERIC Educational Resources Information Center
Jenkins, Peter
Tree climbing offers a safe, inexpensive adventure sport that can be performed almost anywhere. Using standard procedures practiced in tree surgery or rock climbing, almost any tree can be climbed. Tree climbing provides challenge and adventure as well as a vigorous upper-body workout. Tree Climbers International classifies trees using a system…
Burda, Z; Erdmann, J; Petersson, B; Wattenberg, M
2003-02-01
We discuss the scaling properties of free branched polymers. The scaling behavior of the model is classified by the Hausdorff dimensions for the internal geometry, d(L) and d(H), and for the external one, D(L) and D(H). The dimensions d(H) and D(H) characterize the behavior for long distances, while d(L) and D(L) for short distances. We show that the internal Hausdorff dimension is d(L)=2 for generic and scale-free trees, contrary to d(H), which is known to be equal to 2 for generic trees and to vary between 2 and infinity for scale-free trees. We show that the external Hausdorff dimension D(H) is directly related to the internal one as D(H)=alphad(H), where alpha is the stability index of the embedding weights for the nearest-vertex interactions. The index is alpha=2 for weights from the Gaussian domain of attraction and 0
Special parallel processing workshop
1994-12-01
This report contains viewgraphs from the Special Parallel Processing Workshop. These viewgraphs deal with topics such as parallel processing performance, message passing, queue structure, and other basic concept detailing with parallel processing.
Utilizing GPUs to Accelerate Turbomachinery CFD Codes
NASA Technical Reports Server (NTRS)
MacCalla, Weylin; Kulkarni, Sameer
2016-01-01
GPU computing has established itself as a way to accelerate parallel codes in the high performance computing world. This work focuses on speeding up APNASA, a legacy CFD code used at NASA Glenn Research Center, while also drawing conclusions about the nature of GPU computing and the requirements to make GPGPU worthwhile on legacy codes. Rewriting and restructuring of the source code was avoided to limit the introduction of new bugs. The code was profiled and investigated for parallelization potential, then OpenACC directives were used to indicate parallel parts of the code. The use of OpenACC directives was not able to reduce the runtime of APNASA on either the NVIDIA Tesla discrete graphics card, or the AMD accelerated processing unit. Additionally, it was found that in order to justify the use of GPGPU, the amount of parallel work being done within a kernel would have to greatly exceed the work being done by any one portion of the APNASA code. It was determined that in order for an application like APNASA to be accelerated on the GPU, it should not be modular in nature, and the parallel portions of the code must contain a large portion of the code's computation time.
Parallel Eclipse Project Checkout
NASA Technical Reports Server (NTRS)
Crockett, Thomas M.; Joswig, Joseph C.; Shams, Khawaja S.; Powell, Mark W.; Bachmann, Andrew G.
2011-01-01
Parallel Eclipse Project Checkout (PEPC) is a program written to leverage parallelism and to automate the checkout process of plug-ins created in Eclipse RCP (Rich Client Platform). Eclipse plug-ins can be aggregated in a feature project. This innovation digests a feature description (xml file) and automatically checks out all of the plug-ins listed in the feature. This resolves the issue of manually checking out each plug-in required to work on the project. To minimize the amount of time necessary to checkout the plug-ins, this program makes the plug-in checkouts parallel. After parsing the feature, a request to checkout for each plug-in in the feature has been inserted. These requests are handled by a thread pool with a configurable number of threads. By checking out the plug-ins in parallel, the checkout process is streamlined before getting started on the project. For instance, projects that took 30 minutes to checkout now take less than 5 minutes. The effect is especially clear on a Mac, which has a network monitor displaying the bandwidth use. When running the client from a developer s home, the checkout process now saturates the bandwidth in order to get all the plug-ins checked out as fast as possible. For comparison, a checkout process that ranged from 8-200 Kbps from a developer s home is now able to saturate a pipe of 1.3 Mbps, resulting in significantly faster checkouts. Eclipse IDE (integrated development environment) tries to build a project as soon as it is downloaded. As part of another optimization, this innovation programmatically tells Eclipse to stop building while checkouts are happening, which dramatically reduces lock contention and enables plug-ins to continue downloading until all of them finish. Furthermore, the software re-enables automatic building, and forces Eclipse to do a clean build once it finishes checking out all of the plug-ins. This software is fully generic and does not contain any NASA-specific code. It can be applied to any
ERIC Educational Resources Information Center
NatureScope, 1986
1986-01-01
Provides: (1) background information on trees, focusing on the parts of trees and how they differ from other plants; (2) eight activities; and (3) ready-to-copy pages dealing with tree identification and tree rings. Activities include objective(s), recommended age level(s), subject area(s), list of materials needed, and procedures. (JN)
Integrated Task and Data Parallel Programming
NASA Technical Reports Server (NTRS)
Grimshaw, A. S.
1998-01-01
This research investigates the combination of task and data parallel language constructs within a single programming language. There are an number of applications that exhibit properties which would be well served by such an integrated language. Examples include global climate models, aircraft design problems, and multidisciplinary design optimization problems. Our approach incorporates data parallel language constructs into an existing, object oriented, task parallel language. The language will support creation and manipulation of parallel classes and objects of both types (task parallel and data parallel). Ultimately, the language will allow data parallel and task parallel classes to be used either as building blocks or managers of parallel objects of either type, thus allowing the development of single and multi-paradigm parallel applications. 1995 Research Accomplishments In February I presented a paper at Frontiers 1995 describing the design of the data parallel language subset. During the spring I wrote and defended my dissertation proposal. Since that time I have developed a runtime model for the language subset. I have begun implementing the model and hand-coding simple examples which demonstrate the language subset. I have identified an astrophysical fluid flow application which will validate the data parallel language subset. 1996 Research Agenda Milestones for the coming year include implementing a significant portion of the data parallel language subset over the Legion system. Using simple hand-coded methods, I plan to demonstrate (1) concurrent task and data parallel objects and (2) task parallel objects managing both task and data parallel objects. My next steps will focus on constructing a compiler and implementing the fluid flow application with the language. Concurrently, I will conduct a search for a real-world application exhibiting both task and data parallelism within the same program. Additional 1995 Activities During the fall I collaborated
The dynamics of strangling among forest trees.
Okamoto, Kenichi W
2015-11-07
Strangler trees germinate and grow on other trees, eventually enveloping and potentially even girdling their hosts. This allows them to mitigate fitness costs otherwise incurred by germinating and competing with other trees on the forest floor, as well as minimize risks associated with host tree-fall. If stranglers can themselves host other strangler trees, they may not even seem to need non-stranglers to persist. Yet despite their high fitness potential, strangler trees neither dominate the communities in which they occur nor is the strategy particularly common outside of figs (genus Ficus). Here we analyze how dynamic interactions between strangling and non-strangling trees can shape the adaptive landscape for strangling mutants and mutant trees that have lost the ability to strangle. We find a threshold which strangler germination rates must exceed for selection to favor the evolution of strangling, regardless of how effectively hemiepiphytic stranglers may subsequently replace their hosts. This condition describes the magnitude of the phenotypic displacement in the ability to germinate on other trees necessary for invasion by a mutant tree that could potentially strangle its host following establishment as an epiphyte. We show how the relative abilities of strangling and non-strangling trees to occupy empty sites can govern whether strangling is an evolutionarily stable strategy, and obtain the conditions for strangler coexistence with non-stranglers. We then elucidate when the evolution of strangling can disrupt stable coexistence between commensal epiphytic ancestors and their non-strangling host trees. This allows us to highlight parallels between the invasion fitness of strangler trees arising from commensalist ancestors, and cases where strangling can arise in concert with the evolution of hemiepiphytism among free-standing ancestors. Finally, we discuss how our results can inform the evolutionary ecology of antagonistic interactions more generally.
An experimental APL compiler for a distributed memory parallel machine
Ching, W.M.; Katz, A.
1994-12-31
The authors developed an experimental APL compiler for the IBM SP1 distributed memory parallel machine. It accepts classical APL programs, without additional directives, and generates parallelized C code for execution on the SP1 machine. The compiler exploits data parallelism in APL programs based on parallel high level primitives. Program variables are either replicated or partitioned. They also present performance data for five moderate size programs running on the SP1.
Kubilius, Jonas
2014-01-01
Sharing code is becoming increasingly important in the wake of Open Science. In this review I describe and compare two popular code-sharing utilities, GitHub and Open Science Framework (OSF). GitHub is a mature, industry-standard tool but lacks focus towards researchers. In comparison, OSF offers a one-stop solution for researchers but a lot of functionality is still under development. I conclude by listing alternative lesser-known tools for code and materials sharing.
On implementing large binary tree architectures in VLSI and WSI
Youn, H.Y.; Singh, A.D.
1989-04-01
The complete binary tree is known to support the parallel execution of important algorithms, which has given rise to much interest in implementing such architectures in VLSI and WSI. For large trees, the classical H-tree layout approaches suffers from area inefficiency and long interconnects. Other proposed schemes are not well suited for the implementation of defect-tolerant designs. This paper presents an efficient scheme for the layout of large binary tree architectures by embedding the complete binary tree in a two-dimensional array of processing elements.
HOPSPACK: Hybrid Optimization Parallel Search Package.
Gray, Genetha Anne.; Kolda, Tamara G.; Griffin, Joshua; Taddy, Matt; Martinez-Canales, Monica L.
2008-12-01
In this paper, we describe the technical details of HOPSPACK (Hybrid Optimization Parallel SearchPackage), a new software platform which facilitates combining multiple optimization routines into asingle, tightly-coupled, hybrid algorithm that supports parallel function evaluations. The frameworkis designed such that existing optimization source code can be easily incorporated with minimalcode modification. By maintaining the integrity of each individual solver, the strengths and codesophistication of the original optimization package are retained and exploited.4
Computational electromagnetics and parallel dense matrix computations
Forsman, K.; Kettunen, L.; Gropp, W.; Levine, D.
1995-06-01
We present computational results using CORAL, a parallel, three-dimensional, nonlinear magnetostatic code based on a volume integral equation formulation. A key feature of CORAL is the ability to solve, in parallel, the large, dense systems of linear equations that are inherent in the use of integral equation methods. Using the Chameleon and PSLES libraries ensures portability and access to the latest linear algebra solution technology.
Computational electromagnetics and parallel dense matrix computations
Forsman, K.; Kettunen, L.; Gropp, W.
1995-12-01
We present computational results using CORAL, a parallel, three-dimensional, nonlinear magnetostatic code based on a volume integral equation formulation. A key feature of CORAL is the ability to solve, in parallel, the large, dense systems of linear equations that are inherent in the use of integral equation methods. Using the Chameleon and PSLES libraries ensures portability and access to the latest linear algebra solution technology.
Method of moment solutions to scattering problems in a parallel processing environment
NASA Technical Reports Server (NTRS)
Cwik, Tom; Partee, Jonathan; Patterson, Jean
1991-01-01
This paper describes the implementation of a parallelized method of moments (MOM) code into an interactive workstation environment. The workstation allows interactive solid body modeling and mesh generation, MOM analysis, and the graphical display of results. After describing the parallel computing environment, the implementation and results of parallelizing a general MOM code are presented in detail.
Interfacing Computer Aided Parallelization and Performance Analysis
NASA Technical Reports Server (NTRS)
Jost, Gabriele; Jin, Haoqiang; Labarta, Jesus; Gimenez, Judit; Biegel, Bryan A. (Technical Monitor)
2003-01-01
When porting sequential applications to parallel computer architectures, the program developer will typically go through several cycles of source code optimization and performance analysis. We have started a project to develop an environment where the user can jointly navigate through program structure and performance data information in order to make efficient optimization decisions. In a prototype implementation we have interfaced the CAPO computer aided parallelization tool with the Paraver performance analysis tool. We describe both tools and their interface and give an example for how the interface helps within the program development cycle of a benchmark code.
Parallel rendering techniques for massively parallel visualization
Hansen, C.; Krogh, M.; Painter, J.
1995-07-01
As the resolution of simulation models increases, scientific visualization algorithms which take advantage of the large memory. and parallelism of Massively Parallel Processors (MPPs) are becoming increasingly important. For large applications rendering on the MPP tends to be preferable to rendering on a graphics workstation due to the MPP`s abundant resources: memory, disk, and numerous processors. The challenge becomes developing algorithms that can exploit these resources while minimizing overhead, typically communication costs. This paper will describe recent efforts in parallel rendering for polygonal primitives as well as parallel volumetric techniques. This paper presents rendering algorithms, developed for massively parallel processors (MPPs), for polygonal, spheres, and volumetric data. The polygon algorithm uses a data parallel approach whereas the sphere and volume render use a MIMD approach. Implementations for these algorithms are presented for the Thinking Ma.chines Corporation CM-5 MPP.
PARAMESH: A Parallel Adaptive Mesh Refinement Community Toolkit
NASA Technical Reports Server (NTRS)
MacNeice, Peter; Olson, Kevin M.; Mobarry, Clark; deFainchtein, Rosalinda; Packer, Charles
1999-01-01
In this paper, we describe a community toolkit which is designed to provide parallel support with adaptive mesh capability for a large and important class of computational models, those using structured, logically cartesian meshes. The package of Fortran 90 subroutines, called PARAMESH, is designed to provide an application developer with an easy route to extend an existing serial code which uses a logically cartesian structured mesh into a parallel code with adaptive mesh refinement. Alternatively, in its simplest use, and with minimal effort, it can operate as a domain decomposition tool for users who want to parallelize their serial codes, but who do not wish to use adaptivity. The package can provide them with an incremental evolutionary path for their code, converting it first to uniformly refined parallel code, and then later if they so desire, adding adaptivity.
Xyce parallel electronic simulator : users' guide.
Mei, Ting; Rankin, Eric Lamont; Thornquist, Heidi K.; Santarelli, Keith R.; Fixel, Deborah A.; Coffey, Todd Stirling; Russo, Thomas V.; Schiek, Richard Louis; Warrender, Christina E.; Keiter, Eric Richard; Pawlowski, Roger Patrick
2011-05-01
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: (1) Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). Note that this includes support for most popular parallel and serial computers; (2) Improved performance for all numerical kernels (e.g., time integrator, nonlinear and linear solvers) through state-of-the-art algorithms and novel techniques. (3) Device models which are specifically tailored to meet Sandia's needs, including some radiation-aware devices (for Sandia users only); and (4) Object-oriented code design and implementation using modern coding practices that ensure that the Xyce Parallel Electronic Simulator will be maintainable and extensible far into the future. Xyce is a parallel code in the most general sense of the phrase - a message passing parallel implementation - which allows it to run efficiently on the widest possible number of computing platforms. These include serial, shared-memory and distributed-memory parallel as well as heterogeneous platforms. Careful attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows. The development of Xyce provides a platform for computational research and development aimed specifically at the needs of the Laboratory. With Xyce, Sandia has an 'in-house' capability with which both new electrical (e.g., device model development) and algorithmic (e.g., faster time-integration methods, parallel solver algorithms) research and development can be performed. As a result, Xyce is a unique
An efficient parallel algorithm for accelerating computational protein design
Zhou, Yichao; Xu, Wei; Donald, Bruce R.; Zeng, Jianyang
2014-01-01
Motivation: Structure-based computational protein design (SCPR) is an important topic in protein engineering. Under the assumption of a rigid backbone and a finite set of discrete conformations of side-chains, various methods have been proposed to address this problem. A popular method is to combine the dead-end elimination (DEE) and A* tree search algorithms, which provably finds the global minimum energy conformation (GMEC) solution. Results: In this article, we improve the efficiency of computing A* heuristic functions for protein design and propose a variant of A* algorithm in which the search process can be performed on a single GPU in a massively parallel fashion. In addition, we make some efforts to address the memory exceeding problem in A* search. As a result, our enhancements can achieve a significant speedup of the A*-based protein design algorithm by four orders of magnitude on large-scale test data through pre-computation and parallelization, while still maintaining an acceptable memory overhead. We also show that our parallel A* search algorithm could be successfully combined with iMinDEE, a state-of-the-art DEE criterion, for rotamer pruning to further improve SCPR with the consideration of continuous side-chain flexibility. Availability: Our software is available and distributed open-source under the GNU Lesser General License Version 2.1 (GNU, February 1999). The source code can be downloaded from http://www.cs.duke.edu/donaldlab/osprey.php or http://iiis.tsinghua.edu.cn/∼compbio/software.html. Contact: zengjy321@tsinghua.edu.cn Supplementary information: Supplementary data are available at Bioinformatics online. PMID:24931991
ERIC Educational Resources Information Center
Smithyman, S. J.
This manual is designed to prepare students for entry-level positions as tree care professionals. Addressed in the individual chapters of the guide are the following topics: the tree service industry; clothing, eqiupment, and tools; tree workers; basic tree anatomy; techniques of pruning; procedures for climbing and working in the tree; aerial…
Status and Verification of Edge Plasma Turbulence Code BOUT
Umansky, M V; Xu, X Q; Dudson, B; LoDestro, L L; Myra, J R
2009-01-08
The BOUT code is a detailed numerical model of tokamak edge turbulence based on collisional plasma uid equations. BOUT solves for time evolution of plasma uid variables: plasma density N{sub i}, parallel ion velocity V{sub {parallel}i}, electron temperature T{sub e}, ion temperature T{sub i}, electric potential {phi}, parallel current j{sub {parallel}}, and parallel vector potential A{sub {parallel}}, in realistic 3D divertor tokamak geometry. The current status of the code, physics model, algorithms, and implementation is described. Results of verification testing are presented along with illustrative applications to tokamak edge turbulence.
On-Line Construction of Parameterized Suffix Trees
NASA Astrophysics Data System (ADS)
Lee, Taehyung; Na, Joong Chae; Park, Kunsoo
We consider on-line construction of a suffix tree for a parameterized string, where we always have the suffix tree of the input string read so far. This situation often arises from source code management systems where, for example, a source code repository is gradually increasing in its size as users commit new codes into the repository day by day. We present an on-line algorithm which constructs a parameterized suffix tree in randomized O(n) time, where n is the length of the input string. Our algorithm is the first randomized linear time algorithm for the on-line construction problem.
Performance issues for engineering analysis on MIMD parallel computers
Fang, H.E.; Vaughan, C.T.; Gardner, D.R.
1994-08-01
We discuss how engineering analysts can obtain greater computational resolution in a more timely manner from applications codes running on MIMD parallel computers. Both processor speed and memory capacity are important to achieving better performance than a serial vector supercomputer. To obtain good performance, a parallel applications code must be scalable. In addition, the aspect ratios of the subdomains in the decomposition of the simulation domain onto the parallel computer should be of order 1. We demonstrate these conclusions using simulations conducted with the PCTH shock wave physics code running on a Cray Y-MP, a 1024-node nCUBE 2, and an 1840-node Paragon.
Parallel-In-Time For Moving Meshes
Falgout, R. D.; Manteuffel, T. A.; Southworth, B.; Schroder, J. B.
2016-02-04
With steadily growing computational resources available, scientists must develop e ective ways to utilize the increased resources. High performance, highly parallel software has be- come a standard. However until recent years parallelism has focused primarily on the spatial domain. When solving a space-time partial di erential equation (PDE), this leads to a sequential bottleneck in the temporal dimension, particularly when taking a large number of time steps. The XBraid parallel-in-time library was developed as a practical way to add temporal parallelism to existing se- quential codes with only minor modi cations. In this work, a rezoning-type moving mesh is applied to a di usion problem and formulated in a parallel-in-time framework. Tests and scaling studies are run using XBraid and demonstrate excellent results for the simple model problem considered herein.
Parallelization of a Compositional Reservoir Simulator
NASA Astrophysics Data System (ADS)
Reme, Hilde; Åge Øye, Geir; Espedal, Magne S.; Fladmark, Gunnar E.
A finite volume dicretization has been used to solve compositional flow in porous media. Secondary migration in fractured rocks has been the main motivation for the work. Multipoint flux approximation has been implemented and adaptive local grid refinement, based on domain decomposition, is used at fractures and faults. The parallelization method, which is described in this paper, strongly promotes code reuse and gives a very high level of parallelization despite low implementation costs. The programming framework is also portable to other platforms or other applications. We have presented computer experiments to examine the parallel efficiency of the implemented parallel simulator with respect to scalability and speedup. Keywords: porous media, multipoint flux approximation, domain decomposition, parallelization
REBOUND: Multi-purpose N-body code for collisional dynamics
NASA Astrophysics Data System (ADS)
Rein, Hanno; Liu, Shang-Fei
2011-10-01
REBOUND is a multi-purpose N-body code which is freely available under an open-source license. It was designed for collisional dynamics such as planetary rings but can also solve the classical N-body problem. It is highly modular and can be customized easily to work on a wide variety of different problems in astrophysics and beyond. REBOUND comes with three symplectic integrators: leap-frog, the symplectic epicycle integrator (SEI) and a Wisdom-Holman mapping (WH). It supports open, periodic and shearing-sheet boundary conditions. REBOUND can use a Barnes-Hut tree to calculate both self-gravity and collisions. These modules are fully parallelized with MPI as well as OpenMP. The former makes use of a static domain decomposition and a distributed essential tree. Two new collision detection modules based on a plane-sweep algorithm are also implemented. The performance of the plane-sweep algorithm is superior to a tree code for simulations in which one dimension is much longer than the other two and in simulations which are quasi-two dimensional with less than one million particles.
ERIC Educational Resources Information Center
Sattath, Shmuel; Tversky, Amos
1977-01-01
Tree representations of similarity data are investigated. Hierarchical clustering is critically examined, and a more general procedure, called the additive tree, is presented. The additive tree representation is then compared to multidimensional scaling. (Author/JKS)
Parallel distributed computing using Python
NASA Astrophysics Data System (ADS)
Dalcin, Lisandro D.; Paz, Rodrigo R.; Kler, Pablo A.; Cosimo, Alejandro
2011-09-01
This work presents two software components aimed to relieve the costs of accessing high-performance parallel computing resources within a Python programming environment: MPI for Python and PETSc for Python. MPI for Python is a general-purpose Python package that provides bindings for the Message Passing Interface (MPI) standard using any back-end MPI implementation. Its facilities allow parallel Python programs to easily exploit multiple processors using the message passing paradigm. PETSc for Python provides access to the Portable, Extensible Toolkit for Scientific Computation (PETSc) libraries. Its facilities allow sequential and parallel Python applications to exploit state of the art algorithms and data structures readily available in PETSc for the solution of large-scale problems in science and engineering. MPI for Python and PETSc for Python are fully integrated to PETSc-FEM, an MPI and PETSc based parallel, multiphysics, finite elements code developed at CIMEC laboratory. This software infrastructure supports research activities related to simulation of fluid flows with applications ranging from the design of microfluidic devices for biochemical analysis to modeling of large-scale stream/aquifer interactions.
Aerodynamic simulation on massively parallel systems
NASA Technical Reports Server (NTRS)
Haeuser, Jochem; Simon, Horst D.
1992-01-01
This paper briefly addresses the computational requirements for the analysis of complete configurations of aircraft and spacecraft currently under design to be used for advanced transportation in commercial applications as well as in space flight. The discussion clearly shows that massively parallel systems are the only alternative which is both cost effective and on the other hand can provide the necessary TeraFlops, needed to satisfy the narrow design margins of modern vehicles. It is assumed that the solution of the governing physical equations, i.e., the Navier-Stokes equations which may be complemented by chemistry and turbulence models, is done on multiblock grids. This technique is situated between the fully structured approach of classical boundary fitted grids and the fully unstructured tetrahedra grids. A fully structured grid best represents the flow physics, while the unstructured grid gives best geometrical flexibility. The multiblock grid employed is structured within a block, but completely unstructured on the block level. While a completely unstructured grid is not straightforward to parallelize, the above mentioned multiblock grid is inherently parallel, in particular for multiple instruction multiple datastream (MIMD) machines. In this paper guidelines are provided for setting up or modifying an existing sequential code so that a direct parallelization on a massively parallel system is possible. Results are presented for three parallel systems, namely the Intel hypercube, the Ncube hypercube, and the FPS 500 system. Some preliminary results for an 8K CM2 machine will also be mentioned. The code run is the two dimensional grid generation module of Grid, which is a general two dimensional and three dimensional grid generation code for complex geometries. A system of nonlinear Poisson equations is solved. This code is also a good testcase for complex fluid dynamics codes, since the same datastructures are used. All systems provided good speedups, but
Parallel Computational Protein Design
Zhou, Yichao; Donald, Bruce R.; Zeng, Jianyang
2016-01-01
Computational structure-based protein design (CSPD) is an important problem in computational biology, which aims to design or improve a prescribed protein function based on a protein structure template. It provides a practical tool for real-world protein engineering applications. A popular CSPD method that guarantees to find the global minimum energy solution (GMEC) is to combine both dead-end elimination (DEE) and A* tree search algorithms. However, in this framework, the A* search algorithm can run in exponential time in the worst case, which may become the computation bottleneck of large-scale computational protein design process. To address this issue, we extend and add a new module to the OSPREY program that was previously developed in the Donald lab [1] to implement a GPU-based massively parallel A* algorithm for improving protein design pipeline. By exploiting the modern GPU computational framework and optimizing the computation of the heuristic function for A* search, our new program, called gOSPREY, can provide up to four orders of magnitude speedups in large protein design cases with a small memory overhead comparing to the traditional A* search algorithm implementation, while still guaranteeing the optimality. In addition, gOSPREY can be configured to run in a bounded-memory mode to tackle the problems in which the conformation space is too large and the global optimal solution cannot be computed previously. Furthermore, the GPU-based A* algorithm implemented in the gOSPREY program can be combined with the state-of-the-art rotamer pruning algorithms such as iMinDEE [2] and DEEPer [3] to also consider continuous backbone and side-chain flexibility. PMID:27914056
2004-01-01
trees (similar to the role played by the finite- state acceptor FSA for strings). We describe the version (equivalent to TSG ( Schabes , 1990)) where...strictly contained in tree sets of tree adjoining gram- mars (Joshi and Schabes , 1997). 4 Extended-LHS Tree Transducers (xR) Section 1 informally described...changes without modifying the training procedure, as long as we stick to tree automata. 10 Related Work Tree substitution grammars or TSG ( Schabes , 1990
Kubilius, Jonas
2014-01-01
Sharing code is becoming increasingly important in the wake of Open Science. In this review I describe and compare two popular code-sharing utilities, GitHub and Open Science Framework (OSF). GitHub is a mature, industry-standard tool but lacks focus towards researchers. In comparison, OSF offers a one-stop solution for researchers but a lot of functionality is still under development. I conclude by listing alternative lesser-known tools for code and materials sharing. PMID:25165519
CodedStream: live media streaming with overlay coded multicast
NASA Astrophysics Data System (ADS)
Guo, Jiang; Zhu, Ying; Li, Baochun
2003-12-01
Multicasting is a natural paradigm for streaming live multimedia to multiple end receivers. Since IP multicast is not widely deployed, many application-layer multicast protocols have been proposed. However, all of these schemes focus on the construction of multicast trees, where a relatively small number of links carry the multicast streaming load, while the capacity of most of the other links in the overlay network remain unused. In this paper, we propose CodedStream, a high-bandwidth live media distribution system based on end-system overlay multicast. In CodedStream, we construct a k-redundant multicast graph (a directed acyclic graph) as the multicast topology, on which network coding is applied to work around bottlenecks. Simulation results have shown that the combination of k-redundant multicast graph and network coding may indeed bring significant benefits with respect to improving the quality of live media at the end receivers.
Parallel community climate model: Description and user`s guide
Drake, J.B.; Flanery, R.E.; Semeraro, B.D.; Worley, P.H.
1996-07-15
This report gives an overview of a parallel version of the NCAR Community Climate Model, CCM2, implemented for MIMD massively parallel computers using a message-passing programming paradigm. The parallel implementation was developed on an Intel iPSC/860 with 128 processors and on the Intel Delta with 512 processors, and the initial target platform for the production version of the code is the Intel Paragon with 2048 processors. Because the implementation uses a standard, portable message-passing libraries, the code has been easily ported to other multiprocessors supporting a message-passing programming paradigm. The parallelization strategy used is to decompose the problem domain into geographical patches and assign each processor the computation associated with a distinct subset of the patches. With this decomposition, the physics calculations involve only grid points and data local to a processor and are performed in parallel. Using parallel algorithms developed for the semi-Lagrangian transport, the fast Fourier transform and the Legendre transform, both physics and dynamics are computed in parallel with minimal data movement and modest change to the original CCM2 source code. Sequential or parallel history tapes are written and input files (in history tape format) are read sequentially by the parallel code to promote compatibility with production use of the model on other computer systems. A validation exercise has been performed with the parallel code and is detailed along with some performance numbers on the Intel Paragon and the IBM SP2. A discussion of reproducibility of results is included. A user`s guide for the PCCM2 version 2.1 on the various parallel machines completes the report. Procedures for compilation, setup and execution are given. A discussion of code internals is included for those who may wish to modify and use the program in their own research.
Stable feature selection for clinical prediction: exploiting ICD tree structure using Tree-Lasso.
Kamkar, Iman; Gupta, Sunil Kumar; Phung, Dinh; Venkatesh, Svetha
2015-02-01
Modern healthcare is getting reshaped by growing Electronic Medical Records (EMR). Recently, these records have been shown of great value towards building clinical prediction models. In EMR data, patients' diseases and hospital interventions are captured through a set of diagnoses and procedures codes. These codes are usually represented in a tree form (e.g. ICD-10 tree) and the codes within a tree branch may be highly correlated. These codes can be used as features to build a prediction model and an appropriate feature selection can inform a clinician about important risk factors for a disease. Traditional feature selection methods (e.g. Information Gain, T-test, etc.) consider each variable independently and usually end up having a long feature list. Recently, Lasso and related l1-penalty based feature selection methods have become popular due to their joint feature selection property. However, Lasso is known to have problems of selecting one feature of many correlated features randomly. This hinders the clinicians to arrive at a stable feature set, which is crucial for clinical decision making process. In this paper, we solve this problem by using a recently proposed Tree-Lasso model. Since, the stability behavior of Tree-Lasso is not well understood, we study the stability behavior of Tree-Lasso and compare it with other feature selection methods. Using a synthetic and two real-world datasets (Cancer and Acute Myocardial Infarction), we show that Tree-Lasso based feature selection is significantly more stable than Lasso and comparable to other methods e.g. Information Gain, ReliefF and T-test. We further show that, using different types of classifiers such as logistic regression, naive Bayes, support vector machines, decision trees and Random Forest, the classification performance of Tree-Lasso is comparable to Lasso and better than other methods. Our result has implications in identifying stable risk factors for many healthcare problems and therefore can
Payne, J.L.; Hassan, B.
1998-09-01
Massively parallel computers have enabled the analyst to solve complicated flow fields (turbulent, chemically reacting) that were previously intractable. Calculations are presented using a massively parallel CFD code called SACCARA (Sandia Advanced Code for Compressible Aerothermodynamics Research and Analysis) currently under development at Sandia National Laboratories as part of the Department of Energy (DOE) Accelerated Strategic Computing Initiative (ASCI). Computations were made on a generic reentry vehicle in a hypersonic flowfield utilizing three different distributed parallel computers to assess the parallel efficiency of the code with increasing numbers of processors. The parallel efficiencies for the SACCARA code will be presented for cases using 1, 150, 100 and 500 processors. Computations were also made on a subsonic/transonic vehicle using both 236 and 521 processors on a grid containing approximately 14.7 million grid points. Ongoing and future plans to implement a parallel overset grid capability and couple SACCARA with other mechanics codes in a massively parallel environment are discussed.
Massively-Parallel Dislocation Dynamics Simulations
Cai, W; Bulatov, V V; Pierce, T G; Hiratani, M; Rhee, M; Bartelt, M; Tang, M
2003-06-18
Prediction of the plastic strength of single crystals based on the collective dynamics of dislocations has been a challenge for computational materials science for a number of years. The difficulty lies in the inability of the existing dislocation dynamics (DD) codes to handle a sufficiently large number of dislocation lines, in order to be statistically representative and to reproduce experimentally observed microstructures. A new massively-parallel DD code is developed that is capable of modeling million-dislocation systems by employing thousands of processors. We discuss the general aspects of this code that make such large scale simulations possible, as well as a few initial simulation results.
Parallel simulated annealing algorithms for cell placement on hypercube multiprocessors
NASA Technical Reports Server (NTRS)
Banerjee, Prithviraj; Jones, Mark Howard; Sargent, Jeff S.
1990-01-01
Two parallel algorithms for standard cell placement using simulated annealing are developed to run on distributed-memory message-passing hypercube multiprocessors. The cells can be mapped in a two-dimensional area of a chip onto processors in an n-dimensional hypercube in two ways, such that both small and large cell exchange and displacement moves can be applied. The computation of the cost function in parallel among all the processors in the hypercube is described, along with a distributed data structure that needs to be stored in the hypercube to support the parallel cost evaluation. A novel tree broadcasting strategy is used extensively for updating cell locations in the parallel environment. A dynamic parallel annealing schedule estimates the errors due to interacting parallel moves and adapts the rate of synchronization automatically. Two novel approaches in controlling error in parallel algorithms are described: heuristic cell coloring and adaptive sequence control.
Code Pulse: Software Assurance (SWA) Visual Analytics for Dynamic Analysis of Code
2014-09-01
interface for the Code Pulse dynamic tracer ............. 13 Figure 5. Layout and visualization exploration mockup ...code coverage user interface ........................................................... 23 Figure 13. Mockups of package tree view to control what...involved, we designed the user experience that would satisfy both the use cases and our requirements. This came in the form of low fidelity mockups and
Parallel flow diffusion battery
Yeh, Hsu-Chi; Cheng, Yung-Sung
1984-08-07
A parallel flow diffusion battery for determining the mass distribution of an aerosol has a plurality of diffusion cells mounted in parallel to an aerosol stream, each diffusion cell including a stack of mesh wire screens of different density.
Parallel flow diffusion battery
Yeh, H.C.; Cheng, Y.S.
1984-01-01
A parallel flow diffusion battery for determining the mass distribution of an aerosol has a plurality of diffusion cells mounted in parallel to an aerosol stream, each diffusion cell including a stack of mesh wire screens of different density.
Fan, W.C.; Halbleib, J.A. Sr.
1996-09-01
This report provides a users` guide for parallel processing ITS on a UNIX workstation network, a shared-memory multiprocessor or a massively-parallel processor. The parallelized version of ITS is based on a master/slave model with message passing. Parallel issues such as random number generation, load balancing, and communication software are briefly discussed. Timing results for example problems are presented for demonstration purposes.
The EMCC / DARPA Massively Parallel Electromagnetic Scattering Project
NASA Technical Reports Server (NTRS)
Woo, Alex C.; Hill, Kueichien C.
1996-01-01
The Electromagnetic Code Consortium (EMCC) was sponsored by the Advanced Research Program Agency (ARPA) to demonstrate the effectiveness of massively parallel computing in large scale radar signature predictions. The EMCC/ARPA project consisted of three parts.
PALM: a Parallel Dynamic Coupler
NASA Astrophysics Data System (ADS)
Thevenin, A.; Morel, T.
2008-12-01
In order to efficiently represent complex systems, numerical modeling has to rely on many physical models at a time: an ocean model coupled with an atmospheric model is at the basis of climate modeling. The continuity of the solution is granted only if these models can constantly exchange information. PALM is a coupler allowing the concurrent execution and the intercommunication of programs not having been especially designed for that. With PALM, the dynamic coupling approach is introduced: a coupled component can be launched and can release computers' resources upon termination at any moment during the simulation. In order to exploit as much as possible computers' possibilities, the PALM coupler handles two levels of parallelism. The first level concerns the components themselves. While managing the resources, PALM allocates the number of processes which are necessary to any coupled component. These models can be parallel programs based on domain decomposition with MPI or applications multithreaded with OpenMP. The second level of parallelism is a task parallelism: one can define a coupling algorithm allowing two or more programs to be executed in parallel. PALM applications are implemented via a Graphical User Interface called PrePALM. In this GUI, the programmer initially defines the coupling algorithm then he describes the actual communications between the models. PALM offers a very high flexibility for testing different coupling techniques and for reaching the best load balance in a high performance computer. The transformation of computational independent code is almost straightforward. The other qualities of PALM are its easy set-up, its flexibility, its performances, the simple updates and evolutions of the coupled application and the many side services and functions that it offers.
Research in Parallel Algorithms and Software for Computational Aerosciences
NASA Technical Reports Server (NTRS)
Domel, Neal D.
1996-01-01
Phase I is complete for the development of a Computational Fluid Dynamics parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.
Research in Parallel Algorithms and Software for Computational Aerosciences
NASA Technical Reports Server (NTRS)
Domel, Neal D.
1996-01-01
Phase 1 is complete for the development of a computational fluid dynamics CFD) parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.
NASA Astrophysics Data System (ADS)
Vogt, Peter R.
2004-09-01
Nature often replicates her processes at different scales of space and time in differing media. Here a tree-trunk cross section I am preparing for a dendrochronological display at the Battle Creek Cypress Swamp Nature Sanctuary (Calvert County, Maryland) dried and cracked in a way that replicates practically all the planform features found along the Mid-Oceanic Ridge (see Figure 1). The left-lateral offset of saw marks, contrasting with the right-lateral ``rift'' offset, even illustrates the distinction between transcurrent (strike-slip) and transform faults, the latter only recognized as a geologic feature, by J. Tuzo Wilson, in 1965. However, wood cracking is but one of many examples of natural processes that replicate one or several elements of lithospheric plate tectonics. Many of these examples occur in everyday venues and thus make great teaching aids, ``teachable'' from primary school to university levels. Plate tectonics, the dominant process of Earth geology, also occurs in miniature on the surface of some lava lakes, and as ``ice plate tectonics'' on our frozen seas and lakes. Ice tectonics also happens at larger spatial and temporal scales on the Jovian moons Europa and perhaps Ganymede. Tabletop plate tectonics, in which a molten-paraffin ``asthenosphere'' is surfaced by a skin of congealing wax ``plates,'' first replicated Mid-Oceanic Ridge type seafloor spreading more than three decades ago. A seismologist (J. Brune, personal communication, 2004) discovered wax plate tectonics by casually and serendipitously pulling a stick across a container of molten wax his wife and daughters had used in making candles. Brune and his student D. Oldenburg followed up and mirabile dictu published the results in Science (178, 301-304).
Research in parallel computing
NASA Technical Reports Server (NTRS)
Ortega, James M.; Henderson, Charles
1994-01-01
This report summarizes work on parallel computations for NASA Grant NAG-1-1529 for the period 1 Jan. - 30 June 1994. Short summaries on highly parallel preconditioners, target-specific parallel reductions, and simulation of delta-cache protocols are provided.
NASA Technical Reports Server (NTRS)
Nicol, David; Fujimoto, Richard
1992-01-01
This paper surveys topics that presently define the state of the art in parallel simulation. Included in the tutorial are discussions on new protocols, mathematical performance analysis, time parallelism, hardware support for parallel simulation, load balancing algorithms, and dynamic memory management for optimistic synchronization.
Ravishankar, C., Hughes Network Systems, Germantown, MD
1998-05-08
Speech is the predominant means of communication between human beings and since the invention of the telephone by Alexander Graham Bell in 1876, speech services have remained to be the core service in almost all telecommunication systems. Original analog methods of telephony had the disadvantage of speech signal getting corrupted by noise, cross-talk and distortion Long haul transmissions which use repeaters to compensate for the loss in signal strength on transmission links also increase the associated noise and distortion. On the other hand digital transmission is relatively immune to noise, cross-talk and distortion primarily because of the capability to faithfully regenerate digital signal at each repeater purely based on a binary decision. Hence end-to-end performance of the digital link essentially becomes independent of the length and operating frequency bands of the link Hence from a transmission point of view digital transmission has been the preferred approach due to its higher immunity to noise. The need to carry digital speech became extremely important from a service provision point of view as well. Modem requirements have introduced the need for robust, flexible and secure services that can carry a multitude of signal types (such as voice, data and video) without a fundamental change in infrastructure. Such a requirement could not have been easily met without the advent of digital transmission systems, thereby requiring speech to be coded digitally. The term Speech Coding is often referred to techniques that represent or code speech signals either directly as a waveform or as a set of parameters by analyzing the speech signal. In either case, the codes are transmitted to the distant end where speech is reconstructed or synthesized using the received set of codes. A more generic term that is applicable to these techniques that is often interchangeably used with speech coding is the term voice coding. This term is more generic in the sense that the
ERIC Educational Resources Information Center
Boyd, Amy E.; Cooper, Jim
2004-01-01
Tree rings can be used not only to look at plant growth, but also to make connections between plant growth and resource availability. In this lesson, students in 2nd-4th grades use role-play to become familiar with basic requirements of trees and how availability of those resources is related to tree ring sizes and tree growth. These concepts can…
A parallelized Python based Multi-Point Thomson Scattering analysis in NSTX-U
NASA Astrophysics Data System (ADS)
Miller, Jared; Diallo, Ahmed; Leblanc, Benoit
2014-10-01
Multi-Point Thomson Scattering (MPTS) is a reliable and accurate method of finding the temperature, density, and pressure of a magnetically confined plasma. Nd:YAG (1064 nm) lasers are fired into the plasma with a frequency of 60 Hz, and the light is Doppler shifted by Thomson scattering. Polychromators on the midplane of the tokamak pick up the light at various radii/scattering angles, and the avalanche photodiode's voltages are added to an MDSplus tree for later analysis. This project ports and optimizes the prior serial IDL MPTS code into a well-documented Python package that runs in parallel. Since there are 30 polychromators in the current NSTX setup (12 more will be added when NSTX-U is completed), using parallelism offers vast savings in performance. NumPy and SciPy further accelerate numerical calculations and matrix operations, Matplotlib and PyQt make an intuitive GUI with plots of the output, and Multiprocessing parallelizes the computationally intensive calculations. The Python package was designed with portability and flexibility in mind so it can be adapted for use in any polychromator-based MPTS system.
Parallel methods for the flight simulation model
Xiong, Wei Zhong; Swietlik, C.
1994-06-01
The Advanced Computer Applications Center (ACAC) has been involved in evaluating advanced parallel architecture computers and the applicability of these machines to computer simulation models. The advanced systems investigated include parallel machines with shared. memory and distributed architectures consisting of an eight processor Alliant FX/8, a twenty four processor sor Sequent Symmetry, Cray XMP, IBM RISC 6000 model 550, and the Intel Touchstone eight processor Gamma and 512 processor Delta machines. Since parallelizing a truly efficient application program for the parallel machine is a difficult task, the implementation for these machines in a realistic setting has been largely overlooked. The ACAC has developed considerable expertise in optimizing and parallelizing application models on a collection of advanced multiprocessor systems. One of aspect of such an application model is the Flight Simulation Model, which used a set of differential equations to describe the flight characteristics of a launched missile by means of a trajectory. The Flight Simulation Model was written in the FORTRAN language with approximately 29,000 lines of source code. Depending on the number of trajectories, the computation can require several hours to full day of CPU time on DEC/VAX 8650 system. There is an impetus to reduce the execution time and utilize the advanced parallel architecture computing environment available. ACAC researchers developed a parallel method that allows the Flight Simulation Model to be able to run in parallel on the multiprocessor system. For the benchmark data tested, the parallel Flight Simulation Model implemented on the Alliant FX/8 has achieved nearly linear speedup. In this paper, we describe a parallel method for the Flight Simulation Model. We believe the method presented in this paper provides a general concept for the design of parallel applications. This concept, in most cases, can be adapted to many other sequential application programs.
Parallel tempering Monte Carlo in LAMMPS.
Rintoul, Mark Daniel; Plimpton, Steven James; Sears, Mark P.
2003-11-01
We present here the details of the implementation of the parallel tempering Monte Carlo technique into a LAMMPS, a heavily used massively parallel molecular dynamics code at Sandia. This technique allows for many replicas of a system to be run at different simulation temperatures. At various points in the simulation, configurations can be swapped between different temperature environments and then continued. This allows for large regions of energy space to be sampled very quickly, and allows for minimum energy configurations to emerge in very complex systems, such as large biomolecular systems. By including this algorithm into an existing code, we immediately gain all of the previous work that had been put into LAMMPS, and allow this technique to quickly be available to the entire Sandia and international LAMMPS community. Finally, we present an example of this code applied to folding a small protein.
Parallel spin-orbit coupled configuration interaction
NASA Astrophysics Data System (ADS)
Tilson, J. L.; Ermler, W. C.; Pitzer, R. M.
2000-06-01
A parallel spin-orbit configuration interaction (SOCI) code has been developed. This code, named P-SOCI, is an extension of an existing sequential SOCI program and permits solution to heavy-element systems requiring both explicit spin-orbit (SO) effects and significant electron correlation. The relativistic procedure adopted here is an ab initio conventional configuration interaction (CI) method that constructs a Hamiltonian matrix in a double-group-adapted basis. P-SOCI enables solutions to problems far larger than possible with the original code by exploiting the resources of large massively parallel processing computers (MPP). This increase in capability permits not only the continued inclusion of explicit spin-orbit effects but now also a significant amount of non-dynamic and dynamic correlation as is necessary for a good description of heavy-element systems.
Efficient Helicopter Aerodynamic and Aeroacoustic Predictions on Parallel Computers
NASA Technical Reports Server (NTRS)
Wissink, Andrew M.; Lyrintzis, Anastasios S.; Strawn, Roger C.; Oliker, Leonid; Biswas, Rupak
1996-01-01
This paper presents parallel implementations of two codes used in a combined CFD/Kirchhoff methodology to predict the aerodynamics and aeroacoustics properties of helicopters. The rotorcraft Navier-Stokes code, TURNS, computes the aerodynamic flowfield near the helicopter blades and the Kirchhoff acoustics code computes the noise in the far field, using the TURNS solution as input. The overall parallel strategy adds MPI message passing calls to the existing serial codes to allow for communication between processors. As a result, the total code modifications required for parallel execution are relatively small. The biggest bottleneck in running the TURNS code in parallel comes from the LU-SGS algorithm that solves the implicit system of equations. We use a new hybrid domain decomposition implementation of LU-SGS to obtain good parallel performance on the SP-2. TURNS demonstrates excellent parallel speedups for quasi-steady and unsteady three-dimensional calculations of a helicopter blade in forward flight. The execution rate attained by the code on 114 processors is six times faster than the same cases run on one processor of the Cray C-90. The parallel Kirchhoff code also shows excellent parallel speedups and fast execution rates. As a performance demonstration, unsteady acoustic pressures are computed at 1886 far-field observer locations for a sample acoustics problem. The calculation requires over two hundred hours of CPU time on one C-90 processor but takes only a few hours on 80 processors of the SP2. The resultant far-field acoustic field is analyzed with state of-the-art audio and video rendering of the propagating acoustic signals.
TRACKING CODE DEVELOPMENT FOR BEAM DYNAMICS OPTIMIZATION
Yang, L.
2011-03-28
Dynamic aperture (DA) optimization with direct particle tracking is a straight forward approach when the computing power is permitted. It can have various realistic errors included and is more close than theoretical estimations. In this approach, a fast and parallel tracking code could be very helpful. In this presentation, we describe an implementation of storage ring particle tracking code TESLA for beam dynamics optimization. It supports MPI based parallel computing and is robust as DA calculation engine. This code has been used in the NSLS-II dynamics optimizations and obtained promising performance.
An object-oriented approach to nested data parallelism
NASA Technical Reports Server (NTRS)
Sheffler, Thomas J.; Chatterjee, Siddhartha
1994-01-01
This paper describes an implementation technique for integrating nested data parallelism into an object-oriented language. Data-parallel programming employs sets of data called 'collections' and expresses parallelism as operations performed over the elements of a collection. When the elements of a collection are also collections, then there is the possibility for 'nested data parallelism.' Few current programming languages support nested data parallelism however. In an object-oriented framework, a collection is a single object. Its type defines the parallel operations that may be applied to it. Our goal is to design and build an object-oriented data-parallel programming environment supporting nested data parallelism. Our initial approach is built upon three fundamental additions to C++. We add new parallel base types by implementing them as classes, and add a new parallel collection type called a 'vector' that is implemented as a template. Only one new language feature is introduced: the 'foreach' construct, which is the basis for exploiting elementwise parallelism over collections. The strength of the method lies in the compilation strategy, which translates nested data-parallel C++ into ordinary C++. Extracting the potential parallelism in nested 'foreach' constructs is called 'flattening' nested parallelism. We show how to flatten 'foreach' constructs using a simple program transformation. Our prototype system produces vector code which has been successfully run on workstations, a CM-2, and a CM-5.
Labeled trees and the efficient computation of derivations
NASA Technical Reports Server (NTRS)
Grossman, Robert; Larson, Richard G.
1989-01-01
The effective parallel symbolic computation of operators under composition is discussed. Examples include differential operators under composition and vector fields under the Lie bracket. Data structures consisting of formal linear combinations of rooted labeled trees are discussed. A multiplication on rooted labeled trees is defined, thereby making the set of these data structures into an associative algebra. An algebra homomorphism is defined from the original algebra of operators into this algebra of trees. An algebra homomorphism from the algebra of trees into the algebra of differential operators is then described. The cancellation which occurs when noncommuting operators are expressed in terms of commuting ones occurs naturally when the operators are represented using this data structure. This leads to an algorithm which, for operators which are derivations, speeds up the computation exponentially in the degree of the operator. It is shown that the algebra of trees leads naturally to a parallel version of the algorithm.
Xyce parallel electronic simulator design.
Thornquist, Heidi K.; Rankin, Eric Lamont; Mei, Ting; Schiek, Richard Louis; Keiter, Eric Richard; Russo, Thomas V.
2010-09-01
This document is the Xyce Circuit Simulator developer guide. Xyce has been designed from the 'ground up' to be a SPICE-compatible, distributed memory parallel circuit simulator. While it is in many respects a research code, Xyce is intended to be a production simulator. As such, having software quality engineering (SQE) procedures in place to insure a high level of code quality and robustness are essential. Version control, issue tracking customer support, C++ style guildlines and the Xyce release process are all described. The Xyce Parallel Electronic Simulator has been under development at Sandia since 1999. Historically, Xyce has mostly been funded by ASC, the original focus of Xyce development has primarily been related to circuits for nuclear weapons. However, this has not been the only focus and it is expected that the project will diversify. Like many ASC projects, Xyce is a group development effort, which involves a number of researchers, engineers, scientists, mathmaticians and computer scientists. In addition to diversity of background, it is to be expected on long term projects for there to be a certain amount of staff turnover, as people move on to different projects. As a result, it is very important that the project maintain high software quality standards. The point of this document is to formally document a number of the software quality practices followed by the Xyce team in one place. Also, it is hoped that this document will be a good source of information for new developers.
Optimal parallel solution of sparse triangular systems
NASA Technical Reports Server (NTRS)
Alvarado, Fernando L.; Schreiber, Robert
1990-01-01
A method for the parallel solution of triangular sets of equations is described that is appropriate when there are many right-handed sides. By preprocessing, the method can reduce the number of parallel steps required to solve Lx = b compared to parallel forward or backsolve. Applications are to iterative solvers with triangular preconditioners, to structural analysis, or to power systems applications, where there may be many right-handed sides (not all available a priori). The inverse of L is represented as a product of sparse triangular factors. The problem is to find a factored representation of this inverse of L with the smallest number of factors (or partitions), subject to the requirement that no new nonzero elements be created in the formation of these inverse factors. A method from an earlier reference is shown to solve this problem. This method is improved upon by constructing a permutation of the rows and columns of L that preserves triangularity and allow for the best possible such partition. A number of practical examples and algorithmic details are presented. The parallelism attainable is illustrated by means of elimination trees and clique trees.
Hartford, Orville; Zug, Kathryn A
2005-09-01
Tea tree oil is a popular ingredient in many over-the-counter healthcare and cosmetic products. With the explosion of the natural and alternative medicine industry, more and more people are using products containing tea tree oil. This article reviews basic information about tea tree oil and contact allergy, including sources of tea tree oil, chemical composition, potential cross reactions, reported cases of allergic contact dermatitis, allergenic compounds in tea tree oil, practical patch testing information, and preventive measures.
Parallel Atomistic Simulations
HEFFELFINGER,GRANT S.
2000-01-18
Algorithms developed to enable the use of atomistic molecular simulation methods with parallel computers are reviewed. Methods appropriate for bonded as well as non-bonded (and charged) interactions are included. While strategies for obtaining parallel molecular simulations have been developed for the full variety of atomistic simulation methods, molecular dynamics and Monte Carlo have received the most attention. Three main types of parallel molecular dynamics simulations have been developed, the replicated data decomposition, the spatial decomposition, and the force decomposition. For Monte Carlo simulations, parallel algorithms have been developed which can be divided into two categories, those which require a modified Markov chain and those which do not. Parallel algorithms developed for other simulation methods such as Gibbs ensemble Monte Carlo, grand canonical molecular dynamics, and Monte Carlo methods for protein structure determination are also reviewed and issues such as how to measure parallel efficiency, especially in the case of parallel Monte Carlo algorithms with modified Markov chains are discussed.
Locating hardware faults in a data communications network of a parallel computer
Archer, Charles J.; Megerian, Mark G.; Ratterman, Joseph D.; Smith, Brian E.
2010-01-12
Hardware faults location in a data communications network of a parallel computer. Such a parallel computer includes a plurality of compute nodes and a data communications network that couples the compute nodes for data communications and organizes the compute node as a tree. Locating hardware faults includes identifying a next compute node as a parent node and a root of a parent test tree, identifying for each child compute node of the parent node a child test tree having the child compute node as root, running a same test suite on the parent test tree and each child test tree, and identifying the parent compute node as having a defective link connected from the parent compute node to a child compute node if the test suite fails on the parent test tree and succeeds on all the child test trees.
NAS Parallel Benchmarks, Multi-Zone Versions
NASA Technical Reports Server (NTRS)
vanderWijngaart, Rob F.; Haopiang, Jin
2003-01-01
We describe an extension of the NAS Parallel Benchmarks (NPB) suite that involves solving the application benchmarks LU, BT and SP on collections of loosely coupled discretization meshes. The solutions on the meshes are updated independently, but after each time step they exchange boundary value information. This strategy, which is common among structured-mesh production flow solver codes in use at NASA Ames and elsewhere, provides relatively easily exploitable coarse-grain parallelism between meshes. Since the individual application benchmarks also allow fine-grain parallelism themselves, this NPB extension, named NPB Multi-Zone (NPB-MZ), is a good candidate for testing hybrid and multi-level parallelization tools and strategies.
Parallel programming with PCN. Revision 1
Foster, I.; Tuecke, S.
1991-12-01
PCN is a system for developing and executing parallel programs. It comprises a high-level programming language, tools for developing and debugging programs in this language, and interfaces to Fortran and C that allow the reuse of existing code in multilingual parallel programs. Programs developed using PCN are portable across many different workstations, networks, and parallel computers. This document provides all the information required to develop parallel programs with the PCN programming system. In includes both tutorial and reference material. It also presents the basic concepts that underly PCN, particularly where these are likely to be unfamiliar to the reader, and provides pointers to other documentation on the PCN language, programming techniques, and tools. PCN is in the public domain. The latest version of both the software and this manual can be obtained by anonymous FTP from Argonne National Laboratory in the directory pub/pcn at info.mcs.anl.gov (c.f. Appendix A).
Parallel Implementation of the Discontinuous Galerkin Method
NASA Technical Reports Server (NTRS)
Baggag, Abdalkader; Atkins, Harold; Keyes, David
1999-01-01
This paper describes a parallel implementation of the discontinuous Galerkin method. Discontinuous Galerkin is a spatially compact method that retains its accuracy and robustness on non-smooth unstructured grids and is well suited for time dependent simulations. Several parallelization approaches are studied and evaluated. The most natural and symmetric of the approaches has been implemented in all object-oriented code used to simulate aeroacoustic scattering. The parallel implementation is MPI-based and has been tested on various parallel platforms such as the SGI Origin, IBM SP2, and clusters of SGI and Sun workstations. The scalability results presented for the SGI Origin show slightly superlinear speedup on a fixed-size problem due to cache effects.
Highly parallel sparse Cholesky factorization
NASA Technical Reports Server (NTRS)
Gilbert, John R.; Schreiber, Robert
1990-01-01
Several fine grained parallel algorithms were developed and compared to compute the Cholesky factorization of a sparse matrix. The experimental implementations are on the Connection Machine, a distributed memory SIMD machine whose programming model conceptually supplies one processor per data element. In contrast to special purpose algorithms in which the matrix structure conforms to the connection structure of the machine, the focus is on matrices with arbitrary sparsity structure. The most promising algorithm is one whose inner loop performs several dense factorizations simultaneously on a 2-D grid of processors. Virtually any massively parallel dense factorization algorithm can be used as the key subroutine. The sparse code attains execution rates comparable to those of the dense subroutine. Although at present architectural limitations prevent the dense factorization from realizing its potential efficiency, it is concluded that a regular data parallel architecture can be used efficiently to solve arbitrarily structured sparse problems. A performance model is also presented and it is used to analyze the algorithms.
ERIC Educational Resources Information Center
Lai, Hsin-Chih; Chang, Chun-Yen; Li, Wen-Shiane; Fan, Yu-Lin; Wu, Ying-Tien
2013-01-01
This study presents an m-learning method that incorporates Integrated Quick Response (QR) codes. This learning method not only achieves the objectives of outdoor education, but it also increases applications of Cognitive Theory of Multimedia Learning (CTML) (Mayer, 2001) in m-learning for practical use in a diverse range of outdoor locations. When…
Serial-Turbo-Trellis-Coded Modulation with Rate-1 Inner Code
NASA Technical Reports Server (NTRS)
Divsalar, Dariush; Dolinar, Sam; Pollara, Fabrizio
2004-01-01
Serially concatenated turbo codes have been proposed to satisfy requirements for low bit- and word-error rates and for low (in comparison with related previous codes) complexity of coding and decoding algorithms and thus low complexity of coding and decoding circuitry. These codes are applicable to such high-level modulations as octonary phase-shift keying (8PSK) and 16-state quadrature amplitude modulation (16QAM); the signal product obtained by applying one of these codes to one of these modulations is denoted, generally, as serially concatenated trellis-coded modulation (SCTCM). These codes could be particularly beneficial for communication systems that must be designed and operated subject to limitations on bandwidth and power. Some background information is prerequisite to a meaningful summary of this development. Trellis-coded modulation (TCM) is now a well-established technique in digital communications. A turbo code combines binary component codes (which typically include trellis codes) with interleaving. A turbo code of the type that has been studied prior to this development is composed of parallel concatenated convolutional codes (PCCCs) implemented by two or more constituent systematic encoders joined through one or more interleavers. The input information bits feed the first encoder and, after having been scrambled by the interleaver, enter the second encoder. A code word of a parallel concatenated code consists of the input bits to the first encoder followed by the parity check bits of both encoders. The suboptimal iterative decoding structure for such a code is modular, and consists of a set of concatenated decoding modules one for each constituent code connected through an interleaver identical to the one in the encoder side. Each decoder performs weighted soft decoding of the input sequence. PCCCs yield very large coding gains at the cost of a reduction in the data rate and/or an increase in bandwidth.
TreSpEx—Detection of Misleading Signal in Phylogenetic Reconstructions Based on Tree Information
Struck, Torsten H
2014-01-01
Phylogenies of species or genes are commonplace nowadays in many areas of comparative biological studies. However, for phylogenetic reconstructions one must refer to artificial signals such as paralogy, long-branch attraction, saturation, or conflict between different datasets. These signals might eventually mislead the reconstruction even in phylogenomic studies employing hundreds of genes. Unfortunately, there has been no program allowing the detection of such effects in combination with an implementation into automatic process pipelines. TreSpEx (Tree Space Explorer) now combines different approaches (including statistical tests), which utilize tree-based information like nodal support or patristic distances (PDs) to identify misleading signals. The program enables the parallel analysis of hundreds of trees and/or predefined gene partitions, and being command-line driven, it can be integrated into automatic process pipelines. TreSpEx is implemented in Perl and supported on Linux, Mac OS X, and MS Windows. Source code, binaries, and additional material are freely available at http://www.annelida.de/research/bioinformatics/software.html. PMID:24701118
Shimada, Makoto K; Nishida, Tsunetoshi
2017-04-01
Felsenstein's PHYLIP package of molecular phylogeny tools has been used globally since 1980. The programs are receiving renewed attention because of their character-based user interface, which has the advantage of being scriptable for use with large-scale data studies based on super-computers or massively parallel computing clusters. However, occasionally we found, the PHYLIP Consense program output text file displays two or more divided bootstrap values for the same cluster in its result table, and when this happens the output Newick tree file incorrectly assigns only the last value to that cluster that disturbs correct estimation of a consensus tree. We ascertained the cause of this aberrant behavior in the bootstrapping calculation. Our rewrite of the Consense program source code outputs bootstrap values, without redundancy, in its result table, and a Newick tree file with appropriate, corresponding bootstrap values. Furthermore, we developed an add-on program and shell script, add_bootstrap.pl and fasta2tre_bs.bsh, to generate a Newick tree containing the topology and branch lengths inferred from the original data along with valid bootstrap values, and to actualize the automated inference of a phylogenetic tree containing the originally inferred topology and branch lengths with bootstrap values, from multiple unaligned sequences, respectively. These programs can be downloaded at: https://github.com/ShimadaMK/PHYLIP_enhance/.
NASA Technical Reports Server (NTRS)
Saini, Subhash; Frumkin, Michael; Hribar, Michelle; Jin, Hao-Qiang; Waheed, Abdul; Yan, Jerry
1998-01-01
Porting applications to new high performance parallel and distributed computing platforms is a challenging task. Since writing parallel code by hand is extremely time consuming and costly, porting codes would ideally be automated by using some parallelization tools and compilers. In this paper, we compare the performance of the hand written NAB Parallel Benchmarks against three parallel versions generated with the help of tools and compilers: 1) CAPTools: an interactive computer aided parallelization too] that generates message passing code, 2) the Portland Group's HPF compiler and 3) using compiler directives with the native FORTAN77 compiler on the SGI Origin2000.
Driver Code for Adaptive Optics
NASA Technical Reports Server (NTRS)
Rao, Shanti
2007-01-01
A special-purpose computer code for a deformable-mirror adaptive-optics control system transmits pixel-registered control from (1) a personal computer running software that generates the control data to (2) a circuit board with 128 digital-to-analog converters (DACs) that generate voltages to drive the deformable-mirror actuators. This program reads control-voltage codes from a text file, then sends them, via the computer s parallel port, to a circuit board with four AD5535 (or equivalent) chips. Whereas a similar prior computer program was capable of transmitting data to only one chip at a time, this program can send data to four chips simultaneously. This program is in the form of C-language code that can be compiled and linked into an adaptive-optics software system. The program as supplied includes source code for integration into the adaptive-optics software, documentation, and a component that provides a demonstration of loading DAC codes from a text file. On a standard Windows desktop computer, the software can update 128 channels in 10 ms. On Real-Time Linux with a digital I/O card, the software can update 1024 channels (8 boards in parallel) every 8 ms.
Parallel digital forensics infrastructure.
Liebrock, Lorie M.; Duggan, David Patrick
2009-10-01
This report documents the architecture and implementation of a Parallel Digital Forensics infrastructure. This infrastructure is necessary for supporting the design, implementation, and testing of new classes of parallel digital forensics tools. Digital Forensics has become extremely difficult with data sets of one terabyte and larger. The only way to overcome the processing time of these large sets is to identify and develop new parallel algorithms for performing the analysis. To support algorithm research, a flexible base infrastructure is required. A candidate architecture for this base infrastructure was designed, instantiated, and tested by this project, in collaboration with New Mexico Tech. Previous infrastructures were not designed and built specifically for the development and testing of parallel algorithms. With the size of forensics data sets only expected to increase significantly, this type of infrastructure support is necessary for continued research in parallel digital forensics. This report documents the implementation of the parallel digital forensics (PDF) infrastructure architecture and implementation.
NASA Technical Reports Server (NTRS)
Butler, Ricky W.; Boerschlein, David P.
1993-01-01
Fault-Tree Compiler (FTC) program, is software tool used to calculate probability of top event in fault tree. Gates of five different types allowed in fault tree: AND, OR, EXCLUSIVE OR, INVERT, and M OF N. High-level input language easy to understand and use. In addition, program supports hierarchical fault-tree definition feature, which simplifies tree-description process and reduces execution time. Set of programs created forming basis for reliability-analysis workstation: SURE, ASSIST, PAWS/STEM, and FTC fault-tree tool (LAR-14586). Written in PASCAL, ANSI-compliant C language, and FORTRAN 77. Other versions available upon request.
Parallel Strategies for Crash and Impact Simulations
Attaway, S.; Brown, K.; Hendrickson, B.; Plimpton, S.
1998-12-07
We describe a general strategy we have found effective for parallelizing solid mechanics simula- tions. Such simulations often have several computationally intensive parts, including finite element integration, detection of material contacts, and particle interaction if smoothed particle hydrody- namics is used to model highly deforming materials. The need to balance all of these computations simultaneously is a difficult challenge that has kept many commercial and government codes from being used effectively on parallel supercomputers with hundreds or thousands of processors. Our strategy is to load-balance each of the significant computations independently with whatever bal- ancing technique is most appropriate. The chief benefit is that each computation can be scalably paraIlelized. The drawback is the data exchange between processors and extra coding that must be written to maintain multiple decompositions in a single code. We discuss these trade-offs and give performance results showing this strategy has led to a parallel implementation of a widely-used solid mechanics code that can now be run efficiently on thousands of processors of the Pentium-based Sandia/Intel TFLOPS machine. We illustrate with several examples the kinds of high-resolution, million-element models that can now be simulated routinely. We also look to the future and dis- cuss what possibilities this new capabUity promises, as well as the new set of challenges it poses in material models, computational techniques, and computing infrastructure.
Introduction to Parallel Computing
1992-05-01
Topology C, Ada, C++, Data-parallel FORTRAN, 2D mesh of node boards, each node FORTRAN-90 (late 1992) board has 1 application processor Devopment Tools ...parallel machines become the wave of the present, tools are increasingly needed to assist programmers in creating parallel tasks and coordinating...their activities. Linda was designed to be such a tool . Linda was designed with three important goals in mind: to be portable, efficient, and easy to use
Parallel Wolff Cluster Algorithms
NASA Astrophysics Data System (ADS)
Bae, S.; Ko, S. H.; Coddington, P. D.
The Wolff single-cluster algorithm is the most efficient method known for Monte Carlo simulation of many spin models. Due to the irregular size, shape and position of the Wolff clusters, this method does not easily lend itself to efficient parallel implementation, so that simulations using this method have thus far been confined to workstations and vector machines. Here we present two parallel implementations of this algorithm, and show that one gives fairly good performance on a MIMD parallel computer.
NASA Astrophysics Data System (ADS)
Mighell, Kenneth John
2011-11-01
The development of parallel-processing image-analysis codes is generally a challenging task that requires complicated choreography of interprocessor communications. If, however, the image-analysis algorithm is embarrassingly parallel, then the development of a parallel-processing implementation of that algorithm can be a much easier task to accomplish because, by definition, there is little need for communication between the compute processes. I describe the design, implementation, and performance of a parallel-processing image-analysis application, called CRBLASTER, which does cosmic-ray rejection of CCD (charge-coupled device) images using the embarrassingly-parallel L.A.COSMIC algorithm. CRBLASTER is written in C using the high-performance computing industry standard Message Passing Interface (MPI) library. The code has been designed to be used by research scientists who are familiar with C as a parallel-processing computational framework that enables the easy development of parallel-processing image-analysis programs based on embarrassingly-parallel algorithms. The CRBLASTER source code is freely available at the official application website at the National Optical Astronomy Observatory. Removing cosmic rays from a single 800x800 pixel Hubble Space Telescope WFPC2 image takes 44 seconds with the IRAF script lacos_im.cl running on a single core of an Apple Mac Pro computer with two 2.8-GHz quad-core Intel Xeon processors. CRBLASTER is 7.4 times faster processing the same image on a single core on the same machine. Processing the same image with CRBLASTER simultaneously on all 8 cores of the same machine takes 0.875 seconds -- which is a speedup factor of 50.3 times faster than the IRAF script. A detailed analysis is presented of the performance of CRBLASTER using between 1 and 57 processors on a low-power Tilera 700-MHz 64-core TILE64 processor.
NASA Technical Reports Server (NTRS)
Hall, Lawrence O.; Bennett, Bonnie H.; Tello, Ivan
1994-01-01
A parallel version of CLIPS 5.1 has been developed to run on Intel Hypercubes. The user interface is the same as that for CLIPS with some added commands to allow for parallel calls. A complete version of CLIPS runs on each node of the hypercube. The system has been instrumented to display the time spent in the match, recognize, and act cycles on each node. Only rule-level parallelism is supported. Parallel commands enable the assertion and retraction of facts to/from remote nodes working memory. Parallel CLIPS was used to implement a knowledge-based command, control, communications, and intelligence (C(sup 3)I) system to demonstrate the fusion of high-level, disparate sources. We discuss the nature of the information fusion problem, our approach, and implementation. Parallel CLIPS has also be used to run several benchmark parallel knowledge bases such as one to set up a cafeteria. Results show from running Parallel CLIPS with parallel knowledge base partitions indicate that significant speed increases, including superlinear in some cases, are possible.
Application Portable Parallel Library
NASA Technical Reports Server (NTRS)
Cole, Gary L.; Blech, Richard A.; Quealy, Angela; Townsend, Scott
1995-01-01
Application Portable Parallel Library (APPL) computer program is subroutine-based message-passing software library intended to provide consistent interface to variety of multiprocessor computers on market today. Minimizes effort needed to move application program from one computer to another. User develops application program once and then easily moves application program from parallel computer on which created to another parallel computer. ("Parallel computer" also include heterogeneous collection of networked computers). Written in C language with one FORTRAN 77 subroutine for UNIX-based computers and callable from application programs written in C language or FORTRAN 77.
Parallel Algorithms and Patterns
Robey, Robert W.
2016-06-16
This is a powerpoint presentation on parallel algorithms and patterns. A parallel algorithm is a well-defined, step-by-step computational procedure that emphasizes concurrency to solve a problem. Examples of problems include: Sorting, searching, optimization, matrix operations. A parallel pattern is a computational step in a sequence of independent, potentially concurrent operations that occurs in diverse scenarios with some frequency. Examples are: Reductions, prefix scans, ghost cell updates. We only touch on parallel patterns in this presentation. It really deserves its own detailed discussion which Gabe Rockefeller would like to develop.
Xyce parallel electronic simulator release notes.
Keiter, Eric R; Hoekstra, Robert John; Mei, Ting; Russo, Thomas V.; Schiek, Richard Louis; Thornquist, Heidi K.; Rankin, Eric Lamont; Coffey, Todd S; Pawlowski, Roger P; Santarelli, Keith R.
2010-05-01
The Xyce Parallel Electronic Simulator has been written to support, in a rigorous manner, the simulation needs of the Sandia National Laboratories electrical designers. Specific requirements include, among others, the ability to solve extremely large circuit problems by supporting large-scale parallel computing platforms, improved numerical performance and object-oriented code design and implementation. The Xyce release notes describe: Hardware and software requirements New features and enhancements Any defects fixed since the last release Current known defects and defect workarounds For up-to-date information not available at the time these notes were produced, please visit the Xyce web page at http://www.cs.sandia.gov/xyce.
Modeling Cooperative Threads to Project GPU Performance for Adaptive Parallelism
Meng, Jiayuan; Uram, Thomas; Morozov, Vitali A.; Vishwanath, Venkatram; Kumaran, Kalyan
2015-01-01
Most accelerators, such as graphics processing units (GPUs) and vector processors, are particularly suitable for accelerating massively parallel workloads. On the other hand, conventional workloads are developed for multi-core parallelism, which often scale to only a few dozen OpenMP threads. When hardware threads significantly outnumber the degree of parallelism in the outer loop, programmers are challenged with efficient hardware utilization. A common solution is to further exploit the parallelism hidden deep in the code structure. Such parallelism is less structured: parallel and sequential loops may be imperfectly nested within each other, neigh boring inner loops may exhibit different concurrency patterns (e.g. Reduction vs. Forall), yet have to be parallelized in the same parallel section. Many input-dependent transformations have to be explored. A programmer often employs a larger group of hardware threads to cooperatively walk through a smaller outer loop partition and adaptively exploit any encountered parallelism. This process is time-consuming and error-prone, yet the risk of gaining little or no performance remains high for such workloads. To reduce risk and guide implementation, we propose a technique to model workloads with limited parallelism that can automatically explore and evaluate transformations involving cooperative threads. Eventually, our framework projects the best achievable performance and the most promising transformations without implementing GPU code or using physical hardware. We envision our technique to be integrated into future compilers or optimization frameworks for autotuning.
User's Guide for ENSAERO_FE Parallel Finite Element Solver
NASA Technical Reports Server (NTRS)
Eldred, Lloyd B.; Guruswamy, Guru P.
1999-01-01
A high fidelity parallel static structural analysis capability is created and interfaced to the multidisciplinary analysis package ENSAERO-MPI of Ames Research Center. This new module replaces ENSAERO's lower fidelity simple finite element and modal modules. Full aircraft structures may be more accurately modeled using the new finite element capability. Parallel computation is performed by breaking the full structure into multiple substructures. This approach is conceptually similar to ENSAERO's multizonal fluid analysis capability. The new substructure code is used to solve the structural finite element equations for each substructure in parallel. NASTRANKOSMIC is utilized as a front end for this code. Its full library of elements can be used to create an accurate and realistic aircraft model. It is used to create the stiffness matrices for each substructure. The new parallel code then uses an iterative preconditioned conjugate gradient method to solve the global structural equations for the substructure boundary nodes.
Raven, John A; Andrews, Mitchell
2010-09-01
Using a broad definition of trees, the evolutionary origins of trees in a nutritional context is considered using data from the fossil record and molecular phylogeny. Trees are first known from the Late Devonian about 380 million years ago, originated polyphyletically at the pteridophyte grade of organization; the earliest gymnosperms were trees, and trees are polyphyletic in the angiosperms. Nutrient transporters, assimilatory pathways, homoiohydry (cuticle, intercellular gas spaces, stomata, endohydric water transport systems including xylem and phloem-like tissue) and arbuscular mycorrhizas preceded the origin of trees. Nutritional innovations that began uniquely in trees were the seed habit and, certainly (but not necessarily uniquely) in trees, ectomycorrhizas, cyanobacterial, actinorhizal and rhizobial (Parasponia, some legumes) diazotrophic symbioses and cluster roots.
NASA Technical Reports Server (NTRS)
Buntine, Wray
1993-01-01
This paper introduces the IND Tree Package to prospective users. IND does supervised learning using classification trees. This learning task is a basic tool used in the development of diagnosis, monitoring and expert systems. The IND Tree Package was developed as part of a NASA project to semi-automate the development of data analysis and modelling algorithms using artificial intelligence techniques. The IND Tree Package integrates features from CART and C4 with newer Bayesian and minimum encoding methods for growing classification trees and graphs. The IND Tree Package also provides an experimental control suite on top. The newer features give improved probability estimates often required in diagnostic and screening tasks. The package comes with a manual, Unix 'man' entries, and a guide to tree methods and research. The IND Tree Package is implemented in C under Unix and was beta-tested at university and commercial research laboratories in the United States.
Weening, J.S.
1988-05-01
CSIM is a simulator for parallel Lisp, based on a continuation passing interpreter. It models a shared-memory multiprocessor executing programs written in Common Lisp, extended with several primitives for creating and controlling processes. This paper describes the structure of the simulator, measures its performance, and gives an example of its use with a parallel Lisp program.
Parallel and Distributed Computing.
1986-12-12
program was devoted to parallel and distributed computing . Support for this part of the program was obtained from the present Army contract and a...Umesh Vazirani. A workshop on parallel and distributed computing was held from May 19 to May 23, 1986 and drew 141 participants. Keywords: Mathematical programming; Protocols; Randomized algorithms. (Author)
Max, N
2002-08-19
This paper is a survey of the author's work on illumination and shadows under trees, including the effects of sky illumination, sun penumbras, scattering in a misty atmosphere below the trees, and multiple scattering and transmission between leaves. It also describes a hierarchical image-based rendering method for trees.
ERIC Educational Resources Information Center
Brooks, Sarah DeWitt
2010-01-01
This article describes the author's experience in implementing a Wish Tree project in her school in an effort to bring the school community together with a positive art-making experience during a potentially stressful time. The concept of a wish tree is simple: plant a tree; provide tags and pencils for writing wishes; and encourage everyone to…
ERIC Educational Resources Information Center
Srulowitz, Frances
1992-01-01
Describes an activity to develop students' skills of observation and recordkeeping by studying the growth of a tree's leaves during the spring. Children monitor the growth of 11 tress over a 2-month period, draw pictures of the tree at different stages of growth, and write diaries of the tree's growth. (MDH)
Minnesota's Forest Trees. Revised.
ERIC Educational Resources Information Center
Miles, William R.; Fuller, Bruce L.
This bulletin describes 46 of the more common trees found in Minnesota's forests and windbreaks. The bulletin contains two tree keys, a summer key and a winter key, to help the reader identify these trees. Besides the two keys, the bulletin includes an introduction, instructions for key use, illustrations of leaf characteristics and twig…
ERIC Educational Resources Information Center
Sweeney, Debra; Rounds, Judy
2011-01-01
Trees are great inspiration for artists. Many art teachers find themselves inspired and maybe somewhat obsessed with the natural beauty and elegance of the lofty tree, and how it changes through the seasons. One such tree that grows in several regions and always looks magnificent, regardless of the time of year, is the birch. In this article, the…
SMARTS: Exploiting Temporal Locality and Parallelism through Vertical Execution
Beckman, P.; Crotinger, J.; Karmesin, S.; Malony, A.; Oldehoeft, R.; Shende, S.; Smith, S.; Vajracharya, S.
1999-01-04
In the solution of large-scale numerical prob- lems, parallel computing is becoming simultaneously more important and more difficult. The complex organization of today's multiprocessors with several memory hierarchies has forced the scientific programmer to make a choice between simple but unscalable code and scalable but extremely com- plex code that does not port to other architectures. This paper describes how the SMARTS runtime system and the POOMA C++ class library for high-performance scientific computing work together to exploit data parallelism in scientific applications while hiding the details of manag- ing parallelism and data locality from the user. We present innovative algorithms, based on the macro -dataflow model, for detecting data parallelism and efficiently executing data- parallel statements on shared-memory multiprocessors. We also desclibe how these algorithms can be implemented on clusters of SMPS.
Massively parallel mathematical sieves
Montry, G.R.
1989-01-01
The Sieve of Eratosthenes is a well-known algorithm for finding all prime numbers in a given subset of integers. A parallel version of the Sieve is described that produces computational speedups over 800 on a hypercube with 1,024 processing elements for problems of fixed size. Computational speedups as high as 980 are achieved when the problem size per processor is fixed. The method of parallelization generalizes to other sieves and will be efficient on any ensemble architecture. We investigate two highly parallel sieves using scattered decomposition and compare their performance on a hypercube multiprocessor. A comparison of different parallelization techniques for the sieve illustrates the trade-offs necessary in the design and implementation of massively parallel algorithms for large ensemble computers.
Totally parallel multilevel algorithms
NASA Technical Reports Server (NTRS)
Frederickson, Paul O.
1988-01-01
Four totally parallel algorithms for the solution of a sparse linear system have common characteristics which become quite apparent when they are implemented on a highly parallel hypercube such as the CM2. These four algorithms are Parallel Superconvergent Multigrid (PSMG) of Frederickson and McBryan, Robust Multigrid (RMG) of Hackbusch, the FFT based Spectral Algorithm, and Parallel Cyclic Reduction. In fact, all four can be formulated as particular cases of the same totally parallel multilevel algorithm, which are referred to as TPMA. In certain cases the spectral radius of TPMA is zero, and it is recognized to be a direct algorithm. In many other cases the spectral radius, although not zero, is small enough that a single iteration per timestep keeps the local error within the required tolerance.
Not Available
1991-10-23
An account of the Caltech Concurrent Computation Program (C{sup 3}P), a five year project that focused on answering the question: Can parallel computers be used to do large-scale scientific computations '' As the title indicates, the question is answered in the affirmative, by implementing numerous scientific applications on real parallel computers and doing computations that produced new scientific results. In the process of doing so, C{sup 3}P helped design and build several new computers, designed and implemented basic system software, developed algorithms for frequently used mathematical computations on massively parallel machines, devised performance models and measured the performance of many computers, and created a high performance computing facility based exclusively on parallel computers. While the initial focus of C{sup 3}P was the hypercube architecture developed by C. Seitz, many of the methods developed and lessons learned have been applied successfully on other massively parallel architectures.
Parallel beam dynamics simulation of linear accelerators
Qiang, Ji; Ryne, Robert D.
2002-01-31
In this paper we describe parallel particle-in-cell methods for the large scale simulation of beam dynamics in linear accelerators. These techniques have been implemented in the IMPACT (Integrated Map and Particle Accelerator Tracking) code. IMPACT is being used to study the behavior of intense charged particle beams and as a tool for the design of next-generation linear accelerators. As examples, we present applications of the code to the study of emittance exchange in high intensity beams and to the study of beam transport in a proposed accelerator for the development of accelerator-driven waste transmutation technologies.
Lober, R.R.; Tautges, T.J.; Vaughan, C.T.
1997-03-01
Paving is an automated mesh generation algorithm which produces all-quadrilateral elements. It can additionally generate these elements in varying sizes such that the resulting mesh adapts to a function distribution, such as an error function. While powerful, conventional paving is a very serial algorithm in its operation. Parallel paving is the extension of serial paving into parallel environments to perform the same meshing functions as conventional paving only on distributed, discretized models. This extension allows large, adaptive, parallel finite element simulations to take advantage of paving`s meshing capabilities for h-remap remeshing. A significantly modified version of the CUBIT mesh generation code has been developed to host the parallel paving algorithm and demonstrate its capabilities on both two dimensional and three dimensional surface geometries and compare the resulting parallel produced meshes to conventionally paved meshes for mesh quality and algorithm performance. Sandia`s {open_quotes}tiling{close_quotes} dynamic load balancing code has also been extended to work with the paving algorithm to retain parallel efficiency as subdomains undergo iterative mesh refinement.
The stellar atmosphere simulation code Bifrost. Code description and validation
NASA Astrophysics Data System (ADS)
Gudiksen, B. V.; Carlsson, M.; Hansteen, V. H.; Hayek, W.; Leenaarts, J.; Martínez-Sykora, J.
2011-07-01
Context. Numerical simulations of stellar convection and photospheres have been developed to the point where detailed shapes of observed spectral lines can be explained. Stellar atmospheres are very complex, and very different physical regimes are present in the convection zone, photosphere, chromosphere, transition region and corona. To understand the details of the atmosphere it is necessary to simulate the whole atmosphere since the different layers interact strongly. These physical regimes are very diverse and it takes a highly efficient massively parallel numerical code to solve the associated equations. Aims: The design, implementation and validation of the massively parallel numerical code Bifrost for simulating stellar atmospheres from the convection zone to the corona. Methods: The code is subjected to a number of validation tests, among them the Sod shock tube test, the Orzag-Tang colliding shock test, boundary condition tests and tests of how the code treats magnetic field advection, chromospheric radiation, radiative transfer in an isothermal scattering atmosphere, hydrogen ionization and thermal conduction. Results.Bifrost completes the tests with good results and shows near linear efficiency scaling to thousands of computing cores.
Code of accounts, management overview volume: Richland environmental restoration. Revision 4
Hajner, R.S.
2000-01-19
This document contains the code of accounts volume for the Richland Environmental Restoration Project. Contents include: Total ERC work category, Work location listing, Standard work activity, Work activity definitions, Code of Account trees, the Code of Accounts, Netscape instructions, Setup of charge codes, and Distribution.
Parallel molecular dynamics: Communication requirements for massively parallel machines
NASA Astrophysics Data System (ADS)
Taylor, Valerie E.; Stevens, Rick L.; Arnold, Kathryn E.
1995-05-01
Molecular mechanics and dynamics are becoming widely used to perform simulations of molecular systems from large-scale computations of materials to the design and modeling of drug compounds. In this paper we address two major issues: a good decomposition method that can take advantage of future massively parallel processing systems for modest-sized problems in the range of 50,000 atoms and the communication requirements needed to achieve 30 to 40% efficiency on MPPs. We analyzed a scalable benchmark molecular dynamics program executing on the Intel Touchstone Deleta parallelized with an interaction decomposition method. Using a validated analytical performance model of the code, we determined that for an MPP with a four-dimensional mesh topology and 400 MHz processors the communication startup time must be at most 30 clock cycles and the network bandwidth must be at least 2.3 GB/s. This configuration results in 30 to 40% efficiency of the MPP for a problem with 50,000 atoms executing on 50,000 processors.
CFD Optimization on Network-Based Parallel Computer System
NASA Technical Reports Server (NTRS)
Cheung, Samson H.; Holst, Terry L. (Technical Monitor)
1994-01-01
Combining multiple engineering workstations into a network-based heterogeneous parallel computer allows application of aerodynamic optimization with advance computational fluid dynamics codes, which is computationally expensive in mainframe supercomputer. This paper introduces a nonlinear quasi-Newton optimizer designed for this network-based heterogeneous parallel computer on a software called Parallel Virtual Machine. This paper will introduce the methodology behind coupling a Parabolized Navier-Stokes flow solver to the nonlinear optimizer. This parallel optimization package has been applied to reduce the wave drag of a body of revolution and a wing/body configuration with results of 5% to 6% drag reduction.
Parallel CFD design on network-based computer
NASA Technical Reports Server (NTRS)
Cheung, Samson
1995-01-01
Combining multiple engineering workstations into a network-based heterogeneous parallel computer allows application of aerodynamic optimization with advanced computational fluid dynamics codes, which can be computationally expensive on mainframe supercomputers. This paper introduces a nonlinear quasi-Newton optimizer designed for this network-based heterogeneous parallel computing environment utilizing a software called Parallel Virtual Machine. This paper will introduce the methodology behind coupling a Parabolized Navier-Stokes flow solver to the nonlinear optimizer. This parallel optimization package is applied to reduce the wave drag of a body of revolution and a wing/body configuration with results of 5% to 6% drag reduction.
Plasma simulation using the massively parallel processor
NASA Technical Reports Server (NTRS)
Lin, C. S.; Thring, A. L.; Koga, J.; Janetzke, R. W.
1987-01-01
Two dimensional electrostatic simulation codes using the particle-in-cell model are developed on the Massively Parallel Processor (MPP). The conventional plasma simulation procedure that computes electric fields at particle positions by means of a gridded system is found inefficient on the MPP. The MPP simulation code is thus based on the gridless system in which particles are assigned to processing elements and electric fields are computed directly via Discrete Fourier Transform. Currently, the gridless model on the MPP in two dimensions is about nine times slower that the gridded system on the CRAY X-MP without considering I/O time. However, the gridless system on the MPP can be improved by incorporating a faster I/O between the staging memory and Array Unit and a more efficient procedure for taking floating point sums over processing elements. The initial results suggest that the parallel processors have the potential for performing large scale plasma simulations.
Automatically Parallelizing Legacy Binary Code for Multicore Architectures
2009-08-01
had a lasting impact on the respective works of each researcher. 1 1 Executive Summary The industrial adoption of multicore computing has...The rise of commodity multicore computing has forced many organizations to re-evaluate current programming paradigms and software execution models...advantage of the power provided by now- commodity multicore hardware --- a power for which the extensive legacy computing base seems ill-matched, since
CUDT: a CUDA based decision tree algorithm.
Lo, Win-Tsung; Chang, Yue-Shan; Sheu, Ruey-Kai; Chiu, Chun-Chieh; Yuan, Shyan-Ming
2014-01-01
Decision tree is one of the famous classification methods in data mining. Many researches have been proposed, which were focusing on improving the performance of decision tree. However, those algorithms are developed and run on traditional distributed systems. Obviously the latency could not be improved while processing huge data generated by ubiquitous sensing node in the era without new technology help. In order to improve data processing latency in huge data mining, in this paper, we design and implement a new parallelized decision tree algorithm on a CUDA (compute unified device architecture), which is a GPGPU solution provided by NVIDIA. In the proposed system, CPU is responsible for flow control while the GPU is responsible for computation. We have conducted many experiments to evaluate system performance of CUDT and made a comparison with traditional CPU version. The results show that CUDT is 5 ∼ 55 times faster than Weka-j48 and is 18 times speedup than SPRINT for large data set.
ERIC Educational Resources Information Center
Rollinson, Susan Wells
2012-01-01
The growth of a pine tree is examined by preparing "tree cookies" (cross-sectional disks) between whorls of branches. The use of Christmas trees allows the tree cookies to be obtained with inexpensive, commonly available tools. Students use the tree cookies to investigate the annual growth of the tree and how it corresponds to the number of whorls…
Parallel computing techniques for rotorcraft aerodynamics
NASA Astrophysics Data System (ADS)
Ekici, Kivanc
The modification of unsteady three-dimensional Navier-Stokes codes for application on massively parallel and distributed computing environments is investigated. The Euler/Navier-Stokes code TURNS (Transonic Unsteady Rotor Navier-Stokes) was chosen as a test bed because of its wide use by universities and industry. For the efficient implementation of TURNS on parallel computing systems, two algorithmic changes are developed. First, main modifications to the implicit operator, Lower-Upper Symmetric Gauss Seidel (LU-SGS) originally used in TURNS, is performed. Second, application of an inexact Newton method, coupled with a Krylov subspace iterative method (Newton-Krylov method) is carried out. Both techniques have been tried previously for the Euler equations mode of the code. In this work, we have extended the methods to the Navier-Stokes mode. Several new implicit operators were tried because of convergence problems of traditional operators with the high cell aspect ratio (CAR) grids needed for viscous calculations on structured grids. Promising results for both Euler and Navier-Stokes cases are presented for these operators. For the efficient implementation of Newton-Krylov methods to the Navier-Stokes mode of TURNS, efficient preconditioners must be used. The parallel implicit operators used in the previous step are employed as preconditioners and the results are compared. The Message Passing Interface (MPI) protocol has been used because of its portability to various parallel architectures. It should be noted that the proposed methodology is general and can be applied to several other CFD codes (e.g. OVERFLOW).
A parallel, portable and versatile treecode
Warren, M.S.; Salmon, J.K. |
1994-10-01
Portability and versatility are important characteristics of a computer program which is meant to be generally useful. We describe how we have developed a parallel N-body treecode to meet these goals. A variety of applications to which the code can be applied are mentioned. Performance of the program is also measured on several machines. A 512 processor Intel Paragon can solve for the forces on 10 million gravitationally interacting particles to 0.5% rms accuracy in 28.6 seconds.
Incremental Parallelization of Non-Data-Parallel Programs Using the Charon Message-Passing Library
NASA Technical Reports Server (NTRS)
VanderWijngaart, Rob F.
2000-01-01
Message passing is among the most popular techniques for parallelizing scientific programs on distributed-memory architectures. The reasons for its success are wide availability (MPI), efficiency, and full tuning control provided to the programmer. A major drawback, however, is that incremental parallelization, as offered by compiler directives, is not generally possible, because all data structures have to be changed throughout the program simultaneously. Charon remedies this situation through mappings between distributed and non-distributed data. It allows breaking up the parallelization into small steps, guaranteeing correctness at every stage. Several tools are available to help convert legacy codes into high-performance message-passing programs. They usually target data-parallel applications, whose loops carrying most of the work can be distributed among all processors without much dependency analysis. Others do a full dependency analysis and then convert the code virtually automatically. Even more toolkits are available that aid construction from scratch of message passing programs. None, however, allows piecemeal translation of codes with complex data dependencies (i.e. non-data-parallel programs) into message passing codes. The Charon library (available in both C and Fortran) provides incremental parallelization capabilities by linking legacy code arrays with distributed arrays. During the conversion process, non-distributed and distributed arrays exist side by side, and simple mapping functions allow the programmer to switch between the two in any location in the program. Charon also provides wrapper functions that leave the structure of the legacy code intact, but that allow execution on truly distributed data. Finally, the library provides a rich set of communication functions that support virtually all patterns of remote data demands in realistic structured grid scientific programs, including transposition, nearest-neighbor communication, pipelining
NASA Astrophysics Data System (ADS)
Lee, Mike M.; Cho, Byung Lok
2001-11-01
In this paper, we proposed a new First Partial product Addition (FPA) architecture with new compressor (or parallel counter) to CSA tree built in the process of adding partial product for improving speed in the fast parallel multiplier to improve the speed of calculating partial product by about 20% compared with existing parallel counter using full Adder. The new circuit reduces the CLA bit finding final sum by N/2 using the novel FPA architecture. A 5.14ns of multiplication speed of the 16X16 multiplier is obtained using 0.25um CMOS technology. The architecture of the multiplier is easily opted for pipeline design and demonstrates high speed performance.
The Fault Tree Compiler (FTC): Program and mathematics
NASA Technical Reports Server (NTRS)
Butler, Ricky W.; Martensen, Anna L.
1989-01-01
The Fault Tree Compiler Program is a new reliability tool used to predict the top-event probability for a fault tree. Five different gate types are allowed in the fault tree: AND, OR, EXCLUSIVE OR, INVERT, AND m OF n gates. The high-level input language is easy to understand and use when describing the system tree. In addition, the use of the hierarchical fault tree capability can simplify the tree description and decrease program execution time. The current solution technique provides an answer precisely (within the limits of double precision floating point arithmetic) within a user specified number of digits accuracy. The user may vary one failure rate or failure probability over a range of values and plot the results for sensitivity analyses. The solution technique is implemented in FORTRAN; the remaining program code is implemented in Pascal. The program is written to run on a Digital Equipment Corporation (DEC) VAX computer with the VMS operation system.
Scalable still image coding based on wavelet
NASA Astrophysics Data System (ADS)
Yan, Yang; Zhang, Zhengbing
2005-02-01
The scalable image coding is an important objective of the future image coding technologies. In this paper, we present a kind of scalable image coding scheme based on wavelet transform. This method uses the famous EZW (Embedded Zero tree Wavelet) algorithm; we give a high-quality encoding to the ROI (region of interest) of the original image and a rough encoding to the rest. This method is applied well in limited memory space condition, and we encode the region of background according to the memory capacity. In this way, we can store the encoded image in limited memory space easily without losing its main information. Simulation results show it is effective.
NASA Technical Reports Server (NTRS)
Bailey, David (Editor); Barton, John (Editor); Lasinski, Thomas (Editor); Simon, Horst (Editor)
1993-01-01
A new set of benchmarks was developed for the performance evaluation of highly parallel supercomputers. These benchmarks consist of a set of kernels, the 'Parallel Kernels,' and a simulated application benchmark. Together they mimic the computation and data movement characteristics of large scale computational fluid dynamics (CFD) applications. The principal distinguishing feature of these benchmarks is their 'pencil and paper' specification - all details of these benchmarks are specified only algorithmically. In this way many of the difficulties associated with conventional benchmarking approaches on highly parallel systems are avoided.
Parallelization of the Lagrangian Particle Dispersion Model
Buckley, R.L.; O`Steen, B.L.
1997-08-01
An advanced stochastic Lagrangian Particle Dispersion Model (LPDM) is used by the Atmospheric Technologies Group (ATG) to simulate contaminant transport. The model uses time-dependent three-dimensional fields of wind and turbulence to determine the location of individual particles released into the atmosphere. This report describes modifications to LPDM using the Message Passing Interface (MPI) which allows for execution in a parallel configuration on the Cray Supercomputer facility at the SRS. Use of a parallel version allows for many more particles to be released in a given simulation, with little or no increase in computational time. This significantly lowers (greater than an order of magnitude) the minimum resolvable concentration levels without ad hoc averaging schemes and/or without reducing spatial resolution. The general changes made to LPDM are discussed and a series of tests are performed comparing the serial (single processor) and parallel versions of the code.
Extending HPF for advanced data parallel applications
NASA Technical Reports Server (NTRS)
Chapman, Barbara; Mehrotra, Piyush; Zima, Hans
1994-01-01
The stated goal of High Performance Fortran (HPF) was to 'address the problems of writing data parallel programs where the distribution of data affects performance'. After examining the current version of the language we are led to the conclusion that HPF has not fully achieved this goal. While the basic distribution functions offered by the language - regular block, cyclic, and block cyclic distributions - can support regular numerical algorithms, advanced applications such as particle-in-cell codes or unstructured mesh solvers cannot be expressed adequately. We believe that this is a major weakness of HPF, significantly reducing its chances of becoming accepted in the numeric community. The paper discusses the data distribution and alignment issues in detail, points out some flaws in the basic language, and outlines possible future paths of development. Furthermore, we briefly deal with the issue of task parallelism and its integration with the data parallel paradigm of HPF.
Partitioning problems in parallel, pipelined and distributed computing
NASA Technical Reports Server (NTRS)
Bokhari, S.
1985-01-01
The problem of optimally assigning the modules of a parallel program over the processors of a multiple computer system is addressed. A Sum-Bottleneck path algorithm is developed that permits the efficient solution of many variants of this problem under some constraints on the structure of the partitions. In particular, the following problems are solved optimally for a single-host, multiple satellite system: partitioning multiple chain structured parallel programs, multiple arbitrarily structured serial programs and single tree structured parallel programs. In addition, the problems of partitioning chain structured parallel programs across chain connected systems and across shared memory (or shared bus) systems are also solved under certain constraints. All solutions for parallel programs are equally applicable to pipelined programs. These results extend prior research in this area by explicitly taking concurrency into account and permit the efficient utilization of multiple computer architectures for a wide range of problems of practical interest.
Partitioning problems in parallel, pipelined, and distributed computing
NASA Technical Reports Server (NTRS)
Bokhari, Shahid H.
1988-01-01
The problem of optimally assigning the modules of a parallel program over the processors of a multiple-computer system is addressed. A sum-bottleneck path algorithm is developed that permits the efficient solution of many variants of this problem under some constraints on the structure of the partitions. In particular, the following problems are solved optimally for a single-host, multiple-satellite system: partitioning multiple chain-structured parallel programs, multiple arbitrarily structured serial programs, and single-tree structured parallel programs. In addition, the problem of partitioning chain-structured parallel programs across chain-connected systems is solved under certain constraints. All solutions for parallel programs are equally applicable to pipelined programs. These results extend prior research in this area by explicitly taking concurrency into account and permit the efficient utilization of multiple-computer architectures for a wide range of problems of practical interest.
ANTLR Tree Grammar Generator and Extensions
NASA Technical Reports Server (NTRS)
Craymer, Loring
2005-01-01
A computer program implements two extensions of ANTLR (Another Tool for Language Recognition), which is a set of software tools for translating source codes between different computing languages. ANTLR supports predicated- LL(k) lexer and parser grammars, a notation for annotating parser grammars to direct tree construction, and predicated tree grammars. [ LL(k) signifies left-right, leftmost derivation with k tokens of look-ahead, referring to certain characteristics of a grammar.] One of the extensions is a syntax for tree transformations. The other extension is the generation of tree grammars from annotated parser or input tree grammars. These extensions can simplify the process of generating source-to-source language translators and they make possible an approach, called "polyphase parsing," to translation between computing languages. The typical approach to translator development is to identify high-level semantic constructs such as "expressions," "declarations," and "definitions" as fundamental building blocks in the grammar specification used for language recognition. The polyphase approach is to lump ambiguous syntactic constructs during parsing and then disambiguate the alternatives in subsequent tree transformation passes. Polyphase parsing is believed to be useful for generating efficient recognizers for C++ and other languages that, like C++, have significant ambiguities.
TORUS: Radiation transport and hydrodynamics code
NASA Astrophysics Data System (ADS)
Harries, Tim
2014-04-01
TORUS is a flexible radiation transfer and radiation-hydrodynamics code. The code has a basic infrastructure that includes the AMR mesh scheme that is used by several physics modules including atomic line transfer in a moving medium, molecular line transfer, photoionization, radiation hydrodynamics and radiative equilibrium. TORUS is useful for a variety of problems, including magnetospheric accretion onto T Tauri stars, spiral nebulae around Wolf-Rayet stars, discs around Herbig AeBe stars, structured winds of O supergiants and Raman-scattered line formation in symbiotic binaries, and dust emission and molecular line formation in star forming clusters. The code is written in Fortran 2003 and is compiled using a standard Gnu makefile. The code is parallelized using both MPI and OMP, and can use these parallel sections either separately or in a hybrid mode.
ERIC Educational Resources Information Center
Rogers, Pat
1972-01-01
Criteria for a reasonable axiomatic system are discussed. A discussion of the historical attempts to prove the independence of Euclids parallel postulate introduces non-Euclidean geometries. Poincare's model for a non-Euclidean geometry is defined and analyzed. (LS)
Scalable parallel communications
NASA Technical Reports Server (NTRS)
Maly, K.; Khanna, S.; Overstreet, C. M.; Mukkamala, R.; Zubair, M.; Sekhar, Y. S.; Foudriat, E. C.
1992-01-01
Coarse-grain parallelism in networking (that is, the use of multiple protocol processors running replicated software sending over several physical channels) can be used to provide gigabit communications for a single application. Since parallel network performance is highly dependent on real issues such as hardware properties (e.g., memory speeds and cache hit rates), operating system overhead (e.g., interrupt handling), and protocol performance (e.g., effect of timeouts), we have performed detailed simulations studies of both a bus-based multiprocessor workstation node (based on the Sun Galaxy MP multiprocessor) and a distributed-memory parallel computer node (based on the Touchstone DELTA) to evaluate the behavior of coarse-grain parallelism. Our results indicate: (1) coarse-grain parallelism can deliver multiple 100 Mbps with currently available hardware platforms and existing networking protocols (such as Transmission Control Protocol/Internet Protocol (TCP/IP) and parallel Fiber Distributed Data Interface (FDDI) rings); (2) scale-up is near linear in n, the number of protocol processors, and channels (for small n and up to a few hundred Mbps); and (3) since these results are based on existing hardware without specialized devices (except perhaps for some simple modifications of the FDDI boards), this is a low cost solution to providing multiple 100 Mbps on current machines. In addition, from both the performance analysis and the properties of these architectures, we conclude: (1) multiple processors providing identical services and the use of space division multiplexing for the physical channels can provide better reliability than monolithic approaches (it also provides graceful degradation and low-cost load balancing); (2) coarse-grain parallelism supports running several transport protocols in parallel to provide different types of service (for example, one TCP handles small messages for many users, other TCP's running in parallel provide high bandwidth
NASA Technical Reports Server (NTRS)
Reif, John H.
1987-01-01
A parallel compression algorithm for the 16,384 processor MPP machine was developed. The serial version of the algorithm can be viewed as a combination of on-line dynamic lossless test compression techniques (which employ simple learning strategies) and vector quantization. These concepts are described. How these concepts are combined to form a new strategy for performing dynamic on-line lossy compression is discussed. Finally, the implementation of this algorithm in a massively parallel fashion on the MPP is discussed.
Revisiting and parallelizing SHAKE
NASA Astrophysics Data System (ADS)
Weinbach, Yael; Elber, Ron
2005-10-01
An algorithm is presented for running SHAKE in parallel. SHAKE is a widely used approach to compute molecular dynamics trajectories with constraints. An essential step in SHAKE is the solution of a sparse linear problem of the type Ax = b, where x is a vector of unknowns. Conjugate gradient minimization (that can be done in parallel) replaces the widely used iteration process that is inherently serial. Numerical examples present good load balancing and are limited only by communication time.
HERCULES: A Pattern Driven Code Transformation System
Kartsaklis, Christos; Hernandez, Oscar R; Hsu, Chung-Hsing; Ilsche, Thomas; Joubert, Wayne; Graham, Richard L
2012-01-01
New parallel computers are emerging, but developing efficient scientific code for them remains difficult. A scientist must manage not only the science-domain complexity but also the performance-optimization complexity. HERCULES is a code transformation system designed to help the scientist to separate the two concerns, which improves code maintenance, and facilitates performance optimization. The system combines three technologies, code patterns, transformation scripts and compiler plugins, to provide the scientist with an environment to quickly implement code transformations that suit his needs. Unlike existing code optimization tools, HERCULES is unique in its focus on user-level accessibility. In this paper we discuss the design, implementation and an initial evaluation of HERCULES.
On parallelization of the loop over elements in FEAP
NASA Astrophysics Data System (ADS)
Jarzebski, P.; Wisniewski, K.; Taylor, R. L.
2015-07-01
In this paper, we consider parallelization of the loop over elements using OpenMP in FEAP (Taylor, 2014), which is a research FE code, very popular at universities. Even for a serial version of FEAP (a cluster version also exists) such a parallelization is a non-trivial task due to the existing architecture of this code, which complicates efficient parallelization. First, we compare the serial version of FEAP to the parallel code Warp3D (Dodds et al., 2014), considering the usage of time and memory. As we found, Warp3D is much faster but uses more memory than FEAP. An analysis of Warp3D helps us to devise our method of parallelization of the loop over elements. Next, we describe several changes in FEAP, which were necessary to parallelize the loop over elements using OpenMP. In particular, the subroutine assembling elemental matrices is identified as crucial to good performance, and several directives for the mutual exclusion synchronization of OpenMP are implemented and tested. Finally, we demonstrate the performance of the parallelized FEAP, designated as ompFEAP, on numerical examples involving 3D and shell elements of FEAP as well as user's elements. We conclude that ompFEAP, using the directive ATOMIC for synchronization of the assembling, provides a very good speedup and efficiency.
Xyce™ Parallel Electronic Simulator Users' Guide, Version 6.5.
Keiter, Eric R.; Aadithya, Karthik V.; Mei, Ting; Russo, Thomas V.; Schiek, Richard L.; Sholander, Peter E.; Thornquist, Heidi K.; Verley, Jason C.
2016-06-01
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to develop new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandia's needs, including some radiation- aware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase -- a message passing parallel implementation -- which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows. The information herein is subject to change without notice. Copyright © 2002-2016 Sandia Corporation. All rights reserved.
The BLAZE language - A parallel language for scientific programming
NASA Technical Reports Server (NTRS)
Mehrotra, Piyush; Van Rosendale, John
1987-01-01
A Pascal-like scientific programming language, BLAZE, is described. BLAZE contains array arithmetic, forall loops, and APL-style accumulation operators, which allow natural expression of fine grained parallelism. It also employs an applicative or functional procedure invocation mechanism, which makes it easy for compilers to extract coarse grained parallelism using machine specific program restructuring. Thus BLAZE should allow one to achieve highly parallel execution on multiprocessor architectures, while still providing the user with conceptually sequential control flow. A central goal in the design of BLAZE is portability across a broad range of parallel architectures. The multiple levels of parallelism present in BLAZE code, in principle, allow a compiler to extract the types of parallelism appropriate for the given architecture while neglecting the remainder. The features of BLAZE are described and it is shown how this language would be used in typical scientific programming.
Parallel image computation in clusters with task-distributor.
Baun, Christian
2016-01-01
Distributed systems, especially clusters, can be used to execute ray tracing tasks in parallel for speeding up the image computation. Because ray tracing is a computational expensive and memory consuming task, ray tracing can also be used to benchmark clusters. This paper introduces task-distributor, a free software solution for the parallel execution of ray tracing tasks in distributed systems. The ray tracing solution used for this work is the Persistence Of Vision Raytracer (POV-Ray). Task-distributor does not require any modification of the POV-Ray source code or the installation of an additional message passing library like the Message Passing Interface or Parallel Virtual Machine to allow parallel image computation, in contrast to various other projects. By analyzing the runtime of the sequential and parallel program parts of task-distributor, it becomes clear how the problem size and available hardware resources influence the scaling of the parallel application.
The BLAZE language: A parallel language for scientific programming
NASA Technical Reports Server (NTRS)
Mehrotra, P.; Vanrosendale, J.
1985-01-01
A Pascal-like scientific programming language, Blaze, is described. Blaze contains array arithmetic, forall loops, and APL-style accumulation operators, which allow natural expression of fine grained parallelism. It also employs an applicative or functional procedure invocation mechanism, which makes it easy for compilers to extract coarse grained parallelism using machine specific program restructuring. Thus Blaze should allow one to achieve highly parallel execution on multiprocessor architectures, while still providing the user with onceptually sequential control flow. A central goal in the design of Blaze is portability across a broad range of parallel architectures. The multiple levels of parallelism present in Blaze code, in principle, allow a compiler to extract the types of parallelism appropriate for the given architecture while neglecting the remainder. The features of Blaze are described and shows how this language would be used in typical scientific programming.
NASA Technical Reports Server (NTRS)
Divsalar, D.; Pollara, F.
1995-01-01
In this article, we design new turbo codes that can achieve near-Shannon-limit performance. The design criterion for random interleavers is based on maximizing the effective free distance of the turbo code, i.e., the minimum output weight of codewords due to weight-2 input sequences. An upper bound on the effective free distance of a turbo code is derived. This upper bound can be achieved if the feedback connection of convolutional codes uses primitive polynomials. We review multiple turbo codes (parallel concatenation of q convolutional codes), which increase the so-called 'interleaving gain' as q and the interleaver size increase, and a suitable decoder structure derived from an approximation to the maximum a posteriori probability decision rule. We develop new rate 1/3, 2/3, 3/4, and 4/5 constituent codes to be used in the turbo encoder structure. These codes, for from 2 to 32 states, are designed by using primitive polynomials. The resulting turbo codes have rates b/n (b = 1, 2, 3, 4 and n = 2, 3, 4, 5, 6), and include random interleavers for better asymptotic performance. These codes are suitable for deep-space communications with low throughput and for near-Earth communications where high throughput is desirable. The performance of these codes is within 1 dB of the Shannon limit at a bit-error rate of 10(exp -6) for throughputs from 1/15 up to 4 bits/s/Hz.
Performance of a parallel algorithm for standard cell placement on the Intel Hypercube
NASA Technical Reports Server (NTRS)
Jones, Mark; Banerjee, Prithviraj
1987-01-01
A parallel simulated annealing algorithm for standard cell placement on the Intel Hypercube is presented. A novel tree broadcasting strategy is used extensively for updating cell locations in the parallel environment. Studies on the performance of the algorithm on example industrial circuits show that it is faster and gives better final placement results than uniprocessor simulated annealing algorithms.
The Integrated TIGER Series Codes
Kensek, Ronald P.; Franke, Brian C.; Laub, Thomas W.
2006-01-15
ITS is a powerful and user-friendly software package permitting state-of-the-art Monte Carlo solution of linear time-independent coupled electron/photon radiation transport problems, with or without the presence of macroscopic electric and magnetic fields of arbitrary spatial dependence. Our goal has been to simultaneously maximize operational simplicity and physical accuracy. Through a set of preprocessor directives, the user selects one of the many ITS codes. The ease with which the makefile system is applied combines with an input scheme based on order-independent descriptive keywords that makes maximum use of defaults and intemal error checking to provide experimentalists and theorists alike with a method for the routine but rigorous solution of sophisticated radiation transport problems. Physical rigor is provided by employing accurate cross sections, sampling distributions, and physical models for describing the production and transport of the electron/photon cascade from 1.0 GeV down to 1.0 keV. The availability of source code permits the more sophisticated user to tailor the codes to specific applications and to extend the capabilities of the codes to more complex applications. Version 5.0, the latest version of ITS, contains (1) improvements to the ITS 3.0 continuous-energy codes, (2) multigroup codes with adjoint transport capabilities, (3) parallel implementations of all ITS codes, (4) a general purpose geometry engine for linking with CAD or other geometry formats, and (5) the Cholla facet geometry library. Moreover, the general user friendliness of the software has been enhanced through increased internal error checking and improved code portability.
Simplified Decoding of Convolutional Codes
NASA Technical Reports Server (NTRS)
Truong, T. K.; Reed, I. S.
1986-01-01
Some complicated intermediate steps shortened or eliminated. Decoding of convolutional error-correcting digital codes simplified by new errortrellis syndrome technique. In new technique, syndrome vector not computed. Instead, advantage taken of newly-derived mathematical identities simplify decision tree, folding it back on itself into form called "error trellis." This trellis graph of all path solutions of syndrome equations. Each path through trellis corresponds to specific set of decisions as to received digits. Existing decoding algorithms combined with new mathematical identities reduce number of combinations of errors considered and enable computation of correction vector directly from data and check bits as received.
Evolution of the genetic code.
Davis, B K
1999-01-01
Comparative path lengths in amino acid biosynthesis and other molecular indicators of the timing of codon assignment were examined to reconstruct the main stages of code evolution. The codon tree obtained was rooted in the 4 N-fixing amino acids (Asp, Glu, Asn, Gln) and 16 triplets of the NAN set. This small, locally phased (commaless) code evidently arose from ambiguous translation on a poly(A) collector strand, in a surface reaction network. Copolymerisation of these amino acids yields polyanionic peptide chains, which could anchor uncharged amide residues to a positively charged mineral surface. From RNA virus structure and replication in vitro, the first genes seemed to be RNA segments spliced into tRNA. Expansion of the code reduced the risk of mutation to an unreadable codon. This step was conditional on initiation at the 5'-codon of a translated sequence. Incorporation of increasingly hydrophobic amino acids accompanied expansion. As codons of the NUN set were assigned most slowly, they received the most nonpolar amino acids. The origin of ferredoxin and Gln synthetase was traced to mid-expansion phase. Surface metabolism ceased by the end of code expansion, as cells bounded by a proteo-phospholipid membrane, with a protoATPase, had emerged. Incorporation of positively charged and aromatic amino acids followed. They entered the post-expansion code by codon capture. Synthesis of efficient enzymes with acid-base catalysis was then possible. Both types of aminoacyl-tRNA synthetases were attributed to this stage. tRNA sequence diversity and error rates in RNA replication indicate the code evolved within 20 million yr in the preIsuan era. These findings on the genetic code provide empirical evidence, from a contemporaneous source, that a surface reaction network, centred on C-fixing autocatalytic cycles, rapidly led to cellular life on Earth.
NASA Technical Reports Server (NTRS)
Hinds, Erold W. (Principal Investigator)
1996-01-01
This report describes the progress made towards the completion of a specific task on error-correcting coding. The proposed research consisted of investigating the use of modulation block codes as the inner code of a concatenated coding system in order to improve the overall space link communications performance. The study proposed to identify and analyze candidate codes that will complement the performance of the overall coding system which uses the interleaved RS (255,223) code as the outer code.
Integrated Task And Data Parallel Programming: Language Design
NASA Technical Reports Server (NTRS)
Grimshaw, Andrew S.; West, Emily A.
1998-01-01
his research investigates the combination of task and data parallel language constructs within a single programming language. There are an number of applications that exhibit properties which would be well served by such an integrated language. Examples include global climate models, aircraft design problems, and multidisciplinary design optimization problems. Our approach incorporates data parallel language constructs into an existing, object oriented, task parallel language. The language will support creation and manipulation of parallel classes and objects of both types (task parallel and data parallel). Ultimately, the language will allow data parallel and task parallel classes to be used either as building blocks or managers of parallel objects of either type, thus allowing the development of single and multi-paradigm parallel applications. 1995 Research Accomplishments In February I presented a paper at Frontiers '95 describing the design of the data parallel language subset. During the spring I wrote and defended my dissertation proposal. Since that time I have developed a runtime model for the language subset. I have begun implementing the model and hand-coding simple examples which demonstrate the language subset. I have identified an astrophysical fluid flow application which will validate the data parallel language subset. 1996 Research Agenda Milestones for the coming year include implementing a significant portion of the data parallel language subset over the Legion system. Using simple hand-coded methods, I plan to demonstrate (1) concurrent task and data parallel objects and (2) task parallel objects managing both task and data parallel objects. My next steps will focus on constructing a compiler and implementing the fluid flow application with the language. Concurrently, I will conduct a search for a real-world application exhibiting both task and data parallelism within the same program m. Additional 1995 Activities During the fall I collaborated
Parallel algorithms for the spectral transform method
Foster, I.T.; Worley, P.H.
1997-05-01
The spectral transform method is a standard numerical technique for solving partial differential equations on a sphere and is widely used in atmospheric circulation models. Recent research has identified several promising algorithms for implementing this method on massively parallel computers; however, no detailed comparison of the different algorithms has previously been attempted. In this paper, the authors describe these different parallel algorithms and report on computational experiments that they have conducted to evaluate their efficiency on parallel computers. The experiments used a testbed code that solves the nonlinear shallow water equations on a sphere; considerable care was taken to ensure that the experiments provide a fair comparison of the different algorithms and that the results are relevant to global models. The authors focus on hypercube- and mesh-connected multicomputers with cut-through routing, such as the Intel iPSC/860, DELTA, and Paragon, and the nCUBE/2, but they also indicate how the results extend to other parallel computer architectures. The results of this study are relevant not only to the spectral transform method but also to multidimensional fast Fourier transforms (FFTs) and other parallel transforms.
IMPAIR: massively parallel deconvolution on the GPU
NASA Astrophysics Data System (ADS)
Sherry, Michael; Shearer, Andy
2013-02-01
The IMPAIR software is a high throughput image deconvolution tool for processing large out-of-core datasets of images, varying from large images with spatially varying PSFs to large numbers of images with spatially invariant PSFs. IMPAIR implements a parallel version of the tried and tested Richardson-Lucy deconvolution algorithm regularised via a custom wavelet thresholding library. It exploits the inherently parallel nature of the convolution operation to achieve quality results on consumer grade hardware: through the NVIDIA Tesla GPU implementation, the multi-core OpenMP implementation, and the cluster computing MPI implementation of the software. IMPAIR aims to address the problem of parallel processing in both top-down and bottom-up approaches: by managing the input data at the image level, and by managing the execution at the instruction level. These combined techniques will lead to a scalable solution with minimal resource consumption and maximal load balancing. IMPAIR is being developed as both a stand-alone tool for image processing, and as a library which can be embedded into non-parallel code to transparently provide parallel high throughput deconvolution.
Parallel algorithms for the spectral transform method
Foster, I.T.; Worley, P.H.
1994-04-01
The spectral transform method is a standard numerical technique for solving partial differential equations on a sphere and is widely used in atmospheric circulation models. Recent research has identified several promising algorithms for implementing this method on massively parallel computers; however, no detailed comparison of the different algorithms has previously been attempted. In this paper, we describe these different parallel algorithms and report on computational experiments that we have conducted to evaluate their efficiency on parallel computers. The experiments used a testbed code that solves the nonlinear shallow water equations or a sphere; considerable care was taken to ensure that the experiments provide a fair comparison of the different algorithms and that the results are relevant to global models. We focus on hypercube- and mesh-connected multicomputers with cut-through routing, such as the Intel iPSC/860, DELTA, and Paragon, and the nCUBE/2, but also indicate how the results extend to other parallel computer architectures. The results of this study are relevant not only to the spectral transform method but also to multidimensional FFTs and other parallel transforms.
Relative Debugging of Automatically Parallelized Programs
NASA Technical Reports Server (NTRS)
Jost, Gabriele; Hood, Robert; Biegel, Bryan (Technical Monitor)
2002-01-01
We describe a system that simplifies the process of debugging programs produced by computer-aided parallelization tools. The system uses relative debugging techniques to compare serial and parallel executions in order to show where the computations begin to differ. If the original serial code is correct, errors due to parallelization will be isolated by the comparison. One of the primary goals of the system is to minimize the effort required of the user. To that end, the debugging system uses information produced by the parallelization tool to drive the comparison process. In particular, the debugging system relies on the parallelization tool to provide information about where variables may have been modified and how arrays are distributed across multiple processes. User effort is also reduced through the use of dynamic instrumentation. This allows us to modify, the program execution with out changing the way the user builds the executable. The use of dynamic instrumentation also permits us to compare the executions in a fine-grained fashion and only involve the debugger when a difference has been detected. This reduces the overhead of executing instrumentation.
Support for Debugging Automatically Parallelized Programs
NASA Technical Reports Server (NTRS)
Jost, Gabriele; Hood, Robert; Biegel, Bryan (Technical Monitor)
2001-01-01
We describe a system that simplifies the process of debugging programs produced by computer-aided parallelization tools. The system uses relative debugging techniques to compare serial and parallel executions in order to show where the computations begin to differ. If the original serial code is correct, errors due to parallelization will be isolated by the comparison. One of the primary goals of the system is to minimize the effort required of the user. To that end, the debugging system uses information produced by the parallelization tool to drive the comparison process. In particular the debugging system relies on the parallelization tool to provide information about where variables may have been modified and how arrays are distributed across multiple processes. User effort is also reduced through the use of dynamic instrumentation. This allows us to modify the program execution without changing the way the user builds the executable. The use of dynamic instrumentation also permits us to compare the executions in a fine-grained fashion and only involve the debugger when a difference has been detected. This reduces the overhead of executing instrumentation.
A parallel adaptive mesh refinement algorithm
NASA Technical Reports Server (NTRS)
Quirk, James J.; Hanebutte, Ulf R.
1993-01-01
Over recent years, Adaptive Mesh Refinement (AMR) algorithms which dynamically match the local resolution of the computational grid to the numerical solution being sought have emerged as powerful tools for solving problems that contain disparate length and time scales. In particular, several workers have demonstrated the effectiveness of employing an adaptive, block-structured hierarchical grid system for simulations of complex shock wave phenomena. Unfortunately, from the parallel algorithm developer's viewpoint, this class of scheme is quite involved; these schemes cannot be distilled down to a small kernel upon which various parallelizing strategies may be tested. However, because of their block-structured nature such schemes are inherently parallel, so all is not lost. In this paper we describe the method by which Quirk's AMR algorithm has been parallelized. This method is built upon just a few simple message passing routines and so it may be implemented across a broad class of MIMD machines. Moreover, the method of parallelization is such that the original serial code is left virtually intact, and so we are left with just a single product to support. The importance of this fact should not be underestimated given the size and complexity of the original algorithm.
Quantum decision tree classifier
NASA Astrophysics Data System (ADS)
Lu, Songfeng; Braunstein, Samuel L.
2013-11-01
We study the quantum version of a decision tree classifier to fill the gap between quantum computation and machine learning. The quantum entropy impurity criterion which is used to determine which node should be split is presented in the paper. By using the quantum fidelity measure between two quantum states, we cluster the training data into subclasses so that the quantum decision tree can manipulate quantum states. We also propose algorithms constructing the quantum decision tree and searching for a target class over the tree for a new quantum object.
NASA Astrophysics Data System (ADS)
Kalay, Ziya; Ben-Naim, Eli
2015-03-01
We investigate the fragmentation of a random recursive tree by repeated removal of nodes, resulting in a forest of disjoint trees. The initial tree is generated by sequentially attaching new nodes to randomly chosen existing nodes until the tree contains N nodes. As nodes are removed, one at a time, the tree dissolves into an ensemble of separate trees, namely a forest. We study the statistical properties of trees and nodes in this heterogeneous forest. In the limit N --> ∞ , we find that the system is characterized by a single parameter: the fraction of remaining nodes m. We obtain analytically the size density ϕs of trees of size s, which has a power-law tail ϕs ~s-α , with exponent α = 1 + 1 / m . Therefore, the tail becomes steeper as further nodes are removed, producing an unusual scaling exponent that increases continuously with time. Furthermore, we investigate the fragment size distribution in a growing tree, where nodes are added as well as removed, and find that the distribution for this case is much narrower.
Bavykin, Sergey; Alferov, Oleg
2006-08-01
The purpose of this program is to generate probes specific for the group of sequences that belong to a given phylogenetic node. For each node of the input tree, this program selects probes that are positive for all sequences that belong to this node and negative for all that doesn't. The program uses condensed tree for probe representation to save computer memory. As a result of calculation, the program prints lists for each node from the tree. Input file formats: FASTA for sequence database and ARB tree for phylogenetic organization of nodes. Output file format: text file.
A high-speed linear algebra library with automatic parallelism
NASA Technical Reports Server (NTRS)
Boucher, Michael L.
1994-01-01
Parallel or distributed processing is key to getting highest performance workstations. However, designing and implementing efficient parallel algorithms is difficult and error-prone. It is even more difficult to write code that is both portable to and efficient on many different computers. Finally, it is harder still to satisfy the above requirements and include the reliability and ease of use required of commercial software intended for use in a production environment. As a result, the application of parallel processing technology to commercial software has been extremely small even though there are numerous computationally demanding programs that would significantly benefit from application of parallel processing. This paper describes DSSLIB, which is a library of subroutines that perform many of the time-consuming computations in engineering and scientific software. DSSLIB combines the high efficiency and speed of parallel computation with a serial programming model that eliminates many undesirable side-effects of typical parallel code. The result is a simple way to incorporate the power of parallel processing into commercial software without compromising maintainability, reliability, or ease of use. This gives significant advantages over less powerful non-parallel entries in the market.
ClustalW-MPI: ClustalW analysis using distributed and parallel computing.
Li, Kuo-Bin
2003-08-12
ClustalW is a tool for aligning multiple protein or nucleotide sequences. The alignment is achieved via three steps: pairwise alignment, guide-tree generation and progressive alignment. ClustalW-MPI is a distributed and parallel implementation of ClustalW. All three steps have been parallelized to reduce the execution time. The software uses a message-passing library called MPI (Message Passing Interface) and runs on distributed workstation clusters as well as on traditional parallel computers.
International assessment of PCA codes
Neymotin, L.; Lui, C.; Glynn, J.; Archarya, S.
1993-11-01
Over the past three years (1991-1993), an extensive international exercise for intercomparison of a group of six Probabilistic Consequence Assessment (PCA) codes was undertaken. The exercise was jointly sponsored by the Commission of European Communities (CEC) and OECD Nuclear Energy Agency. This exercise was a logical continuation of a similar effort undertaken by OECD/NEA/CSNI in 1979-1981. The PCA codes are currently used by different countries for predicting radiological health and economic consequences of severe accidents at nuclear power plants (and certain types of non-reactor nuclear facilities) resulting in releases of radioactive materials into the atmosphere. The codes participating in the exercise were: ARANO (Finland), CONDOR (UK), COSYMA (CEC), LENA (Sweden), MACCS (USA), and OSCAAR (Japan). In parallel with this inter-code comparison effort, two separate groups performed a similar set of calculations using two of the participating codes, MACCS and COSYMA. Results of the intercode and inter-MACCS comparisons are presented in this paper. The MACCS group included four participants: GREECE: Institute of Nuclear Technology and Radiation Protection, NCSR Demokritos; ITALY: ENEL, ENEA/DISP, and ENEA/NUC-RIN; SPAIN: Universidad Politecnica de Madrid (UPM) and Consejo de Seguridad Nuclear; USA: Brookhaven National Laboratory, US NRC and DOE.
Parallel architectures for vision
Maresca, M. ); Lavin, M.A. ); Li, H. )
1988-08-01
Vision computing involves the execution of a large number of operations on large sets of structured data. Sequential computers cannot achieve the speed required by most of the current applications and therefore parallel architectural solutions have to be explored. In this paper the authors examine the options that drive the design of a vision oriented computer, starting with the analysis of the basic vision computation and communication requirements. They briefly review the classical taxonomy for parallel computers, based on the multiplicity of the instruction and data stream, and apply a recently proposed criterion, the degree of autonomy of each processor, to further classify fine-grain SIMD massively parallel computers. They identify three types of processor autonomy, namely operation autonomy, addressing autonomy, and connection autonomy. For each type they give the basic definitions and show some examples. They focus on the concept of connection autonomy, which they believe is a key point in the development of massively parallel architectures for vision. They show two examples of parallel computers featuring different types of connection autonomy - the Connection Machine and the Polymorphic-Torus - and compare their cost and benefit.
Sublattice parallel replica dynamics
NASA Astrophysics Data System (ADS)
Martínez, Enrique; Uberuaga, Blas P.; Voter, Arthur F.
2014-06-01
Exascale computing presents a challenge for the scientific community as new algorithms must be developed to take full advantage of the new computing paradigm. Atomistic simulation methods that offer full fidelity to the underlying potential, i.e., molecular dynamics (MD) and parallel replica dynamics, fail to use the whole machine speedup, leaving a region in time and sample size space that is unattainable with current algorithms. In this paper, we present an extension of the parallel replica dynamics algorithm [A. F. Voter, Phys. Rev. B 57, R13985 (1998), 10.1103/PhysRevB.57.R13985] by combining it with the synchronous sublattice approach of Shim and Amar [Y. Shim and J. G. Amar, Phys. Rev. B 71, 125432 (2005), 10.1103/PhysRevB.71.125432], thereby exploiting event locality to improve the algorithm scalability. This algorithm is based on a domain decomposition in which events happen independently in different regions in the sample. We develop an analytical expression for the speedup given by this sublattice parallel replica dynamics algorithm and compare it with parallel MD and traditional parallel replica dynamics. We demonstrate how this algorithm, which introduces a slight additional approximation of event locality, enables the study of physical systems unreachable with traditional methodologies and promises to better utilize the resources of current high performance and future exascale computers.
The Particle Accelerator Simulation Code PyORBIT
Gorlov, Timofey V; Holmes, Jeffrey A; Cousineau, Sarah M; Shishlo, Andrei P
2015-01-01
The particle accelerator simulation code PyORBIT is presented. The structure, implementation, history, parallel and simulation capabilities, and future development of the code are discussed. The PyORBIT code is a new implementation and extension of algorithms of the original ORBIT code that was developed for the Spallation Neutron Source accelerator at the Oak Ridge National Laboratory. The PyORBIT code has a two level structure. The upper level uses the Python programming language to control the flow of intensive calculations performed by the lower level code implemented in the C++ language. The parallel capabilities are based on MPI communications. The PyORBIT is an open source code accessible to the public through the Google Open Source Projects Hosting service.
Solving tridiagonal linear systems on the Butterfly parallel computer
Kumar, S.P.
1989-01-01
A parallel block partitioning method to solve a tri-diagonal system of linear equations is adapted to the BBN Butterfly multiprocessor. A performance analysis of the programming experiments on the 32-node Butterfly is presented. An upper bound on the number of processors to achieve the best performance with this method is derived. The computational results verify the theoretical speedup and efficiency results of the parallel algorithm over its serial counterpart. Also included is a study comparing performance runs of the same code on the Butterfly processor with a hardware floating point unit and on one with a software floating point facility. The total parallel time of the given code is considerably reduced by making use of the hardware floating point facility whereas the speedup and efficiency of the parallel program considerably improve on the system with software floating point capability. The achieved results are shown to be within 82% to 90% of the predicted performance.
Rooting the ribosomal tree of life.
Fournier, Gregory P; Gogarten, J Peter
2010-08-01
The origin of the genetic code and the rooting of the tree of life (ToL) are two of the most challenging problems in the study of life's early evolution. Although both have been the focus of numerous investigations utilizing a variety of methods, until now, each problem has been addressed independently. Typically, attempts to root the ToL have relied on phylogenies of genes with ancient duplications, which are subject to artifacts of tree reconstruction and horizontal gene transfer, or specific physiological characters believed to be primitive, which are often based on subjective criteria. Here, we demonstrate a unique method for rooting based on the identification of amino acid usage biases comprising the residual signature of a more primitive genetic code. Using a phylogenetic tree of concatenated ribosomal proteins, our analysis of amino acid compositional bias detects a strong and unique signal associated with the early expansion of the genetic code, placing the root of the translation machinery along the bacterial branch.
Wetzstein, M.; Nelson, Andrew F.; Naab, T.; Burkert, A.
2009-10-01
We present a numerical code for simulating the evolution of astrophysical systems using particles to represent the underlying fluid flow. The code is written in Fortran 95 and is designed to be versatile, flexible, and extensible, with modular options that can be selected either at the time the code is compiled or at run time through a text input file. We include a number of general purpose modules describing a variety of physical processes commonly required in the astrophysical community and we expect that the effort required to integrate additional or alternate modules into the code will be small. In its simplest form the code can evolve the dynamical trajectories of a set of particles in two or three dimensions using a module which implements either a Leapfrog or Runge-Kutta-Fehlberg integrator, selected by the user at compile time. The user may choose to allow the integrator to evolve the system using individual time steps for each particle or with a single, global time step for all. Particles may interact gravitationally as N-body particles, and all or any subset may also interact hydrodynamically, using the smoothed particle hydrodynamic (SPH) method by selecting the SPH module. A third particle species can be included with a module to model massive point particles which may accrete nearby SPH or N-body particles. Such particles may be used to model, e.g., stars in a molecular cloud. Free boundary conditions are implemented by default, and a module may be selected to include periodic boundary conditions. We use a binary 'Press' tree to organize particles for rapid access in gravity and SPH calculations. Modules implementing an interface with special purpose 'GRAPE' hardware may also be selected to accelerate the gravity calculations. If available, forces obtained from the GRAPE coprocessors may be transparently substituted for those obtained from the tree, or both tree and GRAPE may be used as a combination GRAPE/tree code. The code may be run without
NASA Astrophysics Data System (ADS)
Wetzstein, M.; Nelson, Andrew F.; Naab, T.; Burkert, A.
2009-10-01
We present a numerical code for simulating the evolution of astrophysical systems using particles to represent the underlying fluid flow. The code is written in Fortran 95 and is designed to be versatile, flexible, and extensible, with modular options that can be selected either at the time the code is compiled or at run time through a text input file. We include a number of general purpose modules describing a variety of physical processes commonly required in the astrophysical community and we expect that the effort required to integrate additional or alternate modules into the code will be small. In its simplest form the code can evolve the dynamical trajectories of a set of particles in two or three dimensions using a module which implements either a Leapfrog or Runge-Kutta-Fehlberg integrator, selected by the user at compile time. The user may choose to allow the integrator to evolve the system using individual time steps for each particle or with a single, global time step for all. Particles may interact gravitationally as N-body particles, and all or any subset may also interact hydrodynamically, using the smoothed particle hydrodynamic (SPH) method by selecting the SPH module. A third particle species can be included with a module to model massive point particles which may accrete nearby SPH or N-body particles. Such particles may be used to model, e.g., stars in a molecular cloud. Free boundary conditions are implemented by default, and a module may be selected to include periodic boundary conditions. We use a binary "Press" tree to organize particles for rapid access in gravity and SPH calculations. Modules implementing an interface with special purpose "GRAPE" hardware may also be selected to accelerate the gravity calculations. If available, forces obtained from the GRAPE coprocessors may be transparently substituted for those obtained from the tree, or both tree and GRAPE may be used as a combination GRAPE/tree code. The code may be run without
Global Arrays Parallel Programming Toolkit
Nieplocha, Jaroslaw; Krishnan, Manoj Kumar; Palmer, Bruce J.; Tipparaju, Vinod; Harrison, Robert J.; Chavarría-Miranda, Daniel
2011-01-01
The two predominant classes of programming models for parallel computing are distributed memory and shared memory. Both shared memory and distributed memory models have advantages and shortcomings. Shared memory model is much easier to use but it ignores data locality/placement. Given the hierarchical nature of the memory subsystems in modern computers this characteristic can have a negative impact on performance and scalability. Careful code restructuring to increase data reuse and replacing fine grain load/stores with block access to shared data can address the problem and yield performance for shared memory that is competitive with message-passing. However, this performance comes at the cost of compromising the ease of use that the shared memory model advertises. Distributed memory models, such as message-passing or one-sided communication, offer performance and scalability but they are difficult to program. The Global Arrays toolkit attempts to offer the best features of both models. It implements a shared-memory programming model in which data locality is managed by the programmer. This management is achieved by calls to functions that transfer data between a global address space (a distributed array) and local storage. In this respect, the GA model has similarities to the distributed shared-memory models that provide an explicit acquire/release protocol. However, the GA model acknowledges that remote data is slower to access than local data and allows data locality to be specified by the programmer and hence managed. GA is related to the global address space languages such as UPC, Titanium, and, to a lesser extent, Co-Array Fortran. In addition, by providing a set of data-parallel operations, GA is also related to data-parallel languages such as HPF, ZPL, and Data Parallel C. However, the Global Array programming model is implemented as a library that works with most languages used for technical computing and does not rely on compiler technology for achieving
Parallel Monte Carlo simulation of multilattice thin film growth
NASA Astrophysics Data System (ADS)
Shu, J. W.; Lu, Qin; Wong, Wai-on; Huang, Han-chen
2001-07-01
This paper describe a new parallel algorithm for the multi-lattice Monte Carlo atomistic simulator for thin film deposition (ADEPT), implemented on parallel computer using the PVM (Parallel Virtual Machine) message passing library. This parallel algorithm is based on domain decomposition with overlapping and asynchronous communication. Multiple lattices are represented by a single reference lattice through one-to-one mappings, with resulting computational demands being comparable to those in the single-lattice Monte Carlo model. Asynchronous communication and domain overlapping techniques are used to reduce the waiting time and communication time among parallel processors. Results show that the algorithm is highly efficient with large number of processors. The algorithm was implemented on a parallel machine with 50 processors, and it is suitable for parallel Monte Carlo simulation of thin film growth with either a distributed memory parallel computer or a shared memory machine with message passing libraries. In this paper, the significant communication time in parallel MC simulation of thin film growth is effectively reduced by adopting domain decomposition with overlapping between sub-domains and asynchronous communication among processors. The overhead of communication does not increase evidently and speedup shows an ascending tendency when the number of processor increases. A near linear increase in computing speed was achieved with number of processors increases and there is no theoretical limit on the number of processors to be used. The techniques developed in this work are also suitable for the implementation of the Monte Carlo code on other parallel systems.
Tauke-Pedretti, Anna; Skogen, Erik J; Vawter, Gregory A
2014-05-20
An optical sampler includes a first and second 1.times.n optical beam splitters splitting an input optical sampling signal and an optical analog input signal into n parallel channels, respectively, a plurality of optical delay elements providing n parallel delayed input optical sampling signals, n photodiodes converting the n parallel optical analog input signals into n respective electrical output signals, and n optical modulators modulating the input optical sampling signal or the optical analog input signal by the respective electrical output signals, and providing n successive optical samples of the optical analog input signal. A plurality of output photodiodes and eADCs convert the n successive optical samples to n successive digital samples. The optical modulator may be a photodiode interconnected Mach-Zehnder Modulator. A method of sampling the optical analog input signal is disclosed.
Parallel VLSI architecture emulation and the organization of APSA/MPP
NASA Technical Reports Server (NTRS)
Odonnell, John T.
1987-01-01
The Applicative Programming System Architecture (APSA) combines an applicative language interpreter with a novel parallel computer architecture that is well suited for Very Large Scale Integration (VLSI) implementation. The Massively Parallel Processor (MPP) can simulate VLSI circuits by allocating one processing element in its square array to an area on a square VLSI chip. As long as there are not too many long data paths, the MPP can simulate a VLSI clock cycle very rapidly. The APSA circuit contains a binary tree with a few long paths and many short ones. A skewed H-tree layout allows every processing element to simulate a leaf cell and up to four tree nodes, with no loss in parallelism. Emulation of a key APSA algorithm on the MPP resulted in performance 16,000 times faster than a Vax. This speed will make it possible for the APSA language interpreter to run fast enough to support research in parallel list processing algorithms.
Multiple Independent File Parallel I/O with HDF5
Miller, M. C.
2016-07-13
The HDF5 library has supported the I/O requirements of HPC codes at Lawrence Livermore National Labs (LLNL) since the late 90’s. In particular, HDF5 used in the Multiple Independent File (MIF) parallel I/O paradigm has supported LLNL code’s scalable I/O requirements and has recently been gainfully used at scales as large as O(10^{6}) parallel tasks.
Walking tree heuristics for biological string alignment, gene location, and phylogenies
NASA Astrophysics Data System (ADS)
Cull, P.; Holloway, J. L.; Cavener, J. D.
1999-03-01
Basic biological information is stored in strings of nucleic acids (DNA, RNA) or amino acids (proteins). Teasing out the meaning of these strings is a central problem of modern biology. Matching and aligning strings brings out their shared characteristics. Although string matching is well-understood in the edit-distance model, biological strings with transpositions and inversions violate this model's assumptions. We propose a family of heuristics called walking trees to align biologically reasonable strings. Both edit-distance and walking tree methods can locate specific genes within a large string when the genes' sequences are given. When we attempt to match whole strings, the walking tree matches most genes, while the edit-distance method fails. We also give examples in which the walking tree matches substrings even if they have been moved or inverted. The edit-distance method was not designed to handle these problems. We include an example in which the walking tree "discovered" a gene. Calculating scores for whole genome matches gives a method for approximating evolutionary distance. We show two evolutionary trees for the picornaviruses which were computed by the walking tree heuristic. Both of these trees show great similarity to previously constructed trees. The point of this demonstration is that WHOLE genomes can be matched and distances calculated. The first tree was created on a Sequent parallel computer and demonstrates that the walking tree heuristic can be efficiently parallelized. The second tree was created using a network of work stations and demonstrates that there is suffient parallelism in the phylogenetic tree calculation that the sequential walking tree can be used effectively on a network.
Bailey, David H.
2009-11-15
The NAS Parallel Benchmarks (NPB) are a suite of parallel computer performance benchmarks. They were originally developed at the NASA Ames Research Center in 1991 to assess high-end parallel supercomputers. Although they are no longer used as widely as they once were for comparing high-end system performance, they continue to be studied and analyzed a great deal in the high-performance computing community. The acronym 'NAS' originally stood for the Numerical Aeronautical Simulation Program at NASA Ames. The name of this organization was subsequently changed to the Numerical Aerospace Simulation Program, and more recently to the NASA Advanced Supercomputing Center, although the acronym remains 'NAS.' The developers of the original NPB suite were David H. Bailey, Eric Barszcz, John Barton, David Browning, Russell Carter, LeoDagum, Rod Fatoohi, Samuel Fineberg, Paul Frederickson, Thomas Lasinski, Rob Schreiber, Horst Simon, V. Venkatakrishnan and Sisira Weeratunga. The original NAS Parallel Benchmarks consisted of eight individual benchmark problems, each of which focused on some aspect of scientific computing. The principal focus was in computational aerophysics, although most of these benchmarks have much broader relevance, since in a much larger sense they are typical of many real-world scientific computing applications. The NPB suite grew out of the need for a more rational procedure to select new supercomputers for acquisition by NASA. The emergence of commercially available highly parallel computer systems in the late 1980s offered an attractive alternative to parallel vector supercomputers that had been the mainstay of high-end scientific computing. However, the introduction of highly parallel systems was accompanied by a regrettable level of hype, not only on the part of the commercial vendors but even, in some cases, by scientists using the systems. As a result, it was difficult to discern whether the new systems offered any fundamental performance advantage
NASA Technical Reports Server (NTRS)
Denning, Peter J.; Tichy, Walter F.
1990-01-01
Among the highly parallel computing architectures required for advanced scientific computation, those designated 'MIMD' and 'SIMD' have yielded the best results to date. The present development status evaluation of such architectures shown neither to have attained a decisive advantage in most near-homogeneous problems' treatment; in the cases of problems involving numerous dissimilar parts, however, such currently speculative architectures as 'neural networks' or 'data flow' machines may be entailed. Data flow computers are the most practical form of MIMD fine-grained parallel computers yet conceived; they automatically solve the problem of assigning virtual processors to the real processors in the machine.
Adaptive parallel logic networks
NASA Technical Reports Server (NTRS)
Martinez, Tony R.; Vidal, Jacques J.
1988-01-01
Adaptive, self-organizing concurrent systems (ASOCS) that combine self-organization with massive parallelism for such applications as adaptive logic devices, robotics, process control, and system malfunction management, are presently discussed. In ASOCS, an adaptive network composed of many simple computing elements operating in combinational and asynchronous fashion is used and problems are specified by presenting if-then rules to the system in the form of Boolean conjunctions. During data processing, which is a different operational phase from adaptation, the network acts as a parallel hardware circuit.
Transferring ecosystem simulation codes to supercomputers
NASA Technical Reports Server (NTRS)
Skiles, J. W.; Schulbach, C. H.
1995-01-01
Many ecosystem simulation computer codes have been developed in the last twenty-five years. This development took place initially on main-frame computers, then mini-computers, and more recently, on micro-computers and workstations. Supercomputing platforms (both parallel and distributed systems) have been largely unused, however, because of the perceived difficulty in accessing and using the machines. Also, significant differences in the system architectures of sequential, scalar computers and parallel and/or vector supercomputers must be considered. We have transferred a grassland simulation model (developed on a VAX) to a Cray Y-MP/C90. We describe porting the model to the Cray and the changes we made to exploit the parallelism in the application and improve code execution. The Cray executed the model 30 times faster than the VAX and 10 times faster than a Unix workstation. We achieved an additional speedup of 30 percent by using the compiler's vectoring and 'in-line' capabilities. The code runs at only about 5 percent of the Cray's peak speed because it ineffectively uses the vector and parallel processing capabilities of the Cray. We expect that by restructuring the code, it could execute an additional six to ten times faster.
The same-source parallel MM{sub 5}.
Michalakes, J.
1999-08-23
The set of architectures available to users of the Penn State/NCAR MM5 has been expanded to included distributed-memory parallel computers, providing cost-effective scalable performance and memory capacity for large problem sizes. The same-source approach uses high-level parallel library and source-translation technology for adapting MM5, simplifying maintenance and allowing new physics modules to be incorporated without modification. The approach facilitates maintenance of the DM-parallel option to MM5 as an option within the official version, rather than as a separate stand-alone version. As a result, the DM-parallel option to MM5 (now at Version 3.1) has been a part of six subsequent model releases since MM5 Version 2.8 in March 1998. The same-source approach is applicable to other, similarly constructed codes when there is a need or desire to develop the code for distributed memory parallel machines without impacting the pre-existing source code. The approach is also compatible with pre-existing loop-level multithreading directives so that the code will run in distributed-memory/shared-memory mode on SMP clusters.
Parallel implementation of a Monte Carlo molecular stimulation program
Carvalho; Gomes; Cordeiro
2000-05-01
Molecular simulation methods such as molecular dynamics and Monte Carlo are fundamental for the theoretical calculation of macroscopic and microscopic properties of chemical and biochemical systems. These methods often rely on heavy computations, and one sometimes feels the need to run them in powerful massively parallel machines. For moderate problem sizes, however, a not so powerful and less expensive solution based on a network of workstations may be quite satisfactory. In the present work, the strategy adopted in the development of a parallel version is outlined, using the message passing model, of a molecular simulation code to be used in a network of workstations. This parallel code is the adaptation of an older sequential code using the Metropolis Monte Carlo method. In this case, the message passing interface was used as the interprocess communications library, although the code could be easily adapted for other message passing systems such as the parallel virtual machine. For simple systems it is shown that speedups of 2 can be achieved for four processes with this cheap solution. For bigger and more complex simulated systems, even better speedups might be obtained, which indicates that the presented approach is appropriate for the efficient use of a network of workstations in parallel processing.
Technology Transfer Automated Retrieval System (TEKTRAN)
The major tree nuts include almonds, Brazil nuts, cashew nuts, hazelnuts, macadamia nuts, pecans, pine nuts, pistachio nuts, and walnuts. Tree nut oils are appreciated in food applications because of their flavors and are generally more expensive than other gourmet oils. Research during the last de...
ERIC Educational Resources Information Center
Braus, Judy, Ed.
1992-01-01
Ranger Rick's NatureScope is a creative education series dedicated to inspiring in children an understanding and appreciation of the natural world while developing the skills they will need to make responsible decisions about the environment. Contents are organized into the following sections: (1) "What Makes a Tree a Tree?," including…
ERIC Educational Resources Information Center
Rubino, Darrin L.; Hanson, Deborah
2009-01-01
The circles and patterns in a tree's stem tell a story, but that story can be a mystery. Interpreting the story of tree rings provides a way to heighten the natural curiosity of students and help them gain insight into the interaction of elements in the environment. It also represents a wonderful opportunity to incorporate the nature of science.…
ERIC Educational Resources Information Center
Lewis, Richard
2004-01-01
Lewis's own experiences living in Indonesia are fertile ground for telling "a ripping good story," one found in "The Flame Tree." He hopes people will enjoy the tale and appreciate the differences of an unfamiliar culture. The excerpt from "The Flame Tree" will reel readers in quickly.
NASA Astrophysics Data System (ADS)
Kalay, Z.; Ben-Naim, E.
2015-01-01
We study fragmentation of a random recursive tree into a forest by repeated removal of nodes. The initial tree consists of N nodes and it is generated by sequential addition of nodes with each new node attaching to a randomly-selected existing node. As nodes are removed from the tree, one at a time, the tree dissolves into an ensemble of separate trees, namely, a forest. We study statistical properties of trees and nodes in this heterogeneous forest, and find that the fraction of remaining nodes m characterizes the system in the limit N\\to ∞ . We obtain analytically the size density {{φ }s} of trees of size s. The size density has power-law tail {{φ }s}˜ {{s}-α } with exponent α =1+\\frac{1}{m}. Therefore, the tail becomes steeper as further nodes are removed, and the fragmentation process is unusual in that exponent α increases continuously with time. We also extend our analysis to the case where nodes are added as well as removed, and obtain the asymptotic size density for growing trees.
ERIC Educational Resources Information Center
Greer, Sandy
1993-01-01
Describes Trees for Mother Earth, a program in which secondary students raise funds to buy fruit trees to plant during visits to the Navajo Reservation. Benefits include developing feelings of self-worth among participants, promoting cultural exchange and understanding, and encouraging self-sufficiency among the Navajo. (LP)
Kolar, C.A.; Ashby, W.C.
1982-07-01
A five-year research programme was started in 1978 in the Botany Department of Southern Illinois University to evaluate the effect of reclamation practices on tree survival and growth. The project was initiated as a direct result of reports from Illinois and Indiana of tree-planting failures on mined lands reclaimed to current regulation standards.
Structural Equation Model Trees
ERIC Educational Resources Information Center
Brandmaier, Andreas M.; von Oertzen, Timo; McArdle, John J.; Lindenberger, Ulman
2013-01-01
In the behavioral and social sciences, structural equation models (SEMs) have become widely accepted as a modeling tool for the relation between latent and observed variables. SEMs can be seen as a unification of several multivariate analysis techniques. SEM Trees combine the strengths of SEMs and the decision tree paradigm by building tree…
ERIC Educational Resources Information Center
NatureScope, 1986
1986-01-01
Provides: (1) background information on how trees have influenced human history and how trees affect people today; (2) four activities dealing with these topics; and (3) a ready-to-copy page related to paper and plastics. Activities include an objective, recommended age level(s), subject area(s), list of materials needed, and procedures. (JN)
Dependency Tree Annotation Software
2015-11-01
Features 5 Distribution List 12 iv List of Figures Fig. 1 Manually created dependency tree for the sentence, “The little cat ate the pie...with a dependency relation label. Fig. 1 Manually created dependency tree for the sentence, “The little cat ate the pie” The user can easily
A fast algorithm for reordering sparse matrices for parallel factorization
Lewis, J.G.; Peyton, B.W.; Pothen, A.
1989-01-01
Jess and Kees introduced a method for ordering a sparse symmetric matrix A for efficient parallel factorization. The parallel ordering is computed in two steps. First, the matrix A is ordered by some fill-reducing ordering. Second, a parallel ordering of A is computed from the filled graph that results from factoring A using the initial fill-reducing ordering. Among all orderings whose fill lies in the filled graph, this parallel ordering achieves the minimum number of parallel steps in the factorization of A. Jess and Kees did not specify the implementation details of an algorithm for either step of this scheme. Liu and Mirzaian (1987) designed an algorithm implementing the second step, but it has time and space requirements higher than the cost of computing common fill-reducing orderings. We present here a new fast algorithm that implements the parallel ordering step by exploiting the clique tree representation of a chordal graph. We succeed in reducing the cost of the parallel ordering step well below that of the fill-reducing step. Our algorithm has time and space complexity linear in the number of compressed subscripts of L, i.e., the sum of the sizes of the maximal cliques of the filled graph. Empirically we demonstrate running times nearly identical to Liu's heuristic Composite Rotations algorithm that approximates the minimum number of parallel steps. 21 refs., 3 figs., 4 tabs.