domain decomposition parallelization: Topics by Science.gov

Sample records for domain decomposition parallelization

A Domain Decomposition Parallelization of the Fast Marching Method

NASA Technical Reports Server (NTRS)

Herrmann, M.

2003-01-01

In this paper, the first domain decomposition parallelization of the Fast Marching Method for level sets has been presented. Parallel speedup has been demonstrated in both the optimal and non-optimal domain decomposition case. The parallel performance of the proposed method is strongly dependent on load balancing separately the number of nodes on each side of the interface. A load imbalance of nodes on either side of the domain leads to an increase in communication and rollback operations. Furthermore, the amount of inter-domain communication can be reduced by aligning the inter-domain boundaries with the interface normal vectors. In the case of optimal load balancing and aligned inter-domain boundaries, the proposed parallel FMM algorithm is highly efficient, reaching efficiency factors of up to 0.98. Future work will focus on the extension of the proposed parallel algorithm to higher order accuracy. Also, to further enhance parallel performance, the coupling of the domain decomposition parallelization to the G(sub 0)-based parallelization will be investigated.
Domain Decomposition By the Advancing-Partition Method for Parallel Unstructured Grid Generation

NASA Technical Reports Server (NTRS)

Pirzadeh, Shahyar Z.; Zagaris, George

2009-01-01

A new method of domain decomposition has been developed for generating unstructured grids in subdomains either sequentially or using multiple computers in parallel. Domain decomposition is a crucial and challenging step for parallel grid generation. Prior methods are generally based on auxiliary, complex, and computationally intensive operations for defining partition interfaces and usually produce grids of lower quality than those generated in single domains. The new technique, referred to as "Advancing Partition," is based on the Advancing-Front method, which partitions a domain as part of the volume mesh generation in a consistent and "natural" way. The benefits of this approach are: 1) the process of domain decomposition is highly automated, 2) partitioning of domain does not compromise the quality of the generated grids, and 3) the computational overhead for domain decomposition is minimal. The new method has been implemented in NASA's unstructured grid generation code VGRID.
Parallel CE/SE Computations via Domain Decomposition

NASA Technical Reports Server (NTRS)

Himansu, Ananda; Jorgenson, Philip C. E.; Wang, Xiao-Yen; Chang, Sin-Chung

2000-01-01

This paper describes the parallelization strategy and achieved parallel efficiency of an explicit time-marching algorithm for solving conservation laws. The Space-Time Conservation Element and Solution Element (CE/SE) algorithm for solving the 2D and 3D Euler equations is parallelized with the aid of domain decomposition. The parallel efficiency of the resultant algorithm on a Silicon Graphics Origin 2000 parallel computer is checked.
Traffic Simulations on Parallel Computers Using Domain Decomposition Techniques

DOT National Transportation Integrated Search

1995-01-01

Large scale simulations of Intelligent Transportation Systems (ITS) can only be acheived by using the computing resources offered by parallel computing architectures. Domain decomposition techniques are proposed which allow the performance of traffic...
Parallelization of PANDA discrete ordinates code using spatial decomposition

DOE Office of Scientific and Technical Information (OSTI.GOV)

Humbert, P.

2006-07-01

We present the parallel method, based on spatial domain decomposition, implemented in the 2D and 3D versions of the discrete Ordinates code PANDA. The spatial mesh is orthogonal and the spatial domain decomposition is Cartesian. For 3D problems a 3D Cartesian domain topology is created and the parallel method is based on a domain diagonal plane ordered sweep algorithm. The parallel efficiency of the method is improved by directions and octants pipelining. The implementation of the algorithm is straightforward using MPI blocking point to point communications. The efficiency of the method is illustrated by an application to the 3D-Ext C5G7more » benchmark of the OECD/NEA. (authors)« less
Domain Decomposition By the Advancing-Partition Method

NASA Technical Reports Server (NTRS)

Pirzadeh, Shahyar Z.

2008-01-01

A new method of domain decomposition has been developed for generating unstructured grids in subdomains either sequentially or using multiple computers in parallel. Domain decomposition is a crucial and challenging step for parallel grid generation. Prior methods are generally based on auxiliary, complex, and computationally intensive operations for defining partition interfaces and usually produce grids of lower quality than those generated in single domains. The new technique, referred to as "Advancing Partition," is based on the Advancing-Front method, which partitions a domain as part of the volume mesh generation in a consistent and "natural" way. The benefits of this approach are: 1) the process of domain decomposition is highly automated, 2) partitioning of domain does not compromise the quality of the generated grids, and 3) the computational overhead for domain decomposition is minimal. The new method has been implemented in NASA's unstructured grid generation code VGRID.
Domain Decomposition: A Bridge between Nature and Parallel Computers

DTIC Science & Technology

1992-09-01

B., "Domain Decomposition Algorithms for Indefinite Elliptic Problems," S"IAM Journal of S; cientific and Statistical (’omputing, Vol. 13, 1992, pp...AD-A256 575 NASA Contractor Report 189709 ICASE Report No. 92-44 ICASE DOMAIN DECOMPOSITION: A BRIDGE BETWEEN NATURE AND PARALLEL COMPUTERS DTIC dE...effectively implemented on dis- tributed memory multiprocessors. In 1990 (as reported in Ref. 38 using the tile algo- rithm), a 103,201-unknown 2D elliptic
Performance evaluation of canny edge detection on a tiled multicore architecture

NASA Astrophysics Data System (ADS)

Brethorst, Andrew Z.; Desai, Nehal; Enright, Douglas P.; Scrofano, Ronald

2011-01-01

In the last few years, a variety of multicore architectures have been used to parallelize image processing applications. In this paper, we focus on assessing the parallel speed-ups of different Canny edge detection parallelization strategies on the Tile64, a tiled multicore architecture developed by the Tilera Corporation. Included in these strategies are different ways Canny edge detection can be parallelized, as well as differences in data management. The two parallelization strategies examined were loop-level parallelism and domain decomposition. Loop-level parallelism is achieved through the use of OpenMP,1 and it is capable of parallelization across the range of values over which a loop iterates. Domain decomposition is the process of breaking down an image into subimages, where each subimage is processed independently, in parallel. The results of the two strategies show that for the same number of threads, programmer implemented, domain decomposition exhibits higher speed-ups than the compiler managed, loop-level parallelism implemented with OpenMP.
Multitasking domain decomposition fast Poisson solvers on the Cray Y-MP

NASA Technical Reports Server (NTRS)

Chan, Tony F.; Fatoohi, Rod A.

1990-01-01

The results of multitasking implementation of a domain decomposition fast Poisson solver on eight processors of the Cray Y-MP are presented. The object of this research is to study the performance of domain decomposition methods on a Cray supercomputer and to analyze the performance of different multitasking techniques using highly parallel algorithms. Two implementations of multitasking are considered: macrotasking (parallelism at the subroutine level) and microtasking (parallelism at the do-loop level). A conventional FFT-based fast Poisson solver is also multitasked. The results of different implementations are compared and analyzed. A speedup of over 7.4 on the Cray Y-MP running in a dedicated environment is achieved for all cases.
An asymptotic induced numerical method for the convection-diffusion-reaction equation

NASA Technical Reports Server (NTRS)

Scroggs, Jeffrey S.; Sorensen, Danny C.

1988-01-01

A parallel algorithm for the efficient solution of a time dependent reaction convection diffusion equation with small parameter on the diffusion term is presented. The method is based on a domain decomposition that is dictated by singular perturbation analysis. The analysis is used to determine regions where certain reduced equations may be solved in place of the full equation. Parallelism is evident at two levels. Domain decomposition provides parallelism at the highest level, and within each domain there is ample opportunity to exploit parallelism. Run time results demonstrate the viability of the method.
A Framework for Parallel Unstructured Grid Generation for Complex Aerodynamic Simulations

NASA Technical Reports Server (NTRS)

Zagaris, George; Pirzadeh, Shahyar Z.; Chrisochoides, Nikos

2009-01-01

A framework for parallel unstructured grid generation targeting both shared memory multi-processors and distributed memory architectures is presented. The two fundamental building-blocks of the framework consist of: (1) the Advancing-Partition (AP) method used for domain decomposition and (2) the Advancing Front (AF) method used for mesh generation. Starting from the surface mesh of the computational domain, the AP method is applied recursively to generate a set of sub-domains. Next, the sub-domains are meshed in parallel using the AF method. The recursive nature of domain decomposition naturally maps to a divide-and-conquer algorithm which exhibits inherent parallelism. For the parallel implementation, the Master/Worker pattern is employed to dynamically balance the varying workloads of each task on the set of available CPUs. Performance results by this approach are presented and discussed in detail as well as future work and improvements.
Parallel computing of a climate model on the dawn 1000 by domain decomposition method

NASA Astrophysics Data System (ADS)

Bi, Xunqiang

1997-12-01

In this paper the parallel computing of a grid-point nine-level atmospheric general circulation model on the Dawn 1000 is introduced. The model was developed by the Institute of Atmospheric Physics (IAP), Chinese Academy of Sciences (CAS). The Dawn 1000 is a MIMD massive parallel computer made by National Research Center for Intelligent Computer (NCIC), CAS. A two-dimensional domain decomposition method is adopted to perform the parallel computing. The potential ways to increase the speed-up ratio and exploit more resources of future massively parallel supercomputation are also discussed.
A Dual Super-Element Domain Decomposition Approach for Parallel Nonlinear Finite Element Analysis

NASA Astrophysics Data System (ADS)

Jokhio, G. A.; Izzuddin, B. A.

2015-05-01

This article presents a new domain decomposition method for nonlinear finite element analysis introducing the concept of dual partition super-elements. The method extends ideas from the displacement frame method and is ideally suited for parallel nonlinear static/dynamic analysis of structural systems. In the new method, domain decomposition is realized by replacing one or more subdomains in a "parent system," each with a placeholder super-element, where the subdomains are processed separately as "child partitions," each wrapped by a dual super-element along the partition boundary. The analysis of the overall system, including the satisfaction of equilibrium and compatibility at all partition boundaries, is realized through direct communication between all pairs of placeholder and dual super-elements. The proposed method has particular advantages for matrix solution methods based on the frontal scheme, and can be readily implemented for existing finite element analysis programs to achieve parallelization on distributed memory systems with minimal intervention, thus overcoming memory bottlenecks typically faced in the analysis of large-scale problems. Several examples are presented in this article which demonstrate the computational benefits of the proposed parallel domain decomposition approach and its applicability to the nonlinear structural analysis of realistic structural systems.
Using domain decomposition in the multigrid NAS parallel benchmark on the Fujitsu VPP500

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wang, J.C.H.; Lung, H.; Katsumata, Y.

1995-12-01

In this paper, we demonstrate how domain decomposition can be applied to the multigrid algorithm to convert the code for MPP architectures. We also discuss the performance and scalability of this implementation on the new product line of Fujitsu`s vector parallel computer, VPP500. This computer has Fujitsu`s well-known vector processor as the PE each rated at 1.6 C FLOPS. The high speed crossbar network rated at 800 MB/s provides the inter-PE communication. The results show that the physical domain decomposition is the best way to solve MG problems on VPP500.
A Parallel Algorithm for Contact in a Finite Element Hydrocode

DOE Office of Scientific and Technical Information (OSTI.GOV)

Pierce, Timothy G.

A parallel algorithm is developed for contact/impact of multiple three dimensional bodies undergoing large deformation. As time progresses the relative positions of contact between the multiple bodies changes as collision and sliding occurs. The parallel algorithm is capable of tracking these changes and enforcing an impenetrability constraint and momentum transfer across the surfaces in contact. Portions of the various surfaces of the bodies are assigned to the processors of a distributed-memory parallel machine in an arbitrary fashion, known as the primary decomposition. A secondary, dynamic decomposition is utilized to bring opposing sections of the contacting surfaces together on the samemore » processors, so that opposing forces may be balanced and the resultant deformation of the bodies calculated. The secondary decomposition is accomplished and updated using only local communication with a limited subset of neighbor processors. Each processor represents both a domain of the primary decomposition and a domain of the secondary, or contact, decomposition. Thus each processor has four sets of neighbor processors: (a) those processors which represent regions adjacent to it in the primary decomposition, (b) those processors which represent regions adjacent to it in the contact decomposition, (c) those processors which send it the data from which it constructs its contact domain, and (d) those processors to which it sends its primary domain data, from which they construct their contact domains. The latter three of these neighbor sets change dynamically as the simulation progresses. By constraining all communication to these sets of neighbors, all global communication, with its attendant nonscalable performance, is avoided. A set of tests are provided to measure the degree of scalability achieved by this algorithm on up to 1024 processors. Issues related to the operating system of the test platform which lead to some degradation of the results are analyzed. This algorithm has been implemented as the contact capability of the ALE3D multiphysics code, and is currently in production use.« less
Scalable parallel elastic-plastic finite element analysis using a quasi-Newton method with a balancing domain decomposition preconditioner

NASA Astrophysics Data System (ADS)

Yusa, Yasunori; Okada, Hiroshi; Yamada, Tomonori; Yoshimura, Shinobu

2018-04-01

A domain decomposition method for large-scale elastic-plastic problems is proposed. The proposed method is based on a quasi-Newton method in conjunction with a balancing domain decomposition preconditioner. The use of a quasi-Newton method overcomes two problems associated with the conventional domain decomposition method based on the Newton-Raphson method: (1) avoidance of a double-loop iteration algorithm, which generally has large computational complexity, and (2) consideration of the local concentration of nonlinear deformation, which is observed in elastic-plastic problems with stress concentration. Moreover, the application of a balancing domain decomposition preconditioner ensures scalability. Using the conventional and proposed domain decomposition methods, several numerical tests, including weak scaling tests, were performed. The convergence performance of the proposed method is comparable to that of the conventional method. In particular, in elastic-plastic analysis, the proposed method exhibits better convergence performance than the conventional method.
Complexity of parallel implementation of domain decomposition techniques for elliptic partial differential equations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gropp, W.D.; Keyes, D.E.

1988-03-01

The authors discuss the parallel implementation of preconditioned conjugate gradient (PCG)-based domain decomposition techniques for self-adjoint elliptic partial differential equations in two dimensions on several architectures. The complexity of these methods is described on a variety of message-passing parallel computers as a function of the size of the problem, number of processors and relative communication speeds of the processors. They show that communication startups are very important, and that even the small amount of global communication in these methods can significantly reduce the performance of many message-passing architectures.
Parallel Domain Decomposition Formulation and Software for Large-Scale Sparse Symmetrical/Unsymmetrical Aeroacoustic Applications

NASA Technical Reports Server (NTRS)

Nguyen, D. T.; Watson, Willie R. (Technical Monitor)

2005-01-01

The overall objectives of this research work are to formulate and validate efficient parallel algorithms, and to efficiently design/implement computer software for solving large-scale acoustic problems, arised from the unified frameworks of the finite element procedures. The adopted parallel Finite Element (FE) Domain Decomposition (DD) procedures should fully take advantages of multiple processing capabilities offered by most modern high performance computing platforms for efficient parallel computation. To achieve this objective. the formulation needs to integrate efficient sparse (and dense) assembly techniques, hybrid (or mixed) direct and iterative equation solvers, proper pre-conditioned strategies, unrolling strategies, and effective processors' communicating schemes. Finally, the numerical performance of the developed parallel finite element procedures will be evaluated by solving series of structural, and acoustic (symmetrical and un-symmetrical) problems (in different computing platforms). Comparisons with existing "commercialized" and/or "public domain" software are also included, whenever possible.
Distributed-Memory Computing With the Langley Aerothermodynamic Upwind Relaxation Algorithm (LAURA)

NASA Technical Reports Server (NTRS)

Riley, Christopher J.; Cheatwood, F. McNeil

1997-01-01

The Langley Aerothermodynamic Upwind Relaxation Algorithm (LAURA), a Navier-Stokes solver, has been modified for use in a parallel, distributed-memory environment using the Message-Passing Interface (MPI) standard. A standard domain decomposition strategy is used in which the computational domain is divided into subdomains with each subdomain assigned to a processor. Performance is examined on dedicated parallel machines and a network of desktop workstations. The effect of domain decomposition and frequency of boundary updates on performance and convergence is also examined for several realistic configurations and conditions typical of large-scale computational fluid dynamic analysis.
Convergence issues in domain decomposition parallel computation of hovering rotor

NASA Astrophysics Data System (ADS)

Xiao, Zhongyun; Liu, Gang; Mou, Bin; Jiang, Xiong

2018-05-01

Implicit LU-SGS time integration algorithm has been widely used in parallel computation in spite of its lack of information from adjacent domains. When applied to parallel computation of hovering rotor flows in a rotating frame, it brings about convergence issues. To remedy the problem, three LU factorization-based implicit schemes (consisting of LU-SGS, DP-LUR and HLU-SGS) are investigated comparatively. A test case of pure grid rotation is designed to verify these algorithms, which show that LU-SGS algorithm introduces errors on boundary cells. When partition boundaries are circumferential, errors arise in proportion to grid speed, accumulating along with the rotation, and leading to computational failure in the end. Meanwhile, DP-LUR and HLU-SGS methods show good convergence owing to boundary treatment which are desirable in domain decomposition parallel computations.

Information criteria for quantifying loss of reversibility in parallelized KMC

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gourgoulias, Konstantinos, E-mail: gourgoul@math.umass.edu; Katsoulakis, Markos A., E-mail: markos@math.umass.edu; Rey-Bellet, Luc, E-mail: luc@math.umass.edu

Parallel Kinetic Monte Carlo (KMC) is a potent tool to simulate stochastic particle systems efficiently. However, despite literature on quantifying domain decomposition errors of the particle system for this class of algorithms in the short and in the long time regime, no study yet explores and quantifies the loss of time-reversibility in Parallel KMC. Inspired by concepts from non-equilibrium statistical mechanics, we propose the entropy production per unit time, or entropy production rate, given in terms of an observable and a corresponding estimator, as a metric that quantifies the loss of reversibility. Typically, this is a quantity that cannot bemore » computed explicitly for Parallel KMC, which is why we develop a posteriori estimators that have good scaling properties with respect to the size of the system. Through these estimators, we can connect the different parameters of the scheme, such as the communication time step of the parallelization, the choice of the domain decomposition, and the computational schedule, with its performance in controlling the loss of reversibility. From this point of view, the entropy production rate can be seen both as an information criterion to compare the reversibility of different parallel schemes and as a tool to diagnose reversibility issues with a particular scheme. As a demonstration, we use Sandia Lab's SPPARKS software to compare different parallelization schemes and different domain (lattice) decompositions.« less
Information criteria for quantifying loss of reversibility in parallelized KMC

NASA Astrophysics Data System (ADS)

Gourgoulias, Konstantinos; Katsoulakis, Markos A.; Rey-Bellet, Luc

2017-01-01

Parallel Kinetic Monte Carlo (KMC) is a potent tool to simulate stochastic particle systems efficiently. However, despite literature on quantifying domain decomposition errors of the particle system for this class of algorithms in the short and in the long time regime, no study yet explores and quantifies the loss of time-reversibility in Parallel KMC. Inspired by concepts from non-equilibrium statistical mechanics, we propose the entropy production per unit time, or entropy production rate, given in terms of an observable and a corresponding estimator, as a metric that quantifies the loss of reversibility. Typically, this is a quantity that cannot be computed explicitly for Parallel KMC, which is why we develop a posteriori estimators that have good scaling properties with respect to the size of the system. Through these estimators, we can connect the different parameters of the scheme, such as the communication time step of the parallelization, the choice of the domain decomposition, and the computational schedule, with its performance in controlling the loss of reversibility. From this point of view, the entropy production rate can be seen both as an information criterion to compare the reversibility of different parallel schemes and as a tool to diagnose reversibility issues with a particular scheme. As a demonstration, we use Sandia Lab's SPPARKS software to compare different parallelization schemes and different domain (lattice) decompositions.
Domain decomposition methods in aerodynamics

NASA Technical Reports Server (NTRS)

Venkatakrishnan, V.; Saltz, Joel

1990-01-01

Compressible Euler equations are solved for two-dimensional problems by a preconditioned conjugate gradient-like technique. An approximate Riemann solver is used to compute the numerical fluxes to second order accuracy in space. Two ways to achieve parallelism are tested, one which makes use of parallelism inherent in triangular solves and the other which employs domain decomposition techniques. The vectorization/parallelism in triangular solves is realized by the use of a recording technique called wavefront ordering. This process involves the interpretation of the triangular matrix as a directed graph and the analysis of the data dependencies. It is noted that the factorization can also be done in parallel with the wave front ordering. The performances of two ways of partitioning the domain, strips and slabs, are compared. Results on Cray YMP are reported for an inviscid transonic test case. The performances of linear algebra kernels are also reported.
Scalable and fast heterogeneous molecular simulation with predictive parallelization schemes

NASA Astrophysics Data System (ADS)

Guzman, Horacio V.; Junghans, Christoph; Kremer, Kurt; Stuehn, Torsten

2017-11-01

Multiscale and inhomogeneous molecular systems are challenging topics in the field of molecular simulation. In particular, modeling biological systems in the context of multiscale simulations and exploring material properties are driving a permanent development of new simulation methods and optimization algorithms. In computational terms, those methods require parallelization schemes that make a productive use of computational resources for each simulation and from its genesis. Here, we introduce the heterogeneous domain decomposition approach, which is a combination of an heterogeneity-sensitive spatial domain decomposition with an a priori rearrangement of subdomain walls. Within this approach, the theoretical modeling and scaling laws for the force computation time are proposed and studied as a function of the number of particles and the spatial resolution ratio. We also show the new approach capabilities, by comparing it to both static domain decomposition algorithms and dynamic load-balancing schemes. Specifically, two representative molecular systems have been simulated and compared to the heterogeneous domain decomposition proposed in this work. These two systems comprise an adaptive resolution simulation of a biomolecule solvated in water and a phase-separated binary Lennard-Jones fluid.
Spatiotemporal Domain Decomposition for Massive Parallel Computation of Space-Time Kernel Density

NASA Astrophysics Data System (ADS)

Hohl, A.; Delmelle, E. M.; Tang, W.

2015-07-01

Accelerated processing capabilities are deemed critical when conducting analysis on spatiotemporal datasets of increasing size, diversity and availability. High-performance parallel computing offers the capacity to solve computationally demanding problems in a limited timeframe, but likewise poses the challenge of preventing processing inefficiency due to workload imbalance between computing resources. Therefore, when designing new algorithms capable of implementing parallel strategies, careful spatiotemporal domain decomposition is necessary to account for heterogeneity in the data. In this study, we perform octtree-based adaptive decomposition of the spatiotemporal domain for parallel computation of space-time kernel density. In order to avoid edge effects near subdomain boundaries, we establish spatiotemporal buffers to include adjacent data-points that are within the spatial and temporal kernel bandwidths. Then, we quantify computational intensity of each subdomain to balance workloads among processors. We illustrate the benefits of our methodology using a space-time epidemiological dataset of Dengue fever, an infectious vector-borne disease that poses a severe threat to communities in tropical climates. Our parallel implementation of kernel density reaches substantial speedup compared to sequential processing, and achieves high levels of workload balance among processors due to great accuracy in quantifying computational intensity. Our approach is portable of other space-time analytical tests.
Scalable domain decomposition solvers for stochastic PDEs in high performance computing

DOE PAGES

Desai, Ajit; Khalil, Mohammad; Pettit, Chris; ...

2017-09-21

Stochastic spectral finite element models of practical engineering systems may involve solutions of linear systems or linearized systems for non-linear problems with billions of unknowns. For stochastic modeling, it is therefore essential to design robust, parallel and scalable algorithms that can efficiently utilize high-performance computing to tackle such large-scale systems. Domain decomposition based iterative solvers can handle such systems. And though these algorithms exhibit excellent scalabilities, significant algorithmic and implementational challenges exist to extend them to solve extreme-scale stochastic systems using emerging computing platforms. Intrusive polynomial chaos expansion based domain decomposition algorithms are extended here to concurrently handle high resolutionmore » in both spatial and stochastic domains using an in-house implementation. Sparse iterative solvers with efficient preconditioners are employed to solve the resulting global and subdomain level local systems through multi-level iterative solvers. We also use parallel sparse matrix–vector operations to reduce the floating-point operations and memory requirements. Numerical and parallel scalabilities of these algorithms are presented for the diffusion equation having spatially varying diffusion coefficient modeled by a non-Gaussian stochastic process. Scalability of the solvers with respect to the number of random variables is also investigated.« less
Scalable domain decomposition solvers for stochastic PDEs in high performance computing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Desai, Ajit; Khalil, Mohammad; Pettit, Chris

Stochastic spectral finite element models of practical engineering systems may involve solutions of linear systems or linearized systems for non-linear problems with billions of unknowns. For stochastic modeling, it is therefore essential to design robust, parallel and scalable algorithms that can efficiently utilize high-performance computing to tackle such large-scale systems. Domain decomposition based iterative solvers can handle such systems. And though these algorithms exhibit excellent scalabilities, significant algorithmic and implementational challenges exist to extend them to solve extreme-scale stochastic systems using emerging computing platforms. Intrusive polynomial chaos expansion based domain decomposition algorithms are extended here to concurrently handle high resolutionmore » in both spatial and stochastic domains using an in-house implementation. Sparse iterative solvers with efficient preconditioners are employed to solve the resulting global and subdomain level local systems through multi-level iterative solvers. We also use parallel sparse matrix–vector operations to reduce the floating-point operations and memory requirements. Numerical and parallel scalabilities of these algorithms are presented for the diffusion equation having spatially varying diffusion coefficient modeled by a non-Gaussian stochastic process. Scalability of the solvers with respect to the number of random variables is also investigated.« less
Parallel deterministic transport sweeps of structured and unstructured meshes with overloaded mesh decompositions

DOE Office of Scientific and Technical Information (OSTI.GOV)

Pautz, Shawn D.; Bailey, Teresa S.

Here, the efficiency of discrete ordinates transport sweeps depends on the scheduling algorithm, the domain decomposition, the problem to be solved, and the computational platform. Sweep scheduling algorithms may be categorized by their approach to several issues. In this paper we examine the strategy of domain overloading for mesh partitioning as one of the components of such algorithms. In particular, we extend the domain overloading strategy, previously defined and analyzed for structured meshes, to the general case of unstructured meshes. We also present computational results for both the structured and unstructured domain overloading cases. We find that an appropriate amountmore » of domain overloading can greatly improve the efficiency of parallel sweeps for both structured and unstructured partitionings of the test problems examined on up to 10 5 processor cores.« less
Parallel deterministic transport sweeps of structured and unstructured meshes with overloaded mesh decompositions

DOE PAGES

Pautz, Shawn D.; Bailey, Teresa S.

2016-11-29

Here, the efficiency of discrete ordinates transport sweeps depends on the scheduling algorithm, the domain decomposition, the problem to be solved, and the computational platform. Sweep scheduling algorithms may be categorized by their approach to several issues. In this paper we examine the strategy of domain overloading for mesh partitioning as one of the components of such algorithms. In particular, we extend the domain overloading strategy, previously defined and analyzed for structured meshes, to the general case of unstructured meshes. We also present computational results for both the structured and unstructured domain overloading cases. We find that an appropriate amountmore » of domain overloading can greatly improve the efficiency of parallel sweeps for both structured and unstructured partitionings of the test problems examined on up to 10 5 processor cores.« less
Domain decomposition methods for the parallel computation of reacting flows

NASA Technical Reports Server (NTRS)

Keyes, David E.

1988-01-01

Domain decomposition is a natural route to parallel computing for partial differential equation solvers. Subdomains of which the original domain of definition is comprised are assigned to independent processors at the price of periodic coordination between processors to compute global parameters and maintain the requisite degree of continuity of the solution at the subdomain interfaces. In the domain-decomposed solution of steady multidimensional systems of PDEs by finite difference methods using a pseudo-transient version of Newton iteration, the only portion of the computation which generally stands in the way of efficient parallelization is the solution of the large, sparse linear systems arising at each Newton step. For some Jacobian matrices drawn from an actual two-dimensional reacting flow problem, comparisons are made between relaxation-based linear solvers and also preconditioned iterative methods of Conjugate Gradient and Chebyshev type, focusing attention on both iteration count and global inner product count. The generalized minimum residual method with block-ILU preconditioning is judged the best serial method among those considered, and parallel numerical experiments on the Encore Multimax demonstrate for it approximately 10-fold speedup on 16 processors.
Scalable and fast heterogeneous molecular simulation with predictive parallelization schemes

DOE Office of Scientific and Technical Information (OSTI.GOV)

Guzman, Horacio V.; Junghans, Christoph; Kremer, Kurt

Multiscale and inhomogeneous molecular systems are challenging topics in the field of molecular simulation. In particular, modeling biological systems in the context of multiscale simulations and exploring material properties are driving a permanent development of new simulation methods and optimization algorithms. In computational terms, those methods require parallelization schemes that make a productive use of computational resources for each simulation and from its genesis. Here, we introduce the heterogeneous domain decomposition approach, which is a combination of an heterogeneity-sensitive spatial domain decomposition with an a priori rearrangement of subdomain walls. Within this approach and paper, the theoretical modeling and scalingmore » laws for the force computation time are proposed and studied as a function of the number of particles and the spatial resolution ratio. We also show the new approach capabilities, by comparing it to both static domain decomposition algorithms and dynamic load-balancing schemes. Specifically, two representative molecular systems have been simulated and compared to the heterogeneous domain decomposition proposed in this work. Finally, these two systems comprise an adaptive resolution simulation of a biomolecule solvated in water and a phase-separated binary Lennard-Jones fluid.« less
Scalable and fast heterogeneous molecular simulation with predictive parallelization schemes

DOE PAGES

Guzman, Horacio V.; Junghans, Christoph; Kremer, Kurt; ...

2017-11-27

Multiscale and inhomogeneous molecular systems are challenging topics in the field of molecular simulation. In particular, modeling biological systems in the context of multiscale simulations and exploring material properties are driving a permanent development of new simulation methods and optimization algorithms. In computational terms, those methods require parallelization schemes that make a productive use of computational resources for each simulation and from its genesis. Here, we introduce the heterogeneous domain decomposition approach, which is a combination of an heterogeneity-sensitive spatial domain decomposition with an a priori rearrangement of subdomain walls. Within this approach and paper, the theoretical modeling and scalingmore » laws for the force computation time are proposed and studied as a function of the number of particles and the spatial resolution ratio. We also show the new approach capabilities, by comparing it to both static domain decomposition algorithms and dynamic load-balancing schemes. Specifically, two representative molecular systems have been simulated and compared to the heterogeneous domain decomposition proposed in this work. Finally, these two systems comprise an adaptive resolution simulation of a biomolecule solvated in water and a phase-separated binary Lennard-Jones fluid.« less
An MPI + $X$ implementation of contact global search using Kokkos

DOE PAGES

Hansen, Glen A.; Xavier, Patrick G.; Mish, Sam P.; ...

2015-10-05

This paper describes an approach that seeks to parallelize the spatial search associated with computational contact mechanics. In contact mechanics, the purpose of the spatial search is to find “nearest neighbors,” which is the prelude to an imprinting search that resolves the interactions between the external surfaces of contacting bodies. In particular, we are interested in the contact global search portion of the spatial search associated with this operation on domain-decomposition-based meshes. Specifically, we describe an implementation that combines standard domain-decomposition-based MPI-parallel spatial search with thread-level parallelism (MPI-X) available on advanced computer architectures (those with GPU coprocessors). Our goal ismore » to demonstrate the efficacy of the MPI-X paradigm in the overall contact search. Standard MPI-parallel implementations typically use a domain decomposition of the external surfaces of bodies within the domain in an attempt to efficiently distribute computational work. This decomposition may or may not be the same as the volume decomposition associated with the host physics. The parallel contact global search phase is then employed to find and distribute surface entities (nodes and faces) that are needed to compute contact constraints between entities owned by different MPI ranks without further inter-rank communication. Key steps of the contact global search include computing bounding boxes, building surface entity (node and face) search trees and finding and distributing entities required to complete on-rank (local) spatial searches. To enable source-code portability and performance across a variety of different computer architectures, we implemented the algorithm using the Kokkos hardware abstraction library. While we targeted development towards machines with a GPU accelerator per MPI rank, we also report performance results for OpenMP with a conventional multi-core compute node per rank. Results here demonstrate a 47 % decrease in the time spent within the global search algorithm, comparing the reference ACME algorithm with the GPU implementation, on an 18M face problem using four MPI ranks. As a result, while further work remains to maximize performance on the GPU, this result illustrates the potential of the proposed implementation.« less
Domain decomposition: A bridge between nature and parallel computers

NASA Technical Reports Server (NTRS)

Keyes, David E.

1992-01-01

Domain decomposition is an intuitive organizing principle for a partial differential equation (PDE) computation, both physically and architecturally. However, its significance extends beyond the readily apparent issues of geometry and discretization, on one hand, and of modular software and distributed hardware, on the other. Engineering and computer science aspects are bridged by an old but recently enriched mathematical theory that offers the subject not only unity, but also tools for analysis and generalization. Domain decomposition induces function-space and operator decompositions with valuable properties. Function-space bases and operator splittings that are not derived from domain decompositions generally lack one or more of these properties. The evolution of domain decomposition methods for elliptically dominated problems has linked two major algorithmic developments of the last 15 years: multilevel and Krylov methods. Domain decomposition methods may be considered descendants of both classes with an inheritance from each: they are nearly optimal and at the same time efficiently parallelizable. Many computationally driven application areas are ripe for these developments. A progression is made from a mathematically informal motivation for domain decomposition methods to a specific focus on fluid dynamics applications. To be introductory rather than comprehensive, simple examples are provided while convergence proofs and algorithmic details are left to the original references; however, an attempt is made to convey their most salient features, especially where this leads to algorithmic insight.
Dynamic load balancing algorithm for molecular dynamics based on Voronoi cells domain decompositions

DOE Office of Scientific and Technical Information (OSTI.GOV)

Fattebert, J.-L.; Richards, D.F.; Glosli, J.N.

2012-12-01

We present a new algorithm for automatic parallel load balancing in classical molecular dynamics. It assumes a spatial domain decomposition of particles into Voronoi cells. It is a gradient method which attempts to minimize a cost function by displacing Voronoi sites associated with each processor/sub-domain along steepest descent directions. Excellent load balance has been obtained for quasi-2D and 3D practical applications, with up to 440·10 6 particles on 65,536 MPI tasks.
A parallel domain decomposition-based implicit method for the Cahn–Hilliard–Cook phase-field equation in 3D

DOE Office of Scientific and Technical Information (OSTI.GOV)

Zheng, Xiang; Yang, Chao; State Key Laboratory of Computer Science, Chinese Academy of Sciences, Beijing 100190

2015-03-15

We present a numerical algorithm for simulating the spinodal decomposition described by the three dimensional Cahn–Hilliard–Cook (CHC) equation, which is a fourth-order stochastic partial differential equation with a noise term. The equation is discretized in space and time based on a fully implicit, cell-centered finite difference scheme, with an adaptive time-stepping strategy designed to accelerate the progress to equilibrium. At each time step, a parallel Newton–Krylov–Schwarz algorithm is used to solve the nonlinear system. We discuss various numerical and computational challenges associated with the method. The numerical scheme is validated by a comparison with an explicit scheme of high accuracymore » (and unreasonably high cost). We present steady state solutions of the CHC equation in two and three dimensions. The effect of the thermal fluctuation on the spinodal decomposition process is studied. We show that the existence of the thermal fluctuation accelerates the spinodal decomposition process and that the final steady morphology is sensitive to the stochastic noise. We also show the evolution of the energies and statistical moments. In terms of the parallel performance, it is found that the implicit domain decomposition approach scales well on supercomputers with a large number of processors.« less
A 3D staggered-grid finite difference scheme for poroelastic wave equation

NASA Astrophysics Data System (ADS)

Zhang, Yijie; Gao, Jinghuai

2014-10-01

Three dimensional numerical modeling has been a viable tool for understanding wave propagation in real media. The poroelastic media can better describe the phenomena of hydrocarbon reservoirs than acoustic and elastic media. However, the numerical modeling in 3D poroelastic media demands significantly more computational capacity, including both computational time and memory. In this paper, we present a 3D poroelastic staggered-grid finite difference (SFD) scheme. During the procedure, parallel computing is implemented to reduce the computational time. Parallelization is based on domain decomposition, and communication between processors is performed using message passing interface (MPI). Parallel analysis shows that the parallelized SFD scheme significantly improves the simulation efficiency and 3D decomposition in domain is the most efficient. We also analyze the numerical dispersion and stability condition of the 3D poroelastic SFD method. Numerical results show that the 3D numerical simulation can provide a real description of wave propagation.
A Linked-Cell Domain Decomposition Method for Molecular Dynamics Simulation on a Scalable Multiprocessor

DOE PAGES

Yang, L. H.; Brooks III, E. D.; Belak, J.

1992-01-01

A molecular dynamics algorithm for performing large-scale simulations using the Parallel C Preprocessor (PCP) programming paradigm on the BBN TC2000, a massively parallel computer, is discussed. The algorithm uses a linked-cell data structure to obtain the near neighbors of each atom as time evoles. Each processor is assigned to a geometric domain containing many subcells and the storage for that domain is private to the processor. Within this scheme, the interdomain (i.e., interprocessor) communication is minimized.
Algebraic multigrid domain and range decomposition (AMG-DD / AMG-RD)*

DOE PAGES

Bank, R.; Falgout, R. D.; Jones, T.; ...

2015-10-29

In modern large-scale supercomputing applications, algebraic multigrid (AMG) is a leading choice for solving matrix equations. However, the high cost of communication relative to that of computation is a concern for the scalability of traditional implementations of AMG on emerging architectures. This paper introduces two new algebraic multilevel algorithms, algebraic multigrid domain decomposition (AMG-DD) and algebraic multigrid range decomposition (AMG-RD), that replace traditional AMG V-cycles with a fully overlapping domain decomposition approach. While the methods introduced here are similar in spirit to the geometric methods developed by Brandt and Diskin [Multigrid solvers on decomposed domains, in Domain Decomposition Methods inmore » Science and Engineering, Contemp. Math. 157, AMS, Providence, RI, 1994, pp. 135--155], Mitchell [Electron. Trans. Numer. Anal., 6 (1997), pp. 224--233], and Bank and Holst [SIAM J. Sci. Comput., 22 (2000), pp. 1411--1443], they differ primarily in that they are purely algebraic: AMG-RD and AMG-DD trade communication for computation by forming global composite “grids” based only on the matrix, not the geometry. (As is the usual AMG convention, “grids” here should be taken only in the algebraic sense, regardless of whether or not it corresponds to any geometry.) Another important distinguishing feature of AMG-RD and AMG-DD is their novel residual communication process that enables effective parallel computation on composite grids, avoiding the all-to-all communication costs of the geometric methods. The main purpose of this paper is to study the potential of these two algebraic methods as possible alternatives to existing AMG approaches for future parallel machines. As a result, this paper develops some theoretical properties of these methods and reports on serial numerical tests of their convergence properties over a spectrum of problem parameters.« less
Coherency strain engineered decomposition of unstable multilayer alloys for improved thermal stability

NASA Astrophysics Data System (ADS)

Forsén, R.; Ghafoor, N.; Odén, M.

2013-12-01

A concept to improve hardness and thermal stability of unstable multilayer alloys is presented based on control of the coherency strain such that the driving force for decomposition is favorably altered. Cathodic arc evaporated cubic TiCrAlN/Ti1-xCrxN multilayer coatings are used as demonstrators. Upon annealing, the coatings undergo spinodal decomposition into nanometer-sized coherent Ti- and Al-rich cubic domains which is affected by the coherency strain. In addition, the growth of the domains is restricted by the surrounding TiCrN layer compared to a non-layered TiCrAlN coating which together results in an improved thermal stability of the cubic structure. A significant hardness increase is seen during decomposition for the case with high coherency strain while a low coherency strain results in a hardness decrease for high annealing temperatures. The metal diffusion paths during the domain coarsening are affected by strain which in turn is controlled by the Cr-content (x) in the Ti1-xCrxN layers. For x = 0 the diffusion occurs both parallel and perpendicular to the growth direction but for x > =0.9 the diffusion occurs predominantly parallel to the growth direction. Altogether this study shows a structural tool to alter and fine-tune high temperature properties of multicomponent materials.

Box schemes and their implementation on the iPSC/860

NASA Technical Reports Server (NTRS)

Chattot, J. J.; Merriam, M. L.

1991-01-01

Research on algoriths for efficiently solving fluid flow problems on massively parallel computers is continued in the present paper. Attention is given to the implementation of a box scheme on the iPSC/860, a massively parallel computer with a peak speed of 10 Gflops and a memory of 128 Mwords. A domain decomposition approach to parallelism is used.
FWT2D: A massively parallel program for frequency-domain full-waveform tomography of wide-aperture seismic data—Part 1: Algorithm

NASA Astrophysics Data System (ADS)

Sourbier, Florent; Operto, Stéphane; Virieux, Jean; Amestoy, Patrick; L'Excellent, Jean-Yves

2009-03-01

This is the first paper in a two-part series that describes a massively parallel code that performs 2D frequency-domain full-waveform inversion of wide-aperture seismic data for imaging complex structures. Full-waveform inversion methods, namely quantitative seismic imaging methods based on the resolution of the full wave equation, are computationally expensive. Therefore, designing efficient algorithms which take advantage of parallel computing facilities is critical for the appraisal of these approaches when applied to representative case studies and for further improvements. Full-waveform modelling requires the resolution of a large sparse system of linear equations which is performed with the massively parallel direct solver MUMPS for efficient multiple-shot simulations. Efficiency of the multiple-shot solution phase (forward/backward substitutions) is improved by using the BLAS3 library. The inverse problem relies on a classic local optimization approach implemented with a gradient method. The direct solver returns the multiple-shot wavefield solutions distributed over the processors according to a domain decomposition driven by the distribution of the LU factors. The domain decomposition of the wavefield solutions is used to compute in parallel the gradient of the objective function and the diagonal Hessian, this latter providing a suitable scaling of the gradient. The algorithm allows one to test different strategies for multiscale frequency inversion ranging from successive mono-frequency inversion to simultaneous multifrequency inversion. These different inversion strategies will be illustrated in the following companion paper. The parallel efficiency and the scalability of the code will also be quantified.
Low-Rank Correction Methods for Algebraic Domain Decomposition Preconditioners

DOE PAGES

Li, Ruipeng; Saad, Yousef

2017-08-01

This study presents a parallel preconditioning method for distributed sparse linear systems, based on an approximate inverse of the original matrix, that adopts a general framework of distributed sparse matrices and exploits domain decomposition (DD) and low-rank corrections. The DD approach decouples the matrix and, once inverted, a low-rank approximation is applied by exploiting the Sherman--Morrison--Woodbury formula, which yields two variants of the preconditioning methods. The low-rank expansion is computed by the Lanczos procedure with reorthogonalizations. Numerical experiments indicate that, when combined with Krylov subspace accelerators, this preconditioner can be efficient and robust for solving symmetric sparse linear systems. Comparisonsmore » with pARMS, a DD-based parallel incomplete LU (ILU) preconditioning method, are presented for solving Poisson's equation and linear elasticity problems.« less
Low-Rank Correction Methods for Algebraic Domain Decomposition Preconditioners

DOE Office of Scientific and Technical Information (OSTI.GOV)

Li, Ruipeng; Saad, Yousef

This study presents a parallel preconditioning method for distributed sparse linear systems, based on an approximate inverse of the original matrix, that adopts a general framework of distributed sparse matrices and exploits domain decomposition (DD) and low-rank corrections. The DD approach decouples the matrix and, once inverted, a low-rank approximation is applied by exploiting the Sherman--Morrison--Woodbury formula, which yields two variants of the preconditioning methods. The low-rank expansion is computed by the Lanczos procedure with reorthogonalizations. Numerical experiments indicate that, when combined with Krylov subspace accelerators, this preconditioner can be efficient and robust for solving symmetric sparse linear systems. Comparisonsmore » with pARMS, a DD-based parallel incomplete LU (ILU) preconditioning method, are presented for solving Poisson's equation and linear elasticity problems.« less
Parallel Finite Element Domain Decomposition for Structural/Acoustic Analysis

NASA Technical Reports Server (NTRS)

Nguyen, Duc T.; Tungkahotara, Siroj; Watson, Willie R.; Rajan, Subramaniam D.

2005-01-01

A domain decomposition (DD) formulation for solving sparse linear systems of equations resulting from finite element analysis is presented. The formulation incorporates mixed direct and iterative equation solving strategics and other novel algorithmic ideas that are optimized to take advantage of sparsity and exploit modern computer architecture, such as memory and parallel computing. The most time consuming part of the formulation is identified and the critical roles of direct sparse and iterative solvers within the framework of the formulation are discussed. Experiments on several computer platforms using several complex test matrices are conducted using software based on the formulation. Small-scale structural examples are used to validate thc steps in the formulation and large-scale (l,000,000+ unknowns) duct acoustic examples are used to evaluate the ORIGIN 2000 processors, and a duster of 6 PCs (running under the Windows environment). Statistics show that the formulation is efficient in both sequential and parallel computing environmental and that the formulation is significantly faster and consumes less memory than that based on one of the best available commercialized parallel sparse solvers.
DOMAIN DECOMPOSITION METHOD APPLIED TO A FLOW PROBLEM Norberto C. Vera Guzmán Institute of Geophysics, UNAM

NASA Astrophysics Data System (ADS)

Vera, N. C.; GMMC

2013-05-01

In this paper we present the results of macrohybrid mixed Darcian flow in porous media in a general three-dimensional domain. The global problem is solved as a set of local subproblems which are posed using a domain decomposition method. Unknown fields of local problems, velocity and pressure are approximated using mixed finite elements. For this application, a general three-dimensional domain is considered which is discretized using tetrahedra. The discrete domain is decomposed into subdomains and reformulated the original problem as a set of subproblems, communicated through their interfaces. To solve this set of subproblems, we use finite element mixed and parallel computing. The parallelization of a problem using this methodology can, in principle, to fully exploit a computer equipment and also provides results in less time, two very important elements in modeling. Referencias G.Alduncin and N.Vera-Guzmán Parallel proximal-point algorithms for mixed _nite element models of _ow in the subsurface, Commun. Numer. Meth. Engng 2004; 20:83-104 (DOI: 10.1002/cnm.647) Z. Chen, G.Huan and Y. Ma Computational Methods for Multiphase Flows in Porous Media, SIAM, Society for Industrial and Applied Mathematics, Philadelphia, 2006. A. Quarteroni and A. Valli, Numerical Approximation of Partial Differential Equations, Springer-Verlag, Berlin, 1994. Brezzi F, Fortin M. Mixed and Hybrid Finite Element Methods. Springer: New York, 1991.
Domain decomposition by the advancing-partition method for parallel unstructured grid generation

NASA Technical Reports Server (NTRS)

Banihashemi, legal representative, Soheila (Inventor); Pirzadeh, Shahyar Z. (Inventor)

2012-01-01

In a method for domain decomposition for generating unstructured grids, a surface mesh is generated for a spatial domain. A location of a partition plane dividing the domain into two sections is determined. Triangular faces on the surface mesh that intersect the partition plane are identified. A partition grid of tetrahedral cells, dividing the domain into two sub-domains, is generated using a marching process in which a front comprises only faces of new cells which intersect the partition plane. The partition grid is generated until no active faces remain on the front. Triangular faces on each side of the partition plane are collected into two separate subsets. Each subset of triangular faces is renumbered locally and a local/global mapping is created for each sub-domain. A volume grid is generated for each sub-domain. The partition grid and volume grids are then merged using the local-global mapping.
Computational mechanics analysis tools for parallel-vector supercomputers

NASA Technical Reports Server (NTRS)

Storaasli, Olaf O.; Nguyen, Duc T.; Baddourah, Majdi; Qin, Jiangning

1993-01-01

Computational algorithms for structural analysis on parallel-vector supercomputers are reviewed. These parallel algorithms, developed by the authors, are for the assembly of structural equations, 'out-of-core' strategies for linear equation solution, massively distributed-memory equation solution, unsymmetric equation solution, general eigensolution, geometrically nonlinear finite element analysis, design sensitivity analysis for structural dynamics, optimization search analysis and domain decomposition. The source code for many of these algorithms is available.
The development of a scalable parallel 3-D CFD algorithm for turbomachinery. M.S. Thesis Final Report

NASA Technical Reports Server (NTRS)

Luke, Edward Allen

1993-01-01

Two algorithms capable of computing a transonic 3-D inviscid flow field about rotating machines are considered for parallel implementation. During the study of these algorithms, a significant new method of measuring the performance of parallel algorithms is developed. The theory that supports this new method creates an empirical definition of scalable parallel algorithms that is used to produce quantifiable evidence that a scalable parallel application was developed. The implementation of the parallel application and an automated domain decomposition tool are also discussed.
Final Report, DE-FG01-06ER25718 Domain Decomposition and Parallel Computing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Widlund, Olof B.

2015-06-09

The goal of this project is to develop and improve domain decomposition algorithms for a variety of partial differential equations such as those of linear elasticity and electro-magnetics.These iterative methods are designed for massively parallel computing systems and allow the fast solution of the very large systems of algebraic equations that arise in large scale and complicated simulations. A special emphasis is placed on problems arising from Maxwell's equation. The approximate solvers, the preconditioners, are combined with the conjugate gradient method and must always include a solver of a coarse model in order to have a performance which is independentmore » of the number of processors used in the computer simulation. A recent development allows for an adaptive construction of this coarse component of the preconditioner.« less
Modern gyrokinetic particle-in-cell simulation of fusion plasmas on top supercomputers

DOE PAGES

Wang, Bei; Ethier, Stephane; Tang, William; ...

2017-06-29

The Gyrokinetic Toroidal Code at Princeton (GTC-P) is a highly scalable and portable particle-in-cell (PIC) code. It solves the 5D Vlasov-Poisson equation featuring efficient utilization of modern parallel computer architectures at the petascale and beyond. Motivated by the goal of developing a modern code capable of dealing with the physics challenge of increasing problem size with sufficient resolution, new thread-level optimizations have been introduced as well as a key additional domain decomposition. GTC-P's multiple levels of parallelism, including inter-node 2D domain decomposition and particle decomposition, as well as intra-node shared memory partition and vectorization have enabled pushing the scalability ofmore » the PIC method to extreme computational scales. In this paper, we describe the methods developed to build a highly parallelized PIC code across a broad range of supercomputer designs. This particularly includes implementations on heterogeneous systems using NVIDIA GPU accelerators and Intel Xeon Phi (MIC) co-processors and performance comparisons with state-of-the-art homogeneous HPC systems such as Blue Gene/Q. New discovery science capabilities in the magnetic fusion energy application domain are enabled, including investigations of Ion-Temperature-Gradient (ITG) driven turbulence simulations with unprecedented spatial resolution and long temporal duration. Performance studies with realistic fusion experimental parameters are carried out on multiple supercomputing systems spanning a wide range of cache capacities, cache-sharing configurations, memory bandwidth, interconnects and network topologies. These performance comparisons using a realistic discovery-science-capable domain application code provide valuable insights on optimization techniques across one of the broadest sets of current high-end computing platforms worldwide.« less
Modern gyrokinetic particle-in-cell simulation of fusion plasmas on top supercomputers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wang, Bei; Ethier, Stephane; Tang, William

The Gyrokinetic Toroidal Code at Princeton (GTC-P) is a highly scalable and portable particle-in-cell (PIC) code. It solves the 5D Vlasov-Poisson equation featuring efficient utilization of modern parallel computer architectures at the petascale and beyond. Motivated by the goal of developing a modern code capable of dealing with the physics challenge of increasing problem size with sufficient resolution, new thread-level optimizations have been introduced as well as a key additional domain decomposition. GTC-P's multiple levels of parallelism, including inter-node 2D domain decomposition and particle decomposition, as well as intra-node shared memory partition and vectorization have enabled pushing the scalability ofmore » the PIC method to extreme computational scales. In this paper, we describe the methods developed to build a highly parallelized PIC code across a broad range of supercomputer designs. This particularly includes implementations on heterogeneous systems using NVIDIA GPU accelerators and Intel Xeon Phi (MIC) co-processors and performance comparisons with state-of-the-art homogeneous HPC systems such as Blue Gene/Q. New discovery science capabilities in the magnetic fusion energy application domain are enabled, including investigations of Ion-Temperature-Gradient (ITG) driven turbulence simulations with unprecedented spatial resolution and long temporal duration. Performance studies with realistic fusion experimental parameters are carried out on multiple supercomputing systems spanning a wide range of cache capacities, cache-sharing configurations, memory bandwidth, interconnects and network topologies. These performance comparisons using a realistic discovery-science-capable domain application code provide valuable insights on optimization techniques across one of the broadest sets of current high-end computing platforms worldwide.« less
Newton-Krylov-Schwarz: An implicit solver for CFD

NASA Technical Reports Server (NTRS)

Cai, Xiao-Chuan; Keyes, David E.; Venkatakrishnan, V.

1995-01-01

Newton-Krylov methods and Krylov-Schwarz (domain decomposition) methods have begun to become established in computational fluid dynamics (CFD) over the past decade. The former employ a Krylov method inside of Newton's method in a Jacobian-free manner, through directional differencing. The latter employ an overlapping Schwarz domain decomposition to derive a preconditioner for the Krylov accelerator that relies primarily on local information, for data-parallel concurrency. They may be composed as Newton-Krylov-Schwarz (NKS) methods, which seem particularly well suited for solving nonlinear elliptic systems in high-latency, distributed-memory environments. We give a brief description of this family of algorithms, with an emphasis on domain decomposition iterative aspects. We then describe numerical simulations with Newton-Krylov-Schwarz methods on aerodynamics applications emphasizing comparisons with a standard defect-correction approach, subdomain preconditioner consistency, subdomain preconditioner quality, and the effect of a coarse grid.
3-D modeling of ductile tearing using finite elements: Computational aspects and techniques

NASA Astrophysics Data System (ADS)

Gullerud, Arne Stewart

This research focuses on the development and application of computational tools to perform large-scale, 3-D modeling of ductile tearing in engineering components under quasi-static to mild loading rates. Two standard models for ductile tearing---the computational cell methodology and crack growth controlled by the crack tip opening angle (CTOA)---are described and their 3-D implementations are explored. For the computational cell methodology, quantification of the effects of several numerical issues---computational load step size, procedures for force release after cell deletion, and the porosity for cell deletion---enables construction of computational algorithms to remove the dependence of predicted crack growth on these issues. This work also describes two extensions of the CTOA approach into 3-D: a general 3-D method and a constant front technique. Analyses compare the characteristics of the extensions, and a validation study explores the ability of the constant front extension to predict crack growth in thin aluminum test specimens over a range of specimen geometries, absolutes sizes, and levels of out-of-plane constraint. To provide a computational framework suitable for the solution of these problems, this work also describes the parallel implementation of a nonlinear, implicit finite element code. The implementation employs an explicit message-passing approach using the MPI standard to maintain portability, a domain decomposition of element data to provide parallel execution, and a master-worker organization of the computational processes to enhance future extensibility. A linear preconditioned conjugate gradient (LPCG) solver serves as the core of the solution process. The parallel LPCG solver utilizes an element-by-element (EBE) structure of the computations to permit a dual-level decomposition of the element data: domain decomposition of the mesh provides efficient coarse-grain parallel execution, while decomposition of the domains into blocks of similar elements (same type, constitutive model, etc.) provides fine-grain parallel computation on each processor. A major focus of the LPCG solver is a new implementation of the Hughes-Winget element-by-element (HW) preconditioner. The implementation employs a weighted dependency graph combined with a new coloring algorithm to provide load-balanced scheduling for the preconditioner and overlapped communication/computation. This approach enables efficient parallel application of the HW preconditioner for arbitrary unstructured meshes.
GENESIS: a hybrid-parallel and multi-scale molecular dynamics simulator with enhanced sampling algorithms for biomolecular and cellular simulations.

PubMed

Jung, Jaewoon; Mori, Takaharu; Kobayashi, Chigusa; Matsunaga, Yasuhiro; Yoda, Takao; Feig, Michael; Sugita, Yuji

2015-07-01

GENESIS (Generalized-Ensemble Simulation System) is a new software package for molecular dynamics (MD) simulations of macromolecules. It has two MD simulators, called ATDYN and SPDYN. ATDYN is parallelized based on an atomic decomposition algorithm for the simulations of all-atom force-field models as well as coarse-grained Go-like models. SPDYN is highly parallelized based on a domain decomposition scheme, allowing large-scale MD simulations on supercomputers. Hybrid schemes combining OpenMP and MPI are used in both simulators to target modern multicore computer architectures. Key advantages of GENESIS are (1) the highly parallel performance of SPDYN for very large biological systems consisting of more than one million atoms and (2) the availability of various REMD algorithms (T-REMD, REUS, multi-dimensional REMD for both all-atom and Go-like models under the NVT, NPT, NPAT, and NPγT ensembles). The former is achieved by a combination of the midpoint cell method and the efficient three-dimensional Fast Fourier Transform algorithm, where the domain decomposition space is shared in real-space and reciprocal-space calculations. Other features in SPDYN, such as avoiding concurrent memory access, reducing communication times, and usage of parallel input/output files, also contribute to the performance. We show the REMD simulation results of a mixed (POPC/DMPC) lipid bilayer as a real application using GENESIS. GENESIS is released as free software under the GPLv2 licence and can be easily modified for the development of new algorithms and molecular models. WIREs Comput Mol Sci 2015, 5:310-323. doi: 10.1002/wcms.1220.
AZTEC: A parallel iterative package for the solving linear systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hutchinson, S.A.; Shadid, J.N.; Tuminaro, R.S.

1996-12-31

We describe a parallel linear system package, AZTEC. The package incorporates a number of parallel iterative methods (e.g. GMRES, biCGSTAB, CGS, TFQMR) and preconditioners (e.g. Jacobi, Gauss-Seidel, polynomial, domain decomposition with LU or ILU within subdomains). Additionally, AZTEC allows for the reuse of previous preconditioning factorizations within Newton schemes for nonlinear methods. Currently, a number of different users are using this package to solve a variety of PDE applications.
Computational mechanics analysis tools for parallel-vector supercomputers

NASA Technical Reports Server (NTRS)

Storaasli, O. O.; Nguyen, D. T.; Baddourah, M. A.; Qin, J.

1993-01-01

Computational algorithms for structural analysis on parallel-vector supercomputers are reviewed. These parallel algorithms, developed by the authors, are for the assembly of structural equations, 'out-of-core' strategies for linear equation solution, massively distributed-memory equation solution, unsymmetric equation solution, general eigen-solution, geometrically nonlinear finite element analysis, design sensitivity analysis for structural dynamics, optimization algorithm and domain decomposition. The source code for many of these algorithms is available from NASA Langley.
Automatic partitioning of unstructured meshes for the parallel solution of problems in computational mechanics

NASA Technical Reports Server (NTRS)

Farhat, Charbel; Lesoinne, Michel

1993-01-01

Most of the recently proposed computational methods for solving partial differential equations on multiprocessor architectures stem from the 'divide and conquer' paradigm and involve some form of domain decomposition. For those methods which also require grids of points or patches of elements, it is often necessary to explicitly partition the underlying mesh, especially when working with local memory parallel processors. In this paper, a family of cost-effective algorithms for the automatic partitioning of arbitrary two- and three-dimensional finite element and finite difference meshes is presented and discussed in view of a domain decomposed solution procedure and parallel processing. The influence of the algorithmic aspects of a solution method (implicit/explicit computations), and the architectural specifics of a multiprocessor (SIMD/MIMD, startup/transmission time), on the design of a mesh partitioning algorithm are discussed. The impact of the partitioning strategy on load balancing, operation count, operator conditioning, rate of convergence and processor mapping is also addressed. Finally, the proposed mesh decomposition algorithms are demonstrated with realistic examples of finite element, finite volume, and finite difference meshes associated with the parallel solution of solid and fluid mechanics problems on the iPSC/2 and iPSC/860 multiprocessors.
Vectorial finite elements for solving the radiative transfer equation

NASA Astrophysics Data System (ADS)

Badri, M. A.; Jolivet, P.; Rousseau, B.; Le Corre, S.; Digonnet, H.; Favennec, Y.

2018-06-01

The discrete ordinate method coupled with the finite element method is often used for the spatio-angular discretization of the radiative transfer equation. In this paper we attempt to improve upon such a discretization technique. Instead of using standard finite elements, we reformulate the radiative transfer equation using vectorial finite elements. In comparison to standard finite elements, this reformulation yields faster timings for the linear system assemblies, as well as for the solution phase when using scattering media. The proposed vectorial finite element discretization for solving the radiative transfer equation is cross-validated against a benchmark problem available in literature. In addition, we have used the method of manufactured solutions to verify the order of accuracy for our discretization technique within different absorbing, scattering, and emitting media. For solving large problems of radiation on parallel computers, the vectorial finite element method is parallelized using domain decomposition. The proposed domain decomposition method scales on large number of processes, and its performance is unaffected by the changes in optical thickness of the medium. Our parallel solver is used to solve a large scale radiative transfer problem of the Kelvin-cell radiation.
A comparative study of serial and parallel aeroelastic computations of wings

NASA Technical Reports Server (NTRS)

Byun, Chansup; Guruswamy, Guru P.

1994-01-01

A procedure for computing the aeroelasticity of wings on parallel multiple-instruction, multiple-data (MIMD) computers is presented. In this procedure, fluids are modeled using Euler equations, and structures are modeled using modal or finite element equations. The procedure is designed in such a way that each discipline can be developed and maintained independently by using a domain decomposition approach. In the present parallel procedure, each computational domain is scalable. A parallel integration scheme is used to compute aeroelastic responses by solving fluid and structural equations concurrently. The computational efficiency issues of parallel integration of both fluid and structural equations are investigated in detail. This approach, which reduces the total computational time by a factor of almost 2, is demonstrated for a typical aeroelastic wing by using various numbers of processors on the Intel iPSC/860.

GENESIS: a hybrid-parallel and multi-scale molecular dynamics simulator with enhanced sampling algorithms for biomolecular and cellular simulations

PubMed Central

Jung, Jaewoon; Mori, Takaharu; Kobayashi, Chigusa; Matsunaga, Yasuhiro; Yoda, Takao; Feig, Michael; Sugita, Yuji

2015-01-01

GENESIS (Generalized-Ensemble Simulation System) is a new software package for molecular dynamics (MD) simulations of macromolecules. It has two MD simulators, called ATDYN and SPDYN. ATDYN is parallelized based on an atomic decomposition algorithm for the simulations of all-atom force-field models as well as coarse-grained Go-like models. SPDYN is highly parallelized based on a domain decomposition scheme, allowing large-scale MD simulations on supercomputers. Hybrid schemes combining OpenMP and MPI are used in both simulators to target modern multicore computer architectures. Key advantages of GENESIS are (1) the highly parallel performance of SPDYN for very large biological systems consisting of more than one million atoms and (2) the availability of various REMD algorithms (T-REMD, REUS, multi-dimensional REMD for both all-atom and Go-like models under the NVT, NPT, NPAT, and NPγT ensembles). The former is achieved by a combination of the midpoint cell method and the efficient three-dimensional Fast Fourier Transform algorithm, where the domain decomposition space is shared in real-space and reciprocal-space calculations. Other features in SPDYN, such as avoiding concurrent memory access, reducing communication times, and usage of parallel input/output files, also contribute to the performance. We show the REMD simulation results of a mixed (POPC/DMPC) lipid bilayer as a real application using GENESIS. GENESIS is released as free software under the GPLv2 licence and can be easily modified for the development of new algorithms and molecular models. WIREs Comput Mol Sci 2015, 5:310–323. doi: 10.1002/wcms.1220 PMID:26753008
Hierarchical fractional-step approximations and parallel kinetic Monte Carlo algorithms

DOE Office of Scientific and Technical Information (OSTI.GOV)

Arampatzis, Giorgos, E-mail: garab@math.uoc.gr; Katsoulakis, Markos A., E-mail: markos@math.umass.edu; Plechac, Petr, E-mail: plechac@math.udel.edu

2012-10-01

We present a mathematical framework for constructing and analyzing parallel algorithms for lattice kinetic Monte Carlo (KMC) simulations. The resulting algorithms have the capacity to simulate a wide range of spatio-temporal scales in spatially distributed, non-equilibrium physiochemical processes with complex chemistry and transport micro-mechanisms. Rather than focusing on constructing exactly the stochastic trajectories, our approach relies on approximating the evolution of observables, such as density, coverage, correlations and so on. More specifically, we develop a spatial domain decomposition of the Markov operator (generator) that describes the evolution of all observables according to the kinetic Monte Carlo algorithm. This domain decompositionmore » corresponds to a decomposition of the Markov generator into a hierarchy of operators and can be tailored to specific hierarchical parallel architectures such as multi-core processors or clusters of Graphical Processing Units (GPUs). Based on this operator decomposition, we formulate parallel Fractional step kinetic Monte Carlo algorithms by employing the Trotter Theorem and its randomized variants; these schemes, (a) are partially asynchronous on each fractional step time-window, and (b) are characterized by their communication schedule between processors. The proposed mathematical framework allows us to rigorously justify the numerical and statistical consistency of the proposed algorithms, showing the convergence of our approximating schemes to the original serial KMC. The approach also provides a systematic evaluation of different processor communicating schedules. We carry out a detailed benchmarking of the parallel KMC schemes using available exact solutions, for example, in Ising-type systems and we demonstrate the capabilities of the method to simulate complex spatially distributed reactions at very large scales on GPUs. Finally, we discuss work load balancing between processors and propose a re-balancing scheme based on probabilistic mass transport methods.« less
Airbreathing Propulsion System Analysis Using Multithreaded Parallel Processing

NASA Technical Reports Server (NTRS)

Schunk, Richard Gregory; Chung, T. J.; Rodriguez, Pete (Technical Monitor)

2000-01-01

In this paper, parallel processing is used to analyze the mixing, and combustion behavior of hypersonic flow. Preliminary work for a sonic transverse hydrogen jet injected from a slot into a Mach 4 airstream in a two-dimensional duct combustor has been completed [Moon and Chung, 1996]. Our aim is to extend this work to three-dimensional domain using multithreaded domain decomposition parallel processing based on the flowfield-dependent variation theory. Numerical simulations of chemically reacting flows are difficult because of the strong interactions between the turbulent hydrodynamic and chemical processes. The algorithm must provide an accurate representation of the flowfield, since unphysical flowfield calculations will lead to the faulty loss or creation of species mass fraction, or even premature ignition, which in turn alters the flowfield information. Another difficulty arises from the disparity in time scales between the flowfield and chemical reactions, which may require the use of finite rate chemistry. The situations are more complex when there is a disparity in length scales involved in turbulence. In order to cope with these complicated physical phenomena, it is our plan to utilize the flowfield-dependent variation theory mentioned above, facilitated by large eddy simulation. Undoubtedly, the proposed computation requires the most sophisticated computational strategies. The multithreaded domain decomposition parallel processing will be necessary in order to reduce both computational time and storage. Without special treatments involved in computer engineering, our attempt to analyze the airbreathing combustion appears to be difficult, if not impossible.
Parallel implementation of the particle simulation method with dynamic load balancing: Toward realistic geodynamical simulation

NASA Astrophysics Data System (ADS)

Furuichi, M.; Nishiura, D.

2015-12-01

Fully Lagrangian methods such as Smoothed Particle Hydrodynamics (SPH) and Discrete Element Method (DEM) have been widely used to solve the continuum and particles motions in the computational geodynamics field. These mesh-free methods are suitable for the problems with the complex geometry and boundary. In addition, their Lagrangian nature allows non-diffusive advection useful for tracking history dependent properties (e.g. rheology) of the material. These potential advantages over the mesh-based methods offer effective numerical applications to the geophysical flow and tectonic processes, which are for example, tsunami with free surface and floating body, magma intrusion with fracture of rock, and shear zone pattern generation of granular deformation. In order to investigate such geodynamical problems with the particle based methods, over millions to billion particles are required for the realistic simulation. Parallel computing is therefore important for handling such huge computational cost. An efficient parallel implementation of SPH and DEM methods is however known to be difficult especially for the distributed-memory architecture. Lagrangian methods inherently show workload imbalance problem for parallelization with the fixed domain in space, because particles move around and workloads change during the simulation. Therefore dynamic load balance is key technique to perform the large scale SPH and DEM simulation. In this work, we present the parallel implementation technique of SPH and DEM method utilizing dynamic load balancing algorithms toward the high resolution simulation over large domain using the massively parallel super computer system. Our method utilizes the imbalances of the executed time of each MPI process as the nonlinear term of parallel domain decomposition and minimizes them with the Newton like iteration method. In order to perform flexible domain decomposition in space, the slice-grid algorithm is used. Numerical tests show that our approach is suitable for solving the particles with different calculation costs (e.g. boundary particles) as well as the heterogeneous computer architecture. We analyze the parallel efficiency and scalability on the super computer systems (K-computer, Earth simulator 3, etc.).
Optimal mapping of irregular finite element domains to parallel processors

NASA Technical Reports Server (NTRS)

Flower, J.; Otto, S.; Salama, M.

1987-01-01

Mapping the solution domain of n-finite elements into N-subdomains that may be processed in parallel by N-processors is an optimal one if the subdomain decomposition results in a well-balanced workload distribution among the processors. The problem is discussed in the context of irregular finite element domains as an important aspect of the efficient utilization of the capabilities of emerging multiprocessor computers. Finding the optimal mapping is an intractable combinatorial optimization problem, for which a satisfactory approximate solution is obtained here by analogy to a method used in statistical mechanics for simulating the annealing process in solids. The simulated annealing analogy and algorithm are described, and numerical results are given for mapping an irregular two-dimensional finite element domain containing a singularity onto the Hypercube computer.
Parallel Computation of Flow in Heterogeneous Media Modelled by Mixed Finite Elements

NASA Astrophysics Data System (ADS)

Cliffe, K. A.; Graham, I. G.; Scheichl, R.; Stals, L.

2000-11-01

In this paper we describe a fast parallel method for solving highly ill-conditioned saddle-point systems arising from mixed finite element simulations of stochastic partial differential equations (PDEs) modelling flow in heterogeneous media. Each realisation of these stochastic PDEs requires the solution of the linear first-order velocity-pressure system comprising Darcy's law coupled with an incompressibility constraint. The chief difficulty is that the permeability may be highly variable, especially when the statistical model has a large variance and a small correlation length. For reasonable accuracy, the discretisation has to be extremely fine. We solve these problems by first reducing the saddle-point formulation to a symmetric positive definite (SPD) problem using a suitable basis for the space of divergence-free velocities. The reduced problem is solved using parallel conjugate gradients preconditioned with an algebraically determined additive Schwarz domain decomposition preconditioner. The result is a solver which exhibits a good degree of robustness with respect to the mesh size as well as to the variance and to physically relevant values of the correlation length of the underlying permeability field. Numerical experiments exhibit almost optimal levels of parallel efficiency. The domain decomposition solver (DOUG, http://www.maths.bath.ac.uk/~parsoft) used here not only is applicable to this problem but can be used to solve general unstructured finite element systems on a wide range of parallel architectures.
A transient FETI methodology for large-scale parallel implicit computations in structural mechanics

NASA Technical Reports Server (NTRS)

Farhat, Charbel; Crivelli, Luis; Roux, Francois-Xavier

1992-01-01

Explicit codes are often used to simulate the nonlinear dynamics of large-scale structural systems, even for low frequency response, because the storage and CPU requirements entailed by the repeated factorizations traditionally found in implicit codes rapidly overwhelm the available computing resources. With the advent of parallel processing, this trend is accelerating because explicit schemes are also easier to parallelize than implicit ones. However, the time step restriction imposed by the Courant stability condition on all explicit schemes cannot yet -- and perhaps will never -- be offset by the speed of parallel hardware. Therefore, it is essential to develop efficient and robust alternatives to direct methods that are also amenable to massively parallel processing because implicit codes using unconditionally stable time-integration algorithms are computationally more efficient when simulating low-frequency dynamics. Here we present a domain decomposition method for implicit schemes that requires significantly less storage than factorization algorithms, that is several times faster than other popular direct and iterative methods, that can be easily implemented on both shared and local memory parallel processors, and that is both computationally and communication-wise efficient. The proposed transient domain decomposition method is an extension of the method of Finite Element Tearing and Interconnecting (FETI) developed by Farhat and Roux for the solution of static problems. Serial and parallel performance results on the CRAY Y-MP/8 and the iPSC-860/128 systems are reported and analyzed for realistic structural dynamics problems. These results establish the superiority of the FETI method over both the serial/parallel conjugate gradient algorithm with diagonal scaling and the serial/parallel direct method, and contrast the computational power of the iPSC-860/128 parallel processor with that of the CRAY Y-MP/8 system.
Symposium on Parallel Computational Methods for Large-scale Structural Analysis and Design, 2nd, Norfolk, VA, US

NASA Technical Reports Server (NTRS)

Storaasli, Olaf O. (Editor); Housner, Jerrold M. (Editor)

1993-01-01

Computing speed is leaping forward by several orders of magnitude each decade. Engineers and scientists gathered at a NASA Langley symposium to discuss these exciting trends as they apply to parallel computational methods for large-scale structural analysis and design. Among the topics discussed were: large-scale static analysis; dynamic, transient, and thermal analysis; domain decomposition (substructuring); and nonlinear and numerical methods.
The implementation of an aeronautical CFD flow code onto distributed memory parallel systems

NASA Astrophysics Data System (ADS)

Ierotheou, C. S.; Forsey, C. R.; Leatham, M.

2000-04-01

The parallelization of an industrially important in-house computational fluid dynamics (CFD) code for calculating the airflow over complex aircraft configurations using the Euler or Navier-Stokes equations is presented. The code discussed is the flow solver module of the SAUNA CFD suite. This suite uses a novel grid system that may include block-structured hexahedral or pyramidal grids, unstructured tetrahedral grids or a hybrid combination of both. To assist in the rapid convergence to a solution, a number of convergence acceleration techniques are employed including implicit residual smoothing and a multigrid full approximation storage scheme (FAS). Key features of the parallelization approach are the use of domain decomposition and encapsulated message passing to enable the execution in parallel using a single programme multiple data (SPMD) paradigm. In the case where a hybrid grid is used, a unified grid partitioning scheme is employed to define the decomposition of the mesh. The parallel code has been tested using both structured and hybrid grids on a number of different distributed memory parallel systems and is now routinely used to perform industrial scale aeronautical simulations. Copyright
Efficient Delaunay Tessellation through K-D Tree Decomposition

DOE Office of Scientific and Technical Information (OSTI.GOV)

Morozov, Dmitriy; Peterka, Tom

Delaunay tessellations are fundamental data structures in computational geometry. They are important in data analysis, where they can represent the geometry of a point set or approximate its density. The algorithms for computing these tessellations at scale perform poorly when the input data is unbalanced. We investigate the use of k-d trees to evenly distribute points among processes and compare two strategies for picking split points between domain regions. Because resulting point distributions no longer satisfy the assumptions of existing parallel Delaunay algorithms, we develop a new parallel algorithm that adapts to its input and prove its correctness. We evaluatemore » the new algorithm using two late-stage cosmology datasets. The new running times are up to 50 times faster using k-d tree compared with regular grid decomposition. Moreover, in the unbalanced data sets, decomposing the domain into a k-d tree is up to five times faster than decomposing it into a regular grid.« less
Scalable Parallel Computation for Extended MHD Modeling of Fusion Plasmas

NASA Astrophysics Data System (ADS)

Glasser, Alan H.

2008-11-01

Parallel solution of a linear system is scalable if simultaneously doubling the number of dependent variables and the number of processors results in little or no increase in the computation time to solution. Two approaches have this property for parabolic systems: multigrid and domain decomposition. Since extended MHD is primarily a hyperbolic rather than a parabolic system, additional steps must be taken to parabolize the linear system to be solved by such a method. Such physics-based preconditioning (PBP) methods have been pioneered by Chac'on, using finite volumes for spatial discretization, multigrid for solution of the preconditioning equations, and matrix-free Newton-Krylov methods for the accurate solution of the full nonlinear preconditioned equations. The work described here is an extension of these methods using high-order spectral element methods and FETI-DP domain decomposition. Application of PBP to a flux-source representation of the physics equations is discussed. The resulting scalability will be demonstrated for simple wave and for ideal and Hall MHD waves.
Analysis of a parallelized nonlinear elliptic boundary value problem solver with application to reacting flows

NASA Technical Reports Server (NTRS)

Keyes, David E.; Smooke, Mitchell D.

1987-01-01

A parallelized finite difference code based on the Newton method for systems of nonlinear elliptic boundary value problems in two dimensions is analyzed in terms of computational complexity and parallel efficiency. An approximate cost function depending on 15 dimensionless parameters is derived for algorithms based on stripwise and boxwise decompositions of the domain and a one-to-one assignment of the strip or box subdomains to processors. The sensitivity of the cost functions to the parameters is explored in regions of parameter space corresponding to model small-order systems with inexpensive function evaluations and also a coupled system of nineteen equations with very expensive function evaluations. The algorithm was implemented on the Intel Hypercube, and some experimental results for the model problems with stripwise decompositions are presented and compared with the theory. In the context of computational combustion problems, multiprocessors of either message-passing or shared-memory type may be employed with stripwise decompositions to realize speedup of O(n), where n is mesh resolution in one direction, for reasonable n.
Error analysis of multipoint flux domain decomposition methods for evolutionary diffusion problems

NASA Astrophysics Data System (ADS)

Arrarás, A.; Portero, L.; Yotov, I.

2014-01-01

We study space and time discretizations for mixed formulations of parabolic problems. The spatial approximation is based on the multipoint flux mixed finite element method, which reduces to an efficient cell-centered pressure system on general grids, including triangles, quadrilaterals, tetrahedra, and hexahedra. The time integration is performed by using a domain decomposition time-splitting technique combined with multiterm fractional step diagonally implicit Runge-Kutta methods. The resulting scheme is unconditionally stable and computationally efficient, as it reduces the global system to a collection of uncoupled subdomain problems that can be solved in parallel without the need for Schwarz-type iteration. Convergence analysis for both the semidiscrete and fully discrete schemes is presented.
Domain decomposition methods in computational fluid dynamics

NASA Technical Reports Server (NTRS)

Gropp, William D.; Keyes, David E.

1991-01-01

The divide-and-conquer paradigm of iterative domain decomposition, or substructuring, has become a practical tool in computational fluid dynamic applications because of its flexibility in accommodating adaptive refinement through locally uniform (or quasi-uniform) grids, its ability to exploit multiple discretizations of the operator equations, and the modular pathway it provides towards parallelism. These features are illustrated on the classic model problem of flow over a backstep using Newton's method as the nonlinear iteration. Multiple discretizations (second-order in the operator and first-order in the preconditioner) and locally uniform mesh refinement pay dividends separately, and they can be combined synergistically. Sample performance results are included from an Intel iPSC/860 hypercube implementation.
Full waveform time domain solutions for source and induced magnetotelluric and controlled-source electromagnetic fields using quasi-equivalent time domain decomposition and GPU parallelization

NASA Astrophysics Data System (ADS)

Imamura, N.; Schultz, A.

2015-12-01

Recently, a full waveform time domain solution has been developed for the magnetotelluric (MT) and controlled-source electromagnetic (CSEM) methods. The ultimate goal of this approach is to obtain a computationally tractable direct waveform joint inversion for source fields and earth conductivity structure in three and four dimensions. This is desirable on several grounds, including the improved spatial resolving power expected from use of a multitude of source illuminations of non-zero wavenumber, the ability to operate in areas of high levels of source signal spatial complexity and non-stationarity, etc. This goal would not be obtainable if one were to adopt the finite difference time-domain (FDTD) approach for the forward problem. This is particularly true for the case of MT surveys, since an enormous number of degrees of freedom are required to represent the observed MT waveforms across the large frequency bandwidth. It means that for FDTD simulation, the smallest time steps should be finer than that required to represent the highest frequency, while the number of time steps should also cover the lowest frequency. This leads to a linear system that is computationally burdensome to solve. We have implemented our code that addresses this situation through the use of a fictitious wave domain method and GPUs to speed up the computation time. We also substantially reduce the size of the linear systems by applying concepts from successive cascade decimation, through quasi-equivalent time domain decomposition. By combining these refinements, we have made good progress toward implementing the core of a full waveform joint source field/earth conductivity inverse modeling method. From results, we found the use of previous generation of CPU/GPU speeds computations by an order of magnitude over a parallel CPU only approach. In part, this arises from the use of the quasi-equivalent time domain decomposition, which shrinks the size of the linear system dramatically.
Porting LAMMPS to GPUs.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Brown, William Michael; Plimpton, Steven James; Wang, Peng

2010-03-01

LAMMPS is a classical molecular dynamics code, and an acronym for Large-scale Atomic/Molecular Massively Parallel Simulator. LAMMPS has potentials for soft materials (biomolecules, polymers) and solid-state materials (metals, semiconductors) and coarse-grained or mesoscopic systems. It can be used to model atoms or, more generically, as a parallel particle simulator at the atomic, meso, or continuum scale. LAMMPS runs on single processors or in parallel using message-passing techniques and a spatial-decomposition of the simulation domain. The code is designed to be easy to modify or extend with new functionality.
Preconditioned implicit solvers for the Navier-Stokes equations on distributed-memory machines

NASA Technical Reports Server (NTRS)

Ajmani, Kumud; Liou, Meng-Sing; Dyson, Rodger W.

1994-01-01

The GMRES method is parallelized, and combined with local preconditioning to construct an implicit parallel solver to obtain steady-state solutions for the Navier-Stokes equations of fluid flow on distributed-memory machines. The new implicit parallel solver is designed to preserve the convergence rate of the equivalent 'serial' solver. A static domain-decomposition is used to partition the computational domain amongst the available processing nodes of the parallel machine. The SPMD (Single-Program Multiple-Data) programming model is combined with message-passing tools to develop the parallel code on a 32-node Intel Hypercube and a 512-node Intel Delta machine. The implicit parallel solver is validated for internal and external flow problems, and is found to compare identically with flow solutions obtained on a Cray Y-MP/8. A peak computational speed of 2300 MFlops/sec has been achieved on 512 nodes of the Intel Delta machine,k for a problem size of 1024 K equations (256 K grid points).
Scalable Domain Decomposed Monte Carlo Particle Transport

NASA Astrophysics Data System (ADS)

O'Brien, Matthew Joseph

In this dissertation, we present the parallel algorithms necessary to run domain decomposed Monte Carlo particle transport on large numbers of processors (millions of processors). Previous algorithms were not scalable, and the parallel overhead became more computationally costly than the numerical simulation. The main algorithms we consider are: • Domain decomposition of constructive solid geometry: enables extremely large calculations in which the background geometry is too large to fit in the memory of a single computational node. • Load Balancing: keeps the workload per processor as even as possible so the calculation runs efficiently. • Global Particle Find: if particles are on the wrong processor, globally resolve their locations to the correct processor based on particle coordinate and background domain. • Visualizing constructive solid geometry, sourcing particles, deciding that particle streaming communication is completed and spatial redecomposition. These algorithms are some of the most important parallel algorithms required for domain decomposed Monte Carlo particle transport. We demonstrate that our previous algorithms were not scalable, prove that our new algorithms are scalable, and run some of the algorithms up to 2 million MPI processes on the Sequoia supercomputer.
DVS-SOFTWARE: An Effective Tool for Applying Highly Parallelized Hardware To Computational Geophysics

NASA Astrophysics Data System (ADS)

Herrera, I.; Herrera, G. S.

2015-12-01

Most geophysical systems are macroscopic physical systems. The behavior prediction of such systems is carried out by means of computational models whose basic models are partial differential equations (PDEs) [1]. Due to the enormous size of the discretized version of such PDEs it is necessary to apply highly parallelized super-computers. For them, at present, the most efficient software is based on non-overlapping domain decomposition methods (DDM). However, a limiting feature of the present state-of-the-art techniques is due to the kind of discretizations used in them. Recently, I. Herrera and co-workers using 'non-overlapping discretizations' have produced the DVS-Software which overcomes this limitation [2]. The DVS-software can be applied to a great variety of geophysical problems and achieves very high parallel efficiencies (90%, or so [3]). It is therefore very suitable for effectively applying the most advanced parallel supercomputers available at present. In a parallel talk, in this AGU Fall Meeting, Graciela Herrera Z. will present how this software is being applied to advance MOD-FLOW. Key Words: Parallel Software for Geophysics, High Performance Computing, HPC, Parallel Computing, Domain Decomposition Methods (DDM)REFERENCES [1]. Herrera Ismael and George F. Pinder, Mathematical Modelling in Science and Engineering: An axiomatic approach", John Wiley, 243p., 2012. [2]. Herrera, I., de la Cruz L.M. and Rosas-Medina A. "Non Overlapping Discretization Methods for Partial, Differential Equations". NUMER METH PART D E, 30: 1427-1454, 2014, DOI 10.1002/num 21852. (Open source) [3]. Herrera, I., & Contreras Iván "An Innovative Tool for Effectively Applying Highly Parallelized Software To Problems of Elasticity". Geofísica Internacional, 2015 (In press)
Parallel Monte Carlo transport modeling in the context of a time-dependent, three-dimensional multi-physics code

DOE Office of Scientific and Technical Information (OSTI.GOV)

Procassini, R.J.

1997-12-31

The fine-scale, multi-space resolution that is envisioned for accurate simulations of complex weapons systems in three spatial dimensions implies flop-rate and memory-storage requirements that will only be obtained in the near future through the use of parallel computational techniques. Since the Monte Carlo transport models in these simulations usually stress both of these computational resources, they are prime candidates for parallelization. The MONACO Monte Carlo transport package, which is currently under development at LLNL, will utilize two types of parallelism within the context of a multi-physics design code: decomposition of the spatial domain across processors (spatial parallelism) and distribution ofmore » particles in a given spatial subdomain across additional processors (particle parallelism). This implementation of the package will utilize explicit data communication between domains (message passing). Such a parallel implementation of a Monte Carlo transport model will result in non-deterministic communication patterns. The communication of particles between subdomains during a Monte Carlo time step may require a significant level of effort to achieve a high parallel efficiency.« less

Parallel Adaptive Mesh Refinement Library

NASA Technical Reports Server (NTRS)

Mac-Neice, Peter; Olson, Kevin

2005-01-01

Parallel Adaptive Mesh Refinement Library (PARAMESH) is a package of Fortran 90 subroutines designed to provide a computer programmer with an easy route to extension of (1) a previously written serial code that uses a logically Cartesian structured mesh into (2) a parallel code with adaptive mesh refinement (AMR). Alternatively, in its simplest use, and with minimal effort, PARAMESH can operate as a domain-decomposition tool for users who want to parallelize their serial codes but who do not wish to utilize adaptivity. The package builds a hierarchy of sub-grids to cover the computational domain of a given application program, with spatial resolution varying to satisfy the demands of the application. The sub-grid blocks form the nodes of a tree data structure (a quad-tree in two or an oct-tree in three dimensions). Each grid block has a logically Cartesian mesh. The package supports one-, two- and three-dimensional models.
A communication library for the parallelization of air quality models on structured grids

NASA Astrophysics Data System (ADS)

Miehe, Philipp; Sandu, Adrian; Carmichael, Gregory R.; Tang, Youhua; Dăescu, Dacian

PAQMSG is an MPI-based, Fortran 90 communication library for the parallelization of air quality models (AQMs) on structured grids. It consists of distribution, gathering and repartitioning routines for different domain decompositions implementing a master-worker strategy. The library is architecture and application independent and includes optimization strategies for different architectures. This paper presents the library from a user perspective. Results are shown from the parallelization of STEM-III on Beowulf clusters. The PAQMSG library is available on the web. The communication routines are easy to use, and should allow for an immediate parallelization of existing AQMs. PAQMSG can also be used for constructing new models.
Parallel simulation of tsunami inundation on a large-scale supercomputer

NASA Astrophysics Data System (ADS)

Oishi, Y.; Imamura, F.; Sugawara, D.

2013-12-01

An accurate prediction of tsunami inundation is important for disaster mitigation purposes. One approach is to approximate the tsunami wave source through an instant inversion analysis using real-time observation data (e.g., Tsushima et al., 2009) and then use the resulting wave source data in an instant tsunami inundation simulation. However, a bottleneck of this approach is the large computational cost of the non-linear inundation simulation and the computational power of recent massively parallel supercomputers is helpful to enable faster than real-time execution of a tsunami inundation simulation. Parallel computers have become approximately 1000 times faster in 10 years (www.top500.org), and so it is expected that very fast parallel computers will be more and more prevalent in the near future. Therefore, it is important to investigate how to efficiently conduct a tsunami simulation on parallel computers. In this study, we are targeting very fast tsunami inundation simulations on the K computer, currently the fastest Japanese supercomputer, which has a theoretical peak performance of 11.2 PFLOPS. One computing node of the K computer consists of 1 CPU with 8 cores that share memory, and the nodes are connected through a high-performance torus-mesh network. The K computer is designed for distributed-memory parallel computation, so we have developed a parallel tsunami model. Our model is based on TUNAMI-N2 model of Tohoku University, which is based on a leap-frog finite difference method. A grid nesting scheme is employed to apply high-resolution grids only at the coastal regions. To balance the computation load of each CPU in the parallelization, CPUs are first allocated to each nested layer in proportion to the number of grid points of the nested layer. Using CPUs allocated to each layer, 1-D domain decomposition is performed on each layer. In the parallel computation, three types of communication are necessary: (1) communication to adjacent neighbours for the finite difference calculation, (2) communication between adjacent layers for the calculations to connect each layer, and (3) global communication to obtain the time step which satisfies the CFL condition in the whole domain. A preliminary test on the K computer showed the parallel efficiency on 1024 cores was 57% relative to 64 cores. We estimate that the parallel efficiency will be considerably improved by applying a 2-D domain decomposition instead of the present 1-D domain decomposition in future work. The present parallel tsunami model was applied to the 2011 Great Tohoku tsunami. The coarsest resolution layer covers a 758 km × 1155 km region with a 405 m grid spacing. A nesting of five layers was used with the resolution ratio of 1/3 between nested layers. The finest resolution region has 5 m resolution and covers most of the coastal region of Sendai city. To complete 2 hours of simulation time, the serial (non-parallel) computation took approximately 4 days on a workstation. To complete the same simulation on 1024 cores of the K computer, it took 45 minutes which is more than two times faster than real-time. This presentation discusses the updated parallel computational performance and the efficient use of the K computer when considering the characteristics of the tsunami inundation simulation model in relation to the characteristics and capabilities of the K computer.
Implementation and performance of FDPS: a framework for developing parallel particle simulation codes

NASA Astrophysics Data System (ADS)

Iwasawa, Masaki; Tanikawa, Ataru; Hosono, Natsuki; Nitadori, Keigo; Muranushi, Takayuki; Makino, Junichiro

2016-08-01

We present the basic idea, implementation, measured performance, and performance model of FDPS (Framework for Developing Particle Simulators). FDPS is an application-development framework which helps researchers to develop simulation programs using particle methods for large-scale distributed-memory parallel supercomputers. A particle-based simulation program for distributed-memory parallel computers needs to perform domain decomposition, exchange of particles which are not in the domain of each computing node, and gathering of the particle information in other nodes which are necessary for interaction calculation. Also, even if distributed-memory parallel computers are not used, in order to reduce the amount of computation, algorithms such as the Barnes-Hut tree algorithm or the Fast Multipole Method should be used in the case of long-range interactions. For short-range interactions, some methods to limit the calculation to neighbor particles are required. FDPS provides all of these functions which are necessary for efficient parallel execution of particle-based simulations as "templates," which are independent of the actual data structure of particles and the functional form of the particle-particle interaction. By using FDPS, researchers can write their programs with the amount of work necessary to write a simple, sequential and unoptimized program of O(N2) calculation cost, and yet the program, once compiled with FDPS, will run efficiently on large-scale parallel supercomputers. A simple gravitational N-body program can be written in around 120 lines. We report the actual performance of these programs and the performance model. The weak scaling performance is very good, and almost linear speed-up was obtained for up to the full system of the K computer. The minimum calculation time per timestep is in the range of 30 ms (N = 107) to 300 ms (N = 109). These are currently limited by the time for the calculation of the domain decomposition and communication necessary for the interaction calculation. We discuss how we can overcome these bottlenecks.
Object-Oriented Implementation of the NAS Parallel Benchmarks using Charm++

NASA Technical Reports Server (NTRS)

Krishnan, Sanjeev; Bhandarkar, Milind; Kale, Laxmikant V.

1996-01-01

This report describes experiences with implementing the NAS Computational Fluid Dynamics benchmarks using a parallel object-oriented language, Charm++. Our main objective in implementing the NAS CFD kernel benchmarks was to develop a code that could be used to easily experiment with different domain decomposition strategies and dynamic load balancing. We also wished to leverage the object-orientation provided by the Charm++ parallel object-oriented language, to develop reusable abstractions that would simplify the process of developing parallel applications. We first describe the Charm++ parallel programming model and the parallel object array abstraction, then go into detail about each of the Scalar Pentadiagonal (SP) and Lower/Upper Triangular (LU) benchmarks, along with performance results. Finally we conclude with an evaluation of the methodology used.
Progress in the Simulation of Steady and Time-Dependent Flows with 3D Parallel Unstructured Cartesian Methods

NASA Technical Reports Server (NTRS)

Aftosmis, M. J.; Berger, M. J.; Murman, S. M.; Kwak, Dochan (Technical Monitor)

2002-01-01

The proposed paper will present recent extensions in the development of an efficient Euler solver for adaptively-refined Cartesian meshes with embedded boundaries. The paper will focus on extensions of the basic method to include solution adaptation, time-dependent flow simulation, and arbitrary rigid domain motion. The parallel multilevel method makes use of on-the-fly parallel domain decomposition to achieve extremely good scalability on large numbers of processors, and is coupled with an automatic coarse mesh generation algorithm for efficient processing by a multigrid smoother. Numerical results are presented demonstrating parallel speed-ups of up to 435 on 512 processors. Solution-based adaptation may be keyed off truncation error estimates using tau-extrapolation or a variety of feature detection based refinement parameters. The multigrid method is extended to for time-dependent flows through the use of a dual-time approach. The extension to rigid domain motion uses an Arbitrary Lagrangian-Eulerlarian (ALE) formulation, and results will be presented for a variety of two- and three-dimensional example problems with both simple and complex geometry.
Scalability and Portability of Two Parallel Implementations of ADI

NASA Technical Reports Server (NTRS)

Phung, Thanh; VanderWijngaart, Rob F.

1994-01-01

Two domain decompositions for the implementation of the NAS Scalar Penta-diagonal Parallel Benchmark on MIMD systems are investigated, namely transposition and multi-partitioning. Hardware platforms considered are the Intel iPSC/860 and Paragon XP/S-15, and clusters of SGI workstations on ethernet, communicating through PVM. It is found that the multi-partitioning strategy offers the kind of coarse granularity that allows scaling up to hundreds of processors on a massively parallel machine. Moreover, efficiency is retained when the code is ported verbatim (save message passing syntax) to a PVM environment on a modest size cluster of workstations.
Nuclide Depletion Capabilities in the Shift Monte Carlo Code

DOE PAGES

Davidson, Gregory G.; Pandya, Tara M.; Johnson, Seth R.; ...

2017-12-21

A new depletion capability has been developed in the Exnihilo radiation transport code suite. This capability enables massively parallel domain-decomposed coupling between the Shift continuous-energy Monte Carlo solver and the nuclide depletion solvers in ORIGEN to perform high-performance Monte Carlo depletion calculations. This paper describes this new depletion capability and discusses its various features, including a multi-level parallel decomposition, high-order transport-depletion coupling, and energy-integrated power renormalization. Several test problems are presented to validate the new capability against other Monte Carlo depletion codes, and the parallel performance of the new capability is analyzed.
A parallel algorithm for nonlinear convection-diffusion equations

NASA Technical Reports Server (NTRS)

Scroggs, Jeffrey S.

1990-01-01

A parallel algorithm for the efficient solution of nonlinear time-dependent convection-diffusion equations with small parameter on the diffusion term is presented. The method is based on a physically motivated domain decomposition that is dictated by singular perturbation analysis. The analysis is used to determine regions where certain reduced equations may be solved in place of the full equation. The method is suitable for the solution of problems arising in the simulation of fluid dynamics. Experimental results for a nonlinear equation in two-dimensions are presented.
High performance Python for direct numerical simulations of turbulent flows

NASA Astrophysics Data System (ADS)

Mortensen, Mikael; Langtangen, Hans Petter

2016-06-01

Direct Numerical Simulations (DNS) of the Navier Stokes equations is an invaluable research tool in fluid dynamics. Still, there are few publicly available research codes and, due to the heavy number crunching implied, available codes are usually written in low-level languages such as C/C++ or Fortran. In this paper we describe a pure scientific Python pseudo-spectral DNS code that nearly matches the performance of C++ for thousands of processors and billions of unknowns. We also describe a version optimized through Cython, that is found to match the speed of C++. The solvers are written from scratch in Python, both the mesh, the MPI domain decomposition, and the temporal integrators. The solvers have been verified and benchmarked on the Shaheen supercomputer at the KAUST supercomputing laboratory, and we are able to show very good scaling up to several thousand cores. A very important part of the implementation is the mesh decomposition (we implement both slab and pencil decompositions) and 3D parallel Fast Fourier Transforms (FFT). The mesh decomposition and FFT routines have been implemented in Python using serial FFT routines (either NumPy, pyFFTW or any other serial FFT module), NumPy array manipulations and with MPI communications handled by MPI for Python (mpi4py). We show how we are able to execute a 3D parallel FFT in Python for a slab mesh decomposition using 4 lines of compact Python code, for which the parallel performance on Shaheen is found to be slightly better than similar routines provided through the FFTW library. For a pencil mesh decomposition 7 lines of code is required to execute a transform.
Fast non-overlapping Schwarz domain decomposition methods for solving the neutron diffusion equation

NASA Astrophysics Data System (ADS)

Jamelot, Erell; Ciarlet, Patrick

2013-05-01

Studying numerically the steady state of a nuclear core reactor is expensive, in terms of memory storage and computational time. In order to address both requirements, one can use a domain decomposition method, implemented on a parallel computer. We present here such a method for the mixed neutron diffusion equations, discretized with Raviart-Thomas-Nédélec finite elements. This method is based on the Schwarz iterative algorithm with Robin interface conditions to handle communications. We analyse this method from the continuous point of view to the discrete point of view, and we give some numerical results in a realistic highly heterogeneous 3D configuration. Computations are carried out with the MINOS solver of the APOLLO3® neutronics code. APOLLO3 is a registered trademark in France.
A hybrid algorithm for parallel molecular dynamics simulations

NASA Astrophysics Data System (ADS)

Mangiardi, Chris M.; Meyer, R.

2017-10-01

This article describes algorithms for the hybrid parallelization and SIMD vectorization of molecular dynamics simulations with short-range forces. The parallelization method combines domain decomposition with a thread-based parallelization approach. The goal of the work is to enable efficient simulations of very large (tens of millions of atoms) and inhomogeneous systems on many-core processors with hundreds or thousands of cores and SIMD units with large vector sizes. In order to test the efficiency of the method, simulations of a variety of configurations with up to 74 million atoms have been performed. Results are shown that were obtained on multi-core systems with Sandy Bridge and Haswell processors as well as systems with Xeon Phi many-core processors.
Domain decomposition method for the Baltic Sea based on theory of adjoint equation and inverse problem.

NASA Astrophysics Data System (ADS)

Lezina, Natalya; Agoshkov, Valery

2017-04-01

Domain decomposition method (DDM) allows one to present a domain with complex geometry as a set of essentially simpler subdomains. This method is particularly applied for the hydrodynamics of oceans and seas. In each subdomain the system of thermo-hydrodynamic equations in the Boussinesq and hydrostatic approximations is solved. The problem of obtaining solution in the whole domain is that it is necessary to combine solutions in subdomains. For this purposes iterative algorithm is created and numerical experiments are conducted to investigate an effectiveness of developed algorithm using DDM. For symmetric operators in DDM, Poincare-Steklov's operators [1] are used, but for the problems of the hydrodynamics, it is not suitable. In this case for the problem, adjoint equation method [2] and inverse problem theory are used. In addition, it is possible to create algorithms for the parallel calculations using DDM on multiprocessor computer system. DDM for the model of the Baltic Sea dynamics is numerically studied. The results of numerical experiments using DDM are compared with the solution of the system of hydrodynamic equations in the whole domain. The work was supported by the Russian Science Foundation (project 14-11-00609, the formulation of the iterative process and numerical experiments). [1] V.I. Agoshkov, Domain Decompositions Methods in the Mathematical Physics Problem // Numerical processes and systems, No 8, Moscow, 1991 (in Russian). [2] V.I. Agoshkov, Optimal Control Approaches and Adjoint Equations in the Mathematical Physics Problem, Institute of Numerical Mathematics, RAS, Moscow, 2003 (in Russian).
SIAM Conference on Parallel Processing for Scientific Computing, 4th, Chicago, IL, Dec. 11-13, 1989, Proceedings

NASA Technical Reports Server (NTRS)

Dongarra, Jack (Editor); Messina, Paul (Editor); Sorensen, Danny C. (Editor); Voigt, Robert G. (Editor)

1990-01-01

Attention is given to such topics as an evaluation of block algorithm variants in LAPACK and presents a large-grain parallel sparse system solver, a multiprocessor method for the solution of the generalized Eigenvalue problem on an interval, and a parallel QR algorithm for iterative subspace methods on the CM2. A discussion of numerical methods includes the topics of asynchronous numerical solutions of PDEs on parallel computers, parallel homotopy curve tracking on a hypercube, and solving Navier-Stokes equations on the Cedar Multi-Cluster system. A section on differential equations includes a discussion of a six-color procedure for the parallel solution of elliptic systems using the finite quadtree structure, data parallel algorithms for the finite element method, and domain decomposition methods in aerodynamics. Topics dealing with massively parallel computing include hypercube vs. 2-dimensional meshes and massively parallel computation of conservation laws. Performance and tools are also discussed.
Terascale Optimal PDE Simulations (TOPS) Center

DOE Office of Scientific and Technical Information (OSTI.GOV)

Professor Olof B. Widlund

2007-07-09

Our work has focused on the development and analysis of domain decomposition algorithms for a variety of problems arising in continuum mechanics modeling. In particular, we have extended and analyzed FETI-DP and BDDC algorithms; these iterative solvers were first introduced and studied by Charbel Farhat and his collaborators, see [11, 45, 12], and by Clark Dohrmann of SANDIA, Albuquerque, see [43, 2, 1], respectively. These two closely related families of methods are of particular interest since they are used more extensively than other iterative substructuring methods to solve very large and difficult problems. Thus, the FETI algorithms are part ofmore » the SALINAS system developed by the SANDIA National Laboratories for very large scale computations, and as already noted, BDDC was first developed by a SANDIA scientist, Dr. Clark Dohrmann. The FETI algorithms are also making inroads in commercial engineering software systems. We also note that the analysis of these algorithms poses very real mathematical challenges. The success in developing this theory has, in several instances, led to significant improvements in the performance of these algorithms. A very desirable feature of these iterative substructuring and other domain decomposition algorithms is that they respect the memory hierarchy of modern parallel and distributed computing systems, which is essential for approaching peak floating point performance. The development of improved methods, together with more powerful computer systems, is making it possible to carry out simulations in three dimensions, with quite high resolution, relatively easily. This work is supported by high quality software systems, such as Argonne's PETSc library, which facilitates code development as well as the access to a variety of parallel and distributed computer systems. The success in finding scalable and robust domain decomposition algorithms for very large number of processors and very large finite element problems is, e.g., illustrated in [24, 25, 26]. This work is based on [29, 31]. Our work over these five and half years has, in our opinion, helped advance the knowledge of domain decomposition methods significantly. We see these methods as providing valuable alternatives to other iterative methods, in particular, those based on multi-grid. In our opinion, our accomplishments also match the goals of the TOPS project quite closely.« less
Efficient relaxed-Jacobi smoothers for multigrid on parallel computers

NASA Astrophysics Data System (ADS)

Yang, Xiang; Mittal, Rajat

2017-03-01

In this Technical Note, we present a family of Jacobi-based multigrid smoothers suitable for the solution of discretized elliptic equations. These smoothers are based on the idea of scheduled-relaxation Jacobi proposed recently by Yang & Mittal (2014) [18] and employ two or three successive relaxed Jacobi iterations with relaxation factors derived so as to maximize the smoothing property of these iterations. The performance of these new smoothers measured in terms of convergence acceleration and computational workload, is assessed for multi-domain implementations typical of parallelized solvers, and compared to the lexicographic point Gauss-Seidel smoother. The tests include the geometric multigrid method on structured grids as well as the algebraic grid method on unstructured grids. The tests demonstrate that unlike Gauss-Seidel, the convergence of these Jacobi-based smoothers is unaffected by domain decomposition, and furthermore, they outperform the lexicographic Gauss-Seidel by factors that increase with domain partition count.
Regularization of nonlinear decomposition of spectral x-ray projection images.

PubMed

Ducros, Nicolas; Abascal, Juan Felipe Perez-Juste; Sixou, Bruno; Rit, Simon; Peyrin, Françoise

2017-09-01

Exploiting the x-ray measurements obtained in different energy bins, spectral computed tomography (CT) has the ability to recover the 3-D description of a patient in a material basis. This may be achieved solving two subproblems, namely the material decomposition and the tomographic reconstruction problems. In this work, we address the material decomposition of spectral x-ray projection images, which is a nonlinear ill-posed problem. Our main contribution is to introduce a material-dependent spatial regularization in the projection domain. The decomposition problem is solved iteratively using a Gauss-Newton algorithm that can benefit from fast linear solvers. A Matlab implementation is available online. The proposed regularized weighted least squares Gauss-Newton algorithm (RWLS-GN) is validated on numerical simulations of a thorax phantom made of up to five materials (soft tissue, bone, lung, adipose tissue, and gadolinium), which is scanned with a 120 kV source and imaged by a 4-bin photon counting detector. To evaluate the method performance of our algorithm, different scenarios are created by varying the number of incident photons, the concentration of the marker and the configuration of the phantom. The RWLS-GN method is compared to the reference maximum likelihood Nelder-Mead algorithm (ML-NM). The convergence of the proposed method and its dependence on the regularization parameter are also studied. We show that material decomposition is feasible with the proposed method and that it converges in few iterations. Material decomposition with ML-NM was very sensitive to noise, leading to decomposed images highly affected by noise, and artifacts even for the best case scenario. The proposed method was less sensitive to noise and improved contrast-to-noise ratio of the gadolinium image. Results were superior to those provided by ML-NM in terms of image quality and decomposition was 70 times faster. For the assessed experiments, material decomposition was possible with the proposed method when the number of incident photons was equal or larger than 10 5 and when the marker concentration was equal or larger than 0.03 g·cm -3 . The proposed method efficiently solves the nonlinear decomposition problem for spectral CT, which opens up new possibilities such as material-specific regularization in the projection domain and a parallelization framework, in which projections are solved in parallel. © 2017 American Association of Physicists in Medicine.
Domain decomposition algorithms and computation fluid dynamics

NASA Technical Reports Server (NTRS)

Chan, Tony F.

1988-01-01

In the past several years, domain decomposition was a very popular topic, partly motivated by the potential of parallelization. While a large body of theory and algorithms were developed for model elliptic problems, they are only recently starting to be tested on realistic applications. The application of some of these methods to two model problems in computational fluid dynamics are investigated. Some examples are two dimensional convection-diffusion problems and the incompressible driven cavity flow problem. The construction and analysis of efficient preconditioners for the interface operator to be used in the iterative solution of the interface solution is described. For the convection-diffusion problems, the effect of the convection term and its discretization on the performance of some of the preconditioners is discussed. For the driven cavity problem, the effectiveness of a class of boundary probe preconditioners is discussed.
Suppressing correlations in massively parallel simulations of lattice models

NASA Astrophysics Data System (ADS)

Kelling, Jeffrey; Ódor, Géza; Gemming, Sibylle

2017-11-01

For lattice Monte Carlo simulations parallelization is crucial to make studies of large systems and long simulation time feasible, while sequential simulations remain the gold-standard for correlation-free dynamics. Here, various domain decomposition schemes are compared, concluding with one which delivers virtually correlation-free simulations on GPUs. Extensive simulations of the octahedron model for 2 + 1 dimensional Kardar-Parisi-Zhang surface growth, which is very sensitive to correlation in the site-selection dynamics, were performed to show self-consistency of the parallel runs and agreement with the sequential algorithm. We present a GPU implementation providing a speedup of about 30 × over a parallel CPU implementation on a single socket and at least 180 × with respect to the sequential reference.
A spectral analysis of the domain decomposed Monte Carlo method for linear systems

DOE PAGES

Slattery, Stuart R.; Evans, Thomas M.; Wilson, Paul P. H.

2015-09-08

The domain decomposed behavior of the adjoint Neumann-Ulam Monte Carlo method for solving linear systems is analyzed using the spectral properties of the linear oper- ator. Relationships for the average length of the adjoint random walks, a measure of convergence speed and serial performance, are made with respect to the eigenvalues of the linear operator. In addition, relationships for the effective optical thickness of a domain in the decomposition are presented based on the spectral analysis and diffusion theory. Using the effective optical thickness, the Wigner rational approxi- mation and the mean chord approximation are applied to estimate the leakagemore » frac- tion of random walks from a domain in the decomposition as a measure of parallel performance and potential communication costs. The one-speed, two-dimensional neutron diffusion equation is used as a model problem in numerical experiments to test the models for symmetric operators with spectral qualities similar to light water reactor problems. We find, in general, the derived approximations show good agreement with random walk lengths and leakage fractions computed by the numerical experiments.« less

Evaluating the performance of the particle finite element method in parallel architectures

NASA Astrophysics Data System (ADS)

Gimenez, Juan M.; Nigro, Norberto M.; Idelsohn, Sergio R.

2014-05-01

This paper presents a high performance implementation for the particle-mesh based method called particle finite element method two (PFEM-2). It consists of a material derivative based formulation of the equations with a hybrid spatial discretization which uses an Eulerian mesh and Lagrangian particles. The main aim of PFEM-2 is to solve transport equations as fast as possible keeping some level of accuracy. The method was found to be competitive with classical Eulerian alternatives for these targets, even in their range of optimal application. To evaluate the goodness of the method with large simulations, it is imperative to use of parallel environments. Parallel strategies for Finite Element Method have been widely studied and many libraries can be used to solve Eulerian stages of PFEM-2. However, Lagrangian stages, such as streamline integration, must be developed considering the parallel strategy selected. The main drawback of PFEM-2 is the large amount of memory needed, which limits its application to large problems with only one computer. Therefore, a distributed-memory implementation is urgently needed. Unlike a shared-memory approach, using domain decomposition the memory is automatically isolated, thus avoiding race conditions; however new issues appear due to data distribution over the processes. Thus, a domain decomposition strategy for both particle and mesh is adopted, which minimizes the communication between processes. Finally, performance analysis running over multicore and multinode architectures are presented. The Courant-Friedrichs-Lewy number used influences the efficiency of the parallelization and, in some cases, a weighted partitioning can be used to improve the speed-up. However the total cputime for cases presented is lower than that obtained when using classical Eulerian strategies.
On some Aitken-like acceleration of the Schwarz method

NASA Astrophysics Data System (ADS)

Garbey, M.; Tromeur-Dervout, D.

2002-12-01

In this paper we present a family of domain decomposition based on Aitken-like acceleration of the Schwarz method seen as an iterative procedure with a linear rate of convergence. We first present the so-called Aitken-Schwarz procedure for linear differential operators. The solver can be a direct solver when applied to the Helmholtz problem with five-point finite difference scheme on regular grids. We then introduce the Steffensen-Schwarz variant which is an iterative domain decomposition solver that can be applied to linear and nonlinear problems. We show that these solvers have reasonable numerical efficiency compared to classical fast solvers for the Poisson problem or multigrids for more general linear and nonlinear elliptic problems. However, the salient feature of our method is that our algorithm has high tolerance to slow network in the context of distributed parallel computing and is attractive, generally speaking, to use with computer architecture for which performance is limited by the memory bandwidth rather than the flop performance of the CPU. This is nowadays the case for most parallel. computer using the RISC processor architecture. We will illustrate this highly desirable property of our algorithm with large-scale computing experiments.
A Domain-Decomposed Multilevel Method for Adaptively Refined Cartesian Grids with Embedded Boundaries

NASA Technical Reports Server (NTRS)

Aftosmis, M. J.; Berger, M. J.; Adomavicius, G.

2000-01-01

Preliminary verification and validation of an efficient Euler solver for adaptively refined Cartesian meshes with embedded boundaries is presented. The parallel, multilevel method makes use of a new on-the-fly parallel domain decomposition strategy based upon the use of space-filling curves, and automatically generates a sequence of coarse meshes for processing by the multigrid smoother. The coarse mesh generation algorithm produces grids which completely cover the computational domain at every level in the mesh hierarchy. A series of examples on realistically complex three-dimensional configurations demonstrate that this new coarsening algorithm reliably achieves mesh coarsening ratios in excess of 7 on adaptively refined meshes. Numerical investigations of the scheme's local truncation error demonstrate an achieved order of accuracy between 1.82 and 1.88. Convergence results for the multigrid scheme are presented for both subsonic and transonic test cases and demonstrate W-cycle multigrid convergence rates between 0.84 and 0.94. Preliminary parallel scalability tests on both simple wing and complex complete aircraft geometries shows a computational speedup of 52 on 64 processors using the run-time mesh partitioner.
Characterization of cancer and normal tissue fluorescence through wavelet transform and singular value decomposition

NASA Astrophysics Data System (ADS)

Gharekhan, Anita H.; Biswal, Nrusingh C.; Gupta, Sharad; Pradhan, Asima; Sureshkumar, M. B.; Panigrahi, Prasanta K.

2008-02-01

The statistical and characteristic features of the polarized fluorescence spectra from cancer, normal and benign human breast tissues are studied through wavelet transform and singular value decomposition. The discrete wavelets enabled one to isolate high and low frequency spectral fluctuations, which revealed substantial randomization in the cancerous tissues, not present in the normal cases. In particular, the fluctuations fitted well with a Gaussian distribution for the cancerous tissues in the perpendicular component. One finds non-Gaussian behavior for normal and benign tissues' spectral variations. The study of the difference of intensities in parallel and perpendicular channels, which is free from the diffusive component, revealed weak fluorescence activity in the 630nm domain, for the cancerous tissues. This may be ascribable to porphyrin emission. The role of both scatterers and fluorophores in the observed minor intensity peak for the cancer case is experimentally confirmed through tissue-phantom experiments. Continuous Morlet wavelet also highlighted this domain for the cancerous tissue fluorescence spectra. Correlation in the spectral fluctuation is further studied in different tissue types through singular value decomposition. Apart from identifying different domains of spectral activity for diseased and non-diseased tissues, we found random matrix support for the spectral fluctuations. The small eigenvalues of the perpendicular polarized fluorescence spectra of cancerous tissues fitted remarkably well with random matrix prediction for Gaussian random variables, confirming our observations about spectral fluctuations in the wavelet domain.
Repartitioning Strategies for Massively Parallel Simulation of Reacting Flow

NASA Astrophysics Data System (ADS)

Pisciuneri, Patrick; Zheng, Angen; Givi, Peyman; Labrinidis, Alexandros; Chrysanthis, Panos

2015-11-01

The majority of parallel CFD simulators partition the domain into equal regions and assign the calculations for a particular region to a unique processor. This type of domain decomposition is vital to the efficiency of the solver. However, as the simulation develops, the workload among the partitions often become uneven (e.g. by adaptive mesh refinement, or chemically reacting regions) and a new partition should be considered. The process of repartitioning adjusts the current partition to evenly distribute the load again. We compare two repartitioning tools: Zoltan, an architecture-agnostic graph repartitioner developed at the Sandia National Laboratories; and Paragon, an architecture-aware graph repartitioner developed at the University of Pittsburgh. The comparative assessment is conducted via simulation of the Taylor-Green vortex flow with chemical reaction.
GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers

DOE PAGES

Abraham, Mark James; Murtola, Teemu; Schulz, Roland; ...

2015-07-15

GROMACS is one of the most widely used open-source and free software codes in chemistry, used primarily for dynamical simulations of biomolecules. It provides a rich set of calculation types, preparation and analysis tools. Several advanced techniques for free-energy calculations are supported. In version 5, it reaches new performance heights, through several new and enhanced parallelization algorithms. This work on every level; SIMD registers inside cores, multithreading, heterogeneous CPU–GPU acceleration, state-of-the-art 3D domain decomposition, and ensemble-level parallelization through built-in replica exchange and the separate Copernicus framework. Finally, the latest best-in-class compressed trajectory storage format is supported.
GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Abraham, Mark James; Murtola, Teemu; Schulz, Roland

GROMACS is one of the most widely used open-source and free software codes in chemistry, used primarily for dynamical simulations of biomolecules. It provides a rich set of calculation types, preparation and analysis tools. Several advanced techniques for free-energy calculations are supported. In version 5, it reaches new performance heights, through several new and enhanced parallelization algorithms. This work on every level; SIMD registers inside cores, multithreading, heterogeneous CPU–GPU acceleration, state-of-the-art 3D domain decomposition, and ensemble-level parallelization through built-in replica exchange and the separate Copernicus framework. Finally, the latest best-in-class compressed trajectory storage format is supported.
Efficient implementation of a 3-dimensional ADI method on the iPSC/860

DOE Office of Scientific and Technical Information (OSTI.GOV)

Van der Wijngaart, R.F.

1993-12-31

A comparison is made between several domain decomposition strategies for the solution of three-dimensional partial differential equations on a MIMD distributed memory parallel computer. The grids used are structured, and the numerical algorithm is ADI. Important implementation issues regarding load balancing, storage requirements, network latency, and overlap of computations and communications are discussed. Results of the solution of the three-dimensional heat equation on the Intel iPSC/860 are presented for the three most viable methods. It is found that the Bruno-Cappello decomposition delivers optimal computational speed through an almost complete elimination of processor idle time, while providing good memory efficiency.
PARAMESH: A Parallel Adaptive Mesh Refinement Community Toolkit

NASA Technical Reports Server (NTRS)

MacNeice, Peter; Olson, Kevin M.; Mobarry, Clark; deFainchtein, Rosalinda; Packer, Charles

1999-01-01

In this paper, we describe a community toolkit which is designed to provide parallel support with adaptive mesh capability for a large and important class of computational models, those using structured, logically cartesian meshes. The package of Fortran 90 subroutines, called PARAMESH, is designed to provide an application developer with an easy route to extend an existing serial code which uses a logically cartesian structured mesh into a parallel code with adaptive mesh refinement. Alternatively, in its simplest use, and with minimal effort, it can operate as a domain decomposition tool for users who want to parallelize their serial codes, but who do not wish to use adaptivity. The package can provide them with an incremental evolutionary path for their code, converting it first to uniformly refined parallel code, and then later if they so desire, adding adaptivity.
By Hand or Not By-Hand: A Case Study of Alternative Approaches to Parallelize CFD Applications

NASA Technical Reports Server (NTRS)

Yan, Jerry C.; Bailey, David (Technical Monitor)

1997-01-01

While parallel processing promises to speed up applications by several orders of magnitude, the performance achieved still depends upon several factors, including the multiprocessor architecture, system software, data distribution and alignment, as well as the methods used for partitioning the application and mapping its components onto the architecture. The existence of the Gorden Bell Prize given out at Supercomputing every year suggests that while good performance can be attained for real applications on general purpose multiprocessors, the large investment in man-power and time still has to be repeated for each application-machine combination. As applications and machine architectures become more complex, the cost and time-delays for obtaining performance by hand will become prohibitive. Computer users today can turn to three possible avenues for help: parallel libraries, parallel languages and compilers, interactive parallelization tools. The success of these methodologies, in turn, depends on proper application of data dependency analysis, program structure recognition and transformation, performance prediction as well as exploitation of user supplied knowledge. NASA has been developing multidisciplinary applications on highly parallel architectures under the High Performance Computing and Communications Program. Over the past six years, the transition of underlying hardware and system software have forced the scientists to spend a large effort to migrate and recede their applications. Various attempts to exploit software tools to automate the parallelization process have not produced favorable results. In this paper, we report our most recent experience with CAPTOOL, a package developed at Greenwich University. We have chosen CAPTOOL for three reasons: 1. CAPTOOL accepts a FORTRAN 77 program as input. This suggests its potential applicability to a large collection of legacy codes currently in use. 2. CAPTOOL employs domain decomposition to obtain parallelism. Although the fact that not all kinds of parallelism are handled may seem unappealing, many NASA applications in computational aerosciences as well as earth and space sciences are amenable to domain decomposition. 3. CAPTOOL generates code for a large variety of environments employed across NASA centers: MPI/PVM on network of workstations to the IBS/SP2 and CRAY/T3D.
A resilient domain decomposition polynomial chaos solver for uncertain elliptic PDEs

NASA Astrophysics Data System (ADS)

Mycek, Paul; Contreras, Andres; Le Maître, Olivier; Sargsyan, Khachik; Rizzi, Francesco; Morris, Karla; Safta, Cosmin; Debusschere, Bert; Knio, Omar

2017-07-01

A resilient method is developed for the solution of uncertain elliptic PDEs on extreme scale platforms. The method is based on a hybrid domain decomposition, polynomial chaos (PC) framework that is designed to address soft faults. Specifically, parallel and independent solves of multiple deterministic local problems are used to define PC representations of local Dirichlet boundary-to-boundary maps that are used to reconstruct the global solution. A LAD-lasso type regression is developed for this purpose. The performance of the resulting algorithm is tested on an elliptic equation with an uncertain diffusivity field. Different test cases are considered in order to analyze the impacts of correlation structure of the uncertain diffusivity field, the stochastic resolution, as well as the probability of soft faults. In particular, the computations demonstrate that, provided sufficiently many samples are generated, the method effectively overcomes the occurrence of soft faults.
Domain decomposition preconditioners for the spectral collocation method

NASA Technical Reports Server (NTRS)

Quarteroni, Alfio; Sacchilandriani, Giovanni

1988-01-01

Several block iteration preconditioners are proposed and analyzed for the solution of elliptic problems by spectral collocation methods in a region partitioned into several rectangles. It is shown that convergence is achieved with a rate which does not depend on the polynomial degree of the spectral solution. The iterative methods here presented can be effectively implemented on multiprocessor systems due to their high degree of parallelism.
Multiscale Simulations of Magnetic Island Coalescence

NASA Technical Reports Server (NTRS)

Dorelli, John C.

2010-01-01

We describe a new interactive parallel Adaptive Mesh Refinement (AMR) framework written in the Python programming language. This new framework, PyAMR, hides the details of parallel AMR data structures and algorithms (e.g., domain decomposition, grid partition, and inter-process communication), allowing the user to focus on the development of algorithms for advancing the solution of a systems of partial differential equations on a single uniform mesh. We demonstrate the use of PyAMR by simulating the pairwise coalescence of magnetic islands using the resistive Hall MHD equations. Techniques for coupling different physics models on different levels of the AMR grid hierarchy are discussed.
Parallel DSMC Solution of Three-Dimensional Flow Over a Finite Flat Plate

NASA Technical Reports Server (NTRS)

Nance, Robert P.; Wilmoth, Richard G.; Moon, Bongki; Hassan, H. A.; Saltz, Joel

1994-01-01

This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that a one-dimensional chain-partitioning method works best for the problems considered. The parallel code is then used to simulate the Mach 20 nitrogen flow over a finite-thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times than the scalar code, as well as very good load-balance characteristics.
PARAVT: Parallel Voronoi tessellation code

NASA Astrophysics Data System (ADS)

González, R. E.

2016-10-01

In this study, we present a new open source code for massive parallel computation of Voronoi tessellations (VT hereafter) in large data sets. The code is focused for astrophysical purposes where VT densities and neighbors are widely used. There are several serial Voronoi tessellation codes, however no open source and parallel implementations are available to handle the large number of particles/galaxies in current N-body simulations and sky surveys. Parallelization is implemented under MPI and VT using Qhull library. Domain decomposition takes into account consistent boundary computation between tasks, and includes periodic conditions. In addition, the code computes neighbors list, Voronoi density, Voronoi cell volume, density gradient for each particle, and densities on a regular grid. Code implementation and user guide are publicly available at https://github.com/regonzar/paravt.
Reactor Dosimetry Applications Using RAPTOR-M3G:. a New Parallel 3-D Radiation Transport Code

NASA Astrophysics Data System (ADS)

Longoni, Gianluca; Anderson, Stanwood L.

2009-08-01

The numerical solution of the Linearized Boltzmann Equation (LBE) via the Discrete Ordinates method (SN) requires extensive computational resources for large 3-D neutron and gamma transport applications due to the concurrent discretization of the angular, spatial, and energy domains. This paper will discuss the development RAPTOR-M3G (RApid Parallel Transport Of Radiation - Multiple 3D Geometries), a new 3-D parallel radiation transport code, and its application to the calculation of ex-vessel neutron dosimetry responses in the cavity of a commercial 2-loop Pressurized Water Reactor (PWR). RAPTOR-M3G is based domain decomposition algorithms, where the spatial and angular domains are allocated and processed on multi-processor computer architectures. As compared to traditional single-processor applications, this approach reduces the computational load as well as the memory requirement per processor, yielding an efficient solution methodology for large 3-D problems. Measured neutron dosimetry responses in the reactor cavity air gap will be compared to the RAPTOR-M3G predictions. This paper is organized as follows: Section 1 discusses the RAPTOR-M3G methodology; Section 2 describes the 2-loop PWR model and the numerical results obtained. Section 3 addresses the parallel performance of the code, and Section 4 concludes this paper with final remarks and future work.
Time-domain seismic modeling in viscoelastic media for full waveform inversion on heterogeneous computing platforms with OpenCL

NASA Astrophysics Data System (ADS)

Fabien-Ouellet, Gabriel; Gloaguen, Erwan; Giroux, Bernard

2017-03-01

Full Waveform Inversion (FWI) aims at recovering the elastic parameters of the Earth by matching recordings of the ground motion with the direct solution of the wave equation. Modeling the wave propagation for realistic scenarios is computationally intensive, which limits the applicability of FWI. The current hardware evolution brings increasing parallel computing power that can speed up the computations in FWI. However, to take advantage of the diversity of parallel architectures presently available, new programming approaches are required. In this work, we explore the use of OpenCL to develop a portable code that can take advantage of the many parallel processor architectures now available. We present a program called SeisCL for 2D and 3D viscoelastic FWI in the time domain. The code computes the forward and adjoint wavefields using finite-difference and outputs the gradient of the misfit function given by the adjoint state method. To demonstrate the code portability on different architectures, the performance of SeisCL is tested on three different devices: Intel CPUs, NVidia GPUs and Intel Xeon PHI. Results show that the use of GPUs with OpenCL can speed up the computations by nearly two orders of magnitudes over a single threaded application on the CPU. Although OpenCL allows code portability, we show that some device-specific optimization is still required to get the best performance out of a specific architecture. Using OpenCL in conjunction with MPI allows the domain decomposition of large models on several devices located on different nodes of a cluster. For large enough models, the speedup of the domain decomposition varies quasi-linearly with the number of devices. Finally, we investigate two different approaches to compute the gradient by the adjoint state method and show the significant advantages of using OpenCL for FWI.
GROMACS 4: Algorithms for Highly Efficient, Load-Balanced, and Scalable Molecular Simulation.

PubMed

Hess, Berk; Kutzner, Carsten; van der Spoel, David; Lindahl, Erik

2008-03-01

Molecular simulation is an extremely useful, but computationally very expensive tool for studies of chemical and biomolecular systems. Here, we present a new implementation of our molecular simulation toolkit GROMACS which now both achieves extremely high performance on single processors from algorithmic optimizations and hand-coded routines and simultaneously scales very well on parallel machines. The code encompasses a minimal-communication domain decomposition algorithm, full dynamic load balancing, a state-of-the-art parallel constraint solver, and efficient virtual site algorithms that allow removal of hydrogen atom degrees of freedom to enable integration time steps up to 5 fs for atomistic simulations also in parallel. To improve the scaling properties of the common particle mesh Ewald electrostatics algorithms, we have in addition used a Multiple-Program, Multiple-Data approach, with separate node domains responsible for direct and reciprocal space interactions. Not only does this combination of algorithms enable extremely long simulations of large systems but also it provides that simulation performance on quite modest numbers of standard cluster nodes.
Massive parallelization of a 3D finite difference electromagnetic forward solution using domain decomposition methods on multiple CUDA enabled GPUs

NASA Astrophysics Data System (ADS)

Schultz, A.

2010-12-01

3D forward solvers lie at the core of inverse formulations used to image the variation of electrical conductivity within the Earth's interior. This property is associated with variations in temperature, composition, phase, presence of volatiles, and in specific settings, the presence of groundwater, geothermal resources, oil/gas or minerals. The high cost of 3D solutions has been a stumbling block to wider adoption of 3D methods. Parallel algorithms for modeling frequency domain 3D EM problems have not achieved wide scale adoption, with emphasis on fairly coarse grained parallelism using MPI and similar approaches. The communications bandwidth as well as the latency required to send and receive network communication packets is a limiting factor in implementing fine grained parallel strategies, inhibiting wide adoption of these algorithms. Leading Graphics Processor Unit (GPU) companies now produce GPUs with hundreds of GPU processor cores per die. The footprint, in silicon, of the GPU's restricted instruction set is much smaller than the general purpose instruction set required of a CPU. Consequently, the density of processor cores on a GPU can be much greater than on a CPU. GPUs also have local memory, registers and high speed communication with host CPUs, usually through PCIe type interconnects. The extremely low cost and high computational power of GPUs provides the EM geophysics community with an opportunity to achieve fine grained (i.e. massive) parallelization of codes on low cost hardware. The current generation of GPUs (e.g. NVidia Fermi) provides 3 billion transistors per chip die, with nearly 500 processor cores and up to 6 GB of fast (DDR5) GPU memory. This latest generation of GPU supports fast hardware double precision (64 bit) floating point operations of the type required for frequency domain EM forward solutions. Each Fermi GPU board can sustain nearly 1 TFLOP in double precision, and multiple boards can be installed in the host computer system. We describe our ongoing efforts to achieve massive parallelization on a novel hybrid GPU testbed machine currently configured with 12 Intel Westmere Xeon CPU cores (or 24 parallel computational threads) with 96 GB DDR3 system memory, 4 GPU subsystems which in aggregate contain 960 NVidia Tesla GPU cores with 16 GB dedicated DDR3 GPU memory, and a second interleved bank of 4 GPU subsystems containing in aggregate 1792 NVidia Fermi GPU cores with 12 GB dedicated DDR5 GPU memory. We are applying domain decomposition methods to a modified version of Weiss' (2001) 3D frequency domain full physics EM finite difference code, an open source GPL licensed f90 code available for download from www.OpenEM.org. This will be the core of a new hybrid 3D inversion that parallelizes frequencies across CPUs and individual forward solutions across GPUs. We describe progress made in modifying the code to use direct solvers in GPU cores dedicated to each small subdomain, iteratively improving the solution by matching adjacent subdomain boundary solutions, rather than iterative Krylov space sparse solvers as currently applied to the whole domain.
High performance computation of radiative transfer equation using the finite element method

NASA Astrophysics Data System (ADS)

Badri, M. A.; Jolivet, P.; Rousseau, B.; Favennec, Y.

2018-05-01

This article deals with an efficient strategy for numerically simulating radiative transfer phenomena using distributed computing. The finite element method alongside the discrete ordinate method is used for spatio-angular discretization of the monochromatic steady-state radiative transfer equation in an anisotropically scattering media. Two very different methods of parallelization, angular and spatial decomposition methods, are presented. To do so, the finite element method is used in a vectorial way. A detailed comparison of scalability, performance, and efficiency on thousands of processors is established for two- and three-dimensional heterogeneous test cases. Timings show that both algorithms scale well when using proper preconditioners. It is also observed that our angular decomposition scheme outperforms our domain decomposition method. Overall, we perform numerical simulations at scales that were previously unattainable by standard radiative transfer equation solvers.

Parallel computing on Unix workstation arrays

NASA Astrophysics Data System (ADS)

Reale, F.; Bocchino, F.; Sciortino, S.

1994-12-01

We have tested arrays of general-purpose Unix workstations used as MIMD systems for massive parallel computations. In particular we have solved numerically a demanding test problem with a 2D hydrodynamic code, generally developed to study astrophysical flows, by exucuting it on arrays either of DECstations 5000/200 on Ethernet LAN, or of DECstations 3000/400, equipped with powerful Alpha processors, on FDDI LAN. The code is appropriate for data-domain decomposition, and we have used a library for parallelization previously developed in our Institute, and easily extended to work on Unix workstation arrays by using the PVM software toolset. We have compared the parallel efficiencies obtained on arrays of several processors to those obtained on a dedicated MIMD parallel system, namely a Meiko Computing Surface (CS-1), equipped with Intel i860 processors. We discuss the feasibility of using non-dedicated parallel systems and conclude that the convenience depends essentially on the size of the computational domain as compared to the relative processor power and network bandwidth. We point out that for future perspectives a parallel development of processor and network technology is important, and that the software still offers great opportunities of improvement, especially in terms of latency times in the message-passing protocols. In conditions of significant gain in terms of speedup, such workstation arrays represent a cost-effective approach to massive parallel computations.
High-performance parallel analysis of coupled problems for aircraft propulsion

NASA Technical Reports Server (NTRS)

Felippa, C. A.; Farhat, C.; Lanteri, S.; Gumaste, U.; Ronaghi, M.

1994-01-01

Applications are described of high-performance parallel, computation for the analysis of complete jet engines, considering its multi-discipline coupled problem. The coupled problem involves interaction of structures with gas dynamics, heat conduction and heat transfer in aircraft engines. The methodology issues addressed include: consistent discrete formulation of coupled problems with emphasis on coupling phenomena; effect of partitioning strategies, augmentation and temporal solution procedures; sensitivity of response to problem parameters; and methods for interfacing multiscale discretizations in different single fields. The computer implementation issues addressed include: parallel treatment of coupled systems; domain decomposition and mesh partitioning strategies; data representation in object-oriented form and mapping to hardware driven representation, and tradeoff studies between partitioning schemes and fully coupled treatment.
Research in Parallel Algorithms and Software for Computational Aerosciences

NASA Technical Reports Server (NTRS)

Domel, Neal D.

1996-01-01

Phase I is complete for the development of a Computational Fluid Dynamics parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.
Research in Parallel Algorithms and Software for Computational Aerosciences

NASA Technical Reports Server (NTRS)

Domel, Neal D.

1996-01-01

Phase 1 is complete for the development of a computational fluid dynamics CFD) parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.
Second International Workshop on Software Engineering and Code Design in Parallel Meteorological and Oceanographic Applications

NASA Technical Reports Server (NTRS)

OKeefe, Matthew (Editor); Kerr, Christopher L. (Editor)

1998-01-01

This report contains the abstracts and technical papers from the Second International Workshop on Software Engineering and Code Design in Parallel Meteorological and Oceanographic Applications, held June 15-18, 1998, in Scottsdale, Arizona. The purpose of the workshop is to bring together software developers in meteorology and oceanography to discuss software engineering and code design issues for parallel architectures, including Massively Parallel Processors (MPP's), Parallel Vector Processors (PVP's), Symmetric Multi-Processors (SMP's), Distributed Shared Memory (DSM) multi-processors, and clusters. Issues to be discussed include: (1) code architectures for current parallel models, including basic data structures, storage allocation, variable naming conventions, coding rules and styles, i/o and pre/post-processing of data; (2) designing modular code; (3) load balancing and domain decomposition; (4) techniques that exploit parallelism efficiently yet hide the machine-related details from the programmer; (5) tools for making the programmer more productive; and (6) the proliferation of programming models (F--, OpenMP, MPI, and HPF).
GPU-based parallel algorithm for blind image restoration using midfrequency-based methods

NASA Astrophysics Data System (ADS)

Xie, Lang; Luo, Yi-han; Bao, Qi-liang

2013-08-01

GPU-based general-purpose computing is a new branch of modern parallel computing, so the study of parallel algorithms specially designed for GPU hardware architecture is of great significance. In order to solve the problem of high computational complexity and poor real-time performance in blind image restoration, the midfrequency-based algorithm for blind image restoration was analyzed and improved in this paper. Furthermore, a midfrequency-based filtering method is also used to restore the image hardly with any recursion or iteration. Combining the algorithm with data intensiveness, data parallel computing and GPU execution model of single instruction and multiple threads, a new parallel midfrequency-based algorithm for blind image restoration is proposed in this paper, which is suitable for stream computing of GPU. In this algorithm, the GPU is utilized to accelerate the estimation of class-G point spread functions and midfrequency-based filtering. Aiming at better management of the GPU threads, the threads in a grid are scheduled according to the decomposition of the filtering data in frequency domain after the optimization of data access and the communication between the host and the device. The kernel parallelism structure is determined by the decomposition of the filtering data to ensure the transmission rate to get around the memory bandwidth limitation. The results show that, with the new algorithm, the operational speed is significantly increased and the real-time performance of image restoration is effectively improved, especially for high-resolution images.
Efficient parallelization of analytic bond-order potentials for large-scale atomistic simulations

NASA Astrophysics Data System (ADS)

Teijeiro, C.; Hammerschmidt, T.; Drautz, R.; Sutmann, G.

2016-07-01

Analytic bond-order potentials (BOPs) provide a way to compute atomistic properties with controllable accuracy. For large-scale computations of heterogeneous compounds at the atomistic level, both the computational efficiency and memory demand of BOP implementations have to be optimized. Since the evaluation of BOPs is a local operation within a finite environment, the parallelization concepts known from short-range interacting particle simulations can be applied to improve the performance of these simulations. In this work, several efficient parallelization methods for BOPs that use three-dimensional domain decomposition schemes are described. The schemes are implemented into the bond-order potential code BOPfox, and their performance is measured in a series of benchmarks. Systems of up to several millions of atoms are simulated on a high performance computing system, and parallel scaling is demonstrated for up to thousands of processors.
Application of Direct Parallel Methods to Reconstruction and Forecasting Problems

NASA Astrophysics Data System (ADS)

Song, Changgeun

Many important physical processes in nature are represented by partial differential equations. Numerical weather prediction in particular, requires vast computational resources. We investigate the significance of parallel processing technology to the real world problem of atmospheric prediction. In this paper we consider the classic problem of decomposing the observed wind field into the irrotational and nondivergent components. Recognizing the fact that on a limited domain this problem has a non-unique solution, Lynch (1989) described eight different ways to accomplish the decomposition. One set of elliptic equations is associated with the decomposition--this determines the initial nondivergent state for the forecast model. It is shown that the entire decomposition problem can be solved in a fraction of a second using multi-vector processor such as ALLIANT FX/8. Secondly, the barotropic model is used to track hurricanes. Also, one set of elliptic equations is solved to recover the streamfunction from the forecasted vorticity. A 72 h prediction of Elena is made while it is in the Gulf of Mexico. During this time the hurricane executes a dramatic re-curvature that is captured by the model. Furthermore, an improvement in the track prediction results when a simple assimilation strategy is used. This technique makes use of the wind fields in the 24 h period immediately preceding the initial time for the prediction. In this particular application, solutions to systems of elliptic equations are the center of the computational mechanics. We demonstrate that direct, parallel methods based on accelerated block cyclic reduction (BCR) significantly reduce the computational time required to solve the elliptic equations germane to the decomposition, the forecast and adjoint assimilation.
A Unified View of Global Instabilities of Compressible Flow Over Open Cavities

DTIC Science & Technology

2005-06-30

the early work of Rossiter [3], have treated the shear-layer emanating from the upstream comer of the cavity in isolation ( using parallel flow... using a domain-decomposition method. The code has optional equation sets to solve either (i) nonlinear Navier-Stokes, (ii) Navier-Stokes equations...early experments of Maull and East [15]. They used oil flow visualization of surface streamlines on the cavity bottom to show the existence, under certain
Efficient parallel resolution of the simplified transport equations in mixed-dual formulation

NASA Astrophysics Data System (ADS)

Barrault, M.; Lathuilière, B.; Ramet, P.; Roman, J.

2011-03-01

A reactivity computation consists of computing the highest eigenvalue of a generalized eigenvalue problem, for which an inverse power algorithm is commonly used. Very fine modelizations are difficult to treat for our sequential solver, based on the simplified transport equations, in terms of memory consumption and computational time. A first implementation of a Lagrangian based domain decomposition method brings to a poor parallel efficiency because of an increase in the power iterations [1]. In order to obtain a high parallel efficiency, we improve the parallelization scheme by changing the location of the loop over the subdomains in the overall algorithm and by benefiting from the characteristics of the Raviart-Thomas finite element. The new parallel algorithm still allows us to locally adapt the numerical scheme (mesh, finite element order). However, it can be significantly optimized for the matching grid case. The good behavior of the new parallelization scheme is demonstrated for the matching grid case on several hundreds of nodes for computations based on a pin-by-pin discretization.
Multitasking the three-dimensional transport code TORT on CRAY platforms

DOE Office of Scientific and Technical Information (OSTI.GOV)

Azmy, Y.Y.; Barnett, D.A.; Burre, C.A.

1996-04-01

The multitasking options in the three-dimensional neutral particle transport code TORT originally implemented for Cray`s CTSS operating system are revived and extended to run on Cray Y/MP and C90 computers using the UNICOS operating system. These include two coarse-grained domain decompositions; across octants, and across directions within an octant, termed Octant Parallel (OP), and Direction Parallel (DP), respectively. Parallel performance of the DP is significantly enhanced by increasing the task grain size and reducing load imbalance via dynamic scheduling of the discrete angles among the participating tasks. Substantial Wall Clock speedup factors, approaching 4.5 using 8 tasks, have been measuredmore » in a time-sharing environment, and generally depend on the test problem specifications, number of tasks, and machine loading during execution.« less
Discrete sensitivity derivatives of the Navier-Stokes equations with a parallel Krylov solver

NASA Technical Reports Server (NTRS)

Ajmani, Kumud; Taylor, Arthur C., III

1994-01-01

This paper solves an 'incremental' form of the sensitivity equations derived by differentiating the discretized thin-layer Navier Stokes equations with respect to certain design variables of interest. The equations are solved with a parallel, preconditioned Generalized Minimal RESidual (GMRES) solver on a distributed-memory architecture. The 'serial' sensitivity analysis code is parallelized by using the Single Program Multiple Data (SPMD) programming model, domain decomposition techniques, and message-passing tools. Sensitivity derivatives are computed for low and high Reynolds number flows over a NACA 1406 airfoil on a 32-processor Intel Hypercube, and found to be identical to those computed on a single-processor Cray Y-MP. It is estimated that the parallel sensitivity analysis code has to be run on 40-50 processors of the Intel Hypercube in order to match the single-processor processing time of a Cray Y-MP.
Three-Dimensional High-Lift Analysis Using a Parallel Unstructured Multigrid Solver

NASA Technical Reports Server (NTRS)

Mavriplis, Dimitri J.

1998-01-01

A directional implicit unstructured agglomeration multigrid solver is ported to shared and distributed memory massively parallel machines using the explicit domain-decomposition and message-passing approach. Because the algorithm operates on local implicit lines in the unstructured mesh, special care is required in partitioning the problem for parallel computing. A weighted partitioning strategy is described which avoids breaking the implicit lines across processor boundaries, while incurring minimal additional communication overhead. Good scalability is demonstrated on a 128 processor SGI Origin 2000 machine and on a 512 processor CRAY T3E machine for reasonably fine grids. The feasibility of performing large-scale unstructured grid calculations with the parallel multigrid algorithm is demonstrated by computing the flow over a partial-span flap wing high-lift geometry on a highly resolved grid of 13.5 million points in approximately 4 hours of wall clock time on the CRAY T3E.
Marine Controlled-Source Electromagnetic 2D Inversion for synthetic models.

NASA Astrophysics Data System (ADS)

Liu, Y.; Li, Y.

2016-12-01

We present a 2D inverse algorithm for frequency domain marine controlled-source electromagnetic (CSEM) data, which is based on the regularized Gauss-Newton approach. As a forward solver, our parallel adaptive finite element forward modeling program is employed. It is a self-adaptive, goal-oriented grid refinement algorithm in which a finite element analysis is performed on a sequence of refined meshes. The mesh refinement process is guided by a dual error estimate weighting to bias refinement towards elements that affect the solution at the EM receiver locations. With the use of the direct solver (MUMPS), we can effectively compute the electromagnetic fields for multi-sources and parametric sensitivities. We also implement the parallel data domain decomposition approach of Key and Ovall (2011), with the goal of being able to compute accurate responses in parallel for complicated models and a full suite of data parameters typical of offshore CSEM surveys. All minimizations are carried out by using the Gauss-Newton algorithm and model perturbations at each iteration step are obtained by using the Inexact Conjugate Gradient iteration method. Synthetic test inversions are presented.
Predicting Flows of Rarefied Gases

NASA Technical Reports Server (NTRS)

LeBeau, Gerald J.; Wilmoth, Richard G.

2005-01-01

DSMC Analysis Code (DAC) is a flexible, highly automated, easy-to-use computer program for predicting flows of rarefied gases -- especially flows of upper-atmospheric, propulsion, and vented gases impinging on spacecraft surfaces. DAC implements the direct simulation Monte Carlo (DSMC) method, which is widely recognized as standard for simulating flows at densities so low that the continuum-based equations of computational fluid dynamics are invalid. DAC enables users to model complex surface shapes and boundary conditions quickly and easily. The discretization of a flow field into computational grids is automated, thereby relieving the user of a traditionally time-consuming task while ensuring (1) appropriate refinement of grids throughout the computational domain, (2) determination of optimal settings for temporal discretization and other simulation parameters, and (3) satisfaction of the fundamental constraints of the method. In so doing, DAC ensures an accurate and efficient simulation. In addition, DAC can utilize parallel processing to reduce computation time. The domain decomposition needed for parallel processing is completely automated, and the software employs a dynamic load-balancing mechanism to ensure optimal parallel efficiency throughout the simulation.
a Non-Overlapping Discretization Method for Partial Differential Equations

NASA Astrophysics Data System (ADS)

Rosas-Medina, A.; Herrera, I.

2013-05-01

Mathematical models of many systems of interest, including very important continuous systems of Engineering and Science, lead to a great variety of partial differential equations whose solution methods are based on the computational processing of large-scale algebraic systems. Furthermore, the incredible expansion experienced by the existing computational hardware and software has made amenable to effective treatment problems of an ever increasing diversity and complexity, posed by engineering and scientific applications. The emergence of parallel computing prompted on the part of the computational-modeling community a continued and systematic effort with the purpose of harnessing it for the endeavor of solving boundary-value problems (BVPs) of partial differential equations. Very early after such an effort began, it was recognized that domain decomposition methods (DDM) were the most effective technique for applying parallel computing to the solution of partial differential equations, since such an approach drastically simplifies the coordination of the many processors that carry out the different tasks and also reduces very much the requirements of information-transmission between them. Ideally, DDMs intend producing algorithms that fulfill the DDM-paradigm; i.e., such that "the global solution is obtained by solving local problems defined separately in each subdomain of the coarse-mesh -or domain-decomposition-". Stated in a simplistic manner, the basic idea is that, when the DDM-paradigm is satisfied, full parallelization can be achieved by assigning each subdomain to a different processor. When intensive DDM research began much attention was given to overlapping DDMs, but soon after attention shifted to non-overlapping DDMs. This evolution seems natural when the DDM-paradigm is taken into account: it is easier to uncouple the local problems when the subdomains are separated. However, an important limitation of non-overlapping domain decompositions, as that concept is usually understood today, is that interface nodes are shared by two or more subdomains of the coarse-mesh and, therefore, even non-overlapping DDMs are actually overlapping when seen from the perspective of the nodes used in the discretization. In this talk we present and discuss a discretization method in which the nodes used are non-overlapping, in the sense that each one of them belongs to one and only one subdomain of the coarse-mesh.
mm_par2.0: An object-oriented molecular dynamics simulation program parallelized using a hierarchical scheme with MPI and OPENMP

NASA Astrophysics Data System (ADS)

Oh, Kwang Jin; Kang, Ji Hoon; Myung, Hun Joo

2012-02-01

We have revised a general purpose parallel molecular dynamics simulation program mm_par using the object-oriented programming. We parallelized the revised version using a hierarchical scheme in order to utilize more processors for a given system size. The benchmark result will be presented here. New version program summaryProgram title: mm_par2.0 Catalogue identifier: ADXP_v2_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADXP_v2_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC license, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 2 390 858 No. of bytes in distributed program, including test data, etc.: 25 068 310 Distribution format: tar.gz Programming language: C++ Computer: Any system operated by Linux or Unix Operating system: Linux Classification: 7.7 External routines: We provide wrappers for FFTW [1], Intel MKL library [2] FFT routine, and Numerical recipes [3] FFT, random number generator, and eigenvalue solver routines, SPRNG [4] random number generator, Mersenne Twister [5] random number generator, space filling curve routine. Catalogue identifier of previous version: ADXP_v1_0 Journal reference of previous version: Comput. Phys. Comm. 174 (2006) 560 Does the new version supersede the previous version?: Yes Nature of problem: Structural, thermodynamic, and dynamical properties of fluids and solids from microscopic scales to mesoscopic scales. Solution method: Molecular dynamics simulation in NVE, NVT, and NPT ensemble, Langevin dynamics simulation, dissipative particle dynamics simulation. Reasons for new version: First, object-oriented programming has been used, which is known to be open for extension and closed for modification. It is also known to be better for maintenance. Second, version 1.0 was based on atom decomposition and domain decomposition scheme [6] for parallelization. However, atom decomposition is not popular due to its poor scalability. On the other hand, domain decomposition scheme is better for scalability. It still has a limitation in utilizing a large number of cores on recent petascale computers due to the requirement that the domain size is larger than the potential cutoff distance. To go beyond such a limitation, a hierarchical parallelization scheme has been adopted in this new version and implemented using MPI [7] and OPENMP [8]. Summary of revisions: (1) Object-oriented programming has been used. (2) A hierarchical parallelization scheme has been adopted. (3) SPME routine has been fully parallelized with parallel 3D FFT using volumetric decomposition scheme [9]. K.J.O. thanks Mr. Seung Min Lee for useful discussion on programming and debugging. Running time: Running time depends on system size and methods used. For test system containing a protein (PDB id: 5DHFR) with CHARMM22 force field [10] and 7023 TIP3P [11] waters in simulation box having dimension 62.23 Å×62.23 Å×62.23 Å, the benchmark results are given in Fig. 1. Here the potential cutoff distance was set to 12 Å and the switching function was applied from 10 Å for the force calculation in real space. For the SPME [12] calculation, K, K, and K were set to 64 and the interpolation order was set to 4. To do the fast Fourier transform, we used Intel MKL library. All bonds including hydrogen atoms were constrained using SHAKE/RATTLE algorithms [13,14]. The code was compiled using Intel compiler version 11.1 and mvapich2 version 1.5. Fig. 2 shows performance gains from using CUDA-enabled version [15] of mm_par for 5DHFR simulation in water on Intel Core2Quad 2.83 GHz and GeForce GTX 580. Even though mm_par2.0 is not ported yet for GPU, its performance data would be useful to expect mm_par2.0 performance on GPU. Timing results for 1000 MD steps. 1, 2, 4, and 8 in the figure mean the number of OPENMP threads. Timing results for 1000 MD steps from double precision simulation on CPU, single precision simulation on GPU, and double precision simulation on GPU.
Accelerating solutions of one-dimensional unsteady PDEs with GPU-based swept time-space decomposition

NASA Astrophysics Data System (ADS)

Magee, Daniel J.; Niemeyer, Kyle E.

2018-03-01

The expedient design of precision components in aerospace and other high-tech industries requires simulations of physical phenomena often described by partial differential equations (PDEs) without exact solutions. Modern design problems require simulations with a level of resolution difficult to achieve in reasonable amounts of time-even in effectively parallelized solvers. Though the scale of the problem relative to available computing power is the greatest impediment to accelerating these applications, significant performance gains can be achieved through careful attention to the details of memory communication and access. The swept time-space decomposition rule reduces communication between sub-domains by exhausting the domain of influence before communicating boundary values. Here we present a GPU implementation of the swept rule, which modifies the algorithm for improved performance on this processing architecture by prioritizing use of private (shared) memory, avoiding interblock communication, and overwriting unnecessary values. It shows significant improvement in the execution time of finite-difference solvers for one-dimensional unsteady PDEs, producing speedups of 2 - 9 × for a range of problem sizes, respectively, compared with simple GPU versions and 7 - 300 × compared with parallel CPU versions. However, for a more sophisticated one-dimensional system of equations discretized with a second-order finite-volume scheme, the swept rule performs 1.2 - 1.9 × worse than a standard implementation for all problem sizes.
Charon Toolkit for Parallel, Implicit Structured-Grid Computations: Functional Design

NASA Technical Reports Server (NTRS)

VanderWijngaart, Rob F.; Kutler, Paul (Technical Monitor)

1997-01-01

In a previous report the design concepts of Charon were presented. Charon is a toolkit that aids engineers in developing scientific programs for structured-grid applications to be run on MIMD parallel computers. It constitutes an augmentation of the general-purpose MPI-based message-passing layer, and provides the user with a hierarchy of tools for rapid prototyping and validation of parallel programs, and subsequent piecemeal performance tuning. Here we describe the implementation of the domain decomposition tools used for creating data distributions across sets of processors. We also present the hierarchy of parallelization tools that allows smooth translation of legacy code (or a serial design) into a parallel program. Along with the actual tool descriptions, we will present the considerations that led to the particular design choices. Many of these are motivated by the requirement that Charon must be useful within the traditional computational environments of Fortran 77 and C. Only the Fortran 77 syntax will be presented in this report.
Implementation and Characterization of Three-Dimensional Particle-in-Cell Codes on Multiple-Instruction-Multiple-Data Massively Parallel Supercomputers

NASA Technical Reports Server (NTRS)

Lyster, P. M.; Liewer, P. C.; Decyk, V. K.; Ferraro, R. D.

1995-01-01

A three-dimensional electrostatic particle-in-cell (PIC) plasma simulation code has been developed on coarse-grain distributed-memory massively parallel computers with message passing communications. Our implementation is the generalization to three-dimensions of the general concurrent particle-in-cell (GCPIC) algorithm. In the GCPIC algorithm, the particle computation is divided among the processors using a domain decomposition of the simulation domain. In a three-dimensional simulation, the domain can be partitioned into one-, two-, or three-dimensional subdomains ("slabs," "rods," or "cubes") and we investigate the efficiency of the parallel implementation of the push for all three choices. The present implementation runs on the Intel Touchstone Delta machine at Caltech; a multiple-instruction-multiple-data (MIMD) parallel computer with 512 nodes. We find that the parallel efficiency of the push is very high, with the ratio of communication to computation time in the range 0.3%-10.0%. The highest efficiency (> 99%) occurs for a large, scaled problem with 64(sup 3) particles per processing node (approximately 134 million particles of 512 nodes) which has a push time of about 250 ns per particle per time step. We have also developed expressions for the timing of the code which are a function of both code parameters (number of grid points, particles, etc.) and machine-dependent parameters (effective FLOP rate, and the effective interprocessor bandwidths for the communication of particles and grid points). These expressions can be used to estimate the performance of scaled problems--including those with inhomogeneous plasmas--to other parallel machines once the machine-dependent parameters are known.

Interface conditions for domain decomposition with radical grid refinement

NASA Technical Reports Server (NTRS)

Scroggs, Jeffrey S.

1991-01-01

Interface conditions for coupling the domains in a physically motivated domain decomposition method are discussed. The domain decomposition is based on an asymptotic-induced method for the numerical solution of hyperbolic conservation laws with small viscosity. The method consists of multiple stages. The first stage is to obtain a first approximation using a first-order method, such as the Godunov scheme. Subsequent stages of the method involve solving internal-layer problem via a domain decomposition. The method is derived and justified via singular perturbation techniques.
Parallel discontinuous Galerkin FEM for computing hyperbolic conservation law on unstructured grids

NASA Astrophysics Data System (ADS)

Ma, Xinrong; Duan, Zhijian

2018-04-01

High-order resolution Discontinuous Galerkin finite element methods (DGFEM) has been known as a good method for solving Euler equations and Navier-Stokes equations on unstructured grid, but it costs too much computational resources. An efficient parallel algorithm was presented for solving the compressible Euler equations. Moreover, the multigrid strategy based on three-stage three-order TVD Runge-Kutta scheme was used in order to improve the computational efficiency of DGFEM and accelerate the convergence of the solution of unsteady compressible Euler equations. In order to make each processor maintain load balancing, the domain decomposition method was employed. Numerical experiment performed for the inviscid transonic flow fluid problems around NACA0012 airfoil and M6 wing. The results indicated that our parallel algorithm can improve acceleration and efficiency significantly, which is suitable for calculating the complex flow fluid.
Data decomposition method for parallel polygon rasterization considering load balancing

NASA Astrophysics Data System (ADS)

Zhou, Chen; Chen, Zhenjie; Liu, Yongxue; Li, Feixue; Cheng, Liang; Zhu, A.-xing; Li, Manchun

2015-12-01

It is essential to adopt parallel computing technology to rapidly rasterize massive polygon data. In parallel rasterization, it is difficult to design an effective data decomposition method. Conventional methods ignore load balancing of polygon complexity in parallel rasterization and thus fail to achieve high parallel efficiency. In this paper, a novel data decomposition method based on polygon complexity (DMPC) is proposed. First, four factors that possibly affect the rasterization efficiency were investigated. Then, a metric represented by the boundary number and raster pixel number in the minimum bounding rectangle was developed to calculate the complexity of each polygon. Using this metric, polygons were rationally allocated according to the polygon complexity, and each process could achieve balanced loads of polygon complexity. To validate the efficiency of DMPC, it was used to parallelize different polygon rasterization algorithms and tested on different datasets. Experimental results showed that DMPC could effectively parallelize polygon rasterization algorithms. Furthermore, the implemented parallel algorithms with DMPC could achieve good speedup ratios of at least 15.69 and generally outperformed conventional decomposition methods in terms of parallel efficiency and load balancing. In addition, the results showed that DMPC exhibited consistently better performance for different spatial distributions of polygons.
High-Fidelity, Computational Modeling of Non-Equilibrium Discharges for Combustion Applications

DTIC Science & Technology

2013-10-01

gradient reconstruction)  4th order RK time integration  Domain decomposition parallel enabled Plasma chemistry mechanism 22  Methane-air... plasma chemistry mechanism  Species and pathways relevant to plasma time scale (~10’s ns)  26 Species : E, O, N2 , O2 , H , N2+ , O2+ , N4+ , O4...Photoionization (3-term Helmholtz equation model) 0.0067 0.0447 0.0346 0.1121 0.3059 0.5994 Plasma chemistry mechanism used in studies 81
Domain decomposition in time for PDE-constrained optimization

DOE PAGES

Barker, Andrew T.; Stoll, Martin

2015-08-28

Here, PDE-constrained optimization problems have a wide range of applications, but they lead to very large and ill-conditioned linear systems, especially if the problems are time dependent. In this paper we outline an approach for dealing with such problems by decomposing them in time and applying an additive Schwarz preconditioner in time, so that we can take advantage of parallel computers to deal with the very large linear systems. We then illustrate the performance of our method on a variety of problems.
Matching pursuit parallel decomposition of seismic data

NASA Astrophysics Data System (ADS)

Li, Chuanhui; Zhang, Fanchang

2017-07-01

In order to improve the computation speed of matching pursuit decomposition of seismic data, a matching pursuit parallel algorithm is designed in this paper. We pick a fixed number of envelope peaks from the current signal in every iteration according to the number of compute nodes and assign them to the compute nodes on average to search the optimal Morlet wavelets in parallel. With the help of parallel computer systems and Message Passing Interface, the parallel algorithm gives full play to the advantages of parallel computing to significantly improve the computation speed of the matching pursuit decomposition and also has good expandability. Besides, searching only one optimal Morlet wavelet by every compute node in every iteration is the most efficient implementation.
Extending substructure based iterative solvers to multiple load and repeated analyses

NASA Technical Reports Server (NTRS)

Farhat, Charbel

1993-01-01

Direct solvers currently dominate commercial finite element structural software, but do not scale well in the fine granularity regime targeted by emerging parallel processors. Substructure based iterative solvers--often called also domain decomposition algorithms--lend themselves better to parallel processing, but must overcome several obstacles before earning their place in general purpose structural analysis programs. One such obstacle is the solution of systems with many or repeated right hand sides. Such systems arise, for example, in multiple load static analyses and in implicit linear dynamics computations. Direct solvers are well-suited for these problems because after the system matrix has been factored, the multiple or repeated solutions can be obtained through relatively inexpensive forward and backward substitutions. On the other hand, iterative solvers in general are ill-suited for these problems because they often must restart from scratch for every different right hand side. In this paper, we present a methodology for extending the range of applications of domain decomposition methods to problems with multiple or repeated right hand sides. Basically, we formulate the overall problem as a series of minimization problems over K-orthogonal and supplementary subspaces, and tailor the preconditioned conjugate gradient algorithm to solve them efficiently. The resulting solution method is scalable, whereas direct factorization schemes and forward and backward substitution algorithms are not. We illustrate the proposed methodology with the solution of static and dynamic structural problems, and highlight its potential to outperform forward and backward substitutions on parallel computers. As an example, we show that for a linear structural dynamics problem with 11640 degrees of freedom, every time-step beyond time-step 15 is solved in a single iteration and consumes 1.0 second on a 32 processor iPSC-860 system; for the same problem and the same parallel processor, a pair of forward/backward substitutions at each step consumes 15.0 seconds.
Supercomputer simulations of structure formation in the Universe

NASA Astrophysics Data System (ADS)

Ishiyama, Tomoaki

2017-06-01

We describe the implementation and performance results of our massively parallel MPI†/OpenMP‡ hybrid TreePM code for large-scale cosmological N-body simulations. For domain decomposition, a recursive multi-section algorithm is used and the size of domains are automatically set so that the total calculation time is the same for all processes. We developed a highly-tuned gravity kernel for short-range forces, and a novel communication algorithm for long-range forces. For two trillion particles benchmark simulation, the average performance on the fullsystem of K computer (82,944 nodes, the total number of core is 663,552) is 5.8 Pflops, which corresponds to 55% of the peak speed.
Convolution of large 3D images on GPU and its decomposition

NASA Astrophysics Data System (ADS)

Karas, Pavel; Svoboda, David

2011-12-01

In this article, we propose a method for computing convolution of large 3D images. The convolution is performed in a frequency domain using a convolution theorem. The algorithm is accelerated on a graphic card by means of the CUDA parallel computing model. Convolution is decomposed in a frequency domain using the decimation in frequency algorithm. We pay attention to keeping our approach efficient in terms of both time and memory consumption and also in terms of memory transfers between CPU and GPU which have a significant inuence on overall computational time. We also study the implementation on multiple GPUs and compare the results between the multi-GPU and multi-CPU implementations.
Implementation of a Message Passing Interface into a Cloud-Resolving Model for Massively Parallel Computing

NASA Technical Reports Server (NTRS)

Juang, Hann-Ming Henry; Tao, Wei-Kuo; Zeng, Xi-Ping; Shie, Chung-Lin; Simpson, Joanne; Lang, Steve

2004-01-01

The capability for massively parallel programming (MPP) using a message passing interface (MPI) has been implemented into a three-dimensional version of the Goddard Cumulus Ensemble (GCE) model. The design for the MPP with MPI uses the concept of maintaining similar code structure between the whole domain as well as the portions after decomposition. Hence the model follows the same integration for single and multiple tasks (CPUs). Also, it provides for minimal changes to the original code, so it is easily modified and/or managed by the model developers and users who have little knowledge of MPP. The entire model domain could be sliced into one- or two-dimensional decomposition with a halo regime, which is overlaid on partial domains. The halo regime requires that no data be fetched across tasks during the computational stage, but it must be updated before the next computational stage through data exchange via MPI. For reproducible purposes, transposing data among tasks is required for spectral transform (Fast Fourier Transform, FFT), which is used in the anelastic version of the model for solving the pressure equation. The performance of the MPI-implemented codes (i.e., the compressible and anelastic versions) was tested on three different computing platforms. The major results are: 1) both versions have speedups of about 99% up to 256 tasks but not for 512 tasks; 2) the anelastic version has better speedup and efficiency because it requires more computations than that of the compressible version; 3) equal or approximately-equal numbers of slices between the x- and y- directions provide the fastest integration due to fewer data exchanges; and 4) one-dimensional slices in the x-direction result in the slowest integration due to the need for more memory relocation for computation.
The Distributed Diagonal Force Decomposition Method for Parallelizing Molecular Dynamics Simulations

PubMed Central

Boršnik, Urban; Miller, Benjamin T.; Brooks, Bernard R.; Janežič, Dušanka

2011-01-01

Parallelization is an effective way to reduce the computational time needed for molecular dynamics simulations. We describe a new parallelization method, the distributed-diagonal force decomposition method, with which we extend and improve the existing force decomposition methods. Our new method requires less data communication during molecular dynamics simulations than replicated data and current force decomposition methods, increasing the parallel efficiency. It also dynamically load-balances the processors' computational load throughout the simulation. The method is readily implemented in existing molecular dynamics codes and it has been incorporated into the CHARMM program, allowing its immediate use in conjunction with the many molecular dynamics simulation techniques that are already present in the program. We also present the design of the Force Decomposition Machine, a cluster of personal computers and networks that is tailored to running molecular dynamics simulations using the distributed diagonal force decomposition method. The design is expandable and provides various degrees of fault resilience. This approach is easily adaptable to computers with Graphics Processing Units because it is independent of the processor type being used. PMID:21793007
Discussion summary: Fictitious domain methods

NASA Technical Reports Server (NTRS)

Glowinski, Rowland; Rodrigue, Garry

1991-01-01

Fictitious Domain methods are constructed in the following manner: Suppose a partial differential equation is to be solved on an open bounded set, Omega, in 2-D or 3-D. Let R be a rectangle domain containing the closure of Omega. The partial differential equation is first solved on R. Using the solution on R, the solution of the equation on Omega is then recovered by some procedure. The advantage of the fictitious domain method is that in many cases the solution of a partial differential equation on a rectangular region is easier to compute than on a nonrectangular region. Fictitious domain methods for solving elliptic PDEs on general regions are also very efficient when used on a parallel computer. The reason is that one can use the many domain decomposition methods that are available for solving the PDE on the fictitious rectangular region. The discussion on fictitious domain methods began with a talk by R. Glowinski in which he gave some examples of a variational approach to ficititious domain methods for solving the Helmholtz and Navier-Stokes equations.
A physics-motivated Centroidal Voronoi Particle domain decomposition method

DOE Office of Scientific and Technical Information (OSTI.GOV)

Fu, Lin, E-mail: lin.fu@tum.de; Hu, Xiangyu Y., E-mail: xiangyu.hu@tum.de; Adams, Nikolaus A., E-mail: nikolaus.adams@tum.de

2017-04-15

In this paper, we propose a novel domain decomposition method for large-scale simulations in continuum mechanics by merging the concepts of Centroidal Voronoi Tessellation (CVT) and Voronoi Particle dynamics (VP). The CVT is introduced to achieve a high-level compactness of the partitioning subdomains by the Lloyd algorithm which monotonically decreases the CVT energy. The number of computational elements between neighboring partitioning subdomains, which scales the communication effort for parallel simulations, is optimized implicitly as the generated partitioning subdomains are convex and simply connected with small aspect-ratios. Moreover, Voronoi Particle dynamics employing physical analogy with a tailored equation of state ismore » developed, which relaxes the particle system towards the target partition with good load balance. Since the equilibrium is computed by an iterative approach, the partitioning subdomains exhibit locality and the incremental property. Numerical experiments reveal that the proposed Centroidal Voronoi Particle (CVP) based algorithm produces high-quality partitioning with high efficiency, independently of computational-element types. Thus it can be used for a wide range of applications in computational science and engineering.« less
A physics-motivated Centroidal Voronoi Particle domain decomposition method

NASA Astrophysics Data System (ADS)

Fu, Lin; Hu, Xiangyu Y.; Adams, Nikolaus A.

2017-04-01

In this paper, we propose a novel domain decomposition method for large-scale simulations in continuum mechanics by merging the concepts of Centroidal Voronoi Tessellation (CVT) and Voronoi Particle dynamics (VP). The CVT is introduced to achieve a high-level compactness of the partitioning subdomains by the Lloyd algorithm which monotonically decreases the CVT energy. The number of computational elements between neighboring partitioning subdomains, which scales the communication effort for parallel simulations, is optimized implicitly as the generated partitioning subdomains are convex and simply connected with small aspect-ratios. Moreover, Voronoi Particle dynamics employing physical analogy with a tailored equation of state is developed, which relaxes the particle system towards the target partition with good load balance. Since the equilibrium is computed by an iterative approach, the partitioning subdomains exhibit locality and the incremental property. Numerical experiments reveal that the proposed Centroidal Voronoi Particle (CVP) based algorithm produces high-quality partitioning with high efficiency, independently of computational-element types. Thus it can be used for a wide range of applications in computational science and engineering.
Development of Fast Algorithms Using Recursion, Nesting and Iterations for Computational Electromagnetics

NASA Technical Reports Server (NTRS)

Chew, W. C.; Song, J. M.; Lu, C. C.; Weedon, W. H.

1995-01-01

In the first phase of our work, we have concentrated on laying the foundation to develop fast algorithms, including the use of recursive structure like the recursive aggregate interaction matrix algorithm (RAIMA), the nested equivalence principle algorithm (NEPAL), the ray-propagation fast multipole algorithm (RPFMA), and the multi-level fast multipole algorithm (MLFMA). We have also investigated the use of curvilinear patches to build a basic method of moments code where these acceleration techniques can be used later. In the second phase, which is mainly reported on here, we have concentrated on implementing three-dimensional NEPAL on a massively parallel machine, the Connection Machine CM-5, and have been able to obtain some 3D scattering results. In order to understand the parallelization of codes on the Connection Machine, we have also studied the parallelization of 3D finite-difference time-domain (FDTD) code with PML material absorbing boundary condition (ABC). We found that simple algorithms like the FDTD with material ABC can be parallelized very well allowing us to solve within a minute a problem of over a million nodes. In addition, we have studied the use of the fast multipole method and the ray-propagation fast multipole algorithm to expedite matrix-vector multiplication in a conjugate-gradient solution to integral equations of scattering. We find that these methods are faster than LU decomposition for one incident angle, but are slower than LU decomposition when many incident angles are needed as in the monostatic RCS calculations.
Multilevel decomposition of complete vehicle configuration in a parallel computing environment

NASA Technical Reports Server (NTRS)

Bhatt, Vinay; Ragsdell, K. M.

1989-01-01

This research summarizes various approaches to multilevel decomposition to solve large structural problems. A linear decomposition scheme based on the Sobieski algorithm is selected as a vehicle for automated synthesis of a complete vehicle configuration in a parallel processing environment. The research is in a developmental state. Preliminary numerical results are presented for several example problems.
A mixed parallel strategy for the solution of coupled multi-scale problems at finite strains

NASA Astrophysics Data System (ADS)

Lopes, I. A. Rodrigues; Pires, F. M. Andrade; Reis, F. J. P.

2018-02-01

A mixed parallel strategy for the solution of homogenization-based multi-scale constitutive problems undergoing finite strains is proposed. The approach aims to reduce the computational time and memory requirements of non-linear coupled simulations that use finite element discretization at both scales (FE^2). In the first level of the algorithm, a non-conforming domain decomposition technique, based on the FETI method combined with a mortar discretization at the interface of macroscopic subdomains, is employed. A master-slave scheme, which distributes tasks by macroscopic element and adopts dynamic scheduling, is then used for each macroscopic subdomain composing the second level of the algorithm. This strategy allows the parallelization of FE^2 simulations in computers with either shared memory or distributed memory architectures. The proposed strategy preserves the quadratic rates of asymptotic convergence that characterize the Newton-Raphson scheme. Several examples are presented to demonstrate the robustness and efficiency of the proposed parallel strategy.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Gamblin, T; de Supinski, B R; Schulz, M

Good load balance is crucial on very large parallel systems, but the most sophisticated algorithms introduce dynamic imbalances through adaptation in domain decomposition or use of adaptive solvers. To observe and diagnose imbalance, developers need system-wide, temporally-ordered measurements from full-scale runs. This potentially requires data collection from multiple code regions on all processors over the entire execution. Doing this instrumentation naively can, in combination with the application itself, exceed available I/O bandwidth and storage capacity, and can induce severe behavioral perturbations. We present and evaluate a novel technique for scalable, low-error load balance measurement. This uses a parallel wavelet transformmore » and other parallel encoding methods. We show that our technique collects and reconstructs system-wide measurements with low error. Compression time scales sublinearly with system size and data volume is several orders of magnitude smaller than the raw data. The overhead is low enough for online use in a production environment.« less
Iterative load-balancing method with multigrid level relaxation for particle simulation with short-range interactions

NASA Astrophysics Data System (ADS)

Furuichi, Mikito; Nishiura, Daisuke

2017-10-01

We developed dynamic load-balancing algorithms for Particle Simulation Methods (PSM) involving short-range interactions, such as Smoothed Particle Hydrodynamics (SPH), Moving Particle Semi-implicit method (MPS), and Discrete Element method (DEM). These are needed to handle billions of particles modeled in large distributed-memory computer systems. Our method utilizes flexible orthogonal domain decomposition, allowing the sub-domain boundaries in the column to be different for each row. The imbalances in the execution time between parallel logical processes are treated as a nonlinear residual. Load-balancing is achieved by minimizing the residual within the framework of an iterative nonlinear solver, combined with a multigrid technique in the local smoother. Our iterative method is suitable for adjusting the sub-domain frequently by monitoring the performance of each computational process because it is computationally cheaper in terms of communication and memory costs than non-iterative methods. Numerical tests demonstrated the ability of our approach to handle workload imbalances arising from a non-uniform particle distribution, differences in particle types, or heterogeneous computer architecture which was difficult with previously proposed methods. We analyzed the parallel efficiency and scalability of our method using Earth simulator and K-computer supercomputer systems.
Domain decomposition for aerodynamic and aeroacoustic analyses, and optimization

NASA Technical Reports Server (NTRS)

Baysal, Oktay

1995-01-01

The overarching theme was the domain decomposition, which intended to improve the numerical solution technique for the partial differential equations at hand; in the present study, those that governed either the fluid flow, or the aeroacoustic wave propagation, or the sensitivity analysis for a gradient-based optimization. The role of the domain decomposition extended beyond the original impetus of discretizing geometrical complex regions or writing modular software for distributed-hardware computers. It induced function-space decompositions and operator decompositions that offered the valuable property of near independence of operator evaluation tasks. The objectives have gravitated about the extensions and implementations of either the previously developed or concurrently being developed methodologies: (1) aerodynamic sensitivity analysis with domain decomposition (SADD); (2) computational aeroacoustics of cavities; and (3) dynamic, multibody computational fluid dynamics using unstructured meshes.

A Multi-Level Parallelization Concept for High-Fidelity Multi-Block Solvers

NASA Technical Reports Server (NTRS)

Hatay, Ferhat F.; Jespersen, Dennis C.; Guruswamy, Guru P.; Rizk, Yehia M.; Byun, Chansup; Gee, Ken; VanDalsem, William R. (Technical Monitor)

1997-01-01

The integration of high-fidelity Computational Fluid Dynamics (CFD) analysis tools with the industrial design process benefits greatly from the robust implementations that are transportable across a wide range of computer architectures. In the present work, a hybrid domain-decomposition and parallelization concept was developed and implemented into the widely-used NASA multi-block Computational Fluid Dynamics (CFD) packages implemented in ENSAERO and OVERFLOW. The new parallel solver concept, PENS (Parallel Euler Navier-Stokes Solver), employs both fine and coarse granularity in data partitioning as well as data coalescing to obtain the desired load-balance characteristics on the available computer platforms. This multi-level parallelism implementation itself introduces no changes to the numerical results, hence the original fidelity of the packages are identically preserved. The present implementation uses the Message Passing Interface (MPI) library for interprocessor message passing and memory accessing. By choosing an appropriate combination of the available partitioning and coalescing capabilities only during the execution stage, the PENS solver becomes adaptable to different computer architectures from shared-memory to distributed-memory platforms with varying degrees of parallelism. The PENS implementation on the IBM SP2 distributed memory environment at the NASA Ames Research Center obtains 85 percent scalable parallel performance using fine-grain partitioning of single-block CFD domains using up to 128 wide computational nodes. Multi-block CFD simulations of complete aircraft simulations achieve 75 percent perfect load-balanced executions using data coalescing and the two levels of parallelism. SGI PowerChallenge, SGI Origin 2000, and a cluster of workstations are the other platforms where the robustness of the implementation is tested. The performance behavior on the other computer platforms with a variety of realistic problems will be included as this on-going study progresses.
Integrated Network Decompositions and Dynamic Programming for Graph Optimization (INDDGO)

DOE Office of Scientific and Technical Information (OSTI.GOV)

The INDDGO software package offers a set of tools for finding exact solutions to graph optimization problems via tree decompositions and dynamic programming algorithms. Currently the framework offers serial and parallel (distributed memory) algorithms for finding tree decompositions and solving the maximum weighted independent set problem. The parallel dynamic programming algorithm is implemented on top of the MADNESS task-based runtime.
Parallel computation and the basis system

DOE Office of Scientific and Technical Information (OSTI.GOV)

Smith, G.R.

1993-05-01

A software package has been written that can facilitate efforts to develop powerful, flexible, and easy-to use programs that can run in single-processor, massively parallel, and distributed computing environments. Particular attention has been given to the difficulties posed by a program consisting of many science packages that represent subsystems of a complicated, coupled system. Methods have been found to maintain independence of the packages by hiding data structures without increasing the communications costs in a parallel computing environment. Concepts developed in this work are demonstrated by a prototype program that uses library routines from two existing software systems, Basis andmore » Parallel Virtual Machine (PVM). Most of the details of these libraries have been encapsulated in routines and macros that could be rewritten for alternative libraries that possess certain minimum capabilities. The prototype software uses a flexible master-and-slaves paradigm for parallel computation and supports domain decomposition with message passing for partitioning work among slaves. Facilities are provided for accessing variables that are distributed among the memories of slaves assigned to subdomains. The software is named PROTOPAR.« less
RIACS

NASA Technical Reports Server (NTRS)

Oliger, Joseph

1997-01-01

Topics considered include: high-performance computing; cognitive and perceptual prostheses (computational aids designed to leverage human abilities); autonomous systems. Also included: development of a 3D unstructured grid code based on a finite volume formulation and applied to the Navier-stokes equations; Cartesian grid methods for complex geometry; multigrid methods for solving elliptic problems on unstructured grids; algebraic non-overlapping domain decomposition methods for compressible fluid flow problems on unstructured meshes; numerical methods for the compressible navier-stokes equations with application to aerodynamic flows; research in aerodynamic shape optimization; S-HARP: a parallel dynamic spectral partitioner; numerical schemes for the Hamilton-Jacobi and level set equations on triangulated domains; application of high-order shock capturing schemes to direct simulation of turbulence; multicast technology; network testbeds; supercomputer consolidation project.
Planning paths through a spatial hierarchy - Eliminating stair-stepping effects

NASA Technical Reports Server (NTRS)

Slack, Marc G.

1989-01-01

Stair-stepping effects are a result of the loss of spatial continuity resulting from the decomposition of space into a grid. This paper presents a path planning algorithm which eliminates stair-stepping effects induced by the grid-based spatial representation. The algorithm exploits a hierarchical spatial model to efficiently plan paths for a mobile robot operating in dynamic domains. The spatial model and path planning algorithm map to a parallel machine, allowing the system to operate incrementally, thereby accounting for unexpected events in the operating space.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Jalas, S.; Dornmair, I.; Lehe, R.

Particle in Cell (PIC) simulations are a widely used tool for the investigation of both laser- and beam-driven plasma acceleration. It is a known issue that the beam quality can be artificially degraded by numerical Cherenkov radiation (NCR) resulting primarily from an incorrectly modeled dispersion relation. Pseudo-spectral solvers featuring infinite order stencils can strongly reduce NCR - or even suppress it - and are therefore well suited to correctly model the beam properties. For efficient parallelization of the PIC algorithm, however, localized solvers are inevitable. Arbitrary order pseudo-spectral methods provide this needed locality. Yet, these methods can again be pronemore » to NCR. Here in this paper, we show that acceptably low solver orders are sufficient to correctly model the physics of interest, while allowing for parallel computation by domain decomposition.« less
Sublattice parallel replica dynamics.

PubMed

Martínez, Enrique; Uberuaga, Blas P; Voter, Arthur F

2014-06-01

Exascale computing presents a challenge for the scientific community as new algorithms must be developed to take full advantage of the new computing paradigm. Atomistic simulation methods that offer full fidelity to the underlying potential, i.e., molecular dynamics (MD) and parallel replica dynamics, fail to use the whole machine speedup, leaving a region in time and sample size space that is unattainable with current algorithms. In this paper, we present an extension of the parallel replica dynamics algorithm [A. F. Voter, Phys. Rev. B 57, R13985 (1998)] by combining it with the synchronous sublattice approach of Shim and Amar [ and , Phys. Rev. B 71, 125432 (2005)], thereby exploiting event locality to improve the algorithm scalability. This algorithm is based on a domain decomposition in which events happen independently in different regions in the sample. We develop an analytical expression for the speedup given by this sublattice parallel replica dynamics algorithm and compare it with parallel MD and traditional parallel replica dynamics. We demonstrate how this algorithm, which introduces a slight additional approximation of event locality, enables the study of physical systems unreachable with traditional methodologies and promises to better utilize the resources of current high performance and future exascale computers.
A compositional reservoir simulator on distributed memory parallel computers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Rame, M.; Delshad, M.

1995-12-31

This paper presents the application of distributed memory parallel computes to field scale reservoir simulations using a parallel version of UTCHEM, The University of Texas Chemical Flooding Simulator. The model is a general purpose highly vectorized chemical compositional simulator that can simulate a wide range of displacement processes at both field and laboratory scales. The original simulator was modified to run on both distributed memory parallel machines (Intel iPSC/960 and Delta, Connection Machine 5, Kendall Square 1 and 2, and CRAY T3D) and a cluster of workstations. A domain decomposition approach has been taken towards parallelization of the code. Amore » portion of the discrete reservoir model is assigned to each processor by a set-up routine that attempts a data layout as even as possible from the load-balance standpoint. Each of these subdomains is extended so that data can be shared between adjacent processors for stencil computation. The added routines that make parallel execution possible are written in a modular fashion that makes the porting to new parallel platforms straight forward. Results of the distributed memory computing performance of Parallel simulator are presented for field scale applications such as tracer flood and polymer flood. A comparison of the wall-clock times for same problems on a vector supercomputer is also presented.« less
Parallelisation study of a three-dimensional environmental flow model

NASA Astrophysics Data System (ADS)

O'Donncha, Fearghal; Ragnoli, Emanuele; Suits, Frank

2014-03-01

There are many simulation codes in the geosciences that are serial and cannot take advantage of the parallel computational resources commonly available today. One model important for our work in coastal ocean current modelling is EFDC, a Fortran 77 code configured for optimal deployment on vector computers. In order to take advantage of our cache-based, blade computing system we restructured EFDC from serial to parallel, thereby allowing us to run existing models more quickly, and to simulate larger and more detailed models that were previously impractical. Since the source code for EFDC is extensive and involves detailed computation, it is important to do such a port in a manner that limits changes to the files, while achieving the desired speedup. We describe a parallelisation strategy involving surgical changes to the source files to minimise error-prone alteration of the underlying computations, while allowing load-balanced domain decomposition for efficient execution on a commodity cluster. The use of conjugate gradient posed particular challenges due to implicit non-local communication posing a hindrance to standard domain partitioning schemes; a number of techniques are discussed to address this in a feasible, computationally efficient manner. The parallel implementation demonstrates good scalability in combination with a novel domain partitioning scheme that specifically handles mixed water/land regions commonly found in coastal simulations. The approach presented here represents a practical methodology to rejuvenate legacy code on a commodity blade cluster with reasonable effort; our solution has direct application to other similar codes in the geosciences.
Parallel SOR methods with a parabolic-diffusion acceleration technique for solving an unstructured-grid Poisson equation on 3D arbitrary geometries

NASA Astrophysics Data System (ADS)

Zapata, M. A. Uh; Van Bang, D. Pham; Nguyen, K. D.

2016-05-01

This paper presents a parallel algorithm for the finite-volume discretisation of the Poisson equation on three-dimensional arbitrary geometries. The proposed method is formulated by using a 2D horizontal block domain decomposition and interprocessor data communication techniques with message passing interface. The horizontal unstructured-grid cells are reordered according to the neighbouring relations and decomposed into blocks using a load-balanced distribution to give all processors an equal amount of elements. In this algorithm, two parallel successive over-relaxation methods are presented: a multi-colour ordering technique for unstructured grids based on distributed memory and a block method using reordering index following similar ideas of the partitioning for structured grids. In all cases, the parallel algorithms are implemented with a combination of an acceleration iterative solver. This solver is based on a parabolic-diffusion equation introduced to obtain faster solutions of the linear systems arising from the discretisation. Numerical results are given to evaluate the performances of the methods showing speedups better than linear.
Parallel computation and the Basis system

DOE Office of Scientific and Technical Information (OSTI.GOV)

Smith, G.R.

1992-12-16

A software package has been written that can facilitate efforts to develop powerful, flexible, and easy-to-use programs that can run in single-processor, massively parallel, and distributed computing environments. Particular attention has been given to the difficulties posed by a program consisting of many science packages that represent subsystems of a complicated, coupled system. Methods have been found to maintain independence of the packages by hiding data structures without increasing the communication costs in a parallel computing environment. Concepts developed in this work are demonstrated by a prototype program that uses library routines from two existing software systems, Basis and Parallelmore » Virtual Machine (PVM). Most of the details of these libraries have been encapsulated in routines and macros that could be rewritten for alternative libraries that possess certain minimum capabilities. The prototype software uses a flexible master-and-slaves paradigm for parallel computation and supports domain decomposition with message passing for partitioning work among slaves. Facilities are provided for accessing variables that are distributed among the memories of slaves assigned to subdomains. The software is named PROTOPAR.« less
Performance of a parallel thermal-hydraulics code TEMPEST

DOE Office of Scientific and Technical Information (OSTI.GOV)

Fann, G.I.; Trent, D.S.

The authors describe the parallelization of the Tempest thermal-hydraulics code. The serial version of this code is used for production quality 3-D thermal-hydraulics simulations. Good speedup was obtained with a parallel diagonally preconditioned BiCGStab non-symmetric linear solver, using a spatial domain decomposition approach for the semi-iterative pressure-based and mass-conserved algorithm. The test case used here to illustrate the performance of the BiCGStab solver is a 3-D natural convection problem modeled using finite volume discretization in cylindrical coordinates. The BiCGStab solver replaced the LSOR-ADI method for solving the pressure equation in TEMPEST. BiCGStab also solves the coupled thermal energy equation. Scalingmore » performance of 3 problem sizes (221220 nodes, 358120 nodes, and 701220 nodes) are presented. These problems were run on 2 different parallel machines: IBM-SP and SGI PowerChallenge. The largest problem attains a speedup of 68 on an 128 processor IBM-SP. In real terms, this is over 34 times faster than the fastest serial production time using the LSOR-ADI solver.« less
Rank-based decompositions of morphological templates.

PubMed

Sussner, P; Ritter, G X

2000-01-01

Methods for matrix decomposition have found numerous applications in image processing, in particular for the problem of template decomposition. Since existing matrix decomposition techniques are mainly concerned with the linear domain, we consider it timely to investigate matrix decomposition techniques in the nonlinear domain with applications in image processing. The mathematical basis for these investigations is the new theory of rank within minimax algebra. Thus far, only minimax decompositions of rank 1 and rank 2 matrices into outer product expansions are known to the image processing community. We derive a heuristic algorithm for the decomposition of matrices having arbitrary rank.
Optimization by nonhierarchical asynchronous decomposition

NASA Technical Reports Server (NTRS)

Shankar, Jayashree; Ribbens, Calvin J.; Haftka, Raphael T.; Watson, Layne T.

1992-01-01

Large scale optimization problems are tractable only if they are somehow decomposed. Hierarchical decompositions are inappropriate for some types of problems and do not parallelize well. Sobieszczanski-Sobieski has proposed a nonhierarchical decomposition strategy for nonlinear constrained optimization that is naturally parallel. Despite some successes on engineering problems, the algorithm as originally proposed fails on simple two dimensional quadratic programs. The algorithm is carefully analyzed for quadratic programs, and a number of modifications are suggested to improve its robustness.
A Screen Space GPGPU Surface LIC Algorithm for Distributed Memory Data Parallel Sort Last Rendering Infrastructures

NASA Astrophysics Data System (ADS)

Loring, B.; Karimabadi, H.; Rortershteyn, V.

2015-10-01

The surface line integral convolution(LIC) visualization technique produces dense visualization of vector fields on arbitrary surfaces. We present a screen space surface LIC algorithm for use in distributed memory data parallel sort last rendering infrastructures. The motivations for our work are to support analysis of datasets that are too large to fit in the main memory of a single computer and compatibility with prevalent parallel scientific visualization tools such as ParaView and VisIt. By working in screen space using OpenGL we can leverage the computational power of GPUs when they are available and run without them when they are not. We address efficiency and performance issues that arise from the transformation of data from physical to screen space by selecting an alternate screen space domain decomposition. We analyze the algorithm's scaling behavior with and without GPUs on two high performance computing systems using data from turbulent plasma simulations.
Accelerating the discovery of space-time patterns of infectious diseases using parallel computing.

PubMed

Hohl, Alexander; Delmelle, Eric; Tang, Wenwu; Casas, Irene

2016-11-01

Infectious diseases have complex transmission cycles, and effective public health responses require the ability to monitor outbreaks in a timely manner. Space-time statistics facilitate the discovery of disease dynamics including rate of spread and seasonal cyclic patterns, but are computationally demanding, especially for datasets of increasing size, diversity and availability. High-performance computing reduces the effort required to identify these patterns, however heterogeneity in the data must be accounted for. We develop an adaptive space-time domain decomposition approach for parallel computation of the space-time kernel density. We apply our methodology to individual reported dengue cases from 2010 to 2011 in the city of Cali, Colombia. The parallel implementation reaches significant speedup compared to sequential counterparts. Density values are visualized in an interactive 3D environment, which facilitates the identification and communication of uneven space-time distribution of disease events. Our framework has the potential to enhance the timely monitoring of infectious diseases. Copyright © 2016 Elsevier Ltd. All rights reserved.
A Screen Space GPGPU Surface LIC Algorithm for Distributed Memory Data Parallel Sort Last Rendering Infrastructures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Loring, Burlen; Karimabadi, Homa; Rortershteyn, Vadim

2014-07-01

The surface line integral convolution(LIC) visualization technique produces dense visualization of vector fields on arbitrary surfaces. We present a screen space surface LIC algorithm for use in distributed memory data parallel sort last rendering infrastructures. The motivations for our work are to support analysis of datasets that are too large to fit in the main memory of a single computer and compatibility with prevalent parallel scientific visualization tools such as ParaView and VisIt. By working in screen space using OpenGL we can leverage the computational power of GPUs when they are available and run without them when they are not.more » We address efficiency and performance issues that arise from the transformation of data from physical to screen space by selecting an alternate screen space domain decomposition. We analyze the algorithm's scaling behavior with and without GPUs on two high performance computing systems using data from turbulent plasma simulations.« less
Hybrid-optimization strategy for the communication of large-scale Kinetic Monte Carlo simulation

NASA Astrophysics Data System (ADS)

Wu, Baodong; Li, Shigang; Zhang, Yunquan; Nie, Ningming

2017-02-01

The parallel Kinetic Monte Carlo (KMC) algorithm based on domain decomposition has been widely used in large-scale physical simulations. However, the communication overhead of the parallel KMC algorithm is critical, and severely degrades the overall performance and scalability. In this paper, we present a hybrid optimization strategy to reduce the communication overhead for the parallel KMC simulations. We first propose a communication aggregation algorithm to reduce the total number of messages and eliminate the communication redundancy. Then, we utilize the shared memory to reduce the memory copy overhead of the intra-node communication. Finally, we optimize the communication scheduling using the neighborhood collective operations. We demonstrate the scalability and high performance of our hybrid optimization strategy by both theoretical and experimental analysis. Results show that the optimized KMC algorithm exhibits better performance and scalability than the well-known open-source library-SPPARKS. On 32-node Xeon E5-2680 cluster (total 640 cores), the optimized algorithm reduces the communication time by 24.8% compared with SPPARKS.
THC-MP: High performance numerical simulation of reactive transport and multiphase flow in porous media

NASA Astrophysics Data System (ADS)

Wei, Xiaohui; Li, Weishan; Tian, Hailong; Li, Hongliang; Xu, Haixiao; Xu, Tianfu

2015-07-01

The numerical simulation of multiphase flow and reactive transport in the porous media on complex subsurface problem is a computationally intensive application. To meet the increasingly computational requirements, this paper presents a parallel computing method and architecture. Derived from TOUGHREACT that is a well-established code for simulating subsurface multi-phase flow and reactive transport problems, we developed a high performance computing THC-MP based on massive parallel computer, which extends greatly on the computational capability for the original code. The domain decomposition method was applied to the coupled numerical computing procedure in the THC-MP. We designed the distributed data structure, implemented the data initialization and exchange between the computing nodes and the core solving module using the hybrid parallel iterative and direct solver. Numerical accuracy of the THC-MP was verified through a CO2 injection-induced reactive transport problem by comparing the results obtained from the parallel computing and sequential computing (original code). Execution efficiency and code scalability were examined through field scale carbon sequestration applications on the multicore cluster. The results demonstrate successfully the enhanced performance using the THC-MP on parallel computing facilities.
Accurate modeling of plasma acceleration with arbitrary order pseudo-spectral particle-in-cell methods

DOE PAGES

Jalas, S.; Dornmair, I.; Lehe, R.; ...

2017-03-20

Particle in Cell (PIC) simulations are a widely used tool for the investigation of both laser- and beam-driven plasma acceleration. It is a known issue that the beam quality can be artificially degraded by numerical Cherenkov radiation (NCR) resulting primarily from an incorrectly modeled dispersion relation. Pseudo-spectral solvers featuring infinite order stencils can strongly reduce NCR - or even suppress it - and are therefore well suited to correctly model the beam properties. For efficient parallelization of the PIC algorithm, however, localized solvers are inevitable. Arbitrary order pseudo-spectral methods provide this needed locality. Yet, these methods can again be pronemore » to NCR. Here in this paper, we show that acceptably low solver orders are sufficient to correctly model the physics of interest, while allowing for parallel computation by domain decomposition.« less

A Parallel Cartesian Approach for External Aerodynamics of Vehicles with Complex Geometry

NASA Technical Reports Server (NTRS)

Aftosmis, M. J.; Berger, M. J.; Adomavicius, G.

2001-01-01

This workshop paper presents the current status in the development of a new approach for the solution of the Euler equations on Cartesian meshes with embedded boundaries in three dimensions on distributed and shared memory architectures. The approach uses adaptively refined Cartesian hexahedra to fill the computational domain. Where these cells intersect the geometry, they are cut by the boundary into arbitrarily shaped polyhedra which receive special treatment by the solver. The presentation documents a newly developed multilevel upwind solver based on a flexible domain-decomposition strategy. One novel aspect of the work is its use of space-filling curves (SFC) for memory efficient on-the-fly parallelization, dynamic re-partitioning and automatic coarse mesh generation. Within each subdomain the approach employs a variety reordering techniques so that relevant data are on the same page in memory permitting high-performance on cache-based processors. Details of the on-the-fly SFC based partitioning are presented as are construction rules for the automatic coarse mesh generation. After describing the approach, the paper uses model problems and 3- D configurations to both verify and validate the solver. The model problems demonstrate that second-order accuracy is maintained despite the presence of the irregular cut-cells in the mesh. In addition, it examines both parallel efficiency and convergence behavior. These investigations demonstrate a parallel speed-up in excess of 28 on 32 processors of an SGI Origin 2000 system and confirm that mesh partitioning has no effect on convergence behavior.
A New Domain Decomposition Approach for the Gust Response Problem

NASA Technical Reports Server (NTRS)

Scott, James R.; Atassi, Hafiz M.; Susan-Resiga, Romeo F.

2002-01-01

A domain decomposition method is developed for solving the aerodynamic/aeroacoustic problem of an airfoil in a vortical gust. The computational domain is divided into inner and outer regions wherein the governing equations are cast in different forms suitable for accurate computations in each region. Boundary conditions which ensure continuity of pressure and velocity are imposed along the interface separating the two regions. A numerical study is presented for reduced frequencies ranging from 0.1 to 3.0. It is seen that the domain decomposition approach in providing robust and grid independent solutions.
Optimal domain decomposition strategies

NASA Technical Reports Server (NTRS)

Yoon, Yonghyun; Soni, Bharat K.

1995-01-01

The primary interest of the authors is in the area of grid generation, in particular, optimal domain decomposition about realistic configurations. A grid generation procedure with optimal blocking strategies has been developed to generate multi-block grids for a circular-to-rectangular transition duct. The focus of this study is the domain decomposition which optimizes solution algorithm/block compatibility based on geometrical complexities as well as the physical characteristics of flow field. The progress realized in this study is summarized in this paper.
Using SpF to Achieve Petascale for Legacy Pseudospectral Applications

NASA Technical Reports Server (NTRS)

Clune, Thomas L.; Jiang, Weiyuan

2014-01-01

Pseudospectral (PS) methods possess a number of characteristics (e.g., efficiency, accuracy, natural boundary conditions) that are extremely desirable for dynamo models. Unfortunately, dynamo models based upon PS methods face a number of daunting challenges, which include exposing additional parallelism, leveraging hardware accelerators, exploiting hybrid parallelism, and improving the scalability of global memory transposes. Although these issues are a concern for most models, solutions for PS methods tend to require far more pervasive changes to underlying data and control structures. Further, improvements in performance in one model are difficult to transfer to other models, resulting in significant duplication of effort across the research community. We have developed an extensible software framework for pseudospectral methods called SpF that is intended to enable extreme scalability and optimal performance. Highlevel abstractions provided by SpF unburden applications of the responsibility of managing domain decomposition and load balance while reducing the changes in code required to adapt to new computing architectures. The key design concept in SpF is that each phase of the numerical calculation is partitioned into disjoint numerical kernels that can be performed entirely inprocessor. The granularity of domain decomposition provided by SpF is only constrained by the datalocality requirements of these kernels. SpF builds on top of optimized vendor libraries for common numerical operations such as transforms, matrix solvers, etc., but can also be configured to use open source alternatives for portability. SpF includes several alternative schemes for global data redistribution and is expected to serve as an ideal testbed for further research into optimal approaches for different network architectures. In this presentation, we will describe our experience in porting legacy pseudospectral models, MoSST and DYNAMO, to use SpF as well as present preliminary performance results provided by the improved scalability.
An analysis of scatter decomposition

NASA Technical Reports Server (NTRS)

Nicol, David M.; Saltz, Joel H.

1990-01-01

A formal analysis of a powerful mapping technique known as scatter decomposition is presented. Scatter decomposition divides an irregular computational domain into a large number of equal sized pieces, and distributes them modularly among processors. A probabilistic model of workload in one dimension is used to formally explain why, and when scatter decomposition works. The first result is that if correlation in workload is a convex function of distance, then scattering a more finely decomposed domain yields a lower average processor workload variance. The second result shows that if the workload process is stationary Gaussian and the correlation function decreases linearly in distance until becoming zero and then remains zero, scattering a more finely decomposed domain yields a lower expected maximum processor workload. Finally it is shown that if the correlation function decreases linearly across the entire domain, then among all mappings that assign an equal number of domain pieces to each processor, scatter decomposition minimizes the average processor workload variance. The dependence of these results on the assumption of decreasing correlation is illustrated with situations where a coarser granularity actually achieves better load balance.
Parallel 3D Multi-Stage Simulation of a Turbofan Engine

NASA Technical Reports Server (NTRS)

Turner, Mark G.; Topp, David A.

1998-01-01

A 3D multistage simulation of each component of a modern GE Turbofan engine has been made. An axisymmetric view of this engine is presented in the document. This includes a fan, booster rig, high pressure compressor rig, high pressure turbine rig and a low pressure turbine rig. In the near future, all components will be run in a single calculation for a solution of 49 blade rows. The simulation exploits the use of parallel computations by using two levels of parallelism. Each blade row is run in parallel and each blade row grid is decomposed into several domains and run in parallel. 20 processors are used for the 4 blade row analysis. The average passage approach developed by John Adamczyk at NASA Lewis Research Center has been further developed and parallelized. This is APNASA Version A. It is a Navier-Stokes solver using a 4-stage explicit Runge-Kutta time marching scheme with variable time steps and residual smoothing for convergence acceleration. It has an implicit K-E turbulence model which uses an ADI solver to factor the matrix. Between 50 and 100 explicit time steps are solved before a blade row body force is calculated and exchanged with the other blade rows. This outer iteration has been coined a "flip." Efforts have been made to make the solver linearly scaleable with the number of blade rows. Enough flips are run (between 50 and 200) so the solution in the entire machine is not changing. The K-E equations are generally solved every other explicit time step. One of the key requirements in the development of the parallel code was to make the parallel solution exactly (bit for bit) match the serial solution. This has helped isolate many small parallel bugs and guarantee the parallelization was done correctly. The domain decomposition is done only in the axial direction since the number of points axially is much larger than the other two directions. This code uses MPI for message passing. The parallel speed up of the solver portion (no 1/0 or body force calculation) for a grid which has 227 points axially.
Multidisciplinary Optimization Methods for Aircraft Preliminary Design

NASA Technical Reports Server (NTRS)

Kroo, Ilan; Altus, Steve; Braun, Robert; Gage, Peter; Sobieski, Ian

1994-01-01

This paper describes a research program aimed at improved methods for multidisciplinary design and optimization of large-scale aeronautical systems. The research involves new approaches to system decomposition, interdisciplinary communication, and methods of exploiting coarse-grained parallelism for analysis and optimization. A new architecture, that involves a tight coupling between optimization and analysis, is intended to improve efficiency while simplifying the structure of multidisciplinary, computation-intensive design problems involving many analysis disciplines and perhaps hundreds of design variables. Work in two areas is described here: system decomposition using compatibility constraints to simplify the analysis structure and take advantage of coarse-grained parallelism; and collaborative optimization, a decomposition of the optimization process to permit parallel design and to simplify interdisciplinary communication requirements.
Morphological Simulation of Phase Separation Coupled Oscillation Shear and Varying Temperature Fields

NASA Astrophysics Data System (ADS)

Wang, Heping; Li, Xiaoguang; Lin, Kejun; Geng, Xingguo

2018-05-01

This paper explores the effect of the shear frequency and Prandtl number ( Pr) on the procedure and pattern formation of phase separation in symmetric and asymmetric systems. For the symmetric system, the periodic shear significantly prolongs the spinodal decomposition stage and enlarges the separated domain in domain growth stage. By adjusting the Pr and shear frequency, the number and orientation of separated steady layer structures can be controlled during domain stretch stage. The numerical results indicate that the increase in Pr and decrease in the shear frequency can significantly increase in the layer number of the lamellar structure, which relates to the decrease in domain size. Furthermore, the lamellar orientation parallel to the shear direction is altered into that perpendicular to the shear direction by further increasing the shear frequency, and also similar results for larger systems. For asymmetric system, the quantitative analysis shows that the decrease in the shear frequency enlarges the size of separated minority phases. These numerical results provide guidance for setting the optimum condition for the phase separation under periodic shear and slow cooling.
OSIRIS - an object-oriented parallel 3D PIC code for modeling laser and particle beam-plasma interaction

NASA Astrophysics Data System (ADS)

Hemker, Roy

1999-11-01

The advances in computational speed make it now possible to do full 3D PIC simulations of laser plasma and beam plasma interactions, but at the same time the increased complexity of these problems makes it necessary to apply modern approaches like object oriented programming to the development of simulation codes. We report here on our progress in developing an object oriented parallel 3D PIC code using Fortran 90. In its current state the code contains algorithms for 1D, 2D, and 3D simulations in cartesian coordinates and for 2D cylindrically-symmetric geometry. For all of these algorithms the code allows for a moving simulation window and arbitrary domain decomposition for any number of dimensions. Recent 3D simulation results on the propagation of intense laser and electron beams through plasmas will be presented.
High-performance parallel analysis of coupled problems for aircraft propulsion

NASA Technical Reports Server (NTRS)

Felippa, C. A.; Farhat, C.; Lanteri, S.; Maman, N.; Piperno, S.; Gumaste, U.

1994-01-01

This research program deals with the application of high-performance computing methods for the analysis of complete jet engines. We have entitled this program by applying the two dimensional parallel aeroelastic codes to the interior gas flow problem of a bypass jet engine. The fluid mesh generation, domain decomposition, and solution capabilities were successfully tested. We then focused attention on methodology for the partitioned analysis of the interaction of the gas flow with a flexible structure and with the fluid mesh motion that results from these structural displacements. This is treated by a new arbitrary Lagrangian-Eulerian (ALE) technique that models the fluid mesh motion as that of a fictitious mass-spring network. New partitioned analysis procedures to treat this coupled three-component problem are developed. These procedures involved delayed corrections and subcycling. Preliminary results on the stability, accuracy, and MPP computational efficiency are reported.
Sub-domain decomposition methods and computational controls for multibody dynamical systems. [of spacecraft structures

NASA Technical Reports Server (NTRS)

Menon, R. G.; Kurdila, A. J.

1992-01-01

This paper presents a concurrent methodology to simulate the dynamics of flexible multibody systems with a large number of degrees of freedom. A general class of open-loop structures is treated and a redundant coordinate formulation is adopted. A range space method is used in which the constraint forces are calculated using a preconditioned conjugate gradient method. By using a preconditioner motivated by the regular ordering of the directed graph of the structures, it is shown that the method is order N in the total number of coordinates of the system. The overall formulation has the advantage that it permits fine parallelization and does not rely on system topology to induce concurrency. It can be efficiently implemented on the present generation of parallel computers with a large number of processors. Validation of the method is presented via numerical simulations of space structures incorporating large number of flexible degrees of freedom.
Simulation of dilute polymeric fluids in a three-dimensional contraction using a multiscale FENE model

DOE Office of Scientific and Technical Information (OSTI.GOV)

Griebel, M., E-mail: griebel@ins.uni-bonn.de, E-mail: ruettgers@ins.uni-bonn.de; Rüttgers, A., E-mail: griebel@ins.uni-bonn.de, E-mail: ruettgers@ins.uni-bonn.de

The multiscale FENE model is applied to a 3D square-square contraction flow problem. For this purpose, the stochastic Brownian configuration field method (BCF) has been coupled with our fully parallelized three-dimensional Navier-Stokes solver NaSt3DGPF. The robustness of the BCF method enables the numerical simulation of high Deborah number flows for which most macroscopic methods suffer from stability issues. The results of our simulations are compared with that of experimental measurements from literature and show a very good agreement. In particular, flow phenomena such as a strong vortex enhancement, streamline divergence and a flow inversion for highly elastic flows are reproduced.more » Due to their computational complexity, our simulations require massively parallel computations. Using a domain decomposition approach with MPI, the implementation achieves excellent scale-up results for up to 128 processors.« less
A Spectral Element Ocean Model on the Cray T3D: the interannual variability of the Mediterranean Sea general circulation

NASA Astrophysics Data System (ADS)

Molcard, A. J.; Pinardi, N.; Ansaloni, R.

A new numerical model, SEOM (Spectral Element Ocean Model, (Iskandarani et al, 1994)), has been implemented in the Mediterranean Sea. Spectral element methods combine the geometric flexibility of finite element techniques with the rapid convergence rate of spectral schemes. The current version solves the shallow water equations with a fifth (or sixth) order accuracy spectral scheme and about 50.000 nodes. The domain decomposition philosophy makes it possible to exploit the power of parallel machines. The original MIMD master/slave version of SEOM, written in F90 and PVM, has been ported to the Cray T3D. When critical for performance, Cray specific high-performance one-sided communication routines (SHMEM) have been adopted to fully exploit the Cray T3D interprocessor network. Tests performed with highly unstructured and irregular grid, on up to 128 processors, show an almost linear scalability even with unoptimized domain decomposition techniques. Results from various case studies on the Mediterranean Sea are shown, involving realistic coastline geometry, and monthly mean 1000mb winds from the ECMWF's atmospheric model operational analysis from the period January 1987 to December 1994. The simulation results show that variability in the wind forcing considerably affect the circulation dynamics of the Mediterranean Sea.
Computation of an Underexpanded 3-D Rectangular Jet by the CE/SE Method

NASA Technical Reports Server (NTRS)

Loh, Ching Y.; Himansu, Ananda; Wang, Xiao Y.; Jorgenson, Philip C. E.

2000-01-01

Recently, an unstructured three-dimensional space-time conservation element and solution element (CE/SE) Euler solver was developed. Now it is also developed for parallel computation using METIS for domain decomposition and MPI (message passing interface). The method is employed here to numerically study the near-field of a typical 3-D rectangular under-expanded jet. For the computed case-a jet with Mach number Mj = 1.6. with a very modest grid of 1.7 million tetrahedrons, the flow features such as the shock-cell structures and the axis switching, are in good qualitative agreement with experimental results.
Survey of the status of finite element methods for partial differential equations

NASA Technical Reports Server (NTRS)

Temam, Roger

1986-01-01

The finite element methods (FEM) have proved to be a powerful technique for the solution of boundary value problems associated with partial differential equations of either elliptic, parabolic, or hyperbolic type. They also have a good potential for utilization on parallel computers particularly in relation to the concept of domain decomposition. This report is intended as an introduction to the FEM for the nonspecialist. It contains a survey which is totally nonexhaustive, and it also contains as an illustration, a report on some new results concerning two specific applications, namely a free boundary fluid-structure interaction problem and the Euler equations for inviscid flows.
PFLOTRAN: Reactive Flow & Transport Code for Use on Laptops to Leadership-Class Supercomputers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hammond, Glenn E.; Lichtner, Peter C.; Lu, Chuan

PFLOTRAN, a next-generation reactive flow and transport code for modeling subsurface processes, has been designed from the ground up to run efficiently on machines ranging from leadership-class supercomputers to laptops. Based on an object-oriented design, the code is easily extensible to incorporate additional processes. It can interface seamlessly with Fortran 9X, C and C++ codes. Domain decomposition parallelism is employed, with the PETSc parallel framework used to manage parallel solvers, data structures and communication. Features of the code include a modular input file, implementation of high-performance I/O using parallel HDF5, ability to perform multiple realization simulations with multiple processors permore » realization in a seamless manner, and multiple modes for multiphase flow and multicomponent geochemical transport. Chemical reactions currently implemented in the code include homogeneous aqueous complexing reactions and heterogeneous mineral precipitation/dissolution, ion exchange, surface complexation and a multirate kinetic sorption model. PFLOTRAN has demonstrated petascale performance using 2{sup 17} processor cores with over 2 billion degrees of freedom. Accomplishments achieved to date include applications to the Hanford 300 Area and modeling CO{sub 2} sequestration in deep geologic formations.« less
Acceleration of aircraft-level Traffic Flow Management

NASA Astrophysics Data System (ADS)

Rios, Joseph Lucio

This dissertation describes novel approaches to solving large-scale, high fidelity, aircraft-level Traffic Flow Management scheduling problems. Depending on the methods employed, solving these problems to optimality can take longer than the length of the planning horizon in question. Research in this domain typically focuses on the quality of the modeling used to describe the problem and the benefits achieved from the optimized solution, often treating computational aspects as secondary or tertiary. The work presented here takes the complementary view and considers the computational aspect as the primary concern. To this end, a previously published model for solving this Traffic Flow Management scheduling problem is used as starting point for this study. The model proposed by Bertsimas and Stock-Patterson is a binary integer program taking into account all major resource capacities and the trajectories of each flight to decide which flights should be held in which resource for what amount of time in order to satisfy all capacity requirements. For large instances, the solve time using state-of-the-art solvers is prohibitive for use within a potential decision support tool. With this dissertation, however, it will be shown that solving can be achieved in reasonable time for instances of real-world size. Five other techniques developed and tested for this dissertation will be described in detail. These are heuristic methods that provide good results. Performance is measured in terms of runtime and "optimality gap." We then describe the most successful method presented in this dissertation: Dantzig-Wolfe Decomposition. Results indicate that a parallel implementation of Dantzig-Wolfe Decomposition optimally solves the original problem in much reduced time and with better integrality and smaller optimality gap than any of the heuristic methods or state-of-the-art, commercial solvers. The solution quality improves in every measureable way as the number of subproblems solved in parallel increases. A maximal decomposition provides the best results of any method tested. The convergence qualities of Dantzig-Wolfe Decomposition have been criticized in the past, so we examine what makes the Bertsimas-Stock Patterson model so amenable to use of this method. These mathematical qualities of the model are generalized to provide guidance on other problems that may benefit from massively parallel Dantzig-Wolfe Decomposition. This result, together with the development of the software, and the experimental results indicating the feasibility of real-time, nationwide Traffic Flow Management scheduling represent the major contributions of this dissertation.
Quasi-disjoint pentadiagonal matrix systems for the parallelization of compact finite-difference schemes and filters

NASA Astrophysics Data System (ADS)

Kim, Jae Wook

2013-05-01

This paper proposes a novel systematic approach for the parallelization of pentadiagonal compact finite-difference schemes and filters based on domain decomposition. The proposed approach allows a pentadiagonal banded matrix system to be split into quasi-disjoint subsystems by using a linear-algebraic transformation technique. As a result the inversion of pentadiagonal matrices can be implemented within each subdomain in an independent manner subject to a conventional halo-exchange process. The proposed matrix transformation leads to new subdomain boundary (SB) compact schemes and filters that require three halo terms to exchange with neighboring subdomains. The internode communication overhead in the present approach is equivalent to that of standard explicit schemes and filters based on seven-point discretization stencils. The new SB compact schemes and filters demand additional arithmetic operations compared to the original serial ones. However, it is shown that the additional cost becomes sufficiently low by choosing optimal sizes of their discretization stencils. Compared to earlier published results, the proposed SB compact schemes and filters successfully reduce parallelization artifacts arising from subdomain boundaries to a level sufficiently negligible for sophisticated aeroacoustic simulations without degrading parallel efficiency. The overall performance and parallel efficiency of the proposed approach are demonstrated by stringent benchmark tests.
Cloud parallel processing of tandem mass spectrometry based proteomics data.

PubMed

Mohammed, Yassene; Mostovenko, Ekaterina; Henneman, Alex A; Marissen, Rob J; Deelder, André M; Palmblad, Magnus

2012-10-05

Data analysis in mass spectrometry based proteomics struggles to keep pace with the advances in instrumentation and the increasing rate of data acquisition. Analyzing this data involves multiple steps requiring diverse software, using different algorithms and data formats. Speed and performance of the mass spectral search engines are continuously improving, although not necessarily as needed to face the challenges of acquired big data. Improving and parallelizing the search algorithms is one possibility; data decomposition presents another, simpler strategy for introducing parallelism. We describe a general method for parallelizing identification of tandem mass spectra using data decomposition that keeps the search engine intact and wraps the parallelization around it. We introduce two algorithms for decomposing mzXML files and recomposing resulting pepXML files. This makes the approach applicable to different search engines, including those relying on sequence databases and those searching spectral libraries. We use cloud computing to deliver the computational power and scientific workflow engines to interface and automate the different processing steps. We show how to leverage these technologies to achieve faster data analysis in proteomics and present three scientific workflows for parallel database as well as spectral library search using our data decomposition programs, X!Tandem and SpectraST.
Three-dimensional Finite Element Formulation and Scalable Domain Decomposition for High Fidelity Rotor Dynamic Analysis

NASA Technical Reports Server (NTRS)

Datta, Anubhav; Johnson, Wayne R.

2009-01-01

This paper has two objectives. The first objective is to formulate a 3-dimensional Finite Element Model for the dynamic analysis of helicopter rotor blades. The second objective is to implement and analyze a dual-primal iterative substructuring based Krylov solver, that is parallel and scalable, for the solution of the 3-D FEM analysis. The numerical and parallel scalability of the solver is studied using two prototype problems - one for ideal hover (symmetric) and one for a transient forward flight (non-symmetric) - both carried out on up to 48 processors. In both hover and forward flight conditions, a perfect linear speed-up is observed, for a given problem size, up to the point of substructure optimality. Substructure optimality and the linear parallel speed-up range are both shown to depend on the problem size as well as on the selection of the coarse problem. With a larger problem size, linear speed-up is restored up to the new substructure optimality. The solver also scales with problem size - even though this conclusion is premature given the small prototype grids considered in this study.

On the Development of an Efficient Parallel Hybrid Solver with Application to Acoustically Treated Aero-Engine Nacelles

NASA Technical Reports Server (NTRS)

Watson, Willie R.; Nark, Douglas M.; Nguyen, Duc T.; Tungkahotara, Siroj

2006-01-01

A finite element solution to the convected Helmholtz equation in a nonuniform flow is used to model the noise field within 3-D acoustically treated aero-engine nacelles. Options to select linear or cubic Hermite polynomial basis functions and isoparametric elements are included. However, the key feature of the method is a domain decomposition procedure that is based upon the inter-mixing of an iterative and a direct solve strategy for solving the discrete finite element equations. This procedure is optimized to take full advantage of sparsity and exploit the increased memory and parallel processing capability of modern computer architectures. Example computations are presented for the Langley Flow Impedance Test facility and a rectangular mapping of a full scale, generic aero-engine nacelle. The accuracy and parallel performance of this new solver are tested on both model problems using a supercomputer that contains hundreds of central processing units. Results show that the method gives extremely accurate attenuation predictions, achieves super-linear speedup over hundreds of CPUs, and solves upward of 25 million complex equations in a quarter of an hour.
Massively Parallel Dantzig-Wolfe Decomposition Applied to Traffic Flow Scheduling

NASA Technical Reports Server (NTRS)

Rios, Joseph Lucio; Ross, Kevin

2009-01-01

Optimal scheduling of air traffic over the entire National Airspace System is a computationally difficult task. To speed computation, Dantzig-Wolfe decomposition is applied to a known linear integer programming approach for assigning delays to flights. The optimization model is proven to have the block-angular structure necessary for Dantzig-Wolfe decomposition. The subproblems for this decomposition are solved in parallel via independent computation threads. Experimental evidence suggests that as the number of subproblems/threads increases (and their respective sizes decrease), the solution quality, convergence, and runtime improve. A demonstration of this is provided by using one flight per subproblem, which is the finest possible decomposition. This results in thousands of subproblems and associated computation threads. This massively parallel approach is compared to one with few threads and to standard (non-decomposed) approaches in terms of solution quality and runtime. Since this method generally provides a non-integral (relaxed) solution to the original optimization problem, two heuristics are developed to generate an integral solution. Dantzig-Wolfe followed by these heuristics can provide a near-optimal (sometimes optimal) solution to the original problem hundreds of times faster than standard (non-decomposed) approaches. In addition, when massive decomposition is employed, the solution is shown to be more likely integral, which obviates the need for an integerization step. These results indicate that nationwide, real-time, high fidelity, optimal traffic flow scheduling is achievable for (at least) 3 hour planning horizons.
Efficient Helicopter Aerodynamic and Aeroacoustic Predictions on Parallel Computers

NASA Technical Reports Server (NTRS)

Wissink, Andrew M.; Lyrintzis, Anastasios S.; Strawn, Roger C.; Oliker, Leonid; Biswas, Rupak

1996-01-01

This paper presents parallel implementations of two codes used in a combined CFD/Kirchhoff methodology to predict the aerodynamics and aeroacoustics properties of helicopters. The rotorcraft Navier-Stokes code, TURNS, computes the aerodynamic flowfield near the helicopter blades and the Kirchhoff acoustics code computes the noise in the far field, using the TURNS solution as input. The overall parallel strategy adds MPI message passing calls to the existing serial codes to allow for communication between processors. As a result, the total code modifications required for parallel execution are relatively small. The biggest bottleneck in running the TURNS code in parallel comes from the LU-SGS algorithm that solves the implicit system of equations. We use a new hybrid domain decomposition implementation of LU-SGS to obtain good parallel performance on the SP-2. TURNS demonstrates excellent parallel speedups for quasi-steady and unsteady three-dimensional calculations of a helicopter blade in forward flight. The execution rate attained by the code on 114 processors is six times faster than the same cases run on one processor of the Cray C-90. The parallel Kirchhoff code also shows excellent parallel speedups and fast execution rates. As a performance demonstration, unsteady acoustic pressures are computed at 1886 far-field observer locations for a sample acoustics problem. The calculation requires over two hundred hours of CPU time on one C-90 processor but takes only a few hours on 80 processors of the SP2. The resultant far-field acoustic field is analyzed with state of-the-art audio and video rendering of the propagating acoustic signals.
Domain decomposition for a mixed finite element method in three dimensions

USGS Publications Warehouse

Cai, Z.; Parashkevov, R.R.; Russell, T.F.; Wilson, J.D.; Ye, X.

2003-01-01

We consider the solution of the discrete linear system resulting from a mixed finite element discretization applied to a second-order elliptic boundary value problem in three dimensions. Based on a decomposition of the velocity space, these equations can be reduced to a discrete elliptic problem by eliminating the pressure through the use of substructures of the domain. The practicality of the reduction relies on a local basis, presented here, for the divergence-free subspace of the velocity space. We consider additive and multiplicative domain decomposition methods for solving the reduced elliptic problem, and their uniform convergence is established.
Real-world hydrologic assessment of a fully-distributed hydrological model in a parallel computing environment

NASA Astrophysics Data System (ADS)

Vivoni, Enrique R.; Mascaro, Giuseppe; Mniszewski, Susan; Fasel, Patricia; Springer, Everett P.; Ivanov, Valeriy Y.; Bras, Rafael L.

2011-10-01

SummaryA major challenge in the use of fully-distributed hydrologic models has been the lack of computational capabilities for high-resolution, long-term simulations in large river basins. In this study, we present the parallel model implementation and real-world hydrologic assessment of the Triangulated Irregular Network (TIN)-based Real-time Integrated Basin Simulator (tRIBS). Our parallelization approach is based on the decomposition of a complex watershed using the channel network as a directed graph. The resulting sub-basin partitioning divides effort among processors and handles hydrologic exchanges across boundaries. Through numerical experiments in a set of nested basins, we quantify parallel performance relative to serial runs for a range of processors, simulation complexities and lengths, and sub-basin partitioning methods, while accounting for inter-run variability on a parallel computing system. In contrast to serial simulations, the parallel model speed-up depends on the variability of hydrologic processes. Load balancing significantly improves parallel speed-up with proportionally faster runs as simulation complexity (domain resolution and channel network extent) increases. The best strategy for large river basins is to combine a balanced partitioning with an extended channel network, with potential savings through a lower TIN resolution. Based on these advances, a wider range of applications for fully-distributed hydrologic models are now possible. This is illustrated through a set of ensemble forecasts that account for precipitation uncertainty derived from a statistical downscaling model.
Improving NASA's Multiscale Modeling Framework for Tropical Cyclone Climate Study

NASA Technical Reports Server (NTRS)

Shen, Bo-Wen; Nelson, Bron; Cheung, Samson; Tao, Wei-Kuo

2013-01-01

One of the current challenges in tropical cyclone (TC) research is how to improve our understanding of TC interannual variability and the impact of climate change on TCs. Recent advances in global modeling, visualization, and supercomputing technologies at NASA show potential for such studies. In this article, the authors discuss recent scalability improvement to the multiscale modeling framework (MMF) that makes it feasible to perform long-term TC-resolving simulations. The MMF consists of the finite-volume general circulation model (fvGCM), supplemented by a copy of the Goddard cumulus ensemble model (GCE) at each of the fvGCM grid points, giving 13,104 GCE copies. The original fvGCM implementation has a 1D data decomposition; the revised MMF implementation retains the 1D decomposition for most of the code, but uses a 2D decomposition for the massive copies of GCEs. Because the vast majority of computation time in the MMF is spent computing the GCEs, this approach can achieve excellent speedup without incurring the cost of modifying the entire code. Intelligent process mapping allows differing numbers of processes to be assigned to each domain for load balancing. The revised parallel implementation shows highly promising scalability, obtaining a nearly 80-fold speedup by increasing the number of cores from 30 to 3,335.
International Symposium on Numerical Methods in Engineering, 5th, Ecole Polytechnique Federale de Lausanne, Switzerland, Sept. 11-15, 1989, Proceedings. Volumes 1 & 2

NASA Astrophysics Data System (ADS)

Gruber, Ralph; Periaux, Jaques; Shaw, Richard Paul

Recent advances in computational mechanics are discussed in reviews and reports. Topics addressed include spectral superpositions on finite elements for shear banding problems, strain-based finite plasticity, numerical simulation of hypersonic viscous continuum flow, constitutive laws in solid mechanics, dynamics problems, fracture mechanics and damage tolerance, composite plates and shells, contact and friction, metal forming and solidification, coupling problems, and adaptive FEMs. Consideration is given to chemical flows, convection problems, free boundaries and artificial boundary conditions, domain-decomposition and multigrid methods, combustion and thermal analysis, wave propagation, mixed and hybrid FEMs, integral-equation methods, optimization, software engineering, and vector and parallel computing.
On the use of flux limiters in the discrete ordinates method for 3D radiation calculations in absorbing and scattering media

NASA Astrophysics Data System (ADS)

Godoy, William F.; DesJardin, Paul E.

2010-05-01

The application of flux limiters to the discrete ordinates method (DOM), SN, for radiative transfer calculations is discussed and analyzed for 3D enclosures for cases in which the intensities are strongly coupled to each other such as: radiative equilibrium and scattering media. A Newton-Krylov iterative method (GMRES) solves the final systems of linear equations along with a domain decomposition strategy for parallel computation using message passing libraries in a distributed memory system. Ray effects due to angular discretization and errors due to domain decomposition are minimized until small variations are introduced by these effects in order to focus on the influence of flux limiters on errors due to spatial discretization, known as numerical diffusion, smearing or false scattering. Results are presented for the DOM-integrated quantities such as heat flux, irradiation and emission. A variety of flux limiters are compared to "exact" solutions available in the literature, such as the integral solution of the RTE for pure absorbing-emitting media and isotropic scattering cases and a Monte Carlo solution for a forward scattering case. Additionally, a non-homogeneous 3D enclosure is included to extend the use of flux limiters to more practical cases. The overall balance of convergence, accuracy, speed and stability using flux limiters is shown to be superior compared to step schemes for any test case.
A taxonomy and comparison of parallel block multi-level preconditioners for the incompressible Navier-Stokes equations.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shadid, John Nicolas; Elman, Howard; Shuttleworth, Robert R.

2007-04-01

In recent years, considerable effort has been placed on developing efficient and robust solution algorithms for the incompressible Navier-Stokes equations based on preconditioned Krylov methods. These include physics-based methods, such as SIMPLE, and purely algebraic preconditioners based on the approximation of the Schur complement. All these techniques can be represented as approximate block factorization (ABF) type preconditioners. The goal is to decompose the application of the preconditioner into simplified sub-systems in which scalable multi-level type solvers can be applied. In this paper we develop a taxonomy of these ideas based on an adaptation of a generalized approximate factorization of themore » Navier-Stokes system first presented in [25]. This taxonomy illuminates the similarities and differences among these preconditioners and the central role played by efficient approximation of certain Schur complement operators. We then present a parallel computational study that examines the performance of these methods and compares them to an additive Schwarz domain decomposition (DD) algorithm. Results are presented for two and three-dimensional steady state problems for enclosed domains and inflow/outflow systems on both structured and unstructured meshes. The numerical experiments are performed using MPSalsa, a stabilized finite element code.« less
Aztec user`s guide. Version 1

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hutchinson, S.A.; Shadid, J.N.; Tuminaro, R.S.

1995-10-01

Aztec is an iterative library that greatly simplifies the parallelization process when solving the linear systems of equations Ax = b where A is a user supplied n x n sparse matrix, b is a user supplied vector of length n and x is a vector of length n to be computed. Aztec is intended as a software tool for users who want to avoid cumbersome parallel programming details but who have large sparse linear systems which require an efficiently utilized parallel processing system. A collection of data transformation tools are provided that allow for easy creation of distributed sparsemore » unstructured matrices for parallel solution. Once the distributed matrix is created, computation can be performed on any of the parallel machines running Aztec: nCUBE 2, IBM SP2 and Intel Paragon, MPI platforms as well as standard serial and vector platforms. Aztec includes a number of Krylov iterative methods such as conjugate gradient (CG), generalized minimum residual (GMRES) and stabilized biconjugate gradient (BICGSTAB) to solve systems of equations. These Krylov methods are used in conjunction with various preconditioners such as polynomial or domain decomposition methods using LU or incomplete LU factorizations within subdomains. Although the matrix A can be general, the package has been designed for matrices arising from the approximation of partial differential equations (PDEs). In particular, the Aztec package is oriented toward systems arising from PDE applications.« less
Concurrent Probabilistic Simulation of High Temperature Composite Structural Response

NASA Technical Reports Server (NTRS)

Abdi, Frank

1996-01-01

A computational structural/material analysis and design tool which would meet industry's future demand for expedience and reduced cost is presented. This unique software 'GENOA' is dedicated to parallel and high speed analysis to perform probabilistic evaluation of high temperature composite response of aerospace systems. The development is based on detailed integration and modification of diverse fields of specialized analysis techniques and mathematical models to combine their latest innovative capabilities into a commercially viable software package. The technique is specifically designed to exploit the availability of processors to perform computationally intense probabilistic analysis assessing uncertainties in structural reliability analysis and composite micromechanics. The primary objectives which were achieved in performing the development were: (1) Utilization of the power of parallel processing and static/dynamic load balancing optimization to make the complex simulation of structure, material and processing of high temperature composite affordable; (2) Computational integration and synchronization of probabilistic mathematics, structural/material mechanics and parallel computing; (3) Implementation of an innovative multi-level domain decomposition technique to identify the inherent parallelism, and increasing convergence rates through high- and low-level processor assignment; (4) Creating the framework for Portable Paralleled architecture for the machine independent Multi Instruction Multi Data, (MIMD), Single Instruction Multi Data (SIMD), hybrid and distributed workstation type of computers; and (5) Market evaluation. The results of Phase-2 effort provides a good basis for continuation and warrants Phase-3 government, and industry partnership.
Multiphase three-dimensional direct numerical simulation of a rotating impeller with code Blue

NASA Astrophysics Data System (ADS)

Kahouadji, Lyes; Shin, Seungwon; Chergui, Jalel; Juric, Damir; Craster, Richard V.; Matar, Omar K.

2017-11-01

The flow driven by a rotating impeller inside an open fixed cylindrical cavity is simulated using code Blue, a solver for massively-parallel simulations of fully three-dimensional multiphase flows. The impeller is composed of four blades at a 45° inclination all attached to a central hub and tube stem. In Blue, solid forms are constructed through the definition of immersed objects via a distance function that accounts for the object's interaction with the flow for both single and two-phase flows. We use a moving frame technique for imposing translation and/or rotation. The variation of the Reynolds number, the clearance, and the tank aspect ratio are considered, and we highlight the importance of the confinement ratio (blade radius versus the tank radius) in the mixing process. Blue uses a domain decomposition strategy for parallelization with MPI. The fluid interface solver is based on a parallel implementation of a hybrid front-tracking/level-set method designed complex interfacial topological changes. Parallel GMRES and multigrid iterative solvers are applied to the linear systems arising from the implicit solution for the fluid velocities and pressure in the presence of strong density and viscosity discontinuities across fluid phases. EPSRC, UK, MEMPHIS program Grant (EP/K003976/1), RAEng Research Chair (OKM).
The Forest Method as a New Parallel Tree Method with the Sectional Voronoi Tessellation

NASA Astrophysics Data System (ADS)

Yahagi, Hideki; Mori, Masao; Yoshii, Yuzuru

1999-09-01

We have developed a new parallel tree method which will be called the forest method hereafter. This new method uses the sectional Voronoi tessellation (SVT) for the domain decomposition. The SVT decomposes a whole space into polyhedra and allows their flat borders to move by assigning different weights. The forest method determines these weights based on the load balancing among processors by means of the overload diffusion (OLD). Moreover, since all the borders are flat, before receiving the data from other processors, each processor can collect enough data to calculate the gravity force with precision. Both the SVT and the OLD are coded in a highly vectorizable manner to accommodate on vector parallel processors. The parallel code based on the forest method with the Message Passing Interface is run on various platforms so that a wide portability is guaranteed. Extensive calculations with 15 processors of Fujitsu VPP300/16R indicate that the code can calculate the gravity force exerted on 105 particles in each second for some ideal dark halo. This code is found to enable an N-body simulation with 107 or more particles for a wide dynamic range and is therefore a very powerful tool for the study of galaxy formation and large-scale structure in the universe.
Real-time trajectory optimization on parallel processors

NASA Technical Reports Server (NTRS)

Psiaki, Mark L.

1993-01-01

A parallel algorithm has been developed for rapidly solving trajectory optimization problems. The goal of the work has been to develop an algorithm that is suitable to do real-time, on-line optimal guidance through repeated solution of a trajectory optimization problem. The algorithm has been developed on an INTEL iPSC/860 message passing parallel processor. It uses a zero-order-hold discretization of a continuous-time problem and solves the resulting nonlinear programming problem using a custom-designed augmented Lagrangian nonlinear programming algorithm. The algorithm achieves parallelism of function, derivative, and search direction calculations through the principle of domain decomposition applied along the time axis. It has been encoded and tested on 3 example problems, the Goddard problem, the acceleration-limited, planar minimum-time to the origin problem, and a National Aerospace Plane minimum-fuel ascent guidance problem. Execution times as fast as 118 sec of wall clock time have been achieved for a 128-stage Goddard problem solved on 32 processors. A 32-stage minimum-time problem has been solved in 151 sec on 32 processors. A 32-stage National Aerospace Plane problem required 2 hours when solved on 32 processors. A speed-up factor of 7.2 has been achieved by using 32-nodes instead of 1-node to solve a 64-stage Goddard problem.
How to use MPI communication in highly parallel climate simulations more easily and more efficiently.

NASA Astrophysics Data System (ADS)

Behrens, Jörg; Hanke, Moritz; Jahns, Thomas

2014-05-01

In this talk we present a way to facilitate efficient use of MPI communication for developers of climate models. Exploitation of the performance potential of today's highly parallel supercomputers with real world simulations is a complex task. This is partly caused by the low level nature of the MPI communication library which is the dominant communication tool at least for inter-node communication. In order to manage the complexity of the task, climate simulations with non-trivial communication patterns often use an internal abstraction layer above MPI without exploiting the benefits of communication aggregation or MPI-datatypes. The solution for the complexity and performance problem we propose is the communication library YAXT. This library is built on top of MPI and takes high level descriptions of arbitrary domain decompositions and automatically derives an efficient collective data exchange. Several exchanges can be aggregated in order to reduce latency costs. Examples are given which demonstrate the simplicity and the performance gains for selected climate applications.
PFLOTRAN User Manual: A Massively Parallel Reactive Flow and Transport Model for Describing Surface and Subsurface Processes

DOE Office of Scientific and Technical Information (OSTI.GOV)

Lichtner, Peter C.; Hammond, Glenn E.; Lu, Chuan

PFLOTRAN solves a system of generally nonlinear partial differential equations describing multi-phase, multicomponent and multiscale reactive flow and transport in porous materials. The code is designed to run on massively parallel computing architectures as well as workstations and laptops (e.g. Hammond et al., 2011). Parallelization is achieved through domain decomposition using the PETSc (Portable Extensible Toolkit for Scientific Computation) libraries for the parallelization framework (Balay et al., 1997). PFLOTRAN has been developed from the ground up for parallel scalability and has been run on up to 218 processor cores with problem sizes up to 2 billion degrees of freedom. Writtenmore » in object oriented Fortran 90, the code requires the latest compilers compatible with Fortran 2003. At the time of this writing this requires gcc 4.7.x, Intel 12.1.x and PGC compilers. As a requirement of running problems with a large number of degrees of freedom, PFLOTRAN allows reading input data that is too large to fit into memory allotted to a single processor core. The current limitation to the problem size PFLOTRAN can handle is the limitation of the HDF5 file format used for parallel IO to 32 bit integers. Noting that 2 32 = 4; 294; 967; 296, this gives an estimate of the maximum problem size that can be currently run with PFLOTRAN. Hopefully this limitation will be remedied in the near future.« less
A multilevel preconditioner for domain decomposition boundary systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bramble, J.H.; Pasciak, J.E.; Xu, Jinchao.

1991-12-11

In this note, we consider multilevel preconditioning of the reduced boundary systems which arise in non-overlapping domain decomposition methods. It will be shown that the resulting preconditioned systems have condition numbers which be bounded in the case of multilevel spaces on the whole domain and grow at most proportional to the number of levels in the case of multilevel boundary spaces without multilevel extensions into the interior.
FoSSI: the family of simplified solver interfaces for the rapid development of parallel numerical atmosphere and ocean models

NASA Astrophysics Data System (ADS)

Frickenhaus, Stephan; Hiller, Wolfgang; Best, Meike

The portable software FoSSI is introduced that—in combination with additional free solver software packages—allows for an efficient and scalable parallel solution of large sparse linear equations systems arising in finite element model codes. FoSSI is intended to support rapid model code development, completely hiding the complexity of the underlying solver packages. In particular, the model developer need not be an expert in parallelization and is yet free to switch between different solver packages by simple modifications of the interface call. FoSSI offers an efficient and easy, yet flexible interface to several parallel solvers, most of them available on the web, such as PETSC, AZTEC, MUMPS, PILUT and HYPRE. FoSSI makes use of the concept of handles for vectors, matrices, preconditioners and solvers, that is frequently used in solver libraries. Hence, FoSSI allows for a flexible treatment of several linear equations systems and associated preconditioners at the same time, even in parallel on separate MPI-communicators. The second special feature in FoSSI is the task specifier, being a combination of keywords, each configuring a certain phase in the solver setup. This enables the user to control a solver over one unique subroutine. Furthermore, FoSSI has rather similar features for all solvers, making a fast solver intercomparison or exchange an easy task. FoSSI is a community software, proven in an adaptive 2D-atmosphere model and a 3D-primitive equation ocean model, both formulated in finite elements. The present paper discusses perspectives of an OpenMP-implementation of parallel iterative solvers based on domain decomposition methods. This approach to OpenMP solvers is rather attractive, as the code for domain-local operations of factorization, preconditioning and matrix-vector product can be readily taken from a sequential implementation that is also suitable to be used in an MPI-variant. Code development in this direction is in an advanced state under the name ScOPES: the Scalable Open Parallel sparse linear Equations Solver.
Project APhiD: A Lorenz-gauged A-Φ decomposition for parallelized computation of ultra-broadband electromagnetic induction in a fully heterogeneous Earth

NASA Astrophysics Data System (ADS)

Weiss, Chester J.

2013-08-01

An essential element for computational hypothesis testing, data inversion and experiment design for electromagnetic geophysics is a robust forward solver, capable of easily and quickly evaluating the electromagnetic response of arbitrary geologic structure. The usefulness of such a solver hinges on the balance among competing desires like ease of use, speed of forward calculation, scalability to large problems or compute clusters, parsimonious use of memory access, accuracy and by necessity, the ability to faithfully accommodate a broad range of geologic scenarios over extremes in length scale and frequency content. This is indeed a tall order. The present study addresses recent progress toward the development of a forward solver with these properties. Based on the Lorenz-gauged Helmholtz decomposition, a new finite volume solution over Cartesian model domains endowed with complex-valued electrical properties is shown to be stable over the frequency range 10-2-1010 Hz and range 10-3-105 m in length scale. Benchmark examples are drawn from magnetotellurics, exploration geophysics, geotechnical mapping and laboratory-scale analysis, showing excellent agreement with reference analytic solutions. Computational efficiency is achieved through use of a matrix-free implementation of the quasi-minimum-residual (QMR) iterative solver, which eliminates explicit storage of finite volume matrix elements in favor of "on the fly" computation as needed by the iterative Krylov sequence. Further efficiency is achieved through sparse coupling matrices between the vector and scalar potentials whose non-zero elements arise only in those parts of the model domain where the conductivity gradient is non-zero. Multi-thread parallelization in the QMR solver through OpenMP pragmas is used to reduce the computational cost of its most expensive step: the single matrix-vector product at each iteration. High-level MPI communicators farm independent processes to available compute nodes for simultaneous computation of multi-frequency or multi-transmitter responses.
SpF: Enabling Petascale Performance for Pseudospectral Dynamo Models

NASA Astrophysics Data System (ADS)

Jiang, W.; Clune, T.; Vriesema, J.; Gutmann, G.

2013-12-01

Pseudospectral (PS) methods possess a number of characteristics (e.g., efficiency, accuracy, natural boundary conditions) that are extremely desirable for dynamo models. Unfortunately, dynamo models based upon PS methods face a number of daunting challenges, which include exposing additional parallelism, leveraging hardware accelerators, exploiting hybrid parallelism, and improving the scalability of global memory transposes. Although these issues are a concern for most models, solutions for PS methods tend to require far more pervasive changes to underlying data and control structures. Further, improvements in performance in one model are difficult to transfer to other models, resulting in significant duplication of effort across the research community. We have developed an extensible software framework for pseudospectral methods called SpF that is intended to enable extreme scalability and optimal performance. High-level abstractions provided by SpF unburden applications of the responsibility of managing domain decomposition and load balance while reducing the changes in code required to adapt to new computing architectures. The key design concept in SpF is that each phase of the numerical calculation is partitioned into disjoint numerical 'kernels' that can be performed entirely in-processor. The granularity of domain-decomposition provided by SpF is only constrained by the data-locality requirements of these kernels. SpF builds on top of optimized vendor libraries for common numerical operations such as transforms, matrix solvers, etc., but can also be configured to use open source alternatives for portability. SpF includes several alternative schemes for global data redistribution and is expected to serve as an ideal testbed for further research into optimal approaches for different network architectures. In this presentation, we will describe the basic architecture of SpF as well as preliminary performance data and experience with adapting legacy dynamo codes. We will conclude with a discussion of planned extensions to SpF that will provide pseudospectral applications with additional flexibility with regard to time integration, linear solvers, and discretization in the radial direction.

Balancing Conflicting Requirements for Grid and Particle Decomposition in Continuum-Lagrangian Solvers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sitaraman, Hariswaran; Grout, Ray

2015-10-30

The load balancing strategies for hybrid solvers that involve grid based partial differential equation solution coupled with particle tracking are presented in this paper. A typical Message Passing Interface (MPI) based parallelization of grid based solves are done using a spatial domain decomposition while particle tracking is primarily done using either of the two techniques. One of the techniques is to distribute the particles to MPI ranks to whose grid they belong to while the other is to share the particles equally among all ranks, irrespective of their spatial location. The former technique provides spatial locality for field interpolation butmore » cannot assure load balance in terms of number of particles, which is achieved by the latter. The two techniques are compared for a case of particle tracking in a homogeneous isotropic turbulence box as well as a turbulent jet case. We performed a strong scaling study for more than 32,000 cores, which results in particle densities representative of anticipated exascale machines. The use of alternative implementations of MPI collectives and efficient load equalization strategies are studied to reduce data communication overheads.« less
Parallelization of a Fully-Distributed Hydrologic Model using Sub-basin Partitioning

NASA Astrophysics Data System (ADS)

Vivoni, E. R.; Mniszewski, S.; Fasel, P.; Springer, E.; Ivanov, V. Y.; Bras, R. L.

2005-12-01

A primary obstacle towards advances in watershed simulations has been the limited computational capacity available to most models. The growing trend of model complexity, data availability and physical representation has not been matched by adequate developments in computational efficiency. This situation has created a serious bottleneck which limits existing distributed hydrologic models to small domains and short simulations. In this study, we present novel developments in the parallelization of a fully-distributed hydrologic model. Our work is based on the TIN-based Real-time Integrated Basin Simulator (tRIBS), which provides continuous hydrologic simulation using a multiple resolution representation of complex terrain based on a triangulated irregular network (TIN). While the use of TINs reduces computational demand, the sequential version of the model is currently limited over large basins (>10,000 km2) and long simulation periods (>1 year). To address this, a parallel MPI-based version of the tRIBS model has been implemented and tested using high performance computing resources at Los Alamos National Laboratory. Our approach utilizes domain decomposition based on sub-basin partitioning of the watershed. A stream reach graph based on the channel network structure is used to guide the sub-basin partitioning. Individual sub-basins or sub-graphs of sub-basins are assigned to separate processors to carry out internal hydrologic computations (e.g. rainfall-runoff transformation). Routed streamflow from each sub-basin forms the major hydrologic data exchange along the stream reach graph. Individual sub-basins also share subsurface hydrologic fluxes across adjacent boundaries. We demonstrate how the sub-basin partitioning provides computational feasibility and efficiency for a set of test watersheds in northeastern Oklahoma. We compare the performance of the sequential and parallelized versions to highlight the efficiency gained as the number of processors increases. We also discuss how the coupled use of TINs and parallel processing can lead to feasible long-term simulations in regional watersheds while preserving basin properties at high-resolution.
Analyzing Tropical Waves Using the Parallel Ensemble Empirical Model Decomposition Method: Preliminary Results from Hurricane Sandy

NASA Technical Reports Server (NTRS)

Shen, Bo-Wen; Cheung, Samson; Li, Jui-Lin F.; Wu, Yu-ling

2013-01-01

In this study, we discuss the performance of the parallel ensemble empirical mode decomposition (EMD) in the analysis of tropical waves that are associated with tropical cyclone (TC) formation. To efficiently analyze high-resolution, global, multiple-dimensional data sets, we first implement multilevel parallelism into the ensemble EMD (EEMD) and obtain a parallel speedup of 720 using 200 eight-core processors. We then apply the parallel EEMD (PEEMD) to extract the intrinsic mode functions (IMFs) from preselected data sets that represent (1) idealized tropical waves and (2) large-scale environmental flows associated with Hurricane Sandy (2012). Results indicate that the PEEMD is efficient and effective in revealing the major wave characteristics of the data, such as wavelengths and periods, by sifting out the dominant (wave) components. This approach has a potential for hurricane climate study by examining the statistical relationship between tropical waves and TC formation.
Implementation of the force decomposition machine for molecular dynamics simulations.

PubMed

Borštnik, Urban; Miller, Benjamin T; Brooks, Bernard R; Janežič, Dušanka

2012-09-01

We present the design and implementation of the force decomposition machine (FDM), a cluster of personal computers (PCs) that is tailored to running molecular dynamics (MD) simulations using the distributed diagonal force decomposition (DDFD) parallelization method. The cluster interconnect architecture is optimized for the communication pattern of the DDFD method. Our implementation of the FDM relies on standard commodity components even for networking. Although the cluster is meant for DDFD MD simulations, it remains general enough for other parallel computations. An analysis of several MD simulation runs on both the FDM and a standard PC cluster demonstrates that the FDM's interconnect architecture provides a greater performance compared to a more general cluster interconnect. Copyright © 2012 Elsevier Inc. All rights reserved.
Statistical CT noise reduction with multiscale decomposition and penalized weighted least squares in the projection domain

DOE Office of Scientific and Technical Information (OSTI.GOV)

Tang Shaojie; Tang Xiangyang; School of Automation, Xi'an University of Posts and Telecommunications, Xi'an, Shaanxi 710121

2012-09-15

Purposes: The suppression of noise in x-ray computed tomography (CT) imaging is of clinical relevance for diagnostic image quality and the potential for radiation dose saving. Toward this purpose, statistical noise reduction methods in either the image or projection domain have been proposed, which employ a multiscale decomposition to enhance the performance of noise suppression while maintaining image sharpness. Recognizing the advantages of noise suppression in the projection domain, the authors propose a projection domain multiscale penalized weighted least squares (PWLS) method, in which the angular sampling rate is explicitly taken into consideration to account for the possible variation ofmore » interview sampling rate in advanced clinical or preclinical applications. Methods: The projection domain multiscale PWLS method is derived by converting an isotropic diffusion partial differential equation in the image domain into the projection domain, wherein a multiscale decomposition is carried out. With adoption of the Markov random field or soft thresholding objective function, the projection domain multiscale PWLS method deals with noise at each scale. To compensate for the degradation in image sharpness caused by the projection domain multiscale PWLS method, an edge enhancement is carried out following the noise reduction. The performance of the proposed method is experimentally evaluated and verified using the projection data simulated by computer and acquired by a CT scanner. Results: The preliminary results show that the proposed projection domain multiscale PWLS method outperforms the projection domain single-scale PWLS method and the image domain multiscale anisotropic diffusion method in noise reduction. In addition, the proposed method can preserve image sharpness very well while the occurrence of 'salt-and-pepper' noise and mosaic artifacts can be avoided. Conclusions: Since the interview sampling rate is taken into account in the projection domain multiscale decomposition, the proposed method is anticipated to be useful in advanced clinical and preclinical applications where the interview sampling rate varies.« less
The domain interface method: a general-purpose non-intrusive technique for non-conforming domain decomposition problems

NASA Astrophysics Data System (ADS)

Cafiero, M.; Lloberas-Valls, O.; Cante, J.; Oliver, J.

2016-04-01

A domain decomposition technique is proposed which is capable of properly connecting arbitrary non-conforming interfaces. The strategy essentially consists in considering a fictitious zero-width interface between the non-matching meshes which is discretized using a Delaunay triangulation. Continuity is satisfied across domains through normal and tangential stresses provided by the discretized interface and inserted in the formulation in the form of Lagrange multipliers. The final structure of the global system of equations resembles the dual assembly of substructures where the Lagrange multipliers are employed to nullify the gap between domains. A new approach to handle floating subdomains is outlined which can be implemented without significantly altering the structure of standard industrial finite element codes. The effectiveness of the developed algorithm is demonstrated through a patch test example and a number of tests that highlight the accuracy of the methodology and independence of the results with respect to the framework parameters. Considering its high degree of flexibility and non-intrusive character, the proposed domain decomposition framework is regarded as an attractive alternative to other established techniques such as the mortar approach.
Fast decomposition of two ultrasound longitudinal waves in cancellous bone using a phase rotation parameter for bone quality assessment: Simulation study.

PubMed

Taki, Hirofumi; Nagatani, Yoshiki; Matsukawa, Mami; Kanai, Hiroshi; Izumi, Shin-Ichi

2017-10-01

Ultrasound signals that pass through cancellous bone may be considered to consist of two longitudinal waves, which are called fast and slow waves. Accurate decomposition of these fast and slow waves is considered to be highly beneficial in determination of the characteristics of cancellous bone. In the present study, a fast decomposition method using a wave transfer function with a phase rotation parameter was applied to received signals that have passed through bovine bone specimens with various bone volume to total volume (BV/TV) ratios in a simulation study, where the elastic finite-difference time-domain method is used and the ultrasound wave propagated parallel to the bone axes. The proposed method succeeded to decompose both fast and slow waves accurately; the normalized residual intensity was less than -19.5 dB when the specimen thickness ranged from 4 to 7 mm and the BV/TV value ranged from 0.144 to 0.226. There was a strong relationship between the phase rotation value and the BV/TV value. The ratio of the peak envelope amplitude of the decomposed fast wave to that of the slow wave increased monotonically with increasing BV/TV ratio, indicating the high performance of the proposed method in estimation of the BV/TV value in cancellous bone.
Parallel community climate model: Description and user`s guide

DOE Office of Scientific and Technical Information (OSTI.GOV)

Drake, J.B.; Flanery, R.E.; Semeraro, B.D.

This report gives an overview of a parallel version of the NCAR Community Climate Model, CCM2, implemented for MIMD massively parallel computers using a message-passing programming paradigm. The parallel implementation was developed on an Intel iPSC/860 with 128 processors and on the Intel Delta with 512 processors, and the initial target platform for the production version of the code is the Intel Paragon with 2048 processors. Because the implementation uses a standard, portable message-passing libraries, the code has been easily ported to other multiprocessors supporting a message-passing programming paradigm. The parallelization strategy used is to decompose the problem domain intomore » geographical patches and assign each processor the computation associated with a distinct subset of the patches. With this decomposition, the physics calculations involve only grid points and data local to a processor and are performed in parallel. Using parallel algorithms developed for the semi-Lagrangian transport, the fast Fourier transform and the Legendre transform, both physics and dynamics are computed in parallel with minimal data movement and modest change to the original CCM2 source code. Sequential or parallel history tapes are written and input files (in history tape format) are read sequentially by the parallel code to promote compatibility with production use of the model on other computer systems. A validation exercise has been performed with the parallel code and is detailed along with some performance numbers on the Intel Paragon and the IBM SP2. A discussion of reproducibility of results is included. A user`s guide for the PCCM2 version 2.1 on the various parallel machines completes the report. Procedures for compilation, setup and execution are given. A discussion of code internals is included for those who may wish to modify and use the program in their own research.« less
Image characterization by fractal descriptors in variational mode decomposition domain: Application to brain magnetic resonance

NASA Astrophysics Data System (ADS)

Lahmiri, Salim

2016-08-01

The main purpose of this work is to explore the usefulness of fractal descriptors estimated in multi-resolution domains to characterize biomedical digital image texture. In this regard, three multi-resolution techniques are considered: the well-known discrete wavelet transform (DWT) and the empirical mode decomposition (EMD), and; the newly introduced; variational mode decomposition mode (VMD). The original image is decomposed by the DWT, EMD, and VMD into different scales. Then, Fourier spectrum based fractal descriptors is estimated at specific scales and directions to characterize the image. The support vector machine (SVM) was used to perform supervised classification. The empirical study was applied to the problem of distinguishing between normal and abnormal brain magnetic resonance images (MRI) affected with Alzheimer disease (AD). Our results demonstrate that fractal descriptors estimated in VMD domain outperform those estimated in DWT and EMD domains; and also those directly estimated from the original image.
Modern multicore and manycore architectures: Modelling, optimisation and benchmarking a multiblock CFD code

NASA Astrophysics Data System (ADS)

Hadade, Ioan; di Mare, Luca

2016-08-01

Modern multicore and manycore processors exhibit multiple levels of parallelism through a wide range of architectural features such as SIMD for data parallel execution or threads for core parallelism. The exploitation of multi-level parallelism is therefore crucial for achieving superior performance on current and future processors. This paper presents the performance tuning of a multiblock CFD solver on Intel SandyBridge and Haswell multicore CPUs and the Intel Xeon Phi Knights Corner coprocessor. Code optimisations have been applied on two computational kernels exhibiting different computational patterns: the update of flow variables and the evaluation of the Roe numerical fluxes. We discuss at great length the code transformations required for achieving efficient SIMD computations for both kernels across the selected devices including SIMD shuffles and transpositions for flux stencil computations and global memory transformations. Core parallelism is expressed through threading based on a number of domain decomposition techniques together with optimisations pertaining to alleviating NUMA effects found in multi-socket compute nodes. Results are correlated with the Roofline performance model in order to assert their efficiency for each distinct architecture. We report significant speedups for single thread execution across both kernels: 2-5X on the multicore CPUs and 14-23X on the Xeon Phi coprocessor. Computations at full node and chip concurrency deliver a factor of three speedup on the multicore processors and up to 24X on the Xeon Phi manycore coprocessor.
Implementation of ADI: Schemes on MIMD parallel computers

NASA Technical Reports Server (NTRS)

Vanderwijngaart, Rob F.

1993-01-01

In order to simulate the effects of the impingement of hot exhaust jets of High Performance Aircraft on landing surfaces a multi-disciplinary computation coupling flow dynamics to heat conduction in the runway needs to be carried out. Such simulations, which are essentially unsteady, require very large computational power in order to be completed within a reasonable time frame of the order of an hour. Such power can be furnished by the latest generation of massively parallel computers. These remove the bottleneck of ever more congested data paths to one or a few highly specialized central processing units (CPU's) by having many off-the-shelf CPU's work independently on their own data, and exchange information only when needed. During the past year the first phase of this project was completed, in which the optimal strategy for mapping an ADI-algorithm for the three dimensional unsteady heat equation to a MIMD parallel computer was identified. This was done by implementing and comparing three different domain decomposition techniques that define the tasks for the CPU's in the parallel machine. These implementations were done for a Cartesian grid and Dirichlet boundary conditions. The most promising technique was then used to implement the heat equation solver on a general curvilinear grid with a suite of nontrivial boundary conditions. Finally, this technique was also used to implement the Scalar Penta-diagonal (SP) benchmark, which was taken from the NAS Parallel Benchmarks report. All implementations were done in the programming language C on the Intel iPSC/860 computer.
Multi-Zone Liquid Thrust Chamber Performance Code with Domain Decomposition for Parallel Processing

NASA Technical Reports Server (NTRS)

Navaz, Homayun K.

2002-01-01

Computational Fluid Dynamics (CFD) has considerably evolved in the last decade. There are many computer programs that can perform computations on viscous internal or external flows with chemical reactions. CFD has become a commonly used tool in the design and analysis of gas turbines, ramjet combustors, turbo-machinery, inlet ducts, rocket engines, jet interaction, missile, and ramjet nozzles. One of the problems of interest to NASA has always been the performance prediction for rocket and air-breathing engines. Due to the complexity of flow in these engines it is necessary to resolve the flowfield into a fine mesh to capture quantities like turbulence and heat transfer. However, calculation on a high-resolution grid is associated with a prohibitively increasing computational time that can downgrade the value of the CFD for practical engineering calculations. The Liquid Thrust Chamber Performance (LTCP) code was developed for NASA/MSFC (Marshall Space Flight Center) to perform liquid rocket engine performance calculations. This code is a 2D/axisymmetric full Navier-Stokes (NS) solver with fully coupled finite rate chemistry and Eulerian treatment of liquid fuel and/or oxidizer droplets. One of the advantages of this code has been the resemblance of its input file to the JANNAF (Joint Army Navy NASA Air Force Interagency Propulsion Committee) standard TDK code, and its automatic grid generation for JANNAF defined combustion chamber wall geometry. These options minimize the learning effort for TDK users, and make the code a good candidate for performing engineering calculations. Although the LTCP code was developed for liquid rocket engines, it is a general-purpose code and has been used for solving many engineering problems. However, the single zone formulation of the LTCP has limited the code to be applicable to problems with complex geometry. Furthermore, the computational time becomes prohibitively large for high-resolution problems with chemistry, two-equation turbulence model, and two-phase flow. To overcome these limitations, the LTCP code is rewritten to include the multi-zone capability with domain decomposition that makes it suitable for parallel processing, i.e., enabling the code to run every zone or sub-domain on a separate processor. This can reduce the run time by a factor of 6 to 8, depending on the problem.
Dynamic Load Balancing Based on Constrained K-D Tree Decomposition for Parallel Particle Tracing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Zhang, Jiang; Guo, Hanqi; Yuan, Xiaoru

Particle tracing is a fundamental technique in flow field data visualization. In this work, we present a novel dynamic load balancing method for parallel particle tracing. Specifically, we employ a constrained k-d tree decomposition approach to dynamically redistribute tasks among processes. Each process is initially assigned a regularly partitioned block along with duplicated ghost layer under the memory limit. During particle tracing, the k-d tree decomposition is dynamically performed by constraining the cutting planes in the overlap range of duplicated data. This ensures that each process is reassigned particles as even as possible, and on the other hand the newmore » assigned particles for a process always locate in its block. Result shows good load balance and high efficiency of our method.« less
Parallel pivoting combined with parallel reduction

NASA Technical Reports Server (NTRS)

Alaghband, Gita

1987-01-01

Parallel algorithms for triangularization of large, sparse, and unsymmetric matrices are presented. The method combines the parallel reduction with a new parallel pivoting technique, control over generations of fill-ins and a check for numerical stability, all done in parallel with the work being distributed over the active processes. The parallel technique uses the compatibility relation between pivots to identify parallel pivot candidates and uses the Markowitz number of pivots to minimize fill-in. This technique is not a preordering of the sparse matrix and is applied dynamically as the decomposition proceeds.
3-D parallel program for numerical calculation of gas dynamics problems with heat conductivity on distributed memory computational systems (CS)

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sofronov, I.D.; Voronin, B.L.; Butnev, O.I.

1997-12-31

The aim of the work performed is to develop a 3D parallel program for numerical calculation of gas dynamics problem with heat conductivity on distributed memory computational systems (CS), satisfying the condition of numerical result independence from the number of processors involved. Two basically different approaches to the structure of massive parallel computations have been developed. The first approach uses the 3D data matrix decomposition reconstructed at temporal cycle and is a development of parallelization algorithms for multiprocessor CS with shareable memory. The second approach is based on using a 3D data matrix decomposition not reconstructed during a temporal cycle.more » The program was developed on 8-processor CS MP-3 made in VNIIEF and was adapted to a massive parallel CS Meiko-2 in LLNL by joint efforts of VNIIEF and LLNL staffs. A large number of numerical experiments has been carried out with different number of processors up to 256 and the efficiency of parallelization has been evaluated in dependence on processor number and their parameters.« less
Impact of joint statistical dual-energy CT reconstruction of proton stopping power images: Comparison to image- and sinogram-domain material decomposition approaches.

PubMed

Zhang, Shuangyue; Han, Dong; Politte, David G; Williamson, Jeffrey F; O'Sullivan, Joseph A

2018-05-01

The purpose of this study was to assess the performance of a novel dual-energy CT (DECT) approach for proton stopping power ratio (SPR) mapping that integrates image reconstruction and material characterization using a joint statistical image reconstruction (JSIR) method based on a linear basis vector model (BVM). A systematic comparison between the JSIR-BVM method and previously described DECT image- and sinogram-domain decomposition approaches is also carried out on synthetic data. The JSIR-BVM method was implemented to estimate the electron densities and mean excitation energies (I-values) required by the Bethe equation for SPR mapping. In addition, image- and sinogram-domain DECT methods based on three available SPR models including BVM were implemented for comparison. The intrinsic SPR modeling accuracy of the three models was first validated. Synthetic DECT transmission sinograms of two 330 mm diameter phantoms each containing 17 soft and bony tissues (for a total of 34) of known composition were then generated with spectra of 90 and 140 kVp. The estimation accuracy of the reconstructed SPR images were evaluated for the seven investigated methods. The impact of phantom size and insert location on SPR estimation accuracy was also investigated. All three selected DECT-SPR models predict the SPR of all tissue types with less than 0.2% RMS errors under idealized conditions with no reconstruction uncertainties. When applied to synthetic sinograms, the JSIR-BVM method achieves the best performance with mean and RMS-average errors of less than 0.05% and 0.3%, respectively, for all noise levels, while the image- and sinogram-domain decomposition methods show increasing mean and RMS-average errors with increasing noise level. The JSIR-BVM method also reduces statistical SPR variation by sixfold compared to other methods. A 25% phantom diameter change causes up to 4% SPR differences for the image-domain decomposition approach, while the JSIR-BVM method and sinogram-domain decomposition methods are insensitive to size change. Among all the investigated methods, the JSIR-BVM method achieves the best performance for SPR estimation in our simulation phantom study. This novel method is robust with respect to sinogram noise and residual beam-hardening effects, yielding SPR estimation errors comparable to intrinsic BVM modeling error. In contrast, the achievable SPR estimation accuracy of the image- and sinogram-domain decomposition methods is dominated by the CT image intensity uncertainties introduced by the reconstruction and decomposition processes. © 2018 American Association of Physicists in Medicine.
About decomposition approach for solving the classification problem

NASA Astrophysics Data System (ADS)

Andrianova, A. A.

2016-11-01

This article describes the features of the application of an algorithm with using of decomposition methods for solving the binary classification problem of constructing a linear classifier based on Support Vector Machine method. Application of decomposition reduces the volume of calculations, in particular, due to the emerging possibilities to build parallel versions of the algorithm, which is a very important advantage for the solution of problems with big data. The analysis of the results of computational experiments conducted using the decomposition approach. The experiment use known data set for binary classification problem.
Micro-Macro Simulation of Viscoelastic Fluids in Three Dimensions

NASA Astrophysics Data System (ADS)

Rüttgers, Alexander; Griebel, Michael

2012-11-01

The development of the chemical industry resulted in various complex fluids that cannot be correctly described by classical fluid mechanics. For instance, this includes paint, engine oils with polymeric additives and toothpaste. We currently perform multiscale viscoelastic flow simulations for which we have coupled our three-dimensional Navier-Stokes solver NaSt3dGPF with the stochastic Brownian configuration field method on the micro-scale. In this method, we represent a viscoelastic fluid as a dumbbell system immersed in a three-dimensional Newtonian liquid which leads to a six-dimensional problem in space. The approach requires large computational resources and therefore depends on an efficient parallelisation strategy. Our flow solver is parallelised with a domain decomposition approach using MPI. It shows excellent scale-up results for up to 128 processors. In this talk, we present simulation results for viscoelastic fluids in square-square contractions due to their relevance for many engineering applications such as extrusion. Another aspect of the talk is the parallel implementation in NaSt3dGPF and the parallel scale-up and speed-up behaviour.
Scalable Nonlinear Solvers for Fully Implicit Coupled Nuclear Fuel Modeling. Final Report

DOE Office of Scientific and Technical Information (OSTI.GOV)

Cai, Xiao-Chuan; Keyes, David; Yang, Chao

2014-09-29

The focus of the project is on the development and customization of some highly scalable domain decomposition based preconditioning techniques for the numerical solution of nonlinear, coupled systems of partial differential equations (PDEs) arising from nuclear fuel simulations. These high-order PDEs represent multiple interacting physical fields (for example, heat conduction, oxygen transport, solid deformation), each is modeled by a certain type of Cahn-Hilliard and/or Allen-Cahn equations. Most existing approaches involve a careful splitting of the fields and the use of field-by-field iterations to obtain a solution of the coupled problem. Such approaches have many advantages such as ease of implementationmore » since only single field solvers are needed, but also exhibit disadvantages. For example, certain nonlinear interactions between the fields may not be fully captured, and for unsteady problems, stable time integration schemes are difficult to design. In addition, when implemented on large scale parallel computers, the sequential nature of the field-by-field iterations substantially reduces the parallel efficiency. To overcome the disadvantages, fully coupled approaches have been investigated in order to obtain full physics simulations.« less
Parallel computing in enterprise modeling.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Goldsby, Michael E.; Armstrong, Robert C.; Shneider, Max S.

2008-08-01

This report presents the results of our efforts to apply high-performance computing to entity-based simulations with a multi-use plugin for parallel computing. We use the term 'Entity-based simulation' to describe a class of simulation which includes both discrete event simulation and agent based simulation. What simulations of this class share, and what differs from more traditional models, is that the result sought is emergent from a large number of contributing entities. Logistic, economic and social simulations are members of this class where things or people are organized or self-organize to produce a solution. Entity-based problems never have an a priorimore » ergodic principle that will greatly simplify calculations. Because the results of entity-based simulations can only be realized at scale, scalable computing is de rigueur for large problems. Having said that, the absence of a spatial organizing principal makes the decomposition of the problem onto processors problematic. In addition, practitioners in this domain commonly use the Java programming language which presents its own problems in a high-performance setting. The plugin we have developed, called the Parallel Particle Data Model, overcomes both of these obstacles and is now being used by two Sandia frameworks: the Decision Analysis Center, and the Seldon social simulation facility. While the ability to engage U.S.-sized problems is now available to the Decision Analysis Center, this plugin is central to the success of Seldon. Because Seldon relies on computationally intensive cognitive sub-models, this work is necessary to achieve the scale necessary for realistic results. With the recent upheavals in the financial markets, and the inscrutability of terrorist activity, this simulation domain will likely need a capability with ever greater fidelity. High-performance computing will play an important part in enabling that greater fidelity.« less

A Comparison of PETSC Library and HPF Implementations of an Archetypal PDE Computation

NASA Technical Reports Server (NTRS)

Hayder, M. Ehtesham; Keyes, David E.; Mehrotra, Piyush

1997-01-01

Two paradigms for distributed-memory parallel computation that free the application programmer from the details of message passing are compared for an archetypal structured scientific computation a nonlinear, structured-grid partial differential equation boundary value problem using the same algorithm on the same hardware. Both paradigms, parallel libraries represented by Argonne's PETSC, and parallel languages represented by the Portland Group's HPF, are found to be easy to use for this problem class, and both are reasonably effective in exploiting concurrency after a short learning curve. The level of involvement required by the application programmer under either paradigm includes specification of the data partitioning (corresponding to a geometrically simple decomposition of the domain of the PDE). Programming in SPAM style for the PETSC library requires writing the routines that discretize the PDE and its Jacobian, managing subdomain-to-processor mappings (affine global- to-local index mappings), and interfacing to library solver routines. Programming for HPF requires a complete sequential implementation of the same algorithm, introducing concurrency through subdomain blocking (an effort similar to the index mapping), and modest experimentation with rewriting loops to elucidate to the compiler the latent concurrency. Correctness and scalability are cross-validated on up to 32 nodes of an IBM SP2.
A Numerical Study of Scalable Cardiac Electro-Mechanical Solvers on HPC Architectures

PubMed Central

Colli Franzone, Piero; Pavarino, Luca F.; Scacchi, Simone

2018-01-01

We introduce and study some scalable domain decomposition preconditioners for cardiac electro-mechanical 3D simulations on parallel HPC (High Performance Computing) architectures. The electro-mechanical model of the cardiac tissue is composed of four coupled sub-models: (1) the static finite elasticity equations for the transversely isotropic deformation of the cardiac tissue; (2) the active tension model describing the dynamics of the intracellular calcium, cross-bridge binding and myofilament tension; (3) the anisotropic Bidomain model describing the evolution of the intra- and extra-cellular potentials in the deforming cardiac tissue; and (4) the ionic membrane model describing the dynamics of ionic currents, gating variables, ionic concentrations and stretch-activated channels. This strongly coupled electro-mechanical model is discretized in time with a splitting semi-implicit technique and in space with isoparametric finite elements. The resulting scalable parallel solver is based on Multilevel Additive Schwarz preconditioners for the solution of the Bidomain system and on BDDC preconditioned Newton-Krylov solvers for the non-linear finite elasticity system. The results of several 3D parallel simulations show the scalability of both linear and non-linear solvers and their application to the study of both physiological excitation-contraction cardiac dynamics and re-entrant waves in the presence of different mechano-electrical feedbacks. PMID:29674971
A new impedance accounting for short- and long-range effects in mixed substructured formulations of nonlinear problems

NASA Astrophysics Data System (ADS)

Negrello, Camille; Gosselet, Pierre; Rey, Christian

2018-05-01

An efficient method for solving large nonlinear problems combines Newton solvers and Domain Decomposition Methods (DDM). In the DDM framework, the boundary conditions can be chosen to be primal, dual or mixed. The mixed approach presents the advantage to be eligible for the research of an optimal interface parameter (often called impedance) which can increase the convergence rate. The optimal value for this parameter is often too expensive to be computed exactly in practice: an approximate version has to be sought for, along with a compromise between efficiency and computational cost. In the context of parallel algorithms for solving nonlinear structural mechanical problems, we propose a new heuristic for the impedance which combines short and long range effects at a low computational cost.
Finite element analysis in fluids; Proceedings of the Seventh International Conference on Finite Element Methods in Flow Problems, University of Alabama, Huntsville, Apr. 3-7, 1989

NASA Technical Reports Server (NTRS)

Chung, T. J. (Editor); Karr, Gerald R. (Editor)

1989-01-01

Recent advances in computational fluid dynamics are examined in reviews and reports, with an emphasis on finite-element methods. Sections are devoted to adaptive meshes, atmospheric dynamics, combustion, compressible flows, control-volume finite elements, crystal growth, domain decomposition, EM-field problems, FDM/FEM, and fluid-structure interactions. Consideration is given to free-boundary problems with heat transfer, free surface flow, geophysical flow problems, heat and mass transfer, high-speed flow, incompressible flow, inverse design methods, MHD problems, the mathematics of finite elements, and mesh generation. Also discussed are mixed finite elements, multigrid methods, non-Newtonian fluids, numerical dissipation, parallel vector processing, reservoir simulation, seepage, shallow-water problems, spectral methods, supercomputer architectures, three-dimensional problems, and turbulent flows.
Performance Improvements of the CYCOFOS Flow Model

NASA Astrophysics Data System (ADS)

Radhakrishnan, Hari; Moulitsas, Irene; Syrakos, Alexandros; Zodiatis, George; Nikolaides, Andreas; Hayes, Daniel; Georgiou, Georgios C.

2013-04-01

The CYCOFOS-Cyprus Coastal Ocean Forecasting and Observing System has been operational since early 2002, providing daily sea current, temperature, salinity and sea level forecasting data for the next 4 and 10 days to end-users in the Levantine Basin, necessary for operational application in marine safety, particularly concerning oil spills and floating objects predictions. CYCOFOS flow model, similar to most of the coastal and sub-regional operational hydrodynamic forecasting systems of the MONGOOS-Mediterranean Oceanographic Network for Global Ocean Observing System is based on the POM-Princeton Ocean Model. CYCOFOS is nested with the MyOcean Mediterranean regional forecasting data and with SKIRON and ECMWF for surface forcing. The increasing demand for higher and higher resolution data to meet coastal and offshore downstream applications motivated the parallelization of the CYCOFOS POM model. This development was carried out in the frame of the IPcycofos project, funded by the Cyprus Research Promotion Foundation. The parallel processing provides a viable solution to satisfy these demands without sacrificing accuracy or omitting any physical phenomena. Prior to IPcycofos project, there are been several attempts to parallelise the POM, as for example the MP-POM. The existing parallel code models rely on the use of specific outdated hardware architectures and associated software. The objective of the IPcycofos project is to produce an operational parallel version of the CYCOFOS POM code that can replicate the results of the serial version of the POM code used in CYCOFOS. The parallelization of the CYCOFOS POM model use Message Passing Interface-MPI, implemented on commodity computing clusters running open source software and not depending on any specialized vendor hardware. The parallel CYCOFOS POM code constructed in a modular fashion, allowing a fast re-locatable downscaled implementation. The MPI takes advantage of the Cartesian nature of the POM mesh, and use the built-in functionality of MPI routines to split the mesh, using a weighting scheme, along longitude and latitude among the processors. Each server processor work on the model based on domain decomposition techniques. The new parallel CYCOFOS POM code has been benchmarked against the serial POM version of CYCOFOS for speed, accuracy, and resolution and the results are more than satisfactory. With a higher resolution CYCOFOS Levantine model domain the forecasts need much less time than the serial CYCOFOS POM coarser version, both with identical accuracy.
Comparative study of the dynamics of lipid membrane phase decomposition in experiment and simulation.

PubMed

Burger, Stefan; Fraunholz, Thomas; Leirer, Christian; Hoppe, Ronald H W; Wixforth, Achim; Peter, Malte A; Franke, Thomas

2013-06-25

Phase decomposition in lipid membranes has been the subject of numerous investigations by both experiment and theoretical simulation, yet quantitative comparisons of the simulated data to the experimental results are rare. In this work, we present a novel way of comparing the temporal development of liquid-ordered domains obtained from numerically solving the Cahn-Hilliard equation and by inducing a phase transition in giant unilamellar vesicles (GUVs). Quantitative comparison is done by calculating the structure factor of the domain pattern. It turns out that the decomposition takes place in three distinct regimes in both experiment and simulation. These regimes are characterized by different rates of growth of the mean domain diameter, and there is quantitative agreement between experiment and simulation as to the duration of each regime and the absolute rate of growth in each regime.
Decomposition of Proteins into Dynamic Units from Atomic Cross-Correlation Functions.

PubMed

Calligari, Paolo; Gerolin, Marco; Abergel, Daniel; Polimeno, Antonino

2017-01-10

In this article, we present a clustering method of atoms in proteins based on the analysis of the correlation times of interatomic distance correlation functions computed from MD simulations. The goal is to provide a coarse-grained description of the protein in terms of fewer elements that can be treated as dynamically independent subunits. Importantly, this domain decomposition method does not take into account structural properties of the protein. Instead, the clustering of protein residues in terms of networks of dynamically correlated domains is defined on the basis of the effective correlation times of the pair distance correlation functions. For these properties, our method stands as a complementary analysis to the customary protein decomposition in terms of quasi-rigid, structure-based domains. Results obtained for a prototypal protein structure illustrate the approach proposed.
Effect of aging temperature on phase decomposition and mechanical properties in cast duplex stainless steels

DOE Office of Scientific and Technical Information (OSTI.GOV)

Mburu, Sarah; Kolli, R. Prakash; Perea, Daniel E.

The microstructure and mechanical properties in unaged and thermally aged (at 280 °C, 320 °C, 360 °C, and 400 °C to 4300 h) CF–3 and CF–8 cast duplex stainless steels (CDSS) are investigated. The unaged CF–8 steel has Cr-rich M 23C 6 carbides located at the δ–ferrite/γ–austenite heterophase interfaces that were not observed in the CF–3 steel and this corresponds to a difference in mechanical properties. Both unaged steels exhibit incipient spinodal decomposition into Fe-rich α–domains and Cr-rich α’–domains. During aging, spinodal decomposition progresses and the mean wavelength (MW) and mean amplitude (MA) of the compositional fluctuations increase as amore » function of aging temperature. Additionally, G–phase precipitates form between the spinodal decomposition domains in CF–3 at 360 °C and 400 °C and in CF–8 at 400 °C. Finally, the microstructural evolution is correlated to changes in mechanical properties.« less
Effect of aging temperature on phase decomposition and mechanical properties in cast duplex stainless steels

DOE PAGES

Mburu, Sarah; Kolli, R. Prakash; Perea, Daniel E.; ...

2017-03-06

The microstructure and mechanical properties in unaged and thermally aged (at 280 °C, 320 °C, 360 °C, and 400 °C to 4300 h) CF–3 and CF–8 cast duplex stainless steels (CDSS) are investigated. The unaged CF–8 steel has Cr-rich M 23C 6 carbides located at the δ–ferrite/γ–austenite heterophase interfaces that were not observed in the CF–3 steel and this corresponds to a difference in mechanical properties. Both unaged steels exhibit incipient spinodal decomposition into Fe-rich α–domains and Cr-rich α’–domains. During aging, spinodal decomposition progresses and the mean wavelength (MW) and mean amplitude (MA) of the compositional fluctuations increase as amore » function of aging temperature. Additionally, G–phase precipitates form between the spinodal decomposition domains in CF–3 at 360 °C and 400 °C and in CF–8 at 400 °C. Finally, the microstructural evolution is correlated to changes in mechanical properties.« less
Effect of aging temperature on phase decomposition and mechanical properties in cast duplex stainless steels

DOE Office of Scientific and Technical Information (OSTI.GOV)

Mburu, Sarah; Kolli, R. Prakash; Perea, Daniel E.

The microstructure and mechanical properties in unaged and thermally aged (at 280 oC, 320 oC, 360 oC, and 400 oC to 4300 h) CF–3 and CF–8 cast duplex stainless steels (CDSS) are investigated. The unaged CF–8 steel has Cr-rich M23C6 carbides located at the δ–ferrite/γ– austenite heterophase interfaces that were not observed in the CF–3 steel and this corresponds to a difference in mechanical properties. Both unaged steels exhibit incipient spinodal decomposition into Fe-rich α–domains and Cr-rich α’–domains. During aging, spinodal decomposition progresses and the mean wavelength (MW) and mean amplitude (MA) of the compositional fluctuations increase as a functionmore » of aging temperature. Additionally, G–phase precipitates form between the spinodal decomposition domains in CF–3 at 360 oC and 400 oC and in CF–8 at 400 oC. The microstructural evolution is correlated to changes in mechanical properties.« less
Rapid Transient Pressure Field Computations in the Nearfield of Circular Transducers using Frequency Domain Time-Space Decomposition

PubMed Central

Alles, E. J.; Zhu, Y.; van Dongen, K. W. A.; McGough, R. J.

2013-01-01

The fast nearfield method, when combined with time-space decomposition, is a rapid and accurate approach for calculating transient nearfield pressures generated by ultrasound transducers. However, the standard time-space decomposition approach is only applicable to certain analytical representations of the temporal transducer surface velocity that, when applied to the fast nearfield method, are expressed as a finite sum of products of separate temporal and spatial terms. To extend time-space decomposition such that accelerated transient field simulations are enabled in the nearfield for an arbitrary transducer surface velocity, a new transient simulation method, frequency domain time-space decomposition (FDTSD), is derived. With this method, the temporal transducer surface velocity is transformed into the frequency domain, and then each complex-valued term is processed separately. Further improvements are achieved by spectral clipping, which reduces the number of terms and the computation time. Trade-offs between speed and accuracy are established for FDTSD calculations, and pressure fields obtained with the FDTSD method for a circular transducer are compared to those obtained with Field II and the impulse response method. The FDTSD approach, when combined with the fast nearfield method and spectral clipping, consistently achieves smaller errors in less time and requires less memory than Field II or the impulse response method. PMID:23160476
Tensorial extensions of independent component analysis for multisubject FMRI analysis.

PubMed

Beckmann, C F; Smith, S M

2005-03-01

We discuss model-free analysis of multisubject or multisession FMRI data by extending the single-session probabilistic independent component analysis model (PICA; Beckmann and Smith, 2004. IEEE Trans. on Medical Imaging, 23 (2) 137-152) to higher dimensions. This results in a three-way decomposition that represents the different signals and artefacts present in the data in terms of their temporal, spatial, and subject-dependent variations. The technique is derived from and compared with parallel factor analysis (PARAFAC; Harshman and Lundy, 1984. In Research methods for multimode data analysis, chapter 5, pages 122-215. Praeger, New York). Using simulated data as well as data from multisession and multisubject FMRI studies we demonstrate that the tensor PICA approach is able to efficiently and accurately extract signals of interest in the spatial, temporal, and subject/session domain. The final decompositions improve upon PARAFAC results in terms of greater accuracy, reduced interference between the different estimated sources (reduced cross-talk), robustness (against deviations of the data from modeling assumptions and against overfitting), and computational speed. On real FMRI 'activation' data, the tensor PICA approach is able to extract plausible activation maps, time courses, and session/subject modes as well as provide a rich description of additional processes of interest such as image artefacts or secondary activation patterns. The resulting data decomposition gives simple and useful representations of multisubject/multisession FMRI data that can aid the interpretation and optimization of group FMRI studies beyond what can be achieved using model-based analysis techniques.
PARALLEL HOP: A SCALABLE HALO FINDER FOR MASSIVE COSMOLOGICAL DATA SETS

DOE Office of Scientific and Technical Information (OSTI.GOV)

Skory, Stephen; Turk, Matthew J.; Norman, Michael L.

2010-11-15

Modern N-body cosmological simulations contain billions (10{sup 9}) of dark matter particles. These simulations require hundreds to thousands of gigabytes of memory and employ hundreds to tens of thousands of processing cores on many compute nodes. In order to study the distribution of dark matter in a cosmological simulation, the dark matter halos must be identified using a halo finder, which establishes the halo membership of every particle in the simulation. The resources required for halo finding are similar to the requirements for the simulation itself. In particular, simulations have become too extensive to use commonly employed halo finders, suchmore » that the computational requirements to identify halos must now be spread across multiple nodes and cores. Here, we present a scalable-parallel halo finding method called Parallel HOP for large-scale cosmological simulation data. Based on the halo finder HOP, it utilizes message passing interface and domain decomposition to distribute the halo finding workload across multiple compute nodes, enabling analysis of much larger data sets than is possible with the strictly serial or previous parallel implementations of HOP. We provide a reference implementation of this method as a part of the toolkit {sup yt}, an analysis toolkit for adaptive mesh refinement data that include complementary analysis modules. Additionally, we discuss a suite of benchmarks that demonstrate that this method scales well up to several hundred tasks and data sets in excess of 2000{sup 3} particles. The Parallel HOP method and our implementation can be readily applied to any kind of N-body simulation data and is therefore widely applicable.« less
A Simple Application of Compressed Sensing to Further Accelerate Partially Parallel Imaging

PubMed Central

Miao, Jun; Guo, Weihong; Narayan, Sreenath; Wilson, David L.

2012-01-01

Compressed Sensing (CS) and partially parallel imaging (PPI) enable fast MR imaging by reducing the amount of k-space data required for reconstruction. Past attempts to combine these two have been limited by the incoherent sampling requirement of CS, since PPI routines typically sample on a regular (coherent) grid. Here, we developed a new method, “CS+GRAPPA,” to overcome this limitation. We decomposed sets of equidistant samples into multiple random subsets. Then, we reconstructed each subset using CS, and averaging the results to get a final CS k-space reconstruction. We used both a standard CS, and an edge and joint-sparsity guided CS reconstruction. We tested these intermediate results on both synthetic and real MR phantom data, and performed a human observer experiment to determine the effectiveness of decomposition, and to optimize the number of subsets. We then used these CS reconstructions to calibrate the GRAPPA complex coil weights. In vivo parallel MR brain and heart data sets were used. An objective image quality evaluation metric, Case-PDM, was used to quantify image quality. Coherent aliasing and noise artifacts were significantly reduced using two decompositions. More decompositions further reduced coherent aliasing and noise artifacts but introduced blurring. However, the blurring was effectively minimized using our new edge and joint-sparsity guided CS using two decompositions. Numerical results on parallel data demonstrated that the combined method greatly improved image quality as compared to standard GRAPPA, on average halving Case-PDM scores across a range of sampling rates. The proposed technique allowed the same Case-PDM scores as standard GRAPPA, using about half the number of samples. We conclude that the new method augments GRAPPA by combining it with CS, allowing CS to work even when the k-space sampling pattern is equidistant. PMID:22902065
An operational modal analysis method in frequency and spatial domain

NASA Astrophysics Data System (ADS)

Wang, Tong; Zhang, Lingmi; Tamura, Yukio

2005-12-01

A frequency and spatial domain decomposition method (FSDD) for operational modal analysis (OMA) is presented in this paper, which is an extension of the complex mode indicator function (CMIF) method for experimental modal analysis (EMA). The theoretical background of the FSDD method is clarified. Singular value decomposition is adopted to separate the signal space from the noise space. Finally, an enhanced power spectrum density (PSD) is proposed to obtain more accurate modal parameters by curve fitting in the frequency domain. Moreover, a simulation case and an application case are used to validate this method.
An Efficient Multiscale Finite-Element Method for Frequency-Domain Seismic Wave Propagation

DOE PAGES

Gao, Kai; Fu, Shubin; Chung, Eric T.

2018-02-13

The frequency-domain seismic-wave equation, that is, the Helmholtz equation, has many important applications in seismological studies, yet is very challenging to solve, particularly for large geological models. Iterative solvers, domain decomposition, or parallel strategies can partially alleviate the computational burden, but these approaches may still encounter nontrivial difficulties in complex geological models where a sufficiently fine mesh is required to represent the fine-scale heterogeneities. We develop a novel numerical method to solve the frequency-domain acoustic wave equation on the basis of the multiscale finite-element theory. We discretize a heterogeneous model with a coarse mesh and employ carefully constructed high-order multiscalemore » basis functions to form the basis space for the coarse mesh. Solved from medium- and frequency-dependent local problems, these multiscale basis functions can effectively capture themedium’s fine-scale heterogeneity and the source’s frequency information, leading to a discrete system matrix with a much smaller dimension compared with those from conventional methods.We then obtain an accurate solution to the acoustic Helmholtz equation by solving only a small linear system instead of a large linear system constructed on the fine mesh in conventional methods.We verify our new method using several models of complicated heterogeneities, and the results show that our new multiscale method can solve the Helmholtz equation in complex models with high accuracy and extremely low computational costs.« less
An Efficient Multiscale Finite-Element Method for Frequency-Domain Seismic Wave Propagation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gao, Kai; Fu, Shubin; Chung, Eric T.

The frequency-domain seismic-wave equation, that is, the Helmholtz equation, has many important applications in seismological studies, yet is very challenging to solve, particularly for large geological models. Iterative solvers, domain decomposition, or parallel strategies can partially alleviate the computational burden, but these approaches may still encounter nontrivial difficulties in complex geological models where a sufficiently fine mesh is required to represent the fine-scale heterogeneities. We develop a novel numerical method to solve the frequency-domain acoustic wave equation on the basis of the multiscale finite-element theory. We discretize a heterogeneous model with a coarse mesh and employ carefully constructed high-order multiscalemore » basis functions to form the basis space for the coarse mesh. Solved from medium- and frequency-dependent local problems, these multiscale basis functions can effectively capture themedium’s fine-scale heterogeneity and the source’s frequency information, leading to a discrete system matrix with a much smaller dimension compared with those from conventional methods.We then obtain an accurate solution to the acoustic Helmholtz equation by solving only a small linear system instead of a large linear system constructed on the fine mesh in conventional methods.We verify our new method using several models of complicated heterogeneities, and the results show that our new multiscale method can solve the Helmholtz equation in complex models with high accuracy and extremely low computational costs.« less
A Systolic Architecture for Singular Value Decomposition,

DTIC Science & Technology

1983-01-01

Presented at the 1 st International Colloquium on Vector and Parallel Computing in Scientific Applications, Paris, March 191J Contract N00014-82-K.0703...Gene Golub. Private comunication . given inputs x and n 2 , compute 2 2 2 2 /6/ G. H. Golub and F. T. Luk : "Singular Value I + X1 Decomposition
Parallel processing methods for space based power systems

NASA Technical Reports Server (NTRS)

Berry, F. C.

1993-01-01

This report presents a method for doing load-flow analysis of a power system by using a decomposition approach. The power system for the Space Shuttle is used as a basis to build a model for the load-flow analysis. To test the decomposition method for doing load-flow analysis, simulations were performed on power systems of 16, 25, 34, 43, 52, 61, 70, and 79 nodes. Each of the power systems was divided into subsystems and simulated under steady-state conditions. The results from these tests have been found to be as accurate as tests performed using a standard serial simulator. The division of the power systems into different subsystems was done by assigning a processor to each area. There were 13 transputers available, therefore, up to 13 different subsystems could be simulated at the same time. This report has preliminary results for a load-flow analysis using a decomposition principal. The report shows that the decomposition algorithm for load-flow analysis is well suited for parallel processing and provides increases in the speed of execution.
Parallel Tensor Compression for Large-Scale Scientific Data.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kolda, Tamara G.; Ballard, Grey; Austin, Woody Nathan

As parallel computing trends towards the exascale, scientific data produced by high-fidelity simulations are growing increasingly massive. For instance, a simulation on a three-dimensional spatial grid with 512 points per dimension that tracks 64 variables per grid point for 128 time steps yields 8 TB of data. By viewing the data as a dense five way tensor, we can compute a Tucker decomposition to find inherent low-dimensional multilinear structure, achieving compression ratios of up to 10000 on real-world data sets with negligible loss in accuracy. So that we can operate on such massive data, we present the first-ever distributed memorymore » parallel implementation for the Tucker decomposition, whose key computations correspond to parallel linear algebra operations, albeit with nonstandard data layouts. Our approach specifies a data distribution for tensors that avoids any tensor data redistribution, either locally or in parallel. We provide accompanying analysis of the computation and communication costs of the algorithms. To demonstrate the compression and accuracy of the method, we apply our approach to real-world data sets from combustion science simulations. We also provide detailed performance results, including parallel performance in both weak and strong scaling experiments.« less

Use of the Morlet mother wavelet in the frequency-scale domain decomposition technique for the modal identification of ambient vibration responses

NASA Astrophysics Data System (ADS)

Le, Thien-Phu

2017-10-01

The frequency-scale domain decomposition technique has recently been proposed for operational modal analysis. The technique is based on the Cauchy mother wavelet. In this paper, the approach is extended to the Morlet mother wavelet, which is very popular in signal processing due to its superior time-frequency localization. Based on the regressive form and an appropriate norm of the Morlet mother wavelet, the continuous wavelet transform of the power spectral density of ambient responses enables modes in the frequency-scale domain to be highlighted. Analytical developments first demonstrate the link between modal parameters and the local maxima of the continuous wavelet transform modulus. The link formula is then used as the foundation of the proposed modal identification method. Its practical procedure, combined with the singular value decomposition algorithm, is presented step by step. The proposition is finally verified using numerical examples and a laboratory test.
Development and parallelization of a direct numerical simulation to study the formation and transport of nanoparticle clusters in a viscous fluid

NASA Astrophysics Data System (ADS)

Sloan, Gregory James

The direct numerical simulation (DNS) offers the most accurate approach to modeling the behavior of a physical system, but carries an enormous computation cost. There exists a need for an accurate DNS to model the coupled solid-fluid system seen in targeted drug delivery (TDD), nanofluid thermal energy storage (TES), as well as other fields where experiments are necessary, but experiment design may be costly. A parallel DNS can greatly reduce the large computation times required, while providing the same results and functionality of the serial counterpart. A D2Q9 lattice Boltzmann method approach was implemented to solve the fluid phase. The use of domain decomposition with message passing interface (MPI) parallelism resulted in an algorithm that exhibits super-linear scaling in testing, which may be attributed to the caching effect. Decreased performance on a per-node basis for a fixed number of processes confirms this observation. A multiscale approach was implemented to model the behavior of nanoparticles submerged in a viscous fluid, and used to examine the mechanisms that promote or inhibit clustering. Parallelization of this model using a masterworker algorithm with MPI gives less-than-linear speedup for a fixed number of particles and varying number of processes. This is due to the inherent inefficiency of the master-worker approach. Lastly, these separate simulations are combined, and two-way coupling is implemented between the solid and fluid.
High-performance parallel analysis of coupled problems for aircraft propulsion

NASA Technical Reports Server (NTRS)

Felippa, C. A.; Farhat, C.; Chen, P.-S.; Gumaste, U.; Leoinne, M.; Stern, P.

1995-01-01

This research program deals with the application of high-performance computing methods to the numerical simulation of complete jet engines. The program was initiated in 1993 by applying two-dimensional parallel aeroelastic codes to the interior gas flow problem of a by-pass jet engine. The fluid mesh generation, domain decomposition and solution capabilities were successfully tested. Attention was then focused on methodology for the partitioned analysis of the interaction of the gas flow with a flexible structure and with the fluid mesh motion driven by these structural displacements. The latter is treated by an ALE technique that models the fluid mesh motion as that of a fictitious mechanical network laid along the edges of near-field fluid elements. New partitioned analysis procedures to treat this coupled 3-component problem were developed in 1994. These procedures involved delayed corrections and subcycling, and have been successfully tested on several massively parallel computers. For the global steady-state axisymmetric analysis of a complete engine we have decided to use the NASA-sponsored ENG10 program, which uses a regular FV-multiblock-grid discretization in conjunction with circumferential averaging to include effects of blade forces, loss, combustor heat addition, blockage, bleeds and convective mixing. A load-balancing preprocessor for parallel versions of ENG10 has been developed. It is planned to use the steady-state global solution provided by ENG10 as input to a localized three-dimensional FSI analysis for engine regions where aeroelastic effects may be important.
Mechanical and Assembly Units of Viral Capsids Identified via Quasi-Rigid Domain Decomposition

PubMed Central

Polles, Guido; Indelicato, Giuliana; Potestio, Raffaello; Cermelli, Paolo; Twarock, Reidun; Micheletti, Cristian

2013-01-01

Key steps in a viral life-cycle, such as self-assembly of a protective protein container or in some cases also subsequent maturation events, are governed by the interplay of physico-chemical mechanisms involving various spatial and temporal scales. These salient aspects of a viral life cycle are hence well described and rationalised from a mesoscopic perspective. Accordingly, various experimental and computational efforts have been directed towards identifying the fundamental building blocks that are instrumental for the mechanical response, or constitute the assembly units, of a few specific viral shells. Motivated by these earlier studies we introduce and apply a general and efficient computational scheme for identifying the stable domains of a given viral capsid. The method is based on elastic network models and quasi-rigid domain decomposition. It is first applied to a heterogeneous set of well-characterized viruses (CCMV, MS2, STNV, STMV) for which the known mechanical or assembly domains are correctly identified. The validated method is next applied to other viral particles such as L-A, Pariacoto and polyoma viruses, whose fundamental functional domains are still unknown or debated and for which we formulate verifiable predictions. The numerical code implementing the domain decomposition strategy is made freely available. PMID:24244139
The Multiscale Robin Coupled Method for flows in porous media

NASA Astrophysics Data System (ADS)

Guiraldello, Rafael T.; Ausas, Roberto F.; Sousa, Fabricio S.; Pereira, Felipe; Buscaglia, Gustavo C.

2018-02-01

A multiscale mixed method aiming at the accurate approximation of velocity and pressure fields in heterogeneous porous media is proposed. The procedure is based on a new domain decomposition method in which the local problems are subject to Robin boundary conditions. The domain decomposition procedure is defined in terms of two independent spaces on the skeleton of the decomposition, corresponding to interface pressures and fluxes, that can be chosen with great flexibility to accommodate local features of the underlying permeability fields. The well-posedness of the new domain decomposition procedure is established and its connection with the method of Douglas et al. (1993) [12], is identified, also allowing us to reinterpret the known procedure as an optimized Schwarz (or Two-Lagrange-Multiplier) method. The multiscale property of the new domain decomposition method is indicated, and its relation with the Multiscale Mortar Mixed Finite Element Method (MMMFEM) and the Multiscale Hybrid-Mixed (MHM) Finite Element Method is discussed. Numerical simulations are presented aiming at illustrating several features of the new method. Initially we illustrate the possibility of switching from MMMFEM to MHM by suitably varying the Robin condition parameter in the new multiscale method. Then we turn our attention to realistic flows in high-contrast, channelized porous formations. We show that for a range of values of the Robin condition parameter our method provides better approximations for pressure and velocity than those computed with either the MMMFEM and the MHM. This is an indication that our method has the potential to produce more accurate velocity fields in the presence of rough, realistic permeability fields of petroleum reservoirs.
Domain decomposition and matching for time-domain analysis of motions of ships advancing in head sea

NASA Astrophysics Data System (ADS)

Tang, Kai; Zhu, Ren-chuan; Miao, Guo-ping; Fan, Ju

2014-08-01

A domain decomposition and matching method in the time-domain is outlined for simulating the motions of ships advancing in waves. The flow field is decomposed into inner and outer domains by an imaginary control surface, and the Rankine source method is applied to the inner domain while the transient Green function method is used in the outer domain. Two initial boundary value problems are matched on the control surface. The corresponding numerical codes are developed, and the added masses, wave exciting forces and ship motions advancing in head sea for Series 60 ship and S175 containership, are presented and verified. A good agreement has been obtained when the numerical results are compared with the experimental data and other references. It shows that the present method is more efficient because of the panel discretization only in the inner domain during the numerical calculation, and good numerical stability is proved to avoid divergence problem regarding ships with flare.
Robust parallel iterative solvers for linear and least-squares problems, Final Technical Report

DOE Office of Scientific and Technical Information (OSTI.GOV)

Saad, Yousef

2014-01-16

The primary goal of this project is to study and develop robust iterative methods for solving linear systems of equations and least squares systems. The focus of the Minnesota team is on algorithms development, robustness issues, and on tests and validation of the methods on realistic problems. 1. The project begun with an investigation on how to practically update a preconditioner obtained from an ILU-type factorization, when the coefficient matrix changes. 2. We investigated strategies to improve robustness in parallel preconditioners in a specific case of a PDE with discontinuous coefficients. 3. We explored ways to adapt standard preconditioners formore » solving linear systems arising from the Helmholtz equation. These are often difficult linear systems to solve by iterative methods. 4. We have also worked on purely theoretical issues related to the analysis of Krylov subspace methods for linear systems. 5. We developed an effective strategy for performing ILU factorizations for the case when the matrix is highly indefinite. The strategy uses shifting in some optimal way. The method was extended to the solution of Helmholtz equations by using complex shifts, yielding very good results in many cases. 6. We addressed the difficult problem of preconditioning sparse systems of equations on GPUs. 7. A by-product of the above work is a software package consisting of an iterative solver library for GPUs based on CUDA. This was made publicly available. It was the first such library that offers complete iterative solvers for GPUs. 8. We considered another form of ILU which blends coarsening techniques from Multigrid with algebraic multilevel methods. 9. We have released a new version on our parallel solver - called pARMS [new version is version 3]. As part of this we have tested the code in complex settings - including the solution of Maxwell and Helmholtz equations and for a problem of crystal growth.10. As an application of polynomial preconditioning we considered the problem of evaluating f(A)v which arises in statistical sampling. 11. As an application to the methods we developed, we tackled the problem of computing the diagonal of the inverse of a matrix. This arises in statistical applications as well as in many applications in physics. We explored probing methods as well as domain-decomposition type methods. 12. A collaboration with researchers from Toulouse, France, considered the important problem of computing the Schur complement in a domain-decomposition approach. 13. We explored new ways of preconditioning linear systems, based on low-rank approximations.« less
Domain Decomposition with Local Mesh Refinement.

DTIC Science & Technology

1989-08-01

smoothi coefficients, or non-smooth solui ioni,. We eiriplov fromn 1 to 1024 tiles on problems containing irp to 161K (degrees of freedom. Though io... methodology survives such compromises and is even sequentially advantageous in many problems. The domain decomposition algorithms we employ (sertiun 3...iog( I + !J2 it - g i Ol Qunit squiare 1 he (,mai oive i> Hie outward normal. lfie sevoh iih exam pie, from [1. 27] has a smoothi solution, but rapidlY
Multi-Objectivising Combinatorial Optimisation Problems by Means of Elementary Landscape Decompositions.

PubMed

Ceberio, Josu; Calvo, Borja; Mendiburu, Alexander; Lozano, Jose A

2018-02-15

In the last decade, many works in combinatorial optimisation have shown that, due to the advances in multi-objective optimisation, the algorithms from this field could be used for solving single-objective problems as well. In this sense, a number of papers have proposed multi-objectivising single-objective problems in order to use multi-objective algorithms in their optimisation. In this article, we follow up this idea by presenting a methodology for multi-objectivising combinatorial optimisation problems based on elementary landscape decompositions of their objective function. Under this framework, each of the elementary landscapes obtained from the decomposition is considered as an independent objective function to optimise. In order to illustrate this general methodology, we consider four problems from different domains: the quadratic assignment problem and the linear ordering problem (permutation domain), the 0-1 unconstrained quadratic optimisation problem (binary domain), and the frequency assignment problem (integer domain). We implemented two widely known multi-objective algorithms, NSGA-II and SPEA2, and compared their performance with that of a single-objective GA. The experiments conducted on a large benchmark of instances of the four problems show that the multi-objective algorithms clearly outperform the single-objective approaches. Furthermore, a discussion on the results suggests that the multi-objective space generated by this decomposition enhances the exploration ability, thus permitting NSGA-II and SPEA2 to obtain better results in the majority of the tested instances.
Research in computer science

NASA Technical Reports Server (NTRS)

Ortega, J. M.

1986-01-01

Various graduate research activities in the field of computer science are reported. Among the topics discussed are: (1) failure probabilities in multi-version software; (2) Gaussian Elimination on parallel computers; (3) three dimensional Poisson solvers on parallel/vector computers; (4) automated task decomposition for multiple robot arms; (5) multi-color incomplete cholesky conjugate gradient methods on the Cyber 205; and (6) parallel implementation of iterative methods for solving linear equations.
Wiimote Experiments: 3-D Inclined Plane Problem for Reinforcing the Vector Concept

ERIC Educational Resources Information Center

Kawam, Alae; Kouh, Minjoon

2011-01-01

In an introductory physics course where students first learn about vectors, they oftentimes struggle with the concept of vector addition and decomposition. For example, the classic physics problem involving a mass on an inclined plane requires the decomposition of the force of gravity into two directions that are parallel and perpendicular to the…
A Cyber-ITS Framework for Massive Traffic Data Analysis Using Cyber Infrastructure

PubMed Central

Fontaine, Michael D.

2013-01-01

Traffic data is commonly collected from widely deployed sensors in urban areas. This brings up a new research topic, data-driven intelligent transportation systems (ITSs), which means to integrate heterogeneous traffic data from different kinds of sensors and apply it for ITS applications. This research, taking into consideration the significant increase in the amount of traffic data and the complexity of data analysis, focuses mainly on the challenge of solving data-intensive and computation-intensive problems. As a solution to the problems, this paper proposes a Cyber-ITS framework to perform data analysis on Cyber Infrastructure (CI), by nature parallel-computing hardware and software systems, in the context of ITS. The techniques of the framework include data representation, domain decomposition, resource allocation, and parallel processing. All these techniques are based on data-driven and application-oriented models and are organized as a component-and-workflow-based model in order to achieve technical interoperability and data reusability. A case study of the Cyber-ITS framework is presented later based on a traffic state estimation application that uses the fusion of massive Sydney Coordinated Adaptive Traffic System (SCATS) data and GPS data. The results prove that the Cyber-ITS-based implementation can achieve a high accuracy rate of traffic state estimation and provide a significant computational speedup for the data fusion by parallel computing. PMID:23766690
A Cyber-ITS framework for massive traffic data analysis using cyber infrastructure.

PubMed

Xia, Yingjie; Hu, Jia; Fontaine, Michael D

2013-01-01

Traffic data is commonly collected from widely deployed sensors in urban areas. This brings up a new research topic, data-driven intelligent transportation systems (ITSs), which means to integrate heterogeneous traffic data from different kinds of sensors and apply it for ITS applications. This research, taking into consideration the significant increase in the amount of traffic data and the complexity of data analysis, focuses mainly on the challenge of solving data-intensive and computation-intensive problems. As a solution to the problems, this paper proposes a Cyber-ITS framework to perform data analysis on Cyber Infrastructure (CI), by nature parallel-computing hardware and software systems, in the context of ITS. The techniques of the framework include data representation, domain decomposition, resource allocation, and parallel processing. All these techniques are based on data-driven and application-oriented models and are organized as a component-and-workflow-based model in order to achieve technical interoperability and data reusability. A case study of the Cyber-ITS framework is presented later based on a traffic state estimation application that uses the fusion of massive Sydney Coordinated Adaptive Traffic System (SCATS) data and GPS data. The results prove that the Cyber-ITS-based implementation can achieve a high accuracy rate of traffic state estimation and provide a significant computational speedup for the data fusion by parallel computing.
Computation of forces arising from the polarizable continuum model within the domain-decomposition paradigm

NASA Astrophysics Data System (ADS)

Gatto, Paolo; Lipparini, Filippo; Stamm, Benjamin

2017-12-01

The domain-decomposition (dd) paradigm, originally introduced for the conductor-like screening model, has been recently extended to the dielectric Polarizable Continuum Model (PCM), resulting in the ddPCM method. We present here a complete derivation of the analytical derivatives of the ddPCM energy with respect to the positions of the solute's atoms and discuss their efficient implementation. As it is the case for the energy, we observe a quadratic scaling, which is discussed and demonstrated with numerical tests.
A study of the parallel algorithm for large-scale DC simulation of nonlinear systems

NASA Astrophysics Data System (ADS)

Cortés Udave, Diego Ernesto; Ogrodzki, Jan; Gutiérrez de Anda, Miguel Angel

Newton-Raphson DC analysis of large-scale nonlinear circuits may be an extremely time consuming process even if sparse matrix techniques and bypassing of nonlinear models calculation are used. A slight decrease in the time required for this task may be enabled on multi-core, multithread computers if the calculation of the mathematical models for the nonlinear elements as well as the stamp management of the sparse matrix entries are managed through concurrent processes. This numerical complexity can be further reduced via the circuit decomposition and parallel solution of blocks taking as a departure point the BBD matrix structure. This block-parallel approach may give a considerable profit though it is strongly dependent on the system topology and, of course, on the processor type. This contribution presents the easy-parallelizable decomposition-based algorithm for DC simulation and provides a detailed study of its effectiveness.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Vincenti, H.; Vay, J. -L.

Due to discretization effects and truncation to finite domains, many electromagnetic simulations present non-physical modifications of Maxwell's equations in space that may generate spurious signals affecting the overall accuracy of the result. Such modifications for instance occur when Perfectly Matched Layers (PMLs) are used at simulation domain boundaries to simulate open media. Another example is the use of arbitrary order Maxwell solver with domain decomposition technique that may under some condition involve stencil truncations at subdomain boundaries, resulting in small spurious errors that do eventually build up. In each case, a careful evaluation of the characteristics and magnitude of themore » errors resulting from these approximations, and their impact at any frequency and angle, requires detailed analytical and numerical studies. To this end, we present a general analytical approach that enables the evaluation of numerical discretization errors of fully three-dimensional arbitrary order finite-difference Maxwell solver, with arbitrary modification of the local stencil in the simulation domain. The analytical model is validated against simulations of domain decomposition technique and PMLs, when these are used with very high-order Maxwell solver, as well as in the infinite order limit of pseudo-spectral solvers. Results confirm that the new analytical approach enables exact predictions in each case. It also confirms that the domain decomposition technique can be used with very high-order Maxwell solver and a reasonably low number of guard cells with negligible effects on the whole accuracy of the simulation.« less
High-Performance Parallel Analysis of Coupled Problems for Aircraft Propulsion

NASA Technical Reports Server (NTRS)

Felippa, C. A.; Farhat, C.; Park, K. C.; Gumaste, U.; Chen, P.-S.; Lesoinne, M.; Stern, P.

1996-01-01

This research program dealt with the application of high-performance computing methods to the numerical simulation of complete jet engines. The program was initiated in January 1993 by applying two-dimensional parallel aeroelastic codes to the interior gas flow problem of a bypass jet engine. The fluid mesh generation, domain decomposition and solution capabilities were successfully tested. Attention was then focused on methodology for the partitioned analysis of the interaction of the gas flow with a flexible structure and with the fluid mesh motion driven by these structural displacements. The latter is treated by a ALE technique that models the fluid mesh motion as that of a fictitious mechanical network laid along the edges of near-field fluid elements. New partitioned analysis procedures to treat this coupled three-component problem were developed during 1994 and 1995. These procedures involved delayed corrections and subcycling, and have been successfully tested on several massively parallel computers, including the iPSC-860, Paragon XP/S and the IBM SP2. For the global steady-state axisymmetric analysis of a complete engine we have decided to use the NASA-sponsored ENG10 program, which uses a regular FV-multiblock-grid discretization in conjunction with circumferential averaging to include effects of blade forces, loss, combustor heat addition, blockage, bleeds and convective mixing. A load-balancing preprocessor tor parallel versions of ENG10 was developed. During 1995 and 1996 we developed the capability tor the first full 3D aeroelastic simulation of a multirow engine stage. This capability was tested on the IBM SP2 parallel supercomputer at NASA Ames. Benchmark results were presented at the 1196 Computational Aeroscience meeting.
Algorithms and Application of Sparse Matrix Assembly and Equation Solvers for Aeroacoustics

NASA Technical Reports Server (NTRS)

Watson, W. R.; Nguyen, D. T.; Reddy, C. J.; Vatsa, V. N.; Tang, W. H.

2001-01-01

An algorithm for symmetric sparse equation solutions on an unstructured grid is described. Efficient, sequential sparse algorithms for degree-of-freedom reordering, supernodes, symbolic/numerical factorization, and forward backward solution phases are reviewed. Three sparse algorithms for the generation and assembly of symmetric systems of matrix equations are presented. The accuracy and numerical performance of the sequential version of the sparse algorithms are evaluated over the frequency range of interest in a three-dimensional aeroacoustics application. Results show that the solver solutions are accurate using a discretization of 12 points per wavelength. Results also show that the first assembly algorithm is impractical for high-frequency noise calculations. The second and third assembly algorithms have nearly equal performance at low values of source frequencies, but at higher values of source frequencies the third algorithm saves CPU time and RAM. The CPU time and the RAM required by the second and third assembly algorithms are two orders of magnitude smaller than that required by the sparse equation solver. A sequential version of these sparse algorithms can, therefore, be conveniently incorporated into a substructuring for domain decomposition formulation to achieve parallel computation, where different substructures are handles by different parallel processors.
Matrix-Inversion-Free Compressed Sensing With Variable Orthogonal Multi-Matching Pursuit Based on Prior Information for ECG Signals.

PubMed

Cheng, Yih-Chun; Tsai, Pei-Yun; Huang, Ming-Hao

2016-05-19

Low-complexity compressed sensing (CS) techniques for monitoring electrocardiogram (ECG) signals in wireless body sensor network (WBSN) are presented. The prior probability of ECG sparsity in the wavelet domain is first exploited. Then, variable orthogonal multi-matching pursuit (vOMMP) algorithm that consists of two phases is proposed. In the first phase, orthogonal matching pursuit (OMP) algorithm is adopted to effectively augment the support set with reliable indices and in the second phase, the orthogonal multi-matching pursuit (OMMP) is employed to rescue the missing indices. The reconstruction performance is thus enhanced with the prior information and the vOMMP algorithm. Furthermore, the computation-intensive pseudo-inverse operation is simplified by the matrix-inversion-free (MIF) technique based on QR decomposition. The vOMMP-MIF CS decoder is then implemented in 90 nm CMOS technology. The QR decomposition is accomplished by two systolic arrays working in parallel. The implementation supports three settings for obtaining 40, 44, and 48 coefficients in the sparse vector. From the measurement result, the power consumption is 11.7 mW at 0.9 V and 12 MHz. Compared to prior chip implementations, our design shows good hardware efficiency and is suitable for low-energy applications.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Shim, Yunsic; Amar, Jacques G.

While temperature-accelerated dynamics (TAD) is a powerful method for carrying out non-equilibrium simulations of systems over extended time scales, the computational cost of serial TAD increases approximately as N{sup 3} where N is the number of atoms. In addition, although a parallel TAD method based on domain decomposition [Y. Shim et al., Phys. Rev. B 76, 205439 (2007)] has been shown to provide significantly improved scaling, the dynamics in such an approach is only approximate while the size of activated events is limited by the spatial decomposition size. Accordingly, it is of interest to develop methods to improve the scalingmore » of serial TAD. As a first step in understanding the factors which determine the scaling behavior, we first present results for the overall scaling of serial TAD and its components, which were obtained from simulations of Ag/Ag(100) growth and Ag/Ag(100) annealing, and compare with theoretical predictions. We then discuss two methods based on localization which may be used to address two of the primary “bottlenecks” to the scaling of serial TAD with system size. By implementing both of these methods, we find that for intermediate system-sizes, the scaling is improved by almost a factor of N{sup 1/2}. Some additional possible methods to improve the scaling of TAD are also discussed.« less

Improved scaling of temperature-accelerated dynamics using localization

NASA Astrophysics Data System (ADS)

Shim, Yunsic; Amar, Jacques G.

2016-07-01

While temperature-accelerated dynamics (TAD) is a powerful method for carrying out non-equilibrium simulations of systems over extended time scales, the computational cost of serial TAD increases approximately as N3 where N is the number of atoms. In addition, although a parallel TAD method based on domain decomposition [Y. Shim et al., Phys. Rev. B 76, 205439 (2007)] has been shown to provide significantly improved scaling, the dynamics in such an approach is only approximate while the size of activated events is limited by the spatial decomposition size. Accordingly, it is of interest to develop methods to improve the scaling of serial TAD. As a first step in understanding the factors which determine the scaling behavior, we first present results for the overall scaling of serial TAD and its components, which were obtained from simulations of Ag/Ag(100) growth and Ag/Ag(100) annealing, and compare with theoretical predictions. We then discuss two methods based on localization which may be used to address two of the primary "bottlenecks" to the scaling of serial TAD with system size. By implementing both of these methods, we find that for intermediate system-sizes, the scaling is improved by almost a factor of N1/2. Some additional possible methods to improve the scaling of TAD are also discussed.
A General-purpose Framework for Parallel Processing of Large-scale LiDAR Data

NASA Astrophysics Data System (ADS)

Li, Z.; Hodgson, M.; Li, W.

2016-12-01

Light detection and ranging (LiDAR) technologies have proven efficiency to quickly obtain very detailed Earth surface data for a large spatial extent. Such data is important for scientific discoveries such as Earth and ecological sciences and natural disasters and environmental applications. However, handling LiDAR data poses grand geoprocessing challenges due to data intensity and computational intensity. Previous studies received notable success on parallel processing of LiDAR data to these challenges. However, these studies either relied on high performance computers and specialized hardware (GPUs) or focused mostly on finding customized solutions for some specific algorithms. We developed a general-purpose scalable framework coupled with sophisticated data decomposition and parallelization strategy to efficiently handle big LiDAR data. Specifically, 1) a tile-based spatial index is proposed to manage big LiDAR data in the scalable and fault-tolerable Hadoop distributed file system, 2) two spatial decomposition techniques are developed to enable efficient parallelization of different types of LiDAR processing tasks, and 3) by coupling existing LiDAR processing tools with Hadoop, this framework is able to conduct a variety of LiDAR data processing tasks in parallel in a highly scalable distributed computing environment. The performance and scalability of the framework is evaluated with a series of experiments conducted on a real LiDAR dataset using a proof-of-concept prototype system. The results show that the proposed framework 1) is able to handle massive LiDAR data more efficiently than standalone tools; and 2) provides almost linear scalability in terms of either increased workload (data volume) or increased computing nodes with both spatial decomposition strategies. We believe that the proposed framework provides valuable references on developing a collaborative cyberinfrastructure for processing big earth science data in a highly scalable environment.
Performance of the Wavelet Decomposition on Massively Parallel Architectures

NASA Technical Reports Server (NTRS)

El-Ghazawi, Tarek A.; LeMoigne, Jacqueline; Zukor, Dorothy (Technical Monitor)

2001-01-01

Traditionally, Fourier Transforms have been utilized for performing signal analysis and representation. But although it is straightforward to reconstruct a signal from its Fourier transform, no local description of the signal is included in its Fourier representation. To alleviate this problem, Windowed Fourier transforms and then wavelet transforms have been introduced, and it has been proven that wavelets give a better localization than traditional Fourier transforms, as well as a better division of the time- or space-frequency plane than Windowed Fourier transforms. Because of these properties and after the development of several fast algorithms for computing the wavelet representation of any signal, in particular the Multi-Resolution Analysis (MRA) developed by Mallat, wavelet transforms have increasingly been applied to signal analysis problems, especially real-life problems, in which speed is critical. In this paper we present and compare efficient wavelet decomposition algorithms on different parallel architectures. We report and analyze experimental measurements, using NASA remotely sensed images. Results show that our algorithms achieve significant performance gains on current high performance parallel systems, and meet scientific applications and multimedia requirements. The extensive performance measurements collected over a number of high-performance computer systems have revealed important architectural characteristics of these systems, in relation to the processing demands of the wavelet decomposition of digital images.
Aerodynamic Shape Optimization of Supersonic Aircraft Configurations via an Adjoint Formulation on Parallel Computers

NASA Technical Reports Server (NTRS)

Reuther, James; Alonso, Juan Jose; Rimlinger, Mark J.; Jameson, Antony

1996-01-01

This work describes the application of a control theory-based aerodynamic shape optimization method to the problem of supersonic aircraft design. The design process is greatly accelerated through the use of both control theory and a parallel implementation on distributed memory computers. Control theory is employed to derive the adjoint differential equations whose solution allows for the evaluation of design gradient information at a fraction of the computational cost required by previous design methods. The resulting problem is then implemented on parallel distributed memory architectures using a domain decomposition approach, an optimized communication schedule, and the MPI (Message Passing Interface) Standard for portability and efficiency. The final result achieves very rapid aerodynamic design based on higher order computational fluid dynamics methods (CFD). In our earlier studies, the serial implementation of this design method was shown to be effective for the optimization of airfoils, wings, wing-bodies, and complex aircraft configurations using both the potential equation and the Euler equations. In our most recent paper, the Euler method was extended to treat complete aircraft configurations via a new multiblock implementation. Furthermore, during the same conference, we also presented preliminary results demonstrating that this basic methodology could be ported to distributed memory parallel computing architectures. In this paper, our concern will be to demonstrate that the combined power of these new technologies can be used routinely in an industrial design environment by applying it to the case study of the design of typical supersonic transport configurations. A particular difficulty of this test case is posed by the propulsion/airframe integration.
FaCSI: A block parallel preconditioner for fluid-structure interaction in hemodynamics

NASA Astrophysics Data System (ADS)

Deparis, Simone; Forti, Davide; Grandperrin, Gwenol; Quarteroni, Alfio

2016-12-01

Modeling Fluid-Structure Interaction (FSI) in the vascular system is mandatory to reliably compute mechanical indicators in vessels undergoing large deformations. In order to cope with the computational complexity of the coupled 3D FSI problem after discretizations in space and time, a parallel solution is often mandatory. In this paper we propose a new block parallel preconditioner for the coupled linearized FSI system obtained after space and time discretization. We name it FaCSI to indicate that it exploits the Factorized form of the linearized FSI matrix, the use of static Condensation to formally eliminate the interface degrees of freedom of the fluid equations, and the use of a SIMPLE preconditioner for saddle-point problems. FaCSI is built upon a block Gauss-Seidel factorization of the FSI Jacobian matrix and it uses ad-hoc preconditioners for each physical component of the coupled problem, namely the fluid, the structure and the geometry. In the fluid subproblem, after operating static condensation of the interface fluid variables, we use a SIMPLE preconditioner on the reduced fluid matrix. Moreover, to efficiently deal with a large number of processes, FaCSI exploits efficient single field preconditioners, e.g., based on domain decomposition or the multigrid method. We measure the parallel performances of FaCSI on a benchmark cylindrical geometry and on a problem of physiological interest, namely the blood flow through a patient-specific femoropopliteal bypass. We analyze the dependence of the number of linear solver iterations on the cores count (scalability of the preconditioner) and on the mesh size (optimality).
Novel scheme for rapid parallel parameter estimation of gravitational waves from compact binary coalescences

NASA Astrophysics Data System (ADS)

Pankow, C.; Brady, P.; Ochsner, E.; O'Shaughnessy, R.

2015-07-01

We introduce a highly parallelizable architecture for estimating parameters of compact binary coalescence using gravitational-wave data and waveform models. Using a spherical harmonic mode decomposition, the waveform is expressed as a sum over modes that depend on the intrinsic parameters (e.g., masses) with coefficients that depend on the observer dependent extrinsic parameters (e.g., distance, sky position). The data is then prefiltered against those modes, at fixed intrinsic parameters, enabling efficiently evaluation of the likelihood for generic source positions and orientations, independent of waveform length or generation time. We efficiently parallelize our intrinsic space calculation by integrating over all extrinsic parameters using a Monte Carlo integration strategy. Since the waveform generation and prefiltering happens only once, the cost of integration dominates the procedure. Also, we operate hierarchically, using information from existing gravitational-wave searches to identify the regions of parameter space to emphasize in our sampling. As proof of concept and verification of the result, we have implemented this algorithm using standard time-domain waveforms, processing each event in less than one hour on recent computing hardware. For most events we evaluate the marginalized likelihood (evidence) with statistical errors of ≲5 %, and even smaller in many cases. With a bounded runtime independent of the waveform model starting frequency, a nearly unchanged strategy could estimate neutron star (NS)-NS parameters in the 2018 advanced LIGO era. Our algorithm is usable with any noise curve and existing time-domain model at any mass, including some waveforms which are computationally costly to evolve.
Simulation of hypersonic rarefied flows with the immersed-boundary method

NASA Astrophysics Data System (ADS)

Bruno, D.; De Palma, P.; de Tullio, M. D.

2011-05-01

This paper provides a validation of an immersed boundary method for computing hypersonic rarefied gas flows. The method is based on the solution of the Navier-Stokes equation and is validated versus numerical results obtained by the DSMC approach. The Navier-Stokes solver employs a flexible local grid refinement technique and is implemented on parallel machines using a domain-decomposition approach. Thanks to the efficient grid generation process, based on the ray-tracing technique, and the use of the METIS software, it is possible to obtain the partitioned grids to be assigned to each processor with a minimal effort by the user. This allows one to by-pass the expensive (in terms of time and human resources) classical generation process of a body fitted grid. First-order slip-velocity boundary conditions are employed and tested for taking into account rarefied gas effects.
A DAFT DL_POLY distributed memory adaptation of the Smoothed Particle Mesh Ewald method

NASA Astrophysics Data System (ADS)

Bush, I. J.; Todorov, I. T.; Smith, W.

2006-09-01

The Smoothed Particle Mesh Ewald method [U. Essmann, L. Perera, M.L. Berkowtz, T. Darden, H. Lee, L.G. Pedersen, J. Chem. Phys. 103 (1995) 8577] for calculating long ranged forces in molecular simulation has been adapted for the parallel molecular dynamics code DL_POLY_3 [I.T. Todorov, W. Smith, Philos. Trans. Roy. Soc. London 362 (2004) 1835], making use of a novel 3D Fast Fourier Transform (DAFT) [I.J. Bush, The Daresbury Advanced Fourier transform, Daresbury Laboratory, 1999] that perfectly matches the Domain Decomposition (DD) parallelisation strategy [W. Smith, Comput. Phys. Comm. 62 (1991) 229; M.R.S. Pinches, D. Tildesley, W. Smith, Mol. Sim. 6 (1991) 51; D. Rapaport, Comput. Phys. Comm. 62 (1991) 217] of the DL_POLY_3 code. In this article we describe software adaptations undertaken to import this functionality and provide a review of its performance.
On iterative processes in the Krylov-Sonneveld subspaces

NASA Astrophysics Data System (ADS)

Ilin, Valery P.

2016-10-01

The iterative Induced Dimension Reduction (IDR) methods are considered for solving large systems of linear algebraic equations (SLAEs) with nonsingular nonsymmetric matrices. These approaches are investigated by many authors and are charachterized sometimes as the alternative to the classical processes of Krylov type. The key moments of the IDR algorithms consist in the construction of the embedded Sonneveld subspaces, which have the decreasing dimensions and use the orthogonalization to some fixed subspace. Other independent approaches for research and optimization of the iterations are based on the augmented and modified Krylov subspaces by using the aggregation and deflation procedures with present various low rank approximations of the original matrices. The goal of this paper is to show, that IDR method in Sonneveld subspaces present an original interpretation of the modified algorithms in the Krylov subspaces. In particular, such description is given for the multi-preconditioned semi-conjugate direction methods which are actual for the parallel algebraic domain decomposition approaches.
Preparation and transport properties of superconducting layers in the Ca-Sr-Bi-Cu-O system

NASA Astrophysics Data System (ADS)

Klee, M.; Stollman, G. M.; Stotz, S.; de Vries, J. W. C.

1988-08-01

Superconducting layers in the CaSrBiCuO system are prepared by thermal decomposition of metal carboxylates using a spin-coating and a dip-coating method onto ceramic MgO substrates. The samples consist of a tetragonal calcium-strontium-bismuth-cuprate and two bismuth-free calcium-strontium-cuprates. A step in the resistance versus temperature curve is observed which, together with the influence of magnetic fields, is interpreted as typical for a granular superconductor. The analysis shows that the critical current density is determined by domains of the order of some unit cells. The strong dependence of the superconducting transition on the orientation of an applied magnetic field is probably caused by the anisotropic layer structure. The coherence length perpendicular to the c-axis of the material is estimated to be ξab(0) = 4.0 nm and parallel to the c-axis ξc(0) = 0.6 nm.
Numeric Modified Adomian Decomposition Method for Power System Simulations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dimitrovski, Aleksandar D; Simunovic, Srdjan; Pannala, Sreekanth

This paper investigates the applicability of numeric Wazwaz El Sayed modified Adomian Decomposition Method (WES-ADM) for time domain simulation of power systems. WESADM is a numerical method based on a modified Adomian decomposition (ADM) technique. WES-ADM is a numerical approximation method for the solution of nonlinear ordinary differential equations. The non-linear terms in the differential equations are approximated using Adomian polynomials. In this paper WES-ADM is applied to time domain simulations of multimachine power systems. WECC 3-generator, 9-bus system and IEEE 10-generator, 39-bus system have been used to test the applicability of the approach. Several fault scenarios have been tested.more » It has been found that the proposed approach is faster than the trapezoidal method with comparable accuracy.« less
Solution of the within-group multidimensional discrete ordinates transport equations on massively parallel architectures

NASA Astrophysics Data System (ADS)

Zerr, Robert Joseph

2011-12-01

The integral transport matrix method (ITMM) has been used as the kernel of new parallel solution methods for the discrete ordinates approximation of the within-group neutron transport equation. The ITMM abandons the repetitive mesh sweeps of the traditional source iterations (SI) scheme in favor of constructing stored operators that account for the direct coupling factors among all the cells and between the cells and boundary surfaces. The main goals of this work were to develop the algorithms that construct these operators and employ them in the solution process, determine the most suitable way to parallelize the entire procedure, and evaluate the behavior and performance of the developed methods for increasing number of processes. This project compares the effectiveness of the ITMM with the SI scheme parallelized with the Koch-Baker-Alcouffe (KBA) method. The primary parallel solution method involves a decomposition of the domain into smaller spatial sub-domains, each with their own transport matrices, and coupled together via interface boundary angular fluxes. Each sub-domain has its own set of ITMM operators and represents an independent transport problem. Multiple iterative parallel solution methods have investigated, including parallel block Jacobi (PBJ), parallel red/black Gauss-Seidel (PGS), and parallel GMRES (PGMRES). The fastest observed parallel solution method, PGS, was used in a weak scaling comparison with the PARTISN code. Compared to the state-of-the-art SI-KBA with diffusion synthetic acceleration (DSA), this new method without acceleration/preconditioning is not competitive for any problem parameters considered. The best comparisons occur for problems that are difficult for SI DSA, namely highly scattering and optically thick. SI DSA execution time curves are generally steeper than the PGS ones. However, until further testing is performed it cannot be concluded that SI DSA does not outperform the ITMM with PGS even on several thousand or tens of thousands of processors. The PGS method does outperform SI DSA for the periodic heterogeneous layers (PHL) configuration problems. Although this demonstrates a relative strength/weakness between the two methods, the practicality of these problems is much less, further limiting instances where it would be beneficial to select ITMM over SI DSA. The results strongly indicate a need for a robust, stable, and efficient acceleration method (or preconditioner for PGMRES). The spatial multigrid (SMG) method is currently incomplete in that it does not work for all cases considered and does not effectively improve the convergence rate for all values of scattering ratio c or cell dimension h. Nevertheless, it does display the desired trend for highly scattering, optically thin problems. That is, it tends to lower the rate of growth of number of iterations with increasing number of processes, P, while not increasing the number of additional operations per iteration to the extent that the total execution time of the rapidly converging accelerated iterations exceeds that of the slower unaccelerated iterations. A predictive parallel performance model has been developed for the PBJ method. Timing tests were performed such that trend lines could be fitted to the data for the different components and used to estimate the execution times. Applied to the weak scaling results, the model notably underestimates construction time, but combined with a slight overestimation in iterative solution time, the model predicts total execution time very well for large P. It also does a decent job with the strong scaling results, closely predicting the construction time and time per iteration, especially as P increases. Although not shown to be competitive up to 1,024 processing elements with the current state of the art, the parallelized ITMM exhibits promising scaling trends. Ultimately, compared to the KBA method, the parallelized ITMM may be found to be a very attractive option for transport calculations spatially decomposed over several tens of thousands of processes. Acceleration/preconditioning of the parallelized ITMM once developed will improve the convergence rate and improve its competitiveness. (Abstract shortened by UMI.)
A New Coarsening Operator for the Optimal Preconditioning of the Dual and Primal Domain Decomposition Methods: Application to Problems with Severe Coefficient Jumps

NASA Technical Reports Server (NTRS)

Farhat, Charbel; Rixen, Daniel

1996-01-01

We present an optimal preconditioning algorithm that is equally applicable to the dual (FETI) and primal (Balancing) Schur complement domain decomposition methods, and which successfully addresses the problems of subdomain heterogeneities including the effects of large jumps of coefficients. The proposed preconditioner is derived from energy principles and embeds a new coarsening operator that propagates the error globally and accelerates convergence. The resulting iterative solver is illustrated with the solution of highly heterogeneous elasticity problems.
Detailed analysis of the effects of stencil spatial variations with arbitrary high-order finite-difference Maxwell solver

DOE PAGES

Vincenti, H.; Vay, J. -L.

2015-11-22

Due to discretization effects and truncation to finite domains, many electromagnetic simulations present non-physical modifications of Maxwell's equations in space that may generate spurious signals affecting the overall accuracy of the result. Such modifications for instance occur when Perfectly Matched Layers (PMLs) are used at simulation domain boundaries to simulate open media. Another example is the use of arbitrary order Maxwell solver with domain decomposition technique that may under some condition involve stencil truncations at subdomain boundaries, resulting in small spurious errors that do eventually build up. In each case, a careful evaluation of the characteristics and magnitude of themore » errors resulting from these approximations, and their impact at any frequency and angle, requires detailed analytical and numerical studies. To this end, we present a general analytical approach that enables the evaluation of numerical discretization errors of fully three-dimensional arbitrary order finite-difference Maxwell solver, with arbitrary modification of the local stencil in the simulation domain. The analytical model is validated against simulations of domain decomposition technique and PMLs, when these are used with very high-order Maxwell solver, as well as in the infinite order limit of pseudo-spectral solvers. Results confirm that the new analytical approach enables exact predictions in each case. It also confirms that the domain decomposition technique can be used with very high-order Maxwell solver and a reasonably low number of guard cells with negligible effects on the whole accuracy of the simulation.« less
Finite Element Methods for real-time Haptic Feedback of Soft-Tissue Models in Virtual Reality Simulators

NASA Technical Reports Server (NTRS)

Frank, Andreas O.; Twombly, I. Alexander; Barth, Timothy J.; Smith, Jeffrey D.; Dalton, Bonnie P. (Technical Monitor)

2001-01-01

We have applied the linear elastic finite element method to compute haptic force feedback and domain deformations of soft tissue models for use in virtual reality simulators. Our results show that, for virtual object models of high-resolution 3D data (>10,000 nodes), haptic real time computations (>500 Hz) are not currently possible using traditional methods. Current research efforts are focused in the following areas: 1) efficient implementation of fully adaptive multi-resolution methods and 2) multi-resolution methods with specialized basis functions to capture the singularity at the haptic interface (point loading). To achieve real time computations, we propose parallel processing of a Jacobi preconditioned conjugate gradient method applied to a reduced system of equations resulting from surface domain decomposition. This can effectively be achieved using reconfigurable computing systems such as field programmable gate arrays (FPGA), thereby providing a flexible solution that allows for new FPGA implementations as improved algorithms become available. The resulting soft tissue simulation system would meet NASA Virtual Glovebox requirements and, at the same time, provide a generalized simulation engine for any immersive environment application, such as biomedical/surgical procedures or interactive scientific applications.
A general CFD framework for fault-resilient simulations based on multi-resolution information fusion

NASA Astrophysics Data System (ADS)

Lee, Seungjoon; Kevrekidis, Ioannis G.; Karniadakis, George Em

2017-10-01

We develop a general CFD framework for multi-resolution simulations to target multiscale problems but also resilience in exascale simulations, where faulty processors may lead to gappy, in space-time, simulated fields. We combine approximation theory and domain decomposition together with statistical learning techniques, e.g. coKriging, to estimate boundary conditions and minimize communications by performing independent parallel runs. To demonstrate this new simulation approach, we consider two benchmark problems. First, we solve the heat equation (a) on a small number of spatial "patches" distributed across the domain, simulated by finite differences at fine resolution and (b) on the entire domain simulated at very low resolution, thus fusing multi-resolution models to obtain the final answer. Second, we simulate the flow in a lid-driven cavity in an analogous fashion, by fusing finite difference solutions obtained with fine and low resolution assuming gappy data sets. We investigate the influence of various parameters for this framework, including the correlation kernel, the size of a buffer employed in estimating boundary conditions, the coarseness of the resolution of auxiliary data, and the communication frequency across different patches in fusing the information at different resolution levels. In addition to its robustness and resilience, the new framework can be employed to generalize previous multiscale approaches involving heterogeneous discretizations or even fundamentally different flow descriptions, e.g. in continuum-atomistic simulations.
Parallel processing for pitch splitting decomposition

NASA Astrophysics Data System (ADS)

Barnes, Levi; Li, Yong; Wadkins, David; Biederman, Steve; Miloslavsky, Alex; Cork, Chris

2009-10-01

Decomposition of an input pattern in preparation for a double patterning process is an inherently global problem in which the influence of a local decomposition decision can be felt across an entire pattern. In spite of this, a large portion of the work can be massively distributed. Here, we discuss the advantages of geometric distribution for polygon operations with limited range of influence. Further, we have found that even the naturally global "coloring" step can, in large part, be handled in a geometrically local manner. In some practical cases, up to 70% of the work can be distributed geometrically. We also describe the methods for partitioning the problem into local pieces and present scaling data up to 100 CPUs. These techniques reduce DPT decomposition runtime by orders of magnitude.
Using Multiple Grids To Compute Flows

NASA Technical Reports Server (NTRS)

Rai, Man Mohan

1991-01-01

Paper discusses decomposition of global grids into multiple patched and/or overlaid local grids in computations of fluid flow. Such "domain decomposition" particularly useful in computation of flows about complicated bodies moving relative to each other; for example, flows associated with rotors and stators in turbomachinery and rotors and fuselages in helicopters.
Parallel adaptive wavelet collocation method for PDEs

DOE Office of Scientific and Technical Information (OSTI.GOV)

Nejadmalayeri, Alireza, E-mail: Alireza.Nejadmalayeri@gmail.com; Vezolainen, Alexei, E-mail: Alexei.Vezolainen@Colorado.edu; Brown-Dymkoski, Eric, E-mail: Eric.Browndymkoski@Colorado.edu

2015-10-01

A parallel adaptive wavelet collocation method for solving a large class of Partial Differential Equations is presented. The parallelization is achieved by developing an asynchronous parallel wavelet transform, which allows one to perform parallel wavelet transform and derivative calculations with only one data synchronization at the highest level of resolution. The data are stored using tree-like structure with tree roots starting at a priori defined level of resolution. Both static and dynamic domain partitioning approaches are developed. For the dynamic domain partitioning, trees are considered to be the minimum quanta of data to be migrated between the processes. This allowsmore » fully automated and efficient handling of non-simply connected partitioning of a computational domain. Dynamic load balancing is achieved via domain repartitioning during the grid adaptation step and reassigning trees to the appropriate processes to ensure approximately the same number of grid points on each process. The parallel efficiency of the approach is discussed based on parallel adaptive wavelet-based Coherent Vortex Simulations of homogeneous turbulence with linear forcing at effective non-adaptive resolutions up to 2048{sup 3} using as many as 2048 CPU cores.« less
Empirical projection-based basis-component decomposition method

NASA Astrophysics Data System (ADS)

Brendel, Bernhard; Roessl, Ewald; Schlomka, Jens-Peter; Proksa, Roland

2009-02-01

Advances in the development of semiconductor based, photon-counting x-ray detectors stimulate research in the domain of energy-resolving pre-clinical and clinical computed tomography (CT). For counting detectors acquiring x-ray attenuation in at least three different energy windows, an extended basis component decomposition can be performed in which in addition to the conventional approach of Alvarez and Macovski a third basis component is introduced, e.g., a gadolinium based CT contrast material. After the decomposition of the measured projection data into the basis component projections, conventional filtered-backprojection reconstruction is performed to obtain the basis-component images. In recent work, this basis component decomposition was obtained by maximizing the likelihood-function of the measurements. This procedure is time consuming and often unstable for excessively noisy data or low intrinsic energy resolution of the detector. Therefore, alternative procedures are of interest. Here, we introduce a generalization of the idea of empirical dual-energy processing published by Stenner et al. to multi-energy, photon-counting CT raw data. Instead of working in the image-domain, we use prior spectral knowledge about the acquisition system (tube spectra, bin sensitivities) to parameterize the line-integrals of the basis component decomposition directly in the projection domain. We compare this empirical approach with the maximum-likelihood (ML) approach considering image noise and image bias (artifacts) and see that only moderate noise increase is to be expected for small bias in the empirical approach. Given the drastic reduction of pre-processing time, the empirical approach is considered a viable alternative to the ML approach.

Signal processing method and system for noise removal and signal extraction

DOEpatents

Fu, Chi Yung; Petrich, Loren

2009-04-14

A signal processing method and system combining smooth level wavelet pre-processing together with artificial neural networks all in the wavelet domain for signal denoising and extraction. Upon receiving a signal corrupted with noise, an n-level decomposition of the signal is performed using a discrete wavelet transform to produce a smooth component and a rough component for each decomposition level. The n.sup.th level smooth component is then inputted into a corresponding neural network pre-trained to filter out noise in that component by pattern recognition in the wavelet domain. Additional rough components, beginning at the highest level, may also be retained and inputted into corresponding neural networks pre-trained to filter out noise in those components also by pattern recognition in the wavelet domain. In any case, an inverse discrete wavelet transform is performed on the combined output from all the neural networks to recover a clean signal back in the time domain.
Information Processing Research

DTIC Science & Technology

1992-01-03

structure of instances. Opal provides special graphical objects called "Ag- greGadgets" which are used to hold a collection of other objects (either...available in classes of expert systems tasks, re- late this to the structure of parallel production systems, and incorporate parallel-decomposition...Anantharaman et al. 88]. We designed a new pawn structure algorithm and upgraded the king-safety pattern recog- nizers, which contributed significantly
Implementation of a Parallel Kalman Filter for Stratospheric Chemical Tracer Assimilation

NASA Technical Reports Server (NTRS)

Chang, Lang-Ping; Lyster, Peter M.; Menard, R.; Cohn, S. E.

1998-01-01

A Kalman filter for the assimilation of long-lived atmospheric chemical constituents has been developed for two-dimensional transport models on isentropic surfaces over the globe. An important attribute of the Kalman filter is that it calculates error covariances of the constituent fields using the tracer dynamics. Consequently, the current Kalman-filter assimilation is a five-dimensional problem (coordinates of two points and time), and it can only be handled on computers with large memory and high floating point speed. In this paper, an implementation of the Kalman filter for distributed-memory, message-passing parallel computers is discussed. Two approaches were studied: an operator decomposition and a covariance decomposition. The latter was found to be more scalable than the former, and it possesses the property that the dynamical model does not need to be parallelized, which is of considerable practical advantage. This code is currently used to assimilate constituent data retrieved by limb sounders on the Upper Atmosphere Research Satellite. Tests of the code examined the variance transport and observability properties. Aspects of the parallel implementation, some timing results, and a brief discussion of the physical results will be presented.
Reaction behaviors of decomposition of monocrotophos in aqueous solution by UV and UV/O processes.

PubMed

Ku, Y; Wang, W; Shen, Y S

2000-02-01

The decomposition of monocrotophos (cis-3-dimethoxyphosphinyloxy-N-methyl-crotonamide) in aqueous solution by UV and UV/O(3) processes was studied. The experiments were carried out under various solution pH values to investigate the decomposition efficiencies of the reactant and organic intermediates in order to determine the completeness of decomposition. The photolytic decomposition rate of monocrotophos was increased with increasing solution pH because the solution pH affects the distribution and light absorbance of monocrotophos species. The combination of O(3) with UV light apparently promoted the decomposition and mineralization of monocrotophos in aqueous solution. For the UV/O(3) process, the breakage of the >C=C< bond of monocrotophos by ozone molecules was found to occur first, followed by mineralization by hydroxyl radicals to generate CO(3)(2-), PO4(3-), and NO(3)(-) anions in sequence. The quasi-global kinetics based on a simplified consecutive-parallel reaction scheme was developed to describe the temporal behavior of monocrotophos decomposition in aqueous solution by the UV/O(3) process.
On the parallel solution of parabolic equations

NASA Technical Reports Server (NTRS)

Gallopoulos, E.; Saad, Youcef

1989-01-01

Parallel algorithms for the solution of linear parabolic problems are proposed. The first of these methods is based on using polynomial approximation to the exponential. It does not require solving any linear systems and is highly parallelizable. The two other methods proposed are based on Pade and Chebyshev approximations to the matrix exponential. The parallelization of these methods is achieved by using partial fraction decomposition techniques to solve the resulting systems and thus offers the potential for increased time parallelism in time dependent problems. Experimental results from the Alliant FX/8 and the Cray Y-MP/832 vector multiprocessors are also presented.
Multi-variants synthesis of Petri nets for FPGA devices

NASA Astrophysics Data System (ADS)

Bukowiec, Arkadiusz; Doligalski, Michał

2015-09-01

There is presented new method of synthesis of application specific logic controllers for FPGA devices. The specification of control algorithm is made with use of control interpreted Petri net (PT type). It allows specifying parallel processes in easy way. The Petri net is decomposed into state-machine type subnets. In this case, each subnet represents one parallel process. For this purpose there are applied algorithms of coloring of Petri nets. There are presented two approaches of such decomposition: with doublers of macroplaces or with one global wait place. Next, subnets are implemented into two-level logic circuit of the controller. The levels of logic circuit are obtained as a result of its architectural decomposition. The first level combinational circuit is responsible for generation of next places and second level decoder is responsible for generation output symbols. There are worked out two variants of such circuits: with one shared operational memory or with many flexible distributed memories as a decoder. Variants of Petri net decomposition and structures of logic circuits can be combined together without any restrictions. It leads to existence of four variants of multi-variants synthesis.
Parallel text rendering by a PostScript interpreter

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kritskii, S.P.; Zastavnoi, B.A.

1994-11-01

The most radical method of increasing the performance of devices controlled by PostScript interpreters may be the use of multiprocessor controllers. This paper presents a method for parallelizing the operation of a PostScript interpreter for rendering text. The proposed method is based on decomposition of the outlines of letters into horizontal strips covering equal areas. The subroutines thus obtained are distributed to the processors in a network and then filled in by conventional sequential algorithms. A special algorithm has been developed for dividing the outlines of characters into subroutines so that each may be colored independently of the others. Themore » algorithm uses special estimates for estimating the correct partition so that the corresponding outlines are divided into horizontal strips. A method is presented for finding such estimates. Two different processing approaches are presented. In the first, one of the processors performs the decomposition of the outlines and distributes the strips to the remaining processors, which are responsible for the rendering. In the second approach, the decomposition process is itself distributed among the processors in the network.« less
A Novel Multilevel-SVD Method to Improve Multistep Ahead Forecasting in Traffic Accidents Domain.

PubMed

Barba, Lida; Rodríguez, Nibaldo

2017-01-01

Here is proposed a novel method for decomposing a nonstationary time series in components of low and high frequency. The method is based on Multilevel Singular Value Decomposition (MSVD) of a Hankel matrix. The decomposition is used to improve the forecasting accuracy of Multiple Input Multiple Output (MIMO) linear and nonlinear models. Three time series coming from traffic accidents domain are used. They represent the number of persons with injuries in traffic accidents of Santiago, Chile. The data were continuously collected by the Chilean Police and were weekly sampled from 2000:1 to 2014:12. The performance of MSVD is compared with the decomposition in components of low and high frequency of a commonly accepted method based on Stationary Wavelet Transform (SWT). SWT in conjunction with the Autoregressive model (SWT + MIMO-AR) and SWT in conjunction with an Autoregressive Neural Network (SWT + MIMO-ANN) were evaluated. The empirical results have shown that the best accuracy was achieved by the forecasting model based on the proposed decomposition method MSVD, in comparison with the forecasting models based on SWT.
A Novel Multilevel-SVD Method to Improve Multistep Ahead Forecasting in Traffic Accidents Domain

PubMed Central

Rodríguez, Nibaldo

2017-01-01

Here is proposed a novel method for decomposing a nonstationary time series in components of low and high frequency. The method is based on Multilevel Singular Value Decomposition (MSVD) of a Hankel matrix. The decomposition is used to improve the forecasting accuracy of Multiple Input Multiple Output (MIMO) linear and nonlinear models. Three time series coming from traffic accidents domain are used. They represent the number of persons with injuries in traffic accidents of Santiago, Chile. The data were continuously collected by the Chilean Police and were weekly sampled from 2000:1 to 2014:12. The performance of MSVD is compared with the decomposition in components of low and high frequency of a commonly accepted method based on Stationary Wavelet Transform (SWT). SWT in conjunction with the Autoregressive model (SWT + MIMO-AR) and SWT in conjunction with an Autoregressive Neural Network (SWT + MIMO-ANN) were evaluated. The empirical results have shown that the best accuracy was achieved by the forecasting model based on the proposed decomposition method MSVD, in comparison with the forecasting models based on SWT. PMID:28261267
First Applications of the New Parallel Krylov Solver for MODFLOW on a National and Global Scale

NASA Astrophysics Data System (ADS)

Verkaik, J.; Hughes, J. D.; Sutanudjaja, E.; van Walsum, P.

2016-12-01

Integrated high-resolution hydrologic models are increasingly being used for evaluating water management measures at field scale. Their drawbacks are large memory requirements and long run times. Examples of such models are The Netherlands Hydrological Instrument (NHI) model and the PCRaster Global Water Balance (PCR-GLOBWB) model. Typical simulation periods are 30-100 years with daily timesteps. The NHI model predicts water demands in periods of drought, supporting operational and long-term water-supply decisions. The NHI is a state-of-the-art coupling of several models: a 7-layer MODFLOW groundwater model ( 6.5M 250m cells), a MetaSWAP model for the unsaturated zone (Richards emulator of 0.5M cells), and a surface water model (MOZART-DM). The PCR-GLOBWB model provides a grid-based representation of global terrestrial hydrology and this work uses the version that includes a 2-layer MODFLOW groundwater model ( 4.5M 10km cells). The Parallel Krylov Solver (PKS) speeds up computation by both distributed memory parallelization (Message Passing Interface) and shared memory parallelization (Open Multi-Processing). PKS includes conjugate gradient, bi-conjugate gradient stabilized, and generalized minimal residual linear accelerators that use an overlapping additive Schwarz domain decomposition preconditioner. PKS can be used for both structured and unstructured grids and has been fully integrated in MODFLOW-USG using METIS partitioning and in iMODFLOW using RCB partitioning. iMODFLOW is an accelerated version of MODFLOW-2005 that is implicitly and online coupled to MetaSWAP. Results for benchmarks carried out on the Cartesius Dutch supercomputer (https://userinfo.surfsara.nl/systems/cartesius) for the PCRGLOB-WB model and on a 2x16 core Windows machine for the NHI model show speedups up to 10-20 and 5-10, respectively.
Scalable direct Vlasov solver with discontinuous Galerkin method on unstructured mesh.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Xu, J.; Ostroumov, P. N.; Mustapha, B.

2010-12-01

This paper presents the development of parallel direct Vlasov solvers with discontinuous Galerkin (DG) method for beam and plasma simulations in four dimensions. Both physical and velocity spaces are in two dimesions (2P2V) with unstructured mesh. Contrary to the standard particle-in-cell (PIC) approach for kinetic space plasma simulations, i.e., solving Vlasov-Maxwell equations, direct method has been used in this paper. There are several benefits to solving a Vlasov equation directly, such as avoiding noise associated with a finite number of particles and the capability to capture fine structure in the plasma. The most challanging part of a direct Vlasov solvermore » comes from higher dimensions, as the computational cost increases as N{sup 2d}, where d is the dimension of the physical space. Recently, due to the fast development of supercomputers, the possibility has become more realistic. Many efforts have been made to solve Vlasov equations in low dimensions before; now more interest has focused on higher dimensions. Different numerical methods have been tried so far, such as the finite difference method, Fourier Spectral method, finite volume method, and spectral element method. This paper is based on our previous efforts to use the DG method. The DG method has been proven to be very successful in solving Maxwell equations, and this paper is our first effort in applying the DG method to Vlasov equations. DG has shown several advantages, such as local mass matrix, strong stability, and easy parallelization. These are particularly suitable for Vlasov equations. Domain decomposition in high dimensions has been used for parallelization; these include a highly scalable parallel two-dimensional Poisson solver. Benchmark results have been shown and simulation results will be reported.« less
Advancing MODFLOW Applying the Derived Vector Space Method

NASA Astrophysics Data System (ADS)

Herrera, G. S.; Herrera, I.; Lemus-García, M.; Hernandez-Garcia, G. D.

2015-12-01

The most effective domain decomposition methods (DDM) are non-overlapping DDMs. Recently a new approach, the DVS-framework, based on an innovative discretization method that uses a non-overlapping system of nodes (the derived-nodes), was introduced and developed by I. Herrera et al. [1, 2]. Using the DVS-approach a group of four algorithms, referred to as the 'DVS-algorithms', which fulfill the DDM-paradigm (i.e. the solution of global problems is obtained by resolution of local problems exclusively) has been derived. Such procedures are applicable to any boundary-value problem, or system of such equations, for which a standard discretization method is available and then software with a high degree of parallelization can be constructed. In a parallel talk, in this AGU Fall Meeting, Ismael Herrera will introduce the general DVS methodology. The application of the DVS-algorithms has been demonstrated in the solution of several boundary values problems of interest in Geophysics. Numerical examples for a single-equation, for the cases of symmetric, non-symmetric and indefinite problems were demonstrated before [1,2]. For these problems DVS-algorithms exhibited significantly improved numerical performance with respect to standard versions of DDM algorithms. In view of these results our research group is in the process of applying the DVS method to a widely used simulator for the first time, here we present the advances of the application of this method for the parallelization of MODFLOW. Efficiency results for a group of tests will be presented. References [1] I. Herrera, L.M. de la Cruz and A. Rosas-Medina. Non overlapping discretization methods for partial differential equations, Numer Meth Part D E, (2013). [2] Herrera, I., & Contreras Iván "An Innovative Tool for Effectively Applying Highly Parallelized Software To Problems of Elasticity". Geofísica Internacional, 2015 (In press)
A simple hyperbolic model for communication in parallel processing environments

NASA Technical Reports Server (NTRS)

Stoica, Ion; Sultan, Florin; Keyes, David

1994-01-01

We introduce a model for communication costs in parallel processing environments called the 'hyperbolic model,' which generalizes two-parameter dedicated-link models in an analytically simple way. Dedicated interprocessor links parameterized by a latency and a transfer rate that are independent of load are assumed by many existing communication models; such models are unrealistic for workstation networks. The communication system is modeled as a directed communication graph in which terminal nodes represent the application processes that initiate the sending and receiving of the information and in which internal nodes, called communication blocks (CBs), reflect the layered structure of the underlying communication architecture. The direction of graph edges specifies the flow of the information carried through messages. Each CB is characterized by a two-parameter hyperbolic function of the message size that represents the service time needed for processing the message. The parameters are evaluated in the limits of very large and very small messages. Rules are given for reducing a communication graph consisting of many to an equivalent two-parameter form, while maintaining an approximation for the service time that is exact in both large and small limits. The model is validated on a dedicated Ethernet network of workstations by experiments with communication subprograms arising in scientific applications, for which a tight fit of the model predictions with actual measurements of the communication and synchronization time between end processes is demonstrated. The model is then used to evaluate the performance of two simple parallel scientific applications from partial differential equations: domain decomposition and time-parallel multigrid. In an appropriate limit, we also show the compatibility of the hyperbolic model with the recently proposed LogP model.
Aerodynamic Shape Optimization of Supersonic Aircraft Configurations via an Adjoint Formulation on Parallel Computers

NASA Technical Reports Server (NTRS)

Reuther, James; Alonso, Juan Jose; Rimlinger, Mark J.; Jameson, Antony

1996-01-01

This work describes the application of a control theory-based aerodynamic shape optimization method to the problem of supersonic aircraft design. The design process is greatly accelerated through the use of both control theory and a parallel implementation on distributed memory computers. Control theory is employed to derive the adjoint differential equations whose solution allows for the evaluation of design gradient information at a fraction of the computational cost required by previous design methods (13, 12, 44, 38). The resulting problem is then implemented on parallel distributed memory architectures using a domain decomposition approach, an optimized communication schedule, and the MPI (Message Passing Interface) Standard for portability and efficiency. The final result achieves very rapid aerodynamic design based on higher order computational fluid dynamics methods (CFD). In our earlier studies, the serial implementation of this design method (19, 20, 21, 23, 39, 25, 40, 41, 42, 43, 9) was shown to be effective for the optimization of airfoils, wings, wing-bodies, and complex aircraft configurations using both the potential equation and the Euler equations (39, 25). In our most recent paper, the Euler method was extended to treat complete aircraft configurations via a new multiblock implementation. Furthermore, during the same conference, we also presented preliminary results demonstrating that the basic methodology could be ported to distributed memory parallel computing architectures [241. In this paper, our concem will be to demonstrate that the combined power of these new technologies can be used routinely in an industrial design environment by applying it to the case study of the design of typical supersonic transport configurations. A particular difficulty of this test case is posed by the propulsion/airframe integration.
A multithreaded and GPU-optimized compact finite difference algorithm for turbulent mixing at high Schmidt number using petascale computing

NASA Astrophysics Data System (ADS)

Clay, M. P.; Yeung, P. K.; Buaria, D.; Gotoh, T.

2017-11-01

Turbulent mixing at high Schmidt number is a multiscale problem which places demanding requirements on direct numerical simulations to resolve fluctuations down the to Batchelor scale. We use a dual-grid, dual-scheme and dual-communicator approach where velocity and scalar fields are computed by separate groups of parallel processes, the latter using a combined compact finite difference (CCD) scheme on finer grid with a static 3-D domain decomposition free of the communication overhead of memory transposes. A high degree of scalability is achieved for a 81923 scalar field at Schmidt number 512 in turbulence with a modest inertial range, by overlapping communication with computation whenever possible. On the Cray XE6 partition of Blue Waters, use of a dedicated thread for communication combined with OpenMP locks and nested parallelism reduces CCD timings by 34% compared to an MPI baseline. The code has been further optimized for the 27-petaflops Cray XK7 machine Titan using GPUs as accelerators with the latest OpenMP 4.5 directives, giving 2.7X speedup compared to CPU-only execution at the largest problem size. Supported by NSF Grant ACI-1036170, the NCSA Blue Waters Project with subaward via UIUC, and a DOE INCITE allocation at ORNL.
Scalable Metropolis Monte Carlo for simulation of hard shapes

NASA Astrophysics Data System (ADS)

Anderson, Joshua A.; Eric Irrgang, M.; Glotzer, Sharon C.

2016-07-01

We design and implement a scalable hard particle Monte Carlo simulation toolkit (HPMC), and release it open source as part of HOOMD-blue. HPMC runs in parallel on many CPUs and many GPUs using domain decomposition. We employ BVH trees instead of cell lists on the CPU for fast performance, especially with large particle size disparity, and optimize inner loops with SIMD vector intrinsics on the CPU. Our GPU kernel proposes many trial moves in parallel on a checkerboard and uses a block-level queue to redistribute work among threads and avoid divergence. HPMC supports a wide variety of shape classes, including spheres/disks, unions of spheres, convex polygons, convex spheropolygons, concave polygons, ellipsoids/ellipses, convex polyhedra, convex spheropolyhedra, spheres cut by planes, and concave polyhedra. NVT and NPT ensembles can be run in 2D or 3D triclinic boxes. Additional integration schemes permit Frenkel-Ladd free energy computations and implicit depletant simulations. In a benchmark system of a fluid of 4096 pentagons, HPMC performs 10 million sweeps in 10 min on 96 CPU cores on XSEDE Comet. The same simulation would take 7.6 h in serial. HPMC also scales to large system sizes, and the same benchmark with 16.8 million particles runs in 1.4 h on 2048 GPUs on OLCF Titan.
ICASE Computer Science Program

NASA Technical Reports Server (NTRS)

1985-01-01

The Institute for Computer Applications in Science and Engineering computer science program is discussed in outline form. Information is given on such topics as problem decomposition, algorithm development, programming languages, and parallel architectures.
A divide-conquer-recombine algorithmic paradigm for large spatiotemporal quantum molecular dynamics simulations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shimojo, Fuyuki; Hattori, Shinnosuke; Department of Physics, Kumamoto University, Kumamoto 860-8555

We introduce an extension of the divide-and-conquer (DC) algorithmic paradigm called divide-conquer-recombine (DCR) to perform large quantum molecular dynamics (QMD) simulations on massively parallel supercomputers, in which interatomic forces are computed quantum mechanically in the framework of density functional theory (DFT). In DCR, the DC phase constructs globally informed, overlapping local-domain solutions, which in the recombine phase are synthesized into a global solution encompassing large spatiotemporal scales. For the DC phase, we design a lean divide-and-conquer (LDC) DFT algorithm, which significantly reduces the prefactor of the O(N) computational cost for N electrons by applying a density-adaptive boundary condition at themore » peripheries of the DC domains. Our globally scalable and locally efficient solver is based on a hybrid real-reciprocal space approach that combines: (1) a highly scalable real-space multigrid to represent the global charge density; and (2) a numerically efficient plane-wave basis for local electronic wave functions and charge density within each domain. Hybrid space-band decomposition is used to implement the LDC-DFT algorithm on parallel computers. A benchmark test on an IBM Blue Gene/Q computer exhibits an isogranular parallel efficiency of 0.984 on 786 432 cores for a 50.3 × 10{sup 6}-atom SiC system. As a test of production runs, LDC-DFT-based QMD simulation involving 16 661 atoms is performed on the Blue Gene/Q to study on-demand production of hydrogen gas from water using LiAl alloy particles. As an example of the recombine phase, LDC-DFT electronic structures are used as a basis set to describe global photoexcitation dynamics with nonadiabatic QMD (NAQMD) and kinetic Monte Carlo (KMC) methods. The NAQMD simulations are based on the linear response time-dependent density functional theory to describe electronic excited states and a surface-hopping approach to describe transitions between the excited states. A series of techniques are employed for efficiently calculating the long-range exact exchange correction and excited-state forces. The NAQMD trajectories are analyzed to extract the rates of various excitonic processes, which are then used in KMC simulation to study the dynamics of the global exciton flow network. This has allowed the study of large-scale photoexcitation dynamics in 6400-atom amorphous molecular solid, reaching the experimental time scales.« less
Origin of the Chemical and Kinetic Stability of Graphene Oxide

PubMed Central

Zhou, Si; Bongiorno, Angelo

2013-01-01

At moderate temperatures (≤ 70°C), thermal reduction of graphene oxide is inefficient and after its synthesis the material enters in a metastable state. Here, first-principles and statistical calculations are used to investigate both the low-temperature processes leading to decomposition of graphene oxide and the role of ageing on the structure and stability of this material. Our study shows that the key factor underlying the stability of graphene oxide is the tendency of the oxygen functionalities to agglomerate and form highly oxidized domains surrounded by areas of pristine graphene. Within the agglomerates of functional groups, the primary decomposition reactions are hindered by both geometrical and energetic factors. The number of reacting sites is reduced by the occurrence of local order in the oxidized domains, and due to the close packing of the oxygen functionalities, the decomposition reactions become – on average – endothermic by more than 0.6 eV. PMID:23963517
Origin of the chemical and kinetic stability of graphene oxide.

PubMed

Zhou, Si; Bongiorno, Angelo

2013-01-01

At moderate temperatures (≤ 70°C), thermal reduction of graphene oxide is inefficient and after its synthesis the material enters in a metastable state. Here, first-principles and statistical calculations are used to investigate both the low-temperature processes leading to decomposition of graphene oxide and the role of ageing on the structure and stability of this material. Our study shows that the key factor underlying the stability of graphene oxide is the tendency of the oxygen functionalities to agglomerate and form highly oxidized domains surrounded by areas of pristine graphene. Within the agglomerates of functional groups, the primary decomposition reactions are hindered by both geometrical and energetic factors. The number of reacting sites is reduced by the occurrence of local order in the oxidized domains, and due to the close packing of the oxygen functionalities, the decomposition reactions become - on average - endothermic by more than 0.6 eV.

Parallel processing in finite element structural analysis

NASA Technical Reports Server (NTRS)

Noor, Ahmed K.

1987-01-01

A brief review is made of the fundamental concepts and basic issues of parallel processing. Discussion focuses on parallel numerical algorithms, performance evaluation of machines and algorithms, and parallelism in finite element computations. A computational strategy is proposed for maximizing the degree of parallelism at different levels of the finite element analysis process including: 1) formulation level (through the use of mixed finite element models); 2) analysis level (through additive decomposition of the different arrays in the governing equations into the contributions to a symmetrized response plus correction terms); 3) numerical algorithm level (through the use of operator splitting techniques and application of iterative processes); and 4) implementation level (through the effective combination of vectorization, multitasking and microtasking, whenever available).
Parallel eigenanalysis of finite element models in a completely connected architecture

NASA Technical Reports Server (NTRS)

Akl, F. A.; Morel, M. R.

1989-01-01

A parallel algorithm is presented for the solution of the generalized eigenproblem in linear elastic finite element analysis, (K)(phi) = (M)(phi)(omega), where (K) and (M) are of order N, and (omega) is order of q. The concurrent solution of the eigenproblem is based on the multifrontal/modified subspace method and is achieved in a completely connected parallel architecture in which each processor is allowed to communicate with all other processors. The algorithm was successfully implemented on a tightly coupled multiple-instruction multiple-data parallel processing machine, Cray X-MP. A finite element model is divided into m domains each of which is assumed to process n elements. Each domain is then assigned to a processor or to a logical processor (task) if the number of domains exceeds the number of physical processors. The macrotasking library routines are used in mapping each domain to a user task. Computational speed-up and efficiency are used to determine the effectiveness of the algorithm. The effect of the number of domains, the number of degrees-of-freedom located along the global fronts and the dimension of the subspace on the performance of the algorithm are investigated. A parallel finite element dynamic analysis program, p-feda, is documented and the performance of its subroutines in parallel environment is analyzed.
A non-invasive implementation of a mixed domain decomposition method for frictional contact problems

NASA Astrophysics Data System (ADS)

Oumaziz, Paul; Gosselet, Pierre; Boucard, Pierre-Alain; Guinard, Stéphane

2017-11-01

A non-invasive implementation of the Latin domain decomposition method for frictional contact problems is described. The formulation implies to deal with mixed (Robin) conditions on the faces of the subdomains, which is not a classical feature of commercial software. Therefore we propose a new implementation of the linear stage of the Latin method with a non-local search direction built as the stiffness of a layer of elements on the interfaces. This choice enables us to implement the method within the open source software Code_Aster, and to derive 2D and 3D examples with similar performance as the standard Latin method.
Domain Decomposition Algorithms for First-Order System Least Squares Methods

NASA Technical Reports Server (NTRS)

Pavarino, Luca F.

1996-01-01

Least squares methods based on first-order systems have been recently proposed and analyzed for second-order elliptic equations and systems. They produce symmetric and positive definite discrete systems by using standard finite element spaces, which are not required to satisfy the inf-sup condition. In this paper, several domain decomposition algorithms for these first-order least squares methods are studied. Some representative overlapping and substructuring algorithms are considered in their additive and multiplicative variants. The theoretical and numerical results obtained show that the classical convergence bounds (on the iteration operator) for standard Galerkin discretizations are also valid for least squares methods.
Structural considerations for functional anti-EGFR × anti-CD3 bispecific diabodies in light of domain order and binding affinity.

PubMed

Asano, Ryutaro; Nagai, Keisuke; Makabe, Koki; Takahashi, Kento; Kumagai, Takashi; Kawaguchi, Hiroko; Ogata, Hiromi; Arai, Kyoko; Umetsu, Mitsuo; Kumagai, Izumi

2018-03-02

We previously reported a functional humanized bispecific diabody (bsDb) that targeted EGFR and CD3 (hEx3-Db) and enhancement of its cytotoxicity by rearranging the domain order in the V domain. Here, we further dissected the effect of domain order in bsDbs on their cross-linking ability and binding kinetics to elucidate general rules regarding the design of functional bsDbs. Using Ex3-Db as a model system, we first classified the four possible domain orders as anti-parallel (where both chimeric single-chain components are variable heavy domain (VH)-variable light domain (VL) or VL-VH order) and parallel types (both chimeric single-chain components are mixed with VH-VL and VL-VH order). Although anti-parallel Ex3-Dbs could cross-link the soluble target antigens, their cross-linking ability between soluble targets had no correlation with their growth inhibitory effects. In contrast, the binding affinity of one of the two constructs with a parallel-arrangement V domain was particularly low, and structural modeling supported this phenomenon. Similar results were observed with E2x3-Dbs, in which the V region of the anti-EGFR antibody clone in hEx3 was replaced with that of another anti-EGFR clone. Only anti-parallel types showed affinity-dependent cancer inhibitory effects in each molecule, and E2x3-LH (both components in VL-VH order) showed the most intense anti-tumor activity in vitro and in vivo . Our results showed that, in addition to rearranging the domain order of bsDbs, increasing their binding affinity may be an ideal strategy for enhancing the cytotoxicity of anti-parallel constructs and that E2x3-LH is particularly attractive as a candidate next-generation anti-cancer drug.
Parallel distributed, reciprocal Monte Carlo radiation in coupled, large eddy combustion simulations

NASA Astrophysics Data System (ADS)

Hunsaker, Isaac L.

Radiation is the dominant mode of heat transfer in high temperature combustion environments. Radiative heat transfer affects the gas and particle phases, including all the associated combustion chemistry. The radiative properties are in turn affected by the turbulent flow field. This bi-directional coupling of radiation turbulence interactions poses a major challenge in creating parallel-capable, high-fidelity combustion simulations. In this work, a new model was developed in which reciprocal monte carlo radiation was coupled with a turbulent, large-eddy simulation combustion model. A technique wherein domain patches are stitched together was implemented to allow for scalable parallelism. The combustion model runs in parallel on a decomposed domain. The radiation model runs in parallel on a recomposed domain. The recomposed domain is stored on each processor after information sharing of the decomposed domain is handled via the message passing interface. Verification and validation testing of the new radiation model were favorable. Strong scaling analyses were performed on the Ember cluster and the Titan cluster for the CPU-radiation model and GPU-radiation model, respectively. The model demonstrated strong scaling to over 1,700 and 16,000 processing cores on Ember and Titan, respectively.
Computational Challenges of 3D Radiative Transfer in Atmospheric Models

NASA Astrophysics Data System (ADS)

Jakub, Fabian; Bernhard, Mayer

2017-04-01

The computation of radiative heating and cooling rates is one of the most expensive components in todays atmospheric models. The high computational cost stems not only from the laborious integration over a wide range of the electromagnetic spectrum but also from the fact that solving the integro-differential radiative transfer equation for monochromatic light is already rather involved. This lead to the advent of numerous approximations and parameterizations to reduce the cost of the solver. One of the most prominent one is the so called independent pixel approximations (IPA) where horizontal energy transfer is neglected whatsoever and radiation may only propagate in the vertical direction (1D). Recent studies implicate that the IPA introduces significant errors in high resolution simulations and affects the evolution and development of convective systems. However, using fully 3D solvers such as for example MonteCarlo methods is not even on state of the art supercomputers feasible. The parallelization of atmospheric models is often realized by a horizontal domain decomposition, and hence, horizontal transfer of energy necessitates communication. E.g. a cloud's shadow at a low zenith angle will cast a long shadow and potentially needs to communication through a multitude of processors. Especially light in the solar spectral range may travel long distances through the atmosphere. Concerning highly parallel simulations, it is vital that 3D radiative transfer solvers put a special emphasis on parallel scalability. We will present an introduction to intricacies computing 3D radiative heating and cooling rates as well as report on the parallel performance of the TenStream solver. The TenStream is a 3D radiative transfer solver using the PETSc framework to iteratively solve a set of partial differential equation. We investigate two matrix preconditioners, (a) geometric algebraic multigrid preconditioning(MG+GAMG) and (b) block Jacobi incomplete LU (ILU) factorization. The TenStream solver is tested for up to 4096 cores and shows a parallel scaling efficiency of 80-90% on various supercomputers.
Linear scaling computation of the Fock matrix. VI. Data parallel computation of the exchange-correlation matrix

NASA Astrophysics Data System (ADS)

Gan, Chee Kwan; Challacombe, Matt

2003-05-01

Recently, early onset linear scaling computation of the exchange-correlation matrix has been achieved using hierarchical cubature [J. Chem. Phys. 113, 10037 (2000)]. Hierarchical cubature differs from other methods in that the integration grid is adaptive and purely Cartesian, which allows for a straightforward domain decomposition in parallel computations; the volume enclosing the entire grid may be simply divided into a number of nonoverlapping boxes. In our data parallel approach, each box requires only a fraction of the total density to perform the necessary numerical integrations due to the finite extent of Gaussian-orbital basis sets. This inherent data locality may be exploited to reduce communications between processors as well as to avoid memory and copy overheads associated with data replication. Although the hierarchical cubature grid is Cartesian, naive boxing leads to irregular work loads due to strong spatial variations of the grid and the electron density. In this paper we describe equal time partitioning, which employs time measurement of the smallest sub-volumes (corresponding to the primitive cubature rule) to load balance grid-work for the next self-consistent-field iteration. After start-up from a heuristic center of mass partitioning, equal time partitioning exploits smooth variation of the density and grid between iterations to achieve load balance. With the 3-21G basis set and a medium quality grid, equal time partitioning applied to taxol (62 heavy atoms) attained a speedup of 61 out of 64 processors, while for a 110 molecule water cluster at standard density it achieved a speedup of 113 out of 128. The efficiency of equal time partitioning applied to hierarchical cubature improves as the grid work per processor increases. With a fine grid and the 6-311G(df,p) basis set, calculations on the 26 atom molecule α-pinene achieved a parallel efficiency better than 99% with 64 processors. For more coarse grained calculations, superlinear speedups are found to result from reduced computational complexity associated with data parallelism.
Using CLIPS in the domain of knowledge-based massively parallel programming

NASA Technical Reports Server (NTRS)

Dvorak, Jiri J.

1994-01-01

The Program Development Environment (PDE) is a tool for massively parallel programming of distributed-memory architectures. Adopting a knowledge-based approach, the PDE eliminates the complexity introduced by parallel hardware with distributed memory and offers complete transparency in respect of parallelism exploitation. The knowledge-based part of the PDE is realized in CLIPS. Its principal task is to find an efficient parallel realization of the application specified by the user in a comfortable, abstract, domain-oriented formalism. A large collection of fine-grain parallel algorithmic skeletons, represented as COOL objects in a tree hierarchy, contains the algorithmic knowledge. A hybrid knowledge base with rule modules and procedural parts, encoding expertise about application domain, parallel programming, software engineering, and parallel hardware, enables a high degree of automation in the software development process. In this paper, important aspects of the implementation of the PDE using CLIPS and COOL are shown, including the embedding of CLIPS with C++-based parts of the PDE. The appropriateness of the chosen approach and of the CLIPS language for knowledge-based software engineering are discussed.
Nonequilibrium adiabatic molecular dynamics simulations of methane clathrate hydrate decomposition

NASA Astrophysics Data System (ADS)

Alavi, Saman; Ripmeester, J. A.

2010-04-01

Nonequilibrium, constant energy, constant volume (NVE) molecular dynamics simulations are used to study the decomposition of methane clathrate hydrate in contact with water. Under adiabatic conditions, the rate of methane clathrate decomposition is affected by heat and mass transfer arising from the breakup of the clathrate hydrate framework and release of the methane gas at the solid-liquid interface and diffusion of methane through water. We observe that temperature gradients are established between the clathrate and solution phases as a result of the endothermic clathrate decomposition process and this factor must be considered when modeling the decomposition process. Additionally we observe that clathrate decomposition does not occur gradually with breakup of individual cages, but rather in a concerted fashion with rows of structure I cages parallel to the interface decomposing simultaneously. Due to the concerted breakup of layers of the hydrate, large amounts of methane gas are released near the surface which can form bubbles that will greatly affect the rate of mass transfer near the surface of the clathrate phase. The effects of these phenomena on the rate of methane hydrate decomposition are determined and implications on hydrate dissociation in natural methane hydrate reservoirs are discussed.
Nonequilibrium adiabatic molecular dynamics simulations of methane clathrate hydrate decomposition.

PubMed

Alavi, Saman; Ripmeester, J A

2010-04-14

Nonequilibrium, constant energy, constant volume (NVE) molecular dynamics simulations are used to study the decomposition of methane clathrate hydrate in contact with water. Under adiabatic conditions, the rate of methane clathrate decomposition is affected by heat and mass transfer arising from the breakup of the clathrate hydrate framework and release of the methane gas at the solid-liquid interface and diffusion of methane through water. We observe that temperature gradients are established between the clathrate and solution phases as a result of the endothermic clathrate decomposition process and this factor must be considered when modeling the decomposition process. Additionally we observe that clathrate decomposition does not occur gradually with breakup of individual cages, but rather in a concerted fashion with rows of structure I cages parallel to the interface decomposing simultaneously. Due to the concerted breakup of layers of the hydrate, large amounts of methane gas are released near the surface which can form bubbles that will greatly affect the rate of mass transfer near the surface of the clathrate phase. The effects of these phenomena on the rate of methane hydrate decomposition are determined and implications on hydrate dissociation in natural methane hydrate reservoirs are discussed.
Multigrid treatment of implicit continuum diffusion

NASA Astrophysics Data System (ADS)

Francisquez, Manaure; Zhu, Ben; Rogers, Barrett

2017-10-01

Implicit treatment of diffusive terms of various differential orders common in continuum mechanics modeling, such as computational fluid dynamics, is investigated with spectral and multigrid algorithms in non-periodic 2D domains. In doubly periodic time dependent problems these terms can be efficiently and implicitly handled by spectral methods, but in non-periodic systems solved with distributed memory parallel computing and 2D domain decomposition, this efficiency is lost for large numbers of processors. We built and present here a multigrid algorithm for these types of problems which outperforms a spectral solution that employs the highly optimized FFTW library. This multigrid algorithm is not only suitable for high performance computing but may also be able to efficiently treat implicit diffusion of arbitrary order by introducing auxiliary equations of lower order. We test these solvers for fourth and sixth order diffusion with idealized harmonic test functions as well as a turbulent 2D magnetohydrodynamic simulation. It is also shown that an anisotropic operator without cross-terms can improve model accuracy and speed, and we examine the impact that the various diffusion operators have on the energy, the enstrophy, and the qualitative aspect of a simulation. This work was supported by DOE-SC-0010508. This research used resources of the National Energy Research Scientific Computing Center (NERSC).
QR-decomposition based SENSE reconstruction using parallel architecture.

PubMed

Ullah, Irfan; Nisar, Habab; Raza, Haseeb; Qasim, Malik; Inam, Omair; Omer, Hammad

2018-04-01

Magnetic Resonance Imaging (MRI) is a powerful medical imaging technique that provides essential clinical information about the human body. One major limitation of MRI is its long scan time. Implementation of advance MRI algorithms on a parallel architecture (to exploit inherent parallelism) has a great potential to reduce the scan time. Sensitivity Encoding (SENSE) is a Parallel Magnetic Resonance Imaging (pMRI) algorithm that utilizes receiver coil sensitivities to reconstruct MR images from the acquired under-sampled k-space data. At the heart of SENSE lies inversion of a rectangular encoding matrix. This work presents a novel implementation of GPU based SENSE algorithm, which employs QR decomposition for the inversion of the rectangular encoding matrix. For a fair comparison, the performance of the proposed GPU based SENSE reconstruction is evaluated against single and multicore CPU using openMP. Several experiments against various acceleration factors (AFs) are performed using multichannel (8, 12 and 30) phantom and in-vivo human head and cardiac datasets. Experimental results show that GPU significantly reduces the computation time of SENSE reconstruction as compared to multi-core CPU (approximately 12x speedup) and single-core CPU (approximately 53x speedup) without any degradation in the quality of the reconstructed images. Copyright © 2018 Elsevier Ltd. All rights reserved.
A fast new algorithm for a robot neurocontroller using inverse QR decomposition

DOE Office of Scientific and Technical Information (OSTI.GOV)

Morris, A.S.; Khemaissia, S.

2000-01-01

A new adaptive neural network controller for robots is presented. The controller is based on direct adaptive techniques. Unlike many neural network controllers in the literature, inverse dynamical model evaluation is not required. A numerically robust, computationally efficient processing scheme for neutral network weight estimation is described, namely, the inverse QR decomposition (INVQR). The inverse QR decomposition and a weighted recursive least-squares (WRLS) method for neural network weight estimation is derived using Cholesky factorization of the data matrix. The algorithm that performs the efficient INVQR of the underlying space-time data matrix may be implemented in parallel on a triangular array.more » Furthermore, its systolic architecture is well suited for VLSI implementation. Another important benefit is well suited for VLSI implementation. Another important benefit of the INVQR decomposition is that it solves directly for the time-recursive least-squares filter vector, while avoiding the sequential back-substitution step required by the QR decomposition approaches.« less
Simplified Parallel Domain Traversal

DOE Office of Scientific and Technical Information (OSTI.GOV)

Erickson III, David J

2011-01-01

Many data-intensive scientific analysis techniques require global domain traversal, which over the years has been a bottleneck for efficient parallelization across distributed-memory architectures. Inspired by MapReduce and other simplified parallel programming approaches, we have designed DStep, a flexible system that greatly simplifies efficient parallelization of domain traversal techniques at scale. In order to deliver both simplicity to users as well as scalability on HPC platforms, we introduce a novel two-tiered communication architecture for managing and exploiting asynchronous communication loads. We also integrate our design with advanced parallel I/O techniques that operate directly on native simulation output. We demonstrate DStep bymore » performing teleconnection analysis across ensemble runs of terascale atmospheric CO{sub 2} and climate data, and we show scalability results on up to 65,536 IBM BlueGene/P cores.« less
Eigensolution of finite element problems in a completely connected parallel architecture

NASA Technical Reports Server (NTRS)

Akl, F.; Morel, M.

1989-01-01

A parallel algorithm is presented for the solution of the generalized eigenproblem in linear elastic finite element analysis. The algorithm is based on a completely connected parallel architecture in which each processor is allowed to communicate with all other processors. The algorithm is successfully implemented on a tightly coupled MIMD parallel processor. A finite element model is divided into m domains each of which is assumed to process n elements. Each domain is then assigned to a processor or to a logical processor (task) if the number of domains exceeds the number of physical processors. The effect of the number of domains, the number of degrees-of-freedom located along the global fronts, and the dimension of the subspace on the performance of the algorithm is investigated. For a 64-element rectangular plate, speed-ups of 1.86, 3.13, 3.18, and 3.61 are achieved on two, four, six, and eight processors, respectively.
Application of CHAD hydrodynamics to shock-wave problems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Trease, H.E.; O`Rourke, P.J.; Sahota, M.S.

1997-12-31

CHAD is the latest in a sequence of continually evolving computer codes written to effectively utilize massively parallel computer architectures and the latest grid generators for unstructured meshes. Its applications range from automotive design issues such as in-cylinder and manifold flows of internal combustion engines, vehicle aerodynamics, underhood cooling and passenger compartment heating, ventilation, and air conditioning to shock hydrodynamics and materials modeling. CHAD solves the full unsteady Navier-Stoke equations with the k-epsilon turbulence model in three space dimensions. The code has four major features that distinguish it from the earlier KIVA code, also developed at Los Alamos. First, itmore » is based on a node-centered, finite-volume method in which, like finite element methods, all fluid variables are located at computational nodes. The computational mesh efficiently and accurately handles all element shapes ranging from tetrahedra to hexahedra. Second, it is written in standard Fortran 90 and relies on automatic domain decomposition and a universal communication library written in standard C and MPI for unstructured grids to effectively exploit distributed-memory parallel architectures. Thus the code is fully portable to a variety of computing platforms such as uniprocessor workstations, symmetric multiprocessors, clusters of workstations, and massively parallel platforms. Third, CHAD utilizes a variable explicit/implicit upwind method for convection that improves computational efficiency in flows that have large velocity Courant number variations due to velocity of mesh size variations. Fourth, CHAD is designed to also simulate shock hydrodynamics involving multimaterial anisotropic behavior under high shear. The authors will discuss CHAD capabilities and show several sample calculations showing the strengths and weaknesses of CHAD.« less
RMG An Open Source Electronic Structure Code for Multi-Petaflops Calculations

NASA Astrophysics Data System (ADS)

Briggs, Emil; Lu, Wenchang; Hodak, Miroslav; Bernholc, Jerzy

RMG (Real-space Multigrid) is an open source, density functional theory code for quantum simulations of materials. It solves the Kohn-Sham equations on real-space grids, which allows for natural parallelization via domain decomposition. Either subspace or Davidson diagonalization, coupled with multigrid methods, are used to accelerate convergence. RMG is a cross platform open source package which has been used in the study of a wide range of systems, including semiconductors, biomolecules, and nanoscale electronic devices. It can optionally use GPU accelerators to improve performance on systems where they are available. The recently released versions (>2.0) support multiple GPU's per compute node, have improved performance and scalability, enhanced accuracy and support for additional hardware platforms. New versions of the code are regularly released at http://www.rmgdft.org. The releases include binaries for Linux, Windows and MacIntosh systems, automated builds for clusters using cmake, as well as versions adapted to the major supercomputing installations and platforms. Several recent, large-scale applications of RMG will be discussed.
An Object-Oriented Serial DSMC Simulation Package

NASA Astrophysics Data System (ADS)

Liu, Hongli; Cai, Chunpei

2011-05-01

A newly developed three-dimensional direct simulation Monte Carlo (DSMC) simulation package, named GRASP ("Generalized Rarefied gAs Simulation Package"), is reported in this paper. This package utilizes the concept of simulation engine, many C++ features and software design patterns. The package has an open architecture which can benefit further development and maintenance of the code. In order to reduce the engineering time for three-dimensional models, a hybrid grid scheme, combined with a flexible data structure compiled by C++ language, are implemented in this package. This scheme utilizes a local data structure based on the computational cell to achieve high performance on workstation processors. This data structure allows the DSMC algorithm to be very efficiently parallelized with domain decomposition and it provides much flexibility in terms of grid types. This package can utilize traditional structured, unstructured or hybrid grids within the framework of a single code to model arbitrarily complex geometries and to simulate rarefied gas flows. Benchmark test cases indicate that this package has satisfactory accuracy for complex rarefied gas flows.
Application of a Modular Particle-Continuum Method to Partially Rarefied, Hypersonic Flow

NASA Astrophysics Data System (ADS)

Deschenes, Timothy R.; Boyd, Iain D.

2011-05-01

The Modular Particle-Continuum (MPC) method is used to simulate partially-rarefied, hypersonic flow over a sting-mounted planetary probe configuration. This hybrid method uses computational fluid dynamics (CFD) to solve the Navier-Stokes equations in regions that are continuum, while using direct simulation Monte Carlo (DSMC) in portions of the flow that are rarefied. The MPC method uses state-based coupling to pass information between the two flow solvers and decouples both time-step and mesh densities required by each solver. It is parallelized for distributed memory systems using dynamic domain decomposition and internal energy modes can be consistently modeled to be out of equilibrium with the translational mode in both solvers. The MPC results are compared to both full DSMC and CFD predictions and available experimental measurements. By using DSMC in only regions where the flow is nonequilibrium, the MPC method is able to reproduce full DSMC results down to the level of velocity and rotational energy probability density functions while requiring a fraction of the computational time.

PpoC from Aspergillus nidulans is a fusion protein with only one active haem.

PubMed

Brodhun, Florian; Schneider, Stefan; Göbel, Cornelia; Hornung, Ellen; Feussner, Ivo

2010-01-15

In Aspergillus nidulans Ppos [psi (precocious sexual inducer)-producing oxygenases] are required for the production of so-called psi factors, compounds that control the balance between the sexual and asexual life cycle of the fungus. The genome of A. nidulans harbours three different ppo genes: ppoA, ppoB and ppoC. For all three enzymes two different haem-containing domains are predicted: a fatty acid haem peroxidase/dioxygenase domain in the N-terminal region and a P450 haem-thiolate domain in the C-terminal region. Whereas PpoA was shown to use both haem domains for its bifunctional catalytic activity (linoleic acid 8-dioxygenation and 8-hydroperoxide isomerization), we found that PpoC apparently only harbours a functional haem peroxidase/dioxygenase domain. Consequently, we observed that PpoC catalyses mainly the dioxygenation of linoleic acid (18:2Delta9Z,12Z), yielding 10-HPODE (10-hydroperoxyoctadecadienoic acid). No isomerase activity was detected. Additionally, 10-HPODE was converted at lower rates into 10-KODE (10-keto-octadecadienoic acid) and 10-HODE (10-hydroxyoctadecadienoic acid). In parallel, decomposition of 10-HPODE into 10-ODA (10-octadecynoic acid) and volatile C-8 alcohols that are, among other things, responsible for the characteristic mushroom flavour. Besides these principle differences we also found that PpoA and PpoC can convert 8-HPODE and 10-HPODE into the respective epoxy alcohols: 12,13-epoxy-8-HOME (where HOME is hydroxyoctadecenoic acid) and 12,13-epoxy-10-HOME. By using site-directed mutagenesis we demonstrated that both enzymes share a similar mechanism for the oxidation of 18:2Delta9Z,12Z; they both use a conserved tyrosine residue for catalysis and the directed oxygenation at the C-8 and C-10 is most likely controlled by conserved valine/leucine residues in the dioxygenase domain.
Efficient calculation of full waveform time domain inversion for electromagnetic problem using fictitious wave domain method and cascade decimation decomposition

NASA Astrophysics Data System (ADS)

Imamura, N.; Schultz, A.

2016-12-01

Recently, a full waveform time domain inverse solution has been developed for the magnetotelluric (MT) and controlled-source electromagnetic (CSEM) methods. The ultimate goal of this approach is to obtain a computationally tractable direct waveform joint inversion to solve simultaneously for source fields and earth conductivity structure in three and four dimensions. This is desirable on several grounds, including the improved spatial resolving power expected from use of a multitude of source illuminations, the ability to operate in areas of high levels of source signal spatial complexity, and non-stationarity. This goal would not be obtainable if one were to adopt the pure time domain solution for the inverse problem. This is particularly true for the case of MT surveys, since an enormous number of degrees of freedom are required to represent the observed MT waveforms across a large frequency bandwidth. This means that for the forward simulation, the smallest time steps should be finer than that required to represent the highest frequency, while the number of time steps should also cover the lowest frequency. This leads to a sensitivity matrix that is computationally burdensome to solve a model update. We have implemented a code that addresses this situation through the use of cascade decimation decomposition to reduce the size of the sensitivity matrix substantially, through quasi-equivalent time domain decomposition. We also use a fictitious wave domain method to speed up computation time of the forward simulation in the time domain. By combining these refinements, we have developed a full waveform joint source field/earth conductivity inverse modeling method. We found that cascade decimation speeds computations of the sensitivity matrices dramatically, keeping the solution close to that of the undecimated case. For example, for a model discretized into 2.6x105 cells, we obtain model updates in less than 1 hour on a 4U rack-mounted workgroup Linux server, which is a practical computational time for the inverse problem.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Niu, T; Dong, X; Petrongolo, M

Purpose: Dual energy CT (DECT) imaging plays an important role in advanced imaging applications due to its material decomposition capability. Direct decomposition via matrix inversion suffers from significant degradation of image signal-to-noise ratios, which reduces clinical value. Existing de-noising algorithms achieve suboptimal performance since they suppress image noise either before or after the decomposition and do not fully explore the noise statistical properties of the decomposition process. We propose an iterative image-domain decomposition method for noise suppression in DECT, using the full variance-covariance matrix of the decomposed images. Methods: The proposed algorithm is formulated in the form of least-square estimationmore » with smoothness regularization. It includes the inverse of the estimated variance-covariance matrix of the decomposed images as the penalty weight in the least-square term. Performance is evaluated using an evaluation phantom (Catphan 600) and an anthropomorphic head phantom. Results are compared to those generated using direct matrix inversion with no noise suppression, a de-noising method applied on the decomposed images, and an existing algorithm with similar formulation but with an edge-preserving regularization term. Results: On the Catphan phantom, our method retains the same spatial resolution as the CT images before decomposition while reducing the noise standard deviation of decomposed images by over 98%. The other methods either degrade spatial resolution or achieve less low-contrast detectability. Also, our method yields lower electron density measurement error than direct matrix inversion and reduces error variation by over 97%. On the head phantom, it reduces the noise standard deviation of decomposed images by over 97% without blurring the sinus structures. Conclusion: We propose an iterative image-domain decomposition method for DECT. The method combines noise suppression and material decomposition into an iterative process and achieves both goals simultaneously. The proposed algorithm shows superior performance on noise suppression with high image spatial resolution and low-contrast detectability. This work is supported by a Varian MRA grant.« less
Endothermic decompositions of inorganic monocrystalline thin plates. II. Displacement rate modulation of the reaction front

NASA Astrophysics Data System (ADS)

Bertrand, G.; Comperat, M.; Lallemant, M.

1980-09-01

Copper sulfate pentahydrate dehydration into trihydrate was investigated using monocrystalline platelets with (110) crystallographic orientation. Temperature and pressure conditions were selected so as to obtain elliptical trihydrate domains. The study deals with the evolution, vs time, of elliptical domain dimensions and the evolution, vs water vapor pressure, of the {D}/{d} ratio of ellipse axes and on the other hand of the interface displacement rate along a given direction. The phenomena observed are not basically different from those yielded by the overall kinetic study of the solid sample. Their magnitude, however, is modulated depending on displacement direction. The results are analyzed within the scope of our study of endothermic decomposition of solids.
Domain decomposition methods for nonconforming finite element spaces of Lagrange-type

NASA Technical Reports Server (NTRS)

Cowsar, Lawrence C.

1993-01-01

In this article, we consider the application of three popular domain decomposition methods to Lagrange-type nonconforming finite element discretizations of scalar, self-adjoint, second order elliptic equations. The additive Schwarz method of Dryja and Widlund, the vertex space method of Smith, and the balancing method of Mandel applied to nonconforming elements are shown to converge at a rate no worse than their applications to the standard conforming piecewise linear Galerkin discretization. Essentially, the theory for the nonconforming elements is inherited from the existing theory for the conforming elements with only modest modification by constructing an isomorphism between the nonconforming finite element space and a space of continuous piecewise linear functions.
Galerkin-collocation domain decomposition method for arbitrary binary black holes

NASA Astrophysics Data System (ADS)

Barreto, W.; Clemente, P. C. M.; de Oliveira, H. P.; Rodriguez-Mueller, B.

2018-05-01

We present a new computational framework for the Galerkin-collocation method for double domain in the context of ADM 3 +1 approach in numerical relativity. This work enables us to perform high resolution calculations for initial sets of two arbitrary black holes. We use the Bowen-York method for binary systems and the puncture method to solve the Hamiltonian constraint. The nonlinear numerical code solves the set of equations for the spectral modes using the standard Newton-Raphson method, LU decomposition and Gaussian quadratures. We show convergence of our code for the conformal factor and the ADM mass. Thus, we display features of the conformal factor for different masses, spins and linear momenta.
Segmented Domain Decomposition Multigrid For 3-D Turbomachinery Flows

NASA Technical Reports Server (NTRS)

Celestina, M. L.; Adamczyk, J. J.; Rubin, S. G.

2001-01-01

A Segmented Domain Decomposition Multigrid (SDDMG) procedure was developed for three-dimensional viscous flow problems as they apply to turbomachinery flows. The procedure divides the computational domain into a coarse mesh comprised of uniformly spaced cells. To resolve smaller length scales such as the viscous layer near a surface, segments of the coarse mesh are subdivided into a finer mesh. This is repeated until adequate resolution of the smallest relevant length scale is obtained. Multigrid is used to communicate information between the different grid levels. To test the procedure, simulation results will be presented for a compressor and turbine cascade. These simulations are intended to show the ability of the present method to generate grid independent solutions. Comparisons with data will also be presented. These comparisons will further demonstrate the usefulness of the present work for they allow an estimate of the accuracy of the flow modeling equations independent of error attributed to numerical discretization.
Scalable Domain Decomposed Monte Carlo Particle Transport

DOE Office of Scientific and Technical Information (OSTI.GOV)

O'Brien, Matthew Joseph

2013-12-05

In this dissertation, we present the parallel algorithms necessary to run domain decomposed Monte Carlo particle transport on large numbers of processors (millions of processors). Previous algorithms were not scalable, and the parallel overhead became more computationally costly than the numerical simulation.
Eigensolution of finite element problems in a completely connected parallel architecture

NASA Technical Reports Server (NTRS)

Akl, Fred A.; Morel, Michael R.

1989-01-01

A parallel algorithm for the solution of the generalized eigenproblem in linear elastic finite element analysis, (K)(phi)=(M)(phi)(omega), where (K) and (M) are of order N, and (omega) is of order q is presented. The parallel algorithm is based on a completely connected parallel architecture in which each processor is allowed to communicate with all other processors. The algorithm has been successfully implemented on a tightly coupled multiple-instruction-multiple-data (MIMD) parallel processing computer, Cray X-MP. A finite element model is divided into m domains each of which is assumed to process n elements. Each domain is then assigned to a processor, or to a logical processor (task) if the number of domains exceeds the number of physical processors. The macro-tasking library routines are used in mapping each domain to a user task. Computational speed-up and efficiency are used to determine the effectiveness of the algorithm. The effect of the number of domains, the number of degrees-of-freedom located along the global fronts and the dimension of the subspace on the performance of the algorithm are investigated. For a 64-element rectangular plate, speed-ups of 1.86, 3.13, 3.18 and 3.61 are achieved on two, four, six and eight processors, respectively.
Parallel rendering

NASA Technical Reports Server (NTRS)

Crockett, Thomas W.

1995-01-01

This article provides a broad introduction to the subject of parallel rendering, encompassing both hardware and software systems. The focus is on the underlying concepts and the issues which arise in the design of parallel rendering algorithms and systems. We examine the different types of parallelism and how they can be applied in rendering applications. Concepts from parallel computing, such as data decomposition, task granularity, scalability, and load balancing, are considered in relation to the rendering problem. We also explore concepts from computer graphics, such as coherence and projection, which have a significant impact on the structure of parallel rendering algorithms. Our survey covers a number of practical considerations as well, including the choice of architectural platform, communication and memory requirements, and the problem of image assembly and display. We illustrate the discussion with numerous examples from the parallel rendering literature, representing most of the principal rendering methods currently used in computer graphics.
A model for optimizing file access patterns using spatio-temporal parallelism

DOE Office of Scientific and Technical Information (OSTI.GOV)

Boonthanome, Nouanesengsy; Patchett, John; Geveci, Berk

2013-01-01

For many years now, I/O read time has been recognized as the primary bottleneck for parallel visualization and analysis of large-scale data. In this paper, we introduce a model that can estimate the read time for a file stored in a parallel filesystem when given the file access pattern. Read times ultimately depend on how the file is stored and the access pattern used to read the file. The file access pattern will be dictated by the type of parallel decomposition used. We employ spatio-temporal parallelism, which combines both spatial and temporal parallelism, to provide greater flexibility to possible filemore » access patterns. Using our model, we were able to configure the spatio-temporal parallelism to design optimized read access patterns that resulted in a speedup factor of approximately 400 over traditional file access patterns.« less
Parallelization of combinatorial search when solving knapsack optimization problem on computing systems based on multicore processors

NASA Astrophysics Data System (ADS)

Rahman, P. A.

2018-05-01

This scientific paper deals with the model of the knapsack optimization problem and method of its solving based on directed combinatorial search in the boolean space. The offered by the author specialized mathematical model of decomposition of the search-zone to the separate search-spheres and the algorithm of distribution of the search-spheres to the different cores of the multi-core processor are also discussed. The paper also provides an example of decomposition of the search-zone to the several search-spheres and distribution of the search-spheres to the different cores of the quad-core processor. Finally, an offered by the author formula for estimation of the theoretical maximum of the computational acceleration, which can be achieved due to the parallelization of the search-zone to the search-spheres on the unlimited number of the processor cores, is also given.
A Subspace Approach to the Structural Decomposition and Identification of Ankle Joint Dynamic Stiffness.

PubMed

Jalaleddini, Kian; Tehrani, Ehsan Sobhani; Kearney, Robert E

2017-06-01

The purpose of this paper is to present a structural decomposition subspace (SDSS) method for decomposition of the joint torque to intrinsic, reflexive, and voluntary torques and identification of joint dynamic stiffness. First, it formulates a novel state-space representation for the joint dynamic stiffness modeled by a parallel-cascade structure with a concise parameter set that provides a direct link between the state-space representation matrices and the parallel-cascade parameters. Second, it presents a subspace method for the identification of the new state-space model that involves two steps: 1) the decomposition of the intrinsic and reflex pathways and 2) the identification of an impulse response model of the intrinsic pathway and a Hammerstein model of the reflex pathway. Extensive simulation studies demonstrate that SDSS has significant performance advantages over some other methods. Thus, SDSS was more robust under high noise conditions, converging where others failed; it was more accurate, giving estimates with lower bias and random errors. The method also worked well in practice and yielded high-quality estimates of intrinsic and reflex stiffnesses when applied to experimental data at three muscle activation levels. The simulation and experimental results demonstrate that SDSS accurately decomposes the intrinsic and reflex torques and provides accurate estimates of physiologically meaningful parameters. SDSS will be a valuable tool for studying joint stiffness under functionally important conditions. It has important clinical implications for the diagnosis, assessment, objective quantification, and monitoring of neuromuscular diseases that change the muscle tone.
Algorithms for parallel flow solvers on message passing architectures

NASA Technical Reports Server (NTRS)

Vanderwijngaart, Rob F.

1995-01-01

The purpose of this project has been to identify and test suitable technologies for implementation of fluid flow solvers -- possibly coupled with structures and heat equation solvers -- on MIMD parallel computers. In the course of this investigation much attention has been paid to efficient domain decomposition strategies for ADI-type algorithms. Multi-partitioning derives its efficiency from the assignment of several blocks of grid points to each processor in the parallel computer. A coarse-grain parallelism is obtained, and a near-perfect load balance results. In uni-partitioning every processor receives responsibility for exactly one block of grid points instead of several. This necessitates fine-grain pipelined program execution in order to obtain a reasonable load balance. Although fine-grain parallelism is less desirable on many systems, especially high-latency networks of workstations, uni-partition methods are still in wide use in production codes for flow problems. Consequently, it remains important to achieve good efficiency with this technique that has essentially been superseded by multi-partitioning for parallel ADI-type algorithms. Another reason for the concentration on improving the performance of pipeline methods is their applicability in other types of flow solver kernels with stronger implied data dependence. Analytical expressions can be derived for the size of the dynamic load imbalance incurred in traditional pipelines. From these it can be determined what is the optimal first-processor retardation that leads to the shortest total completion time for the pipeline process. Theoretical predictions of pipeline performance with and without optimization match experimental observations on the iPSC/860 very well. Analysis of pipeline performance also highlights the effect of uncareful grid partitioning in flow solvers that employ pipeline algorithms. If grid blocks at boundaries are not at least as large in the wall-normal direction as those immediately adjacent to them, then the first processor in the pipeline will receive a computational load that is less than that of subsequent processors, magnifying the pipeline slowdown effect. Extra compensation is needed for grid boundary effects, even if all grid blocks are equally sized.
Portable parallel portfolio optimization in the Aurora Financial Management System

NASA Astrophysics Data System (ADS)

Laure, Erwin; Moritsch, Hans

2001-07-01

Financial planning problems are formulated as large scale, stochastic, multiperiod, tree structured optimization problems. An efficient technique for solving this kind of problems is the nested Benders decomposition method. In this paper we present a parallel, portable, asynchronous implementation of this technique. To achieve our portability goals we elected the programming language Java for our implementation and used a high level Java based framework, called OpusJava, for expressing the parallelism potential as well as synchronization constraints. Our implementation is embedded within a modular decision support tool for portfolio and asset liability management, the Aurora Financial Management System.
Parallel Algorithms for Computational Models of Geophysical Systems

NASA Astrophysics Data System (ADS)

Carrillo Ledesma, A.; Herrera, I.; de la Cruz, L. M.; Hernández, G.; Grupo de Modelacion Matematica y Computacional

2013-05-01

Mathematical models of many systems of interest, including very important continuous systems of Earth Sciences and Engineering, lead to a great variety of partial differential equations (PDEs) whose solution methods are based on the computational processing of large-scale algebraic systems. Furthermore, the incredible expansion experienced by the existing computational hardware and software has made amenable to effective treatment problems of an ever increasing diversity and complexity, posed by scientific and engineering applications. Parallel computing is outstanding among the new computational tools and, in order to effectively use the most advanced computers available today, massively parallel software is required. Domain decomposition methods (DDMs) have been developed precisely for effectively treating PDEs in paralle. Ideally, the main objective of domain decomposition research is to produce algorithms capable of 'obtaining the global solution by exclusively solving local problems', but up-to-now this has only been an aspiration; that is, a strong desire for achieving such a property and so we call it 'the DDM-paradigm'. In recent times, numerically competitive DDM-algorithms are non-overlapping, preconditioned and necessarily incorporate constraints which pose an additional challenge for achieving the DDM-paradigm. Recently a group of four algorithms, referred to as the 'DVS-algorithms', which fulfill the DDM-paradigm, was developed. To derive them a new discretization method, which uses a non-overlapping system of nodes (the derived-nodes), was introduced. This discretization procedure can be applied to any boundary-value problem, or system of such equations. In turn, the resulting system of discrete equations can be treated using any available DDM-algorithm. In particular, two of the four DVS-algorithms mentioned above were obtained by application of the well-known and very effective algorithms BDDC and FETI-DP; these will be referred to as the DVS-BDDC and DVS-FETI-DP algorithms. The other two, which will be referred to as the DVS-PRIMAL and DVS-DUAL algorithms, were obtained by application of two new algorithms that had not been previously reported in the literature. As said before, the four DVS-algorithms constitute a group of preconditioned and constrained algorithms that, for the first time, fulfill the DDM-paradigm. Both, BDDC and FETI-DP, are very well-known; and both are highly efficient. Recently, it was established that these two methods are closely related and its numerical performance is quite similar. On the other hand, through numerical experiments, we have established that the numerical performances of each one of the members of DVS-algorithms group (DVS-BDDC, DVS-FETI-DP, DVS-PRIMAL and DVS-DUAL) are very similar too. Furthermore, we have carried out comparisons of the performances of the standard versions of BDDC and FETI-DP with DVS-BDDC and DVS-FETI-DP, and in all such numerical experiments the DVS algorithms have performed significantly better.
Optical ranked-order filtering using threshold decomposition

DOEpatents

Allebach, Jan P.; Ochoa, Ellen; Sweeney, Donald W.

1990-01-01

A hybrid optical/electronic system performs median filtering and related ranked-order operations using threshold decomposition to encode the image. Threshold decomposition transforms the nonlinear neighborhood ranking operation into a linear space-invariant filtering step followed by a point-to-point threshold comparison step. Spatial multiplexing allows parallel processing of all the threshold components as well as recombination by a second linear, space-invariant filtering step. An incoherent optical correlation system performs the linear filtering, using a magneto-optic spatial light modulator as the input device and a computer-generated hologram in the filter plane. Thresholding is done electronically. By adjusting the value of the threshold, the same architecture is used to perform median, minimum, and maximum filtering of images. A totally optical system is also disclosed.
Simulations of relativistic quantum plasmas using real-time lattice scalar QED

NASA Astrophysics Data System (ADS)

Shi, Yuan; Xiao, Jianyuan; Qin, Hong; Fisch, Nathaniel J.

2018-05-01

Real-time lattice quantum electrodynamics (QED) provides a unique tool for simulating plasmas in the strong-field regime, where collective plasma scales are not well separated from relativistic-quantum scales. As a toy model, we study scalar QED, which describes self-consistent interactions between charged bosons and electromagnetic fields. To solve this model on a computer, we first discretize the scalar-QED action on a lattice, in a way that respects geometric structures of exterior calculus and U(1)-gauge symmetry. The lattice scalar QED can then be solved, in the classical-statistics regime, by advancing an ensemble of statistically equivalent initial conditions in time, using classical field equations obtained by extremizing the discrete action. To demonstrate the capability of our numerical scheme, we apply it to two example problems. The first example is the propagation of linear waves, where we recover analytic wave dispersion relations using numerical spectrum. The second example is an intense laser interacting with a one-dimensional plasma slab, where we demonstrate natural transition from wakefield acceleration to pair production when the wave amplitude exceeds the Schwinger threshold. Our real-time lattice scheme is fully explicit and respects local conservation laws, making it reliable for long-time dynamics. The algorithm is readily parallelized using domain decomposition, and the ensemble may be computed using quantum parallelism in the future.
CoFlame: A refined and validated numerical algorithm for modeling sooting laminar coflow diffusion flames

NASA Astrophysics Data System (ADS)

Eaves, Nick A.; Zhang, Qingan; Liu, Fengshan; Guo, Hongsheng; Dworkin, Seth B.; Thomson, Murray J.

2016-10-01

Mitigation of soot emissions from combustion devices is a global concern. For example, recent EURO 6 regulations for vehicles have placed stringent limits on soot emissions. In order to allow design engineers to achieve the goal of reduced soot emissions, they must have the tools to so. Due to the complex nature of soot formation, which includes growth and oxidation, detailed numerical models are required to gain fundamental insights into the mechanisms of soot formation. A detailed description of the CoFlame FORTRAN code which models sooting laminar coflow diffusion flames is given. The code solves axial and radial velocity, temperature, species conservation, and soot aggregate and primary particle number density equations. The sectional particle dynamics model includes nucleation, PAH condensation and HACA surface growth, surface oxidation, coagulation, fragmentation, particle diffusion, and thermophoresis. The code utilizes a distributed memory parallelization scheme with strip-domain decomposition. The public release of the CoFlame code, which has been refined in terms of coding structure, to the research community accompanies this paper. CoFlame is validated against experimental data for reattachment length in an axi-symmetric pipe with a sudden expansion, and ethylene-air and methane-air diffusion flames for multiple soot morphological parameters and gas-phase species. Finally, the parallel performance and computational costs of the code is investigated.
The relativistic theory of the chemical shift

NASA Astrophysics Data System (ADS)

Pyper, N. C.

1983-04-01

A relativistic theory of the NMR chemical shift for a closed-shell system is presented. The final expression for the shielding, derived by, applying two Gordon decompositions to the Dirac current operator, closely parallels the Ramsey non-relativistic result.

Micropore extrusion-induced alignment transition from perpendicular to parallel of cylindrical domains in block copolymers.

PubMed

Qu, Ting; Zhao, Yongbin; Li, Zongbo; Wang, Pingping; Cao, Shubo; Xu, Yawei; Li, Yayuan; Chen, Aihua

2016-02-14

The orientation transition from perpendicular to parallel alignment of PEO cylindrical domains of PEO-b-PMA(Az) films has been demonstrated by extruding the block copolymer (BCP) solutions through a micropore of a plastic gastight syringe. The parallelized orientation of PEO domains induced by this micropore extrusion can be recovered to perpendicular alignment via ultrasonication of the extruded BCP solutions and subsequent annealing. A plausible mechanism is proposed in this study. The BCP films can be used as templates to prepare nanowire arrays with controlled layers, which has enormous potential application in the field of integrated circuits.
Development of parallel algorithms for electrical power management in space applications

NASA Technical Reports Server (NTRS)

Berry, Frederick C.

1989-01-01

The application of parallel techniques for electrical power system analysis is discussed. The Newton-Raphson method of load flow analysis was used along with the decomposition-coordination technique to perform load flow analysis. The decomposition-coordination technique enables tasks to be performed in parallel by partitioning the electrical power system into independent local problems. Each independent local problem represents a portion of the total electrical power system on which a loan flow analysis can be performed. The load flow analysis is performed on these partitioned elements by using the Newton-Raphson load flow method. These independent local problems will produce results for voltage and power which can then be passed to the coordinator portion of the solution procedure. The coordinator problem uses the results of the local problems to determine if any correction is needed on the local problems. The coordinator problem is also solved by an iterative method much like the local problem. The iterative method for the coordination problem will also be the Newton-Raphson method. Therefore, each iteration at the coordination level will result in new values for the local problems. The local problems will have to be solved again along with the coordinator problem until some convergence conditions are met.
High order parallel numerical schemes for solving incompressible flows

NASA Technical Reports Server (NTRS)

Lin, Avi; Milner, Edward J.; Liou, May-Fun; Belch, Richard A.

1992-01-01

The use of parallel computers for numerically solving flow fields has gained much importance in recent years. This paper introduces a new high order numerical scheme for computational fluid dynamics (CFD) specifically designed for parallel computational environments. A distributed MIMD system gives the flexibility of treating different elements of the governing equations with totally different numerical schemes in different regions of the flow field. The parallel decomposition of the governing operator to be solved is the primary parallel split. The primary parallel split was studied using a hypercube like architecture having clusters of shared memory processors at each node. The approach is demonstrated using examples of simple steady state incompressible flows. Future studies should investigate the secondary split because, depending on the numerical scheme that each of the processors applies and the nature of the flow in the specific subdomain, it may be possible for a processor to seek better, or higher order, schemes for its particular subcase.
PEM-PCA: a parallel expectation-maximization PCA face recognition architecture.

PubMed

Rujirakul, Kanokmon; So-In, Chakchai; Arnonkijpanich, Banchar

2014-01-01

Principal component analysis or PCA has been traditionally used as one of the feature extraction techniques in face recognition systems yielding high accuracy when requiring a small number of features. However, the covariance matrix and eigenvalue decomposition stages cause high computational complexity, especially for a large database. Thus, this research presents an alternative approach utilizing an Expectation-Maximization algorithm to reduce the determinant matrix manipulation resulting in the reduction of the stages' complexity. To improve the computational time, a novel parallel architecture was employed to utilize the benefits of parallelization of matrix computation during feature extraction and classification stages including parallel preprocessing, and their combinations, so-called a Parallel Expectation-Maximization PCA architecture. Comparing to a traditional PCA and its derivatives, the results indicate lower complexity with an insignificant difference in recognition precision leading to high speed face recognition systems, that is, the speed-up over nine and three times over PCA and Parallel PCA.
Preprocessed cumulative reconstructor with domain decomposition: a fast wavefront reconstruction method for pyramid wavefront sensor.

PubMed

Shatokhina, Iuliia; Obereder, Andreas; Rosensteiner, Matthias; Ramlau, Ronny

2013-04-20

We present a fast method for the wavefront reconstruction from pyramid wavefront sensor (P-WFS) measurements. The method is based on an analytical relation between pyramid and Shack-Hartmann sensor (SH-WFS) data. The algorithm consists of two steps--a transformation of the P-WFS data to SH data, followed by the application of cumulative reconstructor with domain decomposition, a wavefront reconstructor from SH-WFS measurements. The closed loop simulations confirm that our method provides the same quality as the standard matrix vector multiplication method. A complexity analysis as well as speed tests confirm that the method is very fast. Thus, the method can be used on extremely large telescopes, e.g., for eXtreme adaptive optics systems.
Object detection with a multistatic array using singular value decomposition

DOEpatents

Hallquist, Aaron T.; Chambers, David H.

2014-07-01

A method and system for detecting the presence of subsurface objects within a medium is provided. In some embodiments, the detection system operates in a multistatic mode to collect radar return signals generated by an array of transceiver antenna pairs that is positioned across a surface and that travels down the surface. The detection system converts the return signals from a time domain to a frequency domain, resulting in frequency return signals. The detection system then performs a singular value decomposition for each frequency to identify singular values for each frequency. The detection system then detects the presence of a subsurface object based on a comparison of the identified singular values to expected singular values when no subsurface object is present.
Parallel Algorithms for Monte Carlo Particle Transport Simulation on Exascale Computing Architectures

NASA Astrophysics Data System (ADS)

Romano, Paul Kollath

Monte Carlo particle transport methods are being considered as a viable option for high-fidelity simulation of nuclear reactors. While Monte Carlo methods offer several potential advantages over deterministic methods, there are a number of algorithmic shortcomings that would prevent their immediate adoption for full-core analyses. In this thesis, algorithms are proposed both to ameliorate the degradation in parallel efficiency typically observed for large numbers of processors and to offer a means of decomposing large tally data that will be needed for reactor analysis. A nearest-neighbor fission bank algorithm was proposed and subsequently implemented in the OpenMC Monte Carlo code. A theoretical analysis of the communication pattern shows that the expected cost is O( N ) whereas traditional fission bank algorithms are O(N) at best. The algorithm was tested on two supercomputers, the Intrepid Blue Gene/P and the Titan Cray XK7, and demonstrated nearly linear parallel scaling up to 163,840 processor cores on a full-core benchmark problem. An algorithm for reducing network communication arising from tally reduction was analyzed and implemented in OpenMC. The proposed algorithm groups only particle histories on a single processor into batches for tally purposes---in doing so it prevents all network communication for tallies until the very end of the simulation. The algorithm was tested, again on a full-core benchmark, and shown to reduce network communication substantially. A model was developed to predict the impact of load imbalances on the performance of domain decomposed simulations. The analysis demonstrated that load imbalances in domain decomposed simulations arise from two distinct phenomena: non-uniform particle densities and non-uniform spatial leakage. The dominant performance penalty for domain decomposition was shown to come from these physical effects rather than insufficient network bandwidth or high latency. The model predictions were verified with measured data from simulations in OpenMC on a full-core benchmark problem. Finally, a novel algorithm for decomposing large tally data was proposed, analyzed, and implemented/tested in OpenMC. The algorithm relies on disjoint sets of compute processes and tally servers. The analysis showed that for a range of parameters relevant to LWR analysis, the tally server algorithm should perform with minimal overhead. Tests were performed on Intrepid and Titan and demonstrated that the algorithm did indeed perform well over a wide range of parameters. (Copies available exclusively from MIT Libraries, libraries.mit.edu/docs - docs mit.edu)
DOE Office of Scientific and Technical Information (OSTI.GOV)

Meng, F.; Banks, J. W.; Henshaw, W. D.

We describe a new partitioned approach for solving conjugate heat transfer (CHT) problems where the governing temperature equations in different material domains are time-stepped in a implicit manner, but where the interface coupling is explicit. The new approach, called the CHAMP scheme (Conjugate Heat transfer Advanced Multi-domain Partitioned), is based on a discretization of the interface coupling conditions using a generalized Robin (mixed) condition. The weights in the Robin condition are determined from the optimization of a condition derived from a local stability analysis of the coupling scheme. The interface treatment combines ideas from optimized-Schwarz methods for domain-decomposition problems togethermore » with the interface jump conditions and additional compatibility jump conditions derived from the governing equations. For many problems (i.e. for a wide range of material properties, grid-spacings and time-steps) the CHAMP algorithm is stable and second-order accurate using no sub-time-step iterations (i.e. a single implicit solve of the temperature equation in each domain). In extreme cases (e.g. very fine grids with very large time-steps) it may be necessary to perform one or more sub-iterations. Each sub-iteration generally increases the range of stability substantially and thus one sub-iteration is likely sufficient for the vast majority of practical problems. The CHAMP algorithm is developed first for a model problem and analyzed using normal-mode the- ory. The theory provides a mechanism for choosing optimal parameters in the mixed interface condition. A comparison is made to the classical Dirichlet-Neumann (DN) method and, where applicable, to the optimized- Schwarz (OS) domain-decomposition method. For problems with different thermal conductivities and dif- fusivities, the CHAMP algorithm outperforms the DN scheme. For domain-decomposition problems with uniform conductivities and diffusivities, the CHAMP algorithm performs better than the typical OS scheme with one grid-cell overlap. Lastly, the CHAMP scheme is also developed for general curvilinear grids and CHT ex- amples are presented using composite overset grids that confirm the theory and demonstrate the effectiveness of the approach.« less
PCTDSE: A parallel Cartesian-grid-based TDSE solver for modeling laser-atom interactions

NASA Astrophysics Data System (ADS)

Fu, Yongsheng; Zeng, Jiaolong; Yuan, Jianmin

2017-01-01

We present a parallel Cartesian-grid-based time-dependent Schrödinger equation (TDSE) solver for modeling laser-atom interactions. It can simulate the single-electron dynamics of atoms in arbitrary time-dependent vector potentials. We use a split-operator method combined with fast Fourier transforms (FFT), on a three-dimensional (3D) Cartesian grid. Parallelization is realized using a 2D decomposition strategy based on the Message Passing Interface (MPI) library, which results in a good parallel scaling on modern supercomputers. We give simple applications for the hydrogen atom using the benchmark problems coming from the references and obtain repeatable results. The extensions to other laser-atom systems are straightforward with minimal modifications of the source code.
Flood predictions using the parallel version of distributed numerical physical rainfall-runoff model TOPKAPI

NASA Astrophysics Data System (ADS)

Boyko, Oleksiy; Zheleznyak, Mark

2015-04-01

The original numerical code TOPKAPI-IMMS of the distributed rainfall-runoff model TOPKAPI ( Todini et al, 1996-2014) is developed and implemented in Ukraine. The parallel version of the code has been developed recently to be used on multiprocessors systems - multicore/processors PC and clusters. Algorithm is based on binary-tree decomposition of the watershed for the balancing of the amount of computation for all processors/cores. Message passing interface (MPI) protocol is used as a parallel computing framework. The numerical efficiency of the parallelization algorithms is demonstrated for the case studies for the flood predictions of the mountain watersheds of the Ukrainian Carpathian regions. The modeling results is compared with the predictions based on the lumped parameters models.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Chen, Chao; Pouransari, Hadi; Rajamanickam, Sivasankaran

We present a parallel hierarchical solver for general sparse linear systems on distributed-memory machines. For large-scale problems, this fully algebraic algorithm is faster and more memory-efficient than sparse direct solvers because it exploits the low-rank structure of fill-in blocks. Depending on the accuracy of low-rank approximations, the hierarchical solver can be used either as a direct solver or as a preconditioner. The parallel algorithm is based on data decomposition and requires only local communication for updating boundary data on every processor. Moreover, the computation-to-communication ratio of the parallel algorithm is approximately the volume-to-surface-area ratio of the subdomain owned by everymore » processor. We also provide various numerical results to demonstrate the versatility and scalability of the parallel algorithm.« less
Novel techniques for data decomposition and load balancing for parallel processing of vision systems: Implementation and evaluation using a motion estimation system

NASA Technical Reports Server (NTRS)

Choudhary, Alok Nidhi; Leung, Mun K.; Huang, Thomas S.; Patel, Janak H.

1989-01-01

Computer vision systems employ a sequence of vision algorithms in which the output of an algorithm is the input of the next algorithm in the sequence. Algorithms that constitute such systems exhibit vastly different computational characteristics, and therefore, require different data decomposition techniques and efficient load balancing techniques for parallel implementation. However, since the input data for a task is produced as the output data of the previous task, this information can be exploited to perform knowledge based data decomposition and load balancing. Presented here are algorithms for a motion estimation system. The motion estimation is based on the point correspondence between the involved images which are a sequence of stereo image pairs. Researchers propose algorithms to obtain point correspondences by matching feature points among stereo image pairs at any two consecutive time instants. Furthermore, the proposed algorithms employ non-iterative procedures, which results in saving considerable amounts of computation time. The system consists of the following steps: (1) extraction of features; (2) stereo match of images in one time instant; (3) time match of images from consecutive time instants; (4) stereo match to compute final unambiguous points; and (5) computation of motion parameters.
Iterative image-domain decomposition for dual-energy CT

DOE Office of Scientific and Technical Information (OSTI.GOV)

Niu, Tianye; Dong, Xue; Petrongolo, Michael

2014-04-15

Purpose: Dual energy CT (DECT) imaging plays an important role in advanced imaging applications due to its capability of material decomposition. Direct decomposition via matrix inversion suffers from significant degradation of image signal-to-noise ratios, which reduces clinical values of DECT. Existing denoising algorithms achieve suboptimal performance since they suppress image noise either before or after the decomposition and do not fully explore the noise statistical properties of the decomposition process. In this work, the authors propose an iterative image-domain decomposition method for noise suppression in DECT, using the full variance-covariance matrix of the decomposed images. Methods: The proposed algorithm ismore » formulated in the form of least-square estimation with smoothness regularization. Based on the design principles of a best linear unbiased estimator, the authors include the inverse of the estimated variance-covariance matrix of the decomposed images as the penalty weight in the least-square term. The regularization term enforces the image smoothness by calculating the square sum of neighboring pixel value differences. To retain the boundary sharpness of the decomposed images, the authors detect the edges in the CT images before decomposition. These edge pixels have small weights in the calculation of the regularization term. Distinct from the existing denoising algorithms applied on the images before or after decomposition, the method has an iterative process for noise suppression, with decomposition performed in each iteration. The authors implement the proposed algorithm using a standard conjugate gradient algorithm. The method performance is evaluated using an evaluation phantom (Catphan©600) and an anthropomorphic head phantom. The results are compared with those generated using direct matrix inversion with no noise suppression, a denoising method applied on the decomposed images, and an existing algorithm with similar formulation as the proposed method but with an edge-preserving regularization term. Results: On the Catphan phantom, the method maintains the same spatial resolution on the decomposed images as that of the CT images before decomposition (8 pairs/cm) while significantly reducing their noise standard deviation. Compared to that obtained by the direct matrix inversion, the noise standard deviation in the images decomposed by the proposed algorithm is reduced by over 98%. Without considering the noise correlation properties in the formulation, the denoising scheme degrades the spatial resolution to 6 pairs/cm for the same level of noise suppression. Compared to the edge-preserving algorithm, the method achieves better low-contrast detectability. A quantitative study is performed on the contrast-rod slice of Catphan phantom. The proposed method achieves lower electron density measurement error as compared to that by the direct matrix inversion, and significantly reduces the error variation by over 97%. On the head phantom, the method reduces the noise standard deviation of decomposed images by over 97% without blurring the sinus structures. Conclusions: The authors propose an iterative image-domain decomposition method for DECT. The method combines noise suppression and material decomposition into an iterative process and achieves both goals simultaneously. By exploring the full variance-covariance properties of the decomposed images and utilizing the edge predetection, the proposed algorithm shows superior performance on noise suppression with high image spatial resolution and low-contrast detectability.« less
A high performance linear equation solver on the VPP500 parallel supercomputer

DOE Office of Scientific and Technical Information (OSTI.GOV)

Nakanishi, Makoto; Ina, Hiroshi; Miura, Kenichi

1994-12-31

This paper describes the implementation of two high performance linear equation solvers developed for the Fujitsu VPP500, a distributed memory parallel supercomputer system. The solvers take advantage of the key architectural features of VPP500--(1) scalability for an arbitrary number of processors up to 222 processors, (2) flexible data transfer among processors provided by a crossbar interconnection network, (3) vector processing capability on each processor, and (4) overlapped computation and transfer. The general linear equation solver based on the blocked LU decomposition method achieves 120.0 GFLOPS performance with 100 processors in the LIN-PACK Highly Parallel Computing benchmark.
Optical ranked-order filtering using threshold decomposition

DOEpatents

Allebach, J.P.; Ochoa, E.; Sweeney, D.W.

1987-10-09

A hybrid optical/electronic system performs median filtering and related ranked-order operations using threshold decomposition to encode the image. Threshold decomposition transforms the nonlinear neighborhood ranking operation into a linear space-invariant filtering step followed by a point-to-point threshold comparison step. Spatial multiplexing allows parallel processing of all the threshold components as well as recombination by a second linear, space-invariant filtering step. An incoherent optical correlation system performs the linear filtering, using a magneto-optic spatial light modulator as the input device and a computer-generated hologram in the filter plane. Thresholding is done electronically. By adjusting the value of the threshold, the same architecture is used to perform median, minimum, and maximum filtering of images. A totally optical system is also disclosed. 3 figs.
Distributed intelligence for supervisory control

NASA Technical Reports Server (NTRS)

Wolfe, W. J.; Raney, S. D.

1987-01-01

Supervisory control systems must deal with various types of intelligence distributed throughout the layers of control. Typical layers are real-time servo control, off-line planning and reasoning subsystems and finally, the human operator. Design methodologies must account for the fact that the majority of the intelligence will reside with the human operator. Hierarchical decompositions and feedback loops as conceptual building blocks that provide a common ground for man-machine interaction are discussed. Examples of types of parallelism and parallel implementation on several classes of computer architecture are also discussed.
Parallel algorithm for computation of second-order sequential best rotations

NASA Astrophysics Data System (ADS)

Redif, Soydan; Kasap, Server

2013-12-01

Algorithms for computing an approximate polynomial matrix eigenvalue decomposition of para-Hermitian systems have emerged as a powerful, generic signal processing tool. A technique that has shown much success in this regard is the sequential best rotation (SBR2) algorithm. Proposed is a scheme for parallelising SBR2 with a view to exploiting the modern architectural features and inherent parallelism of field-programmable gate array (FPGA) technology. Experiments show that the proposed scheme can achieve low execution times while requiring minimal FPGA resources.
Application of cross recurrence plot for identification of temperature fluctuations synchronization in parallel minichannels

NASA Astrophysics Data System (ADS)

Grzybowski, H.; Mosdorf, R.

2016-09-01

The temperature fluctuations occurring in flow boiling in parallel minichannels with diameter of 1 mm have been experimentally investigated and analysed. The wall temperature was recorded at each minichannel outlet by thermocouple with 0.08 mm diameter probe. The time series where recorded during dynamic two-phase flow instabilities which are accompanied by chaotic temperature fluctuations. Time series were denoised using wavelet decomposition and were analysed using cross recurrence plots (CRP) which enables the study of two time series synchronization.
Computational structures for robotic computations

NASA Technical Reports Server (NTRS)

Lee, C. S. G.; Chang, P. R.

1987-01-01

The computational problem of inverse kinematics and inverse dynamics of robot manipulators by taking advantage of parallelism and pipelining architectures is discussed. For the computation of inverse kinematic position solution, a maximum pipelined CORDIC architecture has been designed based on a functional decomposition of the closed-form joint equations. For the inverse dynamics computation, an efficient p-fold parallel algorithm to overcome the recurrence problem of the Newton-Euler equations of motion to achieve the time lower bound of O(log sub 2 n) has also been developed.
Queueing Network Models for Parallel Processing of Task Systems: an Operational Approach

NASA Technical Reports Server (NTRS)

Mak, Victor W. K.

1986-01-01

Computer performance modeling of possibly complex computations running on highly concurrent systems is considered. Earlier works in this area either dealt with a very simple program structure or resulted in methods with exponential complexity. An efficient procedure is developed to compute the performance measures for series-parallel-reducible task systems using queueing network models. The procedure is based on the concept of hierarchical decomposition and a new operational approach. Numerical results for three test cases are presented and compared to those of simulations.

Parallelization of TWOPORFLOW, a Cartesian Grid based Two-phase Porous Media Code for Transient Thermo-hydraulic Simulations

NASA Astrophysics Data System (ADS)

Trost, Nico; Jiménez, Javier; Imke, Uwe; Sanchez, Victor

2014-06-01

TWOPORFLOW is a thermo-hydraulic code based on a porous media approach to simulate single- and two-phase flow including boiling. It is under development at the Institute for Neutron Physics and Reactor Technology (INR) at KIT. The code features a 3D transient solution of the mass, momentum and energy conservation equations for two inter-penetrating fluids with a semi-implicit continuous Eulerian type solver. The application domain of TWOPORFLOW includes the flow in standard porous media and in structured porous media such as micro-channels and cores of nuclear power plants. In the latter case, the fluid domain is coupled to a fuel rod model, describing the heat flow inside the solid structure. In this work, detailed profiling tools have been utilized to determine the optimization potential of TWOPORFLOW. As a result, bottle-necks were identified and reduced in the most feasible way, leading for instance to an optimization of the water-steam property computation. Furthermore, an OpenMP implementation addressing the routines in charge of inter-phase momentum-, energy- and mass-coupling delivered good performance together with a high scalability on shared memory architectures. In contrast to that, the approach for distributed memory systems was to solve sub-problems resulting by the decomposition of the initial Cartesian geometry. Thread communication for the sub-problem boundary updates was accomplished by the Message Passing Interface (MPI) standard.
Whole-annulus aeroelasticity analysis of a 17-bladerow WRF compressor using an unstructured Navier Stokes solver

NASA Astrophysics Data System (ADS)

Wu, X.; Vahdati, M.; Sayma, A.; Imregun, M.

2005-03-01

This paper describes a large-scale aeroelasticity computation for an aero-engine core compressor. The computational domain includes all 17 bladerows, resulting in a mesh with over 68 million points. The Favre-averaged Navier Stokes equations are used to represent the flow in a non-linear time-accurate fashion on unstructured meshes of mixed elements. The structural model of the first two rotor bladerows is based on a standard finite element representation. The fluid mesh is moved at each time step according to the structural motion so that changes in blade aerodynamic damping and flow unsteadiness can be accommodated automatically. An efficient domain decomposition technique, where special care was taken to balance the memory requirement across processors, was developed as part of the work. The calculation was conducted in parallel mode on 128 CPUs of an SGI Origin 3000. Ten vibration cycles were obtained using over 2.2 CPU years, though the elapsed time was a week only. Steady-state flow measurements and predictions were found to be in good agreement. A comparison of the averaged unsteady flow and the steady-state flow revealed some discrepancies. It was concluded that, in due course, the methodology would be adopted by industry to perform routine numerical simulations of the unsteady flow through entire compressor assemblies with vibrating blades not only to minimise engine and rig tests but also to improve performance predictions.
Phase segregation in multiphase turbulent channel flow

NASA Astrophysics Data System (ADS)

Bianco, Federico; Soldati, Alfredo

2014-11-01

The phase segregation of a rapidly quenched mixture (namely spinodal decomposition) is numerically investigated. A phase field approach is considered. Direct numerical simulation of the coupled Navier-Stokes and Cahn-Hilliard equations is performed with spectral accuracy and focus has been put on domain growth scaling laws, in a wide range of regimes. The numerical method has been first validated against well known results of literature, then spinodal decomposition in a turbulent bounded flow (channel flow) has been considered. As for homogeneous isotropic case, turbulent fluctuations suppress the segregation process when surface tension at the interfaces is relatively low (namely low Weber number regimes). For these regimes, segregated domains size reaches a statistically steady state due to mixing and break-up phenomena. In contrast with homogenous and isotropic turbulence, the presence of mean shear, leads to a typical domain size that show a wall-distance dependence. Finally, preliminary results on the effects to the drag forces at the wall, due to phase segregation, have been discussed. Regione FVG, program PAR-FSC.
MARS-MD: rejection based image domain material decomposition

NASA Astrophysics Data System (ADS)

Bateman, C. J.; Knight, D.; Brandwacht, B.; McMahon, J.; Healy, J.; Panta, R.; Aamir, R.; Rajendran, K.; Moghiseh, M.; Ramyar, M.; Rundle, D.; Bennett, J.; de Ruiter, N.; Smithies, D.; Bell, S. T.; Doesburg, R.; Chernoglazov, A.; Mandalika, V. B. H.; Walsh, M.; Shamshad, M.; Anjomrouz, M.; Atharifard, A.; Vanden Broeke, L.; Bheesette, S.; Kirkbride, T.; Anderson, N. G.; Gieseg, S. P.; Woodfield, T.; Renaud, P. F.; Butler, A. P. H.; Butler, P. H.

2018-05-01

This paper outlines image domain material decomposition algorithms that have been routinely used in MARS spectral CT systems. These algorithms (known collectively as MARS-MD) are based on a pragmatic heuristic for solving the under-determined problem where there are more materials than energy bins. This heuristic contains three parts: (1) splitting the problem into a number of possible sub-problems, each containing fewer materials; (2) solving each sub-problem; and (3) applying rejection criteria to eliminate all but one sub-problem's solution. An advantage of this process is that different constraints can be applied to each sub-problem if necessary. In addition, the result of this process is that solutions will be sparse in the material domain, which reduces crossover of signal between material images. Two algorithms based on this process are presented: the Segmentation variant, which uses segmented material classes to define each sub-problem; and the Angular Rejection variant, which defines the rejection criteria using the angle between reconstructed attenuation vectors.
Multiscale infrared and visible image fusion using gradient domain guided image filtering

NASA Astrophysics Data System (ADS)

Zhu, Jin; Jin, Weiqi; Li, Li; Han, Zhenghao; Wang, Xia

2018-03-01

For better surveillance with infrared and visible imaging, a novel hybrid multiscale decomposition fusion method using gradient domain guided image filtering (HMSD-GDGF) is proposed in this study. In this method, hybrid multiscale decomposition with guided image filtering and gradient domain guided image filtering of source images are first applied before the weight maps of each scale are obtained using a saliency detection technology and filtering means with three different fusion rules at different scales. The three types of fusion rules are for small-scale detail level, large-scale detail level, and base level. Finally, the target becomes more salient and can be more easily detected in the fusion result, with the detail information of the scene being fully displayed. After analyzing the experimental comparisons with state-of-the-art fusion methods, the HMSD-GDGF method has obvious advantages in fidelity of salient information (including structural similarity, brightness, and contrast), preservation of edge features, and human visual perception. Therefore, visual effects can be improved by using the proposed HMSD-GDGF method.
Surface-Accelerated Decomposition of δ-HMX.

PubMed

Sharia, Onise; Tsyshevsky, Roman; Kuklja, Maija M

2013-03-07

Despite extensive efforts to study the explosive decomposition of HMX, a cyclic nitramine widely used as a solid fuel, explosive, and propellant, an understanding of the physicochemical processes, governing the sensitivity of condensed HMX to detonation initiation is not yet achieved. Experimental and theoretical explorations of the initiation of chemistry are equally challenging because of many complex parallel processes, including the β-δ phase transition and the decomposition from both phases. Among four known polymorphs, HMX is produced in the most stable β-phase, which transforms into the most reactive δ-phase under heat or pressure. In this study, the homolytic NO2 loss and HONO elimination precursor reactions of the gas-phase, ideal crystal, and the (100) surface of δ-HMX are explored by first principles modeling. Our calculations revealed that the high sensitivity of δ-HMX is attributed to interactions of surfaces and molecular dipole moments. While both decomposition reactions coexist, the exothermic HONO-isomer formation catalyzes the N-NO2 homolysis, leading to fast violent explosions.
Wakefield Simulation of CLIC PETS Structure Using Parallel 3D Finite Element Time-Domain Solver T3P

DOE Office of Scientific and Technical Information (OSTI.GOV)

Candel, A.; Kabel, A.; Lee, L.

In recent years, SLAC's Advanced Computations Department (ACD) has developed the parallel 3D Finite Element electromagnetic time-domain code T3P. Higher-order Finite Element methods on conformal unstructured meshes and massively parallel processing allow unprecedented simulation accuracy for wakefield computations and simulations of transient effects in realistic accelerator structures. Applications include simulation of wakefield damping in the Compact Linear Collider (CLIC) power extraction and transfer structure (PETS).
A New Parallel Boundary Condition for Turbulence Simulations in Stellarators

NASA Astrophysics Data System (ADS)

Martin, Mike F.; Landreman, Matt; Dorland, William; Xanthopoulos, Pavlos

2017-10-01

For gyrokinetic simulations of core turbulence, the ``twist-and-shift'' parallel boundary condition (Beer et al., PoP, 1995), which involves a shift in radial wavenumber proportional to the global shear and a quantization of the simulation domain's aspect ratio, is the standard choice. But as this condition was derived under the assumption of axisymmetry, ``twist-and-shift'' as it stands is formally incorrect for turbulence simulations in stellarators. Moreover, for low-shear stellarators like W7X and HSX, the use of a global shear in the traditional boundary condition places an inflexible constraint on the aspect ratio of the domain, requiring more grid points to fully resolve its extent. Here, we present a parallel boundary condition for ``stellarator-symmetric'' simulations that relies on the local shear along a field line. This boundary condition is similar to ``twist-and-shift'', but has an added flexibility in choosing the parallel length of the domain based on local shear consideration in order to optimize certain parameters such as the aspect ratio of the simulation domain.
F4 , E6 and G2 exceptional gauge groups in the vacuum domain structure model

NASA Astrophysics Data System (ADS)

Shahlaei, Amir; Rafibakhsh, Shahnoosh

2018-03-01

Using a vacuum domain structure model, we calculate trivial static potentials in various representations of F4 , E6, and G2 exceptional groups by means of the unit center element. Due to the absence of the nontrivial center elements, the potential of every representation is screened at far distances. However, the linear part is observed at intermediate quark separations and is investigated by the decomposition of the exceptional group to its maximal subgroups. Comparing the group factor of the supergroup with the corresponding one obtained from the nontrivial center elements of S U (3 ) subgroup shows that S U (3 ) is not the direct cause of temporary confinement in any of the exceptional groups. However, the trivial potential obtained from the group decomposition into the S U (3 ) subgroup is the same as the potential of the supergroup itself. In addition, any regular or singular decomposition into the S U (2 ) subgroup that produces the Cartan generator with the same elements as h1, in any exceptional group, leads to the linear intermediate potential of the exceptional gauge groups. The other S U (2 ) decompositions with the Cartan generator different from h1 are still able to describe the linear potential if the number of S U (2 ) nontrivial center elements that emerge in the decompositions is the same. As a result, it is the center vortices quantized in terms of nontrivial center elements of the S U (2 ) subgroup that give rise to the intermediate confinement in the static potentials.
Reduced Order Model Basis Vector Generation: Generates Basis Vectors fro ROMs

DOE Office of Scientific and Technical Information (OSTI.GOV)

Arrighi, Bill

2016-03-03

libROM is a library that implements order reduction via singular value decomposition (SVD) of sampled state vectors. It implements 2 parallel, incremental SVD algorithms and one serial, non-incremental algorithm. It also provides a mechanism for adaptive sampling of basis vectors.
Diderot: a Domain-Specific Language for Portable Parallel Scientific Visualization and Image Analysis.

PubMed

Kindlmann, Gordon; Chiw, Charisee; Seltzer, Nicholas; Samuels, Lamont; Reppy, John

2016-01-01

Many algorithms for scientific visualization and image analysis are rooted in the world of continuous scalar, vector, and tensor fields, but are programmed in low-level languages and libraries that obscure their mathematical foundations. Diderot is a parallel domain-specific language that is designed to bridge this semantic gap by providing the programmer with a high-level, mathematical programming notation that allows direct expression of mathematical concepts in code. Furthermore, Diderot provides parallel performance that takes advantage of modern multicore processors and GPUs. The high-level notation allows a concise and natural expression of the algorithms and the parallelism allows efficient execution on real-world datasets.
A domain decomposition approach to implementing fault slip in finite-element models of quasi-static and dynamic crustal deformation

USGS Publications Warehouse

Aagaard, Brad T.; Knepley, M.G.; Williams, C.A.

2013-01-01

We employ a domain decomposition approach with Lagrange multipliers to implement fault slip in a finite-element code, PyLith, for use in both quasi-static and dynamic crustal deformation applications. This integrated approach to solving both quasi-static and dynamic simulations leverages common finite-element data structures and implementations of various boundary conditions, discretization schemes, and bulk and fault rheologies. We have developed a custom preconditioner for the Lagrange multiplier portion of the system of equations that provides excellent scalability with problem size compared to conventional additive Schwarz methods. We demonstrate application of this approach using benchmarks for both quasi-static viscoelastic deformation and dynamic spontaneous rupture propagation that verify the numerical implementation in PyLith.
Application of modified Martinez-Silva algorithm in determination of net cover

NASA Astrophysics Data System (ADS)

Stefanowicz, Łukasz; Grobelna, Iwona

2016-12-01

In the article we present the idea of modifications of Martinez-Silva algorithm, which allows for determination of place invariants (p-invariants) of Petri net. Their generation time is important in the parallel decomposition of discrete systems described by Petri nets. Decomposition process is essential from the point of view of discrete system design, as it allows for separation of smaller sequential parts. The proposed modifications of Martinez-Silva method concern the net cover by p-invariants and are focused on two important issues: cyclic reduction of invariant matrix and cyclic checking of net cover.
3D Staggered-Grid Finite-Difference Simulation of Acoustic Waves in Turbulent Moving Media

NASA Astrophysics Data System (ADS)

Symons, N. P.; Aldridge, D. F.; Marlin, D.; Wilson, D. K.; Sullivan, P.; Ostashev, V.

2003-12-01

Acoustic wave propagation in a three-dimensional heterogeneous moving atmosphere is accurately simulated with a numerical algorithm recently developed under the DOD Common High Performance Computing Software Support Initiative (CHSSI). Sound waves within such a dynamic environment are mathematically described by a set of four, coupled, first-order partial differential equations governing small-amplitude fluctuations in pressure and particle velocity. The system is rigorously derived from fundamental principles of continuum mechanics, ideal-fluid constitutive relations, and reasonable assumptions that the ambient atmospheric motion is adiabatic and divergence-free. An explicit, time-domain, finite-difference (FD) numerical scheme is used to solve the system for both pressure and particle velocity wavefields. The atmosphere is characterized by 3D gridded models of sound speed, mass density, and the three components of the wind velocity vector. Dependent variables are stored on staggered spatial and temporal grids, and centered FD operators possess 2nd-order and 4th-order space/time accuracy. Accurate sound wave simulation is achieved provided grid intervals are chosen appropriately. The gridding must be fine enough to reduce numerical dispersion artifacts to an acceptable level and maintain stability. The algorithm is designed to execute on parallel computational platforms by utilizing a spatial domain-decomposition strategy. Currently, the algorithm has been validated on four different computational platforms, and parallel scalability of approximately 85% has been demonstrated. Comparisons with analytic solutions for uniform and vertically stratified wind models indicate that the FD algorithm generates accurate results with either a vanishing pressure or vanishing vertical-particle velocity boundary condition. Simulations are performed using a kinematic turbulence wind profile developed with the quasi-wavelet method. In addition, preliminary results are presented using high-resolution 3D dynamic turbulent flowfields generated by a large-eddy simulation model of a stably stratified planetary boundary layer. Sandia National Laboratories is a operated by Sandia Corporation, a Lockheed Martin Company, for the USDOE under contract 94-AL85000.
Sequential or parallel decomposed processing of two-digit numbers? Evidence from eye-tracking.

PubMed

Moeller, Korbinian; Fischer, Martin H; Nuerk, Hans-Christoph; Willmes, Klaus

2009-02-01

While reaction time data have shown that decomposed processing of two-digit numbers occurs, there is little evidence about how decomposed processing functions. Poltrock and Schwartz (1984) argued that multi-digit numbers are compared in a sequential digit-by-digit fashion starting at the leftmost digit pair. In contrast, Nuerk and Willmes (2005) favoured parallel processing of the digits constituting a number. These models (i.e., sequential decomposition, parallel decomposition) make different predictions regarding the fixation pattern in a two-digit number magnitude comparison task and can therefore be differentiated by eye fixation data. We tested these models by evaluating participants' eye fixation behaviour while selecting the larger of two numbers. The stimulus set consisted of within-decade comparisons (e.g., 53_57) and between-decade comparisons (e.g., 42_57). The between-decade comparisons were further divided into compatible and incompatible trials (cf. Nuerk, Weger, & Willmes, 2001) and trials with different decade and unit distances. The observed fixation pattern implies that the comparison of two-digit numbers is not executed by sequentially comparing decade and unit digits as proposed by Poltrock and Schwartz (1984) but rather in a decomposed but parallel fashion. Moreover, the present fixation data provide first evidence that digit processing in multi-digit numbers is not a pure bottom-up effect, but is also influenced by top-down factors. Finally, implications for multi-digit number processing beyond the range of two-digit numbers are discussed.
Open-Source Development of the Petascale Reactive Flow and Transport Code PFLOTRAN

NASA Astrophysics Data System (ADS)

Hammond, G. E.; Andre, B.; Bisht, G.; Johnson, T.; Karra, S.; Lichtner, P. C.; Mills, R. T.

2013-12-01

Open-source software development has become increasingly popular in recent years. Open-source encourages collaborative and transparent software development and promotes unlimited free redistribution of source code to the public. Open-source development is good for science as it reveals implementation details that are critical to scientific reproducibility, but generally excluded from journal publications. In addition, research funds that would have been spent on licensing fees can be redirected to code development that benefits more scientists. In 2006, the developers of PFLOTRAN open-sourced their code under the U.S. Department of Energy SciDAC-II program. Since that time, the code has gained popularity among code developers and users from around the world seeking to employ PFLOTRAN to simulate thermal, hydraulic, mechanical and biogeochemical processes in the Earth's surface/subsurface environment. PFLOTRAN is a massively-parallel subsurface reactive multiphase flow and transport simulator designed from the ground up to run efficiently on computing platforms ranging from the laptop to leadership-class supercomputers, all from a single code base. The code employs domain decomposition for parallelism and is founded upon the well-established and open-source parallel PETSc and HDF5 frameworks. PFLOTRAN leverages modern Fortran (i.e. Fortran 2003-2008) in its extensible object-oriented design. The use of this progressive, yet domain-friendly programming language has greatly facilitated collaboration in the code's software development. Over the past year, PFLOTRAN's top-level data structures were refactored as Fortran classes (i.e. extendible derived types) to improve the flexibility of the code, ease the addition of new process models, and enable coupling to external simulators. For instance, PFLOTRAN has been coupled to the parallel electrical resistivity tomography code E4D to enable hydrogeophysical inversion while the same code base can be used as a third-party library to provide hydrologic flow, energy transport, and biogeochemical capability to the community land model, CLM, part of the open-source community earth system model (CESM) for climate. In this presentation, the advantages and disadvantages of open source software development in support of geoscience research at government laboratories, universities, and the private sector are discussed. Since the code is open-source (i.e. it's transparent and readily available to competitors), the PFLOTRAN team's development strategy within a competitive research environment is presented. Finally, the developers discuss their approach to object-oriented programming and the leveraging of modern Fortran in support of collaborative geoscience research as the Fortran standard evolves among compiler vendors.
Phylogeny of the TRAF/MATH domain.

PubMed

Zapata, Juan M; Martínez-García, Vanesa; Lefebvre, Sophie

2007-01-01

The TNF-receptor associated factor (TRAF) domain (TD), also known as the meprin and TRAF-C homology (MATH) domain is a fold of seven anti-parallel p-helices that participates in protein-protein interactions. This fold is broadly represented among eukaryotes, where it is found associated with a discrete set of protein-domains. Virtually all protein families encompassing a TRAF/MATH domain seem to be involved in the regulation of protein processing and ubiquitination, strongly suggesting a parallel evolution of the TRAF/MATH domain and certain proteolysis pathways in eukaryotes. The restricted number of living organisms for which we have information of their genetic and protein make-up limits the scope and analysis of the MATH domain in evolution. However, the available information allows us to get a glimpse on the origins, distribution and evolution of the TRAF/MATH domain, which will be overviewed in this chapter.
Wavelet bases on the L-shaped domain

NASA Astrophysics Data System (ADS)

Jouini, Abdellatif; Lemarié-Rieusset, Pierre Gilles

2013-07-01

We present in this paper two elementary constructions of multiresolution analyses on the L-shaped domain D. In the first one, we shall describe a direct method to define an orthonormal multiresolution analysis. In the second one, we use the decomposition method for constructing a biorthogonal multiresolution analysis. These analyses are adapted for the study of the Sobolev spaces Hs(D)(s∈N).
A Hybrid Procedural/Deductive Executive for Autonomous Spacecraft

NASA Technical Reports Server (NTRS)

Pell, Barney; Gamble, Edward B.; Gat, Erann; Kessing, Ron; Kurien, James; Millar, William; Nayak, P. Pandurang; Plaunt, Christian; Williams, Brian C.; Lau, Sonie (Technical Monitor)

1998-01-01

The New Millennium Remote Agent (NMRA) will be the first AI system to control an actual spacecraft. The spacecraft domain places a strong premium on autonomy and requires dynamic recoveries and robust concurrent execution, all in the presence of tight real-time deadlines, changing goals, scarce resource constraints, and a wide variety of possible failures. To achieve this level of execution robustness, we have integrated a procedural executive based on generic procedures with a deductive model-based executive. A procedural executive provides sophisticated control constructs such as loops, parallel activity, locks, and synchronization which are used for robust schedule execution, hierarchical task decomposition, and routine configuration management. A deductive executive provides algorithms for sophisticated state inference and optimal failure recover), planning. The integrated executive enables designers to code knowledge via a combination of procedures and declarative models, yielding a rich modeling capability suitable to the challenges of real spacecraft control. The interface between the two executives ensures both that recovery sequences are smoothly merged into high-level schedule execution and that a high degree of reactivity is retained to effectively handle additional failures during recovery.
Polarizable Molecular Dynamics in a Polarizable Continuum Solvent

PubMed Central

Lipparini, Filippo; Lagardère, Louis; Raynaud, Christophe; Stamm, Benjamin; Cancès, Eric; Mennucci, Benedetta; Schnieders, Michael; Ren, Pengyu; Maday, Yvon; Piquemal, Jean-Philip

2015-01-01

We present for the first time scalable polarizable molecular dynamics (MD) simulations within a polarizable continuum solvent with molecular shape cavities and exact solution of the mutual polarization. The key ingredients are a very efficient algorithm for solving the equations associated with the polarizable continuum, in particular, the domain decomposition Conductor-like Screening Model (ddCOSMO), a rigorous coupling of the continuum with the polarizable force field achieved through a robust variational formulation and an effective strategy to solve the coupled equations. The coupling of ddCOSMO with non variational force fields, including AMOEBA, is also addressed. The MD simulations are feasible, for real life systems, on standard cluster nodes; a scalable parallel implementation allows for further speed up in the context of a newly developed module in Tinker, named Tinker-HP. NVE simulations are stable and long term energy conservation can be achieved. This paper is focused on the methodological developments, on the analysis of the algorithm and on the stability of the simulations; a proof-of-concept application is also presented to attest the possibilities of this newly developed technique. PMID:26516318

Application of the parametric proper generalized decomposition to the frequency-dependent calculation of the impedance of an AC line with rectangular conductors

NASA Astrophysics Data System (ADS)

Sancarlos-González, Abel; Pineda-Sanchez, Manuel; Puche-Panadero, Ruben; Sapena-Bano, Angel; Riera-Guasp, Martin; Martinez-Roman, Javier; Perez-Cruz, Juan; Roger-Folch, Jose

2017-12-01

AC lines of industrial busbar systems are usually built using conductors with rectangular cross sections, where each phase can have several parallel conductors to carry high currents. The current density in a rectangular conductor, under sinusoidal conditions, is not uniform. It depends on the frequency, on the conductor shape, and on the distance between conductors, due to the skin effect and to proximity effects. Contrary to circular conductors, there are not closed analytical formulas for obtaining the frequency-dependent impedance of conductors with rectangular cross-section. It is necessary to resort to numerical simulations to obtain the resistance and the inductance of the phases, one for each desired frequency and also for each distance between the phases' conductors. On the contrary, the use of the parametric proper generalized decomposition (PGD) allows to obtain the frequency-dependent impedance of an AC line for a wide range of frequencies and distances between the phases' conductors by solving a single simulation in a 4D domain (spatial coordinates x and y, the frequency and the separation between conductors). In this way, a general "virtual chart" solution is obtained, which contains the solution for any frequency and for any separation of the conductors, and stores it in a compact separated representations form, which can be easily embedded on a more general software for the design of electrical installations. The approach presented in this work for rectangular conductors can be easily extended to conductors with an arbitrary shape.
Implementation of High Time Delay Accuracy of Ultrasonic Phased Array Based on Interpolation CIC Filter

PubMed Central

Liu, Peilu; Li, Xinghua; Li, Haopeng; Su, Zhikun; Zhang, Hongxu

2017-01-01

In order to improve the accuracy of ultrasonic phased array focusing time delay, analyzing the original interpolation Cascade-Integrator-Comb (CIC) filter, an 8× interpolation CIC filter parallel algorithm was proposed, so that interpolation and multichannel decomposition can simultaneously process. Moreover, we summarized the general formula of arbitrary multiple interpolation CIC filter parallel algorithm and established an ultrasonic phased array focusing time delay system based on 8× interpolation CIC filter parallel algorithm. Improving the algorithmic structure, 12.5% of addition and 29.2% of multiplication was reduced, meanwhile the speed of computation is still very fast. Considering the existing problems of the CIC filter, we compensated the CIC filter; the compensated CIC filter’s pass band is flatter, the transition band becomes steep, and the stop band attenuation increases. Finally, we verified the feasibility of this algorithm on Field Programming Gate Array (FPGA). In the case of system clock is 125 MHz, after 8× interpolation filtering and decomposition, time delay accuracy of the defect echo becomes 1 ns. Simulation and experimental results both show that the algorithm we proposed has strong feasibility. Because of the fast calculation, small computational amount and high resolution, this algorithm is especially suitable for applications with high time delay accuracy and fast detection. PMID:29023385
Implementation of High Time Delay Accuracy of Ultrasonic Phased Array Based on Interpolation CIC Filter.

PubMed

Liu, Peilu; Li, Xinghua; Li, Haopeng; Su, Zhikun; Zhang, Hongxu

2017-10-12

In order to improve the accuracy of ultrasonic phased array focusing time delay, analyzing the original interpolation Cascade-Integrator-Comb (CIC) filter, an 8× interpolation CIC filter parallel algorithm was proposed, so that interpolation and multichannel decomposition can simultaneously process. Moreover, we summarized the general formula of arbitrary multiple interpolation CIC filter parallel algorithm and established an ultrasonic phased array focusing time delay system based on 8× interpolation CIC filter parallel algorithm. Improving the algorithmic structure, 12.5% of addition and 29.2% of multiplication was reduced, meanwhile the speed of computation is still very fast. Considering the existing problems of the CIC filter, we compensated the CIC filter; the compensated CIC filter's pass band is flatter, the transition band becomes steep, and the stop band attenuation increases. Finally, we verified the feasibility of this algorithm on Field Programming Gate Array (FPGA). In the case of system clock is 125 MHz, after 8× interpolation filtering and decomposition, time delay accuracy of the defect echo becomes 1 ns. Simulation and experimental results both show that the algorithm we proposed has strong feasibility. Because of the fast calculation, small computational amount and high resolution, this algorithm is especially suitable for applications with high time delay accuracy and fast detection.
Water-fat separation with parallel imaging based on BLADE.

PubMed

Weng, Dehe; Pan, Yanli; Zhong, Xiaodong; Zhuo, Yan

2013-06-01

Uniform suppression of fat signal is desired in clinical applications. Based on phase differences introduced by different chemical shift frequencies, Dixon method and its variations are used as alternatives of fat saturation methods, which are sensitive to B0 inhomogeneities. Iterative Decomposition of water and fat with Echo Asymmetry and Least squares estimation (IDEAL) separates water and fat images with flexible echo shifting. Periodically Rotated Overlapping ParallEL Lines with Enhanced Reconstruction (PROPELLER, alternatively termed as BLADE), in conjunction with IDEAL, yields Turboprop IDEAL (TP-IDEAL) and allows for decomposition of water and fat signal with motion correction. However, the flexibility of its parameter setting is limited, and the related phase correction is complicated. To address these problems, a novel method, BLADE-Dixon, is proposed in this study. This method used the same polarity readout gradients (fly-back gradients) to acquire in-phase and opposed-phases images, which led to less complicated phase correction and more flexible parameter setting compared to TP-IDEAL. Parallel imaging and undersampling were integrated to reduce scan time. Phantom, orbit, neck and knee images were acquired with BLADE-Dixon. Water-fat separation results were compared to those measured with conventional turbo spin echo (TSE) Dixon and TSE with fat saturation, respectively, to demonstrate the performance of BLADE-Dixon. Copyright © 2013 Elsevier Inc. All rights reserved.
Nonadiabatic electron response in the Hasegawa-Wakatani equations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Stoltzfus-Dueck, T.; Scott, B. D.; Krommes, J. A.

2013-08-15

Tokamak edge turbulence is strongly influenced by parallel electron physics, which relaxes density and potential fluctuations towards electron adiabatic response. Beginning with the paradigmatic Hasegawa-Wakatani equations (HWEs) for resistive tokamak edge turbulence, a unique decomposition of the electric potential (φ) into adiabatic (a) and nonadiabatic (b) portions is derived, based on the requirement that a neither drive nor respond to the parallel current j{sub ∥}. The form of the decomposition clarifies that, at perpendicular scales large relative to the sound radius, the electron adiabatic response controls the nonzonal φ, not the fluctuating density n. Simple energy balance arguments allow onemore » to rigorously bound the ratio of rms nonzonal nonadiabatic fluctuations (b(tilde sign)) relative to adiabatic ones (ã). The role of the vorticity nonlinearity in transferring energy between adiabatic and nonadiabatic fluctuations aids intuitive understanding of self-sustained turbulence in the HWEs. When the normalized parallel resistivity is weak, b(tilde sign) becomes effectively slaved, allowing the reduction to an approximate one-field model that remains valid for strong turbulence. In addition to guiding physical intuition, the one-field reduction should greatly ease further analytical manipulations. Direct numerical simulation of the 2D HWEs confirms the convergence of the asymptotic formula for b(tilde sign)« less
Decomposition of timed automata for solving scheduling problems

NASA Astrophysics Data System (ADS)

Nishi, Tatsushi; Wakatake, Masato

2014-03-01

A decomposition algorithm for scheduling problems based on timed automata (TA) model is proposed. The problem is represented as an optimal state transition problem for TA. The model comprises of the parallel composition of submodels such as jobs and resources. The procedure of the proposed methodology can be divided into two steps. The first step is to decompose the TA model into several submodels by using decomposable condition. The second step is to combine individual solution of subproblems for the decomposed submodels by the penalty function method. A feasible solution for the entire model is derived through the iterated computation of solving the subproblem for each submodel. The proposed methodology is applied to solve flowshop and jobshop scheduling problems. Computational experiments demonstrate the effectiveness of the proposed algorithm compared with a conventional TA scheduling algorithm without decomposition.
Adaptive Fourier decomposition based R-peak detection for noisy ECG Signals.

PubMed

Ze Wang; Chi Man Wong; Feng Wan

2017-07-01

An adaptive Fourier decomposition (AFD) based R-peak detection method is proposed for noisy ECG signals. Although lots of QRS detection methods have been proposed in literature, most detection methods require high signal quality. The proposed method extracts the R waves from the energy domain using the AFD and determines the R-peak locations based on the key decomposition parameters, achieving the denoising and the R-peak detection at the same time. Validated by clinical ECG signals in the MIT-BIH Arrhythmia Database, the proposed method shows better performance than the Pan-Tompkin (PT) algorithm in both situations of a native PT and the PT with a denoising process.
Predicting domain-domain interaction based on domain profiles with feature selection and support vector machines

PubMed Central

2010-01-01

Background Protein-protein interaction (PPI) plays essential roles in cellular functions. The cost, time and other limitations associated with the current experimental methods have motivated the development of computational methods for predicting PPIs. As protein interactions generally occur via domains instead of the whole molecules, predicting domain-domain interaction (DDI) is an important step toward PPI prediction. Computational methods developed so far have utilized information from various sources at different levels, from primary sequences, to molecular structures, to evolutionary profiles. Results In this paper, we propose a computational method to predict DDI using support vector machines (SVMs), based on domains represented as interaction profile hidden Markov models (ipHMM) where interacting residues in domains are explicitly modeled according to the three dimensional structural information available at the Protein Data Bank (PDB). Features about the domains are extracted first as the Fisher scores derived from the ipHMM and then selected using singular value decomposition (SVD). Domain pairs are represented by concatenating their selected feature vectors, and classified by a support vector machine trained on these feature vectors. The method is tested by leave-one-out cross validation experiments with a set of interacting protein pairs adopted from the 3DID database. The prediction accuracy has shown significant improvement as compared to InterPreTS (Interaction Prediction through Tertiary Structure), an existing method for PPI prediction that also uses the sequences and complexes of known 3D structure. Conclusions We show that domain-domain interaction prediction can be significantly enhanced by exploiting information inherent in the domain profiles via feature selection based on Fisher scores, singular value decomposition and supervised learning based on support vector machines. Datasets and source code are freely available on the web at http://liao.cis.udel.edu/pub/svdsvm. Implemented in Matlab and supported on Linux and MS Windows. PMID:21034480
Reactive Goal Decomposition Hierarchies for On-Board Autonomy

NASA Astrophysics Data System (ADS)

Hartmann, L.

2002-01-01

As our experience grows, space missions and systems are expected to address ever more complex and demanding requirements with fewer resources (e.g., mass, power, budget). One approach to accommodating these higher expectations is to increase the level of autonomy to improve the capabilities and robustness of on- board systems and to simplify operations. The goal decomposition hierarchies described here provide a simple but powerful form of goal-directed behavior that is relatively easy to implement for space systems. A goal corresponds to a state or condition that an operator of the space system would like to bring about. In the system described here goals are decomposed into simpler subgoals until the subgoals are simple enough to execute directly. For each goal there is an activation condition and a set of decompositions. The decompositions correspond to different ways of achieving the higher level goal. Each decomposition contains a gating condition and a set of subgoals to be "executed" sequentially or in parallel. The gating conditions are evaluated in order and for the first one that is true, the corresponding decomposition is executed in order to achieve the higher level goal. The activation condition specifies global conditions (i.e., for all decompositions of the goal) that need to hold in order for the goal to be achieved. In real-time, parameters and state information are passed between goals and subgoals in the decomposition; a termination indication (success, failure, degree) is passed up when a decomposition finishes executing. The lowest level decompositions include servo control loops and finite state machines for generating control signals and sequencing i/o. Semaphores and shared memory are used to synchronize and coordinate decompositions that execute in parallel. The goal decomposition hierarchy is reactive in that the generated behavior is sensitive to the real-time state of the system and the environment. That is, the system is able to react to state and environment and in general can terminate the execution of a decomposition and attempt a new decomposition at any level in the hierarchy. This goal decomposition system is suitable for workstation, microprocessor and fpga implementation and thus is able to support the full range of prototyping activities, from mission design in the laboratory to development of the fpga firmware for the flight system. This approach is based on previous artificial intelligence work including (1) Brooks' subsumption architecture for robot control, (2) Firby's Reactive Action Package System (RAPS) for mediating between high level automated planning and low level execution and (3) hierarchical task networks for automated planning. Reactive goal decomposition hierarchies can be used for a wide variety of on-board autonomy applications including automating low level operation sequences (such as scheduling prerequisite operations, e.g., heaters, warm-up periods, monitoring power constraints), coordinating multiple spacecraft as in formation flying and constellations, robot manipulator operations, rendez-vous, docking, servicing, assembly, on-orbit maintenance, planetary rover operations, solar system and interstellar probes, intelligent science data gathering and disaster early warning. Goal decomposition hierarchies can support high level fault tolerance. Given models of on-board resources and goals to accomplish, the decomposition hierarchy could allocate resources to goals taking into account existing faults and in real-time reallocating resources as new faults arise. Resources to be modeled include memory (e.g., ROM, FPGA configuration memory, processor memory, payload instrument memory), processors, on-board and interspacecraft network nodes and links, sensors, actuators (e.g., attitude determination and control, guidance and navigation) and payload instruments. A goal decomposition hierarchy could be defined to map mission goals and tasks to available on-board resources. As faults occur and are detected the resource allocation is modified to avoid using the faulty resource. Goal decomposition hierarchies can implement variable autonomy (in which the operator chooses to command the system at a high or low level, mixed initiative planning (in which the system is able to interact with the operator, e.g, to request operator intervention when a working envelope is exceeded) and distributed control (in which, for example, multiple spacecraft cooperate to accomplish a task without a fixed master). The full paper will describe in greater detail how goal decompositions work, how they can be implemented, techniques for implementing a candidate application and the current state of the fpga implementation.
An efficient finite element method for simulation of droplet spreading on a topologically rough surface

NASA Astrophysics Data System (ADS)

Luo, Li; Wang, Xiao-Ping; Cai, Xiao-Chuan

2017-11-01

We study numerically the dynamics of a three-dimensional droplet spreading on a rough solid surface using a phase-field model consisting of the coupled Cahn-Hilliard and Navier-Stokes equations with a generalized Navier boundary condition (GNBC). An efficient finite element method on unstructured meshes is introduced to cope with the complex geometry of the solid surfaces. We extend the GNBC to surfaces with complex geometry by including its weak form along different normal and tangential directions in the finite element formulation. The semi-implicit time discretization scheme results in a decoupled system for the phase function, the velocity, and the pressure. In addition, a mass compensation algorithm is introduced to preserve the mass of the droplet. To efficiently solve the decoupled systems, we present a highly parallel solution strategy based on domain decomposition techniques. We validate the newly developed solution method through extensive numerical experiments, particularly for those phenomena that can not be achieved by two-dimensional simulations. On a surface with circular posts, we study how wettability of the rough surface depends on the geometry of the posts. The contact line motion for a droplet spreading over some periodic rough surfaces are also efficiently computed. Moreover, we study the spreading process of an impacting droplet on a microstructured surface, a qualitative agreement is achieved between the numerical and experimental results. The parallel performance suggests that the proposed solution algorithm is scalable with over 4,000 processors cores with tens of millions of unknowns.
Reliable and Efficient Parallel Processing Algorithms and Architectures for Modern Signal Processing. Ph.D. Thesis

NASA Technical Reports Server (NTRS)

Liu, Kuojuey Ray

1990-01-01

Least-squares (LS) estimations and spectral decomposition algorithms constitute the heart of modern signal processing and communication problems. Implementations of recursive LS and spectral decomposition algorithms onto parallel processing architectures such as systolic arrays with efficient fault-tolerant schemes are the major concerns of this dissertation. There are four major results in this dissertation. First, we propose the systolic block Householder transformation with application to the recursive least-squares minimization. It is successfully implemented on a systolic array with a two-level pipelined implementation at the vector level as well as at the word level. Second, a real-time algorithm-based concurrent error detection scheme based on the residual method is proposed for the QRD RLS systolic array. The fault diagnosis, order degraded reconfiguration, and performance analysis are also considered. Third, the dynamic range, stability, error detection capability under finite-precision implementation, order degraded performance, and residual estimation under faulty situations for the QRD RLS systolic array are studied in details. Finally, we propose the use of multi-phase systolic algorithms for spectral decomposition based on the QR algorithm. Two systolic architectures, one based on triangular array and another based on rectangular array, are presented for the multiphase operations with fault-tolerant considerations. Eigenvectors and singular vectors can be easily obtained by using the multi-pase operations. Performance issues are also considered.
Dissolved organic matter release in overlying water and bacterial community shifts in biofilm during the decomposition of Myriophyllum verticillatum.

PubMed

Zhang, Lisha; Zhang, Songhe; Lv, Xiaoyang; Qiu, Zheng; Zhang, Ziqiu; Yan, Liying

2018-08-15

This study investigated the alterations in biomass, nutrients and dissolved organic matter concentration in overlying water and determined the bacterial 16S rRNA gene in biofilms attached to plant residual during the decomposition of Myriophyllum verticillatum. The 55-day decomposition experimental results show that plant decay process can be well described by the exponential model, with the average decomposition rate of 0.037d -1 . Total organic carbon, total nitrogen, and organic nitrogen concentrations increased significantly in overlying water during decomposition compared to control within 35d. Results from excitation emission matrix-parallel factor analysis showed humic acid-like and tyrosine acid-like substances might originate from plant degradation processes. Tyrosine acid-like substances had an obvious correlation to organic nitrogen and total nitrogen (p<0.01). Decomposition rates were positively related to pH, total organic carbon, oxidation-reduction potential and dissolved oxygen but negatively related to temperature in overlying water. Microbe densities attached to plant residues increased with decomposition process. The most dominant phylum was Bacteroidetes (>46%) at 7d, Chlorobi (20%-44%) or Proteobacteria (25%-34%) at 21d and Chlorobi (>40%) at 55d. In microbes attached to plant residues, sugar- and polysaccharides-degrading genus including Bacteroides, Blvii28, Fibrobacter, and Treponema dominated at 7d while Chlorobaculum, Rhodobacter, Methanobacterium, Thiobaca, Methanospirillum and Methanosarcina at 21d and 55d. These results gain the insight into the dissolved organic matter release and bacterial community shifts during submerged macrophytes decomposition. Copyright © 2018 Elsevier B.V. All rights reserved.
Leaf and root litter decomposition is discontinued at high altitude tropical montane rainforests contributing to carbon sequestration.

PubMed

Marian, Franca; Sandmann, Dorothee; Krashevska, Valentyna; Maraun, Mark; Scheu, Stefan

2017-08-01

We investigated how altitude affects the decomposition of leaf and root litter in the Andean tropical montane rainforest of southern Ecuador, that is, through changes in the litter quality between altitudes or other site-specific differences in microenvironmental conditions. Leaf litter from three abundant tree species and roots of different diameter from sites at 1,000, 2,000, and 3,000 m were placed in litterbags and incubated for 6, 12, 24, 36, and 48 months. Environmental conditions at the three altitudes and the sampling time were the main factors driving litter decomposition, while origin, and therefore quality of the litter, was of minor importance. At 2,000 and 3,000 m decomposition of litter declined for 12 months reaching a limit value of ~50% of initial and not decomposing further for about 24 months. After 36 months, decomposition commenced at low rates resulting in an average of 37.9% and 44.4% of initial remaining after 48 months. In contrast, at 1,000 m decomposition continued for 48 months until only 10.9% of the initial litter mass remained. Changes in decomposition rates were paralleled by changes in microorganisms with microbial biomass decreasing after 24 months at 2,000 and 3,000 m, while varying little at 1,000 m. The results show that, irrespective of litter origin (1,000, 2,000, 3,000 m) and type (leaves, roots), unfavorable microenvironmental conditions at high altitudes inhibit decomposition processes resulting in the sequestration of carbon in thick organic layers.
Proceedings for the ICASE Workshop on Heterogeneous Boundary Conditions

NASA Technical Reports Server (NTRS)

Perkins, A. Louise; Scroggs, Jeffrey S.

1991-01-01

Domain Decomposition is a complex problem with many interesting aspects. The choice of decomposition can be made based on many different criteria, and the choice of interface of internal boundary conditions are numerous. The various regions under study may have different dynamical balances, indicating that different physical processes are dominating the flow in these regions. This conference was called in recognition of the need to more clearly define the nature of these complex problems. This proceedings is a collection of the presentations and the discussion groups.
TESS

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dmitriy Morozov, Tom Peterka

2014-07-29

Computing a Voronoi or Delaunay tessellation from a set of points is a core part of the analysis of many simulated and measured datasets. As the scale of simulations and observations surpasses billions of particles, a distributed-memory scalable parallel algorithm is the only feasible approach. The primary contribution of this software is a distributed-memory parallel Delaunay and Voronoi tessellation algorithm based on existing serial computational geometry libraries that automatically determines which neighbor points need to be exchanged among the subdomains of a spatial decomposition. Other contributions include the addition of periodic and wall boundary conditions.
Progressive Vector Quantization on a massively parallel SIMD machine with application to multispectral image data

NASA Technical Reports Server (NTRS)

Manohar, Mareboyana; Tilton, James C.

1994-01-01

A progressive vector quantization (VQ) compression approach is discussed which decomposes image data into a number of levels using full search VQ. The final level is losslessly compressed, enabling lossless reconstruction. The computational difficulties are addressed by implementation on a massively parallel SIMD machine. We demonstrate progressive VQ on multispectral imagery obtained from the Advanced Very High Resolution Radiometer instrument and other Earth observation image data, and investigate the trade-offs in selecting the number of decomposition levels and codebook training method.
Long-term litter decomposition controlled by manganese redox cycling

PubMed Central

Keiluweit, Marco; Nico, Peter; Harmon, Mark E.; Mao, Jingdong; Pett-Ridge, Jennifer; Kleber, Markus

2015-01-01

Litter decomposition is a keystone ecosystem process impacting nutrient cycling and productivity, soil properties, and the terrestrial carbon (C) balance, but the factors regulating decomposition rate are still poorly understood. Traditional models assume that the rate is controlled by litter quality, relying on parameters such as lignin content as predictors. However, a strong correlation has been observed between the manganese (Mn) content of litter and decomposition rates across a variety of forest ecosystems. Here, we show that long-term litter decomposition in forest ecosystems is tightly coupled to Mn redox cycling. Over 7 years of litter decomposition, microbial transformation of litter was paralleled by variations in Mn oxidation state and concentration. A detailed chemical imaging analysis of the litter revealed that fungi recruit and redistribute unreactive Mn2+ provided by fresh plant litter to produce oxidative Mn3+ species at sites of active decay, with Mn eventually accumulating as insoluble Mn3+/4+ oxides. Formation of reactive Mn3+ species coincided with the generation of aromatic oxidation products, providing direct proof of the previously posited role of Mn3+-based oxidizers in the breakdown of litter. Our results suggest that the litter-decomposing machinery at our coniferous forest site depends on the ability of plants and microbes to supply, accumulate, and regenerate short-lived Mn3+ species in the litter layer. This observation indicates that biogeochemical constraints on bioavailability, mobility, and reactivity of Mn in the plant–soil system may have a profound impact on litter decomposition rates. PMID:26372954
Long-term litter decomposition controlled by manganese redox cycling.

PubMed

Keiluweit, Marco; Nico, Peter; Harmon, Mark E; Mao, Jingdong; Pett-Ridge, Jennifer; Kleber, Markus

2015-09-22

Litter decomposition is a keystone ecosystem process impacting nutrient cycling and productivity, soil properties, and the terrestrial carbon (C) balance, but the factors regulating decomposition rate are still poorly understood. Traditional models assume that the rate is controlled by litter quality, relying on parameters such as lignin content as predictors. However, a strong correlation has been observed between the manganese (Mn) content of litter and decomposition rates across a variety of forest ecosystems. Here, we show that long-term litter decomposition in forest ecosystems is tightly coupled to Mn redox cycling. Over 7 years of litter decomposition, microbial transformation of litter was paralleled by variations in Mn oxidation state and concentration. A detailed chemical imaging analysis of the litter revealed that fungi recruit and redistribute unreactive Mn(2+) provided by fresh plant litter to produce oxidative Mn(3+) species at sites of active decay, with Mn eventually accumulating as insoluble Mn(3+/4+) oxides. Formation of reactive Mn(3+) species coincided with the generation of aromatic oxidation products, providing direct proof of the previously posited role of Mn(3+)-based oxidizers in the breakdown of litter. Our results suggest that the litter-decomposing machinery at our coniferous forest site depends on the ability of plants and microbes to supply, accumulate, and regenerate short-lived Mn(3+) species in the litter layer. This observation indicates that biogeochemical constraints on bioavailability, mobility, and reactivity of Mn in the plant-soil system may have a profound impact on litter decomposition rates.
Adaptive Identification by Systolic Arrays.

DTIC Science & Technology

1987-12-01

BIBLIOGRIAPHY Anton , Howard, Elementary Linear Algebra , John Wiley & Sons, 19S4. Cristi, Roberto, A Parallel Structure Jor Adaptive Pole Placement...10 11. SYSTEM IDENTIFICATION M*YETHODS ....................... 12 A. LINEAR SYSTEM MODELING ......................... 12 B. SOLUTION OF SYSTEMS OF... LINEAR EQUATIONS ......... 13 C. QR DECOMPOSITION ................................ 14 D. RECURSIVE LEAST SQUARES ......................... 16 E. BLOCK
Adaptive multigrid domain decomposition solutions for viscous interacting flows

NASA Technical Reports Server (NTRS)

Rubin, Stanley G.; Srinivasan, Kumar

1992-01-01

Several viscous incompressible flows with strong pressure interaction and/or axial flow reversal are considered with an adaptive multigrid domain decomposition procedure. Specific examples include the triple deck structure surrounding the trailing edge of a flat plate, the flow recirculation in a trough geometry, and the flow in a rearward facing step channel. For the latter case, there are multiple recirculation zones, of different character, for laminar and turbulent flow conditions. A pressure-based form of flux-vector splitting is applied to the Navier-Stokes equations, which are represented by an implicit lowest-order reduced Navier-Stokes (RNS) system and a purely diffusive, higher-order, deferred-corrector. A trapezoidal or box-like form of discretization insures that all mass conservation properties are satisfied at interfacial and outflow boundaries, even for this primitive-variable, non-staggered grid computation.

A stable and accurate partitioned algorithm for conjugate heat transfer

NASA Astrophysics Data System (ADS)

Meng, F.; Banks, J. W.; Henshaw, W. D.; Schwendeman, D. W.

2017-09-01

We describe a new partitioned approach for solving conjugate heat transfer (CHT) problems where the governing temperature equations in different material domains are time-stepped in an implicit manner, but where the interface coupling is explicit. The new approach, called the CHAMP scheme (Conjugate Heat transfer Advanced Multi-domain Partitioned), is based on a discretization of the interface coupling conditions using a generalized Robin (mixed) condition. The weights in the Robin condition are determined from the optimization of a condition derived from a local stability analysis of the coupling scheme. The interface treatment combines ideas from optimized-Schwarz methods for domain-decomposition problems together with the interface jump conditions and additional compatibility jump conditions derived from the governing equations. For many problems (i.e. for a wide range of material properties, grid-spacings and time-steps) the CHAMP algorithm is stable and second-order accurate using no sub-time-step iterations (i.e. a single implicit solve of the temperature equation in each domain). In extreme cases (e.g. very fine grids with very large time-steps) it may be necessary to perform one or more sub-iterations. Each sub-iteration generally increases the range of stability substantially and thus one sub-iteration is likely sufficient for the vast majority of practical problems. The CHAMP algorithm is developed first for a model problem and analyzed using normal-mode theory. The theory provides a mechanism for choosing optimal parameters in the mixed interface condition. A comparison is made to the classical Dirichlet-Neumann (DN) method and, where applicable, to the optimized-Schwarz (OS) domain-decomposition method. For problems with different thermal conductivities and diffusivities, the CHAMP algorithm outperforms the DN scheme. For domain-decomposition problems with uniform conductivities and diffusivities, the CHAMP algorithm performs better than the typical OS scheme with one grid-cell overlap. The CHAMP scheme is also developed for general curvilinear grids and CHT examples are presented using composite overset grids that confirm the theory and demonstrate the effectiveness of the approach.
A stable and accurate partitioned algorithm for conjugate heat transfer

DOE PAGES

Meng, F.; Banks, J. W.; Henshaw, W. D.; ...

2017-04-25

We describe a new partitioned approach for solving conjugate heat transfer (CHT) problems where the governing temperature equations in different material domains are time-stepped in a implicit manner, but where the interface coupling is explicit. The new approach, called the CHAMP scheme (Conjugate Heat transfer Advanced Multi-domain Partitioned), is based on a discretization of the interface coupling conditions using a generalized Robin (mixed) condition. The weights in the Robin condition are determined from the optimization of a condition derived from a local stability analysis of the coupling scheme. The interface treatment combines ideas from optimized-Schwarz methods for domain-decomposition problems togethermore » with the interface jump conditions and additional compatibility jump conditions derived from the governing equations. For many problems (i.e. for a wide range of material properties, grid-spacings and time-steps) the CHAMP algorithm is stable and second-order accurate using no sub-time-step iterations (i.e. a single implicit solve of the temperature equation in each domain). In extreme cases (e.g. very fine grids with very large time-steps) it may be necessary to perform one or more sub-iterations. Each sub-iteration generally increases the range of stability substantially and thus one sub-iteration is likely sufficient for the vast majority of practical problems. The CHAMP algorithm is developed first for a model problem and analyzed using normal-mode the- ory. The theory provides a mechanism for choosing optimal parameters in the mixed interface condition. A comparison is made to the classical Dirichlet-Neumann (DN) method and, where applicable, to the optimized- Schwarz (OS) domain-decomposition method. For problems with different thermal conductivities and dif- fusivities, the CHAMP algorithm outperforms the DN scheme. For domain-decomposition problems with uniform conductivities and diffusivities, the CHAMP algorithm performs better than the typical OS scheme with one grid-cell overlap. Lastly, the CHAMP scheme is also developed for general curvilinear grids and CHT ex- amples are presented using composite overset grids that confirm the theory and demonstrate the effectiveness of the approach.« less
A WPS Based Architecture for Climate Data Analytic Services (CDAS) at NASA

NASA Astrophysics Data System (ADS)

Maxwell, T. P.; McInerney, M.; Duffy, D.; Carriere, L.; Potter, G. L.; Doutriaux, C.

2015-12-01

Faced with unprecedented growth in the Big Data domain of climate science, NASA has developed the Climate Data Analytic Services (CDAS) framework. This framework enables scientists to execute trusted and tested analysis operations in a high performance environment close to the massive data stores at NASA. The data is accessed in standard (NetCDF, HDF, etc.) formats in a POSIX file system and processed using trusted climate data analysis tools (ESMF, CDAT, NCO, etc.). The framework is structured as a set of interacting modules allowing maximal flexibility in deployment choices. The current set of module managers include: Staging Manager: Runs the computation locally on the WPS server or remotely using tools such as celery or SLURM. Compute Engine Manager: Runs the computation serially or distributed over nodes using a parallelization framework such as celery or spark. Decomposition Manger: Manages strategies for distributing the data over nodes. Data Manager: Handles the import of domain data from long term storage and manages the in-memory and disk-based caching architectures. Kernel manager: A kernel is an encapsulated computational unit which executes a processor's compute task. Each kernel is implemented in python exploiting existing analysis packages (e.g. CDAT) and is compatible with all CDAS compute engines and decompositions. CDAS services are accessed via a WPS API being developed in collaboration with the ESGF Compute Working Team to support server-side analytics for ESGF. The API can be executed using either direct web service calls, a python script or application, or a javascript-based web application. Client packages in python or javascript contain everything needed to make CDAS requests. The CDAS architecture brings together the tools, data storage, and high-performance computing required for timely analysis of large-scale data sets, where the data resides, to ultimately produce societal benefits. It is is currently deployed at NASA in support of the Collaborative REAnalysis Technical Environment (CREATE) project, which centralizes numerous global reanalysis datasets onto a single advanced data analytics platform. This service permits decision makers to investigate climate changes around the globe, inspect model trends, compare multiple reanalysis datasets, and variability.
Crystal structures of carbonates up to Mbar pressures determined by single crystal synchrotron radiation diffraction

NASA Astrophysics Data System (ADS)

Merlini, M.

2013-12-01

The recent improvements at synchrotron beamlines, currently allow single crystal diffraction experiments at extreme pressures and temperatures [1,2] on very small single crystal domains. We successfully applied such technique to determine the crystal structure adopted by carbonates at mantle pressures. The knowledge of carbon-bearing phases is in fact fundamental for any quantitative modelling of global carbon cycle. The major technical difficulty arises after first order transitions or decomposition reactions, since original crystal (apx. 10x10x5 μm3) is transformed in much smaller crystalline domains often with random orientation. The use of 3D reciprocal space visualization software and the improved resolution of new generation flat panel detectors, however, allow both identification and integration of each single crystal domain, with suitable accuracy for ab-initio structure solution, performed with direct and charge-flipping methods and successive structure refinements. The results obtained on carbonates, indicate two major crystal-chemistry trends established at high pressures. The CO32- units, planar and parallel in ambient pressure calcite and dolomite structures, becomes non parallel in calcite- and dolomite-II and III phases, allowing more flexibility in the structures with possibility to accommodate strain arising from different cation sizes (Ca and Mg in particular). Dolomite-III is therefore also observed to be thermodynamically stable at lower mantle pressures and temperatures, differently from dolomite, which undergoes decomposition into pure end-members in upper mantle. At higher pressure, towards Mbar (lowermost mantle and D'' region) in agreement with theoretical calculations [3,4] and other experimental results [5], carbon coordination transform into 4-fold CO4 units, with different polymerisation in the structure depending on carbonate composition. The second important crystal chemistry feature detected is related to Fe2+ in Fe-bearing magnesite, which spontaneously oxidises at HP/HT, forming Fe3+ carbonates, Fe3+ oxides and reduced carbon (diamonds). Single crystal diffraction approach allowed full structure determination of these phases, yielding to the discovery of few unpredicted structures, such as Mg2Fe2C4O13 and Fe13O19, which can be well reproduced in different experiments. Mg2Fe2C4O13 carbonate present truncated chain C4O13 groups, and Fe13O19 oxide, whose stoichiometry is intermediate between magnetite and hematite, is a one-layer structure, with features encountered in superconducting materials. The results fully support the ideas of unexpected complexities in the mineralogy of the lowermost mantle, and single crystal technique, once properly optimized in ad-hoc synchrotron beamlines, is fundamental for extracting accurate structural information, otherwise rarely accessible with other experimental techniques. References: [1] Merlini M., Hanfland M. (2013). Single crystal diffraction at Mbar conditions by synchrotron radiation. High Pressure Research, in press. [2] Dubrovinsky et al., (2010). High Pressure Research, 30, 620-633. [3] Arapan et al. (1997). Phys. Rev. Lett., 98, 268501. [4] Oganov et al. (2008) EPSL, 273, 38-47. [5] Boulard et al. (2011) PNAS, 108, 5184-5187.
Local energy decomposition analysis of hydrogen-bonded dimers within a domain-based pair natural orbital coupled cluster study.

PubMed

Altun, Ahmet; Neese, Frank; Bistoni, Giovanni

2018-01-01

The local energy decomposition (LED) analysis allows for a decomposition of the accurate domain-based local pair natural orbital CCSD(T) [DLPNO-CCSD(T)] energy into physically meaningful contributions including geometric and electronic preparation, electrostatic interaction, interfragment exchange, dynamic charge polarization, and London dispersion terms. Herein, this technique is employed in the study of hydrogen-bonding interactions in a series of conformers of water and hydrogen fluoride dimers. Initially, DLPNO-CCSD(T) dissociation energies for the most stable conformers are computed and compared with available experimental data. Afterwards, the decay of the LED terms with the intermolecular distance ( r ) is discussed and results are compared with the ones obtained from the popular symmetry adapted perturbation theory (SAPT). It is found that, as expected, electrostatic contributions slowly decay for increasing r and dominate the interaction energies in the long range. London dispersion contributions decay as expected, as r -6 . They significantly affect the depths of the potential wells. The interfragment exchange provides a further stabilizing contribution that decays exponentially with the intermolecular distance. This information is used to rationalize the trend of stability of various conformers of the water and hydrogen fluoride dimers.
A Fast Solver for Implicit Integration of the Vlasov--Poisson System in the Eulerian Framework

DOE Office of Scientific and Technical Information (OSTI.GOV)

Garrett, C. Kristopher; Hauck, Cory D.

In this paper, we present a domain decomposition algorithm to accelerate the solution of Eulerian-type discretizations of the linear, steady-state Vlasov equation. The steady-state solver then forms a key component in the implementation of fully implicit or nearly fully implicit temporal integrators for the nonlinear Vlasov--Poisson system. The solver relies on a particular decomposition of phase space that enables the use of sweeping techniques commonly used in radiation transport applications. The original linear system for the phase space unknowns is then replaced by a smaller linear system involving only unknowns on the boundary between subdomains, which can then be solvedmore » efficiently with Krylov methods such as GMRES. Steady-state solves are combined to form an implicit Runge--Kutta time integrator, and the Vlasov equation is coupled self-consistently to the Poisson equation via a linearized procedure or a nonlinear fixed-point method for the electric field. Finally, numerical results for standard test problems demonstrate the efficiency of the domain decomposition approach when compared to the direct application of an iterative solver to the original linear system.« less
A Fast Solver for Implicit Integration of the Vlasov--Poisson System in the Eulerian Framework

DOE PAGES

Garrett, C. Kristopher; Hauck, Cory D.

2018-04-05

In this paper, we present a domain decomposition algorithm to accelerate the solution of Eulerian-type discretizations of the linear, steady-state Vlasov equation. The steady-state solver then forms a key component in the implementation of fully implicit or nearly fully implicit temporal integrators for the nonlinear Vlasov--Poisson system. The solver relies on a particular decomposition of phase space that enables the use of sweeping techniques commonly used in radiation transport applications. The original linear system for the phase space unknowns is then replaced by a smaller linear system involving only unknowns on the boundary between subdomains, which can then be solvedmore » efficiently with Krylov methods such as GMRES. Steady-state solves are combined to form an implicit Runge--Kutta time integrator, and the Vlasov equation is coupled self-consistently to the Poisson equation via a linearized procedure or a nonlinear fixed-point method for the electric field. Finally, numerical results for standard test problems demonstrate the efficiency of the domain decomposition approach when compared to the direct application of an iterative solver to the original linear system.« less
Genten: Software for Generalized Tensor Decompositions v. 1.0.0

DOE Office of Scientific and Technical Information (OSTI.GOV)

Phipps, Eric T.; Kolda, Tamara G.; Dunlavy, Daniel

Tensors, or multidimensional arrays, are a powerful mathematical means of describing multiway data. This software provides computational means for decomposing or approximating a given tensor in terms of smaller tensors of lower dimension, focusing on decomposition of large, sparse tensors. These techniques have applications in many scientific areas, including signal processing, linear algebra, computer vision, numerical analysis, data mining, graph analysis, neuroscience and more. The software is designed to take advantage of parallelism present emerging computer architectures such has multi-core CPUs, many-core accelerators such as the Intel Xeon Phi, and computation-oriented GPUs to enable efficient processing of large tensors.
Application of advanced multidisciplinary analysis and optimization methods to vehicle design synthesis

NASA Technical Reports Server (NTRS)

Consoli, Robert David; Sobieszczanski-Sobieski, Jaroslaw

1990-01-01

Advanced multidisciplinary analysis and optimization methods, namely system sensitivity analysis and non-hierarchical system decomposition, are applied to reduce the cost and improve the visibility of an automated vehicle design synthesis process. This process is inherently complex due to the large number of functional disciplines and associated interdisciplinary couplings. Recent developments in system sensitivity analysis as applied to complex non-hierarchic multidisciplinary design optimization problems enable the decomposition of these complex interactions into sub-processes that can be evaluated in parallel. The application of these techniques results in significant cost, accuracy, and visibility benefits for the entire design synthesis process.
On a concurrent element-by-element preconditioned conjugate gradient algorithm for multiple load cases

NASA Technical Reports Server (NTRS)

Watson, Brian; Kamat, M. P.

1990-01-01

Element-by-element preconditioned conjugate gradient (EBE-PCG) algorithms have been advocated for use in parallel/vector processing environments as being superior to the conventional LDL(exp T) decomposition algorithm for single load cases. Although there may be some advantages in using such algorithms for a single load case, when it comes to situations involving multiple load cases, the LDL(exp T) decomposition algorithm would appear to be decidedly more cost-effective. The authors have outlined an EBE-PCG algorithm suitable for multiple load cases and compared its effectiveness to the highly efficient LDL(exp T) decomposition scheme. The proposed algorithm offers almost no advantages over the LDL(exp T) algorithm for the linear problems investigated on the Alliant FX/8. However, there may be some merit in the algorithm in solving nonlinear problems with load incrementation, but that remains to be investigated.
Anti-parallel polarization switching in a triglycine sulfate organic ferroelectric insulator: The role of surface charges

NASA Astrophysics Data System (ADS)

Ma, He; Wu, Zhuangchun; Peng, Dongwen; Wang, Yaojin; Wang, Yiping; Yang, Ying; Yuan, Guoliang

2018-04-01

Four consecutive ferroelectric polarization switchings and an abnormal ring-like domain pattern can be introduced by a single tip bias of a piezoresponse force microscope in the (010) triglycine sulfate (TGS) crystal. The external electric field anti-parallel to the original polarization induces the first polarization switching; however, the surface charges of TGS can move toward the tip location and induce the second polarization switching once the tip bias is removed. The two switchings allow a ring-like pattern composed of the central domain with downward polarization and the outer domain with upward polarization. Once the two domains disappear gradually as a result of depolarization, the other two polarization switchings occur one by one at the TGS where the tip contacts. However, the backswitching phenomenon does not occur when the external electric field is parallel to the original polarization. These results can be explained according to the surface charges instead of the charges injected inside.
Randomized Dynamic Mode Decomposition

NASA Astrophysics Data System (ADS)

Erichson, N. Benjamin; Brunton, Steven L.; Kutz, J. Nathan

2017-11-01

The dynamic mode decomposition (DMD) is an equation-free, data-driven matrix decomposition that is capable of providing accurate reconstructions of spatio-temporal coherent structures arising in dynamical systems. We present randomized algorithms to compute the near-optimal low-rank dynamic mode decomposition for massive datasets. Randomized algorithms are simple, accurate and able to ease the computational challenges arising with `big data'. Moreover, randomized algorithms are amenable to modern parallel and distributed computing. The idea is to derive a smaller matrix from the high-dimensional input data matrix using randomness as a computational strategy. Then, the dynamic modes and eigenvalues are accurately learned from this smaller representation of the data, whereby the approximation quality can be controlled via oversampling and power iterations. Here, we present randomized DMD algorithms that are categorized by how many passes the algorithm takes through the data. Specifically, the single-pass randomized DMD does not require data to be stored for subsequent passes. Thus, it is possible to approximately decompose massive fluid flows (stored out of core memory, or not stored at all) using single-pass algorithms, which is infeasible with traditional DMD algorithms.
Domain Adaptation of Translation Models for Multilingual Applications

DTIC Science & Technology

2009-04-01

expansion effect that corpus (or dictionary ) based trans- lation introduces - however, this effect is maintained even with monolingual query expansion [12...every day; bilingual web pages are harvested as parallel corpora as the quantity of non-English data on the web increases; online dictionaries of...approach is to customize translation models to a domain, by automatically selecting the resources ( dictionaries , parallel corpora) that are best for
Intercomparison of Multiscale Modeling Approaches in Simulating Subsurface Flow and Transport

NASA Astrophysics Data System (ADS)

Yang, X.; Mehmani, Y.; Barajas-Solano, D. A.; Song, H. S.; Balhoff, M.; Tartakovsky, A. M.; Scheibe, T. D.

2016-12-01

Hybrid multiscale simulations that couple models across scales are critical to advance predictions of the larger system behavior using understanding of fundamental processes. In the current study, three hybrid multiscale methods are intercompared: multiscale loose-coupling method, multiscale finite volume (MsFV) method and multiscale mortar method. The loose-coupling method enables a parallel workflow structure based on the Swift scripting environment that manages the complex process of executing coupled micro- and macro-scale models without being intrusive to the at-scale simulators. The MsFV method applies microscale and macroscale models over overlapping subdomains of the modeling domain and enforces continuity of concentration and transport fluxes between models via restriction and prolongation operators. The mortar method is a non-overlapping domain decomposition approach capable of coupling all permutations of pore- and continuum-scale models with each other. In doing so, Lagrange multipliers are used at interfaces shared between the subdomains so as to establish continuity of species/fluid mass flux. Subdomain computations can be performed either concurrently or non-concurrently depending on the algorithm used. All the above methods have been proven to be accurate and efficient in studying flow and transport in porous media. However, there has not been any field-scale applications and benchmarking among various hybrid multiscale approaches. To address this challenge, we apply all three hybrid multiscale methods to simulate water flow and transport in a conceptualized 2D modeling domain of the hyporheic zone, where strong interactions between groundwater and surface water exist across multiple scales. In all three multiscale methods, fine-scale simulations are applied to a thin layer of riverbed alluvial sediments while the macroscopic simulations are used for the larger subsurface aquifer domain. Different numerical coupling methods are then applied between scales and inter-compared. Comparisons are drawn in terms of velocity distributions, solute transport behavior, algorithm-induced numerical error and computing cost. The intercomparison work provides support for confidence in a variety of hybrid multiscale methods and motivates further development and applications.
Parallel Implementation of Triangular Cellular Automata for Computing Two-Dimensional Elastodynamic Response on Arbitrary Domains

NASA Astrophysics Data System (ADS)

Leamy, Michael J.; Springer, Adam C.

In this research we report parallel implementation of a Cellular Automata-based simulation tool for computing elastodynamic response on complex, two-dimensional domains. Elastodynamic simulation using Cellular Automata (CA) has recently been presented as an alternative, inherently object-oriented technique for accurately and efficiently computing linear and nonlinear wave propagation in arbitrarily-shaped geometries. The local, autonomous nature of the method should lead to straight-forward and efficient parallelization. We address this notion on symmetric multiprocessor (SMP) hardware using a Java-based object-oriented CA code implementing triangular state machines (i.e., automata) and the MPI bindings written in Java (MPJ Express). We use MPJ Express to reconfigure our existing CA code to distribute a domain's automata to cores present on a dual quad-core shared-memory system (eight total processors). We note that this message passing parallelization strategy is directly applicable to computer clustered computing, which will be the focus of follow-on research. Results on the shared memory platform indicate nearly-ideal, linear speed-up. We conclude that the CA-based elastodynamic simulator is easily configured to run in parallel, and yields excellent speed-up on SMP hardware.
Heterodyne frequency-domain multispectral diffuse optical tomography of breast cancer in the parallel-plane transmission geometry

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ban, H. Y.; Kavuri, V. C., E-mail: venk@physics.up

Purpose: The authors introduce a state-of-the-art all-optical clinical diffuse optical tomography (DOT) imaging instrument which collects spatially dense, multispectral, frequency-domain breast data in the parallel-plate geometry. Methods: The instrument utilizes a CCD-based heterodyne detection scheme that permits massively parallel detection of diffuse photon density wave amplitude and phase for a large number of source–detector pairs (10{sup 6}). The stand-alone clinical DOT instrument thus offers high spatial resolution with reduced crosstalk between absorption and scattering. Other novel features include a fringe profilometry system for breast boundary segmentation, real-time data normalization, and a patient bed design which permits both axial and sagittalmore » breast measurements. Results: The authors validated the instrument using tissue simulating phantoms with two different chromophore-containing targets and one scattering target. The authors also demonstrated the instrument in a case study breast cancer patient; the reconstructed 3D image of endogenous chromophores and scattering gave tumor localization in agreement with MRI. Conclusions: Imaging with a novel parallel-plate DOT breast imager that employs highly parallel, high-resolution CCD detection in the frequency-domain was demonstrated.« less
Progressive Precision Surface Design

DOE Office of Scientific and Technical Information (OSTI.GOV)

Duchaineau, M; Joy, KJ

2002-01-11

We introduce a novel wavelet decomposition algorithm that makes a number of powerful new surface design operations practical. Wavelets, and hierarchical representations generally, have held promise to facilitate a variety of design tasks in a unified way by approximating results very precisely, thus avoiding a proliferation of undergirding mathematical representations. However, traditional wavelet decomposition is defined from fine to coarse resolution, thus limiting its efficiency for highly precise surface manipulation when attempting to create new non-local editing methods. Our key contribution is the progressive wavelet decomposition algorithm, a general-purpose coarse-to-fine method for hierarchical fitting, based in this paper on anmore » underlying multiresolution representation called dyadic splines. The algorithm requests input via a generic interval query mechanism, allowing a wide variety of non-local operations to be quickly implemented. The algorithm performs work proportionate to the tiny compressed output size, rather than to some arbitrarily high resolution that would otherwise be required, thus increasing performance by several orders of magnitude. We describe several design operations that are made tractable because of the progressive decomposition. Free-form pasting is a generalization of the traditional control-mesh edit, but for which the shape of the change is completely general and where the shape can be placed using a free-form deformation within the surface domain. Smoothing and roughening operations are enhanced so that an arbitrary loop in the domain specifies the area of effect. Finally, the sculpting effect of moving a tool shape along a path is simulated.« less
Néel walls between tailored parallel-stripe domains in IrMn/CoFe exchange bias layers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ueltzhöffer, Timo, E-mail: timo.ueltzhoeffer@physik.uni-kassel.de; Schmidt, Christoph; Ehresmann, Arno

Tailored parallel-stripe magnetic domains with antiparallel magnetizations in adjacent domains along the long stripe axis have been fabricated in an IrMn/CoFe Exchange Bias thin film system by 10 keV He{sup +}-ion bombardment induced magnetic patterning. Domain walls between these domains are of Néel type and asymmetric as they separate domains of different anisotropies. X-ray magnetic circular dichroism asymmetry images were obtained by x-ray photoelectron emission microscopy at the Co/Fe L{sub 3} edges at the synchrotron radiation source BESSY II. They revealed Néel-wall tail widths of 1 μm in agreement with the results of a model that was modified in order to describemore » such walls. Similarly obtained domain core widths show a discrepancy to values estimated from the model, but could be explained by experimental broadening. The rotation senses in adjacent walls were determined, yielding unwinding domain walls with non-interacting walls in this layer system.« less
A general framework of noise suppression in material decomposition for dual-energy CT

DOE Office of Scientific and Technical Information (OSTI.GOV)

Petrongolo, Michael; Dong, Xue; Zhu, Lei, E-mail: leizhu@gatech.edu

Purpose: As a general problem of dual-energy CT (DECT), noise amplification in material decomposition severely reduces the signal-to-noise ratio on the decomposed images compared to that on the original CT images. In this work, the authors propose a general framework of noise suppression in material decomposition for DECT. The method is based on an iterative algorithm recently developed in their group for image-domain decomposition of DECT, with an extension to include nonlinear decomposition models. The generalized framework of iterative DECT decomposition enables beam-hardening correction with simultaneous noise suppression, which improves the clinical benefits of DECT. Methods: The authors propose tomore » suppress noise on the decomposed images of DECT using convex optimization, which is formulated in the form of least-squares estimation with smoothness regularization. Based on the design principles of a best linear unbiased estimator, the authors include the inverse of the estimated variance–covariance matrix of the decomposed images as the penalty weight in the least-squares term. Analytical formulas are derived to compute the variance–covariance matrix for decomposed images with general-form numerical or analytical decomposition. As a demonstration, the authors implement the proposed algorithm on phantom data using an empirical polynomial function of decomposition measured on a calibration scan. The polynomial coefficients are determined from the projection data acquired on a wedge phantom, and the signal decomposition is performed in the projection domain. Results: On the Catphan{sup ®}600 phantom, the proposed noise suppression method reduces the average noise standard deviation of basis material images by one to two orders of magnitude, with a superior performance on spatial resolution as shown in comparisons of line-pair images and modulation transfer function measurements. On the synthesized monoenergetic CT images, the noise standard deviation is reduced by a factor of 2–3. By using nonlinear decomposition on projections, the authors’ method effectively suppresses the streaking artifacts of beam hardening and obtains more uniform images than their previous approach based on a linear model. Similar performance of noise suppression is observed in the results of an anthropomorphic head phantom and a pediatric chest phantom generated by the proposed method. With beam-hardening correction enabled by their approach, the image spatial nonuniformity on the head phantom is reduced from around 10% on the original CT images to 4.9% on the synthesized monoenergetic CT image. On the pediatric chest phantom, their method suppresses image noise standard deviation by a factor of around 7.5, and compared with linear decomposition, it reduces the estimation error of electron densities from 33.3% to 8.6%. Conclusions: The authors propose a general framework of noise suppression in material decomposition for DECT. Phantom studies have shown the proposed method improves the image uniformity and the accuracy of electron density measurements by effective beam-hardening correction and reduces noise level without noticeable resolution loss.« less
Comparison of a 3-D GPU-Assisted Maxwell Code and Ray Tracing for Reflectometry on ITER

NASA Astrophysics Data System (ADS)

Gady, Sarah; Kubota, Shigeyuki; Johnson, Irena

2015-11-01

Electromagnetic wave propagation and scattering in magnetized plasmas are important diagnostics for high temperature plasmas. 1-D and 2-D full-wave codes are standard tools for measurements of the electron density profile and fluctuations; however, ray tracing results have shown that beam propagation in tokamak plasmas is inherently a 3-D problem. The GPU-Assisted Maxwell Code utilizes the FDTD (Finite-Difference Time-Domain) method for solving the Maxwell equations with the cold plasma approximation in a 3-D geometry. Parallel processing with GPGPU (General-Purpose computing on Graphics Processing Units) is used to accelerate the computation. Previously, we reported on initial comparisons of the code results to 1-D numerical and analytical solutions, where the size of the computational grid was limited by the on-board memory of the GPU. In the current study, this limitation is overcome by using domain decomposition and an additional GPU. As a practical application, this code is used to study the current design of the ITER Low Field Side Reflectometer (LSFR) for the Equatorial Port Plug 11 (EPP11). A detailed examination of Gaussian beam propagation in the ITER edge plasma will be presented, as well as comparisons with ray tracing. This work was made possible by funding from the Department of Energy for the Summer Undergraduate Laboratory Internship (SULI) program. This work is supported by the US DOE Contract No.DE-AC02-09CH11466 and DE-FG02-99-ER54527.

Simultaneous F 0-F 1 modifications of Arabic for the improvement of natural-sounding

NASA Astrophysics Data System (ADS)

Ykhlef, F.; Bensebti, M.

2013-03-01

Pitch (F 0) modification is one of the most important problems in the area of speech synthesis. Several techniques have been developed in the literature to achieve this goal. The main restrictions of these techniques are in the modification range and the synthesised speech quality, intelligibility and naturalness. The control of formants in a spoken language can significantly improve the naturalness of the synthesised speech. This improvement is mainly dependent on the control of the first formant (F 1). Inspired by this observation, this article proposes a new approach that modifies both F 0 and F 1 of Arabic voiced sounds in order to improve the naturalness of the pitch shifted speech. The developed strategy takes a parallel processing approach, in which the analysis segments are decomposed into sub-bands in the wavelet domain, modified in the desired sub-band by using a resampling technique and reconstructed without affecting the remained sub-bands. Pitch marking and voicing detection are performed in the frequency decomposition step based on the comparison of the multi-level approximation and detail signals. The performance of the proposed technique is evaluated by listening tests and compared to the pitch synchronous overlap and add (PSOLA) technique in the third approximation level. Experimental results have shown that the manipulation in the wavelet domain of F 0 in conjunction with F 1 guarantees natural-sounding of the synthesised speech compared to the classical pitch modification technique. This improvement was appropriate for high pitch modifications.
MO-FG-204-03: Using Edge-Preserving Algorithm for Significantly Improved Image-Domain Material Decomposition in Dual Energy CT

DOE Office of Scientific and Technical Information (OSTI.GOV)

Zhao, W; Niu, T; Xing, L

2015-06-15

Purpose: To significantly improve dual energy CT (DECT) imaging by establishing a new theoretical framework of image-domain material decomposition with incorporation of edge-preserving techniques. Methods: The proposed algorithm, HYPR-NLM, combines the edge-preserving non-local mean filter (NLM) with the HYPR-LR (Local HighlY constrained backPRojection Reconstruction) framework. Image denoising using HYPR-LR framework depends on the noise level of the composite image which is the average of the different energy images. For DECT, the composite image is the average of high- and low-energy images. To further reduce noise, one may want to increase the window size of the filter of the HYPR-LR, leadingmore » resolution degradation. By incorporating the NLM filtering and the HYPR-LR framework, HYPR-NLM reduces the boost material decomposition noise using energy information redundancies as well as the non-local mean. We demonstrate the noise reduction and resolution preservation of the algorithm with both iodine concentration numerical phantom and clinical patient data by comparing the HYPR-NLM algorithm to the direct matrix inversion, HYPR-LR and iterative image-domain material decomposition (Iter-DECT). Results: The results show iterative material decomposition method reduces noise to the lowest level and provides improved DECT images. HYPR-NLM significantly reduces noise while preserving the accuracy of quantitative measurement and resolution. For the iodine concentration numerical phantom, the averaged noise levels are about 2.0, 0.7, 0.2 and 0.4 for direct inversion, HYPR-LR, Iter- DECT and HYPR-NLM, respectively. For the patient data, the noise levels of the water images are about 0.36, 0.16, 0.12 and 0.13 for direct inversion, HYPR-LR, Iter-DECT and HYPR-NLM, respectively. Difference images of both HYPR-LR and Iter-DECT show edge effect, while no significant edge effect is shown for HYPR-NLM, suggesting spatial resolution is well preserved for HYPR-NLM. Conclusion: HYPR-NLM provides an effective way to reduce the generic magnified image noise of dual–energy material decomposition while preserving resolution. This work is supported in part by NIH grants 7R01HL111141 and 1R01-EB016777. This work is also supported by the Natural Science Foundation of China (NSFC Grant No. 81201091), Fundamental Research Funds for the Central Universities in China, and Fund Project for Excellent Abroad Scholar Personnel in Science and Technology.« less
How localized is ``local?'' Efficiency vs. accuracy of O(N) domain decomposition in local orbital based all-electron electronic structure theory

NASA Astrophysics Data System (ADS)

Havu, Vile; Blum, Volker; Scheffler, Matthias

2007-03-01

Numeric atom-centered local orbitals (NAO) are efficient basis sets for all-electron electronic structure theory. The locality of NAO's can be exploited to render (in principle) all operations of the self-consistency cycle O(N). This is straightforward for 3D integrals using domain decomposition into spatially close subsets of integration points, enabling critical computational savings that are effective from ˜tens of atoms (no significant overhead for smaller systems) and make large systems (100s of atoms) computationally feasible. Using a new all-electron NAO-based code,^1 we investigate the quantitative impact of exploiting this locality on two distinct classes of systems: Large light-element molecules [Alanine-based polypeptide chains (Ala)n], and compact transition metal clusters. Strict NAO locality is achieved by imposing a cutoff potential with an onset radius rc, and exploited by appropriately shaped integration domains (subsets of integration points). Conventional tight rc<= 3å have no measurable accuracy impact in (Ala)n, but introduce inaccuracies of 20-30 meV/atom in Cun. The domain shape impacts the computational effort by only 10-20 % for reasonable rc. ^1 V. Blum, R. Gehrke, P. Havu, V. Havu, M. Scheffler, The FHI Ab Initio Molecular Simulations (aims) Project, Fritz-Haber-Institut, Berlin (2006).
Self-force via m-mode regularization and 2+1D evolution. II. Scalar-field implementation on Kerr spacetime

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dolan, Sam R.; Barack, Leor; Wardell, Barry

2011-10-15

This is the second in a series of papers aimed at developing a practical time-domain method for self-force calculations in Kerr spacetime. The key elements of the method are (i) removal of a singular part of the perturbation field with a suitable analytic 'puncture' based on the Detweiler-Whiting decomposition, (ii) decomposition of the perturbation equations in azimuthal (m-)modes, taking advantage of the axial symmetry of the Kerr background, (iii) numerical evolution of the individual m-modes in 2+1 dimensions with a finite-difference scheme, and (iv) reconstruction of the physical self-force from the mode sum. Here we report an implementation of themore » method to compute the scalar-field self-force along circular equatorial geodesic orbits around a Kerr black hole. This constitutes a first time-domain computation of the self-force in Kerr geometry. Our time-domain code reproduces the results of a recent frequency-domain calculation by Warburton and Barack, but has the added advantage of being readily adaptable to include the backreaction from the self-force in a self-consistent manner. In a forthcoming paper--the third in the series--we apply our method to the gravitational self-force (in the Lorenz gauge).« less
Parallel Anisotropic Tetrahedral Adaptation

NASA Technical Reports Server (NTRS)

Park, Michael A.; Darmofal, David L.

2008-01-01

An adaptive method that robustly produces high aspect ratio tetrahedra to a general 3D metric specification without introducing hybrid semi-structured regions is presented. The elemental operators and higher-level logic is described with their respective domain-decomposed parallelizations. An anisotropic tetrahedral grid adaptation scheme is demonstrated for 1000-1 stretching for a simple cube geometry. This form of adaptation is applicable to more complex domain boundaries via a cut-cell approach as demonstrated by a parallel 3D supersonic simulation of a complex fighter aircraft. To avoid the assumptions and approximations required to form a metric to specify adaptation, an approach is introduced that directly evaluates interpolation error. The grid is adapted to reduce and equidistribute this interpolation error calculation without the use of an intervening anisotropic metric. Direct interpolation error adaptation is illustrated for 1D and 3D domains.
Parallelizing serial code for a distributed processing environment with an application to high frequency electromagnetic scattering

NASA Astrophysics Data System (ADS)

Work, Paul R.

1991-12-01

This thesis investigates the parallelization of existing serial programs in computational electromagnetics for use in a parallel environment. Existing algorithms for calculating the radar cross section of an object are covered, and a ray-tracing code is chosen for implementation on a parallel machine. Current parallel architectures are introduced and a suitable parallel machine is selected for the implementation of the chosen ray-tracing algorithm. The standard techniques for the parallelization of serial codes are discussed, including load balancing and decomposition considerations, and appropriate methods for the parallelization effort are selected. A load balancing algorithm is modified to increase the efficiency of the application, and a high level design of the structure of the serial program is presented. A detailed design of the modifications for the parallel implementation is also included, with both the high level and the detailed design specified in a high level design language called UNITY. The correctness of the design is proven using UNITY and standard logic operations. The theoretical and empirical results show that it is possible to achieve an efficient parallel application for a serial computational electromagnetic program where the characteristics of the algorithm and the target architecture critically influence the development of such an implementation.
The Data Transfer Kit: A geometric rendezvous-based tool for multiphysics data transfer

DOE Office of Scientific and Technical Information (OSTI.GOV)

Slattery, S. R.; Wilson, P. P. H.; Pawlowski, R. P.

2013-07-01

The Data Transfer Kit (DTK) is a software library designed to provide parallel data transfer services for arbitrary physics components based on the concept of geometric rendezvous. The rendezvous algorithm provides a means to geometrically correlate two geometric domains that may be arbitrarily decomposed in a parallel simulation. By repartitioning both domains such that they have the same geometric domain on each parallel process, efficient and load balanced search operations and data transfer can be performed at a desirable algorithmic time complexity with low communication overhead relative to other types of mapping algorithms. With the increased development efforts in multiphysicsmore » simulation and other multiple mesh and geometry problems, generating parallel topology maps for transferring fields and other data between geometric domains is a common operation. The algorithms used to generate parallel topology maps based on the concept of geometric rendezvous as implemented in DTK are described with an example using a conjugate heat transfer calculation and thermal coupling with a neutronics code. In addition, we provide the results of initial scaling studies performed on the Jaguar Cray XK6 system at Oak Ridge National Laboratory for a worse-case-scenario problem in terms of algorithmic complexity that shows good scaling on 0(1 x 104) cores for topology map generation and excellent scaling on 0(1 x 105) cores for the data transfer operation with meshes of O(1 x 109) elements. (authors)« less
The Effect of Semantic Transparency on the Processing of Morphologically Derived Words: Evidence from Decision Latencies and Event-Related Potentials

ERIC Educational Resources Information Center

Jared, Debra; Jouravlev, Olessia; Joanisse, Marc F.

2017-01-01

Decomposition theories of morphological processing in visual word recognition posit an early morpho-orthographic parser that is blind to semantic information, whereas parallel distributed processing (PDP) theories assume that the transparency of orthographic-semantic relationships influences processing from the beginning. To test these…
nem_spread Ver. 5.10

DOE Office of Scientific and Technical Information (OSTI.GOV)

HENNIGAN, GARY; SHADID, JOHN; SJAARDEMA, GREGORY

2009-06-08

Nem_spread reads it's input command file (default name nem_spread.inp), takes the named ExodusII geometry definition and spreads out the geometry (and optionally results) contained in that file out to a parallel disk system. The decomposition is taken from a scalar Nemesis load balance file generated by the companion utility nem_slice.
A cuboctahedral platinum (Pt79) nanocluster enclosed by well defined facets favours di-sigma adsorption and improves the reaction kinetics for methanol fuel cells.

PubMed

Mahata, Arup; Choudhuri, Indrani; Pathak, Biswarup

2015-08-28

The methanol dehydrogenation steps are studied very systematically on the (111) facet of a cuboctahedral platinum (Pt79) nanocluster enclosed by well-defined facets. The various intermediates formed during the methanol decompositions are adsorbed at the edge and bridge site of the facet either vertically (through C- and O-centres) or in parallel. The di-sigma adsorption (in parallel) on the (111) facet of the nanocluster is the most stable structure for most of the intermediates and such binding improves the interaction between the substrate and the nanocluster and thus the catalytic activity. The reaction thermodynamics, activation barrier, and temperature dependent reaction rates are calculated for all the successive methanol dehydrogenation steps to understand the methanol decomposition mechanism, and these values are compared with previous studies to understand the catalytic activity of the nanocluster. We find the catalytic activity of the nanocluster is excellent while comparing with any previous reports and the methanol dehydrogenation thermodynamics and kinetics are best when the intermediates are adsorbed in a di-sigma manner.
TMVOC-MP: a parallel numerical simulator for Three-PhaseNon-isothermal Flows of Multicomponent Hydrocarbon Mixtures inporous/fractured media

DOE Office of Scientific and Technical Information (OSTI.GOV)

Zhang, Keni; Yamamoto, Hajime; Pruess, Karsten

2008-02-15

TMVOC-MP is a massively parallel version of the TMVOC code (Pruess and Battistelli, 2002), a numerical simulator for three-phase non-isothermal flow of water, gas, and a multicomponent mixture of volatile organic chemicals (VOCs) in multidimensional heterogeneous porous/fractured media. TMVOC-MP was developed by introducing massively parallel computing techniques into TMVOC. It retains the physical process model of TMVOC, designed for applications to contamination problems that involve hydrocarbon fuels or organic solvents in saturated and unsaturated zones. TMVOC-MP can model contaminant behavior under 'natural' environmental conditions, as well as for engineered systems, such as soil vapor extraction, groundwater pumping, or steam-assisted sourcemore » remediation. With its sophisticated parallel computing techniques, TMVOC-MP can handle much larger problems than TMVOC, and can be much more computationally efficient. TMVOC-MP models multiphase fluid systems containing variable proportions of water, non-condensible gases (NCGs), and water-soluble volatile organic chemicals (VOCs). The user can specify the number and nature of NCGs and VOCs. There are no intrinsic limitations to the number of NCGs or VOCs, although the arrays for fluid components are currently dimensioned as 20, accommodating water plus 19 components that may be either NCGs or VOCs. Among them, NCG arrays are dimensioned as 10. The user may select NCGs from a data bank provided in the software. The currently available choices include O{sub 2}, N{sub 2}, CO{sub 2}, CH{sub 4}, ethane, ethylene, acetylene, and air (a pseudo-component treated with properties averaged from N{sub 2} and O{sub 2}). Thermophysical property data of VOCs can be selected from a chemical data bank, included with TMVOC-MP, that provides parameters for 26 commonly encountered chemicals. Users also can input their own data for other fluids. The fluid components may partition (volatilize and/or dissolve) among gas, aqueous, and NAPL phases. Any combination of the three phases may present, and phases may appear and disappear in the course of a simulation. In addition, VOCs may be adsorbed by the porous medium, and may biodegrade according to a simple half-life model. Detailed discussion of physical processes, assumptions, and fluid properties used in TMVOC-MP can be found in the TMVOC user's guide (Pruess and Battistelli, 2002). TMVOC-MP was developed based on the parallel framework of the TOUGH2-MP code (Zhang et al. 2001, Wu et al. 2002). It uses the MPI (Message Passing Forum, 1994) for parallel implementation. A domain decomposition approach is adopted for the parallelization. The code partitions a simulation domain, defined by an unstructured grid, using partitioning algorithm from the METIS software package (Karypsis and Kumar, 1998). In parallel simulation, each processor is in charge of one part of the simulation domain for assembling mass and energy balance equations, solving linear equation systems, updating thermophysical properties, and performing other local computations. The local linear-equation systems are solved in parallel by multiple processors with the Aztec linear solver package (Tuminaro et al., 1999). Although each processor solves the linearized equations of subdomains independently, the entire linear equation system is solved together by all processors collaboratively via communication between neighboring processors during each iteration. Detailed discussion of the prototype of the data-exchange scheme can be found in Elmroth et al. (2001). In addition, FORTRAN 90 features are introduced to TMVOC-MP, such as dynamic memory allocation, array operation, matrix manipulation, and replacing 'common blocks' (used in the original TMVOC) with modules. All new subroutines are written in FORTRAN 90. Program units imported from the original TMVOC remain in standard FORTRAN 77. This report provides a quick starting guide for using the TMVOC-MP program. We suppose that the users have basic knowledge of using the original TMVOC code. The users can find the detailed technical description of the physical processes modeled, and the mathematical and numerical methods in the user's guide for TMVOC (Pruess and Battistelli, 2002).« less
An experiment in hurricane track prediction using parallel computing methods

NASA Technical Reports Server (NTRS)

Song, Chang G.; Jwo, Jung-Sing; Lakshmivarahan, S.; Dhall, S. K.; Lewis, John M.; Velden, Christopher S.

1994-01-01

The barotropic model is used to explore the advantages of parallel processing in deterministic forecasting. We apply this model to the track forecasting of hurricane Elena (1985). In this particular application, solutions to systems of elliptic equations are the essence of the computational mechanics. One set of equations is associated with the decomposition of the wind into irrotational and nondivergent components - this determines the initial nondivergent state. Another set is associated with recovery of the streamfunction from the forecasted vorticity. We demonstrate that direct parallel methods based on accelerated block cyclic reduction (BCR) significantly reduce the computational time required to solve the elliptic equations germane to this decomposition and forecast problem. A 72-h track prediction was made using incremental time steps of 16 min on a network of 3000 grid points nominally separated by 100 km. The prediction took 30 sec on the 8-processor Alliant FX/8 computer. This was a speed-up of 3.7 when compared to the one-processor version. The 72-h prediction of Elena's track was made as the storm moved toward Florida's west coast. Approximately 200 km west of Tampa Bay, Elena executed a dramatic recurvature that ultimately changed its course toward the northwest. Although the barotropic track forecast was unable to capture the hurricane's tight cycloidal looping maneuver, the subsequent northwesterly movement was accurately forecasted as was the location and timing of landfall near Mobile Bay.
Parallel Architecture, Parallel Acquisition Cross-Linguistic Evidence from Nominal and Verbal Domains

ERIC Educational Resources Information Center

Sutton, Brett R.

2017-01-01

This dissertation explores parallels between Complementizer Phrase (CP) and Determiner Phrase (DP) semantics, syntax, and morphology--including similarities in case-assignment, subject-verb and possessor-possessum agreement, subject and possessor semantics, and overall syntactic structure--in first language acquisition. Applying theoretical…
Parallel-SymD: A Parallel Approach to Detect Internal Symmetry in Protein Domains.

PubMed

Jha, Ashwani; Flurchick, K M; Bikdash, Marwan; Kc, Dukka B

2016-01-01

Internally symmetric proteins are proteins that have a symmetrical structure in their monomeric single-chain form. Around 10-15% of the protein domains can be regarded as having some sort of internal symmetry. In this regard, we previously published SymD (symmetry detection), an algorithm that determines whether a given protein structure has internal symmetry by attempting to align the protein to its own copy after the copy is circularly permuted by all possible numbers of residues. SymD has proven to be a useful algorithm to detect symmetry. In this paper, we present a new parallelized algorithm called Parallel-SymD for detecting symmetry of proteins on clusters of computers. The achieved speedup of the new Parallel-SymD algorithm scales well with the number of computing processors. Scaling is better for proteins with a larger number of residues. For a protein of 509 residues, a speedup of 63 was achieved on a parallel system with 100 processors.
Parallel-SymD: A Parallel Approach to Detect Internal Symmetry in Protein Domains

PubMed Central

Jha, Ashwani; Flurchick, K. M.; Bikdash, Marwan

2016-01-01

Internally symmetric proteins are proteins that have a symmetrical structure in their monomeric single-chain form. Around 10–15% of the protein domains can be regarded as having some sort of internal symmetry. In this regard, we previously published SymD (symmetry detection), an algorithm that determines whether a given protein structure has internal symmetry by attempting to align the protein to its own copy after the copy is circularly permuted by all possible numbers of residues. SymD has proven to be a useful algorithm to detect symmetry. In this paper, we present a new parallelized algorithm called Parallel-SymD for detecting symmetry of proteins on clusters of computers. The achieved speedup of the new Parallel-SymD algorithm scales well with the number of computing processors. Scaling is better for proteins with a larger number of residues. For a protein of 509 residues, a speedup of 63 was achieved on a parallel system with 100 processors. PMID:27747230
Multitrace/singletrace formulations and Domain Decomposition Methods for the solution of Helmholtz transmission problems for bounded composite scatterers

NASA Astrophysics Data System (ADS)

Jerez-Hanckes, Carlos; Pérez-Arancibia, Carlos; Turc, Catalin

2017-12-01

We present Nyström discretizations of multitrace/singletrace formulations and non-overlapping Domain Decomposition Methods (DDM) for the solution of Helmholtz transmission problems for bounded composite scatterers with piecewise constant material properties. We investigate the performance of DDM with both classical Robin and optimized transmission boundary conditions. The optimized transmission boundary conditions incorporate square root Fourier multiplier approximations of Dirichlet to Neumann operators. While the multitrace/singletrace formulations as well as the DDM that use classical Robin transmission conditions are not particularly well suited for Krylov subspace iterative solutions of high-contrast high-frequency Helmholtz transmission problems, we provide ample numerical evidence that DDM with optimized transmission conditions constitute efficient computational alternatives for these type of applications. In the case of large numbers of subdomains with different material properties, we show that the associated DDM linear system can be efficiently solved via hierarchical Schur complements elimination.
Basis adaptation and domain decomposition for steady partial differential equations with random coefficients

DOE PAGES

Tipireddy, R.; Stinis, P.; Tartakovsky, A. M.

2017-09-04

In this paper, we present a novel approach for solving steady-state stochastic partial differential equations (PDEs) with high-dimensional random parameter space. The proposed approach combines spatial domain decomposition with basis adaptation for each subdomain. The basis adaptation is used to address the curse of dimensionality by constructing an accurate low-dimensional representation of the stochastic PDE solution (probability density function and/or its leading statistical moments) in each subdomain. Restricting the basis adaptation to a specific subdomain affords finding a locally accurate solution. Then, the solutions from all of the subdomains are stitched together to provide a global solution. We support ourmore » construction with numerical experiments for a steady-state diffusion equation with a random spatially dependent coefficient. Lastly, our results show that highly accurate global solutions can be obtained with significantly reduced computational costs.« less
Basis adaptation and domain decomposition for steady-state partial differential equations with random coefficients

DOE Office of Scientific and Technical Information (OSTI.GOV)

Tipireddy, R.; Stinis, P.; Tartakovsky, A. M.

We present a novel approach for solving steady-state stochastic partial differential equations (PDEs) with high-dimensional random parameter space. The proposed approach combines spatial domain decomposition with basis adaptation for each subdomain. The basis adaptation is used to address the curse of dimensionality by constructing an accurate low-dimensional representation of the stochastic PDE solution (probability density function and/or its leading statistical moments) in each subdomain. Restricting the basis adaptation to a specific subdomain affords finding a locally accurate solution. Then, the solutions from all of the subdomains are stitched together to provide a global solution. We support our construction with numericalmore » experiments for a steady-state diffusion equation with a random spatially dependent coefficient. Our results show that highly accurate global solutions can be obtained with significantly reduced computational costs.« less
A time domain frequency-selective multivariate Granger causality approach.

PubMed

Leistritz, Lutz; Witte, Herbert

2016-08-01

The investigation of effective connectivity is one of the major topics in computational neuroscience to understand the interaction between spatially distributed neuronal units of the brain. Thus, a wide variety of methods has been developed during the last decades to investigate functional and effective connectivity in multivariate systems. Their spectrum ranges from model-based to model-free approaches with a clear separation into time and frequency range methods. We present in this simulation study a novel time domain approach based on Granger's principle of predictability, which allows frequency-selective considerations of directed interactions. It is based on a comparison of prediction errors of multivariate autoregressive models fitted to systematically modified time series. These modifications are based on signal decompositions, which enable a targeted cancellation of specific signal components with specific spectral properties. Depending on the embedded signal decomposition method, a frequency-selective or data-driven signal-adaptive Granger Causality Index may be derived.
Coupling lattice Boltzmann and continuum equations for flow and reactive transport in porous media.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Coon, Ethan; Porter, Mark L.; Kang, Qinjun

2012-06-18

In spatially and temporally localized instances, capturing sub-reservoir scale information is necessary. Capturing sub-reservoir scale information everywhere is neither necessary, nor computationally possible. The lattice Boltzmann Method for solving pore-scale systems. At the pore-scale, LBM provides an extremely scalable, efficient way of solving Navier-Stokes equations on complex geometries. Coupling pore-scale and continuum scale systems via domain decomposition. By leveraging the interpolations implied by pore-scale and continuum scale discretizations, overlapping Schwartz domain decomposition is used to ensure continuity of pressure and flux. This approach is demonstrated on a fractured medium, in which Navier-Stokes equations are solved within the fracture while Darcy'smore » equation is solved away from the fracture Coupling reactive transport to pore-scale flow simulators allows hybrid approaches to be extended to solve multi-scale reactive transport.« less

Message-passing-interface-based parallel FDTD investigation on the EM scattering from a 1-D rough sea surface using uniaxial perfectly matched layer absorbing boundary.

PubMed

Li, J; Guo, L-X; Zeng, H; Han, X-B

2009-06-01

A message-passing-interface (MPI)-based parallel finite-difference time-domain (FDTD) algorithm for the electromagnetic scattering from a 1-D randomly rough sea surface is presented. The uniaxial perfectly matched layer (UPML) medium is adopted for truncation of FDTD lattices, in which the finite-difference equations can be used for the total computation domain by properly choosing the uniaxial parameters. This makes the parallel FDTD algorithm easier to implement. The parallel performance with different processors is illustrated for one sea surface realization, and the computation time of the parallel FDTD algorithm is dramatically reduced compared to a single-process implementation. Finally, some numerical results are shown, including the backscattering characteristics of sea surface for different polarization and the bistatic scattering from a sea surface with large incident angle and large wind speed.
A parallel graded-mesh FDTD algorithm for human-antenna interaction problems.

PubMed

Catarinucci, Luca; Tarricone, Luciano

2009-01-01

The finite difference time domain method (FDTD) is frequently used for the numerical solution of a wide variety of electromagnetic (EM) problems and, among them, those concerning human exposure to EM fields. In many practical cases related to the assessment of occupational EM exposure, large simulation domains are modeled and high space resolution adopted, so that strong memory and central processing unit power requirements have to be satisfied. To better afford the computational effort, the use of parallel computing is a winning approach; alternatively, subgridding techniques are often implemented. However, the simultaneous use of subgridding schemes and parallel algorithms is very new. In this paper, an easy-to-implement and highly-efficient parallel graded-mesh (GM) FDTD scheme is proposed and applied to human-antenna interaction problems, demonstrating its appropriateness in dealing with complex occupational tasks and showing its capability to guarantee the advantages of a traditional subgridding technique without affecting the parallel FDTD performance.
Regional-scale calculation of the LS factor using parallel processing

NASA Astrophysics Data System (ADS)

Liu, Kai; Tang, Guoan; Jiang, Ling; Zhu, A.-Xing; Yang, Jianyi; Song, Xiaodong

2015-05-01

With the increase of data resolution and the increasing application of USLE over large areas, the existing serial implementation of algorithms for computing the LS factor is becoming a bottleneck. In this paper, a parallel processing model based on message passing interface (MPI) is presented for the calculation of the LS factor, so that massive datasets at a regional scale can be processed efficiently. The parallel model contains algorithms for calculating flow direction, flow accumulation, drainage network, slope, slope length and the LS factor. According to the existence of data dependence, the algorithms are divided into local algorithms and global algorithms. Parallel strategy are designed according to the algorithm characters including the decomposition method for maintaining the integrity of the results, optimized workflow for reducing the time taken for exporting the unnecessary intermediate data and a buffer-communication-computation strategy for improving the communication efficiency. Experiments on a multi-node system show that the proposed parallel model allows efficient calculation of the LS factor at a regional scale with a massive dataset.
Parallel Reconstruction Using Null Operations (PRUNO)

PubMed Central

Zhang, Jian; Liu, Chunlei; Moseley, Michael E.

2011-01-01

A novel iterative k-space data-driven technique, namely Parallel Reconstruction Using Null Operations (PRUNO), is presented for parallel imaging reconstruction. In PRUNO, both data calibration and image reconstruction are formulated into linear algebra problems based on a generalized system model. An optimal data calibration strategy is demonstrated by using Singular Value Decomposition (SVD). And an iterative conjugate- gradient approach is proposed to efficiently solve missing k-space samples during reconstruction. With its generalized formulation and precise mathematical model, PRUNO reconstruction yields good accuracy, flexibility, stability. Both computer simulation and in vivo studies have shown that PRUNO produces much better reconstruction quality than autocalibrating partially parallel acquisition (GRAPPA), especially under high accelerating rates. With the aid of PRUO reconstruction, ultra high accelerating parallel imaging can be performed with decent image quality. For example, we have done successful PRUNO reconstruction at a reduction factor of 6 (effective factor of 4.44) with 8 coils and only a few autocalibration signal (ACS) lines. PMID:21604290
Long-term litter decomposition controlled by manganese redox cycling

DOE PAGES

Keiluweit, Marco; Nico, Peter S.; Harmon, Mark; ...

2015-09-08

Litter decomposition is a keystone ecosystem process impacting nutrient cycling and productivity, soil properties, and the terrestrial carbon (C) balance, but the factors regulating decomposition rate are still poorly understood. Traditional models assume that the rate is controlled by litter quality, relying on parameters such as lignin content as predictors. However, a strong correlation has been observed between the manganese (Mn) content of litter and decomposition rates across a variety of forest ecosystems. Here, we show that long-term litter decomposition in forest ecosystems is tightly coupled to Mn redox cycling. Over 7 years of litter decomposition, microbial transformation of littermore » was paralleled by variations in Mn oxidation state and concentration. A detailed chemical imaging analysis of the litter revealed that fungi recruit and redistribute unreactive Mn 2+ provided by fresh plant litter to produce oxidative Mn 3+ species at sites of active decay, with Mn eventually accumulating as insoluble Mn 3+/4+ oxides. Formation of reactive Mn 3+ species coincided with the generation of aromatic oxidation products, providing direct proof of the previously posited role of Mn 3+-based oxidizers in the breakdown of litter. Our results suggest that the litter-decomposing machinery at our coniferous forest site depends on the ability of plants and microbes to supply, accumulate, and regenerate short-lived Mn 3+ species in the litter layer. As a result, this observation indicates that biogeochemical constraints on bioavailability, mobility, and reactivity of Mn in the plant–soil system may have a profound impact on litter decomposition rates.« less
Defect inspection using a time-domain mode decomposition technique

NASA Astrophysics Data System (ADS)

Zhu, Jinlong; Goddard, Lynford L.

2018-03-01

In this paper, we propose a technique called time-varying frequency scanning (TVFS) to meet the challenges in killer defect inspection. The proposed technique enables the dynamic monitoring of defects by checking the hopping in the instantaneous frequency data and the classification of defect types by comparing the difference in frequencies. The TVFS technique utilizes the bidimensional empirical mode decomposition (BEMD) method to separate the defect information from the sea of system errors. This significantly improve the signal-to-noise ratio (SNR) and moreover, it potentially enables reference-free defect inspection.
Parallel algorithms for mapping pipelined and parallel computations

NASA Technical Reports Server (NTRS)

Nicol, David M.

1988-01-01

Many computational problems in image processing, signal processing, and scientific computing are naturally structured for either pipelined or parallel computation. When mapping such problems onto a parallel architecture it is often necessary to aggregate an obvious problem decomposition. Even in this context the general mapping problem is known to be computationally intractable, but recent advances have been made in identifying classes of problems and architectures for which optimal solutions can be found in polynomial time. Among these, the mapping of pipelined or parallel computations onto linear array, shared memory, and host-satellite systems figures prominently. This paper extends that work first by showing how to improve existing serial mapping algorithms. These improvements have significantly lower time and space complexities: in one case a published O(nm sup 3) time algorithm for mapping m modules onto n processors is reduced to an O(nm log m) time complexity, and its space requirements reduced from O(nm sup 2) to O(m). Run time complexity is further reduced with parallel mapping algorithms based on these improvements, which run on the architecture for which they create the mappings.
Kinematics of reflections in subsurface offset and angle-domain image gathers

NASA Astrophysics Data System (ADS)

Dafni, Raanan; Symes, William W.

2018-05-01

Seismic migration in the angle-domain generates multiple images of the earth's interior in which reflection takes place at different scattering-angles. Mechanically, the angle-dependent reflection is restricted to happen instantaneously and at a fixed point in space: Incident wave hits a discontinuity in the subsurface media and instantly generates a scattered wave at the same common point of interaction. Alternatively, the angle-domain image may be associated with space-shift (regarded as subsurface offset) extended migration that artificially splits the reflection geometry. Meaning that, incident and scattered waves interact at some offset distance. The geometric differences between the two approaches amount to a contradictory angle-domain behaviour, and unlike kinematic description. We present a phase space depiction of migration methods extended by the peculiar subsurface offset split and stress its profound dissimilarity. In spite of being in radical contradiction with the general physics, the subsurface offset reveals a link to some valuable angle-domain quantities, via post-migration transformations. The angle quantities are indicated by the direction normal to the subsurface offset extended image. They specifically define the local dip and scattering angles if the velocity at the split reflection coordinates is the same for incident and scattered wave pairs. Otherwise, the reflector normal is not a bisector of the opening angle, but of the corresponding slowness vectors. This evidence, together with the distinguished geometry configuration, fundamentally differentiates the angle-domain decomposition based on the subsurface offset split from the conventional decomposition at a common reflection point. An asymptotic simulation of angle-domain moveout curves in layered media exposes the notion of split versus common reflection point geometry. Traveltime inversion methods that involve the subsurface offset extended migration must accommodate the split geometry in the inversion scheme for a robust and successful convergence at the optimal velocity model.
Type synthesis for 4-DOF parallel press mechanism using GF set theory

NASA Astrophysics Data System (ADS)

He, Jun; Gao, Feng; Meng, Xiangdun; Guo, Weizhong

2015-07-01

Parallel mechanisms is used in the large capacity servo press to avoid the over-constraint of the traditional redundant actuation. Currently, the researches mainly focus on the performance analysis for some specific parallel press mechanisms. However, the type synthesis and evaluation of parallel press mechanisms is seldom studied, especially for the four degrees of freedom(DOF) press mechanisms. The type synthesis of 4-DOF parallel press mechanisms is carried out based on the generalized function(GF) set theory. Five design criteria of 4-DOF parallel press mechanisms are firstly proposed. The general procedure of type synthesis of parallel press mechanisms is obtained, which includes number synthesis, symmetrical synthesis of constraint GF sets, decomposition of motion GF sets and design of limbs. Nine combinations of constraint GF sets of 4-DOF parallel press mechanisms, ten combinations of GF sets of active limbs, and eleven combinations of GF sets of passive limbs are synthesized. Thirty-eight kinds of press mechanisms are presented and then different structures of kinematic limbs are designed. Finally, the geometrical constraint complexity( GCC), kinematic pair complexity( KPC), and type complexity( TC) are proposed to evaluate the press types and the optimal press type is achieved. The general methodologies of type synthesis and evaluation for parallel press mechanism are suggested.
Wakefield Computations for the CLIC PETS using the Parallel Finite Element Time-Domain Code T3P

DOE Office of Scientific and Technical Information (OSTI.GOV)

Candel, A; Kabel, A.; Lee, L.

In recent years, SLAC's Advanced Computations Department (ACD) has developed the high-performance parallel 3D electromagnetic time-domain code, T3P, for simulations of wakefields and transients in complex accelerator structures. T3P is based on advanced higher-order Finite Element methods on unstructured grids with quadratic surface approximation. Optimized for large-scale parallel processing on leadership supercomputing facilities, T3P allows simulations of realistic 3D structures with unprecedented accuracy, aiding the design of the next generation of accelerator facilities. Applications to the Compact Linear Collider (CLIC) Power Extraction and Transfer Structure (PETS) are presented.
Iterative methods for elliptic finite element equations on general meshes

NASA Technical Reports Server (NTRS)

Nicolaides, R. A.; Choudhury, Shenaz

1986-01-01

Iterative methods for arbitrary mesh discretizations of elliptic partial differential equations are surveyed. The methods discussed are preconditioned conjugate gradients, algebraic multigrid, deflated conjugate gradients, an element-by-element techniques, and domain decomposition. Computational results are included.
RIO: a new computational framework for accurate initial data of binary black holes

NASA Astrophysics Data System (ADS)

Barreto, W.; Clemente, P. C. M.; de Oliveira, H. P.; Rodriguez-Mueller, B.

2018-06-01

We present a computational framework ( Rio) in the ADM 3+1 approach for numerical relativity. This work enables us to carry out high resolution calculations for initial data of two arbitrary black holes. We use the transverse conformal treatment, the Bowen-York and the puncture methods. For the numerical solution of the Hamiltonian constraint we use the domain decomposition and the spectral decomposition of Galerkin-Collocation. The nonlinear numerical code solves the set of equations for the spectral modes using the standard Newton-Raphson method, LU decomposition and Gaussian quadratures. We show the convergence of the Rio code. This code allows for easy deployment of large calculations. We show how the spin of one of the black holes is manifest in the conformal factor.
Parallel deterministic neutronics with AMR in 3D

DOE Office of Scientific and Technical Information (OSTI.GOV)

Clouse, C.; Ferguson, J.; Hendrickson, C.

1997-12-31

AMTRAN, a three dimensional Sn neutronics code with adaptive mesh refinement (AMR) has been parallelized over spatial domains and energy groups and runs on the Meiko CS-2 with MPI message passing. Block refined AMR is used with linear finite element representations for the fluxes, which allows for a straight forward interpretation of fluxes at block interfaces with zoning differences. The load balancing algorithm assumes 8 spatial domains, which minimizes idle time among processors.
Recursive inverse factorization.

PubMed

Rubensson, Emanuel H; Bock, Nicolas; Holmström, Erik; Niklasson, Anders M N

2008-03-14

A recursive algorithm for the inverse factorization S(-1)=ZZ(*) of Hermitian positive definite matrices S is proposed. The inverse factorization is based on iterative refinement [A.M.N. Niklasson, Phys. Rev. B 70, 193102 (2004)] combined with a recursive decomposition of S. As the computational kernel is matrix-matrix multiplication, the algorithm can be parallelized and the computational effort increases linearly with system size for systems with sufficiently sparse matrices. Recent advances in network theory are used to find appropriate recursive decompositions. We show that optimization of the so-called network modularity results in an improved partitioning compared to other approaches. In particular, when the recursive inverse factorization is applied to overlap matrices of irregularly structured three-dimensional molecules.
Method of assessing the state of a rolling bearing based on the relative compensation distance of multiple-domain features and locally linear embedding

NASA Astrophysics Data System (ADS)

Kang, Shouqiang; Ma, Danyang; Wang, Yujing; Lan, Chaofeng; Chen, Qingguo; Mikulovich, V. I.

2017-03-01

To effectively assess different fault locations and different degrees of performance degradation of a rolling bearing with a unified assessment index, a novel state assessment method based on the relative compensation distance of multiple-domain features and locally linear embedding is proposed. First, for a single-sample signal, time-domain and frequency-domain indexes can be calculated for the original vibration signal and each sensitive intrinsic mode function obtained by improved ensemble empirical mode decomposition, and the singular values of the sensitive intrinsic mode function matrix can be extracted by singular value decomposition to construct a high-dimensional hybrid-domain feature vector. Second, a feature matrix can be constructed by arranging each feature vector of multiple samples, the dimensions of each row vector of the feature matrix can be reduced by the locally linear embedding algorithm, and the compensation distance of each fault state of the rolling bearing can be calculated using the support vector machine. Finally, the relative distance between different fault locations and different degrees of performance degradation and the normal-state optimal classification surface can be compensated, and on the basis of the proposed relative compensation distance, the assessment model can be constructed and an assessment curve drawn. Experimental results show that the proposed method can effectively assess different fault locations and different degrees of performance degradation of the rolling bearing under certain conditions.
Decomposition method for fast computation of gigapixel-sized Fresnel holograms on a graphics processing unit cluster.

PubMed

Jackin, Boaz Jessie; Watanabe, Shinpei; Ootsu, Kanemitsu; Ohkawa, Takeshi; Yokota, Takashi; Hayasaki, Yoshio; Yatagai, Toyohiko; Baba, Takanobu

2018-04-20

A parallel computation method for large-size Fresnel computer-generated hologram (CGH) is reported. The method was introduced by us in an earlier report as a technique for calculating Fourier CGH from 2D object data. In this paper we extend the method to compute Fresnel CGH from 3D object data. The scale of the computation problem is also expanded to 2 gigapixels, making it closer to real application requirements. The significant feature of the reported method is its ability to avoid communication overhead and thereby fully utilize the computing power of parallel devices. The method exhibits three layers of parallelism that favor small to large scale parallel computing machines. Simulation and optical experiments were conducted to demonstrate the workability and to evaluate the efficiency of the proposed technique. A two-times improvement in computation speed has been achieved compared to the conventional method, on a 16-node cluster (one GPU per node) utilizing only one layer of parallelism. A 20-times improvement in computation speed has been estimated utilizing two layers of parallelism on a very large-scale parallel machine with 16 nodes, where each node has 16 GPUs.
Modal analysis of 2-D sedimentary basin from frequency domain decomposition of ambient vibration array recordings

NASA Astrophysics Data System (ADS)

Poggi, Valerio; Ermert, Laura; Burjanek, Jan; Michel, Clotaire; Fäh, Donat

2015-01-01

Frequency domain decomposition (FDD) is a well-established spectral technique used in civil engineering to analyse and monitor the modal response of buildings and structures. The method is based on singular value decomposition of the cross-power spectral density matrix from simultaneous array recordings of ambient vibrations. This method is advantageous to retrieve not only the resonance frequencies of the investigated structure, but also the corresponding modal shapes without the need for an absolute reference. This is an important piece of information, which can be used to validate the consistency of numerical models and analytical solutions. We apply this approach using advanced signal processing to evaluate the resonance characteristics of 2-D Alpine sedimentary valleys. In this study, we present the results obtained at Martigny, in the Rhône valley (Switzerland). For the analysis, we use 2 hr of ambient vibration recordings from a linear seismic array deployed perpendicularly to the valley axis. Only the horizontal-axial direction (SH) of the ground motion is considered. Using the FDD method, six separate resonant frequencies are retrieved together with their corresponding modal shapes. We compare the mode shapes with results from classical standard spectral ratios and numerical simulations of ambient vibration recordings.
Wavelet domain textual coding of Ottoman script images

NASA Astrophysics Data System (ADS)

Gerek, Oemer N.; Cetin, Enis A.; Tewfik, Ahmed H.

1996-02-01

Image coding using wavelet transform, DCT, and similar transform techniques is well established. On the other hand, these coding methods neither take into account the special characteristics of the images in a database nor are they suitable for fast database search. In this paper, the digital archiving of Ottoman printings is considered. Ottoman documents are printed in Arabic letters. Witten et al. describes a scheme based on finding the characters in binary document images and encoding the positions of the repeated characters This method efficiently compresses document images and is suitable for database research, but it cannot be applied to Ottoman or Arabic documents as the concept of character is different in Ottoman or Arabic. Typically, one has to deal with compound structures consisting of a group of letters. Therefore, the matching criterion will be according to those compound structures. Furthermore, the text images are gray tone or color images for Ottoman scripts for the reasons that are described in the paper. In our method the compound structure matching is carried out in wavelet domain which reduces the search space and increases the compression ratio. In addition to the wavelet transformation which corresponds to the linear subband decomposition, we also used nonlinear subband decomposition. The filters in the nonlinear subband decomposition have the property of preserving edges in the low resolution subband image.
Parallel processing approach to transform-based image coding

NASA Astrophysics Data System (ADS)

Normile, James O.; Wright, Dan; Chu, Ken; Yeh, Chia L.

1991-06-01

This paper describes a flexible parallel processing architecture designed for use in real time video processing. The system consists of floating point DSP processors connected to each other via fast serial links, each processor has access to a globally shared memory. A multiple bus architecture in combination with a dual ported memory allows communication with a host control processor. The system has been applied to prototyping of video compression and decompression algorithms. The decomposition of transform based algorithms for decompression into a form suitable for parallel processing is described. A technique for automatic load balancing among the processors is developed and discussed, results ar presented with image statistics and data rates. Finally techniques for accelerating the system throughput are analyzed and results from the application of one such modification described.
Using the Binary Phase-Field Crystal Model to Describe Non-Classical Nucleation Pathways in Gold Nanoparticles

NASA Astrophysics Data System (ADS)

Smith, Nathan; Provatas, Nikolas

Recent experimental work has shown that gold nanoparticles can precipitate from an aqueous solution through a non-classical, multi-step nucleation process. This multi-step process begins with spinodal decomposition into solute-rich and solute-poor liquid domains followed by nucleation from within the solute-rich domains. We present a binary phase-field crystal theory that shows the same phenomology and examine various cross-over regimes in the growth and coarsening of liquid and solid domains. We'd like to the thank Canada Research Chairs (CRC) program for funding this work.

Insight into the novel inhibition mechanism of apigenin to Pneumolysin by molecular modeling

NASA Astrophysics Data System (ADS)

Niu, Xiaodi; Yang, Yanan; Song, Meng; Wang, Guizhen; Sun, Lin; Gao, Yawen; Wang, Hongsu

2017-11-01

In this study, the mechanism of apigenin inhibition was explored using molecular modelling, binding energy calculation, and mutagenesis assays. Energy decomposition analysis indicated that apigenin binds in the gap between domains 3 and 4 of PLY. Using principal component analysis, we found that binding of apigenin to PLY weakens the motion of domains 3 and 4. Consequently, these domains cannot complete the transition from monomer to oligomer, thereby blocking oligomerisation of PLY and counteracting its haemolytic activity. This inhibitory mechanism was confirmed by haemolysis assays, and these findings will promote the future development of an antimicrobial agent.
Wake acoustic analysis and image decomposition via beamforming of microphone signal projections on wavelet subspaces

DOT National Transportation Integrated Search

2006-05-08

This paper describes the integration of wavelet analysis and time-domain beamforming : of microphone array output signals for analyzing the acoustic emissions from airplane : generated wake vortices. This integrated process provides visual and quanti...
An accurate, fast, and scalable solver for high-frequency wave propagation

NASA Astrophysics Data System (ADS)

Zepeda-Núñez, L.; Taus, M.; Hewett, R.; Demanet, L.

2017-12-01

In many science and engineering applications, solving time-harmonic high-frequency wave propagation problems quickly and accurately is of paramount importance. For example, in geophysics, particularly in oil exploration, such problems can be the forward problem in an iterative process for solving the inverse problem of subsurface inversion. It is important to solve these wave propagation problems accurately in order to efficiently obtain meaningful solutions of the inverse problems: low order forward modeling can hinder convergence. Additionally, due to the volume of data and the iterative nature of most optimization algorithms, the forward problem must be solved many times. Therefore, a fast solver is necessary to make solving the inverse problem feasible. For time-harmonic high-frequency wave propagation, obtaining both speed and accuracy is historically challenging. Recently, there have been many advances in the development of fast solvers for such problems, including methods which have linear complexity with respect to the number of degrees of freedom. While most methods scale optimally only in the context of low-order discretizations and smooth wave speed distributions, the method of polarized traces has been shown to retain optimal scaling for high-order discretizations, such as hybridizable discontinuous Galerkin methods and for highly heterogeneous (and even discontinuous) wave speeds. The resulting fast and accurate solver is consequently highly attractive for geophysical applications. To date, this method relies on a layered domain decomposition together with a preconditioner applied in a sweeping fashion, which has limited straight-forward parallelization. In this work, we introduce a new version of the method of polarized traces which reveals more parallel structure than previous versions while preserving all of its other advantages. We achieve this by further decomposing each layer and applying the preconditioner to these new components separately and in parallel. We demonstrate that this produces an even more effective and parallelizable preconditioner for a single right-hand side. As before, additional speed can be gained by pipelining several right-hand-sides.
3D CSEM inversion based on goal-oriented adaptive finite element method

NASA Astrophysics Data System (ADS)

Zhang, Y.; Key, K.

2016-12-01

We present a parallel 3D frequency domain controlled-source electromagnetic inversion code name MARE3DEM. Non-linear inversion of observed data is performed with the Occam variant of regularized Gauss-Newton optimization. The forward operator is based on the goal-oriented finite element method that efficiently calculates the responses and sensitivity kernels in parallel using a data decomposition scheme where independent modeling tasks contain different frequencies and subsets of the transmitters and receivers. To accommodate complex 3D conductivity variation with high flexibility and precision, we adopt the dual-grid approach where the forward mesh conforms to the inversion parameter grid and is adaptively refined until the forward solution converges to the desired accuracy. This dual-grid approach is memory efficient, since the inverse parameter grid remains independent from fine meshing generated around the transmitter and receivers by the adaptive finite element method. Besides, the unstructured inverse mesh efficiently handles multiple scale structures and allows for fine-scale model parameters within the region of interest. Our mesh generation engine keeps track of the refinement hierarchy so that the map of conductivity and sensitivity kernel between the forward and inverse mesh is retained. We employ the adjoint-reciprocity method to calculate the sensitivity kernels which establish a linear relationship between changes in the conductivity model and changes in the modeled responses. Our code uses a direcy solver for the linear systems, so the adjoint problem is efficiently computed by re-using the factorization from the primary problem. Further computational efficiency and scalability is obtained in the regularized Gauss-Newton portion of the inversion using parallel dense matrix-matrix multiplication and matrix factorization routines implemented with the ScaLAPACK library. We show the scalability, reliability and the potential of the algorithm to deal with complex geological scenarios by applying it to the inversion of synthetic marine controlled source EM data generated for a complex 3D offshore model with significant seafloor topography.
Final Report - High-Order Spectral Volume Method for the Navier-Stokes Equations On Unstructured Tetrahedral Grids

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wang, Z J

2012-12-06

The overriding objective for this project is to develop an efficient and accurate method for capturing strong discontinuities and fine smooth flow structures of disparate length scales with unstructured grids, and demonstrate its potentials for problems relevant to DOE. More specifically, we plan to achieve the following objectives: 1. Extend the SV method to three dimensions, and develop a fourth-order accurate SV scheme for tetrahedral grids. Optimize the SV partition by minimizing a form of the Lebesgue constant. Verify the order of accuracy using the scalar conservation laws with an analytical solution; 2. Extend the SV method to Navier-Stokes equationsmore » for the simulation of viscous flow problems. Two promising approaches to compute the viscous fluxes will be tested and analyzed; 3. Parallelize the 3D viscous SV flow solver using domain decomposition and message passing. Optimize the cache performance of the flow solver by designing data structures minimizing data access times; 4. Demonstrate the SV method with a wide range of flow problems including both discontinuities and complex smooth structures. The objectives remain the same as those outlines in the original proposal. We anticipate no technical obstacles in meeting these objectives.« less
Advancing parabolic operators in thermodynamic MHD models: Explicit super time-stepping versus implicit schemes with Krylov solvers

NASA Astrophysics Data System (ADS)

Caplan, R. M.; Mikić, Z.; Linker, J. A.; Lionello, R.

2017-05-01

We explore the performance and advantages/disadvantages of using unconditionally stable explicit super time-stepping (STS) algorithms versus implicit schemes with Krylov solvers for integrating parabolic operators in thermodynamic MHD models of the solar corona. Specifically, we compare the second-order Runge-Kutta Legendre (RKL2) STS method with the implicit backward Euler scheme computed using the preconditioned conjugate gradient (PCG) solver with both a point-Jacobi and a non-overlapping domain decomposition ILU0 preconditioner. The algorithms are used to integrate anisotropic Spitzer thermal conduction and artificial kinematic viscosity at time-steps much larger than classic explicit stability criteria allow. A key component of the comparison is the use of an established MHD model (MAS) to compute a real-world simulation on a large HPC cluster. Special attention is placed on the parallel scaling of the algorithms. It is shown that, for a specific problem and model, the RKL2 method is comparable or surpasses the implicit method with PCG solvers in performance and scaling, but suffers from some accuracy limitations. These limitations, and the applicability of RKL methods are briefly discussed.
The high performance parallel algorithm for Unified Gas-Kinetic Scheme

NASA Astrophysics Data System (ADS)

Li, Shiyi; Li, Qibing; Fu, Song; Xu, Jinxiu

2016-11-01

A high performance parallel algorithm for UGKS is developed to simulate three-dimensional flows internal and external on arbitrary grid system. The physical domain and velocity domain are divided into different blocks and distributed according to the two-dimensional Cartesian topology with intra-communicators in physical domain for data exchange and other intra-communicators in velocity domain for sum reduction to moment integrals. Numerical results of three-dimensional cavity flow and flow past a sphere agree well with the results from the existing studies and validate the applicability of the algorithm. The scalability of the algorithm is tested both on small (1-16) and large (729-5832) scale processors. The tested speed-up ratio is near linear ashind thus the efficiency is around 1, which reveals the good scalability of the present algorithm.
Work stealing for GPU-accelerated parallel programs in a global address space framework: WORK STEALING ON GPU-ACCELERATED SYSTEMS

DOE Office of Scientific and Technical Information (OSTI.GOV)

Arafat, Humayun; Dinan, James; Krishnamoorthy, Sriram

Task parallelism is an attractive approach to automatically load balance the computation in a parallel system and adapt to dynamism exhibited by parallel systems. Exploiting task parallelism through work stealing has been extensively studied in shared and distributed-memory contexts. In this paper, we study the design of a system that uses work stealing for dynamic load balancing of task-parallel programs executed on hybrid distributed-memory CPU-graphics processing unit (GPU) systems in a global-address space framework. We take into account the unique nature of the accelerator model employed by GPUs, the significant performance difference between GPU and CPU execution as a functionmore » of problem size, and the distinct CPU and GPU memory domains. We consider various alternatives in designing a distributed work stealing algorithm for CPU-GPU systems, while taking into account the impact of task distribution and data movement overheads. These strategies are evaluated using microbenchmarks that capture various execution configurations as well as the state-of-the-art CCSD(T) application module from the computational chemistry domain.« less
Work stealing for GPU-accelerated parallel programs in a global address space framework

DOE Office of Scientific and Technical Information (OSTI.GOV)

Arafat, Humayun; Dinan, James; Krishnamoorthy, Sriram

Task parallelism is an attractive approach to automatically load balance the computation in a parallel system and adapt to dynamism exhibited by parallel systems. Exploiting task parallelism through work stealing has been extensively studied in shared and distributed-memory contexts. In this paper, we study the design of a system that uses work stealing for dynamic load balancing of task-parallel programs executed on hybrid distributed-memory CPU-graphics processing unit (GPU) systems in a global-address space framework. We take into account the unique nature of the accelerator model employed by GPUs, the significant performance difference between GPU and CPU execution as a functionmore » of problem size, and the distinct CPU and GPU memory domains. We consider various alternatives in designing a distributed work stealing algorithm for CPU-GPU systems, while taking into account the impact of task distribution and data movement overheads. These strategies are evaluated using microbenchmarks that capture various execution configurations as well as the state-of-the-art CCSD(T) application module from the computational chemistry domain« less
New bounding and decomposition approaches for MILP investment problems: Multi-area transmission and generation planning under policy constraints

DOE PAGES

Munoz, F. D.; Hobbs, B. F.; Watson, J. -P.

2016-02-01

A novel two-phase bounding and decomposition approach to compute optimal and near-optimal solutions to large-scale mixed-integer investment planning problems is proposed and it considers a large number of operating subproblems, each of which is a convex optimization. Our motivating application is the planning of power transmission and generation in which policy constraints are designed to incentivize high amounts of intermittent generation in electric power systems. The bounding phase exploits Jensen’s inequality to define a lower bound, which we extend to stochastic programs that use expected-value constraints to enforce policy objectives. The decomposition phase, in which the bounds are tightened, improvesmore » upon the standard Benders’ algorithm by accelerating the convergence of the bounds. The lower bound is tightened by using a Jensen’s inequality-based approach to introduce an auxiliary lower bound into the Benders master problem. Upper bounds for both phases are computed using a sub-sampling approach executed on a parallel computer system. Numerical results show that only the bounding phase is necessary if loose optimality gaps are acceptable. But, the decomposition phase is required to attain optimality gaps. Moreover, use of both phases performs better, in terms of convergence speed, than attempting to solve the problem using just the bounding phase or regular Benders decomposition separately.« less
New bounding and decomposition approaches for MILP investment problems: Multi-area transmission and generation planning under policy constraints

DOE Office of Scientific and Technical Information (OSTI.GOV)

Munoz, F. D.; Hobbs, B. F.; Watson, J. -P.

A novel two-phase bounding and decomposition approach to compute optimal and near-optimal solutions to large-scale mixed-integer investment planning problems is proposed and it considers a large number of operating subproblems, each of which is a convex optimization. Our motivating application is the planning of power transmission and generation in which policy constraints are designed to incentivize high amounts of intermittent generation in electric power systems. The bounding phase exploits Jensen’s inequality to define a lower bound, which we extend to stochastic programs that use expected-value constraints to enforce policy objectives. The decomposition phase, in which the bounds are tightened, improvesmore » upon the standard Benders’ algorithm by accelerating the convergence of the bounds. The lower bound is tightened by using a Jensen’s inequality-based approach to introduce an auxiliary lower bound into the Benders master problem. Upper bounds for both phases are computed using a sub-sampling approach executed on a parallel computer system. Numerical results show that only the bounding phase is necessary if loose optimality gaps are acceptable. But, the decomposition phase is required to attain optimality gaps. Moreover, use of both phases performs better, in terms of convergence speed, than attempting to solve the problem using just the bounding phase or regular Benders decomposition separately.« less
A linear decomposition method for large optimization problems. Blueprint for development

NASA Technical Reports Server (NTRS)

Sobieszczanski-Sobieski, J.

1982-01-01

A method is proposed for decomposing large optimization problems encountered in the design of engineering systems such as an aircraft into a number of smaller subproblems. The decomposition is achieved by organizing the problem and the subordinated subproblems in a tree hierarchy and optimizing each subsystem separately. Coupling of the subproblems is accounted for by subsequent optimization of the entire system based on sensitivities of the suboptimization problem solutions at each level of the tree to variables of the next higher level. A formalization of the procedure suitable for computer implementation is developed and the state of readiness of the implementation building blocks is reviewed showing that the ingredients for the development are on the shelf. The decomposition method is also shown to be compatible with the natural human organization of the design process of engineering systems. The method is also examined with respect to the trends in computer hardware and software progress to point out that its efficiency can be amplified by network computing using parallel processors.
Bistatic scattering from a three-dimensional object above a two-dimensional randomly rough surface modeled with the parallel FDTD approach.

PubMed

Guo, L-X; Li, J; Zeng, H

2009-11-01

We present an investigation of the electromagnetic scattering from a three-dimensional (3-D) object above a two-dimensional (2-D) randomly rough surface. A Message Passing Interface-based parallel finite-difference time-domain (FDTD) approach is used, and the uniaxial perfectly matched layer (UPML) medium is adopted for truncation of the FDTD lattices, in which the finite-difference equations can be used for the total computation domain by properly choosing the uniaxial parameters. This makes the parallel FDTD algorithm easier to implement. The parallel performance with different number of processors is illustrated for one rough surface realization and shows that the computation time of our parallel FDTD algorithm is dramatically reduced relative to a single-processor implementation. Finally, the composite scattering coefficients versus scattered and azimuthal angle are presented and analyzed for different conditions, including the surface roughness, the dielectric constants, the polarization, and the size of the 3-D object.
Partial information decomposition as a spatiotemporal filter.

PubMed

Flecker, Benjamin; Alford, Wesley; Beggs, John M; Williams, Paul L; Beer, Randall D

2011-09-01

Understanding the mechanisms of distributed computation in cellular automata requires techniques for characterizing the emergent structures that underlie information processing in such systems. Recently, techniques from information theory have been brought to bear on this problem. Building on this work, we utilize the new technique of partial information decomposition to show that previous information-theoretic measures can confound distinct sources of information. We then propose a new set of filters and demonstrate that they more cleanly separate out the background domains, particles, and collisions that are typically associated with information storage, transfer, and modification in cellular automata.
Concurrent Collections (CnC): A new approach to parallel programming

DOE Office of Scientific and Technical Information (OSTI.GOV)

Knobe, Kathleen

2010-05-07

A common approach in designing parallel languages is to provide some high level handles to manipulate the use of the parallel platform. This exposes some aspects of the target platform, for example, shared vs. distributed memory. It may expose some but not all types of parallelism, for example, data parallelism but not task parallelism. This approach must find a balance between the desire to provide a simple view for the domain expert and provide sufficient power for tuning. This is hard for any given architecture and harder if the language is to apply to a range of architectures. Either simplicitymore » or power is lost. Instead of viewing the language design problem as one of providing the programmer with high level handles, we view the problem as one of designing an interface. On one side of this interface is the programmer (domain expert) who knows the application but needs no knowledge of any aspects of the platform. On the other side of the interface is the performance expert (programmer or program) who demands maximal flexibility for optimizing the mapping to a wide range of target platforms (parallel / serial, shared / distributed, homogeneous / heterogeneous, etc.) but needs no knowledge of the domain. Concurrent Collections (CnC) is based on this separation of concerns. The talk will present CnC and its benefits. About the speaker. Kathleen Knobe has focused throughout her career on parallelism especially compiler technology, runtime system design and language design. She worked at Compass (aka Massachusetts Computer Associates) from 1980 to 1991 designing compilers for a wide range of parallel platforms for Thinking Machines, MasPar, Alliant, Numerix, and several government projects. In 1991 she decided to finish her education. After graduating from MIT in 1997, she joined Digital Equipment’s Cambridge Research Lab (CRL). She stayed through the DEC/Compaq/HP mergers and when CRL was acquired and absorbed by Intel. She currently works in the Software and Services Group / Technology Pathfinding and Innovation.« less
Concurrent Collections (CnC): A new approach to parallel programming

ScienceCinema

Knobe, Kathleen

2018-04-16

A common approach in designing parallel languages is to provide some high level handles to manipulate the use of the parallel platform. This exposes some aspects of the target platform, for example, shared vs. distributed memory. It may expose some but not all types of parallelism, for example, data parallelism but not task parallelism. This approach must find a balance between the desire to provide a simple view for the domain expert and provide sufficient power for tuning. This is hard for any given architecture and harder if the language is to apply to a range of architectures. Either simplicity or power is lost. Instead of viewing the language design problem as one of providing the programmer with high level handles, we view the problem as one of designing an interface. On one side of this interface is the programmer (domain expert) who knows the application but needs no knowledge of any aspects of the platform. On the other side of the interface is the performance expert (programmer or program) who demands maximal flexibility for optimizing the mapping to a wide range of target platforms (parallel / serial, shared / distributed, homogeneous / heterogeneous, etc.) but needs no knowledge of the domain. Concurrent Collections (CnC) is based on this separation of concerns. The talk will present CnC and its benefits. About the speaker. Kathleen Knobe has focused throughout her career on parallelism especially compiler technology, runtime system design and language design. She worked at Compass (aka Massachusetts Computer Associates) from 1980 to 1991 designing compilers for a wide range of parallel platforms for Thinking Machines, MasPar, Alliant, Numerix, and several government projects. In 1991 she decided to finish her education. After graduating from MIT in 1997, she joined Digital Equipmentâs Cambridge Research Lab (CRL). She stayed through the DEC/Compaq/HP mergers and when CRL was acquired and absorbed by Intel. She currently works in the Software and Services Group / Technology Pathfinding and Innovation.
Multiscale Methods, Parallel Computation, and Neural Networks for Real-Time Computer Vision.

NASA Astrophysics Data System (ADS)

Battiti, Roberto

1990-01-01

This thesis presents new algorithms for low and intermediate level computer vision. The guiding ideas in the presented approach are those of hierarchical and adaptive processing, concurrent computation, and supervised learning. Processing of the visual data at different resolutions is used not only to reduce the amount of computation necessary to reach the fixed point, but also to produce a more accurate estimation of the desired parameters. The presented adaptive multiple scale technique is applied to the problem of motion field estimation. Different parts of the image are analyzed at a resolution that is chosen in order to minimize the error in the coefficients of the differential equations to be solved. Tests with video-acquired images show that velocity estimation is more accurate over a wide range of motion with respect to the homogeneous scheme. In some cases introduction of explicit discontinuities coupled to the continuous variables can be used to avoid propagation of visual information from areas corresponding to objects with different physical and/or kinematic properties. The human visual system uses concurrent computation in order to process the vast amount of visual data in "real -time." Although with different technological constraints, parallel computation can be used efficiently for computer vision. All the presented algorithms have been implemented on medium grain distributed memory multicomputers with a speed-up approximately proportional to the number of processors used. A simple two-dimensional domain decomposition assigns regions of the multiresolution pyramid to the different processors. The inter-processor communication needed during the solution process is proportional to the linear dimension of the assigned domain, so that efficiency is close to 100% if a large region is assigned to each processor. Finally, learning algorithms are shown to be a viable technique to engineer computer vision systems for different applications starting from multiple-purpose modules. In the last part of the thesis a well known optimization method (the Broyden-Fletcher-Goldfarb-Shanno memoryless quasi -Newton method) is applied to simple classification problems and shown to be superior to the "error back-propagation" algorithm for numerical stability, automatic selection of parameters, and convergence properties.
Optimal message log reclamation for independent checkpointing

NASA Technical Reports Server (NTRS)

Wang, Yi-Min; Fuchs, W. Kent

1993-01-01

Independent (uncoordinated) check pointing for parallel and distributed systems allows maximum process autonomy but suffers from possible domino effects and the associated storage space overhead for maintaining multiple checkpoints and message logs. In most research on check pointing and recovery, it was assumed that only the checkpoints and message logs older than the global recovery line can be discarded. It is shown how recovery line transformation and decomposition can be applied to the problem of efficiently identifying all discardable message logs, thereby achieving optimal garbage collection. Communication trace-driven simulation for several parallel programs is used to show the benefits of the proposed algorithm for message log reclamation.
Domain Hierarchy and closed Loops (DHcL): a server for exploring hierarchy of protein domain structure

PubMed Central

Koczyk, Grzegorz; Berezovsky, Igor N.

2008-01-01

Domain hierarchy and closed loops (DHcL) (http://sitron.bccs.uib.no/dhcl/) is a web server that delineates energy hierarchy of protein domain structure and detects domains at different levels of this hierarchy. The server also identifies closed loops and van der Waals locks, which constitute a structural basis for the protein domain hierarchy. The DHcL can be a useful tool for an express analysis of protein structures and their alternative domain decompositions. The user submits a PDB identifier(s) or uploads a 3D protein structure in a PDB format. The results of the analysis are the location of domains at different levels of hierarchy, closed loops, van der Waals locks and their interactive visualization. The server maintains a regularly updated database of domains, closed loop and van der Waals locks for all X-ray structures in PDB. DHcL server is available at: http://sitron.bccs.uib.no/dhcl. PMID:18502776
Simulation of reaction diffusion processes over biologically relevant size and time scales using multi-GPU workstations

PubMed Central

Hallock, Michael J.; Stone, John E.; Roberts, Elijah; Fry, Corey; Luthey-Schulten, Zaida

2014-01-01

Simulation of in vivo cellular processes with the reaction-diffusion master equation (RDME) is a computationally expensive task. Our previous software enabled simulation of inhomogeneous biochemical systems for small bacteria over long time scales using the MPD-RDME method on a single GPU. Simulations of larger eukaryotic systems exceed the on-board memory capacity of individual GPUs, and long time simulations of modest-sized cells such as yeast are impractical on a single GPU. We present a new multi-GPU parallel implementation of the MPD-RDME method based on a spatial decomposition approach that supports dynamic load balancing for workstations containing GPUs of varying performance and memory capacity. We take advantage of high-performance features of CUDA for peer-to-peer GPU memory transfers and evaluate the performance of our algorithms on state-of-the-art GPU devices. We present parallel e ciency and performance results for simulations using multiple GPUs as system size, particle counts, and number of reactions grow. We also demonstrate multi-GPU performance in simulations of the Min protein system in E. coli. Moreover, our multi-GPU decomposition and load balancing approach can be generalized to other lattice-based problems. PMID:24882911

Some links on this page may take you to non-federal websites. Their policies may differ from this site.