Sample records for quality parallel preconditioners

  1. Element-topology-independent preconditioners for parallel finite element computations

    NASA Technical Reports Server (NTRS)

    Park, K. C.; Alexander, Scott

    1992-01-01

    A family of preconditioners for the solution of finite element equations are presented, which are element-topology independent and thus can be applicable to element order-free parallel computations. A key feature of the present preconditioners is the repeated use of element connectivity matrices and their left and right inverses. The properties and performance of the present preconditioners are demonstrated via beam and two-dimensional finite element matrices for implicit time integration computations.

  2. FaCSI: A block parallel preconditioner for fluid-structure interaction in hemodynamics

    NASA Astrophysics Data System (ADS)

    Deparis, Simone; Forti, Davide; Grandperrin, Gwenol; Quarteroni, Alfio

    2016-12-01

    Modeling Fluid-Structure Interaction (FSI) in the vascular system is mandatory to reliably compute mechanical indicators in vessels undergoing large deformations. In order to cope with the computational complexity of the coupled 3D FSI problem after discretizations in space and time, a parallel solution is often mandatory. In this paper we propose a new block parallel preconditioner for the coupled linearized FSI system obtained after space and time discretization. We name it FaCSI to indicate that it exploits the Factorized form of the linearized FSI matrix, the use of static Condensation to formally eliminate the interface degrees of freedom of the fluid equations, and the use of a SIMPLE preconditioner for saddle-point problems. FaCSI is built upon a block Gauss-Seidel factorization of the FSI Jacobian matrix and it uses ad-hoc preconditioners for each physical component of the coupled problem, namely the fluid, the structure and the geometry. In the fluid subproblem, after operating static condensation of the interface fluid variables, we use a SIMPLE preconditioner on the reduced fluid matrix. Moreover, to efficiently deal with a large number of processes, FaCSI exploits efficient single field preconditioners, e.g., based on domain decomposition or the multigrid method. We measure the parallel performances of FaCSI on a benchmark cylindrical geometry and on a problem of physiological interest, namely the blood flow through a patient-specific femoropopliteal bypass. We analyze the dependence of the number of linear solver iterations on the cores count (scalability of the preconditioner) and on the mesh size (optimality).

  3. Performance of a parallel algebraic multilevel preconditioner for stabilized finite element semiconductor device modeling

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Lin, Paul T.; Shadid, John N.; Sala, Marzio

    In this study results are presented for the large-scale parallel performance of an algebraic multilevel preconditioner for solution of the drift-diffusion model for semiconductor devices. The preconditioner is the key numerical procedure determining the robustness, efficiency and scalability of the fully-coupled Newton-Krylov based, nonlinear solution method that is employed for this system of equations. The coupled system is comprised of a source term dominated Poisson equation for the electric potential, and two convection-diffusion-reaction type equations for the electron and hole concentration. The governing PDEs are discretized in space by a stabilized finite element method. Solution of the discrete system ismore » obtained through a fully-implicit time integrator, a fully-coupled Newton-based nonlinear solver, and a restarted GMRES Krylov linear system solver. The algebraic multilevel preconditioner is based on an aggressive coarsening graph partitioning of the nonzero block structure of the Jacobian matrix. Representative performance results are presented for various choices of multigrid V-cycles and W-cycles and parameter variations for smoothers based on incomplete factorizations. Parallel scalability results are presented for solution of up to 10{sup 8} unknowns on 4096 processors of a Cray XT3/4 and an IBM POWER eServer system.« less

  4. A taxonomy and comparison of parallel block multi-level preconditioners for the incompressible Navier-Stokes equations.

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Shadid, John Nicolas; Elman, Howard; Shuttleworth, Robert R.

    2007-04-01

    In recent years, considerable effort has been placed on developing efficient and robust solution algorithms for the incompressible Navier-Stokes equations based on preconditioned Krylov methods. These include physics-based methods, such as SIMPLE, and purely algebraic preconditioners based on the approximation of the Schur complement. All these techniques can be represented as approximate block factorization (ABF) type preconditioners. The goal is to decompose the application of the preconditioner into simplified sub-systems in which scalable multi-level type solvers can be applied. In this paper we develop a taxonomy of these ideas based on an adaptation of a generalized approximate factorization of themore » Navier-Stokes system first presented in [25]. This taxonomy illuminates the similarities and differences among these preconditioners and the central role played by efficient approximation of certain Schur complement operators. We then present a parallel computational study that examines the performance of these methods and compares them to an additive Schwarz domain decomposition (DD) algorithm. Results are presented for two and three-dimensional steady state problems for enclosed domains and inflow/outflow systems on both structured and unstructured meshes. The numerical experiments are performed using MPSalsa, a stabilized finite element code.« less

  5. A Performance Comparison of the Parallel Preconditioners for Iterative Methods for Large Sparse Linear Systems Arising from Partial Differential Equations on Structured Grids

    NASA Astrophysics Data System (ADS)

    Ma, Sangback

    In this paper we compare various parallel preconditioners such as Point-SSOR (Symmetric Successive OverRelaxation), ILU(0) (Incomplete LU) in the Wavefront ordering, ILU(0) in the Multi-color ordering, Multi-Color Block SOR (Successive OverRelaxation), SPAI (SParse Approximate Inverse) and pARMS (Parallel Algebraic Recursive Multilevel Solver) for solving large sparse linear systems arising from two-dimensional PDE (Partial Differential Equation)s on structured grids. Point-SSOR is well-known, and ILU(0) is one of the most popular preconditioner, but it is inherently serial. ILU(0) in the Wavefront ordering maximizes the parallelism in the natural order, but the lengths of the wave-fronts are often nonuniform. ILU(0) in the Multi-color ordering is a simple way of achieving a parallelism of the order N, where N is the order of the matrix, but its convergence rate often deteriorates as compared to that of natural ordering. We have chosen the Multi-Color Block SOR preconditioner combined with direct sparse matrix solver, since for the Laplacian matrix the SOR method is known to have a nondeteriorating rate of convergence when used with the Multi-Color ordering. By using block version we expect to minimize the interprocessor communications. SPAI computes the sparse approximate inverse directly by least squares method. Finally, ARMS is a preconditioner recursively exploiting the concept of independent sets and pARMS is the parallel version of ARMS. Experiments were conducted for the Finite Difference and Finite Element discretizations of five two-dimensional PDEs with large meshsizes up to a million on an IBM p595 machine with distributed memory. Our matrices are real positive, i. e., their real parts of the eigenvalues are positive. We have used GMRES(m) as our outer iterative method, so that the convergence of GMRES(m) for our test matrices are mathematically guaranteed. Interprocessor communications were done using MPI (Message Passing Interface) primitives. The

  6. Meros Preconditioner Package

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    2004-04-01

    Meros uses the compositional, aggregation, and overload operator capabilities of TSF to provide an object-oriented package providing segregated/block preconditioners for linear systems related to fully-coupled Navier-Stokes problems. This class of preconditioners exploits the special properties of these problems to segregate the equations and use multi-level preconditioners (through ML) on the matrix sub-blocks. Several preconditioners are provided, including the Fp and BFB preconditioners of Kay & Loghin and Silvester, Elman, Kay & Wathen. The overall performance and scalability of these preconditioners approaches that of multigrid for certain types of problems. Meros also provides more traditional pressure projection methods including SIMPLE andmore » SIMPLEC.« less

  7. Multilevel filtering elliptic preconditioners

    NASA Technical Reports Server (NTRS)

    Kuo, C. C. Jay; Chan, Tony F.; Tong, Charles

    1989-01-01

    A class of preconditioners is presented for elliptic problems built on ideas borrowed from the digital filtering theory and implemented on a multilevel grid structure. They are designed to be both rapidly convergent and highly parallelizable. The digital filtering viewpoint allows the use of filter design techniques for constructing elliptic preconditioners and also provides an alternative framework for understanding several other recently proposed multilevel preconditioners. Numerical results are presented to assess the convergence behavior of the new methods and to compare them with other preconditioners of multilevel type, including the usual multigrid method as preconditioner, the hierarchical basis method and a recent method proposed by Bramble-Pasciak-Xu.

  8. Domain decomposition preconditioners for the spectral collocation method

    NASA Technical Reports Server (NTRS)

    Quarteroni, Alfio; Sacchilandriani, Giovanni

    1988-01-01

    Several block iteration preconditioners are proposed and analyzed for the solution of elliptic problems by spectral collocation methods in a region partitioned into several rectangles. It is shown that convergence is achieved with a rate which does not depend on the polynomial degree of the spectral solution. The iterative methods here presented can be effectively implemented on multiprocessor systems due to their high degree of parallelism.

  9. Research in parallel computing

    NASA Technical Reports Server (NTRS)

    Ortega, James M.; Henderson, Charles

    1994-01-01

    This report summarizes work on parallel computations for NASA Grant NAG-1-1529 for the period 1 Jan. - 30 June 1994. Short summaries on highly parallel preconditioners, target-specific parallel reductions, and simulation of delta-cache protocols are provided.

  10. Algorithms for parallel and vector computations

    NASA Technical Reports Server (NTRS)

    Ortega, James M.

    1995-01-01

    This is a final report on work performed under NASA grant NAG-1-1112-FOP during the period March, 1990 through February 1995. Four major topics are covered: (1) solution of nonlinear poisson-type equations; (2) parallel reduced system conjugate gradient method; (3) orderings for conjugate gradient preconditioners, and (4) SOR as a preconditioner.

  11. Scalable Preconditioners for Structure Preserving Discretizations of Maxwell Equations in First Order Form

    DOE PAGES

    Phillips, Edward Geoffrey; Shadid, John N.; Cyr, Eric C.

    2018-05-01

    Here, we report multiple physical time-scales can arise in electromagnetic simulations when dissipative effects are introduced through boundary conditions, when currents follow external time-scales, and when material parameters vary spatially. In such scenarios, the time-scales of interest may be much slower than the fastest time-scales supported by the Maxwell equations, therefore making implicit time integration an efficient approach. The use of implicit temporal discretizations results in linear systems in which fast time-scales, which severely constrain the stability of an explicit method, can manifest as so-called stiff modes. This study proposes a new block preconditioner for structure preserving (also termed physicsmore » compatible) discretizations of the Maxwell equations in first order form. The intent of the preconditioner is to enable the efficient solution of multiple-time-scale Maxwell type systems. An additional benefit of the developed preconditioner is that it requires only a traditional multigrid method for its subsolves and compares well against alternative approaches that rely on specialized edge-based multigrid routines that may not be readily available. Lastly, results demonstrate parallel scalability at large electromagnetic wave CFL numbers on a variety of test problems.« less

  12. Scalable Preconditioners for Structure Preserving Discretizations of Maxwell Equations in First Order Form

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Phillips, Edward Geoffrey; Shadid, John N.; Cyr, Eric C.

    Here, we report multiple physical time-scales can arise in electromagnetic simulations when dissipative effects are introduced through boundary conditions, when currents follow external time-scales, and when material parameters vary spatially. In such scenarios, the time-scales of interest may be much slower than the fastest time-scales supported by the Maxwell equations, therefore making implicit time integration an efficient approach. The use of implicit temporal discretizations results in linear systems in which fast time-scales, which severely constrain the stability of an explicit method, can manifest as so-called stiff modes. This study proposes a new block preconditioner for structure preserving (also termed physicsmore » compatible) discretizations of the Maxwell equations in first order form. The intent of the preconditioner is to enable the efficient solution of multiple-time-scale Maxwell type systems. An additional benefit of the developed preconditioner is that it requires only a traditional multigrid method for its subsolves and compares well against alternative approaches that rely on specialized edge-based multigrid routines that may not be readily available. Lastly, results demonstrate parallel scalability at large electromagnetic wave CFL numbers on a variety of test problems.« less

  13. Low-Rank Correction Methods for Algebraic Domain Decomposition Preconditioners

    DOE PAGES

    Li, Ruipeng; Saad, Yousef

    2017-08-01

    This study presents a parallel preconditioning method for distributed sparse linear systems, based on an approximate inverse of the original matrix, that adopts a general framework of distributed sparse matrices and exploits domain decomposition (DD) and low-rank corrections. The DD approach decouples the matrix and, once inverted, a low-rank approximation is applied by exploiting the Sherman--Morrison--Woodbury formula, which yields two variants of the preconditioning methods. The low-rank expansion is computed by the Lanczos procedure with reorthogonalizations. Numerical experiments indicate that, when combined with Krylov subspace accelerators, this preconditioner can be efficient and robust for solving symmetric sparse linear systems. Comparisonsmore » with pARMS, a DD-based parallel incomplete LU (ILU) preconditioning method, are presented for solving Poisson's equation and linear elasticity problems.« less

  14. Low-Rank Correction Methods for Algebraic Domain Decomposition Preconditioners

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Li, Ruipeng; Saad, Yousef

    This study presents a parallel preconditioning method for distributed sparse linear systems, based on an approximate inverse of the original matrix, that adopts a general framework of distributed sparse matrices and exploits domain decomposition (DD) and low-rank corrections. The DD approach decouples the matrix and, once inverted, a low-rank approximation is applied by exploiting the Sherman--Morrison--Woodbury formula, which yields two variants of the preconditioning methods. The low-rank expansion is computed by the Lanczos procedure with reorthogonalizations. Numerical experiments indicate that, when combined with Krylov subspace accelerators, this preconditioner can be efficient and robust for solving symmetric sparse linear systems. Comparisonsmore » with pARMS, a DD-based parallel incomplete LU (ILU) preconditioning method, are presented for solving Poisson's equation and linear elasticity problems.« less

  15. Approximate tensor-product preconditioners for very high order discontinuous Galerkin methods

    NASA Astrophysics Data System (ADS)

    Pazner, Will; Persson, Per-Olof

    2018-02-01

    In this paper, we develop a new tensor-product based preconditioner for discontinuous Galerkin methods with polynomial degrees higher than those typically employed. This preconditioner uses an automatic, purely algebraic method to approximate the exact block Jacobi preconditioner by Kronecker products of several small, one-dimensional matrices. Traditional matrix-based preconditioners require O (p2d) storage and O (p3d) computational work, where p is the degree of basis polynomials used, and d is the spatial dimension. Our SVD-based tensor-product preconditioner requires O (p d + 1) storage, O (p d + 1) work in two spatial dimensions, and O (p d + 2) work in three spatial dimensions. Combined with a matrix-free Newton-Krylov solver, these preconditioners allow for the solution of DG systems in linear time in p per degree of freedom in 2D, and reduce the computational complexity from O (p9) to O (p5) in 3D. Numerical results are shown in 2D and 3D for the advection, Euler, and Navier-Stokes equations, using polynomials of degree up to p = 30. For many test cases, the preconditioner results in similar iteration counts when compared with the exact block Jacobi preconditioner, and performance is significantly improved for high polynomial degrees p.

  16. Parallel conjugate gradient algorithms for manipulator dynamic simulation

    NASA Technical Reports Server (NTRS)

    Fijany, Amir; Scheld, Robert E.

    1989-01-01

    Parallel conjugate gradient algorithms for the computation of multibody dynamics are developed for the specialized case of a robot manipulator. For an n-dimensional positive-definite linear system, the Classical Conjugate Gradient (CCG) algorithms are guaranteed to converge in n iterations, each with a computation cost of O(n); this leads to a total computational cost of O(n sq) on a serial processor. A conjugate gradient algorithms is presented that provide greater efficiency using a preconditioner, which reduces the number of iterations required, and by exploiting parallelism, which reduces the cost of each iteration. Two Preconditioned Conjugate Gradient (PCG) algorithms are proposed which respectively use a diagonal and a tridiagonal matrix, composed of the diagonal and tridiagonal elements of the mass matrix, as preconditioners. Parallel algorithms are developed to compute the preconditioners and their inversions in O(log sub 2 n) steps using n processors. A parallel algorithm is also presented which, on the same architecture, achieves the computational time of O(log sub 2 n) for each iteration. Simulation results for a seven degree-of-freedom manipulator are presented. Variants of the proposed algorithms are also developed which can be efficiently implemented on the Robot Mathematics Processor (RMP).

  17. Scalable parallel elastic-plastic finite element analysis using a quasi-Newton method with a balancing domain decomposition preconditioner

    NASA Astrophysics Data System (ADS)

    Yusa, Yasunori; Okada, Hiroshi; Yamada, Tomonori; Yoshimura, Shinobu

    2018-04-01

    A domain decomposition method for large-scale elastic-plastic problems is proposed. The proposed method is based on a quasi-Newton method in conjunction with a balancing domain decomposition preconditioner. The use of a quasi-Newton method overcomes two problems associated with the conventional domain decomposition method based on the Newton-Raphson method: (1) avoidance of a double-loop iteration algorithm, which generally has large computational complexity, and (2) consideration of the local concentration of nonlinear deformation, which is observed in elastic-plastic problems with stress concentration. Moreover, the application of a balancing domain decomposition preconditioner ensures scalability. Using the conventional and proposed domain decomposition methods, several numerical tests, including weak scaling tests, were performed. The convergence performance of the proposed method is comparable to that of the conventional method. In particular, in elastic-plastic analysis, the proposed method exhibits better convergence performance than the conventional method.

  18. Incomplete Sparse Approximate Inverses for Parallel Preconditioning

    DOE PAGES

    Anzt, Hartwig; Huckle, Thomas K.; Bräckle, Jürgen; ...

    2017-10-28

    In this study, we propose a new preconditioning method that can be seen as a generalization of block-Jacobi methods, or as a simplification of the sparse approximate inverse (SAI) preconditioners. The “Incomplete Sparse Approximate Inverses” (ISAI) is in particular efficient in the solution of sparse triangular linear systems of equations. Those arise, for example, in the context of incomplete factorization preconditioning. ISAI preconditioners can be generated via an algorithm providing fine-grained parallelism, which makes them attractive for hardware with a high concurrency level. Finally, in a study covering a large number of matrices, we identify the ISAI preconditioner as anmore » attractive alternative to exact triangular solves in the context of incomplete factorization preconditioning.« less

  19. A universal preconditioner for simulating condensed phase materials.

    PubMed

    Packwood, David; Kermode, James; Mones, Letif; Bernstein, Noam; Woolley, John; Gould, Nicholas; Ortner, Christoph; Csányi, Gábor

    2016-04-28

    We introduce a universal sparse preconditioner that accelerates geometry optimisation and saddle point search tasks that are common in the atomic scale simulation of materials. Our preconditioner is based on the neighbourhood structure and we demonstrate the gain in computational efficiency in a wide range of materials that include metals, insulators, and molecular solids. The simple structure of the preconditioner means that the gains can be realised in practice not only when using expensive electronic structure models but also for fast empirical potentials. Even for relatively small systems of a few hundred atoms, we observe speedups of a factor of two or more, and the gain grows with system size. An open source Python implementation within the Atomic Simulation Environment is available, offering interfaces to a wide range of atomistic codes.

  20. A universal preconditioner for simulating condensed phase materials

    NASA Astrophysics Data System (ADS)

    Packwood, David; Kermode, James; Mones, Letif; Bernstein, Noam; Woolley, John; Gould, Nicholas; Ortner, Christoph; Csányi, Gábor

    2016-04-01

    We introduce a universal sparse preconditioner that accelerates geometry optimisation and saddle point search tasks that are common in the atomic scale simulation of materials. Our preconditioner is based on the neighbourhood structure and we demonstrate the gain in computational efficiency in a wide range of materials that include metals, insulators, and molecular solids. The simple structure of the preconditioner means that the gains can be realised in practice not only when using expensive electronic structure models but also for fast empirical potentials. Even for relatively small systems of a few hundred atoms, we observe speedups of a factor of two or more, and the gain grows with system size. An open source Python implementation within the Atomic Simulation Environment is available, offering interfaces to a wide range of atomistic codes.

  1. A universal preconditioner for simulating condensed phase materials

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Packwood, David; Ortner, Christoph, E-mail: c.ortner@warwick.ac.uk; Kermode, James, E-mail: j.r.kermode@warwick.ac.uk

    2016-04-28

    We introduce a universal sparse preconditioner that accelerates geometry optimisation and saddle point search tasks that are common in the atomic scale simulation of materials. Our preconditioner is based on the neighbourhood structure and we demonstrate the gain in computational efficiency in a wide range of materials that include metals, insulators, and molecular solids. The simple structure of the preconditioner means that the gains can be realised in practice not only when using expensive electronic structure models but also for fast empirical potentials. Even for relatively small systems of a few hundred atoms, we observe speedups of a factor ofmore » two or more, and the gain grows with system size. An open source Python implementation within the Atomic Simulation Environment is available, offering interfaces to a wide range of atomistic codes.« less

  2. A note on the preconditioner Pm=(I+Sm)

    NASA Astrophysics Data System (ADS)

    Kohno, Toshiyuki; Niki, Hiroshi

    2009-03-01

    Kotakemori et al. [H. Kotakemori, K. Harada, M. Morimoto, H. Niki, A comparison theorem for the iterative method with the preconditioner (I+Smax), Journal of Computational and Applied Mathematics 145 (2002) 373-378] have reported that the convergence rate of the iterative method with a preconditioner Pm=(I+Sm) was superior to one of the modified Gauss-Seidel method under the condition. These authors derived a theorem comparing the Gauss-Seidel method with the proposed method. However, through application of a counter example, Wen Li [Wen Li, A note on the preconditioned GaussSeidel (GS) method for linear systems, Journal of Computational and Applied Mathematics 182 (2005) 81-91] pointed out that there exists a special matrix that does not satisfy this comparison theorem. In this note, we analyze the reason why such a to counter example may be produced, and propose a preconditioner to overcome this problem.

  3. Algebraic multigrid preconditioning within parallel finite-element solvers for 3-D electromagnetic modelling problems in geophysics

    NASA Astrophysics Data System (ADS)

    Koldan, Jelena; Puzyrev, Vladimir; de la Puente, Josep; Houzeaux, Guillaume; Cela, José María

    2014-06-01

    We present an elaborate preconditioning scheme for Krylov subspace methods which has been developed to improve the performance and reduce the execution time of parallel node-based finite-element (FE) solvers for 3-D electromagnetic (EM) numerical modelling in exploration geophysics. This new preconditioner is based on algebraic multigrid (AMG) that uses different basic relaxation methods, such as Jacobi, symmetric successive over-relaxation (SSOR) and Gauss-Seidel, as smoothers and the wave front algorithm to create groups, which are used for a coarse-level generation. We have implemented and tested this new preconditioner within our parallel nodal FE solver for 3-D forward problems in EM induction geophysics. We have performed series of experiments for several models with different conductivity structures and characteristics to test the performance of our AMG preconditioning technique when combined with biconjugate gradient stabilized method. The results have shown that, the more challenging the problem is in terms of conductivity contrasts, ratio between the sizes of grid elements and/or frequency, the more benefit is obtained by using this preconditioner. Compared to other preconditioning schemes, such as diagonal, SSOR and truncated approximate inverse, the AMG preconditioner greatly improves the convergence of the iterative solver for all tested models. Also, when it comes to cases in which other preconditioners succeed to converge to a desired precision, AMG is able to considerably reduce the total execution time of the forward-problem code-up to an order of magnitude. Furthermore, the tests have confirmed that our AMG scheme ensures grid-independent rate of convergence, as well as improvement in convergence regardless of how big local mesh refinements are. In addition, AMG is designed to be a black-box preconditioner, which makes it easy to use and combine with different iterative methods. Finally, it has proved to be very practical and efficient in the

  4. Block recursive LU preconditioners for the thermally coupled incompressible inductionless MHD problem

    NASA Astrophysics Data System (ADS)

    Badia, Santiago; Martín, Alberto F.; Planas, Ramon

    2014-10-01

    The thermally coupled incompressible inductionless magnetohydrodynamics (MHD) problem models the flow of an electrically charged fluid under the influence of an external electromagnetic field with thermal coupling. This system of partial differential equations is strongly coupled and highly nonlinear for real cases of interest. Therefore, fully implicit time integration schemes are very desirable in order to capture the different physical scales of the problem at hand. However, solving the multiphysics linear systems of equations resulting from such algorithms is a very challenging task which requires efficient and scalable preconditioners. In this work, a new family of recursive block LU preconditioners is designed and tested for solving the thermally coupled inductionless MHD equations. These preconditioners are obtained after splitting the fully coupled matrix into one-physics problems for every variable (velocity, pressure, current density, electric potential and temperature) that can be optimally solved, e.g., using preconditioned domain decomposition algorithms. The main idea is to arrange the original matrix into an (arbitrary) 2 × 2 block matrix, and consider an LU preconditioner obtained by approximating the corresponding Schur complement. For every one of the diagonal blocks in the LU preconditioner, if it involves more than one type of unknowns, we proceed the same way in a recursive fashion. This approach is stated in an abstract way, and can be straightforwardly applied to other multiphysics problems. Further, we precisely explain a flexible and general software design for the code implementation of this type of preconditioners.

  5. Multilevel Preconditioners for Reaction-Diffusion Problems with Discontinuous Coefficients

    DOE PAGES

    Kolev, Tzanio V.; Xu, Jinchao; Zhu, Yunrong

    2015-08-23

    In this study, we extend some of the multilevel convergence results obtained by Xu and Zhu, to the case of second order linear reaction-diffusion equations. Specifically, we consider the multilevel preconditioners for solving the linear systems arising from the linear finite element approximation of the problem, where both diffusion and reaction coefficients are piecewise-constant functions. We discuss in detail the influence of both the discontinuous reaction and diffusion coefficients to the performance of the classical BPX and multigrid V-cycle preconditioner.

  6. A scalable parallel black oil simulator on distributed memory parallel computers

    NASA Astrophysics Data System (ADS)

    Wang, Kun; Liu, Hui; Chen, Zhangxin

    2015-11-01

    This paper presents our work on developing a parallel black oil simulator for distributed memory computers based on our in-house parallel platform. The parallel simulator is designed to overcome the performance issues of common simulators that are implemented for personal computers and workstations. The finite difference method is applied to discretize the black oil model. In addition, some advanced techniques are employed to strengthen the robustness and parallel scalability of the simulator, including an inexact Newton method, matrix decoupling methods, and algebraic multigrid methods. A new multi-stage preconditioner is proposed to accelerate the solution of linear systems from the Newton methods. Numerical experiments show that our simulator is scalable and efficient, and is capable of simulating extremely large-scale black oil problems with tens of millions of grid blocks using thousands of MPI processes on parallel computers.

  7. Incomplete augmented Lagrangian preconditioner for steady incompressible Navier-Stokes equations.

    PubMed

    Tan, Ning-Bo; Huang, Ting-Zhu; Hu, Ze-Jun

    2013-01-01

    An incomplete augmented Lagrangian preconditioner, for the steady incompressible Navier-Stokes equations discretized by stable finite elements, is proposed. The eigenvalues of the preconditioned matrix are analyzed. Numerical experiments show that the incomplete augmented Lagrangian-based preconditioner proposed is very robust and performs quite well by the Picard linearization or the Newton linearization over a wide range of values of the viscosity on both uniform and stretched grids.

  8. Incomplete Augmented Lagrangian Preconditioner for Steady Incompressible Navier-Stokes Equations

    PubMed Central

    Tan, Ning-Bo; Huang, Ting-Zhu; Hu, Ze-Jun

    2013-01-01

    An incomplete augmented Lagrangian preconditioner, for the steady incompressible Navier-Stokes equations discretized by stable finite elements, is proposed. The eigenvalues of the preconditioned matrix are analyzed. Numerical experiments show that the incomplete augmented Lagrangian-based preconditioner proposed is very robust and performs quite well by the Picard linearization or the Newton linearization over a wide range of values of the viscosity on both uniform and stretched grids. PMID:24235888

  9. Parallel computing techniques for rotorcraft aerodynamics

    NASA Astrophysics Data System (ADS)

    Ekici, Kivanc

    The modification of unsteady three-dimensional Navier-Stokes codes for application on massively parallel and distributed computing environments is investigated. The Euler/Navier-Stokes code TURNS (Transonic Unsteady Rotor Navier-Stokes) was chosen as a test bed because of its wide use by universities and industry. For the efficient implementation of TURNS on parallel computing systems, two algorithmic changes are developed. First, main modifications to the implicit operator, Lower-Upper Symmetric Gauss Seidel (LU-SGS) originally used in TURNS, is performed. Second, application of an inexact Newton method, coupled with a Krylov subspace iterative method (Newton-Krylov method) is carried out. Both techniques have been tried previously for the Euler equations mode of the code. In this work, we have extended the methods to the Navier-Stokes mode. Several new implicit operators were tried because of convergence problems of traditional operators with the high cell aspect ratio (CAR) grids needed for viscous calculations on structured grids. Promising results for both Euler and Navier-Stokes cases are presented for these operators. For the efficient implementation of Newton-Krylov methods to the Navier-Stokes mode of TURNS, efficient preconditioners must be used. The parallel implicit operators used in the previous step are employed as preconditioners and the results are compared. The Message Passing Interface (MPI) protocol has been used because of its portability to various parallel architectures. It should be noted that the proposed methodology is general and can be applied to several other CFD codes (e.g. OVERFLOW).

  10. Incomplete Gröbner basis as a preconditioner for polynomial systems

    NASA Astrophysics Data System (ADS)

    Sun, Yang; Tao, Yu-Hui; Bai, Feng-Shan

    2009-04-01

    Precondition plays a critical role in the numerical methods for large and sparse linear systems. It is also true for nonlinear algebraic systems. In this paper incomplete Gröbner basis (IGB) is proposed as a preconditioner of homotopy methods for polynomial systems of equations, which transforms a deficient system into a system with the same finite solutions, but smaller degree. The reduced system can thus be solved faster. Numerical results show the efficiency of the preconditioner.

  11. AZTEC: A parallel iterative package for the solving linear systems

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Hutchinson, S.A.; Shadid, J.N.; Tuminaro, R.S.

    1996-12-31

    We describe a parallel linear system package, AZTEC. The package incorporates a number of parallel iterative methods (e.g. GMRES, biCGSTAB, CGS, TFQMR) and preconditioners (e.g. Jacobi, Gauss-Seidel, polynomial, domain decomposition with LU or ILU within subdomains). Additionally, AZTEC allows for the reuse of previous preconditioning factorizations within Newton schemes for nonlinear methods. Currently, a number of different users are using this package to solve a variety of PDE applications.

  12. Semi-automatic sparse preconditioners for high-order finite element methods on non-uniform meshes

    NASA Astrophysics Data System (ADS)

    Austin, Travis M.; Brezina, Marian; Jamroz, Ben; Jhurani, Chetan; Manteuffel, Thomas A.; Ruge, John

    2012-05-01

    High-order finite elements often have a higher accuracy per degree of freedom than the classical low-order finite elements. However, in the context of implicit time-stepping methods, high-order finite elements present challenges to the construction of efficient simulations due to the high cost of inverting the denser finite element matrix. There are many cases where simulations are limited by the memory required to store the matrix and/or the algorithmic components of the linear solver. We are particularly interested in preconditioned Krylov methods for linear systems generated by discretization of elliptic partial differential equations with high-order finite elements. Using a preconditioner like Algebraic Multigrid can be costly in terms of memory due to the need to store matrix information at the various levels. We present a novel method for defining a preconditioner for systems generated by high-order finite elements that is based on a much sparser system than the original high-order finite element system. We investigate the performance for non-uniform meshes on a cube and a cubed sphere mesh, showing that the sparser preconditioner is more efficient and uses significantly less memory. Finally, we explore new methods to construct the sparse preconditioner and examine their effectiveness for non-uniform meshes. We compare results to a direct use of Algebraic Multigrid as a preconditioner and to a two-level additive Schwarz method.

  13. Multigrid solvers and multigrid preconditioners for the solution of variational data assimilation problems

    NASA Astrophysics Data System (ADS)

    Debreu, Laurent; Neveu, Emilie; Simon, Ehouarn; Le Dimet, Francois Xavier; Vidard, Arthur

    2014-05-01

    In order to lower the computational cost of the variational data assimilation process, we investigate the use of multigrid methods to solve the associated optimal control system. On a linear advection equation, we study the impact of the regularization term on the optimal control and the impact of discretization errors on the efficiency of the coarse grid correction step. We show that even if the optimal control problem leads to the solution of an elliptic system, numerical errors introduced by the discretization can alter the success of the multigrid methods. The view of the multigrid iteration as a preconditioner for a Krylov optimization method leads to a more robust algorithm. A scale dependent weighting of the multigrid preconditioner and the usual background error covariance matrix based preconditioner is proposed and brings significant improvements. [1] Laurent Debreu, Emilie Neveu, Ehouarn Simon, François-Xavier Le Dimet and Arthur Vidard, 2014: Multigrid solvers and multigrid preconditioners for the solution of variational data assimilation problems, submitted to QJRMS, http://hal.inria.fr/hal-00874643 [2] Emilie Neveu, Laurent Debreu and François-Xavier Le Dimet, 2011: Multigrid methods and data assimilation - Convergence study and first experiments on non-linear equations, ARIMA, 14, 63-80, http://intranet.inria.fr/international/arima/014/014005.html

  14. Algorithmically scalable block preconditioner for fully implicit shallow-water equations in CAM-SE

    DOE PAGES

    Lott, P. Aaron; Woodward, Carol S.; Evans, Katherine J.

    2014-10-19

    Performing accurate and efficient numerical simulation of global atmospheric climate models is challenging due to the disparate length and time scales over which physical processes interact. Implicit solvers enable the physical system to be integrated with a time step commensurate with processes being studied. The dominant cost of an implicit time step is the ancillary linear system solves, so we have developed a preconditioner aimed at improving the efficiency of these linear system solves. Our preconditioner is based on an approximate block factorization of the linearized shallow-water equations and has been implemented within the spectral element dynamical core within themore » Community Atmospheric Model (CAM-SE). Furthermore, in this paper we discuss the development and scalability of the preconditioner for a suite of test cases with the implicit shallow-water solver within CAM-SE.« less

  15. Tensor-product preconditioners for higher-order space-time discontinuous Galerkin methods

    NASA Astrophysics Data System (ADS)

    Diosady, Laslo T.; Murman, Scott M.

    2017-02-01

    A space-time discontinuous-Galerkin spectral-element discretization is presented for direct numerical simulation of the compressible Navier-Stokes equations. An efficient solution technique based on a matrix-free Newton-Krylov method is developed in order to overcome the stiffness associated with high solution order. The use of tensor-product basis functions is key to maintaining efficiency at high-order. Efficient preconditioning methods are presented which can take advantage of the tensor-product formulation. A diagonalized Alternating-Direction-Implicit (ADI) scheme is extended to the space-time discontinuous Galerkin discretization. A new preconditioner for the compressible Euler/Navier-Stokes equations based on the fast-diagonalization method is also presented. Numerical results demonstrate the effectiveness of these preconditioners for the direct numerical simulation of subsonic turbulent flows.

  16. Tensor-Product Preconditioners for Higher-Order Space-Time Discontinuous Galerkin Methods

    NASA Technical Reports Server (NTRS)

    Diosady, Laslo T.; Murman, Scott M.

    2016-01-01

    space-time discontinuous-Galerkin spectral-element discretization is presented for direct numerical simulation of the compressible Navier-Stokes equat ions. An efficient solution technique based on a matrix-free Newton-Krylov method is developed in order to overcome the stiffness associated with high solution order. The use of tensor-product basis functions is key to maintaining efficiency at high order. Efficient preconditioning methods are presented which can take advantage of the tensor-product formulation. A diagonalized Alternating-Direction-Implicit (ADI) scheme is extended to the space-time discontinuous Galerkin discretization. A new preconditioner for the compressible Euler/Navier-Stokes equations based on the fast-diagonalization method is also presented. Numerical results demonstrate the effectiveness of these preconditioners for the direct numerical simulation of subsonic turbulent flows.

  17. Assessment of Preconditioner for a USM3D Hierarchical Adaptive Nonlinear Method (HANIM) (Invited)

    NASA Technical Reports Server (NTRS)

    Pandya, Mohagna J.; Diskin, Boris; Thomas, James L.; Frink, Neal T.

    2016-01-01

    Enhancements to the previously reported mixed-element USM3D Hierarchical Adaptive Nonlinear Iteration Method (HANIM) framework have been made to further improve robustness, efficiency, and accuracy of computational fluid dynamic simulations. The key enhancements include a multi-color line-implicit preconditioner, a discretely consistent symmetry boundary condition, and a line-mapping method for the turbulence source term discretization. The USM3D iterative convergence for the turbulent flows is assessed on four configurations. The configurations include a two-dimensional (2D) bump-in-channel, the 2D NACA 0012 airfoil, a three-dimensional (3D) bump-in-channel, and a 3D hemisphere cylinder. The Reynolds Averaged Navier Stokes (RANS) solutions have been obtained using a Spalart-Allmaras turbulence model and families of uniformly refined nested grids. Two types of HANIM solutions using line- and point-implicit preconditioners have been computed. Additional solutions using the point-implicit preconditioner alone (PA) method that broadly represents the baseline solver technology have also been computed. The line-implicit HANIM shows superior iterative convergence in most cases with progressively increasing benefits on finer grids.

  18. Evaluating Sparse Linear System Solvers on Scalable Parallel Architectures

    DTIC Science & Technology

    2008-10-01

    42 3.4 Residual history of WSO banded preconditioner for problem 2D 54019 HIGHK . . . . . . . . . . . . . . . . . . . . . . . . . . 43...3.5 Residual history of WSO banded preconditioner for problem Appu 43 3.6 Residual history of WSO banded preconditioner for problem ASIC 680k...44 3.7 Residual history of WSO banded preconditioner for problem BUN- DLE1

  19. Final Report, DE-FG01-06ER25718 Domain Decomposition and Parallel Computing

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Widlund, Olof B.

    2015-06-09

    The goal of this project is to develop and improve domain decomposition algorithms for a variety of partial differential equations such as those of linear elasticity and electro-magnetics.These iterative methods are designed for massively parallel computing systems and allow the fast solution of the very large systems of algebraic equations that arise in large scale and complicated simulations. A special emphasis is placed on problems arising from Maxwell's equation. The approximate solvers, the preconditioners, are combined with the conjugate gradient method and must always include a solver of a coarse model in order to have a performance which is independentmore » of the number of processors used in the computer simulation. A recent development allows for an adaptive construction of this coarse component of the preconditioner.« less

  20. Array-based, parallel hierarchical mesh refinement algorithms for unstructured meshes

    DOE PAGES

    Ray, Navamita; Grindeanu, Iulian; Zhao, Xinglin; ...

    2016-08-18

    In this paper, we describe an array-based hierarchical mesh refinement capability through uniform refinement of unstructured meshes for efficient solution of PDE's using finite element methods and multigrid solvers. A multi-degree, multi-dimensional and multi-level framework is designed to generate the nested hierarchies from an initial coarse mesh that can be used for a variety of purposes such as in multigrid solvers/preconditioners, to do solution convergence and verification studies and to improve overall parallel efficiency by decreasing I/O bandwidth requirements (by loading smaller meshes and in memory refinement). We also describe a high-order boundary reconstruction capability that can be used tomore » project the new points after refinement using high-order approximations instead of linear projection in order to minimize and provide more control on geometrical errors introduced by curved boundaries.The capability is developed under the parallel unstructured mesh framework "Mesh Oriented dAtaBase" (MOAB Tautges et al. (2004)). We describe the underlying data structures and algorithms to generate such hierarchies in parallel and present numerical results for computational efficiency and effect on mesh quality. Furthermore, we also present results to demonstrate the applicability of the developed capability to study convergence properties of different point projection schemes for various mesh hierarchies and to a multigrid finite-element solver for elliptic problems.« less

  1. A preconditioner for the finite element computation of incompressible, nonlinear elastic deformations

    NASA Astrophysics Data System (ADS)

    Whiteley, J. P.

    2017-10-01

    Large, incompressible elastic deformations are governed by a system of nonlinear partial differential equations. The finite element discretisation of these partial differential equations yields a system of nonlinear algebraic equations that are usually solved using Newton's method. On each iteration of Newton's method, a linear system must be solved. We exploit the structure of the Jacobian matrix to propose a preconditioner, comprising two steps. The first step is the solution of a relatively small, symmetric, positive definite linear system using the preconditioned conjugate gradient method. This is followed by a small number of multigrid V-cycles for a larger linear system. Through the use of exemplar elastic deformations, the preconditioner is demonstrated to facilitate the iterative solution of the linear systems arising. The number of GMRES iterations required has only a very weak dependence on the number of degrees of freedom of the linear systems.

  2. Tensor-product preconditioners for a space-time discontinuous Galerkin method

    NASA Astrophysics Data System (ADS)

    Diosady, Laslo T.; Murman, Scott M.

    2014-10-01

    A space-time discontinuous Galerkin spectral element discretization is presented for direct numerical simulation of the compressible Navier-Stokes equations. An efficient solution technique based on a matrix-free Newton-Krylov method is presented. A diagonalized alternating direction implicit preconditioner is extended to a space-time formulation using entropy variables. The effectiveness of this technique is demonstrated for the direct numerical simulation of turbulent flow in a channel.

  3. Optimal parallel solution of sparse triangular systems

    NASA Technical Reports Server (NTRS)

    Alvarado, Fernando L.; Schreiber, Robert

    1990-01-01

    A method for the parallel solution of triangular sets of equations is described that is appropriate when there are many right-handed sides. By preprocessing, the method can reduce the number of parallel steps required to solve Lx = b compared to parallel forward or backsolve. Applications are to iterative solvers with triangular preconditioners, to structural analysis, or to power systems applications, where there may be many right-handed sides (not all available a priori). The inverse of L is represented as a product of sparse triangular factors. The problem is to find a factored representation of this inverse of L with the smallest number of factors (or partitions), subject to the requirement that no new nonzero elements be created in the formation of these inverse factors. A method from an earlier reference is shown to solve this problem. This method is improved upon by constructing a permutation of the rows and columns of L that preserves triangularity and allow for the best possible such partition. A number of practical examples and algorithmic details are presented. The parallelism attainable is illustrated by means of elimination trees and clique trees.

  4. Performance of fully-coupled algebraic multigrid preconditioners for large-scale VMS resistive MHD

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Lin, P. T.; Shadid, J. N.; Hu, J. J.

    Here, we explore the current performance and scaling of a fully-implicit stabilized unstructured finite element (FE) variational multiscale (VMS) capability for large-scale simulations of 3D incompressible resistive magnetohydrodynamics (MHD). The large-scale linear systems that are generated by a Newton nonlinear solver approach are iteratively solved by preconditioned Krylov subspace methods. The efficiency of this approach is critically dependent on the scalability and performance of the algebraic multigrid preconditioner. Our study considers the performance of the numerical methods as recently implemented in the second-generation Trilinos implementation that is 64-bit compliant and is not limited by the 32-bit global identifiers of themore » original Epetra-based Trilinos. The study presents representative results for a Poisson problem on 1.6 million cores of an IBM Blue Gene/Q platform to demonstrate very large-scale parallel execution. Additionally, results for a more challenging steady-state MHD generator and a transient solution of a benchmark MHD turbulence calculation for the full resistive MHD system are also presented. These results are obtained on up to 131,000 cores of a Cray XC40 and one million cores of a BG/Q system.« less

  5. Performance of fully-coupled algebraic multigrid preconditioners for large-scale VMS resistive MHD

    DOE PAGES

    Lin, P. T.; Shadid, J. N.; Hu, J. J.; ...

    2017-11-06

    Here, we explore the current performance and scaling of a fully-implicit stabilized unstructured finite element (FE) variational multiscale (VMS) capability for large-scale simulations of 3D incompressible resistive magnetohydrodynamics (MHD). The large-scale linear systems that are generated by a Newton nonlinear solver approach are iteratively solved by preconditioned Krylov subspace methods. The efficiency of this approach is critically dependent on the scalability and performance of the algebraic multigrid preconditioner. Our study considers the performance of the numerical methods as recently implemented in the second-generation Trilinos implementation that is 64-bit compliant and is not limited by the 32-bit global identifiers of themore » original Epetra-based Trilinos. The study presents representative results for a Poisson problem on 1.6 million cores of an IBM Blue Gene/Q platform to demonstrate very large-scale parallel execution. Additionally, results for a more challenging steady-state MHD generator and a transient solution of a benchmark MHD turbulence calculation for the full resistive MHD system are also presented. These results are obtained on up to 131,000 cores of a Cray XC40 and one million cores of a BG/Q system.« less

  6. Parallel Newton-Krylov-Schwarz algorithms for the transonic full potential equation

    NASA Technical Reports Server (NTRS)

    Cai, Xiao-Chuan; Gropp, William D.; Keyes, David E.; Melvin, Robin G.; Young, David P.

    1996-01-01

    We study parallel two-level overlapping Schwarz algorithms for solving nonlinear finite element problems, in particular, for the full potential equation of aerodynamics discretized in two dimensions with bilinear elements. The overall algorithm, Newton-Krylov-Schwarz (NKS), employs an inexact finite-difference Newton method and a Krylov space iterative method, with a two-level overlapping Schwarz method as a preconditioner. We demonstrate that NKS, combined with a density upwinding continuation strategy for problems with weak shocks, is robust and, economical for this class of mixed elliptic-hyperbolic nonlinear partial differential equations, with proper specification of several parameters. We study upwinding parameters, inner convergence tolerance, coarse grid density, subdomain overlap, and the level of fill-in in the incomplete factorization, and report their effect on numerical convergence rate, overall execution time, and parallel efficiency on a distributed-memory parallel computer.

  7. The multigrid preconditioned conjugate gradient method

    NASA Technical Reports Server (NTRS)

    Tatebe, Osamu

    1993-01-01

    A multigrid preconditioned conjugate gradient method (MGCG method), which uses the multigrid method as a preconditioner of the PCG method, is proposed. The multigrid method has inherent high parallelism and improves convergence of long wavelength components, which is important in iterative methods. By using this method as a preconditioner of the PCG method, an efficient method with high parallelism and fast convergence is obtained. First, it is considered a necessary condition of the multigrid preconditioner in order to satisfy requirements of a preconditioner of the PCG method. Next numerical experiments show a behavior of the MGCG method and that the MGCG method is superior to both the ICCG method and the multigrid method in point of fast convergence and high parallelism. This fast convergence is understood in terms of the eigenvalue analysis of the preconditioned matrix. From this observation of the multigrid preconditioner, it is realized that the MGCG method converges in very few iterations and the multigrid preconditioner is a desirable preconditioner of the conjugate gradient method.

  8. Parallelization of the preconditioned IDR solver for modern multicore computer systems

    NASA Astrophysics Data System (ADS)

    Bessonov, O. A.; Fedoseyev, A. I.

    2012-10-01

    This paper present the analysis, parallelization and optimization approach for the large sparse matrix solver CNSPACK for modern multicore microprocessors. CNSPACK is an advanced solver successfully used for coupled solution of stiff problems arising in multiphysics applications such as CFD, semiconductor transport, kinetic and quantum problems. It employs iterative IDR algorithm with ILU preconditioning (user chosen ILU preconditioning order). CNSPACK has been successfully used during last decade for solving problems in several application areas, including fluid dynamics and semiconductor device simulation. However, there was a dramatic change in processor architectures and computer system organization in recent years. Due to this, performance criteria and methods have been revisited, together with involving the parallelization of the solver and preconditioner using Open MP environment. Results of the successful implementation for efficient parallelization are presented for the most advances computer system (Intel Core i7-9xx or two-processor Xeon 55xx/56xx).

  9. Parallel Preconditioning for CFD Problems on the CM-5

    NASA Technical Reports Server (NTRS)

    Simon, Horst D.; Kremenetsky, Mark D.; Richardson, John; Lasinski, T. A. (Technical Monitor)

    1994-01-01

    Up to today, preconditioning methods on massively parallel systems have faced a major difficulty. The most successful preconditioning methods in terms of accelerating the convergence of the iterative solver such as incomplete LU factorizations are notoriously difficult to implement on parallel machines for two reasons: (1) the actual computation of the preconditioner is not very floating-point intensive, but requires a large amount of unstructured communication, and (2) the application of the preconditioning matrix in the iteration phase (i.e. triangular solves) are difficult to parallelize because of the recursive nature of the computation. Here we present a new approach to preconditioning for very large, sparse, unsymmetric, linear systems, which avoids both difficulties. We explicitly compute an approximate inverse to our original matrix. This new preconditioning matrix can be applied most efficiently for iterative methods on massively parallel machines, since the preconditioning phase involves only a matrix-vector multiplication, with possibly a dense matrix. Furthermore the actual computation of the preconditioning matrix has natural parallelism. For a problem of size n, the preconditioning matrix can be computed by solving n independent small least squares problems. The algorithm and its implementation on the Connection Machine CM-5 are discussed in detail and supported by extensive timings obtained from real problem data.

  10. Udzawa-type iterative method with parareal preconditioner for a parabolic optimal control problem

    NASA Astrophysics Data System (ADS)

    Lapin, A.; Romanenko, A.

    2016-11-01

    The article deals with the optimal control problem with the parabolic equation as state problem. There are point-wise constraints on the state and control functions. The objective functional involves the observation given in the domain at each moment. The conditions for convergence Udzawa's type iterative method are given. The parareal method to inverse preconditioner is given. The results of calculations are presented.

  11. A hierarchical preconditioner for the electric field integral equation on unstructured meshes based on primal and dual Haar bases

    NASA Astrophysics Data System (ADS)

    Adrian, S. B.; Andriulli, F. P.; Eibert, T. F.

    2017-02-01

    A new hierarchical basis preconditioner for the electric field integral equation (EFIE) operator is introduced. In contrast to existing hierarchical basis preconditioners, it works on arbitrary meshes and preconditions both the vector and the scalar potential within the EFIE operator. This is obtained by taking into account that the vector and the scalar potential discretized with loop-star basis functions are related to the hypersingular and the single layer operator (i.e., the well known integral operators from acoustics). For the single layer operator discretized with piecewise constant functions, a hierarchical preconditioner can easily be constructed. Thus the strategy we propose in this work for preconditioning the EFIE is the transformation of the scalar and the vector potential into operators equivalent to the single layer operator and to its inverse. More specifically, when the scalar potential is discretized with star functions as source and testing functions, the resulting matrix is a single layer operator discretized with piecewise constant functions and multiplied left and right with two additional graph Laplacian matrices. By inverting these graph Laplacian matrices, the discretized single layer operator is obtained, which can be preconditioned with the hierarchical basis. Dually, when the vector potential is discretized with loop functions, the resulting matrix can be interpreted as a hypersingular operator discretized with piecewise linear functions. By leveraging on a scalar Calderón identity, we can interpret this operator as spectrally equivalent to the inverse single layer operator. Then we use a linear-in-complexity, closed-form inverse of the dual hierarchical basis to precondition the hypersingular operator. The numerical results show the effectiveness of the proposed preconditioner and the practical impact of theoretical developments in real case scenarios.

  12. Performance Models for the Spike Banded Linear System Solver

    DOE PAGES

    Manguoglu, Murat; Saied, Faisal; Sameh, Ahmed; ...

    2011-01-01

    With availability of large-scale parallel platforms comprised of tens-of-thousands of processors and beyond, there is significant impetus for the development of scalable parallel sparse linear system solvers and preconditioners. An integral part of this design process is the development of performance models capable of predicting performance and providing accurate cost models for the solvers and preconditioners. There has been some work in the past on characterizing performance of the iterative solvers themselves. In this paper, we investigate the problem of characterizing performance and scalability of banded preconditioners. Recent work has demonstrated the superior convergence properties and robustness of banded preconditioners,more » compared to state-of-the-art ILU family of preconditioners as well as algebraic multigrid preconditioners. Furthermore, when used in conjunction with efficient banded solvers, banded preconditioners are capable of significantly faster time-to-solution. Our banded solver, the Truncated Spike algorithm is specifically designed for parallel performance and tolerance to deep memory hierarchies. Its regular structure is also highly amenable to accurate performance characterization. Using these characteristics, we derive the following results in this paper: (i) we develop parallel formulations of the Truncated Spike solver, (ii) we develop a highly accurate pseudo-analytical parallel performance model for our solver, (iii) we show excellent predication capabilities of our model – based on which we argue the high scalability of our solver. Our pseudo-analytical performance model is based on analytical performance characterization of each phase of our solver. These analytical models are then parameterized using actual runtime information on target platforms. An important consequence of our performance models is that they reveal underlying performance bottlenecks in both serial and parallel formulations. All of our results are validated

  13. FoSSI: the family of simplified solver interfaces for the rapid development of parallel numerical atmosphere and ocean models

    NASA Astrophysics Data System (ADS)

    Frickenhaus, Stephan; Hiller, Wolfgang; Best, Meike

    The portable software FoSSI is introduced that—in combination with additional free solver software packages—allows for an efficient and scalable parallel solution of large sparse linear equations systems arising in finite element model codes. FoSSI is intended to support rapid model code development, completely hiding the complexity of the underlying solver packages. In particular, the model developer need not be an expert in parallelization and is yet free to switch between different solver packages by simple modifications of the interface call. FoSSI offers an efficient and easy, yet flexible interface to several parallel solvers, most of them available on the web, such as PETSC, AZTEC, MUMPS, PILUT and HYPRE. FoSSI makes use of the concept of handles for vectors, matrices, preconditioners and solvers, that is frequently used in solver libraries. Hence, FoSSI allows for a flexible treatment of several linear equations systems and associated preconditioners at the same time, even in parallel on separate MPI-communicators. The second special feature in FoSSI is the task specifier, being a combination of keywords, each configuring a certain phase in the solver setup. This enables the user to control a solver over one unique subroutine. Furthermore, FoSSI has rather similar features for all solvers, making a fast solver intercomparison or exchange an easy task. FoSSI is a community software, proven in an adaptive 2D-atmosphere model and a 3D-primitive equation ocean model, both formulated in finite elements. The present paper discusses perspectives of an OpenMP-implementation of parallel iterative solvers based on domain decomposition methods. This approach to OpenMP solvers is rather attractive, as the code for domain-local operations of factorization, preconditioning and matrix-vector product can be readily taken from a sequential implementation that is also suitable to be used in an MPI-variant. Code development in this direction is in an advanced state under

  14. Matrix-Free Polynomial-Based Nonlinear Least Squares Optimized Preconditioning and its Application to Discontinuous Galerkin Discretizations of the Euler Equations

    DTIC Science & Technology

    2015-06-01

    cient parallel code for applying the operator. Our method constructs a polynomial preconditioner using a nonlinear least squares (NLLS) algorithm. We show...apply the underlying operator. Such a preconditioner can be very attractive in scenarios where one has a highly efficient parallel code for applying...repeatedly solve a large system of linear equations where one has an extremely fast parallel code for applying an underlying fixed linear operator

  15. Inversion of potential field data using the finite element method on parallel computers

    NASA Astrophysics Data System (ADS)

    Gross, L.; Altinay, C.; Shaw, S.

    2015-11-01

    In this paper we present a formulation of the joint inversion of potential field anomaly data as an optimization problem with partial differential equation (PDE) constraints. The problem is solved using the iterative Broyden-Fletcher-Goldfarb-Shanno (BFGS) method with the Hessian operator of the regularization and cross-gradient component of the cost function as preconditioner. We will show that each iterative step requires the solution of several PDEs namely for the potential fields, for the adjoint defects and for the application of the preconditioner. In extension to the traditional discrete formulation the BFGS method is applied to continuous descriptions of the unknown physical properties in combination with an appropriate integral form of the dot product. The PDEs can easily be solved using standard conforming finite element methods (FEMs) with potentially different resolutions. For two examples we demonstrate that the number of PDE solutions required to reach a given tolerance in the BFGS iteration is controlled by weighting regularization and cross-gradient but is independent of the resolution of PDE discretization and that as a consequence the method is weakly scalable with the number of cells on parallel computers. We also show a comparison with the UBC-GIF GRAV3D code.

  16. MapReduce Based Parallel Bayesian Network for Manufacturing Quality Control

    NASA Astrophysics Data System (ADS)

    Zheng, Mao-Kuan; Ming, Xin-Guo; Zhang, Xian-Yu; Li, Guo-Ming

    2017-09-01

    Increasing complexity of industrial products and manufacturing processes have challenged conventional statistics based quality management approaches in the circumstances of dynamic production. A Bayesian network and big data analytics integrated approach for manufacturing process quality analysis and control is proposed. Based on Hadoop distributed architecture and MapReduce parallel computing model, big volume and variety quality related data generated during the manufacturing process could be dealt with. Artificial intelligent algorithms, including Bayesian network learning, classification and reasoning, are embedded into the Reduce process. Relying on the ability of the Bayesian network in dealing with dynamic and uncertain problem and the parallel computing power of MapReduce, Bayesian network of impact factors on quality are built based on prior probability distribution and modified with posterior probability distribution. A case study on hull segment manufacturing precision management for ship and offshore platform building shows that computing speed accelerates almost directly proportionally to the increase of computing nodes. It is also proved that the proposed model is feasible for locating and reasoning of root causes, forecasting of manufacturing outcome, and intelligent decision for precision problem solving. The integration of bigdata analytics and BN method offers a whole new perspective in manufacturing quality control.

  17. Parallel Dynamics Simulation Using a Krylov-Schwarz Linear Solution Scheme

    DOE PAGES

    Abhyankar, Shrirang; Constantinescu, Emil M.; Smith, Barry F.; ...

    2016-11-07

    Fast dynamics simulation of large-scale power systems is a computational challenge because of the need to solve a large set of stiff, nonlinear differential-algebraic equations at every time step. The main bottleneck in dynamic simulations is the solution of a linear system during each nonlinear iteration of Newton’s method. In this paper, we present a parallel Krylov- Schwarz linear solution scheme that uses the Krylov subspacebased iterative linear solver GMRES with an overlapping restricted additive Schwarz preconditioner. As a result, performance tests of the proposed Krylov-Schwarz scheme for several large test cases ranging from 2,000 to 20,000 buses, including amore » real utility network, show good scalability on different computing architectures.« less

  18. Parallel Dynamics Simulation Using a Krylov-Schwarz Linear Solution Scheme

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Abhyankar, Shrirang; Constantinescu, Emil M.; Smith, Barry F.

    Fast dynamics simulation of large-scale power systems is a computational challenge because of the need to solve a large set of stiff, nonlinear differential-algebraic equations at every time step. The main bottleneck in dynamic simulations is the solution of a linear system during each nonlinear iteration of Newton’s method. In this paper, we present a parallel Krylov- Schwarz linear solution scheme that uses the Krylov subspacebased iterative linear solver GMRES with an overlapping restricted additive Schwarz preconditioner. As a result, performance tests of the proposed Krylov-Schwarz scheme for several large test cases ranging from 2,000 to 20,000 buses, including amore » real utility network, show good scalability on different computing architectures.« less

  19. Preconditioner and convergence study for the Quantum Computer Aided Design (QCAD) nonlinear poisson problem posed on the Ottawa Flat 270 design geometry.

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kalashnikova, Irina

    2012-05-01

    A numerical study aimed to evaluate different preconditioners within the Trilinos Ifpack and ML packages for the Quantum Computer Aided Design (QCAD) non-linear Poisson problem implemented within the Albany code base and posed on the Ottawa Flat 270 design geometry is performed. This study led to some new development of Albany that allows the user to select an ML preconditioner with Zoltan repartitioning based on nodal coordinates, which is summarized. Convergence of the numerical solutions computed within the QCAD computational suite with successive mesh refinement is examined in two metrics, the mean value of the solution (an L{sup 1} norm)more » and the field integral of the solution (L{sup 2} norm).« less

  20. PIXIE3D: A Parallel, Implicit, eXtended MHD 3D Code.

    NASA Astrophysics Data System (ADS)

    Chacon, L.; Knoll, D. A.

    2004-11-01

    We report on the development of PIXIE3D, a 3D parallel, fully implicit Newton-Krylov extended primitive-variable MHD code in general curvilinear geometry. PIXIE3D employs a second-order, finite-volume-based spatial discretization that satisfies remarkable properties such as being conservative, solenoidal in the magnetic field, non-dissipative, and stable in the absence of physical dissipation.(L. Chacón , phComput. Phys. Comm.) submitted (2004) PIXIE3D employs fully-implicit Newton-Krylov methods for the time advance. Currently, first and second-order implicit schemes are available, although higher-order temporal implicit schemes can be effortlessly implemented within the Newton-Krylov framework. A successful, scalable, MG physics-based preconditioning strategy, similar in concept to previous 2D MHD efforts,(L. Chacón et al., phJ. Comput. Phys). 178 (1), 15- 36 (2002); phJ. Comput. Phys., 188 (2), 573-592 (2003) has been developed. We are currently in the process of parallelizing the code using the PETSc library, and a Newton-Krylov-Schwarz approach for the parallel treatment of the preconditioner. In this poster, we will report on both the serial and parallel performance of PIXIE3D, focusing primarily on scalability and CPU speedup vs. an explicit approach.

  1. Newton-Krylov-Schwarz: An implicit solver for CFD

    NASA Technical Reports Server (NTRS)

    Cai, Xiao-Chuan; Keyes, David E.; Venkatakrishnan, V.

    1995-01-01

    Newton-Krylov methods and Krylov-Schwarz (domain decomposition) methods have begun to become established in computational fluid dynamics (CFD) over the past decade. The former employ a Krylov method inside of Newton's method in a Jacobian-free manner, through directional differencing. The latter employ an overlapping Schwarz domain decomposition to derive a preconditioner for the Krylov accelerator that relies primarily on local information, for data-parallel concurrency. They may be composed as Newton-Krylov-Schwarz (NKS) methods, which seem particularly well suited for solving nonlinear elliptic systems in high-latency, distributed-memory environments. We give a brief description of this family of algorithms, with an emphasis on domain decomposition iterative aspects. We then describe numerical simulations with Newton-Krylov-Schwarz methods on aerodynamics applications emphasizing comparisons with a standard defect-correction approach, subdomain preconditioner consistency, subdomain preconditioner quality, and the effect of a coarse grid.

  2. Hybrid parallelization of the XTOR-2F code for the simulation of two-fluid MHD instabilities in tokamaks

    NASA Astrophysics Data System (ADS)

    Marx, Alain; Lütjens, Hinrich

    2017-03-01

    A hybrid MPI/OpenMP parallel version of the XTOR-2F code [Lütjens and Luciani, J. Comput. Phys. 229 (2010) 8130] solving the two-fluid MHD equations in full tokamak geometry by means of an iterative Newton-Krylov matrix-free method has been developed. The present work shows that the code has been parallelized significantly despite the numerical profile of the problem solved by XTOR-2F, i.e. a discretization with pseudo-spectral representations in all angular directions, the stiffness of the two-fluid stability problem in tokamaks, and the use of a direct LU decomposition to invert the physical pre-conditioner at every Krylov iteration of the solver. The execution time of the parallelized version is an order of magnitude smaller than the sequential one for low resolution cases, with an increasing speedup when the discretization mesh is refined. Moreover, it allows to perform simulations with higher resolutions, previously forbidden because of memory limitations.

  3. Beam quality corrections for parallel-plate ion chambers in electron reference dosimetry

    NASA Astrophysics Data System (ADS)

    Zink, K.; Wulff, J.

    2012-04-01

    Current dosimetry protocols (AAPM, IAEA, IPEM, DIN) recommend parallel-plate ionization chambers for dose measurements in clinical electron beams. This study presents detailed Monte Carlo simulations of beam quality correction factors for four different types of parallel-plate chambers: NACP-02, Markus, Advanced Markus and Roos. These chambers differ in constructive details which should have notable impact on the resulting perturbation corrections, hence on the beam quality corrections. The results reveal deviations to the recommended beam quality corrections given in the IAEA TRS-398 protocol in the range of 0%-2% depending on energy and chamber type. For well-guarded chambers, these deviations could be traced back to a non-unity and energy-dependent wall perturbation correction. In the case of the guardless Markus chamber, a nearly energy-independent beam quality correction is resulting as the effects of wall and cavity perturbation compensate each other. For this chamber, the deviations to the recommended values are the largest and may exceed 2%. From calculations of type-B uncertainties including effects due to uncertainties of the underlying cross-sectional data as well as uncertainties due to the chamber material composition and chamber geometry, the overall uncertainty of calculated beam quality correction factors was estimated to be <0.7%. Due to different chamber positioning recommendations given in the national and international dosimetry protocols, an additional uncertainty in the range of 0.2%-0.6% is present. According to the IAEA TRS-398 protocol, the uncertainty in clinical electron dosimetry using parallel-plate ion chambers is 1.7%. This study may help to reduce this uncertainty significantly.

  4. A scalable nonlinear fluid-structure interaction solver based on a Schwarz preconditioner with isogeometric unstructured coarse spaces in 3D

    NASA Astrophysics Data System (ADS)

    Kong, Fande; Cai, Xiao-Chuan

    2017-07-01

    Nonlinear fluid-structure interaction (FSI) problems on unstructured meshes in 3D appear in many applications in science and engineering, such as vibration analysis of aircrafts and patient-specific diagnosis of cardiovascular diseases. In this work, we develop a highly scalable, parallel algorithmic and software framework for FSI problems consisting of a nonlinear fluid system and a nonlinear solid system, that are coupled monolithically. The FSI system is discretized by a stabilized finite element method in space and a fully implicit backward difference scheme in time. To solve the large, sparse system of nonlinear algebraic equations at each time step, we propose an inexact Newton-Krylov method together with a multilevel, smoothed Schwarz preconditioner with isogeometric coarse meshes generated by a geometry preserving coarsening algorithm. Here "geometry" includes the boundary of the computational domain and the wet interface between the fluid and the solid. We show numerically that the proposed algorithm and implementation are highly scalable in terms of the number of linear and nonlinear iterations and the total compute time on a supercomputer with more than 10,000 processor cores for several problems with hundreds of millions of unknowns.

  5. A scalable nonlinear fluid–structure interaction solver based on a Schwarz preconditioner with isogeometric unstructured coarse spaces in 3D

    DOE PAGES

    Kong, Fande; Cai, Xiao-Chuan

    2017-03-24

    Nonlinear fluid-structure interaction (FSI) problems on unstructured meshes in 3D appear many applications in science and engineering, such as vibration analysis of aircrafts and patient-specific diagnosis of cardiovascular diseases. In this work, we develop a highly scalable, parallel algorithmic and software framework for FSI problems consisting of a nonlinear fluid system and a nonlinear solid system, that are coupled monolithically. The FSI system is discretized by a stabilized finite element method in space and a fully implicit backward difference scheme in time. To solve the large, sparse system of nonlinear algebraic equations at each time step, we propose an inexactmore » Newton-Krylov method together with a multilevel, smoothed Schwarz preconditioner with isogeometric coarse meshes generated by a geometry preserving coarsening algorithm. Here ''geometry'' includes the boundary of the computational domain and the wet interface between the fluid and the solid. We show numerically that the proposed algorithm and implementation are highly scalable in terms of the number of linear and nonlinear iterations and the total compute time on a supercomputer with more than 10,000 processor cores for several problems with hundreds of millions of unknowns.« less

  6. Preconditioner-free Wiener filtering with a dense noise matrix

    NASA Astrophysics Data System (ADS)

    Huffenberger, Kevin M.

    2018-05-01

    This work extends the Elsner & Wandelt (2013) iterative method for efficient, preconditioner-free Wiener filtering to cases in which the noise covariance matrix is dense, but can be decomposed into a sum whose parts are sparse in convenient bases. The new method, which uses multiple messenger fields, reproduces Wiener-filter solutions for test problems, and we apply it to a case beyond the reach of the Elsner & Wandelt (2013) method. We compute the Wiener-filter solution for a simulated Cosmic Microwave Background (CMB) map that contains spatially varying, uncorrelated noise, isotropic 1/f noise, and large-scale horizontal stripes (like those caused by atmospheric noise). We discuss simple extensions that can filter contaminated modes or inverse-noise-filter the data. These techniques help to address complications in the noise properties of maps from current and future generations of ground-based Microwave Background experiments, like Advanced ACTPol, Simons Observatory, and CMB-S4.

  7. Parallel Computation of Flow in Heterogeneous Media Modelled by Mixed Finite Elements

    NASA Astrophysics Data System (ADS)

    Cliffe, K. A.; Graham, I. G.; Scheichl, R.; Stals, L.

    2000-11-01

    In this paper we describe a fast parallel method for solving highly ill-conditioned saddle-point systems arising from mixed finite element simulations of stochastic partial differential equations (PDEs) modelling flow in heterogeneous media. Each realisation of these stochastic PDEs requires the solution of the linear first-order velocity-pressure system comprising Darcy's law coupled with an incompressibility constraint. The chief difficulty is that the permeability may be highly variable, especially when the statistical model has a large variance and a small correlation length. For reasonable accuracy, the discretisation has to be extremely fine. We solve these problems by first reducing the saddle-point formulation to a symmetric positive definite (SPD) problem using a suitable basis for the space of divergence-free velocities. The reduced problem is solved using parallel conjugate gradients preconditioned with an algebraically determined additive Schwarz domain decomposition preconditioner. The result is a solver which exhibits a good degree of robustness with respect to the mesh size as well as to the variance and to physically relevant values of the correlation length of the underlying permeability field. Numerical experiments exhibit almost optimal levels of parallel efficiency. The domain decomposition solver (DOUG, http://www.maths.bath.ac.uk/~parsoft) used here not only is applicable to this problem but can be used to solve general unstructured finite element systems on a wide range of parallel architectures.

  8. Robust parallel iterative solvers for linear and least-squares problems, Final Technical Report

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Saad, Yousef

    2014-01-16

    The primary goal of this project is to study and develop robust iterative methods for solving linear systems of equations and least squares systems. The focus of the Minnesota team is on algorithms development, robustness issues, and on tests and validation of the methods on realistic problems. 1. The project begun with an investigation on how to practically update a preconditioner obtained from an ILU-type factorization, when the coefficient matrix changes. 2. We investigated strategies to improve robustness in parallel preconditioners in a specific case of a PDE with discontinuous coefficients. 3. We explored ways to adapt standard preconditioners formore » solving linear systems arising from the Helmholtz equation. These are often difficult linear systems to solve by iterative methods. 4. We have also worked on purely theoretical issues related to the analysis of Krylov subspace methods for linear systems. 5. We developed an effective strategy for performing ILU factorizations for the case when the matrix is highly indefinite. The strategy uses shifting in some optimal way. The method was extended to the solution of Helmholtz equations by using complex shifts, yielding very good results in many cases. 6. We addressed the difficult problem of preconditioning sparse systems of equations on GPUs. 7. A by-product of the above work is a software package consisting of an iterative solver library for GPUs based on CUDA. This was made publicly available. It was the first such library that offers complete iterative solvers for GPUs. 8. We considered another form of ILU which blends coarsening techniques from Multigrid with algebraic multilevel methods. 9. We have released a new version on our parallel solver - called pARMS [new version is version 3]. As part of this we have tested the code in complex settings - including the solution of Maxwell and Helmholtz equations and for a problem of crystal growth.10. As an application of polynomial preconditioning we considered

  9. The DANTE Boltzmann transport solver: An unstructured mesh, 3-D, spherical harmonics algorithm compatible with parallel computer architectures

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    McGhee, J.M.; Roberts, R.M.; Morel, J.E.

    1997-06-01

    A spherical harmonics research code (DANTE) has been developed which is compatible with parallel computer architectures. DANTE provides 3-D, multi-material, deterministic, transport capabilities using an arbitrary finite element mesh. The linearized Boltzmann transport equation is solved in a second order self-adjoint form utilizing a Galerkin finite element spatial differencing scheme. The core solver utilizes a preconditioned conjugate gradient algorithm. Other distinguishing features of the code include options for discrete-ordinates and simplified spherical harmonics angular differencing, an exact Marshak boundary treatment for arbitrarily oriented boundary faces, in-line matrix construction techniques to minimize memory consumption, and an effective diffusion based preconditioner formore » scattering dominated problems. Algorithm efficiency is demonstrated for a massively parallel SIMD architecture (CM-5), and compatibility with MPP multiprocessor platforms or workstation clusters is anticipated.« less

  10. Using Perturbed QR Factorizations To Solve Linear Least-Squares Problems

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Avron, Haim; Ng, Esmond G.; Toledo, Sivan

    2008-03-21

    We propose and analyze a new tool to help solve sparse linear least-squares problems min{sub x} {parallel}Ax-b{parallel}{sub 2}. Our method is based on a sparse QR factorization of a low-rank perturbation {cflx A} of A. More precisely, we show that the R factor of {cflx A} is an effective preconditioner for the least-squares problem min{sub x} {parallel}Ax-b{parallel}{sub 2}, when solved using LSQR. We propose applications for the new technique. When A is rank deficient we can add rows to ensure that the preconditioner is well-conditioned without column pivoting. When A is sparse except for a few dense rows we canmore » drop these dense rows from A to obtain {cflx A}. Another application is solving an updated or downdated problem. If R is a good preconditioner for the original problem A, it is a good preconditioner for the updated/downdated problem {cflx A}. We can also solve what-if scenarios, where we want to find the solution if a column of the original matrix is changed/removed. We present a spectral theory that analyzes the generalized spectrum of the pencil (A*A,R*R) and analyze the applications.« less

  11. Parallel Simulation of Three-Dimensional Free Surface Fluid Flow Problems

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    BAER,THOMAS A.; SACKINGER,PHILIP A.; SUBIA,SAMUEL R.

    1999-10-14

    Simulation of viscous three-dimensional fluid flow typically involves a large number of unknowns. When free surfaces are included, the number of unknowns increases dramatically. Consequently, this class of problem is an obvious application of parallel high performance computing. We describe parallel computation of viscous, incompressible, free surface, Newtonian fluid flow problems that include dynamic contact fines. The Galerkin finite element method was used to discretize the fully-coupled governing conservation equations and a ''pseudo-solid'' mesh mapping approach was used to determine the shape of the free surface. In this approach, the finite element mesh is allowed to deform to satisfy quasi-staticmore » solid mechanics equations subject to geometric or kinematic constraints on the boundaries. As a result, nodal displacements must be included in the set of unknowns. Other issues discussed are the proper constraints appearing along the dynamic contact line in three dimensions. Issues affecting efficient parallel simulations include problem decomposition to equally distribute computational work among a SPMD computer and determination of robust, scalable preconditioners for the distributed matrix systems that must be solved. Solution continuation strategies important for serial simulations have an enhanced relevance in a parallel coquting environment due to the difficulty of solving large scale systems. Parallel computations will be demonstrated on an example taken from the coating flow industry: flow in the vicinity of a slot coater edge. This is a three dimensional free surface problem possessing a contact line that advances at the web speed in one region but transitions to static behavior in another region. As such, a significant fraction of the computational time is devoted to processing boundary data. Discussion focuses on parallel speed ups for fixed problem size, a class of problems of immediate practical importance.« less

  12. Globalized Newton-Krylov-Schwarz Algorithms and Software for Parallel Implicit CFD

    NASA Technical Reports Server (NTRS)

    Gropp, W. D.; Keyes, D. E.; McInnes, L. C.; Tidriri, M. D.

    1998-01-01

    Implicit solution methods are important in applications modeled by PDEs with disparate temporal and spatial scales. Because such applications require high resolution with reasonable turnaround, "routine" parallelization is essential. The pseudo-transient matrix-free Newton-Krylov-Schwarz (Psi-NKS) algorithmic framework is presented as an answer. We show that, for the classical problem of three-dimensional transonic Euler flow about an M6 wing, Psi-NKS can simultaneously deliver: globalized, asymptotically rapid convergence through adaptive pseudo- transient continuation and Newton's method-, reasonable parallelizability for an implicit method through deferred synchronization and favorable communication-to-computation scaling in the Krylov linear solver; and high per- processor performance through attention to distributed memory and cache locality, especially through the Schwarz preconditioner. Two discouraging features of Psi-NKS methods are their sensitivity to the coding of the underlying PDE discretization and the large number of parameters that must be selected to govern convergence. We therefore distill several recommendations from our experience and from our reading of the literature on various algorithmic components of Psi-NKS, and we describe a freely available, MPI-based portable parallel software implementation of the solver employed here.

  13. A communication library for the parallelization of air quality models on structured grids

    NASA Astrophysics Data System (ADS)

    Miehe, Philipp; Sandu, Adrian; Carmichael, Gregory R.; Tang, Youhua; Dăescu, Dacian

    PAQMSG is an MPI-based, Fortran 90 communication library for the parallelization of air quality models (AQMs) on structured grids. It consists of distribution, gathering and repartitioning routines for different domain decompositions implementing a master-worker strategy. The library is architecture and application independent and includes optimization strategies for different architectures. This paper presents the library from a user perspective. Results are shown from the parallelization of STEM-III on Beowulf clusters. The PAQMSG library is available on the web. The communication routines are easy to use, and should allow for an immediate parallelization of existing AQMs. PAQMSG can also be used for constructing new models.

  14. Efficient parallel implicit methods for rotary-wing aerodynamics calculations

    NASA Astrophysics Data System (ADS)

    Wissink, Andrew M.

    Euler/Navier-Stokes Computational Fluid Dynamics (CFD) methods are commonly used for prediction of the aerodynamics and aeroacoustics of modern rotary-wing aircraft. However, their widespread application to large complex problems is limited lack of adequate computing power. Parallel processing offers the potential for dramatic increases in computing power, but most conventional implicit solution methods are inefficient in parallel and new techniques must be adopted to realize its potential. This work proposes alternative implicit schemes for Euler/Navier-Stokes rotary-wing calculations which are robust and efficient in parallel. The first part of this work proposes an efficient parallelizable modification of the Lower Upper-Symmetric Gauss Seidel (LU-SGS) implicit operator used in the well-known Transonic Unsteady Rotor Navier Stokes (TURNS) code. The new hybrid LU-SGS scheme couples a point-relaxation approach of the Data Parallel-Lower Upper Relaxation (DP-LUR) algorithm for inter-processor communication with the Symmetric Gauss Seidel algorithm of LU-SGS for on-processor computations. With the modified operator, TURNS is implemented in parallel using Message Passing Interface (MPI) for communication. Numerical performance and parallel efficiency are evaluated on the IBM SP2 and Thinking Machines CM-5 multi-processors for a variety of steady-state and unsteady test cases. The hybrid LU-SGS scheme maintains the numerical performance of the original LU-SGS algorithm in all cases and shows a good degree of parallel efficiency. It experiences a higher degree of robustness than DP-LUR for third-order upwind solutions. The second part of this work examines use of Krylov subspace iterative solvers for the nonlinear CFD solutions. The hybrid LU-SGS scheme is used as a parallelizable preconditioner. Two iterative methods are tested, Generalized Minimum Residual (GMRES) and Orthogonal s-Step Generalized Conjugate Residual (OSGCR). The Newton method demonstrates good

  15. Parallel Performance of Linear Solvers and Preconditioners

    DTIC Science & Technology

    2014-01-01

    are produced by a discrete dislocation dynamics ( DDD ) simulation and change with each timestep of the DDD simulation as the dislocation structure...evolves. However, the coefficient—or stiffness matrix— remains constant during the DDD simulation and some expensive matrix factorizations only occur once...discrete dislocation dynamics ( DDD ) simulations. This can be achieved by coupling a DDD simulator for bulk material (Arsenlis et al., 2007) to a

  16. Extending molecular simulation time scales: Parallel in time integrations for high-level quantum chemistry and complex force representations

    NASA Astrophysics Data System (ADS)

    Bylaska, Eric J.; Weare, Jonathan Q.; Weare, John H.

    2013-08-01

    Parallel in time simulation algorithms are presented and applied to conventional molecular dynamics (MD) and ab initio molecular dynamics (AIMD) models of realistic complexity. Assuming that a forward time integrator, f (e.g., Verlet algorithm), is available to propagate the system from time ti (trajectory positions and velocities xi = (ri, vi)) to time ti + 1 (xi + 1) by xi + 1 = fi(xi), the dynamics problem spanning an interval from t0…tM can be transformed into a root finding problem, F(X) = [xi - f(x(i - 1)]i = 1, M = 0, for the trajectory variables. The root finding problem is solved using a variety of root finding techniques, including quasi-Newton and preconditioned quasi-Newton schemes that are all unconditionally convergent. The algorithms are parallelized by assigning a processor to each time-step entry in the columns of F(X). The relation of this approach to other recently proposed parallel in time methods is discussed, and the effectiveness of various approaches to solving the root finding problem is tested. We demonstrate that more efficient dynamical models based on simplified interactions or coarsening time-steps provide preconditioners for the root finding problem. However, for MD and AIMD simulations, such preconditioners are not required to obtain reasonable convergence and their cost must be considered in the performance of the algorithm. The parallel in time algorithms developed are tested by applying them to MD and AIMD simulations of size and complexity similar to those encountered in present day applications. These include a 1000 Si atom MD simulation using Stillinger-Weber potentials, and a HCl + 4H2O AIMD simulation at the MP2 level. The maximum speedup (serial execution time/parallel execution time) obtained by parallelizing the Stillinger-Weber MD simulation was nearly 3.0. For the AIMD MP2 simulations, the algorithms achieved speedups of up to 14.3. The parallel in time algorithms can be implemented in a distributed computing

  17. Parallel numerical modeling of hybrid-dimensional compositional non-isothermal Darcy flows in fractured porous media

    NASA Astrophysics Data System (ADS)

    Xing, F.; Masson, R.; Lopez, S.

    2017-09-01

    This paper introduces a new discrete fracture model accounting for non-isothermal compositional multiphase Darcy flows and complex networks of fractures with intersecting, immersed and non-immersed fractures. The so called hybrid-dimensional model using a 2D model in the fractures coupled with a 3D model in the matrix is first derived rigorously starting from the equi-dimensional matrix fracture model. Then, it is discretized using a fully implicit time integration combined with the Vertex Approximate Gradient (VAG) finite volume scheme which is adapted to polyhedral meshes and anisotropic heterogeneous media. The fully coupled systems are assembled and solved in parallel using the Single Program Multiple Data (SPMD) paradigm with one layer of ghost cells. This strategy allows for a local assembly of the discrete systems. An efficient preconditioner is implemented to solve the linear systems at each time step and each Newton type iteration of the simulation. The numerical efficiency of our approach is assessed on different meshes, fracture networks, and physical settings in terms of parallel scalability, nonlinear convergence and linear convergence.

  18. Wide-angle full-vector beam propagation method based on an alternating direction implicit preconditioner

    NASA Astrophysics Data System (ADS)

    Chui, Siu Lit; Lu, Ya Yan

    2004-03-01

    Wide-angle full-vector beam propagation methods (BPMs) for three-dimensional wave-guiding structures can be derived on the basis of rational approximants of a square root operator or its exponential (i.e., the one-way propagator). While the less accurate BPM based on the slowly varying envelope approximation can be efficiently solved by the alternating direction implicit (ADI) method, the wide-angle variants involve linear systems that are more difficult to handle. We present an efficient solver for these linear systems that is based on a Krylov subspace method with an ADI preconditioner. The resulting wide-angle full-vector BPM is used to simulate the propagation of wave fields in a Y branch and a taper.

  19. Wide-angle full-vector beam propagation method based on an alternating direction implicit preconditioner.

    PubMed

    Chui, Siu Lit; Lu, Ya Yan

    2004-03-01

    Wide-angle full-vector beam propagation methods (BPMs) for three-dimensional wave-guiding structures can be derived on the basis of rational approximants of a square root operator or its exponential (i.e., the one-way propagator). While the less accurate BPM based on the slowly varying envelope approximation can be efficiently solved by the alternating direction implicit (ADI) method, the wide-angle variants involve linear systems that are more difficult to handle. We present an efficient solver for these linear systems that is based on a Krylov subspace method with an ADI preconditioner. The resulting wide-angle full-vector BPM is used to simulate the propagation of wave fields in a Y branch and a taper.

  20. Parallel iterative solution for h and p approximations of the shallow water equations

    USGS Publications Warehouse

    Barragy, E.J.; Walters, R.A.

    1998-01-01

    A p finite element scheme and parallel iterative solver are introduced for a modified form of the shallow water equations. The governing equations are the three-dimensional shallow water equations. After a harmonic decomposition in time and rearrangement, the resulting equations are a complex Helmholz problem for surface elevation, and a complex momentum equation for the horizontal velocity. Both equations are nonlinear and the resulting system is solved using the Picard iteration combined with a preconditioned biconjugate gradient (PBCG) method for the linearized subproblems. A subdomain-based parallel preconditioner is developed which uses incomplete LU factorization with thresholding (ILUT) methods within subdomains, overlapping ILUT factorizations for subdomain boundaries and under-relaxed iteration for the resulting block system. The method builds on techniques successfully applied to linear elements by introducing ordering and condensation techniques to handle uniform p refinement. The combined methods show good performance for a range of p (element order), h (element size), and N (number of processors). Performance and scalability results are presented for a field scale problem where up to 512 processors are used. ?? 1998 Elsevier Science Ltd. All rights reserved.

  1. A parallel multi-domain solution methodology applied to nonlinear thermal transport problems in nuclear fuel pins

    DOE PAGES

    Philip, Bobby; Berrill, Mark A.; Allu, Srikanth; ...

    2015-01-26

    We describe an efficient and nonlinearly consistent parallel solution methodology for solving coupled nonlinear thermal transport problems that occur in nuclear reactor applications over hundreds of individual 3D physical subdomains. Efficiency is obtained by leveraging knowledge of the physical domains, the physics on individual domains, and the couplings between them for preconditioning within a Jacobian Free Newton Krylov method. Details of the computational infrastructure that enabled this work, namely the open source Advanced Multi-Physics (AMP) package developed by the authors are described. The details of verification and validation experiments, and parallel performance analysis in weak and strong scaling studies demonstratingmore » the achieved efficiency of the algorithm are presented. Moreover, numerical experiments demonstrate that the preconditioner developed is independent of the number of fuel subdomains in a fuel rod, which is particularly important when simulating different types of fuel rods. Finally, we demonstrate the power of the coupling methodology by considering problems with couplings between surface and volume physics and coupling of nonlinear thermal transport in fuel rods to an external radiation transport code.« less

  2. Extending molecular simulation time scales: Parallel in time integrations for high-level quantum chemistry and complex force representations.

    PubMed

    Bylaska, Eric J; Weare, Jonathan Q; Weare, John H

    2013-08-21

    Parallel in time simulation algorithms are presented and applied to conventional molecular dynamics (MD) and ab initio molecular dynamics (AIMD) models of realistic complexity. Assuming that a forward time integrator, f (e.g., Verlet algorithm), is available to propagate the system from time ti (trajectory positions and velocities xi = (ri, vi)) to time ti + 1 (xi + 1) by xi + 1 = fi(xi), the dynamics problem spanning an interval from t0[ellipsis (horizontal)]tM can be transformed into a root finding problem, F(X) = [xi - f(x(i - 1)]i = 1, M = 0, for the trajectory variables. The root finding problem is solved using a variety of root finding techniques, including quasi-Newton and preconditioned quasi-Newton schemes that are all unconditionally convergent. The algorithms are parallelized by assigning a processor to each time-step entry in the columns of F(X). The relation of this approach to other recently proposed parallel in time methods is discussed, and the effectiveness of various approaches to solving the root finding problem is tested. We demonstrate that more efficient dynamical models based on simplified interactions or coarsening time-steps provide preconditioners for the root finding problem. However, for MD and AIMD simulations, such preconditioners are not required to obtain reasonable convergence and their cost must be considered in the performance of the algorithm. The parallel in time algorithms developed are tested by applying them to MD and AIMD simulations of size and complexity similar to those encountered in present day applications. These include a 1000 Si atom MD simulation using Stillinger-Weber potentials, and a HCl + 4H2O AIMD simulation at the MP2 level. The maximum speedup (serial execution/timeparallel execution time) obtained by parallelizing the Stillinger-Weber MD simulation was nearly 3.0. For the AIMD MP2 simulations, the algorithms achieved speedups of up to 14.3. The parallel in time algorithms can be implemented in a

  3. Extending molecular simulation time scales: Parallel in time integrations for high-level quantum chemistry and complex force representations

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Bylaska, Eric J.; Weare, Jonathan Q.; Weare, John H.

    2013-08-21

    Parallel in time simulation algorithms are presented and applied to conventional molecular dynamics (MD) and ab initio molecular dynamics (AIMD) models of realistic complexity. Assuming that a forward time integrator, f , (e.g. Verlet algorithm) is available to propagate the system from time ti (trajectory positions and velocities xi = (ri; vi)) to time ti+1 (xi+1) by xi+1 = fi(xi), the dynamics problem spanning an interval from t0 : : : tM can be transformed into a root finding problem, F(X) = [xi - f (x(i-1)]i=1;M = 0, for the trajectory variables. The root finding problem is solved using amore » variety of optimization techniques, including quasi-Newton and preconditioned quasi-Newton optimization schemes that are all unconditionally convergent. The algorithms are parallelized by assigning a processor to each time-step entry in the columns of F(X). The relation of this approach to other recently proposed parallel in time methods is discussed and the effectiveness of various approaches to solving the root finding problem are tested. We demonstrate that more efficient dynamical models based on simplified interactions or coarsening time-steps provide preconditioners for the root finding problem. However, for MD and AIMD simulations such preconditioners are not required to obtain reasonable convergence and their cost must be considered in the performance of the algorithm. The parallel in time algorithms developed are tested by applying them to MD and AIMD simulations of size and complexity similar to those encountered in present day applications. These include a 1000 Si atom MD simulation using Stillinger-Weber potentials, and a HCl+4H2O AIMD simulation at the MP2 level. The maximum speedup obtained by parallelizing the Stillinger-Weber MD simulation was nearly 3.0. For the AIMD MP2 simulations the algorithms achieved speedups of up to 14.3. The parallel in time algorithms can be implemented in a distributed computing environment using very slow TCP/IP networks

  4. Extending molecular simulation time scales: Parallel in time integrations for high-level quantum chemistry and complex force representations

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Bylaska, Eric J., E-mail: Eric.Bylaska@pnnl.gov; Weare, Jonathan Q., E-mail: weare@uchicago.edu; Weare, John H., E-mail: jweare@ucsd.edu

    2013-08-21

    Parallel in time simulation algorithms are presented and applied to conventional molecular dynamics (MD) and ab initio molecular dynamics (AIMD) models of realistic complexity. Assuming that a forward time integrator, f (e.g., Verlet algorithm), is available to propagate the system from time t{sub i} (trajectory positions and velocities x{sub i} = (r{sub i}, v{sub i})) to time t{sub i+1} (x{sub i+1}) by x{sub i+1} = f{sub i}(x{sub i}), the dynamics problem spanning an interval from t{sub 0}…t{sub M} can be transformed into a root finding problem, F(X) = [x{sub i} − f(x{sub (i−1})]{sub i} {sub =1,M} = 0, for themore » trajectory variables. The root finding problem is solved using a variety of root finding techniques, including quasi-Newton and preconditioned quasi-Newton schemes that are all unconditionally convergent. The algorithms are parallelized by assigning a processor to each time-step entry in the columns of F(X). The relation of this approach to other recently proposed parallel in time methods is discussed, and the effectiveness of various approaches to solving the root finding problem is tested. We demonstrate that more efficient dynamical models based on simplified interactions or coarsening time-steps provide preconditioners for the root finding problem. However, for MD and AIMD simulations, such preconditioners are not required to obtain reasonable convergence and their cost must be considered in the performance of the algorithm. The parallel in time algorithms developed are tested by applying them to MD and AIMD simulations of size and complexity similar to those encountered in present day applications. These include a 1000 Si atom MD simulation using Stillinger-Weber potentials, and a HCl + 4H{sub 2}O AIMD simulation at the MP2 level. The maximum speedup ((serial execution time)/(parallel execution time) ) obtained by parallelizing the Stillinger-Weber MD simulation was nearly 3.0. For the AIMD MP2 simulations, the algorithms achieved speedups

  5. Nonlinear and parallel algorithms for finite element discretizations of the incompressible Navier-Stokes equations

    NASA Astrophysics Data System (ADS)

    Arteaga, Santiago Egido

    1998-12-01

    linearization strategies considered and whose computational cost is negligible. The algebraic properties of these systems depend on both the discretization and nonlinear method used. We study in detail the positive definiteness and skewsymmetry of the advection submatrices (essentially, convection-diffusion problems). We propose a discretization based on a new trilinear form for Newton's method. We solve the linear systems using three Krylov subspace methods, GMRES, QMR and TFQMR, and compare the advantages of each. Our emphasis is on parallel algorithms, and so we consider preconditioners suitable for parallel computers such as line variants of the Jacobi and Gauss- Seidel methods, alternating direction implicit methods, and Chebyshev and least squares polynomial preconditioners. These work well for moderate viscosities (moderate Reynolds number). For small viscosities we show that effective parallel solution of the advection subproblem is a critical factor to improve performance. Implementation details on a CM-5 are presented.

  6. Nonlinear study of the parallel velocity/tearing instability using an implicit, nonlinear resistive MHD solver

    NASA Astrophysics Data System (ADS)

    Chacon, L.; Finn, J. M.; Knoll, D. A.

    2000-10-01

    Recently, a new parallel velocity instability has been found.(J. M. Finn, Phys. Plasmas), 2, 12 (1995) This mode is a tearing mode driven unstable by curvature effects and sound wave coupling in the presence of parallel velocity shear. Under such conditions, linear theory predicts that tearing instabilities will grow even in situations in which the classical tearing mode is stable. This could then be a viable seed mechanism for the neoclassical tearing mode, and hence a non-linear study is of interest. Here, the linear and non-linear stages of this instability are explored using a fully implicit, fully nonlinear 2D reduced resistive MHD code,(L. Chacon et al), ``Implicit, Jacobian-free Newton-Krylov 2D reduced resistive MHD nonlinear solver,'' submitted to J. Comput. Phys. (2000) including viscosity and particle transport effects. The nonlinear implicit time integration is performed using the Newton-Raphson iterative algorithm. Krylov iterative techniques are employed for the required algebraic matrix inversions, implemented Jacobian-free (i.e., without ever forming and storing the Jacobian matrix), and preconditioned with a ``physics-based'' preconditioner. Nonlinear results indicate that, for large total plasma beta and large parallel velocity shear, the instability results in the generation of large poloidal shear flows and large magnetic islands even in regimes when the classical tearing mode is absolutely stable. For small viscosity, the time asymptotic state can be turbulent.

  7. PRIM: An Efficient Preconditioning Iterative Reweighted Least Squares Method for Parallel Brain MRI Reconstruction.

    PubMed

    Xu, Zheng; Wang, Sheng; Li, Yeqing; Zhu, Feiyun; Huang, Junzhou

    2018-02-08

    The most recent history of parallel Magnetic Resonance Imaging (pMRI) has in large part been devoted to finding ways to reduce acquisition time. While joint total variation (JTV) regularized model has been demonstrated as a powerful tool in increasing sampling speed for pMRI, however, the major bottleneck is the inefficiency of the optimization method. While all present state-of-the-art optimizations for the JTV model could only reach a sublinear convergence rate, in this paper, we squeeze the performance by proposing a linear-convergent optimization method for the JTV model. The proposed method is based on the Iterative Reweighted Least Squares algorithm. Due to the complexity of the tangled JTV objective, we design a novel preconditioner to further accelerate the proposed method. Extensive experiments demonstrate the superior performance of the proposed algorithm for pMRI regarding both accuracy and efficiency compared with state-of-the-art methods.

  8. A new preconditioner update strategy for the solution of sequences of linear systems in structural mechanics: application to saddle point problems in elasticity

    NASA Astrophysics Data System (ADS)

    Mercier, Sylvain; Gratton, Serge; Tardieu, Nicolas; Vasseur, Xavier

    2017-12-01

    Many applications in structural mechanics require the numerical solution of sequences of linear systems typically issued from a finite element discretization of the governing equations on fine meshes. The method of Lagrange multipliers is often used to take into account mechanical constraints. The resulting matrices then exhibit a saddle point structure and the iterative solution of such preconditioned linear systems is considered as challenging. A popular strategy is then to combine preconditioning and deflation to yield an efficient method. We propose an alternative that is applicable to the general case and not only to matrices with a saddle point structure. In this approach, we consider to update an existing algebraic or application-based preconditioner, using specific available information exploiting the knowledge of an approximate invariant subspace or of matrix-vector products. The resulting preconditioner has the form of a limited memory quasi-Newton matrix and requires a small number of linearly independent vectors. Numerical experiments performed on three large-scale applications in elasticity highlight the relevance of the new approach. We show that the proposed method outperforms the deflation method when considering sequences of linear systems with varying matrices.

  9. Distributed Memory Parallel Computing with SEAWAT

    NASA Astrophysics Data System (ADS)

    Verkaik, J.; Huizer, S.; van Engelen, J.; Oude Essink, G.; Ram, R.; Vuik, K.

    2017-12-01

    Fresh groundwater reserves in coastal aquifers are threatened by sea-level rise, extreme weather conditions, increasing urbanization and associated groundwater extraction rates. To counteract these threats, accurate high-resolution numerical models are required to optimize the management of these precious reserves. The major model drawbacks are long run times and large memory requirements, limiting the predictive power of these models. Distributed memory parallel computing is an efficient technique for reducing run times and memory requirements, where the problem is divided over multiple processor cores. A new Parallel Krylov Solver (PKS) for SEAWAT is presented. PKS has recently been applied to MODFLOW and includes Conjugate Gradient (CG) and Biconjugate Gradient Stabilized (BiCGSTAB) linear accelerators. Both accelerators are preconditioned by an overlapping additive Schwarz preconditioner in a way that: a) subdomains are partitioned using Recursive Coordinate Bisection (RCB) load balancing, b) each subdomain uses local memory only and communicates with other subdomains by Message Passing Interface (MPI) within the linear accelerator, c) it is fully integrated in SEAWAT. Within SEAWAT, the PKS-CG solver replaces the Preconditioned Conjugate Gradient (PCG) solver for solving the variable-density groundwater flow equation and the PKS-BiCGSTAB solver replaces the Generalized Conjugate Gradient (GCG) solver for solving the advection-diffusion equation. PKS supports the third-order Total Variation Diminishing (TVD) scheme for computing advection. Benchmarks were performed on the Dutch national supercomputer (https://userinfo.surfsara.nl/systems/cartesius) using up to 128 cores, for a synthetic 3D Henry model (100 million cells) and the real-life Sand Engine model ( 10 million cells). The Sand Engine model was used to investigate the potential effect of the long-term morphological evolution of a large sand replenishment and climate change on fresh groundwater resources

  10. A Polynomial-Based Nonlinear Least Squares Optimized Preconditioner for Continuous and Discontinuous Element-Based Discretizations of the Euler Equations

    DTIC Science & Technology

    2014-01-01

    system (here using left- preconditioning ) (KÃ)x = Kb̃, (3.1) where K is a low-order polynomial in à given by K = s(Ã) = m∑ i=0 kià i, (3.2) and has a... system with a complex spectrum, region E in the complex plane must be some convex form (e.g., an ellipse or polygon) that approximately encloses the...preconditioners with p = 2 and p = 20 on the spectrum of the preconditioned system matrices Kà and KH̃ for both CG Schur-complement form and DG form cases

  11. A Kronecker product splitting preconditioner for two-dimensional space-fractional diffusion equations

    NASA Astrophysics Data System (ADS)

    Chen, Hao; Lv, Wen; Zhang, Tongtong

    2018-05-01

    We study preconditioned iterative methods for the linear system arising in the numerical discretization of a two-dimensional space-fractional diffusion equation. Our approach is based on a formulation of the discrete problem that is shown to be the sum of two Kronecker products. By making use of an alternating Kronecker product splitting iteration technique we establish a class of fixed-point iteration methods. Theoretical analysis shows that the new method converges to the unique solution of the linear system. Moreover, the optimal choice of the involved iteration parameters and the corresponding asymptotic convergence rate are computed exactly when the eigenvalues of the system matrix are all real. The basic iteration is accelerated by a Krylov subspace method like GMRES. The corresponding preconditioner is in a form of a Kronecker product structure and requires at each iteration the solution of a set of discrete one-dimensional fractional diffusion equations. We use structure preserving approximations to the discrete one-dimensional fractional diffusion operators in the action of the preconditioning matrix. Numerical examples are presented to illustrate the effectiveness of this approach.

  12. An Implicit Solver on A Parallel Block-Structured Adaptive Mesh Grid for FLASH

    NASA Astrophysics Data System (ADS)

    Lee, D.; Gopal, S.; Mohapatra, P.

    2012-07-01

    We introduce a fully implicit solver for FLASH based on a Jacobian-Free Newton-Krylov (JFNK) approach with an appropriate preconditioner. The main goal of developing this JFNK-type implicit solver is to provide efficient high-order numerical algorithms and methodology for simulating stiff systems of differential equations on large-scale parallel computer architectures. A large number of natural problems in nonlinear physics involve a wide range of spatial and time scales of interest. A system that encompasses such a wide magnitude of scales is described as "stiff." A stiff system can arise in many different fields of physics, including fluid dynamics/aerodynamics, laboratory/space plasma physics, low Mach number flows, reactive flows, radiation hydrodynamics, and geophysical flows. One of the big challenges in solving such a stiff system using current-day computational resources lies in resolving time and length scales varying by several orders of magnitude. We introduce FLASH's preliminary implementation of a time-accurate JFNK-based implicit solver in the framework of FLASH's unsplit hydro solver.

  13. On conjugate gradient type methods and polynomial preconditioners for a class of complex non-Hermitian matrices

    NASA Technical Reports Server (NTRS)

    Freund, Roland

    1988-01-01

    Conjugate gradient type methods are considered for the solution of large linear systems Ax = b with complex coefficient matrices of the type A = T + i(sigma)I where T is Hermitian and sigma, a real scalar. Three different conjugate gradient type approaches with iterates defined by a minimal residual property, a Galerkin type condition, and an Euclidian error minimization, respectively, are investigated. In particular, numerically stable implementations based on the ideas behind Paige and Saunder's SYMMLQ and MINRES for real symmetric matrices are proposed. Error bounds for all three methods are derived. It is shown how the special shift structure of A can be preserved by using polynomial preconditioning. Results on the optimal choice of the polynomial preconditioner are given. Also, some numerical experiments for matrices arising from finite difference approximations to the complex Helmholtz equation are reported.

  14. Mesh quality oriented 3D geometric vascular modeling based on parallel transport frame.

    PubMed

    Guo, Jixiang; Li, Shun; Chui, Yim Pan; Qin, Jing; Heng, Pheng Ann

    2013-08-01

    While a number of methods have been proposed to reconstruct geometrically and topologically accurate 3D vascular models from medical images, little attention has been paid to constantly maintain high mesh quality of these models during the reconstruction procedure, which is essential for many subsequent applications such as simulation-based surgical training and planning. We propose a set of methods to bridge this gap based on parallel transport frame. An improved bifurcation modeling method and two novel trifurcation modeling methods are developed based on 3D Bézier curve segments in order to ensure the continuous surface transition at furcations. In addition, a frame blending scheme is implemented to solve the twisting problem caused by frame mismatch of two successive furcations. A curvature based adaptive sampling scheme combined with a mesh quality guided frame tilting algorithm is developed to construct an evenly distributed, non-concave and self-intersection free surface mesh for vessels with distinct radius and high curvature. Extensive experiments demonstrate that our methodology can generate vascular models with better mesh quality than previous methods in terms of surface mesh quality criteria. Copyright © 2013 Elsevier Ltd. All rights reserved.

  15. Fast l₁-SPIRiT compressed sensing parallel imaging MRI: scalable parallel implementation and clinically feasible runtime.

    PubMed

    Murphy, Mark; Alley, Marcus; Demmel, James; Keutzer, Kurt; Vasanawala, Shreyas; Lustig, Michael

    2012-06-01

    We present l₁-SPIRiT, a simple algorithm for auto calibrating parallel imaging (acPI) and compressed sensing (CS) that permits an efficient implementation with clinically-feasible runtimes. We propose a CS objective function that minimizes cross-channel joint sparsity in the wavelet domain. Our reconstruction minimizes this objective via iterative soft-thresholding, and integrates naturally with iterative self-consistent parallel imaging (SPIRiT). Like many iterative magnetic resonance imaging reconstructions, l₁-SPIRiT's image quality comes at a high computational cost. Excessively long runtimes are a barrier to the clinical use of any reconstruction approach, and thus we discuss our approach to efficiently parallelizing l₁-SPIRiT and to achieving clinically-feasible runtimes. We present parallelizations of l₁-SPIRiT for both multi-GPU systems and multi-core CPUs, and discuss the software optimization and parallelization decisions made in our implementation. The performance of these alternatives depends on the processor architecture, the size of the image matrix, and the number of parallel imaging channels. Fundamentally, achieving fast runtime requires the correct trade-off between cache usage and parallelization overheads. We demonstrate image quality via a case from our clinical experimentation, using a custom 3DFT spoiled gradient echo (SPGR) sequence with up to 8× acceleration via Poisson-disc undersampling in the two phase-encoded directions.

  16. Improved L-BFGS diagonal preconditioners for a large-scale 4D-Var inversion system: application to CO2 flux constraints and analysis error calculation

    NASA Astrophysics Data System (ADS)

    Bousserez, Nicolas; Henze, Daven; Bowman, Kevin; Liu, Junjie; Jones, Dylan; Keller, Martin; Deng, Feng

    2013-04-01

    This work presents improved analysis error estimates for 4D-Var systems. From operational NWP models to top-down constraints on trace gas emissions, many of today's data assimilation and inversion systems in atmospheric science rely on variational approaches. This success is due to both the mathematical clarity of these formulations and the availability of computationally efficient minimization algorithms. However, unlike Kalman Filter-based algorithms, these methods do not provide an estimate of the analysis or forecast error covariance matrices, these error statistics being propagated only implicitly by the system. From both a practical (cycling assimilation) and scientific perspective, assessing uncertainties in the solution of the variational problem is critical. For large-scale linear systems, deterministic or randomization approaches can be considered based on the equivalence between the inverse Hessian of the cost function and the covariance matrix of analysis error. For perfectly quadratic systems, like incremental 4D-Var, Lanczos/Conjugate-Gradient algorithms have proven to be most efficient in generating low-rank approximations of the Hessian matrix during the minimization. For weakly non-linear systems though, the Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS), a quasi-Newton descent algorithm, is usually considered the best method for the minimization. Suitable for large-scale optimization, this method allows one to generate an approximation to the inverse Hessian using the latest m vector/gradient pairs generated during the minimization, m depending upon the available core memory. At each iteration, an initial low-rank approximation to the inverse Hessian has to be provided, which is called preconditioning. The ability of the preconditioner to retain useful information from previous iterations largely determines the efficiency of the algorithm. Here we assess the performance of different preconditioners to estimate the inverse Hessian of a

  17. Parallel Reconstruction Using Null Operations (PRUNO)

    PubMed Central

    Zhang, Jian; Liu, Chunlei; Moseley, Michael E.

    2011-01-01

    A novel iterative k-space data-driven technique, namely Parallel Reconstruction Using Null Operations (PRUNO), is presented for parallel imaging reconstruction. In PRUNO, both data calibration and image reconstruction are formulated into linear algebra problems based on a generalized system model. An optimal data calibration strategy is demonstrated by using Singular Value Decomposition (SVD). And an iterative conjugate- gradient approach is proposed to efficiently solve missing k-space samples during reconstruction. With its generalized formulation and precise mathematical model, PRUNO reconstruction yields good accuracy, flexibility, stability. Both computer simulation and in vivo studies have shown that PRUNO produces much better reconstruction quality than autocalibrating partially parallel acquisition (GRAPPA), especially under high accelerating rates. With the aid of PRUO reconstruction, ultra high accelerating parallel imaging can be performed with decent image quality. For example, we have done successful PRUNO reconstruction at a reduction factor of 6 (effective factor of 4.44) with 8 coils and only a few autocalibration signal (ACS) lines. PMID:21604290

  18. Fast ℓ1-SPIRiT Compressed Sensing Parallel Imaging MRI: Scalable Parallel Implementation and Clinically Feasible Runtime

    PubMed Central

    Murphy, Mark; Alley, Marcus; Demmel, James; Keutzer, Kurt; Vasanawala, Shreyas; Lustig, Michael

    2012-01-01

    We present ℓ1-SPIRiT, a simple algorithm for auto calibrating parallel imaging (acPI) and compressed sensing (CS) that permits an efficient implementation with clinically-feasible runtimes. We propose a CS objective function that minimizes cross-channel joint sparsity in the Wavelet domain. Our reconstruction minimizes this objective via iterative soft-thresholding, and integrates naturally with iterative Self-Consistent Parallel Imaging (SPIRiT). Like many iterative MRI reconstructions, ℓ1-SPIRiT’s image quality comes at a high computational cost. Excessively long runtimes are a barrier to the clinical use of any reconstruction approach, and thus we discuss our approach to efficiently parallelizing ℓ1-SPIRiT and to achieving clinically-feasible runtimes. We present parallelizations of ℓ1-SPIRiT for both multi-GPU systems and multi-core CPUs, and discuss the software optimization and parallelization decisions made in our implementation. The performance of these alternatives depends on the processor architecture, the size of the image matrix, and the number of parallel imaging channels. Fundamentally, achieving fast runtime requires the correct trade-off between cache usage and parallelization overheads. We demonstrate image quality via a case from our clinical experimentation, using a custom 3DFT Spoiled Gradient Echo (SPGR) sequence with up to 8× acceleration via poisson-disc undersampling in the two phase-encoded directions. PMID:22345529

  19. Domain decomposition algorithms and computation fluid dynamics

    NASA Technical Reports Server (NTRS)

    Chan, Tony F.

    1988-01-01

    In the past several years, domain decomposition was a very popular topic, partly motivated by the potential of parallelization. While a large body of theory and algorithms were developed for model elliptic problems, they are only recently starting to be tested on realistic applications. The application of some of these methods to two model problems in computational fluid dynamics are investigated. Some examples are two dimensional convection-diffusion problems and the incompressible driven cavity flow problem. The construction and analysis of efficient preconditioners for the interface operator to be used in the iterative solution of the interface solution is described. For the convection-diffusion problems, the effect of the convection term and its discretization on the performance of some of the preconditioners is discussed. For the driven cavity problem, the effectiveness of a class of boundary probe preconditioners is discussed.

  20. Spectral analysis and multigrid preconditioners for two-dimensional space-fractional diffusion equations

    NASA Astrophysics Data System (ADS)

    Moghaderi, Hamid; Dehghan, Mehdi; Donatelli, Marco; Mazza, Mariarosa

    2017-12-01

    Fractional diffusion equations (FDEs) are a mathematical tool used for describing some special diffusion phenomena arising in many different applications like porous media and computational finance. In this paper, we focus on a two-dimensional space-FDE problem discretized by means of a second order finite difference scheme obtained as combination of the Crank-Nicolson scheme and the so-called weighted and shifted Grünwald formula. By fully exploiting the Toeplitz-like structure of the resulting linear system, we provide a detailed spectral analysis of the coefficient matrix at each time step, both in the case of constant and variable diffusion coefficients. Such a spectral analysis has a very crucial role, since it can be used for designing fast and robust iterative solvers. In particular, we employ the obtained spectral information to define a Galerkin multigrid method based on the classical linear interpolation as grid transfer operator and damped-Jacobi as smoother, and to prove the linear convergence rate of the corresponding two-grid method. The theoretical analysis suggests that the proposed grid transfer operator is strong enough for working also with the V-cycle method and the geometric multigrid. On this basis, we introduce two computationally favourable variants of the proposed multigrid method and we use them as preconditioners for Krylov methods. Several numerical results confirm that the resulting preconditioning strategies still keep a linear convergence rate.

  1. First Applications of the New Parallel Krylov Solver for MODFLOW on a National and Global Scale

    NASA Astrophysics Data System (ADS)

    Verkaik, J.; Hughes, J. D.; Sutanudjaja, E.; van Walsum, P.

    2016-12-01

    Integrated high-resolution hydrologic models are increasingly being used for evaluating water management measures at field scale. Their drawbacks are large memory requirements and long run times. Examples of such models are The Netherlands Hydrological Instrument (NHI) model and the PCRaster Global Water Balance (PCR-GLOBWB) model. Typical simulation periods are 30-100 years with daily timesteps. The NHI model predicts water demands in periods of drought, supporting operational and long-term water-supply decisions. The NHI is a state-of-the-art coupling of several models: a 7-layer MODFLOW groundwater model ( 6.5M 250m cells), a MetaSWAP model for the unsaturated zone (Richards emulator of 0.5M cells), and a surface water model (MOZART-DM). The PCR-GLOBWB model provides a grid-based representation of global terrestrial hydrology and this work uses the version that includes a 2-layer MODFLOW groundwater model ( 4.5M 10km cells). The Parallel Krylov Solver (PKS) speeds up computation by both distributed memory parallelization (Message Passing Interface) and shared memory parallelization (Open Multi-Processing). PKS includes conjugate gradient, bi-conjugate gradient stabilized, and generalized minimal residual linear accelerators that use an overlapping additive Schwarz domain decomposition preconditioner. PKS can be used for both structured and unstructured grids and has been fully integrated in MODFLOW-USG using METIS partitioning and in iMODFLOW using RCB partitioning. iMODFLOW is an accelerated version of MODFLOW-2005 that is implicitly and online coupled to MetaSWAP. Results for benchmarks carried out on the Cartesius Dutch supercomputer (https://userinfo.surfsara.nl/systems/cartesius) for the PCRGLOB-WB model and on a 2x16 core Windows machine for the NHI model show speedups up to 10-20 and 5-10, respectively.

  2. Parallel pivoting combined with parallel reduction

    NASA Technical Reports Server (NTRS)

    Alaghband, Gita

    1987-01-01

    Parallel algorithms for triangularization of large, sparse, and unsymmetric matrices are presented. The method combines the parallel reduction with a new parallel pivoting technique, control over generations of fill-ins and a check for numerical stability, all done in parallel with the work being distributed over the active processes. The parallel technique uses the compatibility relation between pivots to identify parallel pivot candidates and uses the Markowitz number of pivots to minimize fill-in. This technique is not a preordering of the sparse matrix and is applied dynamically as the decomposition proceeds.

  3. Parallel MR imaging: a user's guide.

    PubMed

    Glockner, James F; Hu, Houchun H; Stanley, David W; Angelos, Lisa; King, Kevin

    2005-01-01

    Parallel imaging is a recently developed family of techniques that take advantage of the spatial information inherent in phased-array radiofrequency coils to reduce acquisition times in magnetic resonance imaging. In parallel imaging, the number of sampled k-space lines is reduced, often by a factor of two or greater, thereby significantly shortening the acquisition time. Parallel imaging techniques have only recently become commercially available, and the wide range of clinical applications is just beginning to be explored. The potential clinical applications primarily involve reduction in acquisition time, improved spatial resolution, or a combination of the two. Improvements in image quality can be achieved by reducing the echo train lengths of fast spin-echo and single-shot fast spin-echo sequences. Parallel imaging is particularly attractive for cardiac and vascular applications and will likely prove valuable as 3-T body and cardiovascular imaging becomes part of standard clinical practice. Limitations of parallel imaging include reduced signal-to-noise ratio and reconstruction artifacts. It is important to consider these limitations when deciding when to use these techniques. (c) RSNA, 2005.

  4. 3-D modeling of ductile tearing using finite elements: Computational aspects and techniques

    NASA Astrophysics Data System (ADS)

    Gullerud, Arne Stewart

    elements (same type, constitutive model, etc.) provides fine-grain parallel computation on each processor. A major focus of the LPCG solver is a new implementation of the Hughes-Winget element-by-element (HW) preconditioner. The implementation employs a weighted dependency graph combined with a new coloring algorithm to provide load-balanced scheduling for the preconditioner and overlapped communication/computation. This approach enables efficient parallel application of the HW preconditioner for arbitrary unstructured meshes.

  5. Inspection criteria ensure quality control of parallel gap soldering

    NASA Technical Reports Server (NTRS)

    Burka, J. A.

    1968-01-01

    Investigation of parallel gap soldering of electrical leads resulted in recommendation on material preparation, equipment, process control, and visual inspection criteria to ensure reliable solder joints. The recommendations will minimize problems in heat-dwell time, amount of solder, bridging conductors, and damage of circuitry.

  6. Fully anisotropic 3-D EM modelling on a Lebedev grid with a multigrid pre-conditioner

    NASA Astrophysics Data System (ADS)

    Jaysaval, Piyoosh; Shantsev, Daniil V.; de la Kethulle de Ryhove, Sébastien; Bratteland, Tarjei

    2016-12-01

    We present a numerical algorithm for 3-D electromagnetic (EM) simulations in conducting media with general electric anisotropy. The algorithm is based on the finite-difference discretization of frequency-domain Maxwell's equations on a Lebedev grid, in which all components of the electric field are collocated but half a spatial step staggered with respect to the magnetic field components, which also are collocated. This leads to a system of linear equations that is solved using a stabilized biconjugate gradient method with a multigrid preconditioner. We validate the accuracy of the numerical results for layered and 3-D tilted transverse isotropic (TTI) earth models representing typical scenarios used in the marine controlled-source EM method. It is then demonstrated that not taking into account the full anisotropy of the conductivity tensor can lead to misleading inversion results. For synthetic data corresponding to a 3-D model with a TTI anticlinal structure, a standard vertical transverse isotropic (VTI) inversion is not able to image a resistor, while for a 3-D model with a TTI synclinal structure it produces a false resistive anomaly. However, if the VTI forward solver used in the inversion is replaced by the proposed TTI solver with perfect knowledge of the strike and dip of the dipping structures, the resulting resistivity images become consistent with the true models.

  7. Taste quality decoding parallels taste sensations.

    PubMed

    Crouzet, Sébastien M; Busch, Niko A; Ohla, Kathrin

    2015-03-30

    In most species, the sense of taste is key in the distinction of potentially nutritious and harmful food constituents and thereby in the acceptance (or rejection) of food. Taste quality is encoded by specialized receptors on the tongue, which detect chemicals corresponding to each of the basic tastes (sweet, salty, sour, bitter, and savory [1]), before taste quality information is transmitted via segregated neuronal fibers [2], distributed coding across neuronal fibers [3], or dynamic firing patterns [4] to the gustatory cortex in the insula. In rodents, both hardwired coding by labeled lines [2] and flexible, learning-dependent representations [5] and broadly tuned neurons [6] seem to coexist. It is currently unknown how, when, and where taste quality representations are established in the cortex and whether these representations are used for perceptual decisions. Here, we show that neuronal response patterns allow to decode which of four tastants (salty, sweet, sour, and bitter) participants tasted in a given trial by using time-resolved multivariate pattern analyses of large-scale electrophysiological brain responses. The onset of this prediction coincided with the earliest taste-evoked responses originating from the insula and opercular cortices, indicating that quality is among the first attributes of a taste represented in the central gustatory system. These response patterns correlated with perceptual decisions of taste quality: tastes that participants discriminated less accurately also evoked less discriminated brain response patterns. The results therefore provide the first evidence for a link between taste-related decision-making and the predictive value of these brain response patterns. Copyright © 2015 Elsevier Ltd. All rights reserved.

  8. Large-scale 3D geoelectromagnetic modeling using parallel adaptive high-order finite element method

    DOE PAGES

    Grayver, Alexander V.; Kolev, Tzanio V.

    2015-11-01

    Here, we have investigated the use of the adaptive high-order finite-element method (FEM) for geoelectromagnetic modeling. Because high-order FEM is challenging from the numerical and computational points of view, most published finite-element studies in geoelectromagnetics use the lowest order formulation. Solution of the resulting large system of linear equations poses the main practical challenge. We have developed a fully parallel and distributed robust and scalable linear solver based on the optimal block-diagonal and auxiliary space preconditioners. The solver was found to be efficient for high finite element orders, unstructured and nonconforming locally refined meshes, a wide range of frequencies, largemore » conductivity contrasts, and number of degrees of freedom (DoFs). Furthermore, the presented linear solver is in essence algebraic; i.e., it acts on the matrix-vector level and thus requires no information about the discretization, boundary conditions, or physical source used, making it readily efficient for a wide range of electromagnetic modeling problems. To get accurate solutions at reduced computational cost, we have also implemented goal-oriented adaptive mesh refinement. The numerical tests indicated that if highly accurate modeling results were required, the high-order FEM in combination with the goal-oriented local mesh refinement required less computational time and DoFs than the lowest order adaptive FEM.« less

  9. Large-scale 3D geoelectromagnetic modeling using parallel adaptive high-order finite element method

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Grayver, Alexander V.; Kolev, Tzanio V.

    Here, we have investigated the use of the adaptive high-order finite-element method (FEM) for geoelectromagnetic modeling. Because high-order FEM is challenging from the numerical and computational points of view, most published finite-element studies in geoelectromagnetics use the lowest order formulation. Solution of the resulting large system of linear equations poses the main practical challenge. We have developed a fully parallel and distributed robust and scalable linear solver based on the optimal block-diagonal and auxiliary space preconditioners. The solver was found to be efficient for high finite element orders, unstructured and nonconforming locally refined meshes, a wide range of frequencies, largemore » conductivity contrasts, and number of degrees of freedom (DoFs). Furthermore, the presented linear solver is in essence algebraic; i.e., it acts on the matrix-vector level and thus requires no information about the discretization, boundary conditions, or physical source used, making it readily efficient for a wide range of electromagnetic modeling problems. To get accurate solutions at reduced computational cost, we have also implemented goal-oriented adaptive mesh refinement. The numerical tests indicated that if highly accurate modeling results were required, the high-order FEM in combination with the goal-oriented local mesh refinement required less computational time and DoFs than the lowest order adaptive FEM.« less

  10. INVITED TOPICAL REVIEW: Parallel magnetic resonance imaging

    NASA Astrophysics Data System (ADS)

    Larkman, David J.; Nunes, Rita G.

    2007-04-01

    Parallel imaging has been the single biggest innovation in magnetic resonance imaging in the last decade. The use of multiple receiver coils to augment the time consuming Fourier encoding has reduced acquisition times significantly. This increase in speed comes at a time when other approaches to acquisition time reduction were reaching engineering and human limits. A brief summary of spatial encoding in MRI is followed by an introduction to the problem parallel imaging is designed to solve. There are a large number of parallel reconstruction algorithms; this article reviews a cross-section, SENSE, SMASH, g-SMASH and GRAPPA, selected to demonstrate the different approaches. Theoretical (the g-factor) and practical (coil design) limits to acquisition speed are reviewed. The practical implementation of parallel imaging is also discussed, in particular coil calibration. How to recognize potential failure modes and their associated artefacts are shown. Well-established applications including angiography, cardiac imaging and applications using echo planar imaging are reviewed and we discuss what makes a good application for parallel imaging. Finally, active research areas where parallel imaging is being used to improve data quality by repairing artefacted images are also reviewed.

  11. Scalable Parallel Density-based Clustering and Applications

    NASA Astrophysics Data System (ADS)

    Patwary, Mostofa Ali

    2014-04-01

    Recently, density-based clustering algorithms (DBSCAN and OPTICS) have gotten significant attention of the scientific community due to their unique capability of discovering arbitrary shaped clusters and eliminating noise data. These algorithms have several applications, which require high performance computing, including finding halos and subhalos (clusters) from massive cosmology data in astrophysics, analyzing satellite images, X-ray crystallography, and anomaly detection. However, parallelization of these algorithms are extremely challenging as they exhibit inherent sequential data access order, unbalanced workload resulting in low parallel efficiency. To break the data access sequentiality and to achieve high parallelism, we develop new parallel algorithms, both for DBSCAN and OPTICS, designed using graph algorithmic techniques. For example, our parallel DBSCAN algorithm exploits the similarities between DBSCAN and computing connected components. Using datasets containing up to a billion floating point numbers, we show that our parallel density-based clustering algorithms significantly outperform the existing algorithms, achieving speedups up to 27.5 on 40 cores on shared memory architecture and speedups up to 5,765 using 8,192 cores on distributed memory architecture. In our experiments, we found that while achieving the scalability, our algorithms produce clustering results with comparable quality to the classical algorithms.

  12. A Numerical Study of Scalable Cardiac Electro-Mechanical Solvers on HPC Architectures

    PubMed Central

    Colli Franzone, Piero; Pavarino, Luca F.; Scacchi, Simone

    2018-01-01

    We introduce and study some scalable domain decomposition preconditioners for cardiac electro-mechanical 3D simulations on parallel HPC (High Performance Computing) architectures. The electro-mechanical model of the cardiac tissue is composed of four coupled sub-models: (1) the static finite elasticity equations for the transversely isotropic deformation of the cardiac tissue; (2) the active tension model describing the dynamics of the intracellular calcium, cross-bridge binding and myofilament tension; (3) the anisotropic Bidomain model describing the evolution of the intra- and extra-cellular potentials in the deforming cardiac tissue; and (4) the ionic membrane model describing the dynamics of ionic currents, gating variables, ionic concentrations and stretch-activated channels. This strongly coupled electro-mechanical model is discretized in time with a splitting semi-implicit technique and in space with isoparametric finite elements. The resulting scalable parallel solver is based on Multilevel Additive Schwarz preconditioners for the solution of the Bidomain system and on BDDC preconditioned Newton-Krylov solvers for the non-linear finite elasticity system. The results of several 3D parallel simulations show the scalability of both linear and non-linear solvers and their application to the study of both physiological excitation-contraction cardiac dynamics and re-entrant waves in the presence of different mechano-electrical feedbacks. PMID:29674971

  13. Solution of the within-group multidimensional discrete ordinates transport equations on massively parallel architectures

    NASA Astrophysics Data System (ADS)

    Zerr, Robert Joseph

    2011-12-01

    thousands of processors. The PGS method does outperform SI DSA for the periodic heterogeneous layers (PHL) configuration problems. Although this demonstrates a relative strength/weakness between the two methods, the practicality of these problems is much less, further limiting instances where it would be beneficial to select ITMM over SI DSA. The results strongly indicate a need for a robust, stable, and efficient acceleration method (or preconditioner for PGMRES). The spatial multigrid (SMG) method is currently incomplete in that it does not work for all cases considered and does not effectively improve the convergence rate for all values of scattering ratio c or cell dimension h. Nevertheless, it does display the desired trend for highly scattering, optically thin problems. That is, it tends to lower the rate of growth of number of iterations with increasing number of processes, P, while not increasing the number of additional operations per iteration to the extent that the total execution time of the rapidly converging accelerated iterations exceeds that of the slower unaccelerated iterations. A predictive parallel performance model has been developed for the PBJ method. Timing tests were performed such that trend lines could be fitted to the data for the different components and used to estimate the execution times. Applied to the weak scaling results, the model notably underestimates construction time, but combined with a slight overestimation in iterative solution time, the model predicts total execution time very well for large P. It also does a decent job with the strong scaling results, closely predicting the construction time and time per iteration, especially as P increases. Although not shown to be competitive up to 1,024 processing elements with the current state of the art, the parallelized ITMM exhibits promising scaling trends. Ultimately, compared to the KBA method, the parallelized ITMM may be found to be a very attractive option for transport calculations

  14. Parallel algorithms for placement and routing in VLSI design. Ph.D. Thesis

    NASA Technical Reports Server (NTRS)

    Brouwer, Randall Jay

    1991-01-01

    The computational requirements for high quality synthesis, analysis, and verification of very large scale integration (VLSI) designs have rapidly increased with the fast growing complexity of these designs. Research in the past has focused on the development of heuristic algorithms, special purpose hardware accelerators, or parallel algorithms for the numerous design tasks to decrease the time required for solution. Two new parallel algorithms are proposed for two VLSI synthesis tasks, standard cell placement and global routing. The first algorithm, a parallel algorithm for global routing, uses hierarchical techniques to decompose the routing problem into independent routing subproblems that are solved in parallel. Results are then presented which compare the routing quality to the results of other published global routers and which evaluate the speedups attained. The second algorithm, a parallel algorithm for cell placement and global routing, hierarchically integrates a quadrisection placement algorithm, a bisection placement algorithm, and the previous global routing algorithm. Unique partitioning techniques are used to decompose the various stages of the algorithm into independent tasks which can be evaluated in parallel. Finally, results are presented which evaluate the various algorithm alternatives and compare the algorithm performance to other placement programs. Measurements are presented on the parallel speedups available.

  15. S-HARP: A parallel dynamic spectral partitioner

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Sohn, A.; Simon, H.

    1998-01-01

    Computational science problems with adaptive meshes involve dynamic load balancing when implemented on parallel machines. This dynamic load balancing requires fast partitioning of computational meshes at run time. The authors present in this report a fast parallel dynamic partitioner, called S-HARP. The underlying principles of S-HARP are the fast feature of inertial partitioning and the quality feature of spectral partitioning. S-HARP partitions a graph from scratch, requiring no partition information from previous iterations. Two types of parallelism have been exploited in S-HARP, fine grain loop level parallelism and coarse grain recursive parallelism. The parallel partitioner has been implemented in Messagemore » Passing Interface on Cray T3E and IBM SP2 for portability. Experimental results indicate that S-HARP can partition a mesh of over 100,000 vertices into 256 partitions in 0.2 seconds on a 64 processor Cray T3E. S-HARP is much more scalable than other dynamic partitioners, giving over 15 fold speedup on 64 processors while ParaMeTiS1.0 gives a few fold speedup. Experimental results demonstrate that S-HARP is three to 10 times faster than the dynamic partitioners ParaMeTiS and Jostle on six computational meshes of size over 100,000 vertices.« less

  16. Xyce parallel electronic simulator design.

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Thornquist, Heidi K.; Rankin, Eric Lamont; Mei, Ting

    2010-09-01

    This document is the Xyce Circuit Simulator developer guide. Xyce has been designed from the 'ground up' to be a SPICE-compatible, distributed memory parallel circuit simulator. While it is in many respects a research code, Xyce is intended to be a production simulator. As such, having software quality engineering (SQE) procedures in place to insure a high level of code quality and robustness are essential. Version control, issue tracking customer support, C++ style guildlines and the Xyce release process are all described. The Xyce Parallel Electronic Simulator has been under development at Sandia since 1999. Historically, Xyce has mostly beenmore » funded by ASC, the original focus of Xyce development has primarily been related to circuits for nuclear weapons. However, this has not been the only focus and it is expected that the project will diversify. Like many ASC projects, Xyce is a group development effort, which involves a number of researchers, engineers, scientists, mathmaticians and computer scientists. In addition to diversity of background, it is to be expected on long term projects for there to be a certain amount of staff turnover, as people move on to different projects. As a result, it is very important that the project maintain high software quality standards. The point of this document is to formally document a number of the software quality practices followed by the Xyce team in one place. Also, it is hoped that this document will be a good source of information for new developers.« less

  17. P-HARP: A parallel dynamic spectral partitioner

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Sohn, A.; Biswas, R.; Simon, H.D.

    1997-05-01

    Partitioning unstructured graphs is central to the parallel solution of problems in computational science and engineering. The authors have introduced earlier the sequential version of an inertial spectral partitioner called HARP which maintains the quality of recursive spectral bisection (RSB) while forming the partitions an order of magnitude faster than RSB. The serial HARP is known to be the fastest spectral partitioner to date, three to four times faster than similar partitioners on a variety of meshes. This paper presents a parallel version of HARP, called P-HARP. Two types of parallelism have been exploited: loop level parallelism and recursive parallelism.more » P-HARP has been implemented in MPI on the SGI/Cray T3E and the IBM SP2. Experimental results demonstrate that P-HARP can partition a mesh of over 100,000 vertices into 256 partitions in 0.25 seconds on a 64-processor T3E. Experimental results further show that P-HARP can give nearly a 20-fold speedup on 64 processors. These results indicate that graph partitioning is no longer a major bottleneck that hinders the advancement of computational science and engineering for dynamically-changing real-world applications.« less

  18. Limited angle tomographic breast imaging: A comparison of parallel beam and pinhole collimation

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Wessell, D.E.; Kadrmas, D.J.; Frey, E.C.

    1996-12-31

    Results from clinical trials have suggested no improvement in lesion detection with parallel hole SPECT scintimammography (SM) with Tc-99m over parallel hole planar SM. In this initial investigation, we have elucidated some of the unique requirements of SPECT SM. With these requirements in mind, we have begun to develop practical data acquisition and reconstruction strategies that can reduce image artifacts and improve image quality. In this paper we investigate limited angle orbits for both parallel hole and pinhole SPECT SM. Singular Value Decomposition (SVD) is used to analyze the artifacts associated with the limited angle orbits. Maximum likelihood expectation maximizationmore » (MLEM) reconstructions are then used to examine the effects of attenuation compensation on the quality of the reconstructed image. All simulations are performed using the 3D-MCAT breast phantom. The results of these simulation studies demonstrate that limited angle SPECT SM is feasible, that attenuation correction is needed for accurate reconstructions, and that pinhole SPECT SM may have an advantage over parallel hole SPECT SM in terms of improved image quality and reduced image artifacts.« less

  19. Parallelized direct execution simulation of message-passing parallel programs

    NASA Technical Reports Server (NTRS)

    Dickens, Phillip M.; Heidelberger, Philip; Nicol, David M.

    1994-01-01

    As massively parallel computers proliferate, there is growing interest in findings ways by which performance of massively parallel codes can be efficiently predicted. This problem arises in diverse contexts such as parallelizing computers, parallel performance monitoring, and parallel algorithm development. In this paper we describe one solution where one directly executes the application code, but uses a discrete-event simulator to model details of the presumed parallel machine such as operating system and communication network behavior. Because this approach is computationally expensive, we are interested in its own parallelization specifically the parallelization of the discrete-event simulator. We describe methods suitable for parallelized direct execution simulation of message-passing parallel programs, and report on the performance of such a system, Large Application Parallel Simulation Environment (LAPSE), we have built on the Intel Paragon. On all codes measured to date, LAPSE predicts performance well typically within 10 percent relative error. Depending on the nature of the application code, we have observed low slowdowns (relative to natively executing code) and high relative speedups using up to 64 processors.

  20. A nonrecursive order N preconditioned conjugate gradient: Range space formulation of MDOF dynamics

    NASA Technical Reports Server (NTRS)

    Kurdila, Andrew J.

    1990-01-01

    While excellent progress has been made in deriving algorithms that are efficient for certain combinations of system topologies and concurrent multiprocessing hardware, several issues must be resolved to incorporate transient simulation in the control design process for large space structures. Specifically, strategies must be developed that are applicable to systems with numerous degrees of freedom. In addition, the algorithms must have a growth potential in that they must also be amenable to implementation on forthcoming parallel system architectures. For mechanical system simulation, this fact implies that algorithms are required that induce parallelism on a fine scale, suitable for the emerging class of highly parallel processors; and transient simulation methods must be automatically load balancing for a wider collection of system topologies and hardware configurations. These problems are addressed by employing a combination range space/preconditioned conjugate gradient formulation of multi-degree-of-freedom dynamics. The method described has several advantages. In a sequential computing environment, the method has the features that: by employing regular ordering of the system connectivity graph, an extremely efficient preconditioner can be derived from the 'range space metric', as opposed to the system coefficient matrix; because of the effectiveness of the preconditioner, preliminary studies indicate that the method can achieve performance rates that depend linearly upon the number of substructures, hence the title 'Order N'; and the method is non-assembling. Furthermore, the approach is promising as a potential parallel processing algorithm in that the method exhibits a fine parallel granularity suitable for a wide collection of combinations of physical system topologies/computer architectures; and the method is easily load balanced among processors, and does not rely upon system topology to induce parallelism.

  1. Angular Multigrid Preconditioner for Krylov-Based Solution Techniques Applied to the Sn Equations with Highly Forward-Peaked Scattering

    NASA Astrophysics Data System (ADS)

    Turcksin, Bruno; Ragusa, Jean C.; Morel, Jim E.

    2012-01-01

    It is well known that the diffusion synthetic acceleration (DSA) methods for the Sn equations become ineffective in the Fokker-Planck forward-peaked scattering limit. In response to this deficiency, Morel and Manteuffel (1991) developed an angular multigrid method for the 1-D Sn equations. This method is very effective, costing roughly twice as much as DSA per source iteration, and yielding a maximum spectral radius of approximately 0.6 in the Fokker-Planck limit. Pautz, Adams, and Morel (PAM) (1999) later generalized the angular multigrid to 2-D, but it was found that the method was unstable with sufficiently forward-peaked mappings between the angular grids. The method was stabilized via a filtering technique based on diffusion operators, but this filtering also degraded the effectiveness of the overall scheme. The spectral radius was not bounded away from unity in the Fokker-Planck limit, although the method remained more effective than DSA. The purpose of this article is to recast the multidimensional PAM angular multigrid method without the filtering as an Sn preconditioner and use it in conjunction with the Generalized Minimal RESidual (GMRES) Krylov method. The approach ensures stability and our computational results demonstrate that it is also significantly more efficient than an analogous DSA-preconditioned Krylov method.

  2. Self-balanced modulation and magnetic rebalancing method for parallel multilevel inverters

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Li, Hui; Shi, Yanjun

    A self-balanced modulation method and a closed-loop magnetic flux rebalancing control method for parallel multilevel inverters. The combination of the two methods provides for balancing of the magnetic flux of the inter-cell transformers (ICTs) of the parallel multilevel inverters without deteriorating the quality of the output voltage. In various embodiments a parallel multi-level inverter modulator is provide including a multi-channel comparator to generate a multiplexed digitized ideal waveform for a parallel multi-level inverter and a finite state machine (FSM) module coupled to the parallel multi-channel comparator, the FSM module to receive the multiplexed digitized ideal waveform and to generate amore » pulse width modulated gate-drive signal for each switching device of the parallel multi-level inverter. The system and method provides for optimization of the output voltage spectrum without influence the magnetic balancing.« less

  3. Drug innovation, price controls, and parallel trade.

    PubMed

    Matteucci, Giorgio; Reverberi, Pierfrancesco

    2016-12-21

    We study the long-run welfare effects of parallel trade (PT) in pharmaceuticals. We develop a two-country model of PT with endogenous quality, where the pharmaceutical firm negotiates the price of the drug with the government in the foreign country. We show that, even though the foreign government does not consider global R&D costs, (the threat of) PT improves the quality of the drug as long as the foreign consumers' valuation of quality is high enough. We find that the firm's short-run profit may be higher when PT is allowed. Nonetheless, this is neither necessary nor sufficient for improving drug quality in the long run. We also show that improving drug quality is a sufficient condition for PT to increase global welfare. Finally, we show that, when PT is allowed, drug quality may be higher with than without price controls.

  4. A Robust and Scalable Software Library for Parallel Adaptive Refinement on Unstructured Meshes

    NASA Technical Reports Server (NTRS)

    Lou, John Z.; Norton, Charles D.; Cwik, Thomas A.

    1999-01-01

    The design and implementation of Pyramid, a software library for performing parallel adaptive mesh refinement (PAMR) on unstructured meshes, is described. This software library can be easily used in a variety of unstructured parallel computational applications, including parallel finite element, parallel finite volume, and parallel visualization applications using triangular or tetrahedral meshes. The library contains a suite of well-designed and efficiently implemented modules that perform operations in a typical PAMR process. Among these are mesh quality control during successive parallel adaptive refinement (typically guided by a local-error estimator), parallel load-balancing, and parallel mesh partitioning using the ParMeTiS partitioner. The Pyramid library is implemented in Fortran 90 with an interface to the Message-Passing Interface (MPI) library, supporting code efficiency, modularity, and portability. An EM waveguide filter application, adaptively refined using the Pyramid library, is illustrated.

  5. Condition Number Estimation of Preconditioned Matrices

    PubMed Central

    Kushida, Noriyuki

    2015-01-01

    The present paper introduces a condition number estimation method for preconditioned matrices. The newly developed method provides reasonable results, while the conventional method which is based on the Lanczos connection gives meaningless results. The Lanczos connection based method provides the condition numbers of coefficient matrices of systems of linear equations with information obtained through the preconditioned conjugate gradient method. Estimating the condition number of preconditioned matrices is sometimes important when describing the effectiveness of new preconditionerers or selecting adequate preconditioners. Operating a preconditioner on a coefficient matrix is the simplest method of estimation. However, this is not possible for large-scale computing, especially if computation is performed on distributed memory parallel computers. This is because, the preconditioned matrices become dense, even if the original matrices are sparse. Although the Lanczos connection method can be used to calculate the condition number of preconditioned matrices, it is not considered to be applicable to large-scale problems because of its weakness with respect to numerical errors. Therefore, we have developed a robust and parallelizable method based on Hager’s method. The feasibility studies are curried out for the diagonal scaling preconditioner and the SSOR preconditioner with a diagonal matrix, a tri-daigonal matrix and Pei’s matrix. As a result, the Lanczos connection method contains around 10% error in the results even with a simple problem. On the other hand, the new method contains negligible errors. In addition, the newly developed method returns reasonable solutions when the Lanczos connection method fails with Pei’s matrix, and matrices generated with the finite element method. PMID:25816331

  6. Condition number estimation of preconditioned matrices.

    PubMed

    Kushida, Noriyuki

    2015-01-01

    The present paper introduces a condition number estimation method for preconditioned matrices. The newly developed method provides reasonable results, while the conventional method which is based on the Lanczos connection gives meaningless results. The Lanczos connection based method provides the condition numbers of coefficient matrices of systems of linear equations with information obtained through the preconditioned conjugate gradient method. Estimating the condition number of preconditioned matrices is sometimes important when describing the effectiveness of new preconditionerers or selecting adequate preconditioners. Operating a preconditioner on a coefficient matrix is the simplest method of estimation. However, this is not possible for large-scale computing, especially if computation is performed on distributed memory parallel computers. This is because, the preconditioned matrices become dense, even if the original matrices are sparse. Although the Lanczos connection method can be used to calculate the condition number of preconditioned matrices, it is not considered to be applicable to large-scale problems because of its weakness with respect to numerical errors. Therefore, we have developed a robust and parallelizable method based on Hager's method. The feasibility studies are curried out for the diagonal scaling preconditioner and the SSOR preconditioner with a diagonal matrix, a tri-daigonal matrix and Pei's matrix. As a result, the Lanczos connection method contains around 10% error in the results even with a simple problem. On the other hand, the new method contains negligible errors. In addition, the newly developed method returns reasonable solutions when the Lanczos connection method fails with Pei's matrix, and matrices generated with the finite element method.

  7. An accurate, fast, and scalable solver for high-frequency wave propagation

    NASA Astrophysics Data System (ADS)

    Zepeda-Núñez, L.; Taus, M.; Hewett, R.; Demanet, L.

    2017-12-01

    In many science and engineering applications, solving time-harmonic high-frequency wave propagation problems quickly and accurately is of paramount importance. For example, in geophysics, particularly in oil exploration, such problems can be the forward problem in an iterative process for solving the inverse problem of subsurface inversion. It is important to solve these wave propagation problems accurately in order to efficiently obtain meaningful solutions of the inverse problems: low order forward modeling can hinder convergence. Additionally, due to the volume of data and the iterative nature of most optimization algorithms, the forward problem must be solved many times. Therefore, a fast solver is necessary to make solving the inverse problem feasible. For time-harmonic high-frequency wave propagation, obtaining both speed and accuracy is historically challenging. Recently, there have been many advances in the development of fast solvers for such problems, including methods which have linear complexity with respect to the number of degrees of freedom. While most methods scale optimally only in the context of low-order discretizations and smooth wave speed distributions, the method of polarized traces has been shown to retain optimal scaling for high-order discretizations, such as hybridizable discontinuous Galerkin methods and for highly heterogeneous (and even discontinuous) wave speeds. The resulting fast and accurate solver is consequently highly attractive for geophysical applications. To date, this method relies on a layered domain decomposition together with a preconditioner applied in a sweeping fashion, which has limited straight-forward parallelization. In this work, we introduce a new version of the method of polarized traces which reveals more parallel structure than previous versions while preserving all of its other advantages. We achieve this by further decomposing each layer and applying the preconditioner to these new components separately and

  8. Some thoughts about parallel process and psychotherapy supervision: when is a parallel just a parallel?

    PubMed

    Watkins, C Edward

    2012-09-01

    In a way not done before, Tracey, Bludworth, and Glidden-Tracey ("Are there parallel processes in psychotherapy supervision: An empirical examination," Psychotherapy, 2011, advance online publication, doi.10.1037/a0026246) have shown us that parallel process in psychotherapy supervision can indeed be rigorously and meaningfully researched, and their groundbreaking investigation provides a nice prototype for future supervision studies to emulate. In what follows, I offer a brief complementary comment to Tracey et al., addressing one matter that seems to be a potentially important conceptual and empirical parallel process consideration: When is a parallel just a parallel? PsycINFO Database Record (c) 2012 APA, all rights reserved.

  9. Use of general purpose graphics processing units with MODFLOW

    USGS Publications Warehouse

    Hughes, Joseph D.; White, Jeremy T.

    2013-01-01

    To evaluate the use of general-purpose graphics processing units (GPGPUs) to improve the performance of MODFLOW, an unstructured preconditioned conjugate gradient (UPCG) solver has been developed. The UPCG solver uses a compressed sparse row storage scheme and includes Jacobi, zero fill-in incomplete, and modified-incomplete lower-upper (LU) factorization, and generalized least-squares polynomial preconditioners. The UPCG solver also includes options for sequential and parallel solution on the central processing unit (CPU) using OpenMP. For simulations utilizing the GPGPU, all basic linear algebra operations are performed on the GPGPU; memory copies between the central processing unit CPU and GPCPU occur prior to the first iteration of the UPCG solver and after satisfying head and flow criteria or exceeding a maximum number of iterations. The efficiency of the UPCG solver for GPGPU and CPU solutions is benchmarked using simulations of a synthetic, heterogeneous unconfined aquifer with tens of thousands to millions of active grid cells. Testing indicates GPGPU speedups on the order of 2 to 8, relative to the standard MODFLOW preconditioned conjugate gradient (PCG) solver, can be achieved when (1) memory copies between the CPU and GPGPU are optimized, (2) the percentage of time performing memory copies between the CPU and GPGPU is small relative to the calculation time, (3) high-performance GPGPU cards are utilized, and (4) CPU-GPGPU combinations are used to execute sequential operations that are difficult to parallelize. Furthermore, UPCG solver testing indicates GPGPU speedups exceed parallel CPU speedups achieved using OpenMP on multicore CPUs for preconditioners that can be easily parallelized.

  10. Parallel rendering

    NASA Technical Reports Server (NTRS)

    Crockett, Thomas W.

    1995-01-01

    This article provides a broad introduction to the subject of parallel rendering, encompassing both hardware and software systems. The focus is on the underlying concepts and the issues which arise in the design of parallel rendering algorithms and systems. We examine the different types of parallelism and how they can be applied in rendering applications. Concepts from parallel computing, such as data decomposition, task granularity, scalability, and load balancing, are considered in relation to the rendering problem. We also explore concepts from computer graphics, such as coherence and projection, which have a significant impact on the structure of parallel rendering algorithms. Our survey covers a number of practical considerations as well, including the choice of architectural platform, communication and memory requirements, and the problem of image assembly and display. We illustrate the discussion with numerous examples from the parallel rendering literature, representing most of the principal rendering methods currently used in computer graphics.

  11. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Chen, Chao; Pouransari, Hadi; Rajamanickam, Sivasankaran

    We present a parallel hierarchical solver for general sparse linear systems on distributed-memory machines. For large-scale problems, this fully algebraic algorithm is faster and more memory-efficient than sparse direct solvers because it exploits the low-rank structure of fill-in blocks. Depending on the accuracy of low-rank approximations, the hierarchical solver can be used either as a direct solver or as a preconditioner. The parallel algorithm is based on data decomposition and requires only local communication for updating boundary data on every processor. Moreover, the computation-to-communication ratio of the parallel algorithm is approximately the volume-to-surface-area ratio of the subdomain owned by everymore » processor. We also provide various numerical results to demonstrate the versatility and scalability of the parallel algorithm.« less

  12. A nonrecursive 'Order N' preconditioned conjugate gradient/range space formulation of MDOF dynamics

    NASA Technical Reports Server (NTRS)

    Kurdila, A. J.; Menon, R.; Sunkel, John

    1991-01-01

    This paper addresses the requirements of present-day mechanical system simulations of algorithms that induce parallelism on a fine scale and of transient simulation methods which must be automatically load balancing for a wide collection of system topologies and hardware configurations. To this end, a combination range space/preconditioned conjugage gradient formulation of multidegree-of-freedon dynamics is developed, which, by employing regular ordering of the system connectivity graph, makes it possible to derive an extremely efficient preconditioner from the range space metric (as opposed to the system coefficient matrix). Because of the effectiveness of the preconditioner, the method can achieve performance rates that depend linearly on the number of substructures. The method, termed 'Order N' does not require the assembly of system mass or stiffness matrices, and is therefore amenable to implementation on work stations. Using this method, a 13-substructure model of the Space Station was constructed.

  13. PCLIPS: Parallel CLIPS

    NASA Technical Reports Server (NTRS)

    Hall, Lawrence O.; Bennett, Bonnie H.; Tello, Ivan

    1994-01-01

    A parallel version of CLIPS 5.1 has been developed to run on Intel Hypercubes. The user interface is the same as that for CLIPS with some added commands to allow for parallel calls. A complete version of CLIPS runs on each node of the hypercube. The system has been instrumented to display the time spent in the match, recognize, and act cycles on each node. Only rule-level parallelism is supported. Parallel commands enable the assertion and retraction of facts to/from remote nodes working memory. Parallel CLIPS was used to implement a knowledge-based command, control, communications, and intelligence (C(sup 3)I) system to demonstrate the fusion of high-level, disparate sources. We discuss the nature of the information fusion problem, our approach, and implementation. Parallel CLIPS has also be used to run several benchmark parallel knowledge bases such as one to set up a cafeteria. Results show from running Parallel CLIPS with parallel knowledge base partitions indicate that significant speed increases, including superlinear in some cases, are possible.

  14. Parallelization of fine-scale computation in Agile Multiscale Modelling Methodology

    NASA Astrophysics Data System (ADS)

    Macioł, Piotr; Michalik, Kazimierz

    2016-10-01

    Nowadays, multiscale modelling of material behavior is an extensively developed area. An important obstacle against its wide application is high computational demands. Among others, the parallelization of multiscale computations is a promising solution. Heterogeneous multiscale models are good candidates for parallelization, since communication between sub-models is limited. In this paper, the possibility of parallelization of multiscale models based on Agile Multiscale Methodology framework is discussed. A sequential, FEM based macroscopic model has been combined with concurrently computed fine-scale models, employing a MatCalc thermodynamic simulator. The main issues, being investigated in this work are: (i) the speed-up of multiscale models with special focus on fine-scale computations and (ii) on decreasing the quality of computations enforced by parallel execution. Speed-up has been evaluated on the basis of Amdahl's law equations. The problem of `delay error', rising from the parallel execution of fine scale sub-models, controlled by the sequential macroscopic sub-model is discussed. Some technical aspects of combining third-party commercial modelling software with an in-house multiscale framework and a MPI library are also discussed.

  15. Parallel processing architecture for H.264 deblocking filter on multi-core platforms

    NASA Astrophysics Data System (ADS)

    Prasad, Durga P.; Sonachalam, Sekar; Kunchamwar, Mangesh K.; Gunupudi, Nageswara Rao

    2012-03-01

    Massively parallel computing (multi-core) chips offer outstanding new solutions that satisfy the increasing demand for high resolution and high quality video compression technologies such as H.264. Such solutions not only provide exceptional quality but also efficiency, low power, and low latency, previously unattainable in software based designs. While custom hardware and Application Specific Integrated Circuit (ASIC) technologies may achieve lowlatency, low power, and real-time performance in some consumer devices, many applications require a flexible and scalable software-defined solution. The deblocking filter in H.264 encoder/decoder poses difficult implementation challenges because of heavy data dependencies and the conditional nature of the computations. Deblocking filter implementations tend to be fixed and difficult to reconfigure for different needs. The ability to scale up for higher quality requirements such as 10-bit pixel depth or a 4:2:2 chroma format often reduces the throughput of a parallel architecture designed for lower feature set. A scalable architecture for deblocking filtering, created with a massively parallel processor based solution, means that the same encoder or decoder will be deployed in a variety of applications, at different video resolutions, for different power requirements, and at higher bit-depths and better color sub sampling patterns like YUV, 4:2:2, or 4:4:4 formats. Low power, software-defined encoders/decoders may be implemented using a massively parallel processor array, like that found in HyperX technology, with 100 or more cores and distributed memory. The large number of processor elements allows the silicon device to operate more efficiently than conventional DSP or CPU technology. This software programing model for massively parallel processors offers a flexible implementation and a power efficiency close to that of ASIC solutions. This work describes a scalable parallel architecture for an H.264 compliant deblocking

  16. IOPA: I/O-aware parallelism adaption for parallel programs

    PubMed Central

    Liu, Tao; Liu, Yi; Qian, Chen; Qian, Depei

    2017-01-01

    With the development of multi-/many-core processors, applications need to be written as parallel programs to improve execution efficiency. For data-intensive applications that use multiple threads to read/write files simultaneously, an I/O sub-system can easily become a bottleneck when too many of these types of threads exist; on the contrary, too few threads will cause insufficient resource utilization and hurt performance. Therefore, programmers must pay much attention to parallelism control to find the appropriate number of I/O threads for an application. This paper proposes a parallelism control mechanism named IOPA that can adjust the parallelism of applications to adapt to the I/O capability of a system and balance computing resources and I/O bandwidth. The programming interface of IOPA is also provided to programmers to simplify parallel programming. IOPA is evaluated using multiple applications with both solid state and hard disk drives. The results show that the parallel applications using IOPA can achieve higher efficiency than those with a fixed number of threads. PMID:28278236

  17. IOPA: I/O-aware parallelism adaption for parallel programs.

    PubMed

    Liu, Tao; Liu, Yi; Qian, Chen; Qian, Depei

    2017-01-01

    With the development of multi-/many-core processors, applications need to be written as parallel programs to improve execution efficiency. For data-intensive applications that use multiple threads to read/write files simultaneously, an I/O sub-system can easily become a bottleneck when too many of these types of threads exist; on the contrary, too few threads will cause insufficient resource utilization and hurt performance. Therefore, programmers must pay much attention to parallelism control to find the appropriate number of I/O threads for an application. This paper proposes a parallelism control mechanism named IOPA that can adjust the parallelism of applications to adapt to the I/O capability of a system and balance computing resources and I/O bandwidth. The programming interface of IOPA is also provided to programmers to simplify parallel programming. IOPA is evaluated using multiple applications with both solid state and hard disk drives. The results show that the parallel applications using IOPA can achieve higher efficiency than those with a fixed number of threads.

  18. High-Performance Psychometrics: The Parallel-E Parallel-M Algorithm for Generalized Latent Variable Models. Research Report. ETS RR-16-34

    ERIC Educational Resources Information Center

    von Davier, Matthias

    2016-01-01

    This report presents results on a parallel implementation of the expectation-maximization (EM) algorithm for multidimensional latent variable models. The developments presented here are based on code that parallelizes both the E step and the M step of the parallel-E parallel-M algorithm. Examples presented in this report include item response…

  19. Random number generators for large-scale parallel Monte Carlo simulations on FPGA

    NASA Astrophysics Data System (ADS)

    Lin, Y.; Wang, F.; Liu, B.

    2018-05-01

    Through parallelization, field programmable gate array (FPGA) can achieve unprecedented speeds in large-scale parallel Monte Carlo (LPMC) simulations. FPGA presents both new constraints and new opportunities for the implementations of random number generators (RNGs), which are key elements of any Monte Carlo (MC) simulation system. Using empirical and application based tests, this study evaluates all of the four RNGs used in previous FPGA based MC studies and newly proposed FPGA implementations for two well-known high-quality RNGs that are suitable for LPMC studies on FPGA. One of the newly proposed FPGA implementations: a parallel version of additive lagged Fibonacci generator (Parallel ALFG) is found to be the best among the evaluated RNGs in fulfilling the needs of LPMC simulations on FPGA.

  20. Synthesizing parallel imaging applications using the CAP (computer-aided parallelization) tool

    NASA Astrophysics Data System (ADS)

    Gennart, Benoit A.; Mazzariol, Marc; Messerli, Vincent; Hersch, Roger D.

    1997-12-01

    Imaging applications such as filtering, image transforms and compression/decompression require vast amounts of computing power when applied to large data sets. These applications would potentially benefit from the use of parallel processing. However, dedicated parallel computers are expensive and their processing power per node lags behind that of the most recent commodity components. Furthermore, developing parallel applications remains a difficult task: writing and debugging the application is difficult (deadlocks), programs may not be portable from one parallel architecture to the other, and performance often comes short of expectations. In order to facilitate the development of parallel applications, we propose the CAP computer-aided parallelization tool which enables application programmers to specify at a high-level of abstraction the flow of data between pipelined-parallel operations. In addition, the CAP tool supports the programmer in developing parallel imaging and storage operations. CAP enables combining efficiently parallel storage access routines and image processing sequential operations. This paper shows how processing and I/O intensive imaging applications must be implemented to take advantage of parallelism and pipelining between data access and processing. This paper's contribution is (1) to show how such implementations can be compactly specified in CAP, and (2) to demonstrate that CAP specified applications achieve the performance of custom parallel code. The paper analyzes theoretically the performance of CAP specified applications and demonstrates the accuracy of the theoretical analysis through experimental measurements.

  1. Parallel and Distributed Systems for Probabilistic Reasoning

    DTIC Science & Technology

    2012-12-01

    work at CMU I had the opportunity to work with Andreas Krause on Gaussian process models for signal quality estimation in wireless sensor networks ...we reviewed the natural parallelization of the belief propagation algorithm using the synchronous schedule and demonstrated both theoretically and...problem is that the power-law sparsity structure, commonly found in graphs derived from natural phenomena (e.g., social networks and the web

  2. The language parallel Pascal and other aspects of the massively parallel processor

    NASA Technical Reports Server (NTRS)

    Reeves, A. P.; Bruner, J. D.

    1982-01-01

    A high level language for the Massively Parallel Processor (MPP) was designed. This language, called Parallel Pascal, is described in detail. A description of the language design, a description of the intermediate language, Parallel P-Code, and details for the MPP implementation are included. Formal descriptions of Parallel Pascal and Parallel P-Code are given. A compiler was developed which converts programs in Parallel Pascal into the intermediate Parallel P-Code language. The code generator to complete the compiler for the MPP is being developed independently. A Parallel Pascal to Pascal translator was also developed. The architecture design for a VLSI version of the MPP was completed with a description of fault tolerant interconnection networks. The memory arrangement aspects of the MPP are discussed and a survey of other high level languages is given.

  3. Efficient Preconditioning for the p-Version Finite Element Method in Two Dimensions

    DTIC Science & Technology

    1989-10-01

    paper, we study fast parallel preconditioners for systems of equations arising from the p-version finite element method. The p-version finite element...computations and the solution of a relatively small global auxiliary problem. We study two different methods. In the first (Section 3), the global...20], will be studied in the next section. Problem (3.12) is obviously much more easily solved than the original problem ,nd the procedure is highly

  4. Parallel integer sorting with medium and fine-scale parallelism

    NASA Technical Reports Server (NTRS)

    Dagum, Leonardo

    1993-01-01

    Two new parallel integer sorting algorithms, queue-sort and barrel-sort, are presented and analyzed in detail. These algorithms do not have optimal parallel complexity, yet they show very good performance in practice. Queue-sort designed for fine-scale parallel architectures which allow the queueing of multiple messages to the same destination. Barrel-sort is designed for medium-scale parallel architectures with a high message passing overhead. The performance results from the implementation of queue-sort on a Connection Machine CM-2 and barrel-sort on a 128 processor iPSC/860 are given. The two implementations are found to be comparable in performance but not as good as a fully vectorized bucket sort on the Cray YMP.

  5. Improving quality of arterial spin labeling MR imaging at 3 Tesla with a 32-channel coil and parallel imaging.

    PubMed

    Ferré, Jean-Christophe; Petr, Jan; Bannier, Elise; Barillot, Christian; Gauvrit, Jean-Yves

    2012-05-01

    To compare 12-channel and 32-channel phased-array coils and to determine the optimal parallel imaging (PI) technique and factor for brain perfusion imaging using Pulsed Arterial Spin labeling (PASL) at 3 Tesla (T). Twenty-seven healthy volunteers underwent 10 different PASL perfusion PICORE Q2TIPS scans at 3T using 12-channel and 32-channel coils without PI and with GRAPPA or mSENSE using factor 2. PI with factor 3 and 4 were used only with the 32-channel coil. Visual quality was assessed using four parameters. Quantitative analyses were performed using temporal noise, contrast-to-noise and signal-to-noise ratios (CNR, SNR). Compared with 12-channel acquisition, the scores for 32-channel acquisition were significantly higher for overall visual quality, lower for noise and higher for SNR and CNR. With the 32-channel coil, artifact compromise achieved the best score with PI factor 2. Noise increased, SNR and CNR decreased with PI factor. However mSENSE 2 scores were not always significantly different from acquisition without PI. For PASL at 3T, the 32-channel coil at 3T provided better quality than the 12-channel coil. With the 32-channel coil, mSENSE 2 seemed to offer the best compromise for decreasing artifacts without significantly reducing SNR, CNR. Copyright © 2012 Wiley Periodicals, Inc.

  6. Bilingual parallel programming

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Foster, I.; Overbeek, R.

    1990-01-01

    Numerous experiments have demonstrated that computationally intensive algorithms support adequate parallelism to exploit the potential of large parallel machines. Yet successful parallel implementations of serious applications are rare. The limiting factor is clearly programming technology. None of the approaches to parallel programming that have been proposed to date -- whether parallelizing compilers, language extensions, or new concurrent languages -- seem to adequately address the central problems of portability, expressiveness, efficiency, and compatibility with existing software. In this paper, we advocate an alternative approach to parallel programming based on what we call bilingual programming. We present evidence that this approach providesmore » and effective solution to parallel programming problems. The key idea in bilingual programming is to construct the upper levels of applications in a high-level language while coding selected low-level components in low-level languages. This approach permits the advantages of a high-level notation (expressiveness, elegance, conciseness) to be obtained without the cost in performance normally associated with high-level approaches. In addition, it provides a natural framework for reusing existing code.« less

  7. A Comparison of Solver Performance for Complex Gastric Electrophysiology Models

    PubMed Central

    Sathar, Shameer; Cheng, Leo K.; Trew, Mark L.

    2016-01-01

    Computational techniques for solving systems of equations arising in gastric electrophysiology have not been studied for efficient solution process. We present a computationally challenging problem of simulating gastric electrophysiology in anatomically realistic stomach geometries with multiple intracellular and extracellular domains. The multiscale nature of the problem and mesh resolution required to capture geometric and functional features necessitates efficient solution methods if the problem is to be tractable. In this study, we investigated and compared several parallel preconditioners for the linear systems arising from tetrahedral discretisation of electrically isotropic and anisotropic problems, with and without stimuli. The results showed that the isotropic problem was computationally less challenging than the anisotropic problem and that the application of extracellular stimuli increased workload considerably. Preconditioning based on block Jacobi and algebraic multigrid solvers were found to have the best overall solution times and least iteration counts, respectively. The algebraic multigrid preconditioner would be expected to perform better on large problems. PMID:26736543

  8. Scalable parallel communications

    NASA Technical Reports Server (NTRS)

    Maly, K.; Khanna, S.; Overstreet, C. M.; Mukkamala, R.; Zubair, M.; Sekhar, Y. S.; Foudriat, E. C.

    1992-01-01

    Coarse-grain parallelism in networking (that is, the use of multiple protocol processors running replicated software sending over several physical channels) can be used to provide gigabit communications for a single application. Since parallel network performance is highly dependent on real issues such as hardware properties (e.g., memory speeds and cache hit rates), operating system overhead (e.g., interrupt handling), and protocol performance (e.g., effect of timeouts), we have performed detailed simulations studies of both a bus-based multiprocessor workstation node (based on the Sun Galaxy MP multiprocessor) and a distributed-memory parallel computer node (based on the Touchstone DELTA) to evaluate the behavior of coarse-grain parallelism. Our results indicate: (1) coarse-grain parallelism can deliver multiple 100 Mbps with currently available hardware platforms and existing networking protocols (such as Transmission Control Protocol/Internet Protocol (TCP/IP) and parallel Fiber Distributed Data Interface (FDDI) rings); (2) scale-up is near linear in n, the number of protocol processors, and channels (for small n and up to a few hundred Mbps); and (3) since these results are based on existing hardware without specialized devices (except perhaps for some simple modifications of the FDDI boards), this is a low cost solution to providing multiple 100 Mbps on current machines. In addition, from both the performance analysis and the properties of these architectures, we conclude: (1) multiple processors providing identical services and the use of space division multiplexing for the physical channels can provide better reliability than monolithic approaches (it also provides graceful degradation and low-cost load balancing); (2) coarse-grain parallelism supports running several transport protocols in parallel to provide different types of service (for example, one TCP handles small messages for many users, other TCP's running in parallel provide high bandwidth

  9. Design of k-Space Channel Combination Kernels and Integration with Parallel Imaging

    PubMed Central

    Beatty, Philip J.; Chang, Shaorong; Holmes, James H.; Wang, Kang; Brau, Anja C. S.; Reeder, Scott B.; Brittain, Jean H.

    2014-01-01

    Purpose In this work, a new method is described for producing local k-space channel combination kernels using a small amount of low-resolution multichannel calibration data. Additionally, this work describes how these channel combination kernels can be combined with local k-space unaliasing kernels produced by the calibration phase of parallel imaging methods such as GRAPPA, PARS and ARC. Methods Experiments were conducted to evaluate both the image quality and computational efficiency of the proposed method compared to a channel-by-channel parallel imaging approach with image-space sum-of-squares channel combination. Results Results indicate comparable image quality overall, with some very minor differences seen in reduced field-of-view imaging. It was demonstrated that this method enables a speed up in computation time on the order of 3–16X for 32-channel data sets. Conclusion The proposed method enables high quality channel combination to occur earlier in the reconstruction pipeline, reducing computational and memory requirements for image reconstruction. PMID:23943602

  10. Parallel simulation today

    NASA Technical Reports Server (NTRS)

    Nicol, David; Fujimoto, Richard

    1992-01-01

    This paper surveys topics that presently define the state of the art in parallel simulation. Included in the tutorial are discussions on new protocols, mathematical performance analysis, time parallelism, hardware support for parallel simulation, load balancing algorithms, and dynamic memory management for optimistic synchronization.

  11. Automatic Generation of Directive-Based Parallel Programs for Shared Memory Parallel Systems

    NASA Technical Reports Server (NTRS)

    Jin, Hao-Qiang; Yan, Jerry; Frumkin, Michael

    2000-01-01

    The shared-memory programming model is a very effective way to achieve parallelism on shared memory parallel computers. As great progress was made in hardware and software technologies, performance of parallel programs with compiler directives has demonstrated large improvement. The introduction of OpenMP directives, the industrial standard for shared-memory programming, has minimized the issue of portability. Due to its ease of programming and its good performance, the technique has become very popular. In this study, we have extended CAPTools, a computer-aided parallelization toolkit, to automatically generate directive-based, OpenMP, parallel programs. We outline techniques used in the implementation of the tool and present test results on the NAS parallel benchmarks and ARC3D, a CFD application. This work demonstrates the great potential of using computer-aided tools to quickly port parallel programs and also achieve good performance.

  12. Performance of a parallel thermal-hydraulics code TEMPEST

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Fann, G.I.; Trent, D.S.

    The authors describe the parallelization of the Tempest thermal-hydraulics code. The serial version of this code is used for production quality 3-D thermal-hydraulics simulations. Good speedup was obtained with a parallel diagonally preconditioned BiCGStab non-symmetric linear solver, using a spatial domain decomposition approach for the semi-iterative pressure-based and mass-conserved algorithm. The test case used here to illustrate the performance of the BiCGStab solver is a 3-D natural convection problem modeled using finite volume discretization in cylindrical coordinates. The BiCGStab solver replaced the LSOR-ADI method for solving the pressure equation in TEMPEST. BiCGStab also solves the coupled thermal energy equation. Scalingmore » performance of 3 problem sizes (221220 nodes, 358120 nodes, and 701220 nodes) are presented. These problems were run on 2 different parallel machines: IBM-SP and SGI PowerChallenge. The largest problem attains a speedup of 68 on an 128 processor IBM-SP. In real terms, this is over 34 times faster than the fastest serial production time using the LSOR-ADI solver.« less

  13. Endpoint-based parallel data processing in a parallel active messaging interface of a parallel computer

    DOEpatents

    Archer, Charles J; Blocksome, Michael E; Ratterman, Joseph D; Smith, Brian E

    2014-02-11

    Endpoint-based parallel data processing in a parallel active messaging interface ('PAMI') of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes coupled for data communications through the PAMI, including establishing a data communications geometry, the geometry specifying, for tasks representing processes of execution of the parallel application, a set of endpoints that are used in collective operations of the PAMI including a plurality of endpoints for one of the tasks; receiving in endpoints of the geometry an instruction for a collective operation; and executing the instruction for a collective opeartion through the endpoints in dependence upon the geometry, including dividing data communications operations among the plurality of endpoints for one of the tasks.

  14. Endpoint-based parallel data processing in a parallel active messaging interface of a parallel computer

    DOEpatents

    Archer, Charles J.; Blocksome, Michael A.; Ratterman, Joseph D.; Smith, Brian E.

    2014-08-12

    Endpoint-based parallel data processing in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes coupled for data communications through the PAMI, including establishing a data communications geometry, the geometry specifying, for tasks representing processes of execution of the parallel application, a set of endpoints that are used in collective operations of the PAMI including a plurality of endpoints for one of the tasks; receiving in endpoints of the geometry an instruction for a collective operation; and executing the instruction for a collective operation through the endpoints in dependence upon the geometry, including dividing data communications operations among the plurality of endpoints for one of the tasks.

  15. A parallel electrostatic Particle-in-Cell method on unstructured tetrahedral grids for large-scale bounded collisionless plasma simulations

    NASA Astrophysics Data System (ADS)

    Averkin, Sergey N.; Gatsonis, Nikolaos A.

    2018-06-01

    An unstructured electrostatic Particle-In-Cell (EUPIC) method is developed on arbitrary tetrahedral grids for simulation of plasmas bounded by arbitrary geometries. The electric potential in EUPIC is obtained on cell vertices from a finite volume Multi-Point Flux Approximation of Gauss' law using the indirect dual cell with Dirichlet, Neumann and external circuit boundary conditions. The resulting matrix equation for the nodal potential is solved with a restarted generalized minimal residual method (GMRES) and an ILU(0) preconditioner algorithm, parallelized using a combination of node coloring and level scheduling approaches. The electric field on vertices is obtained using the gradient theorem applied to the indirect dual cell. The algorithms for injection, particle loading, particle motion, and particle tracking are parallelized for unstructured tetrahedral grids. The algorithms for the potential solver, electric field evaluation, loading, scatter-gather algorithms are verified using analytic solutions for test cases subject to Laplace and Poisson equations. Grid sensitivity analysis examines the L2 and L∞ norms of the relative error in potential, field, and charge density as a function of edge-averaged and volume-averaged cell size. Analysis shows second order of convergence for the potential and first order of convergence for the electric field and charge density. Temporal sensitivity analysis is performed and the momentum and energy conservation properties of the particle integrators in EUPIC are examined. The effects of cell size and timestep on heating, slowing-down and the deflection times are quantified. The heating, slowing-down and the deflection times are found to be almost linearly dependent on number of particles per cell. EUPIC simulations of current collection by cylindrical Langmuir probes in collisionless plasmas show good comparison with previous experimentally validated numerical results. These simulations were also used in a parallelization

  16. Parallel discrete-event simulation of FCFS stochastic queueing networks

    NASA Technical Reports Server (NTRS)

    Nicol, David M.

    1988-01-01

    Physical systems are inherently parallel. Intuition suggests that simulations of these systems may be amenable to parallel execution. The parallel execution of a discrete-event simulation requires careful synchronization of processes in order to ensure the execution's correctness; this synchronization can degrade performance. Largely negative results were recently reported in a study which used a well-known synchronization method on queueing network simulations. Discussed here is a synchronization method (appointments), which has proven itself to be effective on simulations of FCFS queueing networks. The key concept behind appointments is the provision of lookahead. Lookahead is a prediction on a processor's future behavior, based on an analysis of the processor's simulation state. It is shown how lookahead can be computed for FCFS queueing network simulations, give performance data that demonstrates the method's effectiveness under moderate to heavy loads, and discuss performance tradeoffs between the quality of lookahead, and the cost of computing lookahead.

  17. The effect of earthquake on architecture geometry with non-parallel system irregularity configuration

    NASA Astrophysics Data System (ADS)

    Teddy, Livian; Hardiman, Gagoek; Nuroji; Tudjono, Sri

    2017-12-01

    Indonesia is an area prone to earthquake that may cause casualties and damage to buildings. The fatalities or the injured are not largely caused by the earthquake, but by building collapse. The collapse of the building is resulted from the building behaviour against the earthquake, and it depends on many factors, such as architectural design, geometry configuration of structural elements in horizontal and vertical plans, earthquake zone, geographical location (distance to earthquake center), soil type, material quality, and construction quality. One of the geometry configurations that may lead to the collapse of the building is irregular configuration of non-parallel system. In accordance with FEMA-451B, irregular configuration in non-parallel system is defined to have existed if the vertical lateral force-retaining elements are neither parallel nor symmetric with main orthogonal axes of the earthquake-retaining axis system. Such configuration may lead to torque, diagonal translation and local damage to buildings. It does not mean that non-parallel irregular configuration should not be formed on architectural design; however the designer must know the consequence of earthquake behaviour against buildings with irregular configuration of non-parallel system. The present research has the objective to identify earthquake behaviour in architectural geometry with irregular configuration of non-parallel system. The present research was quantitative with simulation experimental method. It consisted of 5 models, where architectural data and model structure data were inputted and analyzed using the software SAP2000 in order to find out its performance, and ETAB2015 to determine the eccentricity occurred. The output of the software analysis was tabulated, graphed, compared and analyzed with relevant theories. For areas of strong earthquake zones, avoid designing buildings which wholly form irregular configuration of non-parallel system. If it is inevitable to design a

  18. A template-based approach for parallel hexahedral two-refinement

    DOE PAGES

    Owen, Steven J.; Shih, Ryan M.; Ernst, Corey D.

    2016-10-17

    Here, we provide a template-based approach for generating locally refined all-hex meshes. We focus specifically on refinement of initially structured grids utilizing a 2-refinement approach where uniformly refined hexes are subdivided into eight child elements. The refinement algorithm consists of identifying marked nodes that are used as the basis for a set of four simple refinement templates. The target application for 2-refinement is a parallel grid-based all-hex meshing tool for high performance computing in a distributed environment. The result is a parallel consistent locally refined mesh requiring minimal communication and where minimum mesh quality is greater than scaled Jacobian 0.3more » prior to smoothing.« less

  19. A template-based approach for parallel hexahedral two-refinement

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Owen, Steven J.; Shih, Ryan M.; Ernst, Corey D.

    Here, we provide a template-based approach for generating locally refined all-hex meshes. We focus specifically on refinement of initially structured grids utilizing a 2-refinement approach where uniformly refined hexes are subdivided into eight child elements. The refinement algorithm consists of identifying marked nodes that are used as the basis for a set of four simple refinement templates. The target application for 2-refinement is a parallel grid-based all-hex meshing tool for high performance computing in a distributed environment. The result is a parallel consistent locally refined mesh requiring minimal communication and where minimum mesh quality is greater than scaled Jacobian 0.3more » prior to smoothing.« less

  20. Albany/FELIX: A parallel, scalable and robust, finite element, first-order Stokes approximation ice sheet solver built for advanced analysis

    DOE PAGES

    Tezaur, I. K.; Perego, M.; Salinger, A. G.; ...

    2015-04-27

    This paper describes a new parallel, scalable and robust finite element based solver for the first-order Stokes momentum balance equations for ice flow. The solver, known as Albany/FELIX, is constructed using the component-based approach to building application codes, in which mature, modular libraries developed as a part of the Trilinos project are combined using abstract interfaces and template-based generic programming, resulting in a final code with access to dozens of algorithmic and advanced analysis capabilities. Following an overview of the relevant partial differential equations and boundary conditions, the numerical methods chosen to discretize the ice flow equations are described, alongmore » with their implementation. The results of several verification studies of the model accuracy are presented using (1) new test cases for simplified two-dimensional (2-D) versions of the governing equations derived using the method of manufactured solutions, and (2) canonical ice sheet modeling benchmarks. Model accuracy and convergence with respect to mesh resolution are then studied on problems involving a realistic Greenland ice sheet geometry discretized using hexahedral and tetrahedral meshes. Also explored as a part of this study is the effect of vertical mesh resolution on the solution accuracy and solver performance. The robustness and scalability of our solver on these problems is demonstrated. Lastly, we show that good scalability can be achieved by preconditioning the iterative linear solver using a new algebraic multilevel preconditioner, constructed based on the idea of semi-coarsening.« less

  1. Bit-parallel arithmetic in a massively-parallel associative processor

    NASA Technical Reports Server (NTRS)

    Scherson, Isaac D.; Kramer, David A.; Alleyne, Brian D.

    1992-01-01

    A simple but powerful new architecture based on a classical associative processor model is presented. Algorithms for performing the four basic arithmetic operations both for integer and floating point operands are described. For m-bit operands, the proposed architecture makes it possible to execute complex operations in O(m) cycles as opposed to O(m exp 2) for bit-serial machines. A word-parallel, bit-parallel, massively-parallel computing system can be constructed using this architecture with VLSI technology. The operation of this system is demonstrated for the fast Fourier transform and matrix multiplication.

  2. Optimizing a realistic large-scale frequency assignment problem using a new parallel evolutionary approach

    NASA Astrophysics Data System (ADS)

    Chaves-González, José M.; Vega-Rodríguez, Miguel A.; Gómez-Pulido, Juan A.; Sánchez-Pérez, Juan M.

    2011-08-01

    This article analyses the use of a novel parallel evolutionary strategy to solve complex optimization problems. The work developed here has been focused on a relevant real-world problem from the telecommunication domain to verify the effectiveness of the approach. The problem, known as frequency assignment problem (FAP), basically consists of assigning a very small number of frequencies to a very large set of transceivers used in a cellular phone network. Real data FAP instances are very difficult to solve due to the NP-hard nature of the problem, therefore using an efficient parallel approach which makes the most of different evolutionary strategies can be considered as a good way to obtain high-quality solutions in short periods of time. Specifically, a parallel hyper-heuristic based on several meta-heuristics has been developed. After a complete experimental evaluation, results prove that the proposed approach obtains very high-quality solutions for the FAP and beats any other result published.

  3. Use of parallel computing in mass processing of laser data

    NASA Astrophysics Data System (ADS)

    Będkowski, J.; Bratuś, R.; Prochaska, M.; Rzonca, A.

    2015-12-01

    The first part of the paper includes a description of the rules used to generate the algorithm needed for the purpose of parallel computing and also discusses the origins of the idea of research on the use of graphics processors in large scale processing of laser scanning data. The next part of the paper includes the results of an efficiency assessment performed for an array of different processing options, all of which were substantially accelerated with parallel computing. The processing options were divided into the generation of orthophotos using point clouds, coloring of point clouds, transformations, and the generation of a regular grid, as well as advanced processes such as the detection of planes and edges, point cloud classification, and the analysis of data for the purpose of quality control. Most algorithms had to be formulated from scratch in the context of the requirements of parallel computing. A few of the algorithms were based on existing technology developed by the Dephos Software Company and then adapted to parallel computing in the course of this research study. Processing time was determined for each process employed for a typical quantity of data processed, which helped confirm the high efficiency of the solutions proposed and the applicability of parallel computing to the processing of laser scanning data. The high efficiency of parallel computing yields new opportunities in the creation and organization of processing methods for laser scanning data.

  4. New preconditioning strategy for Jacobian-free solvers for variably saturated flows with Richards’ equation

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Lipnikov, Konstantin; Moulton, David; Svyatskiy, Daniil

    2016-04-29

    We develop a new approach for solving the nonlinear Richards’ equation arising in variably saturated flow modeling. The growing complexity of geometric models for simulation of subsurface flows leads to the necessity of using unstructured meshes and advanced discretization methods. Typically, a numerical solution is obtained by first discretizing PDEs and then solving the resulting system of nonlinear discrete equations with a Newton-Raphson-type method. Efficiency and robustness of the existing solvers rely on many factors, including an empiric quality control of intermediate iterates, complexity of the employed discretization method and a customized preconditioner. We propose and analyze a new preconditioningmore » strategy that is based on a stable discretization of the continuum Jacobian. We will show with numerical experiments for challenging problems in subsurface hydrology that this new preconditioner improves convergence of the existing Jacobian-free solvers 3-20 times. Furthermore, we show that the Picard method with this preconditioner becomes a more efficient nonlinear solver than a few widely used Jacobian-free solvers.« less

  5. Scalable smoothing strategies for a geometric multigrid method for the immersed boundary equations

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Bhalla, Amneet Pal Singh; Knepley, Matthew G.; Adams, Mark F.

    2016-12-20

    The immersed boundary (IB) method is a widely used approach to simulating fluid-structure interaction (FSI). Although explicit versions of the IB method can suffer from severe time step size restrictions, these methods remain popular because of their simplicity and generality. In prior work (Guy et al., Adv Comput Math, 2015), some of us developed a geometric multigrid preconditioner for a stable semi-implicit IB method under Stokes flow conditions; however, this solver methodology used a Vanka-type smoother that presented limited opportunities for parallelization. This work extends this Stokes-IB solver methodology by developing smoothing techniques that are suitable for parallel implementation. Specifically,more » we demonstrate that an additive version of the Vanka smoother can yield an effective multigrid preconditioner for the Stokes-IB equations, and we introduce an efficient Schur complement-based smoother that is also shown to be effective for the Stokes-IB equations. We investigate the performance of these solvers for a broad range of material stiffnesses, both for Stokes flows and flows at nonzero Reynolds numbers, and for thick and thin structural models. We show here that linear solver performance degrades with increasing Reynolds number and material stiffness, especially for thin interface cases. Nonetheless, the proposed approaches promise to yield effective solution algorithms, especially at lower Reynolds numbers and at modest-to-high elastic stiffnesses.« less

  6. Resolutions of the Coulomb operator: VIII. Parallel implementation using the modern programming language X10.

    PubMed

    Limpanuparb, Taweetham; Milthorpe, Josh; Rendell, Alistair P

    2014-10-30

    Use of the modern parallel programming language X10 for computing long-range Coulomb and exchange interactions is presented. By using X10, a partitioned global address space language with support for task parallelism and the explicit representation of data locality, the resolution of the Ewald operator can be parallelized in a straightforward manner including use of both intranode and internode parallelism. We evaluate four different schemes for dynamic load balancing of integral calculation using X10's work stealing runtime, and report performance results for long-range HF energy calculation of large molecule/high quality basis running on up to 1024 cores of a high performance cluster machine. Copyright © 2014 Wiley Periodicals, Inc.

  7. Parallel algorithms for mapping pipelined and parallel computations

    NASA Technical Reports Server (NTRS)

    Nicol, David M.

    1988-01-01

    Many computational problems in image processing, signal processing, and scientific computing are naturally structured for either pipelined or parallel computation. When mapping such problems onto a parallel architecture it is often necessary to aggregate an obvious problem decomposition. Even in this context the general mapping problem is known to be computationally intractable, but recent advances have been made in identifying classes of problems and architectures for which optimal solutions can be found in polynomial time. Among these, the mapping of pipelined or parallel computations onto linear array, shared memory, and host-satellite systems figures prominently. This paper extends that work first by showing how to improve existing serial mapping algorithms. These improvements have significantly lower time and space complexities: in one case a published O(nm sup 3) time algorithm for mapping m modules onto n processors is reduced to an O(nm log m) time complexity, and its space requirements reduced from O(nm sup 2) to O(m). Run time complexity is further reduced with parallel mapping algorithms based on these improvements, which run on the architecture for which they create the mappings.

  8. Parallel computing works

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Not Available

    An account of the Caltech Concurrent Computation Program (C{sup 3}P), a five year project that focused on answering the question: Can parallel computers be used to do large-scale scientific computations '' As the title indicates, the question is answered in the affirmative, by implementing numerous scientific applications on real parallel computers and doing computations that produced new scientific results. In the process of doing so, C{sup 3}P helped design and build several new computers, designed and implemented basic system software, developed algorithms for frequently used mathematical computations on massively parallel machines, devised performance models and measured the performance of manymore » computers, and created a high performance computing facility based exclusively on parallel computers. While the initial focus of C{sup 3}P was the hypercube architecture developed by C. Seitz, many of the methods developed and lessons learned have been applied successfully on other massively parallel architectures.« less

  9. Graph Partitioning for Parallel Applications in Heterogeneous Grid Environments

    NASA Technical Reports Server (NTRS)

    Bisws, Rupak; Kumar, Shailendra; Das, Sajal K.; Biegel, Bryan (Technical Monitor)

    2002-01-01

    The problem of partitioning irregular graphs and meshes for parallel computations on homogeneous systems has been extensively studied. However, these partitioning schemes fail when the target system architecture exhibits heterogeneity in resource characteristics. With the emergence of technologies such as the Grid, it is imperative to study the partitioning problem taking into consideration the differing capabilities of such distributed heterogeneous systems. In our model, the heterogeneous system consists of processors with varying processing power and an underlying non-uniform communication network. We present in this paper a novel multilevel partitioning scheme for irregular graphs and meshes, that takes into account issues pertinent to Grid computing environments. Our partitioning algorithm, called MiniMax, generates and maps partitions onto a heterogeneous system with the objective of minimizing the maximum execution time of the parallel distributed application. For experimental performance study, we have considered both a realistic mesh problem from NASA as well as synthetic workloads. Simulation results demonstrate that MiniMax generates high quality partitions for various classes of applications targeted for parallel execution in a distributed heterogeneous environment.

  10. On the scalability of the Albany/FELIX first-order Stokes approximation ice sheet solver for large-scale simulations of the Greenland and Antarctic ice sheets

    DOE PAGES

    Tezaur, Irina K.; Tuminaro, Raymond S.; Perego, Mauro; ...

    2015-01-01

    We examine the scalability of the recently developed Albany/FELIX finite-element based code for the first-order Stokes momentum balance equations for ice flow. We focus our analysis on the performance of two possible preconditioners for the iterative solution of the sparse linear systems that arise from the discretization of the governing equations: (1) a preconditioner based on the incomplete LU (ILU) factorization, and (2) a recently-developed algebraic multigrid (AMG) preconditioner, constructed using the idea of semi-coarsening. A strong scalability study on a realistic, high resolution Greenland ice sheet problem reveals that, for a given number of processor cores, the AMG preconditionermore » results in faster linear solve times but the ILU preconditioner exhibits better scalability. In addition, a weak scalability study is performed on a realistic, moderate resolution Antarctic ice sheet problem, a substantial fraction of which contains floating ice shelves, making it fundamentally different from the Greenland ice sheet problem. We show that as the problem size increases, the performance of the ILU preconditioner deteriorates whereas the AMG preconditioner maintains scalability. This is because the linear systems are extremely ill-conditioned in the presence of floating ice shelves, and the ill-conditioning has a greater negative effect on the ILU preconditioner than on the AMG preconditioner.« less

  11. Advanced fabrication of Si nanowire FET structures by means of a parallel approach.

    PubMed

    Li, J; Pud, S; Mayer, D; Vitusevich, S

    2014-07-11

    In this paper we present fabricated Si nanowires (NWs) of different dimensions with enhanced electrical characteristics. The parallel fabrication process is based on nanoimprint lithography using high-quality molds, which facilitates the realization of 50 nm-wide NW field-effect transistors (FETs). The imprint molds were fabricated by using a wet chemical anisotropic etching process. The wet chemical etch results in well-defined vertical sidewalls with edge roughness (3σ) as small as 2 nm, which is about four times better compared with the roughness usually obtained for reactive-ion etching molds. The quality of the mold was studied using atomic force microscopy and scanning electron microscopy image data. The use of the high-quality mold leads to almost 100% yield during fabrication of Si NW FETs as well as to an exceptional quality of the surfaces of the devices produced. To characterize the Si NW FETs, we used noise spectroscopy as a powerful method for evaluating device performance and the reliability of structures with nanoscale dimensions. The Hooge parameter of fabricated FET structures exhibits an average value of 1.6 × 10(-3). This value reflects the high quality of Si NW FETs fabricated by means of a parallel approach that uses a nanoimprint mold and cost-efficient technology.

  12. Template based parallel checkpointing in a massively parallel computer system

    DOEpatents

    Archer, Charles Jens [Rochester, MN; Inglett, Todd Alan [Rochester, MN

    2009-01-13

    A method and apparatus for a template based parallel checkpoint save for a massively parallel super computer system using a parallel variation of the rsync protocol, and network broadcast. In preferred embodiments, the checkpoint data for each node is compared to a template checkpoint file that resides in the storage and that was previously produced. Embodiments herein greatly decrease the amount of data that must be transmitted and stored for faster checkpointing and increased efficiency of the computer system. Embodiments are directed to a parallel computer system with nodes arranged in a cluster with a high speed interconnect that can perform broadcast communication. The checkpoint contains a set of actual small data blocks with their corresponding checksums from all nodes in the system. The data blocks may be compressed using conventional non-lossy data compression algorithms to further reduce the overall checkpoint size.

  13. The application of parallel wells to support the use of groundwater for sustainable irrigation

    NASA Astrophysics Data System (ADS)

    Suhardi

    2018-05-01

    The use of groundwater as a source of irrigation is one alternative in meeting water needs of plants. Using groundwater for irrigation requires a high cost because of the discharge that can be taken is limited. In addition, the use of large groundwater can cause environmental damage and social conflict. To minimize costs, maintain quality of the environment and to prevent social conflicts, it is necessary to innovate in the groundwater taking system. The study was conducted with an innovation of using parallel wells. Performance is measured by comparing parallel wells with a single well. The results showed that the use of parallel wells to meet the water needs of rice plants and increase the pump discharge up to 100%. In addition, parallel wells can reduce the influence radius of taking of groundwater compared to single well so as to prevent social conflict. Thus, the use of parallel wells can support the achievement of the use of groundwater for sustainable irrigation.

  14. Domain Decomposition By the Advancing-Partition Method for Parallel Unstructured Grid Generation

    NASA Technical Reports Server (NTRS)

    Pirzadeh, Shahyar Z.; Zagaris, George

    2009-01-01

    A new method of domain decomposition has been developed for generating unstructured grids in subdomains either sequentially or using multiple computers in parallel. Domain decomposition is a crucial and challenging step for parallel grid generation. Prior methods are generally based on auxiliary, complex, and computationally intensive operations for defining partition interfaces and usually produce grids of lower quality than those generated in single domains. The new technique, referred to as "Advancing Partition," is based on the Advancing-Front method, which partitions a domain as part of the volume mesh generation in a consistent and "natural" way. The benefits of this approach are: 1) the process of domain decomposition is highly automated, 2) partitioning of domain does not compromise the quality of the generated grids, and 3) the computational overhead for domain decomposition is minimal. The new method has been implemented in NASA's unstructured grid generation code VGRID.

  15. The Galley Parallel File System

    NASA Technical Reports Server (NTRS)

    Nieuwejaar, Nils; Kotz, David

    1996-01-01

    As the I/O needs of parallel scientific applications increase, file systems for multiprocessors are being designed to provide applications with parallel access to multiple disks. Many parallel file systems present applications with a conventional Unix-like interface that allows the application to access multiple disks transparently. The interface conceals the parallelism within the file system, which increases the ease of programmability, but makes it difficult or impossible for sophisticated programmers and libraries to use knowledge about their I/O needs to exploit that parallelism. Furthermore, most current parallel file systems are optimized for a different workload than they are being asked to support. We introduce Galley, a new parallel file system that is intended to efficiently support realistic parallel workloads. We discuss Galley's file structure and application interface, as well as an application that has been implemented using that interface.

  16. Parallel solution of the symmetric tridiagonal eigenproblem

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Jessup, E.R.

    1989-01-01

    This thesis discusses methods for computing all eigenvalues and eigenvectors of a symmetric tridiagonal matrix on a distributed memory MIMD multiprocessor. Only those techniques having the potential for both high numerical accuracy and significant large-grained parallelism are investigated. These include the QL method or Cuppen's divide and conquer method based on rank-one updating to compute both eigenvalues and eigenvectors, bisection to determine eigenvalues, and inverse iteration to compute eigenvectors. To begin, the methods are compared with respect to computation time, communication time, parallel speedup, and accuracy. Experiments on an iPSC hyper-cube multiprocessor reveal that Cuppen's method is the most accuratemore » approach, but bisection with inverse iteration is the fastest and most parallel. Because the accuracy of the latter combination is determined by the quality of the computed eigenvectors, the factors influencing the accuracy of inverse iteration are examined. This includes, in part, statistical analysis of the effects of a starting vector with random components. These results are used to develop an implementation of inverse iteration producing eigenvectors with lower residual error and better orthogonality than those generated by the EISPACK routine TINVIT. This thesis concludes with adaptations of methods for the symmetric tridiagonal eigenproblem to the related problem of computing the singular value decomposition (SVD) of a bidiagonal matrix.« less

  17. Parallel digital forensics infrastructure.

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Liebrock, Lorie M.; Duggan, David Patrick

    2009-10-01

    This report documents the architecture and implementation of a Parallel Digital Forensics infrastructure. This infrastructure is necessary for supporting the design, implementation, and testing of new classes of parallel digital forensics tools. Digital Forensics has become extremely difficult with data sets of one terabyte and larger. The only way to overcome the processing time of these large sets is to identify and develop new parallel algorithms for performing the analysis. To support algorithm research, a flexible base infrastructure is required. A candidate architecture for this base infrastructure was designed, instantiated, and tested by this project, in collaboration with New Mexicomore » Tech. Previous infrastructures were not designed and built specifically for the development and testing of parallel algorithms. With the size of forensics data sets only expected to increase significantly, this type of infrastructure support is necessary for continued research in parallel digital forensics. This report documents the implementation of the parallel digital forensics (PDF) infrastructure architecture and implementation.« less

  18. Parallel processing and expert systems

    NASA Technical Reports Server (NTRS)

    Yan, Jerry C.; Lau, Sonie

    1991-01-01

    Whether it be monitoring the thermal subsystem of Space Station Freedom, or controlling the navigation of the autonomous rover on Mars, NASA missions in the 90's cannot enjoy an increased level of autonomy without the efficient use of expert systems. Merely increasing the computational speed of uniprocessors may not be able to guarantee that real time demands are met for large expert systems. Speed-up via parallel processing must be pursued alongside the optimization of sequential implementations. Prototypes of parallel expert systems have been built at universities and industrial labs in the U.S. and Japan. The state-of-the-art research in progress related to parallel execution of expert systems was surveyed. The survey is divided into three major sections: (1) multiprocessors for parallel expert systems; (2) parallel languages for symbolic computations; and (3) measurements of parallelism of expert system. Results to date indicate that the parallelism achieved for these systems is small. In order to obtain greater speed-ups, data parallelism and application parallelism must be exploited.

  19. Computer-Aided Parallelizer and Optimizer

    NASA Technical Reports Server (NTRS)

    Jin, Haoqiang

    2011-01-01

    The Computer-Aided Parallelizer and Optimizer (CAPO) automates the insertion of compiler directives (see figure) to facilitate parallel processing on Shared Memory Parallel (SMP) machines. While CAPO currently is integrated seamlessly into CAPTools (developed at the University of Greenwich, now marketed as ParaWise), CAPO was independently developed at Ames Research Center as one of the components for the Legacy Code Modernization (LCM) project. The current version takes serial FORTRAN programs, performs interprocedural data dependence analysis, and generates OpenMP directives. Due to the widely supported OpenMP standard, the generated OpenMP codes have the potential to run on a wide range of SMP machines. CAPO relies on accurate interprocedural data dependence information currently provided by CAPTools. Compiler directives are generated through identification of parallel loops in the outermost level, construction of parallel regions around parallel loops and optimization of parallel regions, and insertion of directives with automatic identification of private, reduction, induction, and shared variables. Attempts also have been made to identify potential pipeline parallelism (implemented with point-to-point synchronization). Although directives are generated automatically, user interaction with the tool is still important for producing good parallel codes. A comprehensive graphical user interface is included for users to interact with the parallelization process.

  20. Flexure mechanism-based parallelism measurements for chip-on-glass bonding

    NASA Astrophysics Data System (ADS)

    Jung, Seung Won; Yun, Won Soo; Jin, Songwan; Kim, Bo Sun; Jeong, Young Hun

    2011-08-01

    Recently, liquid crystal displays (LCDs) have played vital roles in a variety of electronic devices such as televisions, cellular phones, and desktop/laptop monitors because of their enhanced volume, performance, and functionality. However, there is still a need for thinner LCD panels due to the trend of miniaturization in electronic applications. Thus, chip-on-glass (COG) bonding has become one of the most important aspects in the LCD panel manufacturing process. In this study, a novel sensor was developed to measure the parallelism between the tooltip planes of the bonding head and the backup of the COG main bonder, which has previously been estimated by prescale pressure films in industry. The sensor developed in this study is based on a flexure mechanism, and it can measure the total pressing force and the inclination angles in two directions that satisfy the quantitative definition of parallelism. To improve the measurement accuracy, the sensor was calibrated based on the estimation of the total pressing force and the inclination angles using the least-squares method. To verify the accuracy of the sensor, the estimation results for parallelism were compared with those from prescale pressure film measurements. In addition, the influence of parallelism on the bonding quality was experimentally demonstrated. The sensor was successfully applied to the measurement of parallelism in the COG-bonding process with an accuracy of more than three times that of the conventional method using prescale pressure films.

  1. Data communications in a parallel active messaging interface of a parallel computer

    DOEpatents

    Archer, Charles J; Blocksome, Michael A; Ratterman, Joseph D; Smith, Brian E

    2013-11-12

    Data communications in a parallel active messaging interface (`PAMI`) of a parallel computer composed of compute nodes that execute a parallel application, each compute node including application processors that execute the parallel application and at least one management processor dedicated to gathering information regarding data communications. The PAMI is composed of data communications endpoints, each endpoint composed of a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes and the endpoints coupled for data communications through the PAMI and through data communications resources. Embodiments function by gathering call site statistics describing data communications resulting from execution of data communications instructions and identifying in dependence upon the call cite statistics a data communications algorithm for use in executing a data communications instruction at a call site in the parallel application.

  2. The remote sensing image segmentation mean shift algorithm parallel processing based on MapReduce

    NASA Astrophysics Data System (ADS)

    Chen, Xi; Zhou, Liqing

    2015-12-01

    With the development of satellite remote sensing technology and the remote sensing image data, traditional remote sensing image segmentation technology cannot meet the massive remote sensing image processing and storage requirements. This article put cloud computing and parallel computing technology in remote sensing image segmentation process, and build a cheap and efficient computer cluster system that uses parallel processing to achieve MeanShift algorithm of remote sensing image segmentation based on the MapReduce model, not only to ensure the quality of remote sensing image segmentation, improved split speed, and better meet the real-time requirements. The remote sensing image segmentation MeanShift algorithm parallel processing algorithm based on MapReduce shows certain significance and a realization of value.

  3. Parallel Algorithms and Patterns

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Robey, Robert W.

    2016-06-16

    This is a powerpoint presentation on parallel algorithms and patterns. A parallel algorithm is a well-defined, step-by-step computational procedure that emphasizes concurrency to solve a problem. Examples of problems include: Sorting, searching, optimization, matrix operations. A parallel pattern is a computational step in a sequence of independent, potentially concurrent operations that occurs in diverse scenarios with some frequency. Examples are: Reductions, prefix scans, ghost cell updates. We only touch on parallel patterns in this presentation. It really deserves its own detailed discussion which Gabe Rockefeller would like to develop.

  4. Application Portable Parallel Library

    NASA Technical Reports Server (NTRS)

    Cole, Gary L.; Blech, Richard A.; Quealy, Angela; Townsend, Scott

    1995-01-01

    Application Portable Parallel Library (APPL) computer program is subroutine-based message-passing software library intended to provide consistent interface to variety of multiprocessor computers on market today. Minimizes effort needed to move application program from one computer to another. User develops application program once and then easily moves application program from parallel computer on which created to another parallel computer. ("Parallel computer" also include heterogeneous collection of networked computers). Written in C language with one FORTRAN 77 subroutine for UNIX-based computers and callable from application programs written in C language or FORTRAN 77.

  5. Parallel Logic Programming and Parallel Systems Software and Hardware

    DTIC Science & Technology

    1989-07-29

    Conference, Dallas TX. January 1985. (55) [Rous75] Roussel, P., "PROLOG: Manuel de Reference et d’Uilisation", Group d’ Intelligence Artificielle , Universite d...completed. Tools were provided for software development using artificial intelligence techniques. Al software for massively parallel architectures was...using artificial intelligence tech- niques. Al software for massively parallel architectures was started. 1. Introduction We describe research conducted

  6. Domain decomposition in time for PDE-constrained optimization

    DOE PAGES

    Barker, Andrew T.; Stoll, Martin

    2015-08-28

    Here, PDE-constrained optimization problems have a wide range of applications, but they lead to very large and ill-conditioned linear systems, especially if the problems are time dependent. In this paper we outline an approach for dealing with such problems by decomposing them in time and applying an additive Schwarz preconditioner in time, so that we can take advantage of parallel computers to deal with the very large linear systems. We then illustrate the performance of our method on a variety of problems.

  7. Endpoint-based parallel data processing with non-blocking collective instructions in a parallel active messaging interface of a parallel computer

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Archer, Charles J; Blocksome, Michael A; Cernohous, Bob R

    Methods, apparatuses, and computer program products for endpoint-based parallel data processing with non-blocking collective instructions in a parallel active messaging interface (`PAMI`) of a parallel computer are provided. Embodiments include establishing by a parallel application a data communications geometry, the geometry specifying a set of endpoints that are used in collective operations of the PAMI, including associating with the geometry a list of collective algorithms valid for use with the endpoints of the geometry. Embodiments also include registering in each endpoint in the geometry a dispatch callback function for a collective operation and executing without blocking, through a single onemore » of the endpoints in the geometry, an instruction for the collective operation.« less

  8. QR-decomposition based SENSE reconstruction using parallel architecture.

    PubMed

    Ullah, Irfan; Nisar, Habab; Raza, Haseeb; Qasim, Malik; Inam, Omair; Omer, Hammad

    2018-04-01

    Magnetic Resonance Imaging (MRI) is a powerful medical imaging technique that provides essential clinical information about the human body. One major limitation of MRI is its long scan time. Implementation of advance MRI algorithms on a parallel architecture (to exploit inherent parallelism) has a great potential to reduce the scan time. Sensitivity Encoding (SENSE) is a Parallel Magnetic Resonance Imaging (pMRI) algorithm that utilizes receiver coil sensitivities to reconstruct MR images from the acquired under-sampled k-space data. At the heart of SENSE lies inversion of a rectangular encoding matrix. This work presents a novel implementation of GPU based SENSE algorithm, which employs QR decomposition for the inversion of the rectangular encoding matrix. For a fair comparison, the performance of the proposed GPU based SENSE reconstruction is evaluated against single and multicore CPU using openMP. Several experiments against various acceleration factors (AFs) are performed using multichannel (8, 12 and 30) phantom and in-vivo human head and cardiac datasets. Experimental results show that GPU significantly reduces the computation time of SENSE reconstruction as compared to multi-core CPU (approximately 12x speedup) and single-core CPU (approximately 53x speedup) without any degradation in the quality of the reconstructed images. Copyright © 2018 Elsevier Ltd. All rights reserved.

  9. Neural Parallel Engine: A toolbox for massively parallel neural signal processing.

    PubMed

    Tam, Wing-Kin; Yang, Zhi

    2018-05-01

    Large-scale neural recordings provide detailed information on neuronal activities and can help elicit the underlying neural mechanisms of the brain. However, the computational burden is also formidable when we try to process the huge data stream generated by such recordings. In this study, we report the development of Neural Parallel Engine (NPE), a toolbox for massively parallel neural signal processing on graphical processing units (GPUs). It offers a selection of the most commonly used routines in neural signal processing such as spike detection and spike sorting, including advanced algorithms such as exponential-component-power-component (EC-PC) spike detection and binary pursuit spike sorting. We also propose a new method for detecting peaks in parallel through a parallel compact operation. Our toolbox is able to offer a 5× to 110× speedup compared with its CPU counterparts depending on the algorithms. A user-friendly MATLAB interface is provided to allow easy integration of the toolbox into existing workflows. Previous efforts on GPU neural signal processing only focus on a few rudimentary algorithms, are not well-optimized and often do not provide a user-friendly programming interface to fit into existing workflows. There is a strong need for a comprehensive toolbox for massively parallel neural signal processing. A new toolbox for massively parallel neural signal processing has been created. It can offer significant speedup in processing signals from large-scale recordings up to thousands of channels. Copyright © 2018 Elsevier B.V. All rights reserved.

  10. Robot-assisted ultrasound imaging: overview and development of a parallel telerobotic system.

    PubMed

    Monfaredi, Reza; Wilson, Emmanuel; Azizi Koutenaei, Bamshad; Labrecque, Brendan; Leroy, Kristen; Goldie, James; Louis, Eric; Swerdlow, Daniel; Cleary, Kevin

    2015-02-01

    Ultrasound imaging is frequently used in medicine. The quality of ultrasound images is often dependent on the skill of the sonographer. Several researchers have proposed robotic systems to aid in ultrasound image acquisition. In this paper we first provide a short overview of robot-assisted ultrasound imaging (US). We categorize robot-assisted US imaging systems into three approaches: autonomous US imaging, teleoperated US imaging, and human-robot cooperation. For each approach several systems are introduced and briefly discussed. We then describe a compact six degree of freedom parallel mechanism telerobotic system for ultrasound imaging developed by our research team. The long-term goal of this work is to enable remote ultrasound scanning through teleoperation. This parallel mechanism allows for both translation and rotation of an ultrasound probe mounted on the top plate along with force control. Our experimental results confirmed good mechanical system performance with a positioning error of < 1 mm. Phantom experiments by a radiologist showed promising results with good image quality.

  11. Data communications in a parallel active messaging interface of a parallel computer

    DOEpatents

    Archer, Charles J; Blocksome, Michael A; Ratterman, Joseph D; Smith, Brian E

    2014-02-11

    Data communications in a parallel active messaging interface ('PAMI') or a parallel computer, the parallel computer including a plurality of compute nodes that execute a parallel application, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution of a compute node, including specification of a client, a context, and a task, the compute nodes and the endpoints coupled for data communications instruction, the instruction characterized by instruction type, the instruction specifying a transmission of transfer data from the origin endpoint to a target endpoint and transmitting, in accordance witht the instruction type, the transfer data from the origin endpoin to the target endpoint.

  12. Increasing the reach of forensic genetics with massively parallel sequencing.

    PubMed

    Budowle, Bruce; Schmedes, Sarah E; Wendt, Frank R

    2017-09-01

    The field of forensic genetics has made great strides in the analysis of biological evidence related to criminal and civil matters. More so, the discipline has set a standard of performance and quality in the forensic sciences. The advent of massively parallel sequencing will allow the field to expand its capabilities substantially. This review describes the salient features of massively parallel sequencing and how it can impact forensic genetics. The features of this technology offer increased number and types of genetic markers that can be analyzed, higher throughput of samples, and the capability of targeting different organisms, all by one unifying methodology. While there are many applications, three are described where massively parallel sequencing will have immediate impact: molecular autopsy, microbial forensics and differentiation of monozygotic twins. The intent of this review is to expose the forensic science community to the potential enhancements that have or are soon to arrive and demonstrate the continued expansion the field of forensic genetics and its service in the investigation of legal matters.

  13. Parallel design patterns for a low-power, software-defined compressed video encoder

    NASA Astrophysics Data System (ADS)

    Bruns, Michael W.; Hunt, Martin A.; Prasad, Durga; Gunupudi, Nageswara R.; Sonachalam, Sekar

    2011-06-01

    Video compression algorithms such as H.264 offer much potential for parallel processing that is not always exploited by the technology of a particular implementation. Consumer mobile encoding devices often achieve real-time performance and low power consumption through parallel processing in Application Specific Integrated Circuit (ASIC) technology, but many other applications require a software-defined encoder. High quality compression features needed for some applications such as 10-bit sample depth or 4:2:2 chroma format often go beyond the capability of a typical consumer electronics device. An application may also need to efficiently combine compression with other functions such as noise reduction, image stabilization, real time clocks, GPS data, mission/ESD/user data or software-defined radio in a low power, field upgradable implementation. Low power, software-defined encoders may be implemented using a massively parallel memory-network processor array with 100 or more cores and distributed memory. The large number of processor elements allow the silicon device to operate more efficiently than conventional DSP or CPU technology. A dataflow programming methodology may be used to express all of the encoding processes including motion compensation, transform and quantization, and entropy coding. This is a declarative programming model in which the parallelism of the compression algorithm is expressed as a hierarchical graph of tasks with message communication. Data parallel and task parallel design patterns are supported without the need for explicit global synchronization control. An example is described of an H.264 encoder developed for a commercially available, massively parallel memorynetwork processor device.

  14. On optimal improvements of classical iterative schemes for Z-matrices

    NASA Astrophysics Data System (ADS)

    Noutsos, D.; Tzoumas, M.

    2006-04-01

    Many researchers have considered preconditioners, applied to linear systems, whose matrix coefficient is a Z- or an M-matrix, that make the associated Jacobi and Gauss-Seidel methods converge asymptotically faster than the unpreconditioned ones. Such preconditioners are chosen so that they eliminate the off-diagonal elements of the same column or the elements of the first upper diagonal [Milaszewicz, LAA 93 (1987) 161-170], Gunawardena et al. [LAA 154-156 (1991) 123-143]. In this work we generalize the previous preconditioners to obtain optimal methods. "Good" Jacobi and Gauss-Seidel algorithms are given and preconditioners, that eliminate more than one entry per row, are also proposed and analyzed. Moreover, the behavior of the above preconditioners to the Krylov subspace methods is studied.

  15. Parallels in History.

    ERIC Educational Resources Information Center

    Mugleston, William F.

    2000-01-01

    Believes that by focusing on the recurrent situations and problems, or parallels, throughout history, students will understand the relevance of history to their own times and lives. Provides suggestions for parallels in history that may be introduced within lectures or as a means to class discussions. (CMK)

  16. SequenceL: Automated Parallel Algorithms Derived from CSP-NT Computational Laws

    NASA Technical Reports Server (NTRS)

    Cooke, Daniel; Rushton, Nelson

    2013-01-01

    costly than development of comparable parallel code. Moreover, SequenceL not only automatically parallelizes the code, but since it is based on CSP-NT, it is provably race free, thus eliminating the largest quality challenge the parallelized software developer faces.

  17. Parallel Architectures and Parallel Algorithms for Integrated Vision Systems. Ph.D. Thesis

    NASA Technical Reports Server (NTRS)

    Choudhary, Alok Nidhi

    1989-01-01

    Computer vision is regarded as one of the most complex and computationally intensive problems. An integrated vision system (IVS) is a system that uses vision algorithms from all levels of processing to perform for a high level application (e.g., object recognition). An IVS normally involves algorithms from low level, intermediate level, and high level vision. Designing parallel architectures for vision systems is of tremendous interest to researchers. Several issues are addressed in parallel architectures and parallel algorithms for integrated vision systems.

  18. Integrated Task and Data Parallel Programming

    NASA Technical Reports Server (NTRS)

    Grimshaw, A. S.

    1998-01-01

    This research investigates the combination of task and data parallel language constructs within a single programming language. There are an number of applications that exhibit properties which would be well served by such an integrated language. Examples include global climate models, aircraft design problems, and multidisciplinary design optimization problems. Our approach incorporates data parallel language constructs into an existing, object oriented, task parallel language. The language will support creation and manipulation of parallel classes and objects of both types (task parallel and data parallel). Ultimately, the language will allow data parallel and task parallel classes to be used either as building blocks or managers of parallel objects of either type, thus allowing the development of single and multi-paradigm parallel applications. 1995 Research Accomplishments In February I presented a paper at Frontiers 1995 describing the design of the data parallel language subset. During the spring I wrote and defended my dissertation proposal. Since that time I have developed a runtime model for the language subset. I have begun implementing the model and hand-coding simple examples which demonstrate the language subset. I have identified an astrophysical fluid flow application which will validate the data parallel language subset. 1996 Research Agenda Milestones for the coming year include implementing a significant portion of the data parallel language subset over the Legion system. Using simple hand-coded methods, I plan to demonstrate (1) concurrent task and data parallel objects and (2) task parallel objects managing both task and data parallel objects. My next steps will focus on constructing a compiler and implementing the fluid flow application with the language. Concurrently, I will conduct a search for a real-world application exhibiting both task and data parallelism within the same program. Additional 1995 Activities During the fall I collaborated

  19. A parallel algorithm for multi-level logic synthesis using the transduction method. M.S. Thesis

    NASA Technical Reports Server (NTRS)

    Lim, Chieng-Fai

    1991-01-01

    The Transduction Method has been shown to be a powerful tool in the optimization of multilevel networks. Many tools such as the SYLON synthesis system (X90), (CM89), (LM90) have been developed based on this method. A parallel implementation is presented of SYLON-XTRANS (XM89) on an eight processor Encore Multimax shared memory multiprocessor. It minimizes multilevel networks consisting of simple gates through parallel pruning, gate substitution, gate merging, generalized gate substitution, and gate input reduction. This implementation, called Parallel TRANSduction (PTRANS), also uses partitioning to break large circuits up and performs inter- and intra-partition dynamic load balancing. With this, good speedups and high processor efficiencies are achievable without sacrificing the resulting circuit quality.

  20. Data communications in a parallel active messaging interface of a parallel computer

    DOEpatents

    Archer, Charles J; Blocksome, Michael A; Ratterman, Joseph D; Smith, Brian E

    2013-10-29

    Data communications in a parallel active messaging interface (`PAMI`) of a parallel computer, the parallel computer including a plurality of compute nodes that execute a parallel application, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes and the endpoints coupled for data communications through the PAMI and through data communications resources, including receiving in an origin endpoint of the PAMI a data communications instruction, the instruction characterized by an instruction type, the instruction specifying a transmission of transfer data from the origin endpoint to a target endpoint and transmitting, in accordance with the instruction type, the transfer data from the origin endpoint to the target endpoint.

  1. Fully Parallel MHD Stability Analysis Tool

    NASA Astrophysics Data System (ADS)

    Svidzinski, Vladimir; Galkin, Sergei; Kim, Jin-Soo; Liu, Yueqiang

    2014-10-01

    Progress on full parallelization of the plasma stability code MARS will be reported. MARS calculates eigenmodes in 2D axisymmetric toroidal equilibria in MHD-kinetic plasma models. It is a powerful tool for studying MHD and MHD-kinetic instabilities and it is widely used by fusion community. Parallel version of MARS is intended for simulations on local parallel clusters. It will be an efficient tool for simulation of MHD instabilities with low, intermediate and high toroidal mode numbers within both fluid and kinetic plasma models, already implemented in MARS. Parallelization of the code includes parallelization of the construction of the matrix for the eigenvalue problem and parallelization of the inverse iterations algorithm, implemented in MARS for the solution of the formulated eigenvalue problem. Construction of the matrix is parallelized by distributing the load among processors assigned to different magnetic surfaces. Parallelization of the solution of the eigenvalue problem is made by repeating steps of the present MARS algorithm using parallel libraries and procedures. Initial results of the code parallelization will be reported. Work is supported by the U.S. DOE SBIR program.

  2. File concepts for parallel I/O

    NASA Technical Reports Server (NTRS)

    Crockett, Thomas W.

    1989-01-01

    The subject of input/output (I/O) was often neglected in the design of parallel computer systems, although for many problems I/O rates will limit the speedup attainable. The I/O problem is addressed by considering the role of files in parallel systems. The notion of parallel files is introduced. Parallel files provide for concurrent access by multiple processes, and utilize parallelism in the I/O system to improve performance. Parallel files can also be used conventionally by sequential programs. A set of standard parallel file organizations is proposed, organizations are suggested, using multiple storage devices. Problem areas are also identified and discussed.

  3. Non-Cartesian Parallel Imaging Reconstruction

    PubMed Central

    Wright, Katherine L.; Hamilton, Jesse I.; Griswold, Mark A.; Gulani, Vikas; Seiberlich, Nicole

    2014-01-01

    Non-Cartesian parallel imaging has played an important role in reducing data acquisition time in MRI. The use of non-Cartesian trajectories can enable more efficient coverage of k-space, which can be leveraged to reduce scan times. These trajectories can be undersampled to achieve even faster scan times, but the resulting images may contain aliasing artifacts. Just as Cartesian parallel imaging can be employed to reconstruct images from undersampled Cartesian data, non-Cartesian parallel imaging methods can mitigate aliasing artifacts by using additional spatial encoding information in the form of the non-homogeneous sensitivities of multi-coil phased arrays. This review will begin with an overview of non-Cartesian k-space trajectories and their sampling properties, followed by an in-depth discussion of several selected non-Cartesian parallel imaging algorithms. Three representative non-Cartesian parallel imaging methods will be described, including Conjugate Gradient SENSE (CG SENSE), non-Cartesian GRAPPA, and Iterative Self-Consistent Parallel Imaging Reconstruction (SPIRiT). After a discussion of these three techniques, several potential promising clinical applications of non-Cartesian parallel imaging will be covered. PMID:24408499

  4. Parallel processing in the honeybee olfactory pathway: structure, function, and evolution.

    PubMed

    Rössler, Wolfgang; Brill, Martin F

    2013-11-01

    Animals face highly complex and dynamic olfactory stimuli in their natural environments, which require fast and reliable olfactory processing. Parallel processing is a common principle of sensory systems supporting this task, for example in visual and auditory systems, but its role in olfaction remained unclear. Studies in the honeybee focused on a dual olfactory pathway. Two sets of projection neurons connect glomeruli in two antennal-lobe hemilobes via lateral and medial tracts in opposite sequence with the mushroom bodies and lateral horn. Comparative studies suggest that this dual-tract circuit represents a unique adaptation in Hymenoptera. Imaging studies indicate that glomeruli in both hemilobes receive redundant sensory input. Recent simultaneous multi-unit recordings from projection neurons of both tracts revealed widely overlapping response profiles strongly indicating parallel olfactory processing. Whereas lateral-tract neurons respond fast with broad (generalistic) profiles, medial-tract neurons are odorant specific and respond slower. In analogy to "what-" and "where" subsystems in visual pathways, this suggests two parallel olfactory subsystems providing "what-" (quality) and "when" (temporal) information. Temporal response properties may support across-tract coincidence coding in higher centers. Parallel olfactory processing likely enhances perception of complex odorant mixtures to decode the diverse and dynamic olfactory world of a social insect.

  5. Parallel k-means++

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    A parallelization of the k-means++ seed selection algorithm on three distinct hardware platforms: GPU, multicore CPU, and multithreaded architecture. K-means++ was developed by David Arthur and Sergei Vassilvitskii in 2007 as an extension of the k-means data clustering technique. These algorithms allow people to cluster multidimensional data, by attempting to minimize the mean distance of data points within a cluster. K-means++ improved upon traditional k-means by using a more intelligent approach to selecting the initial seeds for the clustering process. While k-means++ has become a popular alternative to traditional k-means clustering, little work has been done to parallelize this technique.more » We have developed original C++ code for parallelizing the algorithm on three unique hardware architectures: GPU using NVidia's CUDA/Thrust framework, multicore CPU using OpenMP, and the Cray XMT multithreaded architecture. By parallelizing the process for these platforms, we are able to perform k-means++ clustering much more quickly than it could be done before.« less

  6. Fully Parallel MHD Stability Analysis Tool

    NASA Astrophysics Data System (ADS)

    Svidzinski, Vladimir; Galkin, Sergei; Kim, Jin-Soo; Liu, Yueqiang

    2015-11-01

    Progress on full parallelization of the plasma stability code MARS will be reported. MARS calculates eigenmodes in 2D axisymmetric toroidal equilibria in MHD-kinetic plasma models. It is a powerful tool for studying MHD and MHD-kinetic instabilities and it is widely used by fusion community. Parallel version of MARS is intended for simulations on local parallel clusters. It will be an efficient tool for simulation of MHD instabilities with low, intermediate and high toroidal mode numbers within both fluid and kinetic plasma models, already implemented in MARS. Parallelization of the code includes parallelization of the construction of the matrix for the eigenvalue problem and parallelization of the inverse iterations algorithm, implemented in MARS for the solution of the formulated eigenvalue problem. Construction of the matrix is parallelized by distributing the load among processors assigned to different magnetic surfaces. Parallelization of the solution of the eigenvalue problem is made by repeating steps of the present MARS algorithm using parallel libraries and procedures. Results of MARS parallelization and of the development of a new fix boundary equilibrium code adapted for MARS input will be reported. Work is supported by the U.S. DOE SBIR program.

  7. Directions in parallel programming: HPF, shared virtual memory and object parallelism in pC++

    NASA Technical Reports Server (NTRS)

    Bodin, Francois; Priol, Thierry; Mehrotra, Piyush; Gannon, Dennis

    1994-01-01

    Fortran and C++ are the dominant programming languages used in scientific computation. Consequently, extensions to these languages are the most popular for programming massively parallel computers. We discuss two such approaches to parallel Fortran and one approach to C++. The High Performance Fortran Forum has designed HPF with the intent of supporting data parallelism on Fortran 90 applications. HPF works by asking the user to help the compiler distribute and align the data structures with the distributed memory modules in the system. Fortran-S takes a different approach in which the data distribution is managed by the operating system and the user provides annotations to indicate parallel control regions. In the case of C++, we look at pC++ which is based on a concurrent aggregate parallel model.

  8. Parallel STEPS: Large Scale Stochastic Spatial Reaction-Diffusion Simulation with High Performance Computers.

    PubMed

    Chen, Weiliang; De Schutter, Erik

    2017-01-01

    Stochastic, spatial reaction-diffusion simulations have been widely used in systems biology and computational neuroscience. However, the increasing scale and complexity of models and morphologies have exceeded the capacity of any serial implementation. This led to the development of parallel solutions that benefit from the boost in performance of modern supercomputers. In this paper, we describe an MPI-based, parallel operator-splitting implementation for stochastic spatial reaction-diffusion simulations with irregular tetrahedral meshes. The performance of our implementation is first examined and analyzed with simulations of a simple model. We then demonstrate its application to real-world research by simulating the reaction-diffusion components of a published calcium burst model in both Purkinje neuron sub-branch and full dendrite morphologies. Simulation results indicate that our implementation is capable of achieving super-linear speedup for balanced loading simulations with reasonable molecule density and mesh quality. In the best scenario, a parallel simulation with 2,000 processes runs more than 3,600 times faster than its serial SSA counterpart, and achieves more than 20-fold speedup relative to parallel simulation with 100 processes. In a more realistic scenario with dynamic calcium influx and data recording, the parallel simulation with 1,000 processes and no load balancing is still 500 times faster than the conventional serial SSA simulation.

  9. A Simple Application of Compressed Sensing to Further Accelerate Partially Parallel Imaging

    PubMed Central

    Miao, Jun; Guo, Weihong; Narayan, Sreenath; Wilson, David L.

    2012-01-01

    Compressed Sensing (CS) and partially parallel imaging (PPI) enable fast MR imaging by reducing the amount of k-space data required for reconstruction. Past attempts to combine these two have been limited by the incoherent sampling requirement of CS, since PPI routines typically sample on a regular (coherent) grid. Here, we developed a new method, “CS+GRAPPA,” to overcome this limitation. We decomposed sets of equidistant samples into multiple random subsets. Then, we reconstructed each subset using CS, and averaging the results to get a final CS k-space reconstruction. We used both a standard CS, and an edge and joint-sparsity guided CS reconstruction. We tested these intermediate results on both synthetic and real MR phantom data, and performed a human observer experiment to determine the effectiveness of decomposition, and to optimize the number of subsets. We then used these CS reconstructions to calibrate the GRAPPA complex coil weights. In vivo parallel MR brain and heart data sets were used. An objective image quality evaluation metric, Case-PDM, was used to quantify image quality. Coherent aliasing and noise artifacts were significantly reduced using two decompositions. More decompositions further reduced coherent aliasing and noise artifacts but introduced blurring. However, the blurring was effectively minimized using our new edge and joint-sparsity guided CS using two decompositions. Numerical results on parallel data demonstrated that the combined method greatly improved image quality as compared to standard GRAPPA, on average halving Case-PDM scores across a range of sampling rates. The proposed technique allowed the same Case-PDM scores as standard GRAPPA, using about half the number of samples. We conclude that the new method augments GRAPPA by combining it with CS, allowing CS to work even when the k-space sampling pattern is equidistant. PMID:22902065

  10. Hypergraph partitioning implementation for parallelizing matrix-vector multiplication using CUDA GPU-based parallel computing

    NASA Astrophysics Data System (ADS)

    Murni, Bustamam, A.; Ernastuti, Handhika, T.; Kerami, D.

    2017-07-01

    Calculation of the matrix-vector multiplication in the real-world problems often involves large matrix with arbitrary size. Therefore, parallelization is needed to speed up the calculation process that usually takes a long time. Graph partitioning techniques that have been discussed in the previous studies cannot be used to complete the parallelized calculation of matrix-vector multiplication with arbitrary size. This is due to the assumption of graph partitioning techniques that can only solve the square and symmetric matrix. Hypergraph partitioning techniques will overcome the shortcomings of the graph partitioning technique. This paper addresses the efficient parallelization of matrix-vector multiplication through hypergraph partitioning techniques using CUDA GPU-based parallel computing. CUDA (compute unified device architecture) is a parallel computing platform and programming model that was created by NVIDIA and implemented by the GPU (graphics processing unit).

  11. Parallel seed-based approach to multiple protein structure similarities detection

    DOE PAGES

    Chapuis, Guillaume; Le Boudic-Jamin, Mathilde; Andonov, Rumen; ...

    2015-01-01

    Finding similarities between protein structures is a crucial task in molecular biology. Most of the existing tools require proteins to be aligned in order-preserving way and only find single alignments even when multiple similar regions exist. We propose a new seed-based approach that discovers multiple pairs of similar regions. Its computational complexity is polynomial and it comes with a quality guarantee—the returned alignments have both root mean squared deviations (coordinate-based as well as internal-distances based) lower than a given threshold, if such exist. We do not require the alignments to be order preserving (i.e., we consider nonsequential alignments), which makesmore » our algorithm suitable for detecting similar domains when comparing multidomain proteins as well as to detect structural repetitions within a single protein. Because the search space for nonsequential alignments is much larger than for sequential ones, the computational burden is addressed by extensive use of parallel computing techniques: a coarse-grain level parallelism making use of available CPU cores for computation and a fine-grain level parallelism exploiting bit-level concurrency as well as vector instructions.« less

  12. Two new modified Gauss-Seidel methods for linear system with M-matrices

    NASA Astrophysics Data System (ADS)

    Zheng, Bing; Miao, Shu-Xin

    2009-12-01

    In 2002, H. Kotakemori et al. proposed the modified Gauss-Seidel (MGS) method for solving the linear system with the preconditioner [H. Kotakemori, K. Harada, M. Morimoto, H. Niki, A comparison theorem for the iterative method with the preconditioner () J. Comput. Appl. Math. 145 (2002) 373-378]. Since this preconditioner is constructed by only the largest element on each row of the upper triangular part of the coefficient matrix, the preconditioning effect is not observed on the nth row. In the present paper, to deal with this drawback, we propose two new preconditioners. The convergence and comparison theorems of the modified Gauss-Seidel methods with these two preconditioners for solving the linear system are established. The convergence rates of the new proposed preconditioned methods are compared. In addition, numerical experiments are used to show the effectiveness of the new MGS methods.

  13. Parallel Computing Strategies for Irregular Algorithms

    NASA Technical Reports Server (NTRS)

    Biswas, Rupak; Oliker, Leonid; Shan, Hongzhang; Biegel, Bryan (Technical Monitor)

    2002-01-01

    Parallel computing promises several orders of magnitude increase in our ability to solve realistic computationally-intensive problems, but relies on their efficient mapping and execution on large-scale multiprocessor architectures. Unfortunately, many important applications are irregular and dynamic in nature, making their effective parallel implementation a daunting task. Moreover, with the proliferation of parallel architectures and programming paradigms, the typical scientist is faced with a plethora of questions that must be answered in order to obtain an acceptable parallel implementation of the solution algorithm. In this paper, we consider three representative irregular applications: unstructured remeshing, sparse matrix computations, and N-body problems, and parallelize them using various popular programming paradigms on a wide spectrum of computer platforms ranging from state-of-the-art supercomputers to PC clusters. We present the underlying problems, the solution algorithms, and the parallel implementation strategies. Smart load-balancing, partitioning, and ordering techniques are used to enhance parallel performance. Overall results demonstrate the complexity of efficiently parallelizing irregular algorithms.

  14. CRUNCH_PARALLEL

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Shumaker, Dana E.; Steefel, Carl I.

    The code CRUNCH_PARALLEL is a parallel version of the CRUNCH code. CRUNCH code version 2.0 was previously released by LLNL, (UCRL-CODE-200063). Crunch is a general purpose reactive transport code developed by Carl Steefel and Yabusake (Steefel Yabsaki 1996). The code handles non-isothermal transport and reaction in one, two, and three dimensions. The reaction algorithm is generic in form, handling an arbitrary number of aqueous and surface complexation as well as mineral dissolution/precipitation. A standardized database is used containing thermodynamic and kinetic data. The code includes advective, dispersive, and diffusive transport.

  15. Parallel processing and expert systems

    NASA Technical Reports Server (NTRS)

    Lau, Sonie; Yan, Jerry C.

    1991-01-01

    Whether it be monitoring the thermal subsystem of Space Station Freedom, or controlling the navigation of the autonomous rover on Mars, NASA missions in the 1990s cannot enjoy an increased level of autonomy without the efficient implementation of expert systems. Merely increasing the computational speed of uniprocessors may not be able to guarantee that real-time demands are met for larger systems. Speedup via parallel processing must be pursued alongside the optimization of sequential implementations. Prototypes of parallel expert systems have been built at universities and industrial laboratories in the U.S. and Japan. The state-of-the-art research in progress related to parallel execution of expert systems is surveyed. The survey discusses multiprocessors for expert systems, parallel languages for symbolic computations, and mapping expert systems to multiprocessors. Results to date indicate that the parallelism achieved for these systems is small. The main reasons are (1) the body of knowledge applicable in any given situation and the amount of computation executed by each rule firing are small, (2) dividing the problem solving process into relatively independent partitions is difficult, and (3) implementation decisions that enable expert systems to be incrementally refined hamper compile-time optimization. In order to obtain greater speedups, data parallelism and application parallelism must be exploited.

  16. CSM parallel structural methods research

    NASA Technical Reports Server (NTRS)

    Storaasli, Olaf O.

    1989-01-01

    Parallel structural methods, research team activities, advanced architecture computers for parallel computational structural mechanics (CSM) research, the FLEX/32 multicomputer, a parallel structural analyses testbed, blade-stiffened aluminum panel with a circular cutout and the dynamic characteristics of a 60 meter, 54-bay, 3-longeron deployable truss beam are among the topics discussed.

  17. Incremental Parallelization of Non-Data-Parallel Programs Using the Charon Message-Passing Library

    NASA Technical Reports Server (NTRS)

    VanderWijngaart, Rob F.

    2000-01-01

    Message passing is among the most popular techniques for parallelizing scientific programs on distributed-memory architectures. The reasons for its success are wide availability (MPI), efficiency, and full tuning control provided to the programmer. A major drawback, however, is that incremental parallelization, as offered by compiler directives, is not generally possible, because all data structures have to be changed throughout the program simultaneously. Charon remedies this situation through mappings between distributed and non-distributed data. It allows breaking up the parallelization into small steps, guaranteeing correctness at every stage. Several tools are available to help convert legacy codes into high-performance message-passing programs. They usually target data-parallel applications, whose loops carrying most of the work can be distributed among all processors without much dependency analysis. Others do a full dependency analysis and then convert the code virtually automatically. Even more toolkits are available that aid construction from scratch of message passing programs. None, however, allows piecemeal translation of codes with complex data dependencies (i.e. non-data-parallel programs) into message passing codes. The Charon library (available in both C and Fortran) provides incremental parallelization capabilities by linking legacy code arrays with distributed arrays. During the conversion process, non-distributed and distributed arrays exist side by side, and simple mapping functions allow the programmer to switch between the two in any location in the program. Charon also provides wrapper functions that leave the structure of the legacy code intact, but that allow execution on truly distributed data. Finally, the library provides a rich set of communication functions that support virtually all patterns of remote data demands in realistic structured grid scientific programs, including transposition, nearest-neighbor communication, pipelining

  18. Simplified Parallel Domain Traversal

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Erickson III, David J

    2011-01-01

    Many data-intensive scientific analysis techniques require global domain traversal, which over the years has been a bottleneck for efficient parallelization across distributed-memory architectures. Inspired by MapReduce and other simplified parallel programming approaches, we have designed DStep, a flexible system that greatly simplifies efficient parallelization of domain traversal techniques at scale. In order to deliver both simplicity to users as well as scalability on HPC platforms, we introduce a novel two-tiered communication architecture for managing and exploiting asynchronous communication loads. We also integrate our design with advanced parallel I/O techniques that operate directly on native simulation output. We demonstrate DStep bymore » performing teleconnection analysis across ensemble runs of terascale atmospheric CO{sub 2} and climate data, and we show scalability results on up to 65,536 IBM BlueGene/P cores.« less

  19. The NAS parallel benchmarks

    NASA Technical Reports Server (NTRS)

    Bailey, David (Editor); Barton, John (Editor); Lasinski, Thomas (Editor); Simon, Horst (Editor)

    1993-01-01

    A new set of benchmarks was developed for the performance evaluation of highly parallel supercomputers. These benchmarks consist of a set of kernels, the 'Parallel Kernels,' and a simulated application benchmark. Together they mimic the computation and data movement characteristics of large scale computational fluid dynamics (CFD) applications. The principal distinguishing feature of these benchmarks is their 'pencil and paper' specification - all details of these benchmarks are specified only algorithmically. In this way many of the difficulties associated with conventional benchmarking approaches on highly parallel systems are avoided.

  20. Parallel flow diffusion battery

    DOEpatents

    Yeh, H.C.; Cheng, Y.S.

    1984-01-01

    A parallel flow diffusion battery for determining the mass distribution of an aerosol has a plurality of diffusion cells mounted in parallel to an aerosol stream, each diffusion cell including a stack of mesh wire screens of different density.

  1. Parallel flow diffusion battery

    DOEpatents

    Yeh, Hsu-Chi; Cheng, Yung-Sung

    1984-08-07

    A parallel flow diffusion battery for determining the mass distribution of an aerosol has a plurality of diffusion cells mounted in parallel to an aerosol stream, each diffusion cell including a stack of mesh wire screens of different density.

  2. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Spotz, William F.

    PyTrilinos is a set of Python interfaces to compiled Trilinos packages. This collection supports serial and parallel dense linear algebra, serial and parallel sparse linear algebra, direct and iterative linear solution techniques, algebraic and multilevel preconditioners, nonlinear solvers and continuation algorithms, eigensolvers and partitioning algorithms. Also included are a variety of related utility functions and classes, including distributed I/O, coloring algorithms and matrix generation. PyTrilinos vector objects are compatible with the popular NumPy Python package. As a Python front end to compiled libraries, PyTrilinos takes advantage of the flexibility and ease of use of Python, and the efficiency of themore » underlying C++, C and Fortran numerical kernels. This paper covers recent, previously unpublished advances in the PyTrilinos package.« less

  3. Conjugate-gradient preconditioning methods for shift-variant PET image reconstruction.

    PubMed

    Fessler, J A; Booth, S D

    1999-01-01

    Gradient-based iterative methods often converge slowly for tomographic image reconstruction and image restoration problems, but can be accelerated by suitable preconditioners. Diagonal preconditioners offer some improvement in convergence rate, but do not incorporate the structure of the Hessian matrices in imaging problems. Circulant preconditioners can provide remarkable acceleration for inverse problems that are approximately shift-invariant, i.e., for those with approximately block-Toeplitz or block-circulant Hessians. However, in applications with nonuniform noise variance, such as arises from Poisson statistics in emission tomography and in quantum-limited optical imaging, the Hessian of the weighted least-squares objective function is quite shift-variant, and circulant preconditioners perform poorly. Additional shift-variance is caused by edge-preserving regularization methods based on nonquadratic penalty functions. This paper describes new preconditioners that approximate more accurately the Hessian matrices of shift-variant imaging problems. Compared to diagonal or circulant preconditioning, the new preconditioners lead to significantly faster convergence rates for the unconstrained conjugate-gradient (CG) iteration. We also propose a new efficient method for the line-search step required by CG methods. Applications to positron emission tomography (PET) illustrate the method.

  4. A parallel row-based algorithm with error control for standard-cell replacement on a hypercube multiprocessor

    NASA Technical Reports Server (NTRS)

    Sargent, Jeff Scott

    1988-01-01

    A new row-based parallel algorithm for standard-cell placement targeted for execution on a hypercube multiprocessor is presented. Key features of this implementation include a dynamic simulated-annealing schedule, row-partitioning of the VLSI chip image, and two novel new approaches to controlling error in parallel cell-placement algorithms; Heuristic Cell-Coloring and Adaptive (Parallel Move) Sequence Control. Heuristic Cell-Coloring identifies sets of noninteracting cells that can be moved repeatedly, and in parallel, with no buildup of error in the placement cost. Adaptive Sequence Control allows multiple parallel cell moves to take place between global cell-position updates. This feedback mechanism is based on an error bound derived analytically from the traditional annealing move-acceptance profile. Placement results are presented for real industry circuits and the performance is summarized of an implementation on the Intel iPSC/2 Hypercube. The runtime of this algorithm is 5 to 16 times faster than a previous program developed for the Hypercube, while producing equivalent quality placement. An integrated place and route program for the Intel iPSC/2 Hypercube is currently being developed.

  5. Parallel consistent labeling algorithms

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Samal, A.; Henderson, T.

    Mackworth and Freuder have analyzed the time complexity of several constraint satisfaction algorithms. Mohr and Henderson have given new algorithms, AC-4 and PC-3, for arc and path consistency, respectively, and have shown that the arc consistency algorithm is optimal in time complexity and of the same order space complexity as the earlier algorithms. In this paper, they give parallel algorithms for solving node and arc consistency. They show that any parallel algorithm for enforcing arc consistency in the worst case must have O(na) sequential steps, where n is number of nodes, and a is the number of labels per node.more » They give several parallel algorithms to do arc consistency. It is also shown that they all have optimal time complexity. The results of running the parallel algorithms on a BBN Butterfly multiprocessor are also presented.« less

  6. Shedding New Light on Exploding Stars: Terascale Simulations of Nuetrino-Dreiven Supernovas and Their Nucleosynthesis

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Dennis C. Smolarski, S.J.

    Project Abstract This project was a continuation of work begun under a subcontract issued off of TSI-DOE Grant 1528746, awarded to the University of Illinois Urbana-Champaign. Dr. Anthony Mezzacappa is the Principal Investigator on the Illinois award. A separate award was issued to Santa Clara University to continue the collaboration during the time period May 2003 ? 2004. Smolarski continued to work on preconditioner technology and its interface with various iterative methods. He worked primarily with F. Dough Swesty (SUNY-Stony Brook) in continuing software development started in the 2002-03 academic year. Special attention was paid to the development and testingmore » of difference sparse approximate inverse preconditioners and their use in the solution of linear systems arising from radiation transport equations. The target was a high performance platform on which efficient implementation is a critical component of the overall effort. Smolarski also focused on the integration of the adaptive iterative algorithm, Chebycode, developed by Tom Manteuffel and Steve Ashby and adapted by Ryan Szypowski for parallel platforms, into the radiation transport code being developed at SUNY-Stony Brook.« less

  7. Physics-Based Preconditioning of a Compressible Flow Solver for Large-Scale Simulations of Additive Manufacturing Processes

    NASA Astrophysics Data System (ADS)

    Weston, Brian; Nourgaliev, Robert; Delplanque, Jean-Pierre

    2017-11-01

    We present a new block-based Schur complement preconditioner for simulating all-speed compressible flow with phase change. The conservation equations are discretized with a reconstructed Discontinuous Galerkin method and integrated in time with fully implicit time discretization schemes. The resulting set of non-linear equations is converged using a robust Newton-Krylov framework. Due to the stiffness of the underlying physics associated with stiff acoustic waves and viscous material strength effects, we solve for the primitive-variables (pressure, velocity, and temperature). To enable convergence of the highly ill-conditioned linearized systems, we develop a physics-based preconditioner, utilizing approximate block factorization techniques to reduce the fully-coupled 3×3 system to a pair of reduced 2×2 systems. We demonstrate that our preconditioned Newton-Krylov framework converges on very stiff multi-physics problems, corresponding to large CFL and Fourier numbers, with excellent algorithmic and parallel scalability. Results are shown for the classic lid-driven cavity flow problem as well as for 3D laser-induced phase change. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

  8. Parallel STEPS: Large Scale Stochastic Spatial Reaction-Diffusion Simulation with High Performance Computers

    PubMed Central

    Chen, Weiliang; De Schutter, Erik

    2017-01-01

    Stochastic, spatial reaction-diffusion simulations have been widely used in systems biology and computational neuroscience. However, the increasing scale and complexity of models and morphologies have exceeded the capacity of any serial implementation. This led to the development of parallel solutions that benefit from the boost in performance of modern supercomputers. In this paper, we describe an MPI-based, parallel operator-splitting implementation for stochastic spatial reaction-diffusion simulations with irregular tetrahedral meshes. The performance of our implementation is first examined and analyzed with simulations of a simple model. We then demonstrate its application to real-world research by simulating the reaction-diffusion components of a published calcium burst model in both Purkinje neuron sub-branch and full dendrite morphologies. Simulation results indicate that our implementation is capable of achieving super-linear speedup for balanced loading simulations with reasonable molecule density and mesh quality. In the best scenario, a parallel simulation with 2,000 processes runs more than 3,600 times faster than its serial SSA counterpart, and achieves more than 20-fold speedup relative to parallel simulation with 100 processes. In a more realistic scenario with dynamic calcium influx and data recording, the parallel simulation with 1,000 processes and no load balancing is still 500 times faster than the conventional serial SSA simulation. PMID:28239346

  9. Partitioning in parallel processing of production systems

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Oflazer, K.

    1987-01-01

    This thesis presents research on certain issues related to parallel processing of production systems. It first presents a parallel production system interpreter that has been implemented on a four-processor multiprocessor. This parallel interpreter is based on Forgy's OPS5 interpreter and exploits production-level parallelism in production systems. Runs on the multiprocessor system indicate that it is possible to obtain speed-up of around 1.7 in the match computation for certain production systems when productions are split into three sets that are processed in parallel. The next issue addressed is that of partitioning a set of rules to processors in a parallel interpretermore » with production-level parallelism, and the extent of additional improvement in performance. The partitioning problem is formulated and an algorithm for approximate solutions is presented. The thesis next presents a parallel processing scheme for OPS5 production systems that allows some redundancy in the match computation. This redundancy enables the processing of a production to be divided into units of medium granularity each of which can be processed in parallel. Subsequently, a parallel processor architecture for implementing the parallel processing algorithm is presented.« less

  10. The NAS parallel benchmarks

    NASA Technical Reports Server (NTRS)

    Bailey, D. H.; Barszcz, E.; Barton, J. T.; Carter, R. L.; Lasinski, T. A.; Browning, D. S.; Dagum, L.; Fatoohi, R. A.; Frederickson, P. O.; Schreiber, R. S.

    1991-01-01

    A new set of benchmarks has been developed for the performance evaluation of highly parallel supercomputers in the framework of the NASA Ames Numerical Aerodynamic Simulation (NAS) Program. These consist of five 'parallel kernel' benchmarks and three 'simulated application' benchmarks. Together they mimic the computation and data movement characteristics of large-scale computational fluid dynamics applications. The principal distinguishing feature of these benchmarks is their 'pencil and paper' specification-all details of these benchmarks are specified only algorithmically. In this way many of the difficulties associated with conventional benchmarking approaches on highly parallel systems are avoided.

  11. Sublattice parallel replica dynamics.

    PubMed

    Martínez, Enrique; Uberuaga, Blas P; Voter, Arthur F

    2014-06-01

    Exascale computing presents a challenge for the scientific community as new algorithms must be developed to take full advantage of the new computing paradigm. Atomistic simulation methods that offer full fidelity to the underlying potential, i.e., molecular dynamics (MD) and parallel replica dynamics, fail to use the whole machine speedup, leaving a region in time and sample size space that is unattainable with current algorithms. In this paper, we present an extension of the parallel replica dynamics algorithm [A. F. Voter, Phys. Rev. B 57, R13985 (1998)] by combining it with the synchronous sublattice approach of Shim and Amar [ and , Phys. Rev. B 71, 125432 (2005)], thereby exploiting event locality to improve the algorithm scalability. This algorithm is based on a domain decomposition in which events happen independently in different regions in the sample. We develop an analytical expression for the speedup given by this sublattice parallel replica dynamics algorithm and compare it with parallel MD and traditional parallel replica dynamics. We demonstrate how this algorithm, which introduces a slight additional approximation of event locality, enables the study of physical systems unreachable with traditional methodologies and promises to better utilize the resources of current high performance and future exascale computers.

  12. The Galley Parallel File System

    NASA Technical Reports Server (NTRS)

    Nieuwejaar, Nils; Kotz, David

    1996-01-01

    Most current multiprocessor file systems are designed to use multiple disks in parallel, using the high aggregate bandwidth to meet the growing I/0 requirements of parallel scientific applications. Many multiprocessor file systems provide applications with a conventional Unix-like interface, allowing the application to access multiple disks transparently. This interface conceals the parallelism within the file system, increasing the ease of programmability, but making it difficult or impossible for sophisticated programmers and libraries to use knowledge about their I/O needs to exploit that parallelism. In addition to providing an insufficient interface, most current multiprocessor file systems are optimized for a different workload than they are being asked to support. We introduce Galley, a new parallel file system that is intended to efficiently support realistic scientific multiprocessor workloads. We discuss Galley's file structure and application interface, as well as the performance advantages offered by that interface.

  13. Parallel solution of the symmetric tridiagonal eigenproblem. Research report

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Jessup, E.R.

    1989-10-01

    This thesis discusses methods for computing all eigenvalues and eigenvectors of a symmetric tridiagonal matrix on a distributed-memory Multiple Instruction, Multiple Data multiprocessor. Only those techniques having the potential for both high numerical accuracy and significant large-grained parallelism are investigated. These include the QL method or Cuppen's divide and conquer method based on rank-one updating to compute both eigenvalues and eigenvectors, bisection to determine eigenvalues and inverse iteration to compute eigenvectors. To begin, the methods are compared with respect to computation time, communication time, parallel speed up, and accuracy. Experiments on an IPSC hypercube multiprocessor reveal that Cuppen's method ismore » the most accurate approach, but bisection with inverse iteration is the fastest and most parallel. Because the accuracy of the latter combination is determined by the quality of the computed eigenvectors, the factors influencing the accuracy of inverse iteration are examined. This includes, in part, statistical analysis of the effect of a starting vector with random components. These results are used to develop an implementation of inverse iteration producing eigenvectors with lower residual error and better orthogonality than those generated by the EISPACK routine TINVIT. This thesis concludes with adaptions of methods for the symmetric tridiagonal eigenproblem to the related problem of computing the singular value decomposition (SVD) of a bidiagonal matrix.« less

  14. Parallel computation with the force

    NASA Technical Reports Server (NTRS)

    Jordan, H. F.

    1985-01-01

    A methodology, called the force, supports the construction of programs to be executed in parallel by a force of processes. The number of processes in the force is unspecified, but potentially very large. The force idea is embodied in a set of macros which produce multiproceossor FORTRAN code and has been studied on two shared memory multiprocessors of fairly different character. The method has simplified the writing of highly parallel programs within a limited class of parallel algorithms and is being extended to cover a broader class. The individual parallel constructs which comprise the force methodology are discussed. Of central concern are their semantics, implementation on different architectures and performance implications.

  15. Design considerations for parallel graphics libraries

    NASA Technical Reports Server (NTRS)

    Crockett, Thomas W.

    1994-01-01

    Applications which run on parallel supercomputers are often characterized by massive datasets. Converting these vast collections of numbers to visual form has proven to be a powerful aid to comprehension. For a variety of reasons, it may be desirable to provide this visual feedback at runtime. One way to accomplish this is to exploit the available parallelism to perform graphics operations in place. In order to do this, we need appropriate parallel rendering algorithms and library interfaces. This paper provides a tutorial introduction to some of the issues which arise in designing parallel graphics libraries and their underlying rendering algorithms. The focus is on polygon rendering for distributed memory message-passing systems. We illustrate our discussion with examples from PGL, a parallel graphics library which has been developed on the Intel family of parallel systems.

  16. A parallel Jacobson-Oksman optimization algorithm. [parallel processing (computers)

    NASA Technical Reports Server (NTRS)

    Straeter, T. A.; Markos, A. T.

    1975-01-01

    A gradient-dependent optimization technique which exploits the vector-streaming or parallel-computing capabilities of some modern computers is presented. The algorithm, derived by assuming that the function to be minimized is homogeneous, is a modification of the Jacobson-Oksman serial minimization method. In addition to describing the algorithm, conditions insuring the convergence of the iterates of the algorithm and the results of numerical experiments on a group of sample test functions are presented. The results of these experiments indicate that this algorithm will solve optimization problems in less computing time than conventional serial methods on machines having vector-streaming or parallel-computing capabilities.

  17. Automatic Multilevel Parallelization Using OpenMP

    NASA Technical Reports Server (NTRS)

    Jin, Hao-Qiang; Jost, Gabriele; Yan, Jerry; Ayguade, Eduard; Gonzalez, Marc; Martorell, Xavier; Biegel, Bryan (Technical Monitor)

    2002-01-01

    In this paper we describe the extension of the CAPO (CAPtools (Computer Aided Parallelization Toolkit) OpenMP) parallelization support tool to support multilevel parallelism based on OpenMP directives. CAPO generates OpenMP directives with extensions supported by the NanosCompiler to allow for directive nesting and definition of thread groups. We report some results for several benchmark codes and one full application that have been parallelized using our system.

  18. A parallel finite element procedure for contact-impact problems using edge-based smooth triangular element and GPU

    NASA Astrophysics Data System (ADS)

    Cai, Yong; Cui, Xiangyang; Li, Guangyao; Liu, Wenyang

    2018-04-01

    The edge-smooth finite element method (ES-FEM) can improve the computational accuracy of triangular shell elements and the mesh partition efficiency of complex models. In this paper, an approach is developed to perform explicit finite element simulations of contact-impact problems with a graphical processing unit (GPU) using a special edge-smooth triangular shell element based on ES-FEM. Of critical importance for this problem is achieving finer-grained parallelism to enable efficient data loading and to minimize communication between the device and host. Four kinds of parallel strategies are then developed to efficiently solve these ES-FEM based shell element formulas, and various optimization methods are adopted to ensure aligned memory access. Special focus is dedicated to developing an approach for the parallel construction of edge systems. A parallel hierarchy-territory contact-searching algorithm (HITA) and a parallel penalty function calculation method are embedded in this parallel explicit algorithm. Finally, the program flow is well designed, and a GPU-based simulation system is developed, using Nvidia's CUDA. Several numerical examples are presented to illustrate the high quality of the results obtained with the proposed methods. In addition, the GPU-based parallel computation is shown to significantly reduce the computing time.

  19. Parallelization of the FLAPW method

    NASA Astrophysics Data System (ADS)

    Canning, A.; Mannstadt, W.; Freeman, A. J.

    2000-08-01

    The FLAPW (full-potential linearized-augmented plane-wave) method is one of the most accurate first-principles methods for determining structural, electronic and magnetic properties of crystals and surfaces. Until the present work, the FLAPW method has been limited to systems of less than about a hundred atoms due to the lack of an efficient parallel implementation to exploit the power and memory of parallel computers. In this work, we present an efficient parallelization of the method by division among the processors of the plane-wave components for each state. The code is also optimized for RISC (reduced instruction set computer) architectures, such as those found on most parallel computers, making full use of BLAS (basic linear algebra subprograms) wherever possible. Scaling results are presented for systems of up to 686 silicon atoms and 343 palladium atoms per unit cell, running on up to 512 processors on a CRAY T3E parallel supercomputer.

  20. Vortex phase diagram of the layered superconductor Cu0.03TaS2 for H \\parallel c

    NASA Astrophysics Data System (ADS)

    Zhu, X. D.; Lu, J. C.; Sun, Y. P.; Pi, L.; Qu, Z.; Ling, L. S.; Yang, Z. R.; Zhang, Y. H.

    2010-12-01

    The magnetization and anisotropic electrical transport properties have been measured in high quality Cu0.03TaS2 single crystals. A pronounced peak effect has been observed, indicating that high quality and homogeneity are vital to the peak effect. A kink has been observed in the magnetic field, H, dependence of the in-plane resistivity ρab for H\\parallel c , which corresponds to a transition from activated to diffusive behavior of the vortex liquid phase. In the diffusive regime of the vortex liquid phase, the in-plane resistivity ρab is proportional to H0.3, which does not follow the Bardeen-Stephen law for free flux flow. Finally, a simplified vortex phase diagram of Cu0.03TaS2 for H \\parallel c is given.

  1. Using Parallel Processing for Problem Solving.

    DTIC Science & Technology

    1979-12-01

    are the basic parallel proces- sing primitive . Different goals of the system can be pursued in parallel by placing them in separate activities...Language primitives are provided for manipulating running activities. Viewpoints are a generalization of context FOM -(over "*’ DD I FON 1473 ’EDITION OF I...arc the basic parallel processing primitive . Different goals of the system can be pursued in parallel by placing them in separate activities. Language

  2. Support for Debugging Automatically Parallelized Programs

    NASA Technical Reports Server (NTRS)

    Hood, Robert; Jost, Gabriele; Biegel, Bryan (Technical Monitor)

    2001-01-01

    This viewgraph presentation provides information on the technical aspects of debugging computer code that has been automatically converted for use in a parallel computing system. Shared memory parallelization and distributed memory parallelization entail separate and distinct challenges for a debugging program. A prototype system has been developed which integrates various tools for the debugging of automatically parallelized programs including the CAPTools Database which provides variable definition information across subroutines as well as array distribution information.

  3. Parallel-SymD: A Parallel Approach to Detect Internal Symmetry in Protein Domains.

    PubMed

    Jha, Ashwani; Flurchick, K M; Bikdash, Marwan; Kc, Dukka B

    2016-01-01

    Internally symmetric proteins are proteins that have a symmetrical structure in their monomeric single-chain form. Around 10-15% of the protein domains can be regarded as having some sort of internal symmetry. In this regard, we previously published SymD (symmetry detection), an algorithm that determines whether a given protein structure has internal symmetry by attempting to align the protein to its own copy after the copy is circularly permuted by all possible numbers of residues. SymD has proven to be a useful algorithm to detect symmetry. In this paper, we present a new parallelized algorithm called Parallel-SymD for detecting symmetry of proteins on clusters of computers. The achieved speedup of the new Parallel-SymD algorithm scales well with the number of computing processors. Scaling is better for proteins with a larger number of residues. For a protein of 509 residues, a speedup of 63 was achieved on a parallel system with 100 processors.

  4. Parallel-SymD: A Parallel Approach to Detect Internal Symmetry in Protein Domains

    PubMed Central

    Jha, Ashwani; Flurchick, K. M.; Bikdash, Marwan

    2016-01-01

    Internally symmetric proteins are proteins that have a symmetrical structure in their monomeric single-chain form. Around 10–15% of the protein domains can be regarded as having some sort of internal symmetry. In this regard, we previously published SymD (symmetry detection), an algorithm that determines whether a given protein structure has internal symmetry by attempting to align the protein to its own copy after the copy is circularly permuted by all possible numbers of residues. SymD has proven to be a useful algorithm to detect symmetry. In this paper, we present a new parallelized algorithm called Parallel-SymD for detecting symmetry of proteins on clusters of computers. The achieved speedup of the new Parallel-SymD algorithm scales well with the number of computing processors. Scaling is better for proteins with a larger number of residues. For a protein of 509 residues, a speedup of 63 was achieved on a parallel system with 100 processors. PMID:27747230

  5. Turbomachinery CFD on parallel computers

    NASA Technical Reports Server (NTRS)

    Blech, Richard A.; Milner, Edward J.; Quealy, Angela; Townsend, Scott E.

    1992-01-01

    The role of multistage turbomachinery simulation in the development of propulsion system models is discussed. Particularly, the need for simulations with higher fidelity and faster turnaround time is highlighted. It is shown how such fast simulations can be used in engineering-oriented environments. The use of parallel processing to achieve the required turnaround times is discussed. Current work by several researchers in this area is summarized. Parallel turbomachinery CFD research at the NASA Lewis Research Center is then highlighted. These efforts are focused on implementing the average-passage turbomachinery model on MIMD, distributed memory parallel computers. Performance results are given for inviscid, single blade row and viscous, multistage applications on several parallel computers, including networked workstations.

  6. Parallel CE/SE Computations via Domain Decomposition

    NASA Technical Reports Server (NTRS)

    Himansu, Ananda; Jorgenson, Philip C. E.; Wang, Xiao-Yen; Chang, Sin-Chung

    2000-01-01

    This paper describes the parallelization strategy and achieved parallel efficiency of an explicit time-marching algorithm for solving conservation laws. The Space-Time Conservation Element and Solution Element (CE/SE) algorithm for solving the 2D and 3D Euler equations is parallelized with the aid of domain decomposition. The parallel efficiency of the resultant algorithm on a Silicon Graphics Origin 2000 parallel computer is checked.

  7. Research of the effectiveness of parallel multithreaded realizations of interpolation methods for scaling raster images

    NASA Astrophysics Data System (ADS)

    Vnukov, A. A.; Shershnev, M. B.

    2018-01-01

    The aim of this work is the software implementation of three image scaling algorithms using parallel computations, as well as the development of an application with a graphical user interface for the Windows operating system to demonstrate the operation of algorithms and to study the relationship between system performance, algorithm execution time and the degree of parallelization of computations. Three methods of interpolation were studied, formalized and adapted to scale images. The result of the work is a program for scaling images by different methods. Comparison of the quality of scaling by different methods is given.

  8. Parallel image compression

    NASA Technical Reports Server (NTRS)

    Reif, John H.

    1987-01-01

    A parallel compression algorithm for the 16,384 processor MPP machine was developed. The serial version of the algorithm can be viewed as a combination of on-line dynamic lossless test compression techniques (which employ simple learning strategies) and vector quantization. These concepts are described. How these concepts are combined to form a new strategy for performing dynamic on-line lossy compression is discussed. Finally, the implementation of this algorithm in a massively parallel fashion on the MPP is discussed.

  9. Analysis of Monte Carlo accelerated iterative methods for sparse linear systems: Analysis of Monte Carlo accelerated iterative methods for sparse linear systems

    DOE PAGES

    Benzi, Michele; Evans, Thomas M.; Hamilton, Steven P.; ...

    2017-03-05

    Here, we consider hybrid deterministic-stochastic iterative algorithms for the solution of large, sparse linear systems. Starting from a convergent splitting of the coefficient matrix, we analyze various types of Monte Carlo acceleration schemes applied to the original preconditioned Richardson (stationary) iteration. We expect that these methods will have considerable potential for resiliency to faults when implemented on massively parallel machines. We also establish sufficient conditions for the convergence of the hybrid schemes, and we investigate different types of preconditioners including sparse approximate inverses. Numerical experiments on linear systems arising from the discretization of partial differential equations are presented.

  10. Flexbar 3.0 - SIMD and multicore parallelization.

    PubMed

    Roehr, Johannes T; Dieterich, Christoph; Reinert, Knut

    2017-09-15

    High-throughput sequencing machines can process many samples in a single run. For Illumina systems, sequencing reads are barcoded with an additional DNA tag that is contained in the respective sequencing adapters. The recognition of barcode and adapter sequences is hence commonly needed for the analysis of next-generation sequencing data. Flexbar performs demultiplexing based on barcodes and adapter trimming for such data. The massive amounts of data generated on modern sequencing machines demand that this preprocessing is done as efficiently as possible. We present Flexbar 3.0, the successor of the popular program Flexbar. It employs now twofold parallelism: multi-threading and additionally SIMD vectorization. Both types of parallelism are used to speed-up the computation of pair-wise sequence alignments, which are used for the detection of barcodes and adapters. Furthermore, new features were included to cover a wide range of applications. We evaluated the performance of Flexbar based on a simulated sequencing dataset. Our program outcompetes other tools in terms of speed and is among the best tools in the presented quality benchmark. https://github.com/seqan/flexbar. johannes.roehr@fu-berlin.de or knut.reinert@fu-berlin.de. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com

  11. Parallel processing considerations for image recognition tasks

    NASA Astrophysics Data System (ADS)

    Simske, Steven J.

    2011-01-01

    Many image recognition tasks are well-suited to parallel processing. The most obvious example is that many imaging tasks require the analysis of multiple images. From this standpoint, then, parallel processing need be no more complicated than assigning individual images to individual processors. However, there are three less trivial categories of parallel processing that will be considered in this paper: parallel processing (1) by task; (2) by image region; and (3) by meta-algorithm. Parallel processing by task allows the assignment of multiple workflows-as diverse as optical character recognition [OCR], document classification and barcode reading-to parallel pipelines. This can substantially decrease time to completion for the document tasks. For this approach, each parallel pipeline is generally performing a different task. Parallel processing by image region allows a larger imaging task to be sub-divided into a set of parallel pipelines, each performing the same task but on a different data set. This type of image analysis is readily addressed by a map-reduce approach. Examples include document skew detection and multiple face detection and tracking. Finally, parallel processing by meta-algorithm allows different algorithms to be deployed on the same image simultaneously. This approach may result in improved accuracy.

  12. New NAS Parallel Benchmarks Results

    NASA Technical Reports Server (NTRS)

    Yarrow, Maurice; Saphir, William; VanderWijngaart, Rob; Woo, Alex; Kutler, Paul (Technical Monitor)

    1997-01-01

    NPB2 (NAS (NASA Advanced Supercomputing) Parallel Benchmarks 2) is an implementation, based on Fortran and the MPI (message passing interface) message passing standard, of the original NAS Parallel Benchmark specifications. NPB2 programs are run with little or no tuning, in contrast to NPB vendor implementations, which are highly optimized for specific architectures. NPB2 results complement, rather than replace, NPB results. Because they have not been optimized by vendors, NPB2 implementations approximate the performance a typical user can expect for a portable parallel program on distributed memory parallel computers. Together these results provide an insightful comparison of the real-world performance of high-performance computers. New NPB2 features: New implementation (CG), new workstation class problem sizes, new serial sample versions, more performance statistics.

  13. Expressing Parallelism with ROOT

    NASA Astrophysics Data System (ADS)

    Piparo, D.; Tejedor, E.; Guiraud, E.; Ganis, G.; Mato, P.; Moneta, L.; Valls Pla, X.; Canal, P.

    2017-10-01

    The need for processing the ever-increasing amount of data generated by the LHC experiments in a more efficient way has motivated ROOT to further develop its support for parallelism. Such support is being tackled both for shared-memory and distributed-memory environments. The incarnations of the aforementioned parallelism are multi-threading, multi-processing and cluster-wide executions. In the area of multi-threading, we discuss the new implicit parallelism and related interfaces, as well as the new building blocks to safely operate with ROOT objects in a multi-threaded environment. Regarding multi-processing, we review the new MultiProc framework, comparing it with similar tools (e.g. multiprocessing module in Python). Finally, as an alternative to PROOF for cluster-wide executions, we introduce the efforts on integrating ROOT with state-of-the-art distributed data processing technologies like Spark, both in terms of programming model and runtime design (with EOS as one of the main components). For all the levels of parallelism, we discuss, based on real-life examples and measurements, how our proposals can increase the productivity of scientists.

  14. Massively parallel processor computer

    NASA Technical Reports Server (NTRS)

    Fung, L. W. (Inventor)

    1983-01-01

    An apparatus for processing multidimensional data with strong spatial characteristics, such as raw image data, characterized by a large number of parallel data streams in an ordered array is described. It comprises a large number (e.g., 16,384 in a 128 x 128 array) of parallel processing elements operating simultaneously and independently on single bit slices of a corresponding array of incoming data streams under control of a single set of instructions. Each of the processing elements comprises a bidirectional data bus in communication with a register for storing single bit slices together with a random access memory unit and associated circuitry, including a binary counter/shift register device, for performing logical and arithmetical computations on the bit slices, and an I/O unit for interfacing the bidirectional data bus with the data stream source. The massively parallel processor architecture enables very high speed processing of large amounts of ordered parallel data, including spatial translation by shifting or sliding of bits vertically or horizontally to neighboring processing elements.

  15. Parallel computing for probabilistic fatigue analysis

    NASA Technical Reports Server (NTRS)

    Sues, Robert H.; Lua, Yuan J.; Smith, Mark D.

    1993-01-01

    This paper presents the results of Phase I research to investigate the most effective parallel processing software strategies and hardware configurations for probabilistic structural analysis. We investigate the efficiency of both shared and distributed-memory architectures via a probabilistic fatigue life analysis problem. We also present a parallel programming approach, the virtual shared-memory paradigm, that is applicable across both types of hardware. Using this approach, problems can be solved on a variety of parallel configurations, including networks of single or multiprocessor workstations. We conclude that it is possible to effectively parallelize probabilistic fatigue analysis codes; however, special strategies will be needed to achieve large-scale parallelism to keep large number of processors busy and to treat problems with the large memory requirements encountered in practice. We also conclude that distributed-memory architecture is preferable to shared-memory for achieving large scale parallelism; however, in the future, the currently emerging hybrid-memory architectures will likely be optimal.

  16. Exploiting Symmetry on Parallel Architectures.

    NASA Astrophysics Data System (ADS)

    Stiller, Lewis Benjamin

    1995-01-01

    This thesis describes techniques for the design of parallel programs that solve well-structured problems with inherent symmetry. Part I demonstrates the reduction of such problems to generalized matrix multiplication by a group-equivariant matrix. Fast techniques for this multiplication are described, including factorization, orbit decomposition, and Fourier transforms over finite groups. Our algorithms entail interaction between two symmetry groups: one arising at the software level from the problem's symmetry and the other arising at the hardware level from the processors' communication network. Part II illustrates the applicability of our symmetry -exploitation techniques by presenting a series of case studies of the design and implementation of parallel programs. First, a parallel program that solves chess endgames by factorization of an associated dihedral group-equivariant matrix is described. This code runs faster than previous serial programs, and discovered it a number of results. Second, parallel algorithms for Fourier transforms for finite groups are developed, and preliminary parallel implementations for group transforms of dihedral and of symmetric groups are described. Applications in learning, vision, pattern recognition, and statistics are proposed. Third, parallel implementations solving several computational science problems are described, including the direct n-body problem, convolutions arising from molecular biology, and some communication primitives such as broadcast and reduce. Some of our implementations ran orders of magnitude faster than previous techniques, and were used in the investigation of various physical phenomena.

  17. Architectures for reasoning in parallel

    NASA Technical Reports Server (NTRS)

    Hall, Lawrence O.

    1989-01-01

    The research conducted has dealt with rule-based expert systems. The algorithms that may lead to effective parallelization of them were investigated. Both the forward and backward chained control paradigms were investigated in the course of this work. The best computer architecture for the developed and investigated algorithms has been researched. Two experimental vehicles were developed to facilitate this research. They are Backpac, a parallel backward chained rule-based reasoning system and Datapac, a parallel forward chained rule-based reasoning system. Both systems have been written in Multilisp, a version of Lisp which contains the parallel construct, future. Applying the future function to a function causes the function to become a task parallel to the spawning task. Additionally, Backpac and Datapac have been run on several disparate parallel processors. The machines are an Encore Multimax with 10 processors, the Concert Multiprocessor with 64 processors, and a 32 processor BBN GP1000. Both the Concert and the GP1000 are switch-based machines. The Multimax has all its processors hung off a common bus. All are shared memory machines, but have different schemes for sharing the memory and different locales for the shared memory. The main results of the investigations come from experiments on the 10 processor Encore and the Concert with partitions of 32 or less processors. Additionally, experiments have been run with a stripped down version of EMYCIN.

  18. Efficiency of parallel direct optimization

    NASA Technical Reports Server (NTRS)

    Janies, D. A.; Wheeler, W. C.

    2001-01-01

    Tremendous progress has been made at the level of sequential computation in phylogenetics. However, little attention has been paid to parallel computation. Parallel computing is particularly suited to phylogenetics because of the many ways large computational problems can be broken into parts that can be analyzed concurrently. In this paper, we investigate the scaling factors and efficiency of random addition and tree refinement strategies using the direct optimization software, POY, on a small (10 slave processors) and a large (256 slave processors) cluster of networked PCs running LINUX. These algorithms were tested on several data sets composed of DNA and morphology ranging from 40 to 500 taxa. Various algorithms in POY show fundamentally different properties within and between clusters. All algorithms are efficient on the small cluster for the 40-taxon data set. On the large cluster, multibuilding exhibits excellent parallel efficiency, whereas parallel building is inefficient. These results are independent of data set size. Branch swapping in parallel shows excellent speed-up for 16 slave processors on the large cluster. However, there is no appreciable speed-up for branch swapping with the further addition of slave processors (>16). This result is independent of data set size. Ratcheting in parallel is efficient with the addition of up to 32 processors in the large cluster. This result is independent of data set size. c2001 The Willi Hennig Society.

  19. An object-oriented approach to nested data parallelism

    NASA Technical Reports Server (NTRS)

    Sheffler, Thomas J.; Chatterjee, Siddhartha

    1994-01-01

    This paper describes an implementation technique for integrating nested data parallelism into an object-oriented language. Data-parallel programming employs sets of data called 'collections' and expresses parallelism as operations performed over the elements of a collection. When the elements of a collection are also collections, then there is the possibility for 'nested data parallelism.' Few current programming languages support nested data parallelism however. In an object-oriented framework, a collection is a single object. Its type defines the parallel operations that may be applied to it. Our goal is to design and build an object-oriented data-parallel programming environment supporting nested data parallelism. Our initial approach is built upon three fundamental additions to C++. We add new parallel base types by implementing them as classes, and add a new parallel collection type called a 'vector' that is implemented as a template. Only one new language feature is introduced: the 'foreach' construct, which is the basis for exploiting elementwise parallelism over collections. The strength of the method lies in the compilation strategy, which translates nested data-parallel C++ into ordinary C++. Extracting the potential parallelism in nested 'foreach' constructs is called 'flattening' nested parallelism. We show how to flatten 'foreach' constructs using a simple program transformation. Our prototype system produces vector code which has been successfully run on workstations, a CM-2, and a CM-5.

  20. Designing a parallel evolutionary algorithm for inferring gene networks on the cloud computing environment.

    PubMed

    Lee, Wei-Po; Hsiao, Yu-Ting; Hwang, Wei-Che

    2014-01-16

    To improve the tedious task of reconstructing gene networks through testing experimentally the possible interactions between genes, it becomes a trend to adopt the automated reverse engineering procedure instead. Some evolutionary algorithms have been suggested for deriving network parameters. However, to infer large networks by the evolutionary algorithm, it is necessary to address two important issues: premature convergence and high computational cost. To tackle the former problem and to enhance the performance of traditional evolutionary algorithms, it is advisable to use parallel model evolutionary algorithms. To overcome the latter and to speed up the computation, it is advocated to adopt the mechanism of cloud computing as a promising solution: most popular is the method of MapReduce programming model, a fault-tolerant framework to implement parallel algorithms for inferring large gene networks. This work presents a practical framework to infer large gene networks, by developing and parallelizing a hybrid GA-PSO optimization method. Our parallel method is extended to work with the Hadoop MapReduce programming model and is executed in different cloud computing environments. To evaluate the proposed approach, we use a well-known open-source software GeneNetWeaver to create several yeast S. cerevisiae sub-networks and use them to produce gene profiles. Experiments have been conducted and the results have been analyzed. They show that our parallel approach can be successfully used to infer networks with desired behaviors and the computation time can be largely reduced. Parallel population-based algorithms can effectively determine network parameters and they perform better than the widely-used sequential algorithms in gene network inference. These parallel algorithms can be distributed to the cloud computing environment to speed up the computation. By coupling the parallel model population-based optimization method and the parallel computational framework, high

  1. Designing a parallel evolutionary algorithm for inferring gene networks on the cloud computing environment

    PubMed Central

    2014-01-01

    Background To improve the tedious task of reconstructing gene networks through testing experimentally the possible interactions between genes, it becomes a trend to adopt the automated reverse engineering procedure instead. Some evolutionary algorithms have been suggested for deriving network parameters. However, to infer large networks by the evolutionary algorithm, it is necessary to address two important issues: premature convergence and high computational cost. To tackle the former problem and to enhance the performance of traditional evolutionary algorithms, it is advisable to use parallel model evolutionary algorithms. To overcome the latter and to speed up the computation, it is advocated to adopt the mechanism of cloud computing as a promising solution: most popular is the method of MapReduce programming model, a fault-tolerant framework to implement parallel algorithms for inferring large gene networks. Results This work presents a practical framework to infer large gene networks, by developing and parallelizing a hybrid GA-PSO optimization method. Our parallel method is extended to work with the Hadoop MapReduce programming model and is executed in different cloud computing environments. To evaluate the proposed approach, we use a well-known open-source software GeneNetWeaver to create several yeast S. cerevisiae sub-networks and use them to produce gene profiles. Experiments have been conducted and the results have been analyzed. They show that our parallel approach can be successfully used to infer networks with desired behaviors and the computation time can be largely reduced. Conclusions Parallel population-based algorithms can effectively determine network parameters and they perform better than the widely-used sequential algorithms in gene network inference. These parallel algorithms can be distributed to the cloud computing environment to speed up the computation. By coupling the parallel model population-based optimization method and the parallel

  2. Performance Evaluation in Network-Based Parallel Computing

    NASA Technical Reports Server (NTRS)

    Dezhgosha, Kamyar

    1996-01-01

    Network-based parallel computing is emerging as a cost-effective alternative for solving many problems which require use of supercomputers or massively parallel computers. The primary objective of this project has been to conduct experimental research on performance evaluation for clustered parallel computing. First, a testbed was established by augmenting our existing SUNSPARCs' network with PVM (Parallel Virtual Machine) which is a software system for linking clusters of machines. Second, a set of three basic applications were selected. The applications consist of a parallel search, a parallel sort, a parallel matrix multiplication. These application programs were implemented in C programming language under PVM. Third, we conducted performance evaluation under various configurations and problem sizes. Alternative parallel computing models and workload allocations for application programs were explored. The performance metric was limited to elapsed time or response time which in the context of parallel computing can be expressed in terms of speedup. The results reveal that the overhead of communication latency between processes in many cases is the restricting factor to performance. That is, coarse-grain parallelism which requires less frequent communication between processes will result in higher performance in network-based computing. Finally, we are in the final stages of installing an Asynchronous Transfer Mode (ATM) switch and four ATM interfaces (each 155 Mbps) which will allow us to extend our study to newer applications, performance metrics, and configurations.

  3. Simulation Exploration through Immersive Parallel Planes

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Brunhart-Lupo, Nicholas J; Bush, Brian W; Gruchalla, Kenny M

    We present a visualization-driven simulation system that tightly couples systems dynamics simulations with an immersive virtual environment to allow analysts to rapidly develop and test hypotheses in a high-dimensional parameter space. To accomplish this, we generalize the two-dimensional parallel-coordinates statistical graphic as an immersive 'parallel-planes' visualization for multivariate time series emitted by simulations running in parallel with the visualization. In contrast to traditional parallel coordinate's mapping the multivariate dimensions onto coordinate axes represented by a series of parallel lines, we map pairs of the multivariate dimensions onto a series of parallel rectangles. As in the case of parallel coordinates, eachmore » individual observation in the dataset is mapped to a polyline whose vertices coincide with its coordinate values. Regions of the rectangles can be 'brushed' to highlight and select observations of interest: a 'slider' control allows the user to filter the observations by their time coordinate. In an immersive virtual environment, users interact with the parallel planes using a joystick that can select regions on the planes, manipulate selection, and filter time. The brushing and selection actions are used to both explore existing data as well as to launch additional simulations corresponding to the visually selected portions of the input parameter space. As soon as the new simulations complete, their resulting observations are displayed in the virtual environment. This tight feedback loop between simulation and immersive analytics accelerates users' realization of insights about the simulation and its output.« less

  4. Hypercluster Parallel Processor

    NASA Technical Reports Server (NTRS)

    Blech, Richard A.; Cole, Gary L.; Milner, Edward J.; Quealy, Angela

    1992-01-01

    Hypercluster computer system includes multiple digital processors, operation of which coordinated through specialized software. Configurable according to various parallel-computing architectures of shared-memory or distributed-memory class, including scalar computer, vector computer, reduced-instruction-set computer, and complex-instruction-set computer. Designed as flexible, relatively inexpensive system that provides single programming and operating environment within which one can investigate effects of various parallel-computing architectures and combinations on performance in solution of complicated problems like those of three-dimensional flows in turbomachines. Hypercluster software and architectural concepts are in public domain.

  5. Parallel Implementation of a Frozen Flow Based Wavefront Reconstructor

    NASA Astrophysics Data System (ADS)

    Nagy, J.; Kelly, K.

    2013-09-01

    Obtaining high resolution images of space objects from ground based telescopes is challenging, often requiring the use of a multi-frame blind deconvolution (MFBD) algorithm to remove blur caused by atmospheric turbulence. In order for an MFBD algorithm to be effective, it is necessary to obtain a good initial estimate of the wavefront phase. Although wavefront sensors work well in low turbulence situations, they are less effective in high turbulence, such as when imaging in daylight, or when imaging objects that are close to the Earth's horizon. One promising approach, which has been shown to work very well in high turbulence settings, uses a frozen flow assumption on the atmosphere to capture the inherent temporal correlations present in consecutive frames of wavefront data. Exploiting these correlations can lead to more accurate estimation of the wavefront phase, and the associated PSF, which leads to more effective MFBD algorithms. However, with the current serial implementation, the approach can be prohibitively expensive in situations when it is necessary to use a large number of frames. In this poster we describe a parallel implementation that overcomes this constraint. The parallel implementation exploits sparse matrix computations, and uses the Trilinos package developed at Sandia National Laboratories. Trilinos provides a variety of core mathematical software for parallel architectures that have been designed using high quality software engineering practices, The package is open source, and portable to a variety of high-performance computing architectures.

  6. Neoclassical parallel flow calculation in the presence of external parallel momentum sources in Heliotron J

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Nishioka, K.; Nakamura, Y.; Nishimura, S.

    A moment approach to calculate neoclassical transport in non-axisymmetric torus plasmas composed of multiple ion species is extended to include the external parallel momentum sources due to unbalanced tangential neutral beam injections (NBIs). The momentum sources that are included in the parallel momentum balance are calculated from the collision operators of background particles with fast ions. This method is applied for the clarification of the physical mechanism of the neoclassical parallel ion flows and the multi-ion species effect on them in Heliotron J NBI plasmas. It is found that parallel ion flow can be determined by the balance between themore » parallel viscosity and the external momentum source in the region where the external source is much larger than the thermodynamic force driven source in the collisional plasmas. This is because the friction between C{sup 6+} and D{sup +} prevents a large difference between C{sup 6+} and D{sup +} flow velocities in such plasmas. The C{sup 6+} flow velocities, which are measured by the charge exchange recombination spectroscopy system, are numerically evaluated with this method. It is shown that the experimentally measured C{sup 6+} impurity flow velocities do not contradict clearly with the neoclassical estimations, and the dependence of parallel flow velocities on the magnetic field ripples is consistent in both results.« less

  7. Construction, classification and parametrization of complex Hadamard matrices

    NASA Astrophysics Data System (ADS)

    Szöllősi, Ferenc

    To improve the design of nuclear systems, high-fidelity neutron fluxes are required. Leadership-class machines provide platforms on which very large problems can be solved. Computing such fluxes efficiently requires numerical methods with good convergence properties and algorithms that can scale to hundreds of thousands of cores. Many 3-D deterministic transport codes are decomposable in space and angle only, limiting them to tens of thousands of cores. Most codes rely on methods such as Gauss Seidel for fixed source problems and power iteration for eigenvalue problems, which can be slow to converge for challenging problems like those with highly scattering materials or high dominance ratios. Three methods have been added to the 3-D SN transport code Denovo that are designed to improve convergence and enable the full use of cutting-edge computers. The first is a multigroup Krylov solver that converges more quickly than Gauss Seidel and parallelizes the code in energy such that Denovo can use hundreds of thousand of cores effectively. The second is Rayleigh quotient iteration (RQI), an old method applied in a new context. This eigenvalue solver finds the dominant eigenvalue in a mathematically optimal way and should converge in fewer iterations than power iteration. RQI creates energy-block-dense equations that the new Krylov solver treats efficiently. However, RQI can have convergence problems because it creates poorly conditioned systems. This can be overcome with preconditioning. The third method is a multigrid-in-energy preconditioner. The preconditioner takes advantage of the new energy decomposition because the grids are in energy rather than space or angle. The preconditioner greatly reduces iteration count for many problem types and scales well in energy. It also allows RQI to be successful for problems it could not solve otherwise. The methods added to Denovo accomplish the goals of this work. They converge in fewer iterations than traditional methods and

  8. Parallel computing of physical maps--a comparative study in SIMD and MIMD parallelism.

    PubMed

    Bhandarkar, S M; Chirravuri, S; Arnold, J

    1996-01-01

    Ordering clones from a genomic library into physical maps of whole chromosomes presents a central computational problem in genetics. Chromosome reconstruction via clone ordering is usually isomorphic to the NP-complete Optimal Linear Arrangement problem. Parallel SIMD and MIMD algorithms for simulated annealing based on Markov chain distribution are proposed and applied to the problem of chromosome reconstruction via clone ordering. Perturbation methods and problem-specific annealing heuristics are proposed and described. The SIMD algorithms are implemented on a 2048 processor MasPar MP-2 system which is an SIMD 2-D toroidal mesh architecture whereas the MIMD algorithms are implemented on an 8 processor Intel iPSC/860 which is an MIMD hypercube architecture. A comparative analysis of the various SIMD and MIMD algorithms is presented in which the convergence, speedup, and scalability characteristics of the various algorithms are analyzed and discussed. On a fine-grained, massively parallel SIMD architecture with a low synchronization overhead such as the MasPar MP-2, a parallel simulated annealing algorithm based on multiple periodically interacting searches performs the best. For a coarse-grained MIMD architecture with high synchronization overhead such as the Intel iPSC/860, a parallel simulated annealing algorithm based on multiple independent searches yields the best results. In either case, distribution of clonal data across multiple processors is shown to exacerbate the tendency of the parallel simulated annealing algorithm to get trapped in a local optimum.

  9. Fast Multilevel Solvers for a Class of Discrete Fourth Order Parabolic Problems

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Zheng, Bin; Chen, Luoping; Hu, Xiaozhe

    2016-03-05

    In this paper, we study fast iterative solvers for the solution of fourth order parabolic equations discretized by mixed finite element methods. We propose to use consistent mass matrix in the discretization and use lumped mass matrix to construct efficient preconditioners. We provide eigenvalue analysis for the preconditioned system and estimate the convergence rate of the preconditioned GMRes method. Furthermore, we show that these preconditioners only need to be solved inexactly by optimal multigrid algorithms. Our numerical examples indicate that the proposed preconditioners are very efficient and robust with respect to both discretization parameters and diffusion coefficients. We also investigatemore » the performance of multigrid algorithms with either collective smoothers or distributive smoothers when solving the preconditioner systems.« less

  10. Parallel Numerical Simulations of Water Reservoirs

    NASA Astrophysics Data System (ADS)

    Torres, Pedro; Mangiavacchi, Norberto

    2010-11-01

    The study of the water flow and scalar transport in water reservoirs is important for the determination of the water quality during the initial stages of the reservoir filling and during the life of the reservoir. For this scope, a parallel 2D finite element code for solving the incompressible Navier-Stokes equations coupled with scalar transport was implemented using the message-passing programming model, in order to perform simulations of hidropower water reservoirs in a computer cluster environment. The spatial discretization is based on the MINI element that satisfies the Babuska-Brezzi (BB) condition, which provides sufficient conditions for a stable mixed formulation. All the distributed data structures needed in the different stages of the code, such as preprocessing, solving and post processing, were implemented using the PETSc library. The resulting linear systems for the velocity and the pressure fields were solved using the projection method, implemented by an approximate block LU factorization. In order to increase the parallel performance in the solution of the linear systems, we employ the static condensation method for solving the intermediate velocity at vertex and centroid nodes separately. We compare performance results of the static condensation method with the approach of solving the complete system. In our tests the static condensation method shows better performance for large problems, at the cost of an increased memory usage. Performance results for other intensive parts of the code in a computer cluster are also presented.

  11. Long-time atomistic simulations with the Parallel Replica Dynamics method

    NASA Astrophysics Data System (ADS)

    Perez, Danny

    Molecular Dynamics (MD) -- the numerical integration of atomistic equations of motion -- is a workhorse of computational materials science. Indeed, MD can in principle be used to obtain any thermodynamic or kinetic quantity, without introducing any approximation or assumptions beyond the adequacy of the interaction potential. It is therefore an extremely powerful and flexible tool to study materials with atomistic spatio-temporal resolution. These enviable qualities however come at a steep computational price, hence limiting the system sizes and simulation times that can be achieved in practice. While the size limitation can be efficiently addressed with massively parallel implementations of MD based on spatial decomposition strategies, allowing for the simulation of trillions of atoms, the same approach usually cannot extend the timescales much beyond microseconds. In this article, we discuss an alternative parallel-in-time approach, the Parallel Replica Dynamics (ParRep) method, that aims at addressing the timescale limitation of MD for systems that evolve through rare state-to-state transitions. We review the formal underpinnings of the method and demonstrate that it can provide arbitrarily accurate results for any definition of the states. When an adequate definition of the states is available, ParRep can simulate trajectories with a parallel speedup approaching the number of replicas used. We demonstrate the usefulness of ParRep by presenting different examples of materials simulations where access to long timescales was essential to access the physical regime of interest and discuss practical considerations that must be addressed to carry out these simulations. Work supported by the United States Department of Energy (U.S. DOE), Office of Science, Office of Basic Energy Sciences, Materials Sciences and Engineering Division.

  12. Parallel transmission RF pulse design for eddy current correction at ultra high field.

    PubMed

    Zheng, Hai; Zhao, Tiejun; Qian, Yongxian; Ibrahim, Tamer; Boada, Fernando

    2012-08-01

    Multidimensional spatially selective RF pulses have been used in MRI applications such as B₁ and B₀ inhomogeneities mitigation. However, the long pulse duration has limited their practical applications. Recently, theoretical and experimental studies have shown that parallel transmission can effectively shorten pulse duration without sacrificing the quality of the excitation pattern. Nonetheless, parallel transmission with accelerated pulses can be severely impeded by hardware and/or system imperfections. One of such imperfections is the effect of the eddy current field. In this paper, we first show the effects of the eddy current field on the excitation pattern and then report an RF pulse the design method to correct eddy current fields caused by the RF coil and the gradient system. Experimental results on a 7 T human eight-channel parallel transmit system show substantial improvements on excitation patterns with the use of eddy current correction. Moreover, the proposed model-based correction method not only demonstrates comparable excitation patterns as the trajectory measurement method, but also significantly improves time efficiency. Copyright © 2012. Published by Elsevier Inc.

  13. Endpoint-based parallel data processing with non-blocking collective instructions in a parallel active messaging interface of a parallel computer

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Archer, Charles J; Blocksome, Michael A; Cernohous, Bob R

    Endpoint-based parallel data processing with non-blocking collective instructions in a PAMI of a parallel computer is disclosed. The PAMI is composed of data communications endpoints, each including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task. The compute nodes are coupled for data communications through the PAMI. The parallel application establishes a data communications geometry specifying a set of endpoints that are used in collective operations of the PAMI by associating with the geometry a list of collective algorithms valid for use with themore » endpoints of the geometry; registering in each endpoint in the geometry a dispatch callback function for a collective operation; and executing without blocking, through a single one of the endpoints in the geometry, an instruction for the collective operation.« less

  14. Expressing Parallelism with ROOT

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Piparo, D.; Tejedor, E.; Guiraud, E.

    The need for processing the ever-increasing amount of data generated by the LHC experiments in a more efficient way has motivated ROOT to further develop its support for parallelism. Such support is being tackled both for shared-memory and distributed-memory environments. The incarnations of the aforementioned parallelism are multi-threading, multi-processing and cluster-wide executions. In the area of multi-threading, we discuss the new implicit parallelism and related interfaces, as well as the new building blocks to safely operate with ROOT objects in a multi-threaded environment. Regarding multi-processing, we review the new MultiProc framework, comparing it with similar tools (e.g. multiprocessing module inmore » Python). Finally, as an alternative to PROOF for cluster-wide executions, we introduce the efforts on integrating ROOT with state-of-the-art distributed data processing technologies like Spark, both in terms of programming model and runtime design (with EOS as one of the main components). For all the levels of parallelism, we discuss, based on real-life examples and measurements, how our proposals can increase the productivity of scientists.« less

  15. Parallel programming of industrial applications

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Heroux, M; Koniges, A; Simon, H

    1998-07-21

    In the introductory material, we overview the typical MPP environment for real application computing and the special tools available such as parallel debuggers and performance analyzers. Next, we draw from a series of real applications codes and discuss the specific challenges and problems that are encountered in parallelizing these individual applications. The application areas drawn from include biomedical sciences, materials processing and design, plasma and fluid dynamics, and others. We show how it was possible to get a particular application to run efficiently and what steps were necessary. Finally we end with a summary of the lessons learned from thesemore » applications and predictions for the future of industrial parallel computing. This tutorial is based on material from a forthcoming book entitled: "Industrial Strength Parallel Computing" to be published by Morgan Kaufmann Publishers (ISBN l-55860-54).« less

  16. Shift-and-invert parallel spectral transformation eigensolver: Massively parallel performance for density-functional based tight-binding

    DOE PAGES

    Zhang, Hong; Zapol, Peter; Dixon, David A.; ...

    2015-11-17

    The Shift-and-invert parallel spectral transformations (SIPs), a computational approach to solve sparse eigenvalue problems, is developed for massively parallel architectures with exceptional parallel scalability and robustness. The capabilities of SIPs are demonstrated by diagonalization of density-functional based tight-binding (DFTB) Hamiltonian and overlap matrices for single-wall metallic carbon nanotubes, diamond nanowires, and bulk diamond crystals. The largest (smallest) example studied is a 128,000 (2000) atom nanotube for which ~330,000 (~5600) eigenvalues and eigenfunctions are obtained in ~190 (~5) seconds when parallelized over 266,144 (16,384) Blue Gene/Q cores. Weak scaling and strong scaling of SIPs are analyzed and the performance of SIPsmore » is compared with other novel methods. Different matrix ordering methods are investigated to reduce the cost of the factorization step, which dominates the time-to-solution at the strong scaling limit. As a result, a parallel implementation of assembling the density matrix from the distributed eigenvectors is demonstrated.« less

  17. Shift-and-invert parallel spectral transformation eigensolver: Massively parallel performance for density-functional based tight-binding

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Zhang, Hong; Zapol, Peter; Dixon, David A.

    The Shift-and-invert parallel spectral transformations (SIPs), a computational approach to solve sparse eigenvalue problems, is developed for massively parallel architectures with exceptional parallel scalability and robustness. The capabilities of SIPs are demonstrated by diagonalization of density-functional based tight-binding (DFTB) Hamiltonian and overlap matrices for single-wall metallic carbon nanotubes, diamond nanowires, and bulk diamond crystals. The largest (smallest) example studied is a 128,000 (2000) atom nanotube for which ~330,000 (~5600) eigenvalues and eigenfunctions are obtained in ~190 (~5) seconds when parallelized over 266,144 (16,384) Blue Gene/Q cores. Weak scaling and strong scaling of SIPs are analyzed and the performance of SIPsmore » is compared with other novel methods. Different matrix ordering methods are investigated to reduce the cost of the factorization step, which dominates the time-to-solution at the strong scaling limit. As a result, a parallel implementation of assembling the density matrix from the distributed eigenvectors is demonstrated.« less

  18. Parallel Event Analysis Under Unix

    NASA Astrophysics Data System (ADS)

    Looney, S.; Nilsson, B. S.; Oest, T.; Pettersson, T.; Ranjard, F.; Thibonnier, J.-P.

    The ALEPH experiment at LEP, the CERN CN division and Digital Equipment Corp. have, in a joint project, developed a parallel event analysis system. The parallel physics code is identical to ALEPH's standard analysis code, ALPHA, only the organisation of input/output is changed. The user may switch between sequential and parallel processing by simply changing one input "card". The initial implementation runs on an 8-node DEC 3000/400 farm, using the PVM software, and exhibits a near-perfect speed-up linearity, reducing the turn-around time by a factor of 8.

  19. Collectively loading an application in a parallel computer

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Aho, Michael E.; Attinella, John E.; Gooding, Thomas M.

    Collectively loading an application in a parallel computer, the parallel computer comprising a plurality of compute nodes, including: identifying, by a parallel computer control system, a subset of compute nodes in the parallel computer to execute a job; selecting, by the parallel computer control system, one of the subset of compute nodes in the parallel computer as a job leader compute node; retrieving, by the job leader compute node from computer memory, an application for executing the job; and broadcasting, by the job leader to the subset of compute nodes in the parallel computer, the application for executing the job.

  20. On polynomial preconditioning for indefinite Hermitian matrices

    NASA Technical Reports Server (NTRS)

    Freund, Roland W.

    1989-01-01

    The minimal residual method is studied combined with polynomial preconditioning for solving large linear systems (Ax = b) with indefinite Hermitian coefficient matrices (A). The standard approach for choosing the polynomial preconditioners leads to preconditioned systems which are positive definite. Here, a different strategy is studied which leaves the preconditioned coefficient matrix indefinite. More precisely, the polynomial preconditioner is designed to cluster the positive, resp. negative eigenvalues of A around 1, resp. around some negative constant. In particular, it is shown that such indefinite polynomial preconditioners can be obtained as the optimal solutions of a certain two parameter family of Chebyshev approximation problems. Some basic results are established for these approximation problems and a Remez type algorithm is sketched for their numerical solution. The problem of selecting the parameters such that the resulting indefinite polynomial preconditioners speeds up the convergence of minimal residual method optimally is also addressed. An approach is proposed based on the concept of asymptotic convergence factors. Finally, some numerical examples of indefinite polynomial preconditioners are given.

  1. Parallel-In-Time For Moving Meshes

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Falgout, R. D.; Manteuffel, T. A.; Southworth, B.

    2016-02-04

    With steadily growing computational resources available, scientists must develop e ective ways to utilize the increased resources. High performance, highly parallel software has be- come a standard. However until recent years parallelism has focused primarily on the spatial domain. When solving a space-time partial di erential equation (PDE), this leads to a sequential bottleneck in the temporal dimension, particularly when taking a large number of time steps. The XBraid parallel-in-time library was developed as a practical way to add temporal parallelism to existing se- quential codes with only minor modi cations. In this work, a rezoning-type moving mesh is appliedmore » to a di usion problem and formulated in a parallel-in-time framework. Tests and scaling studies are run using XBraid and demonstrate excellent results for the simple model problem considered herein.« less

  2. Integrated Task And Data Parallel Programming: Language Design

    NASA Technical Reports Server (NTRS)

    Grimshaw, Andrew S.; West, Emily A.

    1998-01-01

    his research investigates the combination of task and data parallel language constructs within a single programming language. There are an number of applications that exhibit properties which would be well served by such an integrated language. Examples include global climate models, aircraft design problems, and multidisciplinary design optimization problems. Our approach incorporates data parallel language constructs into an existing, object oriented, task parallel language. The language will support creation and manipulation of parallel classes and objects of both types (task parallel and data parallel). Ultimately, the language will allow data parallel and task parallel classes to be used either as building blocks or managers of parallel objects of either type, thus allowing the development of single and multi-paradigm parallel applications. 1995 Research Accomplishments In February I presented a paper at Frontiers '95 describing the design of the data parallel language subset. During the spring I wrote and defended my dissertation proposal. Since that time I have developed a runtime model for the language subset. I have begun implementing the model and hand-coding simple examples which demonstrate the language subset. I have identified an astrophysical fluid flow application which will validate the data parallel language subset. 1996 Research Agenda Milestones for the coming year include implementing a significant portion of the data parallel language subset over the Legion system. Using simple hand-coded methods, I plan to demonstrate (1) concurrent task and data parallel objects and (2) task parallel objects managing both task and data parallel objects. My next steps will focus on constructing a compiler and implementing the fluid flow application with the language. Concurrently, I will conduct a search for a real-world application exhibiting both task and data parallelism within the same program m. Additional 1995 Activities During the fall I collaborated

  3. Performance of the Galley Parallel File System

    NASA Technical Reports Server (NTRS)

    Nieuwejaar, Nils; Kotz, David

    1996-01-01

    As the input/output (I/O) needs of parallel scientific applications increase, file systems for multiprocessors are being designed to provide applications with parallel access to multiple disks. Many parallel file systems present applications with a conventional Unix-like interface that allows the application to access multiple disks transparently. This interface conceals the parallism within the file system, which increases the ease of programmability, but makes it difficult or impossible for sophisticated programmers and libraries to use knowledge about their I/O needs to exploit that parallelism. Furthermore, most current parallel file systems are optimized for a different workload than they are being asked to support. We introduce Galley, a new parallel file system that is intended to efficiently support realistic parallel workloads. Initial experiments, reported in this paper, indicate that Galley is capable of providing high-performance 1/O to applications the applications that rely on them. In Section 3 we describe that access data in patterns that have been observed to be common.

  4. Parallel computing on Unix workstation arrays

    NASA Astrophysics Data System (ADS)

    Reale, F.; Bocchino, F.; Sciortino, S.

    1994-12-01

    We have tested arrays of general-purpose Unix workstations used as MIMD systems for massive parallel computations. In particular we have solved numerically a demanding test problem with a 2D hydrodynamic code, generally developed to study astrophysical flows, by exucuting it on arrays either of DECstations 5000/200 on Ethernet LAN, or of DECstations 3000/400, equipped with powerful Alpha processors, on FDDI LAN. The code is appropriate for data-domain decomposition, and we have used a library for parallelization previously developed in our Institute, and easily extended to work on Unix workstation arrays by using the PVM software toolset. We have compared the parallel efficiencies obtained on arrays of several processors to those obtained on a dedicated MIMD parallel system, namely a Meiko Computing Surface (CS-1), equipped with Intel i860 processors. We discuss the feasibility of using non-dedicated parallel systems and conclude that the convenience depends essentially on the size of the computational domain as compared to the relative processor power and network bandwidth. We point out that for future perspectives a parallel development of processor and network technology is important, and that the software still offers great opportunities of improvement, especially in terms of latency times in the message-passing protocols. In conditions of significant gain in terms of speedup, such workstation arrays represent a cost-effective approach to massive parallel computations.

  5. Massively Parallel Dantzig-Wolfe Decomposition Applied to Traffic Flow Scheduling

    NASA Technical Reports Server (NTRS)

    Rios, Joseph Lucio; Ross, Kevin

    2009-01-01

    Optimal scheduling of air traffic over the entire National Airspace System is a computationally difficult task. To speed computation, Dantzig-Wolfe decomposition is applied to a known linear integer programming approach for assigning delays to flights. The optimization model is proven to have the block-angular structure necessary for Dantzig-Wolfe decomposition. The subproblems for this decomposition are solved in parallel via independent computation threads. Experimental evidence suggests that as the number of subproblems/threads increases (and their respective sizes decrease), the solution quality, convergence, and runtime improve. A demonstration of this is provided by using one flight per subproblem, which is the finest possible decomposition. This results in thousands of subproblems and associated computation threads. This massively parallel approach is compared to one with few threads and to standard (non-decomposed) approaches in terms of solution quality and runtime. Since this method generally provides a non-integral (relaxed) solution to the original optimization problem, two heuristics are developed to generate an integral solution. Dantzig-Wolfe followed by these heuristics can provide a near-optimal (sometimes optimal) solution to the original problem hundreds of times faster than standard (non-decomposed) approaches. In addition, when massive decomposition is employed, the solution is shown to be more likely integral, which obviates the need for an integerization step. These results indicate that nationwide, real-time, high fidelity, optimal traffic flow scheduling is achievable for (at least) 3 hour planning horizons.

  6. Two variants of minimum discarded fill ordering

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    D'Azevedo, E.F.; Forsyth, P.A.; Tang, Wei-Pai

    1991-01-01

    It is well known that the ordering of the unknowns can have a significant effect on the convergence of Preconditioned Conjugate Gradient (PCG) methods. There has been considerable experimental work on the effects of ordering for regular finite difference problems. In many cases, good results have been obtained with preconditioners based on diagonal, spiral or natural row orderings. However, for finite element problems having unstructured grids or grids generated by a local refinement approach, it is difficult to define many of the orderings for more regular problems. A recently proposed Minimum Discarded Fill (MDF) ordering technique is effective in findingmore » high quality Incomplete LU (ILU) preconditioners, especially for problems arising from unstructured finite element grids. Testing indicates this algorithm can identify a rather complicated physical structure in an anisotropic problem and orders the unknowns in the preferred'' direction. The MDF technique may be viewed as the numerical analogue of the minimum deficiency algorithm in sparse matrix technology. At any stage of the partial elimination, the MDF technique chooses the next pivot node so as to minimize the amount of discarded fill. In this work, two efficient variants of the MDF technique are explored to produce cost-effective high-order ILU preconditioners. The Threshold MDF orderings combine MDF ideas with drop tolerance techniques to identify the sparsity pattern in the ILU preconditioners. These techniques identify an ordering that encourages fast decay of the entries in the ILU factorization. The Minimum Update Matrix (MUM) ordering technique is a simplification of the MDF ordering and is closely related to the minimum degree algorithm. The MUM ordering is especially for large problems arising from Navier-Stokes problems. Some interesting pictures of the orderings are presented using a visualization tool. 22 refs., 4 figs., 7 tabs.« less

  7. Revisiting Parallel Cyclic Reduction and Parallel Prefix-Based Algorithms for Block Tridiagonal System of Equations

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Seal, Sudip K; Perumalla, Kalyan S; Hirshman, Steven Paul

    2013-01-01

    Simulations that require solutions of block tridiagonal systems of equations rely on fast parallel solvers for runtime efficiency. Leading parallel solvers that are highly effective for general systems of equations, dense or sparse, are limited in scalability when applied to block tridiagonal systems. This paper presents scalability results as well as detailed analyses of two parallel solvers that exploit the special structure of block tridiagonal matrices to deliver superior performance, often by orders of magnitude. A rigorous analysis of their relative parallel runtimes is shown to reveal the existence of a critical block size that separates the parameter space spannedmore » by the number of block rows, the block size and the processor count, into distinct regions that favor one or the other of the two solvers. Dependence of this critical block size on the above parameters as well as on machine-specific constants is established. These formal insights are supported by empirical results on up to 2,048 cores of a Cray XT4 system. To the best of our knowledge, this is the highest reported scalability for parallel block tridiagonal solvers to date.« less

  8. The role of grape seed extract in the remineralization of demineralized dentine: micromorphological and physical analyses.

    PubMed

    Tang, Cheng-fang; Fang, Ming; Liu, Rui-rui; Dou, Qi; Chai, Zhi-guo; Xiao, Yu-hong; Chen, Ji-hua

    2013-12-01

    Grape seed extract (GSE) is known to have a positive effect on the demineralization and/or remineralization of artificial root caries lesions. The present study aimed to investigate whether biomodification of caries-like acid-etched demineralized dentine, using proanthocyanidins-rich GSE, would promote its remineralization potential. Dentine specimens were acid-etched for 30s, then biomodified using proanthocyanidin-based preconditioners (at different concentrations and pH values) for 2min, followed by a 15-day artificial remineralization regimen. They were subsequently subjected to microhardness measurements, micromorphological evaluation and X-ray diffraction analyses. Stability of the preconditioners was also analyzed, spectrophotometrically. A concentration-dependent increase was observed in the microhardness of the specimens that were biomodified using GSE preconditioners, without pH adjustment. Field emission scanning electron microscopy revealed greater mineral deposition on their surfaces, which was further identified mainly as hydroxylapatite. The absorbances of preconditioner dilutions at pH 7.4 and pH 10.0 decreased at the two typical polyphenol bands. Transient GSE biomodification promoted remineralization on the surface of demineralized dentine, and this process was influenced by the concentration and pH value of the preconditioner. GSE preconditioner at a concentration of 15%, without pH adjustment, presented with the best results, and this may be attributed to its high polyphenolic content. Copyright © 2013 Elsevier Ltd. All rights reserved.

  9. Automatic Multilevel Parallelization Using OpenMP

    NASA Technical Reports Server (NTRS)

    Jin, Hao-Qiang; Jost, Gabriele; Yan, Jerry; Ayguade, Eduard; Gonzalez, Marc; Martorell, Xavier; Biegel, Bryan (Technical Monitor)

    2002-01-01

    In this paper we describe the extension of the CAPO parallelization support tool to support multilevel parallelism based on OpenMP directives. CAPO generates OpenMP directives with extensions supported by the NanosCompiler to allow for directive nesting and definition of thread groups. We report first results for several benchmark codes and one full application that have been parallelized using our system.

  10. Massively parallel whole genome amplification for single-cell sequencing using droplet microfluidics.

    PubMed

    Hosokawa, Masahito; Nishikawa, Yohei; Kogawa, Masato; Takeyama, Haruko

    2017-07-12

    Massively parallel single-cell genome sequencing is required to further understand genetic diversities in complex biological systems. Whole genome amplification (WGA) is the first step for single-cell sequencing, but its throughput and accuracy are insufficient in conventional reaction platforms. Here, we introduce single droplet multiple displacement amplification (sd-MDA), a method that enables massively parallel amplification of single cell genomes while maintaining sequence accuracy and specificity. Tens of thousands of single cells are compartmentalized in millions of picoliter droplets and then subjected to lysis and WGA by passive droplet fusion in microfluidic channels. Because single cells are isolated in compartments, their genomes are amplified to saturation without contamination. This enables the high-throughput acquisition of contamination-free and cell specific sequence reads from single cells (21,000 single-cells/h), resulting in enhancement of the sequence data quality compared to conventional methods. This method allowed WGA of both single bacterial cells and human cancer cells. The obtained sequencing coverage rivals those of conventional techniques with superior sequence quality. In addition, we also demonstrate de novo assembly of uncultured soil bacteria and obtain draft genomes from single cell sequencing. This sd-MDA is promising for flexible and scalable use in single-cell sequencing.

  11. Improved parallel image reconstruction using feature refinement.

    PubMed

    Cheng, Jing; Jia, Sen; Ying, Leslie; Liu, Yuanyuan; Wang, Shanshan; Zhu, Yanjie; Li, Ye; Zou, Chao; Liu, Xin; Liang, Dong

    2018-07-01

    The aim of this study was to develop a novel feature refinement MR reconstruction method from highly undersampled multichannel acquisitions for improving the image quality and preserve more detail information. The feature refinement technique, which uses a feature descriptor to pick up useful features from residual image discarded by sparsity constrains, is applied to preserve the details of the image in compressed sensing and parallel imaging in MRI (CS-pMRI). The texture descriptor and structure descriptor recognizing different types of features are required for forming the feature descriptor. Feasibility of the feature refinement was validated using three different multicoil reconstruction methods on in vivo data. Experimental results show that reconstruction methods with feature refinement improve the quality of reconstructed image and restore the image details more accurately than the original methods, which is also verified by the lower values of the root mean square error and high frequency error norm. A simple and effective way to preserve more useful detailed information in CS-pMRI is proposed. This technique can effectively improve the reconstruction quality and has superior performance in terms of detail preservation compared with the original version without feature refinement. Magn Reson Med 80:211-223, 2018. © 2017 International Society for Magnetic Resonance in Medicine. © 2017 International Society for Magnetic Resonance in Medicine.

  12. A parallel variable metric optimization algorithm

    NASA Technical Reports Server (NTRS)

    Straeter, T. A.

    1973-01-01

    An algorithm, designed to exploit the parallel computing or vector streaming (pipeline) capabilities of computers is presented. When p is the degree of parallelism, then one cycle of the parallel variable metric algorithm is defined as follows: first, the function and its gradient are computed in parallel at p different values of the independent variable; then the metric is modified by p rank-one corrections; and finally, a single univariant minimization is carried out in the Newton-like direction. Several properties of this algorithm are established. The convergence of the iterates to the solution is proved for a quadratic functional on a real separable Hilbert space. For a finite-dimensional space the convergence is in one cycle when p equals the dimension of the space. Results of numerical experiments indicate that the new algorithm will exploit parallel or pipeline computing capabilities to effect faster convergence than serial techniques.

  13. Genetic Parallel Programming: design and implementation.

    PubMed

    Cheang, Sin Man; Leung, Kwong Sak; Lee, Kin Hong

    2006-01-01

    This paper presents a novel Genetic Parallel Programming (GPP) paradigm for evolving parallel programs running on a Multi-Arithmetic-Logic-Unit (Multi-ALU) Processor (MAP). The MAP is a Multiple Instruction-streams, Multiple Data-streams (MIMD), general-purpose register machine that can be implemented on modern Very Large-Scale Integrated Circuits (VLSIs) in order to evaluate genetic programs at high speed. For human programmers, writing parallel programs is more difficult than writing sequential programs. However, experimental results show that GPP evolves parallel programs with less computational effort than that of their sequential counterparts. It creates a new approach to evolving a feasible problem solution in parallel program form and then serializes it into a sequential program if required. The effectiveness and efficiency of GPP are investigated using a suite of 14 well-studied benchmark problems. Experimental results show that GPP speeds up evolution substantially.

  14. Parallel programming with Easy Java Simulations

    NASA Astrophysics Data System (ADS)

    Esquembre, F.; Christian, W.; Belloni, M.

    2018-01-01

    Nearly all of today's processors are multicore, and ideally programming and algorithm development utilizing the entire processor should be introduced early in the computational physics curriculum. Parallel programming is often not introduced because it requires a new programming environment and uses constructs that are unfamiliar to many teachers. We describe how we decrease the barrier to parallel programming by using a java-based programming environment to treat problems in the usual undergraduate curriculum. We use the easy java simulations programming and authoring tool to create the program's graphical user interface together with objects based on those developed by Kaminsky [Building Parallel Programs (Course Technology, Boston, 2010)] to handle common parallel programming tasks. Shared-memory parallel implementations of physics problems, such as time evolution of the Schrödinger equation, are available as source code and as ready-to-run programs from the AAPT-ComPADRE digital library.

  15. Large-Scale, Parallel, Multi-Sensor Data Fusion in the Cloud

    NASA Astrophysics Data System (ADS)

    Wilson, B. D.; Manipon, G.; Hua, H.

    2012-12-01

    NASA's Earth Observing System (EOS) is an ambitious facility for studying global climate change. The mandate now is to combine measurements from the instruments on the "A-Train" platforms (AIRS, AMSR-E, MODIS, MISR, MLS, and CloudSat) and other Earth probes to enable large-scale studies of climate change over periods of years to decades. However, moving from predominantly single-instrument studies to a multi-sensor, measurement-based model for long-duration analysis of important climate variables presents serious challenges for large-scale data mining and data fusion. For example, one might want to compare temperature and water vapor retrievals from one instrument (AIRS) to another instrument (MODIS), and to a model (ECMWF), stratify the comparisons using a classification of the "cloud scenes" from CloudSat, and repeat the entire analysis over years of AIRS data. To perform such an analysis, one must discover & access multiple datasets from remote sites, find the space/time "matchups" between instruments swaths and model grids, understand the quality flags and uncertainties for retrieved physical variables, assemble merged datasets, and compute fused products for further scientific and statistical analysis. To efficiently assemble such decade-scale datasets in a timely manner, we are utilizing Elastic Computing in the Cloud and parallel map/reduce-based algorithms. "SciReduce" is a Hadoop-like parallel analysis system, programmed in parallel python, that is designed from the ground up for Earth science. SciReduce executes inside VMWare images and scales to any number of nodes in the Cloud. Unlike Hadoop, in which simple tuples (keys & values) are passed between the map and reduce functions, SciReduce operates on bundles of named numeric arrays, which can be passed in memory or serialized to disk in netCDF4 or HDF5. Thus, SciReduce uses the native datatypes (geolocated grids, swaths, and points) that geo-scientists are familiar with. We are deploying within Sci

  16. Large-Scale, Parallel, Multi-Sensor Data Fusion in the Cloud

    NASA Astrophysics Data System (ADS)

    Wilson, B.; Manipon, G.; Hua, H.

    2012-04-01

    NASA's Earth Observing System (EOS) is an ambitious facility for studying global climate change. The mandate now is to combine measurements from the instruments on the "A-Train" platforms (AIRS, AMSR-E, MODIS, MISR, MLS, and CloudSat) and other Earth probes to enable large-scale studies of climate change over periods of years to decades. However, moving from predominantly single-instrument studies to a multi-sensor, measurement-based model for long-duration analysis of important climate variables presents serious challenges for large-scale data mining and data fusion. For example, one might want to compare temperature and water vapor retrievals from one instrument (AIRS) to another instrument (MODIS), and to a model (ECMWF), stratify the comparisons using a classification of the "cloud scenes" from CloudSat, and repeat the entire analysis over years of AIRS data. To perform such an analysis, one must discover & access multiple datasets from remote sites, find the space/time "matchups" between instruments swaths and model grids, understand the quality flags and uncertainties for retrieved physical variables, assemble merged datasets, and compute fused products for further scientific and statistical analysis. To efficiently assemble such decade-scale datasets in a timely manner, we are utilizing Elastic Computing in the Cloud and parallel map/reduce-based algorithms. "SciReduce" is a Hadoop-like parallel analysis system, programmed in parallel python, that is designed from the ground up for Earth science. SciReduce executes inside VMWare images and scales to any number of nodes in the Cloud. Unlike Hadoop, in which simple tuples (keys & values) are passed between the map and reduce functions, SciReduce operates on bundles of named numeric arrays, which can be passed in memory or serialized to disk in netCDF4 or HDF5. Thus, SciReduce uses the native datatypes (geolocated grids, swaths, and points) that geo-scientists are familiar with. We are deploying within Sci

  17. Parallel auto-correlative statistics with VTK.

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Pebay, Philippe Pierre; Bennett, Janine Camille

    2013-08-01

    This report summarizes existing statistical engines in VTK and presents both the serial and parallel auto-correlative statistics engines. It is a sequel to [PT08, BPRT09b, PT09, BPT09, PT10] which studied the parallel descriptive, correlative, multi-correlative, principal component analysis, contingency, k-means, and order statistics engines. The ease of use of the new parallel auto-correlative statistics engine is illustrated by the means of C++ code snippets and algorithm verification is provided. This report justifies the design of the statistics engines with parallel scalability in mind, and provides scalability and speed-up analysis results for the autocorrelative statistics engine.

  18. Xyce parallel electronic simulator : users' guide.

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Mei, Ting; Rankin, Eric Lamont; Thornquist, Heidi K.

    2011-05-01

    This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: (1) Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). Note that this includes support for most popular parallel and serial computers; (2) Improved performance for all numerical kernels (e.g., time integrator, nonlinear and linear solvers) through state-of-the-artmore » algorithms and novel techniques. (3) Device models which are specifically tailored to meet Sandia's needs, including some radiation-aware devices (for Sandia users only); and (4) Object-oriented code design and implementation using modern coding practices that ensure that the Xyce Parallel Electronic Simulator will be maintainable and extensible far into the future. Xyce is a parallel code in the most general sense of the phrase - a message passing parallel implementation - which allows it to run efficiently on the widest possible number of computing platforms. These include serial, shared-memory and distributed-memory parallel as well as heterogeneous platforms. Careful attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows. The development of Xyce provides a platform for computational research and development aimed specifically at the needs of the Laboratory. With Xyce, Sandia has an 'in-house' capability with which both new electrical (e.g., device model development) and algorithmic (e.g., faster time-integration methods, parallel solver algorithms) research and development can be performed. As a result, Xyce is a

  19. Capabilities of Fully Parallelized MHD Stability Code MARS

    NASA Astrophysics Data System (ADS)

    Svidzinski, Vladimir; Galkin, Sergei; Kim, Jin-Soo; Liu, Yueqiang

    2016-10-01

    Results of full parallelization of the plasma stability code MARS will be reported. MARS calculates eigenmodes in 2D axisymmetric toroidal equilibria in MHD-kinetic plasma models. Parallel version of MARS, named PMARS, has been recently developed at FAR-TECH. Parallelized MARS is an efficient tool for simulation of MHD instabilities with low, intermediate and high toroidal mode numbers within both fluid and kinetic plasma models, implemented in MARS. Parallelization of the code included parallelization of the construction of the matrix for the eigenvalue problem and parallelization of the inverse vector iterations algorithm, implemented in MARS for the solution of the formulated eigenvalue problem. Construction of the matrix is parallelized by distributing the load among processors assigned to different magnetic surfaces. Parallelization of the solution of the eigenvalue problem is made by repeating steps of the MARS algorithm using parallel libraries and procedures. Parallelized MARS is capable of calculating eigenmodes with significantly increased spatial resolution: up to 5,000 adapted radial grid points with up to 500 poloidal harmonics. Such resolution is sufficient for simulation of kink, tearing and peeling-ballooning instabilities with physically relevant parameters. Work is supported by the U.S. DOE SBIR program.

  20. When the lowest energy does not induce native structures: parallel minimization of multi-energy values by hybridizing searching intelligences.

    PubMed

    Lü, Qiang; Xia, Xiao-Yan; Chen, Rong; Miao, Da-Jun; Chen, Sha-Sha; Quan, Li-Jun; Li, Hai-Ou

    2012-01-01

    Protein structure prediction (PSP), which is usually modeled as a computational optimization problem, remains one of the biggest challenges in computational biology. PSP encounters two difficult obstacles: the inaccurate energy function problem and the searching problem. Even if the lowest energy has been luckily found by the searching procedure, the correct protein structures are not guaranteed to obtain. A general parallel metaheuristic approach is presented to tackle the above two problems. Multi-energy functions are employed to simultaneously guide the parallel searching threads. Searching trajectories are in fact controlled by the parameters of heuristic algorithms. The parallel approach allows the parameters to be perturbed during the searching threads are running in parallel, while each thread is searching the lowest energy value determined by an individual energy function. By hybridizing the intelligences of parallel ant colonies and Monte Carlo Metropolis search, this paper demonstrates an implementation of our parallel approach for PSP. 16 classical instances were tested to show that the parallel approach is competitive for solving PSP problem. This parallel approach combines various sources of both searching intelligences and energy functions, and thus predicts protein conformations with good quality jointly determined by all the parallel searching threads and energy functions. It provides a framework to combine different searching intelligence embedded in heuristic algorithms. It also constructs a container to hybridize different not-so-accurate objective functions which are usually derived from the domain expertise.

  1. When the Lowest Energy Does Not Induce Native Structures: Parallel Minimization of Multi-Energy Values by Hybridizing Searching Intelligences

    PubMed Central

    Lü, Qiang; Xia, Xiao-Yan; Chen, Rong; Miao, Da-Jun; Chen, Sha-Sha; Quan, Li-Jun; Li, Hai-Ou

    2012-01-01

    Background Protein structure prediction (PSP), which is usually modeled as a computational optimization problem, remains one of the biggest challenges in computational biology. PSP encounters two difficult obstacles: the inaccurate energy function problem and the searching problem. Even if the lowest energy has been luckily found by the searching procedure, the correct protein structures are not guaranteed to obtain. Results A general parallel metaheuristic approach is presented to tackle the above two problems. Multi-energy functions are employed to simultaneously guide the parallel searching threads. Searching trajectories are in fact controlled by the parameters of heuristic algorithms. The parallel approach allows the parameters to be perturbed during the searching threads are running in parallel, while each thread is searching the lowest energy value determined by an individual energy function. By hybridizing the intelligences of parallel ant colonies and Monte Carlo Metropolis search, this paper demonstrates an implementation of our parallel approach for PSP. 16 classical instances were tested to show that the parallel approach is competitive for solving PSP problem. Conclusions This parallel approach combines various sources of both searching intelligences and energy functions, and thus predicts protein conformations with good quality jointly determined by all the parallel searching threads and energy functions. It provides a framework to combine different searching intelligence embedded in heuristic algorithms. It also constructs a container to hybridize different not-so-accurate objective functions which are usually derived from the domain expertise. PMID:23028708

  2. Implementation and performance of parallel Prolog interpreter

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Wei, S.; Kale, L.V.; Balkrishna, R.

    1988-01-01

    In this paper, the authors discuss the implementation of a parallel Prolog interpreter on different parallel machines. The implementation is based on the REDUCE--OR process model which exploits both AND and OR parallelism in logic programs. It is machine independent as it runs on top of the chare-kernel--a machine-independent parallel programming system. The authors also give the performance of the interpreter running a diverse set of benchmark pargrams on parallel machines including shared memory systems: an Alliant FX/8, Sequent and a MultiMax, and a non-shared memory systems: Intel iPSC/32 hypercube, in addition to its performance on a multiprocessor simulation system.

  3. Vortex phase diagram of the layered superconductor Cu0.03TaS2 for H is parallel to c.

    PubMed

    Zhu, X D; Lu, J C; Sun, Y P; Pi, L; Qu, Z; Ling, L S; Yang, Z R; Zhang, Y H

    2010-12-22

    The magnetization and anisotropic electrical transport properties have been measured in high quality Cu(0.03)TaS(2) single crystals. A pronounced peak effect has been observed, indicating that high quality and homogeneity are vital to the peak effect. A kink has been observed in the magnetic field, H, dependence of the in-plane resistivity ρ(ab) for H is parallel to c, which corresponds to a transition from activated to diffusive behavior of the vortex liquid phase. In the diffusive regime of the vortex liquid phase, the in-plane resistivity ρ(ab) is proportional to H(0.3), which does not follow the Bardeen-Stephen law for free flux flow. Finally, a simplified vortex phase diagram of Cu(0.03)TaS(2) for H is parallel to c is given.

  4. Force user's manual: A portable, parallel FORTRAN

    NASA Technical Reports Server (NTRS)

    Jordan, Harry F.; Benten, Muhammad S.; Arenstorf, Norbert S.; Ramanan, Aruna V.

    1990-01-01

    The use of Force, a parallel, portable FORTRAN on shared memory parallel computers is described. Force simplifies writing code for parallel computers and, once the parallel code is written, it is easily ported to computers on which Force is installed. Although Force is nearly the same for all computers, specific details are included for the Cray-2, Cray-YMP, Convex 220, Flex/32, Encore, Sequent, Alliant computers on which it is installed.

  5. Effective Parallel Algorithm Animation

    DTIC Science & Technology

    1994-03-01

    parallel computer. The system incorporates the 14 Parallel Processing System us" r User User UMe PMwuM Progra Propu Plropm ýData Dots Data Daft...that produce meaningful animations. The following sections outline characteristics 146 Animation 0 71 r 40 02 I 5 * *2! 4 Idle Bu~sy Send Recv 7...Event Simulation. Technical Report, Georgia Institute of Technology, 1992. 22. Garey, Michael R . and David S. Johnson. Computers and Intractability: A

  6. Multiradio Resource Management: Parallel Transmission for Higher Throughput?

    NASA Astrophysics Data System (ADS)

    Bazzi, Alessandro; Pasolini, Gianni; Andrisano, Oreste

    2008-12-01

    Mobile communication systems beyond the third generation will see the interconnection of heterogeneous radio access networks (UMTS, WiMax, wireless local area networks, etc.) in order to always provide the best quality of service (QoS) to users with multimode terminals. This scenario poses a number of critical issues, which have to be faced in order to get the best from the integrated access network. In this paper, we will investigate the issue of parallel transmission over multiple radio access technologies (RATs), focusing the attention on the QoS perceived by final users. We will show that the achievement of a real benefit from parallel transmission over multiple RATs is conditioned to the fulfilment of some requirements related to the kind of RATs, the multiradio resource management (MRRM) strategy, and the transport-level protocol behaviour. All these aspects will be carefully considered in our investigation, which will be carried out partly adopting an analytical approach and partly by means of simulations. In this paper, in particular, we will propose a simple but effective MRRM algorithm, whose performance will be investigated in IEEE802.11a-UMTS and IEEE802.11a-IEEE802.16e heterogeneous networks (adopted as case studies).

  7. Parallel Algorithms for the Exascale Era

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Robey, Robert W.

    New parallel algorithms are needed to reach the Exascale level of parallelism with millions of cores. We look at some of the research developed by students in projects at LANL. The research blends ideas from the early days of computing while weaving in the fresh approach brought by students new to the field of high performance computing. We look at reproducibility of global sums and why it is important to parallel computing. Next we look at how the concept of hashing has led to the development of more scalable algorithms suitable for next-generation parallel computers. Nearly all of this workmore » has been done by undergraduates and published in leading scientific journals.« less

  8. Performance of an MPI-only semiconductor device simulator on a quad socket/quad core InfiniBand platform.

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Shadid, John Nicolas; Lin, Paul Tinphone

    2009-01-01

    This preliminary study considers the scaling and performance of a finite element (FE) semiconductor device simulator on a capacity cluster with 272 compute nodes based on a homogeneous multicore node architecture utilizing 16 cores. The inter-node communication backbone for this Tri-Lab Linux Capacity Cluster (TLCC) machine is comprised of an InfiniBand interconnect. The nonuniform memory access (NUMA) nodes consist of 2.2 GHz quad socket/quad core AMD Opteron processors. The performance results for this study are obtained with a FE semiconductor device simulation code (Charon) that is based on a fully-coupled Newton-Krylov solver with domain decomposition and multilevel preconditioners. Scaling andmore » multicore performance results are presented for large-scale problems of 100+ million unknowns on up to 4096 cores. A parallel scaling comparison is also presented with the Cray XT3/4 Red Storm capability platform. The results indicate that an MPI-only programming model for utilizing the multicore nodes is reasonably efficient on all 16 cores per compute node. However, the results also indicated that the multilevel preconditioner, which is critical for large-scale capability type simulations, scales better on the Red Storm machine than the TLCC machine.« less

  9. Learning in Parallel: Using Parallel Corpora to Enhance Written Language Acquisition at the Beginning Level

    ERIC Educational Resources Information Center

    Bluemel, Brody

    2014-01-01

    This article illustrates the pedagogical value of incorporating parallel corpora in foreign language education. It explores the development of a Chinese/English parallel corpus designed specifically for pedagogical application. The corpus tool was created to aid language learners in reading comprehension and writing development by making foreign…

  10. Graph Embedding Techniques for Bounding Condition Numbers of Incomplete Factor Preconditioning

    NASA Technical Reports Server (NTRS)

    Guattery, Stephen

    1997-01-01

    We extend graph embedding techniques for bounding the spectral condition number of preconditioned systems involving symmetric, irreducibly diagonally dominant M-matrices to systems where the preconditioner is not diagonally dominant. In particular, this allows us to bound the spectral condition number when the preconditioner is based on an incomplete factorization. We provide a review of previous techniques, describe our extension, and give examples both of a bound for a model problem, and of ways in which our techniques give intuitive way of looking at incomplete factor preconditioners.

  11. Collisions between quasi-parallel shocks

    NASA Technical Reports Server (NTRS)

    Cargill, Peter J.

    1991-01-01

    The collision between pairs of quasi-parallel shocks is examined using hybrid numerical simulations. In the interaction, the two shocks are transmitted through each other leaving behind a hot plasma with a population of particles with energies in excess of 40 E0, where E0 is the kinetic energy of particles in the shock frame prior to the collision. The energization is more efficient for quasi-parallel shocks than parallel shocks. Collisions between shocks of equal strengths are more efficient than those that are unequal. The results are of importance for phenomena during the impulsive phase of solar flares, in the distant solar wind and at planetary bow shocks.

  12. Identifying, Quantifying, Extracting and Enhancing Implicit Parallelism

    ERIC Educational Resources Information Center

    Agarwal, Mayank

    2009-01-01

    The shift of the microprocessor industry towards multicore architectures has placed a huge burden on the programmers by requiring explicit parallelization for performance. Implicit Parallelization is an alternative that could ease the burden on programmers by parallelizing applications "under the covers" while maintaining sequential semantics…

  13. Bayer image parallel decoding based on GPU

    NASA Astrophysics Data System (ADS)

    Hu, Rihui; Xu, Zhiyong; Wei, Yuxing; Sun, Shaohua

    2012-11-01

    In the photoelectrical tracking system, Bayer image is decompressed in traditional method, which is CPU-based. However, it is too slow when the images become large, for example, 2K×2K×16bit. In order to accelerate the Bayer image decoding, this paper introduces a parallel speedup method for NVIDA's Graphics Processor Unit (GPU) which supports CUDA architecture. The decoding procedure can be divided into three parts: the first is serial part, the second is task-parallelism part, and the last is data-parallelism part including inverse quantization, inverse discrete wavelet transform (IDWT) as well as image post-processing part. For reducing the execution time, the task-parallelism part is optimized by OpenMP techniques. The data-parallelism part could advance its efficiency through executing on the GPU as CUDA parallel program. The optimization techniques include instruction optimization, shared memory access optimization, the access memory coalesced optimization and texture memory optimization. In particular, it can significantly speed up the IDWT by rewriting the 2D (Tow-dimensional) serial IDWT into 1D parallel IDWT. Through experimenting with 1K×1K×16bit Bayer image, data-parallelism part is 10 more times faster than CPU-based implementation. Finally, a CPU+GPU heterogeneous decompression system was designed. The experimental result shows that it could achieve 3 to 5 times speed increase compared to the CPU serial method.

  14. Impact on quality of life of a nursing intervention programme for patients with chronic non-cancer pain: an open, randomized controlled parallel study protocol.

    PubMed

    Morales-Fernandez, Angeles; Morales-Asencio, Jose Miguel; Canca-Sanchez, Jose Carlos; Moreno-Martin, Gabriel; Vergara-Romero, Manuel

    2016-05-01

    To determine the effect of a nurse-led intervention programme for patients with chronic non-cancer pain. Chronic non-cancer pain is a widespread health problem and one that is insufficiently controlled. Nurses can play a vital role in pain management, using best practices in the assessment and management of pain under a holistic approach where the patient plays a proactive role in addressing the disease process. Improving the quality of life, reducing disability, achieving acceptance of health status, coping and breaking the vicious circle of pain should be the prime objectives of our care management programme. Open randomized parallel controlled study. The experimental group will undertake one single initial session, followed by six group sessions led by nurses, aimed at empowering patients for the self-management of pain. Healthy behaviours will be encouraged, such as sleep and postural hygiene, promotion of physical activity and healthy eating. Educational interventions on self-esteem, pain-awareness, communication and relaxing techniques will be carried out. As primary end points, quality of life, perceived level of pain, anxiety and depression will be evaluated. Secondary end points will be coping and satisfaction. Follow-up will be performed at 12 and 24 weeks. The study was approved by the Ethics and Research Committee Costa del Sol. If significant effects were detected, impact on quality of life through a nurse-led programme would offer a complementary service to existing pain clinics for a group of patients with frequent unmet needs. © 2016 John Wiley & Sons Ltd.

  15. Renal magnetic resonance angiography at 3.0 Tesla using a 32-element phased-array coil system and parallel imaging in 2 directions.

    PubMed

    Fenchel, Michael; Nael, Kambiz; Deshpande, Vibhas S; Finn, J Paul; Kramer, Ulrich; Miller, Stephan; Ruehm, Stefan; Laub, Gerhard

    2006-09-01

    The aim of the present study was to assess the feasibility of renal magnetic resonance angiography at 3.0 T using a phased-array coil system with 32-coil elements. Specifically, high parallel imaging factors were used for an increased spatial resolution and anatomic coverage of the whole abdomen. Signal-to-noise values and the g-factor distribution of the 32 element coil were examined in phantom studies for the magnetic resonance angiography (MRA) sequence. Eleven volunteers (6 men, median age of 30.0 years) were examined on a 3.0-T MR scanner (Magnetom Trio, Siemens Medical Solutions, Malvern, PA) using a 32-element phased-array coil (prototype from In vivo Corp.). Contrast-enhanced 3D-MRA (TR 2.95 milliseconds, TE 1.12 milliseconds, flip angle 25-30 degrees , bandwidth 650 Hz/pixel) was acquired with integrated generalized autocalibrating partially parallel acquisition (GRAPPA), in both phase- and slice-encoding direction. Images were assessed by 2 independent observers with regard to image quality, noise and presence of artifacts. Signal-to-noise levels of 22.2 +/- 22.0 and 57.9 +/- 49.0 were measured with (GRAPPAx6) and without parallel-imaging, respectively. The mean g-factor of the 32-element coil for GRAPPA with an acceleration of 3 and 2 in the phase-encoding and slice-encoding direction, respectively, was 1.61. High image quality was found in 9 of 11 volunteers (2.6 +/- 0.8) with good overall interobserver agreement (k = 0.87). Relatively low image quality with higher noise levels were encountered in 2 volunteers. MRA at 3.0 T using a 32-element phased-array coil is feasible in healthy volunteers. High diagnostic image quality and extended anatomic coverage could be achieved with application of high parallel imaging factors.

  16. Parallel machine architecture and compiler design facilities

    NASA Technical Reports Server (NTRS)

    Kuck, David J.; Yew, Pen-Chung; Padua, David; Sameh, Ahmed; Veidenbaum, Alex

    1990-01-01

    The objective is to provide an integrated simulation environment for studying and evaluating various issues in designing parallel systems, including machine architectures, parallelizing compiler techniques, and parallel algorithms. The status of Delta project (which objective is to provide a facility to allow rapid prototyping of parallelized compilers that can target toward different machine architectures) is summarized. Included are the surveys of the program manipulation tools developed, the environmental software supporting Delta, and the compiler research projects in which Delta has played a role.

  17. A Tutorial on Parallel and Concurrent Programming in Haskell

    NASA Astrophysics Data System (ADS)

    Peyton Jones, Simon; Singh, Satnam

    This practical tutorial introduces the features available in Haskell for writing parallel and concurrent programs. We first describe how to write semi-explicit parallel programs by using annotations to express opportunities for parallelism and to help control the granularity of parallelism for effective execution on modern operating systems and processors. We then describe the mechanisms provided by Haskell for writing explicitly parallel programs with a focus on the use of software transactional memory to help share information between threads. Finally, we show how nested data parallelism can be used to write deterministically parallel programs which allows programmers to use rich data types in data parallel programs which are automatically transformed into flat data parallel versions for efficient execution on multi-core processors.

  18. Parallel Computing Using Web Servers and "Servlets".

    ERIC Educational Resources Information Center

    Lo, Alfred; Bloor, Chris; Choi, Y. K.

    2000-01-01

    Describes parallel computing and presents inexpensive ways to implement a virtual parallel computer with multiple Web servers. Highlights include performance measurement of parallel systems; models for using Java and intranet technology including single server, multiple clients and multiple servers, single client; and a comparison of CGI (common…

  19. Smoldyn on graphics processing units: massively parallel Brownian dynamics simulations.

    PubMed

    Dematté, Lorenzo

    2012-01-01

    Space is a very important aspect in the simulation of biochemical systems; recently, the need for simulation algorithms able to cope with space is becoming more and more compelling. Complex and detailed models of biochemical systems need to deal with the movement of single molecules and particles, taking into consideration localized fluctuations, transportation phenomena, and diffusion. A common drawback of spatial models lies in their complexity: models can become very large, and their simulation could be time consuming, especially if we want to capture the systems behavior in a reliable way using stochastic methods in conjunction with a high spatial resolution. In order to deliver the promise done by systems biology to be able to understand a system as whole, we need to scale up the size of models we are able to simulate, moving from sequential to parallel simulation algorithms. In this paper, we analyze Smoldyn, a widely diffused algorithm for stochastic simulation of chemical reactions with spatial resolution and single molecule detail, and we propose an alternative, innovative implementation that exploits the parallelism of Graphics Processing Units (GPUs). The implementation executes the most computational demanding steps (computation of diffusion, unimolecular, and bimolecular reaction, as well as the most common cases of molecule-surface interaction) on the GPU, computing them in parallel on each molecule of the system. The implementation offers good speed-ups and real time, high quality graphics output

  20. Array-based Hierarchical Mesh Generation in Parallel

    DOE PAGES

    Ray, Navamita; Grindeanu, Iulian; Zhao, Xinglin; ...

    2015-11-03

    In this paper, we describe an array-based hierarchical mesh generation capability through uniform refinement of unstructured meshes for efficient solution of PDE's using finite element methods and multigrid solvers. A multi-degree, multi-dimensional and multi-level framework is designed to generate the nested hierarchies from an initial mesh that can be used for a number of purposes such as multi-level methods to generating large meshes. The capability is developed under the parallel mesh framework “Mesh Oriented dAtaBase” a.k.a MOAB. We describe the underlying data structures and algorithms to generate such hierarchies and present numerical results for computational efficiency and mesh quality. Inmore » conclusion, we also present results to demonstrate the applicability of the developed capability to a multigrid finite-element solver.« less

  1. Parallel closure theory for toroidally confined plasmas

    NASA Astrophysics Data System (ADS)

    Ji, Jeong-Young; Held, Eric D.

    2017-10-01

    We solve a system of general moment equations to obtain parallel closures for electrons and ions in an axisymmetric toroidal magnetic field. Magnetic field gradient terms are kept and treated using the Fourier series method. Assuming lowest order density (pressure) and temperature to be flux labels, the parallel heat flow, friction, and viscosity are expressed in terms of radial gradients of the lowest-order temperature and pressure, parallel gradients of temperature and parallel flow, and the relative electron-ion parallel flow velocity. Convergence of closure quantities is demonstrated as the number of moments and Fourier modes are increased. Properties of the moment equations in the collisionless limit are also discussed. Combining closures with fluid equations parallel mass flow and electric current are also obtained. Work in collaboration with the PSI Center and supported by the U.S. DOE under Grant Nos. DE-SC0014033, DE-SC0016256, and DE-FG02-04ER54746.

  2. Linearly exact parallel closures for slab geometry

    NASA Astrophysics Data System (ADS)

    Ji, Jeong-Young; Held, Eric D.; Jhang, Hogun

    2013-08-01

    Parallel closures are obtained by solving a linearized kinetic equation with a model collision operator using the Fourier transform method. The closures expressed in wave number space are exact for time-dependent linear problems to within the limits of the model collision operator. In the adiabatic, collisionless limit, an inverse Fourier transform is performed to obtain integral (nonlocal) parallel closures in real space; parallel heat flow and viscosity closures for density, temperature, and flow velocity equations replace Braginskii's parallel closure relations, and parallel flow velocity and heat flow closures for density and temperature equations replace Spitzer's parallel transport relations. It is verified that the closures reproduce the exact linear response function of Hammett and Perkins [Phys. Rev. Lett. 64, 3019 (1990)] for Landau damping given a temperature gradient. In contrast to their approximate closures where the vanishing viscosity coefficient numerically gives an exact response, our closures relate the heat flow and nonvanishing viscosity to temperature and flow velocity (gradients).

  3. Implementing Shared Memory Parallelism in MCBEND

    NASA Astrophysics Data System (ADS)

    Bird, Adam; Long, David; Dobson, Geoff

    2017-09-01

    MCBEND is a general purpose radiation transport Monte Carlo code from AMEC Foster Wheelers's ANSWERS® Software Service. MCBEND is well established in the UK shielding community for radiation shielding and dosimetry assessments. The existing MCBEND parallel capability effectively involves running the same calculation on many processors. This works very well except when the memory requirements of a model restrict the number of instances of a calculation that will fit on a machine. To more effectively utilise parallel hardware OpenMP has been used to implement shared memory parallelism in MCBEND. This paper describes the reasoning behind the choice of OpenMP, notes some of the challenges of multi-threading an established code such as MCBEND and assesses the performance of the parallel method implemented in MCBEND.

  4. Configuration affects parallel stent grafting results.

    PubMed

    Tanious, Adam; Wooster, Mathew; Armstrong, Paul A; Zwiebel, Bruce; Grundy, Shane; Back, Martin R; Shames, Murray L

    2018-05-01

    A number of adjunctive "off-the-shelf" procedures have been described to treat complex aortic diseases. Our goal was to evaluate parallel stent graft configurations and to determine an optimal formula for these procedures. This is a retrospective review of all patients at a single medical center treated with parallel stent grafts from January 2010 to September 2015. Outcomes were evaluated on the basis of parallel graft orientation, type, and main body device. Primary end points included parallel stent graft compromise and overall endovascular aneurysm repair (EVAR) compromise. There were 78 patients treated with a total of 144 parallel stents for a variety of pathologic processes. There was a significant correlation between main body oversizing and snorkel compromise (P = .0195) and overall procedural complication (P = .0019) but not with endoleak rates. Patients were organized into the following oversizing groups for further analysis: 0% to 10%, 10% to 20%, and >20%. Those oversized into the 0% to 10% group had the highest rate of overall EVAR complication (73%; P = .0003). There were no significant correlations between any one particular configuration and overall procedural complication. There was also no significant correlation between total number of parallel stents employed and overall complication. Composite EVAR configuration had no significant correlation with individual snorkel compromise, endoleak, or overall EVAR or procedural complication. The configuration most prone to individual snorkel compromise and overall EVAR complication was a four-stent configuration with two stents in an antegrade position and two stents in a retrograde position (60% complication rate). The configuration most prone to endoleak was one or two stents in retrograde position (33% endoleak rate), followed by three stents in an all-antegrade position (25%). There was a significant correlation between individual stent configuration and stent compromise (P = .0385), with 31

  5. Summer Proceedings 2016: The Center for Computing Research at Sandia National Laboratories

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Carleton, James Brian; Parks, Michael L.

    Solving sparse linear systems from the discretization of elliptic partial differential equations (PDEs) is an important building block in many engineering applications. Sparse direct solvers can solve general linear systems, but are usually slower and use much more memory than effective iterative solvers. To overcome these two disadvantages, a hierarchical solver (LoRaSp) based on H2-matrices was introduced in [22]. Here, we have developed a parallel version of the algorithm in LoRaSp to solve large sparse matrices on distributed memory machines. On a single processor, the factorization time of our parallel solver scales almost linearly with the problem size for three-dimensionalmore » problems, as opposed to the quadratic scalability of many existing sparse direct solvers. Moreover, our solver leads to almost constant numbers of iterations, when used as a preconditioner for Poisson problems. On more than one processor, our algorithm has significant speedups compared to sequential runs. With this parallel algorithm, we are able to solve large problems much faster than many existing packages as demonstrated by the numerical experiments.« less

  6. Matching pursuit parallel decomposition of seismic data

    NASA Astrophysics Data System (ADS)

    Li, Chuanhui; Zhang, Fanchang

    2017-07-01

    In order to improve the computation speed of matching pursuit decomposition of seismic data, a matching pursuit parallel algorithm is designed in this paper. We pick a fixed number of envelope peaks from the current signal in every iteration according to the number of compute nodes and assign them to the compute nodes on average to search the optimal Morlet wavelets in parallel. With the help of parallel computer systems and Message Passing Interface, the parallel algorithm gives full play to the advantages of parallel computing to significantly improve the computation speed of the matching pursuit decomposition and also has good expandability. Besides, searching only one optimal Morlet wavelet by every compute node in every iteration is the most efficient implementation.

  7. Parallel tempering for the traveling salesman problem

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Percus, Allon; Wang, Richard; Hyman, Jeffrey

    We explore the potential of parallel tempering as a combinatorial optimization method, applying it to the traveling salesman problem. We compare simulation results of parallel tempering with a benchmark implementation of simulated annealing, and study how different choices of parameters affect the relative performance of the two methods. We find that a straightforward implementation of parallel tempering can outperform simulated annealing in several crucial respects. When parameters are chosen appropriately, both methods yield close approximation to the actual minimum distance for an instance with 200 nodes. However, parallel tempering yields more consistently accurate results when a series of independent simulationsmore » are performed. Our results suggest that parallel tempering might offer a simple but powerful alternative to simulated annealing for combinatorial optimization problems.« less

  8. Parallel evolutionary computation in bioinformatics applications.

    PubMed

    Pinho, Jorge; Sobral, João Luis; Rocha, Miguel

    2013-05-01

    A large number of optimization problems within the field of Bioinformatics require methods able to handle its inherent complexity (e.g. NP-hard problems) and also demand increased computational efforts. In this context, the use of parallel architectures is a necessity. In this work, we propose ParJECoLi, a Java based library that offers a large set of metaheuristic methods (such as Evolutionary Algorithms) and also addresses the issue of its efficient execution on a wide range of parallel architectures. The proposed approach focuses on the easiness of use, making the adaptation to distinct parallel environments (multicore, cluster, grid) transparent to the user. Indeed, this work shows how the development of the optimization library can proceed independently of its adaptation for several architectures, making use of Aspect-Oriented Programming. The pluggable nature of parallelism related modules allows the user to easily configure its environment, adding parallelism modules to the base source code when needed. The performance of the platform is validated with two case studies within biological model optimization. Copyright © 2012 Elsevier Ireland Ltd. All rights reserved.

  9. Support for Debugging Automatically Parallelized Programs

    NASA Technical Reports Server (NTRS)

    Jost, Gabriele; Hood, Robert; Biegel, Bryan (Technical Monitor)

    2001-01-01

    We describe a system that simplifies the process of debugging programs produced by computer-aided parallelization tools. The system uses relative debugging techniques to compare serial and parallel executions in order to show where the computations begin to differ. If the original serial code is correct, errors due to parallelization will be isolated by the comparison. One of the primary goals of the system is to minimize the effort required of the user. To that end, the debugging system uses information produced by the parallelization tool to drive the comparison process. In particular the debugging system relies on the parallelization tool to provide information about where variables may have been modified and how arrays are distributed across multiple processes. User effort is also reduced through the use of dynamic instrumentation. This allows us to modify the program execution without changing the way the user builds the executable. The use of dynamic instrumentation also permits us to compare the executions in a fine-grained fashion and only involve the debugger when a difference has been detected. This reduces the overhead of executing instrumentation.

  10. Relative Debugging of Automatically Parallelized Programs

    NASA Technical Reports Server (NTRS)

    Jost, Gabriele; Hood, Robert; Biegel, Bryan (Technical Monitor)

    2002-01-01

    We describe a system that simplifies the process of debugging programs produced by computer-aided parallelization tools. The system uses relative debugging techniques to compare serial and parallel executions in order to show where the computations begin to differ. If the original serial code is correct, errors due to parallelization will be isolated by the comparison. One of the primary goals of the system is to minimize the effort required of the user. To that end, the debugging system uses information produced by the parallelization tool to drive the comparison process. In particular, the debugging system relies on the parallelization tool to provide information about where variables may have been modified and how arrays are distributed across multiple processes. User effort is also reduced through the use of dynamic instrumentation. This allows us to modify, the program execution with out changing the way the user builds the executable. The use of dynamic instrumentation also permits us to compare the executions in a fine-grained fashion and only involve the debugger when a difference has been detected. This reduces the overhead of executing instrumentation.

  11. High-speed parallel implementation of a modified PBR algorithm on DSP-based EH topology

    NASA Astrophysics Data System (ADS)

    Rajan, K.; Patnaik, L. M.; Ramakrishna, J.

    1997-08-01

    Algebraic Reconstruction Technique (ART) is an age-old method used for solving the problem of three-dimensional (3-D) reconstruction from projections in electron microscopy and radiology. In medical applications, direct 3-D reconstruction is at the forefront of investigation. The simultaneous iterative reconstruction technique (SIRT) is an ART-type algorithm with the potential of generating in a few iterations tomographic images of a quality comparable to that of convolution backprojection (CBP) methods. Pixel-based reconstruction (PBR) is similar to SIRT reconstruction, and it has been shown that PBR algorithms give better quality pictures compared to those produced by SIRT algorithms. In this work, we propose a few modifications to the PBR algorithms. The modified algorithms are shown to give better quality pictures compared to PBR algorithms. The PBR algorithm and the modified PBR algorithms are highly compute intensive, Not many attempts have been made to reconstruct objects in the true 3-D sense because of the high computational overhead. In this study, we have developed parallel two-dimensional (2-D) and 3-D reconstruction algorithms based on modified PBR. We attempt to solve the two problems encountered by the PBR and modified PBR algorithms, i.e., the long computational time and the large memory requirements, by parallelizing the algorithm on a multiprocessor system. We investigate the possible task and data partitioning schemes by exploiting the potential parallelism in the PBR algorithm subject to minimizing the memory requirement. We have implemented an extended hypercube (EH) architecture for the high-speed execution of the 3-D reconstruction algorithm using the commercially available fast floating point digital signal processor (DSP) chips as the processing elements (PEs) and dual-port random access memories (DPR) as channels between the PEs. We discuss and compare the performances of the PBR algorithm on an IBM 6000 RISC workstation, on a Silicon

  12. PCLIPS: Parallel CLIPS

    NASA Technical Reports Server (NTRS)

    Gryphon, Coranth D.; Miller, Mark D.

    1991-01-01

    PCLIPS (Parallel CLIPS) is a set of extensions to the C Language Integrated Production System (CLIPS) expert system language. PCLIPS is intended to provide an environment for the development of more complex, extensive expert systems. Multiple CLIPS expert systems are now capable of running simultaneously on separate processors, or separate machines, thus dramatically increasing the scope of solvable tasks within the expert systems. As a tool for parallel processing, PCLIPS allows for an expert system to add to its fact-base information generated by other expert systems, thus allowing systems to assist each other in solving a complex problem. This allows individual expert systems to be more compact and efficient, and thus run faster or on smaller machines.

  13. Multitasking TORT under UNICOS: Parallel performance models and measurements

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Barnett, A.; Azmy, Y.Y.

    1999-09-27

    The existing parallel algorithms in the TORT discrete ordinates code were updated to function in a UNICOS environment. A performance model for the parallel overhead was derived for the existing algorithms. The largest contributors to the parallel overhead were identified and a new algorithm was developed. A parallel overhead model was also derived for the new algorithm. The results of the comparison of parallel performance models were compared to applications of the code to two TORT standard test problems and a large production problem. The parallel performance models agree well with the measured parallel overhead.

  14. Multitasking TORT Under UNICOS: Parallel Performance Models and Measurements

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Azmy, Y.Y.; Barnett, D.A.

    1999-09-27

    The existing parallel algorithms in the TORT discrete ordinates were updated to function in a UNI-COS environment. A performance model for the parallel overhead was derived for the existing algorithms. The largest contributors to the parallel overhead were identified and a new algorithm was developed. A parallel overhead model was also derived for the new algorithm. The results of the comparison of parallel performance models were compared to applications of the code to two TORT standard test problems and a large production problem. The parallel performance models agree well with the measured parallel overhead.

  15. Parallel Processing at the High School Level.

    ERIC Educational Resources Information Center

    Sheary, Kathryn Anne

    This study investigated the ability of high school students to cognitively understand and implement parallel processing. Data indicates that most parallel processing is being taught at the university level. Instructional modules on C, Linux, and the parallel processing language, P4, were designed to show that high school students are highly…

  16. Simulation Exploration through Immersive Parallel Planes: Preprint

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Brunhart-Lupo, Nicholas; Bush, Brian W.; Gruchalla, Kenny

    We present a visualization-driven simulation system that tightly couples systems dynamics simulations with an immersive virtual environment to allow analysts to rapidly develop and test hypotheses in a high-dimensional parameter space. To accomplish this, we generalize the two-dimensional parallel-coordinates statistical graphic as an immersive 'parallel-planes' visualization for multivariate time series emitted by simulations running in parallel with the visualization. In contrast to traditional parallel coordinate's mapping the multivariate dimensions onto coordinate axes represented by a series of parallel lines, we map pairs of the multivariate dimensions onto a series of parallel rectangles. As in the case of parallel coordinates, eachmore » individual observation in the dataset is mapped to a polyline whose vertices coincide with its coordinate values. Regions of the rectangles can be 'brushed' to highlight and select observations of interest: a 'slider' control allows the user to filter the observations by their time coordinate. In an immersive virtual environment, users interact with the parallel planes using a joystick that can select regions on the planes, manipulate selection, and filter time. The brushing and selection actions are used to both explore existing data as well as to launch additional simulations corresponding to the visually selected portions of the input parameter space. As soon as the new simulations complete, their resulting observations are displayed in the virtual environment. This tight feedback loop between simulation and immersive analytics accelerates users' realization of insights about the simulation and its output.« less

  17. Sub-domain decomposition methods and computational controls for multibody dynamical systems. [of spacecraft structures

    NASA Technical Reports Server (NTRS)

    Menon, R. G.; Kurdila, A. J.

    1992-01-01

    This paper presents a concurrent methodology to simulate the dynamics of flexible multibody systems with a large number of degrees of freedom. A general class of open-loop structures is treated and a redundant coordinate formulation is adopted. A range space method is used in which the constraint forces are calculated using a preconditioned conjugate gradient method. By using a preconditioner motivated by the regular ordering of the directed graph of the structures, it is shown that the method is order N in the total number of coordinates of the system. The overall formulation has the advantage that it permits fine parallelization and does not rely on system topology to induce concurrency. It can be efficiently implemented on the present generation of parallel computers with a large number of processors. Validation of the method is presented via numerical simulations of space structures incorporating large number of flexible degrees of freedom.

  18. Parallel integrated frame synchronizer chip

    NASA Technical Reports Server (NTRS)

    Solomon, Jeffrey Michael (Inventor); Ghuman, Parminder Singh (Inventor); Bennett, Toby Dennis (Inventor)

    2000-01-01

    A parallel integrated frame synchronizer which implements a sequential pipeline process wherein serial data in the form of telemetry data or weather satellite data enters the synchronizer by means of a front-end subsystem and passes to a parallel correlator subsystem or a weather satellite data processing subsystem. When in a CCSDS mode, data from the parallel correlator subsystem passes through a window subsystem, then to a data alignment subsystem and then to a bit transition density (BTD)/cyclical redundancy check (CRC) decoding subsystem. Data from the BTD/CRC decoding subsystem or data from the weather satellite data processing subsystem is then fed to an output subsystem where it is output from a data output port.

  19. Parallelization of NAS Benchmarks for Shared Memory Multiprocessors

    NASA Technical Reports Server (NTRS)

    Waheed, Abdul; Yan, Jerry C.; Saini, Subhash (Technical Monitor)

    1998-01-01

    This paper presents our experiences of parallelizing the sequential implementation of NAS benchmarks using compiler directives on SGI Origin2000 distributed shared memory (DSM) system. Porting existing applications to new high performance parallel and distributed computing platforms is a challenging task. Ideally, a user develops a sequential version of the application, leaving the task of porting to new generations of high performance computing systems to parallelization tools and compilers. Due to the simplicity of programming shared-memory multiprocessors, compiler developers have provided various facilities to allow the users to exploit parallelism. Native compilers on SGI Origin2000 support multiprocessing directives to allow users to exploit loop-level parallelism in their programs. Additionally, supporting tools can accomplish this process automatically and present the results of parallelization to the users. We experimented with these compiler directives and supporting tools by parallelizing sequential implementation of NAS benchmarks. Results reported in this paper indicate that with minimal effort, the performance gain is comparable with the hand-parallelized, carefully optimized, message-passing implementations of the same benchmarks.

  20. Massively parallel multicanonical simulations

    NASA Astrophysics Data System (ADS)

    Gross, Jonathan; Zierenberg, Johannes; Weigel, Martin; Janke, Wolfhard

    2018-03-01

    Generalized-ensemble Monte Carlo simulations such as the multicanonical method and similar techniques are among the most efficient approaches for simulations of systems undergoing discontinuous phase transitions or with rugged free-energy landscapes. As Markov chain methods, they are inherently serial computationally. It was demonstrated recently, however, that a combination of independent simulations that communicate weight updates at variable intervals allows for the efficient utilization of parallel computational resources for multicanonical simulations. Implementing this approach for the many-thread architecture provided by current generations of graphics processing units (GPUs), we show how it can be efficiently employed with of the order of 104 parallel walkers and beyond, thus constituting a versatile tool for Monte Carlo simulations in the era of massively parallel computing. We provide the fully documented source code for the approach applied to the paradigmatic example of the two-dimensional Ising model as starting point and reference for practitioners in the field.

  1. Introducing parallelism to histogramming functions for GEM systems

    NASA Astrophysics Data System (ADS)

    Krawczyk, Rafał D.; Czarski, Tomasz; Kolasinski, Piotr; Pozniak, Krzysztof T.; Linczuk, Maciej; Byszuk, Adrian; Chernyshova, Maryna; Juszczyk, Bartlomiej; Kasprowicz, Grzegorz; Wojenski, Andrzej; Zabolotny, Wojciech

    2015-09-01

    This article is an assessment of potential parallelization of histogramming algorithms in GEM detector system. Histogramming and preprocessing algorithms in MATLAB were analyzed with regard to adding parallelism. Preliminary implementation of parallel strip histogramming resulted in speedup. Analysis of algorithms parallelizability is presented. Overview of potential hardware and software support to implement parallel algorithm is discussed.

  2. SU-F-SPS-09: Parallel MC Kernel Calculations for VMAT Plan Improvement

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Chamberlain, S; Roswell Park Cancer Institute, Buffalo, NY; French, S

    Purpose: Adding kernels (small perturbations in leaf positions) to the existing apertures of VMAT control points may improve plan quality. We investigate the calculation of kernel doses using a parallelized Monte Carlo (MC) method. Methods: A clinical prostate VMAT DICOM plan was exported from Eclipse. An arbitrary control point and leaf were chosen, and a modified MLC file was created, corresponding to the leaf position offset by 0.5cm. The additional dose produced by this 0.5 cm × 0.5 cm kernel was calculated using the DOSXYZnrc component module of BEAMnrc. A range of particle history counts were run (varying from 3more » × 10{sup 6} to 3 × 10{sup 7}); each job was split among 1, 10, or 100 parallel processes. A particle count of 3 × 10{sup 6} was established as the lower range because it provided the minimal accuracy level. Results: As expected, an increase in particle counts linearly increases run time. For the lowest particle count, the time varied from 30 hours for the single-processor run, to 0.30 hours for the 100-processor run. Conclusion: Parallel processing of MC calculations in the EGS framework significantly decreases time necessary for each kernel dose calculation. Particle counts lower than 1 × 10{sup 6} have too large of an error to output accurate dose for a Monte Carlo kernel calculation. Future work will investigate increasing the number of parallel processes and optimizing run times for multiple kernel calculations.« less

  3. Code Parallelization with CAPO: A User Manual

    NASA Technical Reports Server (NTRS)

    Jin, Hao-Qiang; Frumkin, Michael; Yan, Jerry; Biegel, Bryan (Technical Monitor)

    2001-01-01

    A software tool has been developed to assist the parallelization of scientific codes. This tool, CAPO, extends an existing parallelization toolkit, CAPTools developed at the University of Greenwich, to generate OpenMP parallel codes for shared memory architectures. This is an interactive toolkit to transform a serial Fortran application code to an equivalent parallel version of the software - in a small fraction of the time normally required for a manual parallelization. We first discuss the way in which loop types are categorized and how efficient OpenMP directives can be defined and inserted into the existing code using the in-depth interprocedural analysis. The use of the toolkit on a number of application codes ranging from benchmark to real-world application codes is presented. This will demonstrate the great potential of using the toolkit to quickly parallelize serial programs as well as the good performance achievable on a large number of toolkit to quickly parallelize serial programs as well as the good performance achievable on a large number of processors. The second part of the document gives references to the parameters and the graphic user interface implemented in the toolkit. Finally a set of tutorials is included for hands-on experiences with this toolkit.

  4. Parallelizing Timed Petri Net simulations

    NASA Technical Reports Server (NTRS)

    Nicol, David M.

    1993-01-01

    The possibility of using parallel processing to accelerate the simulation of Timed Petri Nets (TPN's) was studied. It was recognized that complex system development tools often transform system descriptions into TPN's or TPN-like models, which are then simulated to obtain information about system behavior. Viewed this way, it was important that the parallelization of TPN's be as automatic as possible, to admit the possibility of the parallelization being embedded in the system design tool. Later years of the grant were devoted to examining the problem of joint performance and reliability analysis, to explore whether both types of analysis could be accomplished within a single framework. In this final report, the results of our studies are summarized. We believe that the problem of parallelizing TPN's automatically for MIMD architectures has been almost completely solved for a large and important class of problems. Our initial investigations into joint performance/reliability analysis are two-fold; it was shown that Monte Carlo simulation, with importance sampling, offers promise of joint analysis in the context of a single tool, and methods for the parallel simulation of general Continuous Time Markov Chains, a model framework within which joint performance/reliability models can be cast, were developed. However, very much more work is needed to determine the scope and generality of these approaches. The results obtained in our two studies, future directions for this type of work, and a list of publications are included.

  5. Parallel fabrication of macroporous scaffolds.

    PubMed

    Dobos, Andrew; Grandhi, Taraka Sai Pavan; Godeshala, Sudhakar; Meldrum, Deirdre R; Rege, Kaushal

    2018-07-01

    Scaffolds generated from naturally occurring and synthetic polymers have been investigated in several applications because of their biocompatibility and tunable chemo-mechanical properties. Existing methods for generation of 3D polymeric scaffolds typically cannot be parallelized, suffer from low throughputs, and do not allow for quick and easy removal of the fragile structures that are formed. Current molds used in hydrogel and scaffold fabrication using solvent casting and porogen leaching are often single-use and do not facilitate 3D scaffold formation in parallel. Here, we describe a simple device and related approaches for the parallel fabrication of macroporous scaffolds. This approach was employed for the generation of macroporous and non-macroporous materials in parallel, in higher throughput and allowed for easy retrieval of these 3D scaffolds once formed. In addition, macroporous scaffolds with interconnected as well as non-interconnected pores were generated, and the versatility of this approach was employed for the generation of 3D scaffolds from diverse materials including an aminoglycoside-derived cationic hydrogel ("Amikagel"), poly(lactic-co-glycolic acid) or PLGA, and collagen. Macroporous scaffolds generated using the device were investigated for plasmid DNA binding and cell loading, indicating the use of this approach for developing materials for different applications in biotechnology. Our results demonstrate that the device-based approach is a simple technology for generating scaffolds in parallel, which can enhance the toolbox of current fabrication techniques. © 2018 Wiley Periodicals, Inc.

  6. Parallel solution of sparse one-dimensional dynamic programming problems

    NASA Technical Reports Server (NTRS)

    Nicol, David M.

    1989-01-01

    Parallel computation offers the potential for quickly solving large computational problems. However, it is often a non-trivial task to effectively use parallel computers. Solution methods must sometimes be reformulated to exploit parallelism; the reformulations are often more complex than their slower serial counterparts. We illustrate these points by studying the parallelization of sparse one-dimensional dynamic programming problems, those which do not obviously admit substantial parallelization. We propose a new method for parallelizing such problems, develop analytic models which help us to identify problems which parallelize well, and compare the performance of our algorithm with existing algorithms on a multiprocessor.

  7. Adaptive parallel logic networks

    NASA Technical Reports Server (NTRS)

    Martinez, Tony R.; Vidal, Jacques J.

    1988-01-01

    Adaptive, self-organizing concurrent systems (ASOCS) that combine self-organization with massive parallelism for such applications as adaptive logic devices, robotics, process control, and system malfunction management, are presently discussed. In ASOCS, an adaptive network composed of many simple computing elements operating in combinational and asynchronous fashion is used and problems are specified by presenting if-then rules to the system in the form of Boolean conjunctions. During data processing, which is a different operational phase from adaptation, the network acts as a parallel hardware circuit.

  8. Event parallelism: Distributed memory parallel computing for high energy physics experiments

    NASA Astrophysics Data System (ADS)

    Nash, Thomas

    1989-12-01

    This paper describes the present and expected future development of distributed memory parallel computers for high energy physics experiments. It covers the use of event parallel microprocessor farms, particularly at Fermilab, including both ACP multiprocessors and farms of MicroVAXES. These systems have proven very cost effective in the past. A case is made for moving to the more open environment of UNIX and RISC processors. The 2nd Generation ACP Multiprocessor System, which is based on powerful RISC system, is described. Given the promise of still more extraordinary increases in processor performance, a new emphasis on point to point, rather than bussed, communication will be required. Developments in this direction are described.

  9. Cosmic Microwave Background Mapmaking with a Messenger Field

    NASA Astrophysics Data System (ADS)

    Huffenberger, Kevin M.; Næss, Sigurd K.

    2018-01-01

    We apply a messenger field method to solve the linear minimum-variance mapmaking equation in the context of Cosmic Microwave Background (CMB) observations. In simulations, the method produces sky maps that converge significantly faster than those from a conjugate gradient descent algorithm with a diagonal preconditioner, even though the computational cost per iteration is similar. The messenger method recovers large scales in the map better than conjugate gradient descent, and yields a lower overall χ2. In the single, pencil beam approximation, each iteration of the messenger mapmaking procedure produces an unbiased map, and the iterations become more optimal as they proceed. A variant of the method can handle differential data or perform deconvolution mapmaking. The messenger method requires no preconditioner, but a high-quality solution needs a cooling parameter to control the convergence. We study the convergence properties of this new method and discuss how the algorithm is feasible for the large data sets of current and future CMB experiments.

  10. Increasing processor utilization during parallel computation rundown

    NASA Technical Reports Server (NTRS)

    Jones, W. H.

    1986-01-01

    Some parallel processing environments provide for asynchronous execution and completion of general purpose parallel computations from a single computational phase. When all the computations from such a phase are complete, a new parallel computational phase is begun. Depending upon the granularity of the parallel computations to be performed, there may be a shortage of available work as a particular computational phase draws to a close (computational rundown). This can result in the waste of computing resources and the delay of the overall problem. In many practical instances, strict sequential ordering of phases of parallel computation is not totally required. In such cases, the beginning of one phase can be correctly computed before the end of a previous phase is completed. This allows additional work to be generated somewhat earlier to keep computing resources busy during each computational rundown. The conditions under which this can occur are identified and the frequency of occurrence of such overlapping in an actual parallel Navier-Stokes code is reported. A language construct is suggested and possible control strategies for the management of such computational phase overlapping are discussed.

  11. Research in Parallel Algorithms and Software for Computational Aerosciences

    NASA Technical Reports Server (NTRS)

    Domel, Neal D.

    1996-01-01

    Phase I is complete for the development of a Computational Fluid Dynamics parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.

  12. Parallel Distributed Processing and Lexical-Semantic Effects in Visual Word Recognition: Are a Few Stages Necessary?

    ERIC Educational Resources Information Center

    Borowsky, Ron; Besner, Derek

    2006-01-01

    D. C. Plaut and J. R. Booth presented a parallel distributed processing model that purports to simulate human lexical decision performance. This model (and D. C. Plaut, 1995) offers a single mechanism account of the pattern of factor effects on reaction time (RT) between semantic priming, word frequency, and stimulus quality without requiring a…

  13. Externally Calibrated Parallel Imaging for 3D Multispectral Imaging Near Metallic Implants Using Broadband Ultrashort Echo Time Imaging

    PubMed Central

    Wiens, Curtis N.; Artz, Nathan S.; Jang, Hyungseok; McMillan, Alan B.; Reeder, Scott B.

    2017-01-01

    Purpose To develop an externally calibrated parallel imaging technique for three-dimensional multispectral imaging (3D-MSI) in the presence of metallic implants. Theory and Methods A fast, ultrashort echo time (UTE) calibration acquisition is proposed to enable externally calibrated parallel imaging techniques near metallic implants. The proposed calibration acquisition uses a broadband radiofrequency (RF) pulse to excite the off-resonance induced by the metallic implant, fully phase-encoded imaging to prevent in-plane distortions, and UTE to capture rapidly decaying signal. The performance of the externally calibrated parallel imaging reconstructions was assessed using phantoms and in vivo examples. Results Phantom and in vivo comparisons to self-calibrated parallel imaging acquisitions show that significant reductions in acquisition times can be achieved using externally calibrated parallel imaging with comparable image quality. Acquisition time reductions are particularly large for fully phase-encoded methods such as spectrally resolved fully phase-encoded three-dimensional (3D) fast spin-echo (SR-FPE), in which scan time reductions of up to 8 min were obtained. Conclusion A fully phase-encoded acquisition with broadband excitation and UTE enabled externally calibrated parallel imaging for 3D-MSI, eliminating the need for repeated calibration regions at each frequency offset. Significant reductions in acquisition time can be achieved, particularly for fully phase-encoded methods like SR-FPE. PMID:27403613

  14. Parallel computations and control of adaptive structures

    NASA Technical Reports Server (NTRS)

    Park, K. C.; Alvin, Kenneth F.; Belvin, W. Keith; Chong, K. P. (Editor); Liu, S. C. (Editor); Li, J. C. (Editor)

    1991-01-01

    The equations of motion for structures with adaptive elements for vibration control are presented for parallel computations to be used as a software package for real-time control of flexible space structures. A brief introduction of the state-of-the-art parallel computational capability is also presented. Time marching strategies are developed for an effective use of massive parallel mapping, partitioning, and the necessary arithmetic operations. An example is offered for the simulation of control-structure interaction on a parallel computer and the impact of the approach presented for applications in other disciplines than aerospace industry is assessed.

  15. A parallel adaptive mesh refinement algorithm

    NASA Technical Reports Server (NTRS)

    Quirk, James J.; Hanebutte, Ulf R.

    1993-01-01

    Over recent years, Adaptive Mesh Refinement (AMR) algorithms which dynamically match the local resolution of the computational grid to the numerical solution being sought have emerged as powerful tools for solving problems that contain disparate length and time scales. In particular, several workers have demonstrated the effectiveness of employing an adaptive, block-structured hierarchical grid system for simulations of complex shock wave phenomena. Unfortunately, from the parallel algorithm developer's viewpoint, this class of scheme is quite involved; these schemes cannot be distilled down to a small kernel upon which various parallelizing strategies may be tested. However, because of their block-structured nature such schemes are inherently parallel, so all is not lost. In this paper we describe the method by which Quirk's AMR algorithm has been parallelized. This method is built upon just a few simple message passing routines and so it may be implemented across a broad class of MIMD machines. Moreover, the method of parallelization is such that the original serial code is left virtually intact, and so we are left with just a single product to support. The importance of this fact should not be underestimated given the size and complexity of the original algorithm.

  16. Equalizer: a scalable parallel rendering framework.

    PubMed

    Eilemann, Stefan; Makhinya, Maxim; Pajarola, Renato

    2009-01-01

    Continuing improvements in CPU and GPU performances as well as increasing multi-core processor and cluster-based parallelism demand for flexible and scalable parallel rendering solutions that can exploit multipipe hardware accelerated graphics. In fact, to achieve interactive visualization, scalable rendering systems are essential to cope with the rapid growth of data sets. However, parallel rendering systems are non-trivial to develop and often only application specific implementations have been proposed. The task of developing a scalable parallel rendering framework is even more difficult if it should be generic to support various types of data and visualization applications, and at the same time work efficiently on a cluster with distributed graphics cards. In this paper we introduce a novel system called Equalizer, a toolkit for scalable parallel rendering based on OpenGL which provides an application programming interface (API) to develop scalable graphics applications for a wide range of systems ranging from large distributed visualization clusters and multi-processor multipipe graphics systems to single-processor single-pipe desktop machines. We describe the system architecture, the basic API, discuss its advantages over previous approaches, present example configurations and usage scenarios as well as scalability results.

  17. Broadcasting a message in a parallel computer

    DOEpatents

    Berg, Jeremy E [Rochester, MN; Faraj, Ahmad A [Rochester, MN

    2011-08-02

    Methods, systems, and products are disclosed for broadcasting a message in a parallel computer. The parallel computer includes a plurality of compute nodes connected together using a data communications network. The data communications network optimized for point to point data communications and is characterized by at least two dimensions. The compute nodes are organized into at least one operational group of compute nodes for collective parallel operations of the parallel computer. One compute node of the operational group assigned to be a logical root. Broadcasting a message in a parallel computer includes: establishing a Hamiltonian path along all of the compute nodes in at least one plane of the data communications network and in the operational group; and broadcasting, by the logical root to the remaining compute nodes, the logical root's message along the established Hamiltonian path.

  18. A Newton-Krylov method with an approximate analytical Jacobian for implicit solution of Navier-Stokes equations on staggered overset-curvilinear grids with immersed boundaries.

    PubMed

    Asgharzadeh, Hafez; Borazjani, Iman

    2017-02-15

    diagonal of the Jacobian further improves the performance by 42 - 74% compared to the full Jacobian. The NKM with an analytical Jacobian showed better performance than the fixed point Runge-Kutta because it converged with higher time steps and in approximately 30% less iterations even when the grid was stretched and the Reynold number was increased. In fact, stretching the grid decreased the performance of all methods, but the fixed-point Runge-Kutta performance decreased 4.57 and 2.26 times more than NKM with a diagonal Jacobian when the stretching factor was increased, respectively. The NKM with a diagonal analytical Jacobian and matrix-free method with an analytical preconditioner are the fastest methods and the superiority of one to another depends on the flow problem. Furthermore, the implemented methods are fully parallelized with parallel efficiency of 80-90% on the problems tested. The NKM with the analytical Jacobian can guide building preconditioners for other techniques to improve their performance in the future.

  19. A Newton–Krylov method with an approximate analytical Jacobian for implicit solution of Navier–Stokes equations on staggered overset-curvilinear grids with immersed boundaries

    PubMed Central

    Asgharzadeh, Hafez; Borazjani, Iman

    2016-01-01

    diagonal of the Jacobian further improves the performance by 42 – 74% compared to the full Jacobian. The NKM with an analytical Jacobian showed better performance than the fixed point Runge-Kutta because it converged with higher time steps and in approximately 30% less iterations even when the grid was stretched and the Reynold number was increased. In fact, stretching the grid decreased the performance of all methods, but the fixed-point Runge-Kutta performance decreased 4.57 and 2.26 times more than NKM with a diagonal Jacobian when the stretching factor was increased, respectively. The NKM with a diagonal analytical Jacobian and matrix-free method with an analytical preconditioner are the fastest methods and the superiority of one to another depends on the flow problem. Furthermore, the implemented methods are fully parallelized with parallel efficiency of 80–90% on the problems tested. The NKM with the analytical Jacobian can guide building preconditioners for other techniques to improve their performance in the future. PMID:28042172

  20. A Newton-Krylov method with an approximate analytical Jacobian for implicit solution of Navier-Stokes equations on staggered overset-curvilinear grids with immersed boundaries

    NASA Astrophysics Data System (ADS)

    Asgharzadeh, Hafez; Borazjani, Iman

    2017-02-01

    diagonal of the Jacobian further improves the performance by 42-74% compared to the full Jacobian. The NKM with an analytical Jacobian showed better performance than the fixed point Runge-Kutta because it converged with higher time steps and in approximately 30% less iterations even when the grid was stretched and the Reynold number was increased. In fact, stretching the grid decreased the performance of all methods, but the fixed-point Runge-Kutta performance decreased 4.57 and 2.26 times more than NKM with a diagonal and full Jacobian, respectivley, when the stretching factor was increased. The NKM with a diagonal analytical Jacobian and matrix-free method with an analytical preconditioner are the fastest methods and the superiority of one to another depends on the flow problem. Furthermore, the implemented methods are fully parallelized with parallel efficiency of 80-90% on the problems tested. The NKM with the analytical Jacobian can guide building preconditioners for other techniques to improve their performance in the future.

  1. Parallel experimental design and multivariate analysis provides efficient screening of cell culture media supplements to improve biosimilar product quality.

    PubMed

    Brühlmann, David; Sokolov, Michael; Butté, Alessandro; Sauer, Markus; Hemberger, Jürgen; Souquet, Jonathan; Broly, Hervé; Jordan, Martin

    2017-07-01

    Rational and high-throughput optimization of mammalian cell culture media has a great potential to modulate recombinant protein product quality. We present a process design method based on parallel design-of-experiment (DoE) of CHO fed-batch cultures in 96-deepwell plates to modulate monoclonal antibody (mAb) glycosylation using medium supplements. To reduce the risk of losing valuable information in an intricate joint screening, 17 compounds were separated into five different groups, considering their mode of biological action. The concentration ranges of the medium supplements were defined according to information encountered in the literature and in-house experience. The screening experiments produced wide glycosylation pattern ranges. Multivariate analysis including principal component analysis and decision trees was used to select the best performing glycosylation modulators. Subsequent D-optimal quadratic design with four factors (three promising compounds and temperature shift) in shake tubes confirmed the outcome of the selection process and provided a solid basis for sequential process development at a larger scale. The glycosylation profile with respect to the specifications for biosimilarity was greatly improved in shake tube experiments: 75% of the conditions were equally close or closer to the specifications for biosimilarity than the best 25% in 96-deepwell plates. Biotechnol. Bioeng. 2017;114: 1448-1458. © 2017 Wiley Periodicals, Inc. © 2017 Wiley Periodicals, Inc.

  2. The AIS-5000 parallel processor

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Schmitt, L.A.; Wilson, S.S.

    1988-05-01

    The AIS-5000 is a commercially available massively parallel processor which has been designed to operate in an industrial environment. It has fine-grained parallelism with up to 1024 processing elements arranged in a single-instruction multiple-data (SIMD) architecture. The processing elements are arranged in a one-dimensional chain that, for computer vision applications, can be as wide as the image itself. This architecture has superior cost/performance characteristics than two-dimensional mesh-connected systems. The design of the processing elements and their interconnections as well as the software used to program the system allow a wide variety of algorithms and applications to be implemented. In thismore » paper, the overall architecture of the system is described. Various components of the system are discussed, including details of the processing elements, data I/O pathways and parallel memory organization. A virtual two-dimensional model for programming image-based algorithms for the system is presented. This model is supported by the AIS-5000 hardware and software and allows the system to be treated as a full-image-size, two-dimensional, mesh-connected parallel processor. Performance bench marks are given for certain simple and complex functions.« less

  3. Fast I/O for Massively Parallel Applications

    NASA Technical Reports Server (NTRS)

    OKeefe, Matthew T.

    1996-01-01

    The two primary goals for this report were the design, contruction and modeling of parallel disk arrays for scientific visualization and animation, and a study of the IO requirements of highly parallel applications. In addition, further work in parallel display systems required to project and animate the very high-resolution frames resulting from our supercomputing simulations in ocean circulation and compressible gas dynamics.

  4. Parallel Algorithms for Groebner-Basis Reduction

    DTIC Science & Technology

    1987-09-25

    22209 ELEMENT NO. NO. NO. ACCESSION NO. 11. TITLE (Include Security Classification) * PARALLEL ALGORITHMS FOR GROEBNER -BASIS REDUCTION 12. PERSONAL...All other editions are obsolete. Productivity Engineering in the UNIXt Environment p Parallel Algorithms for Groebner -Basis Reduction Technical Report

  5. Research in Parallel Algorithms and Software for Computational Aerosciences

    NASA Technical Reports Server (NTRS)

    Domel, Neal D.

    1996-01-01

    Phase 1 is complete for the development of a computational fluid dynamics CFD) parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.

  6. Local multiplicative Schwarz algorithms for convection-diffusion equations

    NASA Technical Reports Server (NTRS)

    Cai, Xiao-Chuan; Sarkis, Marcus

    1995-01-01

    We develop a new class of overlapping Schwarz type algorithms for solving scalar convection-diffusion equations discretized by finite element or finite difference methods. The preconditioners consist of two components, namely, the usual two-level additive Schwarz preconditioner and the sum of some quadratic terms constructed by using products of ordered neighboring subdomain preconditioners. The ordering of the subdomain preconditioners is determined by considering the direction of the flow. We prove that the algorithms are optimal in the sense that the convergence rates are independent of the mesh size, as well as the number of subdomains. We show by numerical examples that the new algorithms are less sensitive to the direction of the flow than either the classical multiplicative Schwarz algorithms, and converge faster than the additive Schwarz algorithms. Thus, the new algorithms are more suitable for fluid flow applications than the classical additive or multiplicative Schwarz algorithms.

  7. Solving Graph Laplacian Systems Through Recursive Bisections and Two-Grid Preconditioning

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Ponce, Colin; Vassilevski, Panayot S.

    2016-02-18

    We present a parallelizable direct method for computing the solution to graph Laplacian-based linear systems derived from graphs that can be hierarchically bipartitioned with small edge cuts. For a graph of size n with constant-size edge cuts, our method decomposes a graph Laplacian in time O(n log n), and then uses that decomposition to perform a linear solve in time O(n log n). We then use the developed technique to design a preconditioner for graph Laplacians that do not have this property. Finally, we augment this preconditioner with a two-grid method that accounts for much of the preconditioner's weaknesses. Wemore » present an analysis of this method, as well as a general theorem for the condition number of a general class of two-grid support graph-based preconditioners. Numerical experiments illustrate the performance of the studied methods.« less

  8. Address tracing for parallel machines

    NASA Technical Reports Server (NTRS)

    Stunkel, Craig B.; Janssens, Bob; Fuchs, W. Kent

    1991-01-01

    Recently implemented parallel system address-tracing methods based on several metrics are surveyed. The issues specific to collection of traces for both shared and distributed memory parallel computers are highlighted. Five general categories of address-trace collection methods are examined: hardware-captured, interrupt-based, simulation-based, altered microcode-based, and instrumented program-based traces. The problems unique to shared memory and distributed memory multiprocessors are examined separately.

  9. A Parallel Genetic Algorithm for Automated Electronic Circuit Design

    NASA Technical Reports Server (NTRS)

    Long, Jason D.; Colombano, Silvano P.; Haith, Gary L.; Stassinopoulos, Dimitris

    2000-01-01

    Parallelized versions of genetic algorithms (GAs) are popular primarily for three reasons: the GA is an inherently parallel algorithm, typical GA applications are very compute intensive, and powerful computing platforms, especially Beowulf-style computing clusters, are becoming more affordable and easier to implement. In addition, the low communication bandwidth required allows the use of inexpensive networking hardware such as standard office ethernet. In this paper we describe a parallel GA and its use in automated high-level circuit design. Genetic algorithms are a type of trial-and-error search technique that are guided by principles of Darwinian evolution. Just as the genetic material of two living organisms can intermix to produce offspring that are better adapted to their environment, GAs expose genetic material, frequently strings of 1s and Os, to the forces of artificial evolution: selection, mutation, recombination, etc. GAs start with a pool of randomly-generated candidate solutions which are then tested and scored with respect to their utility. Solutions are then bred by probabilistically selecting high quality parents and recombining their genetic representations to produce offspring solutions. Offspring are typically subjected to a small amount of random mutation. After a pool of offspring is produced, this process iterates until a satisfactory solution is found or an iteration limit is reached. Genetic algorithms have been applied to a wide variety of problems in many fields, including chemistry, biology, and many engineering disciplines. There are many styles of parallelism used in implementing parallel GAs. One such method is called the master-slave or processor farm approach. In this technique, slave nodes are used solely to compute fitness evaluations (the most time consuming part). The master processor collects fitness scores from the nodes and performs the genetic operators (selection, reproduction, variation, etc.). Because of dependency

  10. Parallel adaptive wavelet collocation method for PDEs

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Nejadmalayeri, Alireza, E-mail: Alireza.Nejadmalayeri@gmail.com; Vezolainen, Alexei, E-mail: Alexei.Vezolainen@Colorado.edu; Brown-Dymkoski, Eric, E-mail: Eric.Browndymkoski@Colorado.edu

    2015-10-01

    A parallel adaptive wavelet collocation method for solving a large class of Partial Differential Equations is presented. The parallelization is achieved by developing an asynchronous parallel wavelet transform, which allows one to perform parallel wavelet transform and derivative calculations with only one data synchronization at the highest level of resolution. The data are stored using tree-like structure with tree roots starting at a priori defined level of resolution. Both static and dynamic domain partitioning approaches are developed. For the dynamic domain partitioning, trees are considered to be the minimum quanta of data to be migrated between the processes. This allowsmore » fully automated and efficient handling of non-simply connected partitioning of a computational domain. Dynamic load balancing is achieved via domain repartitioning during the grid adaptation step and reassigning trees to the appropriate processes to ensure approximately the same number of grid points on each process. The parallel efficiency of the approach is discussed based on parallel adaptive wavelet-based Coherent Vortex Simulations of homogeneous turbulence with linear forcing at effective non-adaptive resolutions up to 2048{sup 3} using as many as 2048 CPU cores.« less

  11. Implementation of a Fully-Balanced Periodic Tridiagonal Solver on a Parallel Distributed Memory Architecture

    DTIC Science & Technology

    1994-05-01

    PARALLEL DISTRIBUTED MEMORY ARCHITECTURE LTJh T. M. Eidson 0 - 8 l 9 5 " G. Erlebacher _ _ _. _ DTIe QUALITY INSPECTED a Contract NAS I - 19480 May 1994...DISTRIBUTED MEMORY ARCHITECTURE T.M. Eidson * High Technology Corporation Hampton, VA 23665 G. Erlebachert Institute for Computer Applications in Science and...developed and evaluated. Simple model calculations as well as timing results are pres.nted to evaluate the various strategies. The particular

  12. Parallel multigrid smoothing: polynomial versus Gauss-Seidel

    NASA Astrophysics Data System (ADS)

    Adams, Mark; Brezina, Marian; Hu, Jonathan; Tuminaro, Ray

    2003-07-01

    Gauss-Seidel is often the smoother of choice within multigrid applications. In the context of unstructured meshes, however, maintaining good parallel efficiency is difficult with multiplicative iterative methods such as Gauss-Seidel. This leads us to consider alternative smoothers. We discuss the computational advantages of polynomial smoothers within parallel multigrid algorithms for positive definite symmetric systems. Two particular polynomials are considered: Chebyshev and a multilevel specific polynomial. The advantages of polynomial smoothing over traditional smoothers such as Gauss-Seidel are illustrated on several applications: Poisson's equation, thin-body elasticity, and eddy current approximations to Maxwell's equations. While parallelizing the Gauss-Seidel method typically involves a compromise between a scalable convergence rate and maintaining high flop rates, polynomial smoothers achieve parallel scalable multigrid convergence rates without sacrificing flop rates. We show that, although parallel computers are the main motivation, polynomial smoothers are often surprisingly competitive with Gauss-Seidel smoothers on serial machines.

  13. Iterative methods for 3D implicit finite-difference migration using the complex Padé approximation

    NASA Astrophysics Data System (ADS)

    Costa, Carlos A. N.; Campos, Itamara S.; Costa, Jessé C.; Neto, Francisco A.; Schleicher, Jörg; Novais, Amélia

    2013-08-01

    Conventional implementations of 3D finite-difference (FD) migration use splitting techniques to accelerate performance and save computational cost. However, such techniques are plagued with numerical anisotropy that jeopardises the correct positioning of dipping reflectors in the directions not used for the operator splitting. We implement 3D downward continuation FD migration without splitting using a complex Padé approximation. In this way, the numerical anisotropy is eliminated at the expense of a computationally more intensive solution of a large-band linear system. We compare the performance of the iterative stabilized biconjugate gradient (BICGSTAB) and that of the multifrontal massively parallel direct solver (MUMPS). It turns out that the use of the complex Padé approximation not only stabilizes the solution, but also acts as an effective preconditioner for the BICGSTAB algorithm, reducing the number of iterations as compared to the implementation using the real Padé expansion. As a consequence, the iterative BICGSTAB method is more efficient than the direct MUMPS method when solving a single term in the Padé expansion. The results of both algorithms, here evaluated by computing the migration impulse response in the SEG/EAGE salt model, are of comparable quality.

  14. Parallel Activation in Bilingual Phonological Processing

    ERIC Educational Resources Information Center

    Lee, Su-Yeon

    2011-01-01

    In bilingual language processing, the parallel activation hypothesis suggests that bilinguals activate their two languages simultaneously during language processing. Support for the parallel activation mainly comes from studies of lexical (word-form) processing, with relatively less attention to phonological (sound) processing. According to…

  15. PARAVT: Parallel Voronoi tessellation code

    NASA Astrophysics Data System (ADS)

    González, R. E.

    2016-10-01

    In this study, we present a new open source code for massive parallel computation of Voronoi tessellations (VT hereafter) in large data sets. The code is focused for astrophysical purposes where VT densities and neighbors are widely used. There are several serial Voronoi tessellation codes, however no open source and parallel implementations are available to handle the large number of particles/galaxies in current N-body simulations and sky surveys. Parallelization is implemented under MPI and VT using Qhull library. Domain decomposition takes into account consistent boundary computation between tasks, and includes periodic conditions. In addition, the code computes neighbors list, Voronoi density, Voronoi cell volume, density gradient for each particle, and densities on a regular grid. Code implementation and user guide are publicly available at https://github.com/regonzar/paravt.

  16. Parallelization and automatic data distribution for nuclear reactor simulations

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Liebrock, L.M.

    1997-07-01

    Detailed attempts at realistic nuclear reactor simulations currently take many times real time to execute on high performance workstations. Even the fastest sequential machine can not run these simulations fast enough to ensure that the best corrective measure is used during a nuclear accident to prevent a minor malfunction from becoming a major catastrophe. Since sequential computers have nearly reached the speed of light barrier, these simulations will have to be run in parallel to make significant improvements in speed. In physical reactor plants, parallelism abounds. Fluids flow, controls change, and reactions occur in parallel with only adjacent components directlymore » affecting each other. These do not occur in the sequentialized manner, with global instantaneous effects, that is often used in simulators. Development of parallel algorithms that more closely approximate the real-world operation of a reactor may, in addition to speeding up the simulations, actually improve the accuracy and reliability of the predictions generated. Three types of parallel architecture (shared memory machines, distributed memory multicomputers, and distributed networks) are briefly reviewed as targets for parallelization of nuclear reactor simulation. Various parallelization models (loop-based model, shared memory model, functional model, data parallel model, and a combined functional and data parallel model) are discussed along with their advantages and disadvantages for nuclear reactor simulation. A variety of tools are introduced for each of the models. Emphasis is placed on the data parallel model as the primary focus for two-phase flow simulation. Tools to support data parallel programming for multiple component applications and special parallelization considerations are also discussed.« less

  17. Synchronization Of Parallel Discrete Event Simulations

    NASA Technical Reports Server (NTRS)

    Steinman, Jeffrey S.

    1992-01-01

    Adaptive, parallel, discrete-event-simulation-synchronization algorithm, Breathing Time Buckets, developed in Synchronous Parallel Environment for Emulation and Discrete Event Simulation (SPEEDES) operating system. Algorithm allows parallel simulations to process events optimistically in fluctuating time cycles that naturally adapt while simulation in progress. Combines best of optimistic and conservative synchronization strategies while avoiding major disadvantages. Algorithm processes events optimistically in time cycles adapting while simulation in progress. Well suited for modeling communication networks, for large-scale war games, for simulated flights of aircraft, for simulations of computer equipment, for mathematical modeling, for interactive engineering simulations, and for depictions of flows of information.

  18. Distributed parallel messaging for multiprocessor systems

    DOEpatents

    Chen, Dong; Heidelberger, Philip; Salapura, Valentina; Senger, Robert M; Steinmacher-Burrow, Burhard; Sugawara, Yutaka

    2013-06-04

    A method and apparatus for distributed parallel messaging in a parallel computing system. The apparatus includes, at each node of a multiprocessor network, multiple injection messaging engine units and reception messaging engine units, each implementing a DMA engine and each supporting both multiple packet injection into and multiple reception from a network, in parallel. The reception side of the messaging unit (MU) includes a switch interface enabling writing of data of a packet received from the network to the memory system. The transmission side of the messaging unit, includes switch interface for reading from the memory system when injecting packets into the network.

  19. Knowledge representation into Ada parallel processing

    NASA Technical Reports Server (NTRS)

    Masotto, Tom; Babikyan, Carol; Harper, Richard

    1990-01-01

    The Knowledge Representation into Ada Parallel Processing project is a joint NASA and Air Force funded project to demonstrate the execution of intelligent systems in Ada on the Charles Stark Draper Laboratory fault-tolerant parallel processor (FTPP). Two applications were demonstrated - a portion of the adaptive tactical navigator and a real time controller. Both systems are implemented as Activation Framework Objects on the Activation Framework intelligent scheduling mechanism developed by Worcester Polytechnic Institute. The implementations, results of performance analyses showing speedup due to parallelism and initial efficiency improvements are detailed and further areas for performance improvements are suggested.

  20. Implementations of BLAST for parallel computers.

    PubMed

    Jülich, A

    1995-02-01

    The BLAST sequence comparison programs have been ported to a variety of parallel computers-the shared memory machine Cray Y-MP 8/864 and the distributed memory architectures Intel iPSC/860 and nCUBE. Additionally, the programs were ported to run on workstation clusters. We explain the parallelization techniques and consider the pros and cons of these methods. The BLAST programs are very well suited for parallelization for a moderate number of processors. We illustrate our results using the program blastp as an example. As input data for blastp, a 799 residue protein query sequence and the protein database PIR were used.

  1. Studying Air Quality with Data from the Internet.

    ERIC Educational Resources Information Center

    Salter, Leo; Parsons, Barbara

    2000-01-01

    Explains how the internet can be used between institutions for parallel research opportunities. Uses air quality data to examine the relationship between traffic flow and atmospheric particulate matter (PM) values. (Author/YDS)

  2. Un-collided-flux preconditioning for the first order transport equation

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Rigley, M.; Koebbe, J.; Drumm, C.

    2013-07-01

    Two codes were tested for the first order neutron transport equation using finite element methods. The un-collided-flux solution is used as a preconditioner for each of these methods. These codes include a least squares finite element method and a discontinuous finite element method. The performance of each code is shown on problems in one and two dimensions. The un-collided-flux preconditioner shows good speedup on each of the given methods. The un-collided-flux preconditioner has been used on the second-order equation, and here we extend those results to the first order equation. (authors)

  3. An iterative solver for the 3D Helmholtz equation

    NASA Astrophysics Data System (ADS)

    Belonosov, Mikhail; Dmitriev, Maxim; Kostin, Victor; Neklyudov, Dmitry; Tcheverda, Vladimir

    2017-09-01

    We develop a frequency-domain iterative solver for numerical simulation of acoustic waves in 3D heterogeneous media. It is based on the application of a unique preconditioner to the Helmholtz equation that ensures convergence for Krylov subspace iteration methods. Effective inversion of the preconditioner involves the Fast Fourier Transform (FFT) and numerical solution of a series of boundary value problems for ordinary differential equations. Matrix-by-vector multiplication for iterative inversion of the preconditioned matrix involves inversion of the preconditioner and pointwise multiplication of grid functions. Our solver has been verified by benchmarking against exact solutions and a time-domain solver.

  4. Parallel Algorithms for Image Analysis.

    DTIC Science & Technology

    1982-06-01

    8217 _ _ _ _ _ _ _ 4. TITLE (aid Subtitle) S. TYPE OF REPORT & PERIOD COVERED PARALLEL ALGORITHMS FOR IMAGE ANALYSIS TECHNICAL 6. PERFORMING O4G. REPORT NUMBER TR-1180...Continue on reverse side it neceesary aid Identlfy by block number) Image processing; image analysis ; parallel processing; cellular computers. 20... IMAGE ANALYSIS TECHNICAL 6. PERFORMING ONG. REPORT NUMBER TR-1180 - 7. AUTHOR(&) S. CONTRACT OR GRANT NUMBER(s) Azriel Rosenfeld AFOSR-77-3271 9

  5. Speeding up parallel processing

    NASA Technical Reports Server (NTRS)

    Denning, Peter J.

    1988-01-01

    In 1967 Amdahl expressed doubts about the ultimate utility of multiprocessors. The formulation, now called Amdahl's law, became part of the computing folklore and has inspired much skepticism about the ability of the current generation of massively parallel processors to efficiently deliver all their computing power to programs. The widely publicized recent results of a group at Sandia National Laboratory, which showed speedup on a 1024 node hypercube of over 500 for three fixed size problems and over 1000 for three scalable problems, have convincingly challenged this bit of folklore and have given new impetus to parallel scientific computing.

  6. MLP: A Parallel Programming Alternative to MPI for New Shared Memory Parallel Systems

    NASA Technical Reports Server (NTRS)

    Taft, James R.

    1999-01-01

    Recent developments at the NASA AMES Research Center's NAS Division have demonstrated that the new generation of NUMA based Symmetric Multi-Processing systems (SMPs), such as the Silicon Graphics Origin 2000, can successfully execute legacy vector oriented CFD production codes at sustained rates far exceeding processing rates possible on dedicated 16 CPU Cray C90 systems. This high level of performance is achieved via shared memory based Multi-Level Parallelism (MLP). This programming approach, developed at NAS and outlined below, is distinct from the message passing paradigm of MPI. It offers parallelism at both the fine and coarse grained level, with communication latencies that are approximately 50-100 times lower than typical MPI implementations on the same platform. Such latency reductions offer the promise of performance scaling to very large CPU counts. The method draws on, but is also distinct from, the newly defined OpenMP specification, which uses compiler directives to support a limited subset of multi-level parallel operations. The NAS MLP method is general, and applicable to a large class of NASA CFD codes.

  7. National Combustion Code: Parallel Implementation and Performance

    NASA Technical Reports Server (NTRS)

    Quealy, A.; Ryder, R.; Norris, A.; Liu, N.-S.

    2000-01-01

    The National Combustion Code (NCC) is being developed by an industry-government team for the design and analysis of combustion systems. CORSAIR-CCD is the current baseline reacting flow solver for NCC. This is a parallel, unstructured grid code which uses a distributed memory, message passing model for its parallel implementation. The focus of the present effort has been to improve the performance of the NCC flow solver to meet combustor designer requirements for model accuracy and analysis turnaround time. Improving the performance of this code contributes significantly to the overall reduction in time and cost of the combustor design cycle. This paper describes the parallel implementation of the NCC flow solver and summarizes its current parallel performance on an SGI Origin 2000. Earlier parallel performance results on an IBM SP-2 are also included. The performance improvements which have enabled a turnaround of less than 15 hours for a 1.3 million element fully reacting combustion simulation are described.

  8. Automatic Management of Parallel and Distributed System Resources

    NASA Technical Reports Server (NTRS)

    Yan, Jerry; Ngai, Tin Fook; Lundstrom, Stephen F.

    1990-01-01

    Viewgraphs on automatic management of parallel and distributed system resources are presented. Topics covered include: parallel applications; intelligent management of multiprocessing systems; performance evaluation of parallel architecture; dynamic concurrent programs; compiler-directed system approach; lattice gaseous cellular automata; and sparse matrix Cholesky factorization.

  9. A Quantitative Quality Control Model for Parallel and Distributed Crowdsourcing Tasks

    ERIC Educational Resources Information Center

    Zhu, Shaojian

    2014-01-01

    Crowdsourcing is an emerging research area that has experienced rapid growth in the past few years. Although crowdsourcing has demonstrated its potential in numerous domains, several key challenges continue to hinder its application. One of the major challenges is quality control. How can crowdsourcing requesters effectively control the quality…

  10. A Domain Decomposition Parallelization of the Fast Marching Method

    NASA Technical Reports Server (NTRS)

    Herrmann, M.

    2003-01-01

    In this paper, the first domain decomposition parallelization of the Fast Marching Method for level sets has been presented. Parallel speedup has been demonstrated in both the optimal and non-optimal domain decomposition case. The parallel performance of the proposed method is strongly dependent on load balancing separately the number of nodes on each side of the interface. A load imbalance of nodes on either side of the domain leads to an increase in communication and rollback operations. Furthermore, the amount of inter-domain communication can be reduced by aligning the inter-domain boundaries with the interface normal vectors. In the case of optimal load balancing and aligned inter-domain boundaries, the proposed parallel FMM algorithm is highly efficient, reaching efficiency factors of up to 0.98. Future work will focus on the extension of the proposed parallel algorithm to higher order accuracy. Also, to further enhance parallel performance, the coupling of the domain decomposition parallelization to the G(sub 0)-based parallelization will be investigated.

  11. 17 CFR 12.24 - Parallel proceedings.

    Code of Federal Regulations, 2010 CFR

    2010-04-01

    ...) Definition. For purposes of this section, a parallel proceeding shall include: (1) An arbitration proceeding... the receivership includes the resolution of claims made by customers; or (3) A petition filed under... any of the foregoing with knowledge of a parallel proceeding shall promptly notify the Commission, by...

  12. A Parallel Nonrigid Registration Algorithm Based on B-Spline for Medical Images.

    PubMed

    Du, Xiaogang; Dang, Jianwu; Wang, Yangping; Wang, Song; Lei, Tao

    2016-01-01

    The nonrigid registration algorithm based on B-spline Free-Form Deformation (FFD) plays a key role and is widely applied in medical image processing due to the good flexibility and robustness. However, it requires a tremendous amount of computing time to obtain more accurate registration results especially for a large amount of medical image data. To address the issue, a parallel nonrigid registration algorithm based on B-spline is proposed in this paper. First, the Logarithm Squared Difference (LSD) is considered as the similarity metric in the B-spline registration algorithm to improve registration precision. After that, we create a parallel computing strategy and lookup tables (LUTs) to reduce the complexity of the B-spline registration algorithm. As a result, the computing time of three time-consuming steps including B-splines interpolation, LSD computation, and the analytic gradient computation of LSD, is efficiently reduced, for the B-spline registration algorithm employs the Nonlinear Conjugate Gradient (NCG) optimization method. Experimental results of registration quality and execution efficiency on the large amount of medical images show that our algorithm achieves a better registration accuracy in terms of the differences between the best deformation fields and ground truth and a speedup of 17 times over the single-threaded CPU implementation due to the powerful parallel computing ability of Graphics Processing Unit (GPU).

  13. A Parallel Nonrigid Registration Algorithm Based on B-Spline for Medical Images

    PubMed Central

    Wang, Yangping; Wang, Song

    2016-01-01

    The nonrigid registration algorithm based on B-spline Free-Form Deformation (FFD) plays a key role and is widely applied in medical image processing due to the good flexibility and robustness. However, it requires a tremendous amount of computing time to obtain more accurate registration results especially for a large amount of medical image data. To address the issue, a parallel nonrigid registration algorithm based on B-spline is proposed in this paper. First, the Logarithm Squared Difference (LSD) is considered as the similarity metric in the B-spline registration algorithm to improve registration precision. After that, we create a parallel computing strategy and lookup tables (LUTs) to reduce the complexity of the B-spline registration algorithm. As a result, the computing time of three time-consuming steps including B-splines interpolation, LSD computation, and the analytic gradient computation of LSD, is efficiently reduced, for the B-spline registration algorithm employs the Nonlinear Conjugate Gradient (NCG) optimization method. Experimental results of registration quality and execution efficiency on the large amount of medical images show that our algorithm achieves a better registration accuracy in terms of the differences between the best deformation fields and ground truth and a speedup of 17 times over the single-threaded CPU implementation due to the powerful parallel computing ability of Graphics Processing Unit (GPU). PMID:28053653

  14. Parallel Implementation of the Discontinuous Galerkin Method

    NASA Technical Reports Server (NTRS)

    Baggag, Abdalkader; Atkins, Harold; Keyes, David

    1999-01-01

    This paper describes a parallel implementation of the discontinuous Galerkin method. Discontinuous Galerkin is a spatially compact method that retains its accuracy and robustness on non-smooth unstructured grids and is well suited for time dependent simulations. Several parallelization approaches are studied and evaluated. The most natural and symmetric of the approaches has been implemented in all object-oriented code used to simulate aeroacoustic scattering. The parallel implementation is MPI-based and has been tested on various parallel platforms such as the SGI Origin, IBM SP2, and clusters of SGI and Sun workstations. The scalability results presented for the SGI Origin show slightly superlinear speedup on a fixed-size problem due to cache effects.

  15. Data parallel sorting for particle simulation

    NASA Technical Reports Server (NTRS)

    Dagum, Leonardo

    1992-01-01

    Sorting on a parallel architecture is a communications intensive event which can incur a high penalty in applications where it is required. In the case of particle simulation, only integer sorting is necessary, and sequential implementations easily attain the minimum performance bound of O (N) for N particles. Parallel implementations, however, have to cope with the parallel sorting problem which, in addition to incurring a heavy communications cost, can make the minimun performance bound difficult to attain. This paper demonstrates how the sorting problem in a particle simulation can be reduced to a merging problem, and describes an efficient data parallel algorithm to solve this merging problem in a particle simulation. The new algorithm is shown to be optimal under conditions usual for particle simulation, and its fieldwise implementation on the Connection Machine is analyzed in detail. The new algorithm is about four times faster than a fieldwise implementation of radix sort on the Connection Machine.

  16. Parallel/distributed direct method for solving linear systems

    NASA Technical Reports Server (NTRS)

    Lin, Avi

    1990-01-01

    A new family of parallel schemes for directly solving linear systems is presented and analyzed. It is shown that these schemes exhibit a near optimal performance and enjoy several important features: (1) For large enough linear systems, the design of the appropriate paralleled algorithm is insensitive to the number of processors as its performance grows monotonically with them; (2) It is especially good for large matrices, with dimensions large relative to the number of processors in the system; (3) It can be used in both distributed parallel computing environments and tightly coupled parallel computing systems; and (4) This set of algorithms can be mapped onto any parallel architecture without any major programming difficulties or algorithmical changes.

  17. Parallel algorithm of real-time infrared image restoration based on total variation theory

    NASA Astrophysics Data System (ADS)

    Zhu, Ran; Li, Miao; Long, Yunli; Zeng, Yaoyuan; An, Wei

    2015-10-01

    Image restoration is a necessary preprocessing step for infrared remote sensing applications. Traditional methods allow us to remove the noise but penalize too much the gradients corresponding to edges. Image restoration techniques based on variational approaches can solve this over-smoothing problem for the merits of their well-defined mathematical modeling of the restore procedure. The total variation (TV) of infrared image is introduced as a L1 regularization term added to the objective energy functional. It converts the restoration process to an optimization problem of functional involving a fidelity term to the image data plus a regularization term. Infrared image restoration technology with TV-L1 model exploits the remote sensing data obtained sufficiently and preserves information at edges caused by clouds. Numerical implementation algorithm is presented in detail. Analysis indicates that the structure of this algorithm can be easily implemented in parallelization. Therefore a parallel implementation of the TV-L1 filter based on multicore architecture with shared memory is proposed for infrared real-time remote sensing systems. Massive computation of image data is performed in parallel by cooperating threads running simultaneously on multiple cores. Several groups of synthetic infrared image data are used to validate the feasibility and effectiveness of the proposed parallel algorithm. Quantitative analysis of measuring the restored image quality compared to input image is presented. Experiment results show that the TV-L1 filter can restore the varying background image reasonably, and that its performance can achieve the requirement of real-time image processing.

  18. Externally calibrated parallel imaging for 3D multispectral imaging near metallic implants using broadband ultrashort echo time imaging.

    PubMed

    Wiens, Curtis N; Artz, Nathan S; Jang, Hyungseok; McMillan, Alan B; Reeder, Scott B

    2017-06-01

    To develop an externally calibrated parallel imaging technique for three-dimensional multispectral imaging (3D-MSI) in the presence of metallic implants. A fast, ultrashort echo time (UTE) calibration acquisition is proposed to enable externally calibrated parallel imaging techniques near metallic implants. The proposed calibration acquisition uses a broadband radiofrequency (RF) pulse to excite the off-resonance induced by the metallic implant, fully phase-encoded imaging to prevent in-plane distortions, and UTE to capture rapidly decaying signal. The performance of the externally calibrated parallel imaging reconstructions was assessed using phantoms and in vivo examples. Phantom and in vivo comparisons to self-calibrated parallel imaging acquisitions show that significant reductions in acquisition times can be achieved using externally calibrated parallel imaging with comparable image quality. Acquisition time reductions are particularly large for fully phase-encoded methods such as spectrally resolved fully phase-encoded three-dimensional (3D) fast spin-echo (SR-FPE), in which scan time reductions of up to 8 min were obtained. A fully phase-encoded acquisition with broadband excitation and UTE enabled externally calibrated parallel imaging for 3D-MSI, eliminating the need for repeated calibration regions at each frequency offset. Significant reductions in acquisition time can be achieved, particularly for fully phase-encoded methods like SR-FPE. Magn Reson Med 77:2303-2309, 2017. © 2016 International Society for Magnetic Resonance in Medicine. © 2016 International Society for Magnetic Resonance in Medicine.

  19. Highly accelerated cardiac cine parallel MRI using low-rank matrix completion and partial separability model

    NASA Astrophysics Data System (ADS)

    Lyu, Jingyuan; Nakarmi, Ukash; Zhang, Chaoyi; Ying, Leslie

    2016-05-01

    This paper presents a new approach to highly accelerated dynamic parallel MRI using low rank matrix completion, partial separability (PS) model. In data acquisition, k-space data is moderately randomly undersampled at the center kspace navigator locations, but highly undersampled at the outer k-space for each temporal frame. In reconstruction, the navigator data is reconstructed from undersampled data using structured low-rank matrix completion. After all the unacquired navigator data is estimated, the partial separable model is used to obtain partial k-t data. Then the parallel imaging method is used to acquire the entire dynamic image series from highly undersampled data. The proposed method has shown to achieve high quality reconstructions with reduction factors up to 31, and temporal resolution of 29ms, when the conventional PS method fails.

  20. Computational efficiency of parallel combinatorial OR-tree searches

    NASA Technical Reports Server (NTRS)

    Li, Guo-Jie; Wah, Benjamin W.

    1990-01-01

    The performance of parallel combinatorial OR-tree searches is analytically evaluated. This performance depends on the complexity of the problem to be solved, the error allowance function, the dominance relation, and the search strategies. The exact performance may be difficult to predict due to the nondeterminism and anomalies of parallelism. The authors derive the performance bounds of parallel OR-tree searches with respect to the best-first, depth-first, and breadth-first strategies, and verify these bounds by simulation. They show that a near-linear speedup can be achieved with respect to a large number of processors for parallel OR-tree searches. Using the bounds developed, the authors derive sufficient conditions for assuring that parallelism will not degrade performance and necessary conditions for allowing parallelism to have a speedup greater than the ratio of the numbers of processors. These bounds and conditions provide the theoretical foundation for determining the number of processors required to assure a near-linear speedup.

  1. The Parallel Axiom

    ERIC Educational Resources Information Center

    Rogers, Pat

    1972-01-01

    Criteria for a reasonable axiomatic system are discussed. A discussion of the historical attempts to prove the independence of Euclids parallel postulate introduces non-Euclidean geometries. Poincare's model for a non-Euclidean geometry is defined and analyzed. (LS)

  2. Parallel processing in finite element structural analysis

    NASA Technical Reports Server (NTRS)

    Noor, Ahmed K.

    1987-01-01

    A brief review is made of the fundamental concepts and basic issues of parallel processing. Discussion focuses on parallel numerical algorithms, performance evaluation of machines and algorithms, and parallelism in finite element computations. A computational strategy is proposed for maximizing the degree of parallelism at different levels of the finite element analysis process including: 1) formulation level (through the use of mixed finite element models); 2) analysis level (through additive decomposition of the different arrays in the governing equations into the contributions to a symmetrized response plus correction terms); 3) numerical algorithm level (through the use of operator splitting techniques and application of iterative processes); and 4) implementation level (through the effective combination of vectorization, multitasking and microtasking, whenever available).

  3. Efficient Parallel Kernel Solvers for Computational Fluid Dynamics Applications

    NASA Technical Reports Server (NTRS)

    Sun, Xian-He

    1997-01-01

    Distributed-memory parallel computers dominate today's parallel computing arena. These machines, such as Intel Paragon, IBM SP2, and Cray Origin2OO, have successfully delivered high performance computing power for solving some of the so-called "grand-challenge" problems. Despite initial success, parallel machines have not been widely accepted in production engineering environments due to the complexity of parallel programming. On a parallel computing system, a task has to be partitioned and distributed appropriately among processors to reduce communication cost and to attain load balance. More importantly, even with careful partitioning and mapping, the performance of an algorithm may still be unsatisfactory, since conventional sequential algorithms may be serial in nature and may not be implemented efficiently on parallel machines. In many cases, new algorithms have to be introduced to increase parallel performance. In order to achieve optimal performance, in addition to partitioning and mapping, a careful performance study should be conducted for a given application to find a good algorithm-machine combination. This process, however, is usually painful and elusive. The goal of this project is to design and develop efficient parallel algorithms for highly accurate Computational Fluid Dynamics (CFD) simulations and other engineering applications. The work plan is 1) developing highly accurate parallel numerical algorithms, 2) conduct preliminary testing to verify the effectiveness and potential of these algorithms, 3) incorporate newly developed algorithms into actual simulation packages. The work plan has well achieved. Two highly accurate, efficient Poisson solvers have been developed and tested based on two different approaches: (1) Adopting a mathematical geometry which has a better capacity to describe the fluid, (2) Using compact scheme to gain high order accuracy in numerical discretization. The previously developed Parallel Diagonal Dominant (PDD) algorithm

  4. Parallel-Processing Test Bed For Simulation Software

    NASA Technical Reports Server (NTRS)

    Blech, Richard; Cole, Gary; Townsend, Scott

    1996-01-01

    Second-generation Hypercluster computing system is multiprocessor test bed for research on parallel algorithms for simulation in fluid dynamics, electromagnetics, chemistry, and other fields with large computational requirements but relatively low input/output requirements. Built from standard, off-shelf hardware readily upgraded as improved technology becomes available. System used for experiments with such parallel-processing concepts as message-passing algorithms, debugging software tools, and computational steering. First-generation Hypercluster system described in "Hypercluster Parallel Processor" (LEW-15283).

  5. Rapid indirect trajectory optimization on highly parallel computing architectures

    NASA Astrophysics Data System (ADS)

    Antony, Thomas

    Trajectory optimization is a field which can benefit greatly from the advantages offered by parallel computing. The current state-of-the-art in trajectory optimization focuses on the use of direct optimization methods, such as the pseudo-spectral method. These methods are favored due to their ease of implementation and large convergence regions while indirect methods have largely been ignored in the literature in the past decade except for specific applications in astrodynamics. It has been shown that the shortcomings conventionally associated with indirect methods can be overcome by the use of a continuation method in which complex trajectory solutions are obtained by solving a sequence of progressively difficult optimization problems. High performance computing hardware is trending towards more parallel architectures as opposed to powerful single-core processors. Graphics Processing Units (GPU), which were originally developed for 3D graphics rendering have gained popularity in the past decade as high-performance, programmable parallel processors. The Compute Unified Device Architecture (CUDA) framework, a parallel computing architecture and programming model developed by NVIDIA, is one of the most widely used platforms in GPU computing. GPUs have been applied to a wide range of fields that require the solution of complex, computationally demanding problems. A GPU-accelerated indirect trajectory optimization methodology which uses the multiple shooting method and continuation is developed using the CUDA platform. The various algorithmic optimizations used to exploit the parallelism inherent in the indirect shooting method are described. The resulting rapid optimal control framework enables the construction of high quality optimal trajectories that satisfy problem-specific constraints and fully satisfy the necessary conditions of optimality. The benefits of the framework are highlighted by construction of maximum terminal velocity trajectories for a hypothetical

  6. [Three-dimensional parallel collagen scaffold promotes tendon extracellular matrix formation].

    PubMed

    Zheng, Zefeng; Shen, Weiliang; Le, Huihui; Dai, Xuesong; Ouyang, Hongwei; Chen, Weishan

    2016-03-01

    To investigate the effects of three-dimensional parallel collagen scaffold on the cell shape, arrangement and extracellular matrix formation of tendon stem cells. Parallel collagen scaffold was fabricated by unidirectional freezing technique, while random collagen scaffold was fabricated by freeze-drying technique. The effects of two scaffolds on cell shape and extracellular matrix formation were investigated in vitro by seeding tendon stem/progenitor cells and in vivo by ectopic implantation. Parallel and random collagen scaffolds were produced successfully. Parallel collagen scaffold was more akin to tendon than random collagen scaffold. Tendon stem/progenitor cells were spindle-shaped and unified orientated in parallel collagen scaffold, while cells on random collagen scaffold had disorder orientation. Two weeks after ectopic implantation, cells had nearly the same orientation with the collagen substance. In parallel collagen scaffold, cells had parallel arrangement, and more spindly cells were observed. By contrast, cells in random collagen scaffold were disorder. Parallel collagen scaffold can induce cells to be in spindly and parallel arrangement, and promote parallel extracellular matrix formation; while random collagen scaffold can induce cells in random arrangement. The results indicate that parallel collagen scaffold is an ideal structure to promote tendon repairing.

  7. Preconditioner Circuit Analysis

    DTIC Science & Technology

    2011-09-01

    S) Matthew J. Nye 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) Naval Postgraduate School Monterey, CA 939435–000 8. PERFORMING... ORGANIZATION REPORT NUMBER 9. SPONSORING /MONITORING AGENCY NAME(S) AND ADDRESS(ES) N/A 10. SPONSORING/MONITORING AGENCY REPORT NUMBER 11...of the simulations and the theoretical computations. D. THESIS ORGANIZATION This thesis is organized into four chapters. The theoretical

  8. Graphics applications utilizing parallel processing

    NASA Technical Reports Server (NTRS)

    Rice, John R.

    1990-01-01

    The results are presented of research conducted to develop a parallel graphic application algorithm to depict the numerical solution of the 1-D wave equation, the vibrating string. The research was conducted on a Flexible Flex/32 multiprocessor and a Sequent Balance 21000 multiprocessor. The wave equation is implemented using the finite difference method. The synchronization issues that arose from the parallel implementation and the strategies used to alleviate the effects of the synchronization overhead are discussed.

  9. Visualizing Parallel Computer System Performance

    NASA Technical Reports Server (NTRS)

    Malony, Allen D.; Reed, Daniel A.

    1988-01-01

    Parallel computer systems are among the most complex of man's creations, making satisfactory performance characterization difficult. Despite this complexity, there are strong, indeed, almost irresistible, incentives to quantify parallel system performance using a single metric. The fallacy lies in succumbing to such temptations. A complete performance characterization requires not only an analysis of the system's constituent levels, it also requires both static and dynamic characterizations. Static or average behavior analysis may mask transients that dramatically alter system performance. Although the human visual system is remarkedly adept at interpreting and identifying anomalies in false color data, the importance of dynamic, visual scientific data presentation has only recently been recognized Large, complex parallel system pose equally vexing performance interpretation problems. Data from hardware and software performance monitors must be presented in ways that emphasize important events while eluding irrelevant details. Design approaches and tools for performance visualization are the subject of this paper.

  10. Fast parallel algorithm for slicing STL based on pipeline

    NASA Astrophysics Data System (ADS)

    Ma, Xulong; Lin, Feng; Yao, Bo

    2016-05-01

    In Additive Manufacturing field, the current researches of data processing mainly focus on a slicing process of large STL files or complicated CAD models. To improve the efficiency and reduce the slicing time, a parallel algorithm has great advantages. However, traditional algorithms can't make full use of multi-core CPU hardware resources. In the paper, a fast parallel algorithm is presented to speed up data processing. A pipeline mode is adopted to design the parallel algorithm. And the complexity of the pipeline algorithm is analyzed theoretically. To evaluate the performance of the new algorithm, effects of threads number and layers number are investigated by a serial of experiments. The experimental results show that the threads number and layers number are two remarkable factors to the speedup ratio. The tendency of speedup versus threads number reveals a positive relationship which greatly agrees with the Amdahl's law, and the tendency of speedup versus layers number also keeps a positive relationship agreeing with Gustafson's law. The new algorithm uses topological information to compute contours with a parallel method of speedup. Another parallel algorithm based on data parallel is used in experiments to show that pipeline parallel mode is more efficient. A case study at last shows a suspending performance of the new parallel algorithm. Compared with the serial slicing algorithm, the new pipeline parallel algorithm can make full use of the multi-core CPU hardware, accelerate the slicing process, and compared with the data parallel slicing algorithm, the new slicing algorithm in this paper adopts a pipeline parallel model, and a much higher speedup ratio and efficiency is achieved.

  11. Parallel ALLSPD-3D: Speeding Up Combustor Analysis Via Parallel Processing

    NASA Technical Reports Server (NTRS)

    Fricker, David M.

    1997-01-01

    The ALLSPD-3D Computational Fluid Dynamics code for reacting flow simulation was run on a set of benchmark test cases to determine its parallel efficiency. These test cases included non-reacting and reacting flow simulations with varying numbers of processors. Also, the tests explored the effects of scaling the simulation with the number of processors in addition to distributing a constant size problem over an increasing number of processors. The test cases were run on a cluster of IBM RS/6000 Model 590 workstations with ethernet and ATM networking plus a shared memory SGI Power Challenge L workstation. The results indicate that the network capabilities significantly influence the parallel efficiency, i.e., a shared memory machine is fastest and ATM networking provides acceptable performance. The limitations of ethernet greatly hamper the rapid calculation of flows using ALLSPD-3D.

  12. A possibility of parallel and anti-parallel diffraction measurements on neu- tron diffractometer employing bent perfect crystal monochromator at the monochromatic focusing condition

    NASA Astrophysics Data System (ADS)

    Choi, Yong Nam; Kim, Shin Ae; Kim, Sung Kyu; Kim, Sung Baek; Lee, Chang-Hee; Mikula, Pavel

    2004-07-01

    In a conventional diffractometer having single monochromator, only one position, parallel position, is used for the diffraction experiment (i.e. detection) because the resolution property of the other one, anti-parallel position, is very poor. However, a bent perfect crystal (BPC) monochromator at monochromatic focusing condition can provide a quite flat and equal resolution property at both parallel and anti-parallel positions and thus one can have a chance to use both sides for the diffraction experiment. From the data of the FWHM and the Delta d/d measured on three diffraction geometries (symmetric, asymmetric compression and asymmetric expansion), we can conclude that the simultaneous diffraction measurement in both parallel and anti-parallel positions can be achieved.

  13. Quality Content in Distance Education

    ERIC Educational Resources Information Center

    Yildiz, Ezgi Pelin; Isman, Aytekin

    2016-01-01

    In parallel with technological advances in today's world of education activities can be conducted without the constraints of time and space. One of the most important of these activities is distance education. The success of the distance education is possible with content quality. The proliferation of e-learning environment has brought a need for…

  14. Preconditioned conjugate gradient methods for the compressible Navier-Stokes equations

    NASA Technical Reports Server (NTRS)

    Venkatakrishnan, V.

    1990-01-01

    The compressible Navier-Stokes equations are solved for a variety of two-dimensional inviscid and viscous problems by preconditioned conjugate gradient-like algorithms. Roe's flux difference splitting technique is used to discretize the inviscid fluxes. The viscous terms are discretized by using central differences. An algebraic turbulence model is also incorporated. The system of linear equations which arises out of the linearization of a fully implicit scheme is solved iteratively by the well known methods of GMRES (Generalized Minimum Residual technique) and Chebyschev iteration. Incomplete LU factorization and block diagonal factorization are used as preconditioners. The resulting algorithm is competitive with the best current schemes, but has wide applications in parallel computing and unstructured mesh computations.

  15. Domain decomposition methods in computational fluid dynamics

    NASA Technical Reports Server (NTRS)

    Gropp, William D.; Keyes, David E.

    1991-01-01

    The divide-and-conquer paradigm of iterative domain decomposition, or substructuring, has become a practical tool in computational fluid dynamic applications because of its flexibility in accommodating adaptive refinement through locally uniform (or quasi-uniform) grids, its ability to exploit multiple discretizations of the operator equations, and the modular pathway it provides towards parallelism. These features are illustrated on the classic model problem of flow over a backstep using Newton's method as the nonlinear iteration. Multiple discretizations (second-order in the operator and first-order in the preconditioner) and locally uniform mesh refinement pay dividends separately, and they can be combined synergistically. Sample performance results are included from an Intel iPSC/860 hypercube implementation.

  16. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Collins, Benjamin S.

    The Futility package contains the following: 1) Definition of the size of integers and real numbers; 2) A generic Unit test harness; 3) Definitions for some basic extensions to the Fortran language: arbitrary length strings, a parameter list construct, exception handlers, command line processor, timers; 4) Geometry definitions: point, line, plane, box, cylinder, polyhedron; 5) File wrapper functions: standard Fortran input/output files, Fortran binary files, HDF5 files; 6) Parallel wrapper functions: MPI, and Open MP abstraction layers, partitioning algorithms; 7) Math utilities: BLAS, Matrix and Vector definitions, Linear Solver methods and wrappers for other TPLs (PETSC, MKL, etc), preconditioner classes;more » 8) Misc: random number generator, water saturation properties, sorting algorithms.« less

  17. Broadcasting collective operation contributions throughout a parallel computer

    DOEpatents

    Faraj, Ahmad [Rochester, MN

    2012-02-21

    Methods, systems, and products are disclosed for broadcasting collective operation contributions throughout a parallel computer. The parallel computer includes a plurality of compute nodes connected together through a data communications network. Each compute node has a plurality of processors for use in collective parallel operations on the parallel computer. Broadcasting collective operation contributions throughout a parallel computer according to embodiments of the present invention includes: transmitting, by each processor on each compute node, that processor's collective operation contribution to the other processors on that compute node using intra-node communications; and transmitting on a designated network link, by each processor on each compute node according to a serial processor transmission sequence, that processor's collective operation contribution to the other processors on the other compute nodes using inter-node communications.

  18. Parallel, Distributed Scripting with Python

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Miller, P J

    2002-05-24

    Parallel computers used to be, for the most part, one-of-a-kind systems which were extremely difficult to program portably. With SMP architectures, the advent of the POSIX thread API and OpenMP gave developers ways to portably exploit on-the-box shared memory parallelism. Since these architectures didn't scale cost-effectively, distributed memory clusters were developed. The associated MPI message passing libraries gave these systems a portable paradigm too. Having programmers effectively use this paradigm is a somewhat different question. Distributed data has to be explicitly transported via the messaging system in order for it to be useful. In high level languages, the MPI librarymore » gives access to data distribution routines in C, C++, and FORTRAN. But we need more than that. Many reasonable and common tasks are best done in (or as extensions to) scripting languages. Consider sysadm tools such as password crackers, file purgers, etc ... These are simple to write in a scripting language such as Python (an open source, portable, and freely available interpreter). But these tasks beg to be done in parallel. Consider the a password checker that checks an encrypted password against a 25,000 word dictionary. This can take around 10 seconds in Python (6 seconds in C). It is trivial to parallelize if you can distribute the information and co-ordinate the work.« less

  19. Mouse-tracking evidence for parallel anticipatory option evaluation.

    PubMed

    Cranford, Edward A; Moss, Jarrod

    2017-12-23

    In fast-paced, dynamic tasks, the ability to anticipate the future outcome of a sequence of events is crucial to quickly selecting an appropriate course of action among multiple alternative options. There are two classes of theories that describe how anticipation occurs. Serial theories assume options are generated and evaluated one at a time, in order of quality, whereas parallel theories assume simultaneous generation and evaluation. The present research examined the option evaluation process during a task designed to be analogous to prior anticipation tasks, but within the domain of narrative text comprehension. Prior research has relied on indirect, off-line measurement of the option evaluation process during anticipation tasks. Because the movement of the hand can provide a window into underlying cognitive processes, online metrics such as continuous mouse tracking provide more fine-grained measurements of cognitive processing as it occurs in real time. In this study, participants listened to three-sentence stories and predicted the protagonists' final action by moving a mouse toward one of three possible options. Each story was presented with either one (control condition) or two (distractor condition) plausible ending options. Results seem most consistent with a parallel option evaluation process because initial mouse trajectories deviated further from the best option in the distractor condition compared to the control condition. It is difficult to completely rule out all possible serial processing accounts, although the results do place constraints on the time frame in which a serial processing explanation must operate.

  20. Parallel optoelectronic trinary signed-digit division

    NASA Astrophysics Data System (ADS)

    Alam, Mohammad S.

    1999-03-01

    The trinary signed-digit (TSD) number system has been found to be very useful for parallel addition and subtraction of any arbitrary length operands in constant time. Using the TSD addition and multiplication modules as the basic building blocks, we develop an efficient algorithm for performing parallel TSD division in constant time. The proposed division technique uses one TSD subtraction and two TSD multiplication steps. An optoelectronic correlator based architecture is suggested for implementation of the proposed TSD division algorithm, which fully exploits the parallelism and high processing speed of optics. An efficient spatial encoding scheme is used to ensure better utilization of space bandwidth product of the spatial light modulators used in the optoelectronic implementation.

  1. Parallel grid population

    DOEpatents

    Wald, Ingo; Ize, Santiago

    2015-07-28

    Parallel population of a grid with a plurality of objects using a plurality of processors. One example embodiment is a method for parallel population of a grid with a plurality of objects using a plurality of processors. The method includes a first act of dividing a grid into n distinct grid portions, where n is the number of processors available for populating the grid. The method also includes acts of dividing a plurality of objects into n distinct sets of objects, assigning a distinct set of objects to each processor such that each processor determines by which distinct grid portion(s) each object in its distinct set of objects is at least partially bounded, and assigning a distinct grid portion to each processor such that each processor populates its distinct grid portion with any objects that were previously determined to be at least partially bounded by its distinct grid portion.

  2. A high-speed linear algebra library with automatic parallelism

    NASA Technical Reports Server (NTRS)

    Boucher, Michael L.

    1994-01-01

    Parallel or distributed processing is key to getting highest performance workstations. However, designing and implementing efficient parallel algorithms is difficult and error-prone. It is even more difficult to write code that is both portable to and efficient on many different computers. Finally, it is harder still to satisfy the above requirements and include the reliability and ease of use required of commercial software intended for use in a production environment. As a result, the application of parallel processing technology to commercial software has been extremely small even though there are numerous computationally demanding programs that would significantly benefit from application of parallel processing. This paper describes DSSLIB, which is a library of subroutines that perform many of the time-consuming computations in engineering and scientific software. DSSLIB combines the high efficiency and speed of parallel computation with a serial programming model that eliminates many undesirable side-effects of typical parallel code. The result is a simple way to incorporate the power of parallel processing into commercial software without compromising maintainability, reliability, or ease of use. This gives significant advantages over less powerful non-parallel entries in the market.

  3. Parallel Digital Phase-Locked Loops

    NASA Technical Reports Server (NTRS)

    Sadr, Ramin; Shah, Biren N.; Hinedi, Sami M.

    1995-01-01

    Wide-band microwave receivers of proposed type include digital phase-locked loops in which band-pass filtering and down-conversion of input signals implemented by banks of multirate digital filters operating in parallel. Called "parallel digital phase-locked loops" to distinguish them from other digital phase-locked loops. Systems conceived as cost-effective solution to problem of filtering signals at high sampling rates needed to accommodate wide input frequency bands. Each of M filters process 1/M of spectrum of signal.

  4. Fringe Capacitance of a Parallel-Plate Capacitor.

    ERIC Educational Resources Information Center

    Hale, D. P.

    1978-01-01

    Describes an experiment designed to measure the forces between charged parallel plates, and determines the relationship among the effective electrode area, the measured capacitance values, and the electrode spacing of a parallel plate capacitor. (GA)

  5. Support for Debugging Automatically Parallelized Programs

    NASA Technical Reports Server (NTRS)

    Hood, Robert; Jost, Gabriele

    2001-01-01

    This viewgraph presentation provides information on support sources available for the automatic parallelization of computer program. CAPTools, a support tool developed at the University of Greenwich, transforms, with user guidance, existing sequential Fortran code into parallel message passing code. Comparison routines are then run for debugging purposes, in essence, ensuring that the code transformation was accurate.

  6. Power transformation for enhancing responsiveness of quality of life questionnaire.

    PubMed

    Zhou, YanYan Ange

    2015-01-01

    We investigate the effect of power transformation of raw scores on the responsiveness of quality of life survey. The procedure maximizes the paired t-test value on the power transformed data to obtain an optimal power range. The parallel between the Box-Cox transformation is also investigated for the quality of life data.

  7. Parallel community climate model: Description and user`s guide

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Drake, J.B.; Flanery, R.E.; Semeraro, B.D.

    This report gives an overview of a parallel version of the NCAR Community Climate Model, CCM2, implemented for MIMD massively parallel computers using a message-passing programming paradigm. The parallel implementation was developed on an Intel iPSC/860 with 128 processors and on the Intel Delta with 512 processors, and the initial target platform for the production version of the code is the Intel Paragon with 2048 processors. Because the implementation uses a standard, portable message-passing libraries, the code has been easily ported to other multiprocessors supporting a message-passing programming paradigm. The parallelization strategy used is to decompose the problem domain intomore » geographical patches and assign each processor the computation associated with a distinct subset of the patches. With this decomposition, the physics calculations involve only grid points and data local to a processor and are performed in parallel. Using parallel algorithms developed for the semi-Lagrangian transport, the fast Fourier transform and the Legendre transform, both physics and dynamics are computed in parallel with minimal data movement and modest change to the original CCM2 source code. Sequential or parallel history tapes are written and input files (in history tape format) are read sequentially by the parallel code to promote compatibility with production use of the model on other computer systems. A validation exercise has been performed with the parallel code and is detailed along with some performance numbers on the Intel Paragon and the IBM SP2. A discussion of reproducibility of results is included. A user`s guide for the PCCM2 version 2.1 on the various parallel machines completes the report. Procedures for compilation, setup and execution are given. A discussion of code internals is included for those who may wish to modify and use the program in their own research.« less

  8. Processing data communications events by awakening threads in parallel active messaging interface of a parallel computer

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Archer, Charles J.; Blocksome, Michael A.; Ratterman, Joseph D.

    Processing data communications events in a parallel active messaging interface (`PAMI`) of a parallel computer that includes compute nodes that execute a parallel application, with the PAMI including data communications endpoints, and the endpoints are coupled for data communications through the PAMI and through other data communications resources, including determining by an advance function that there are no actionable data communications events pending for its context, placing by the advance function its thread of execution into a wait state, waiting for a subsequent data communications event for the context; responsive to occurrence of a subsequent data communications event for themore » context, awakening by the thread from the wait state; and processing by the advance function the subsequent data communications event now pending for the context.« less

  9. Programming Probabilistic Structural Analysis for Parallel Processing Computer

    NASA Technical Reports Server (NTRS)

    Sues, Robert H.; Chen, Heh-Chyun; Twisdale, Lawrence A.; Chamis, Christos C.; Murthy, Pappu L. N.

    1991-01-01

    The ultimate goal of this research program is to make Probabilistic Structural Analysis (PSA) computationally efficient and hence practical for the design environment by achieving large scale parallelism. The paper identifies the multiple levels of parallelism in PSA, identifies methodologies for exploiting this parallelism, describes the development of a parallel stochastic finite element code, and presents results of two example applications. It is demonstrated that speeds within five percent of those theoretically possible can be achieved. A special-purpose numerical technique, the stochastic preconditioned conjugate gradient method, is also presented and demonstrated to be extremely efficient for certain classes of PSA problems.

  10. Parallel processing of genomics data

    NASA Astrophysics Data System (ADS)

    Agapito, Giuseppe; Guzzi, Pietro Hiram; Cannataro, Mario

    2016-10-01

    The availability of high-throughput experimental platforms for the analysis of biological samples, such as mass spectrometry, microarrays and Next Generation Sequencing, have made possible to analyze a whole genome in a single experiment. Such platforms produce an enormous volume of data per single experiment, thus the analysis of this enormous flow of data poses several challenges in term of data storage, preprocessing, and analysis. To face those issues, efficient, possibly parallel, bioinformatics software needs to be used to preprocess and analyze data, for instance to highlight genetic variation associated with complex diseases. In this paper we present a parallel algorithm for the parallel preprocessing and statistical analysis of genomics data, able to face high dimension of data and resulting in good response time. The proposed system is able to find statistically significant biological markers able to discriminate classes of patients that respond to drugs in different ways. Experiments performed on real and synthetic genomic datasets show good speed-up and scalability.

  11. Highly parallel sparse Cholesky factorization

    NASA Technical Reports Server (NTRS)

    Gilbert, John R.; Schreiber, Robert

    1990-01-01

    Several fine grained parallel algorithms were developed and compared to compute the Cholesky factorization of a sparse matrix. The experimental implementations are on the Connection Machine, a distributed memory SIMD machine whose programming model conceptually supplies one processor per data element. In contrast to special purpose algorithms in which the matrix structure conforms to the connection structure of the machine, the focus is on matrices with arbitrary sparsity structure. The most promising algorithm is one whose inner loop performs several dense factorizations simultaneously on a 2-D grid of processors. Virtually any massively parallel dense factorization algorithm can be used as the key subroutine. The sparse code attains execution rates comparable to those of the dense subroutine. Although at present architectural limitations prevent the dense factorization from realizing its potential efficiency, it is concluded that a regular data parallel architecture can be used efficiently to solve arbitrarily structured sparse problems. A performance model is also presented and it is used to analyze the algorithms.

  12. Parallel Adaptive Mesh Refinement Library

    NASA Technical Reports Server (NTRS)

    Mac-Neice, Peter; Olson, Kevin

    2005-01-01

    Parallel Adaptive Mesh Refinement Library (PARAMESH) is a package of Fortran 90 subroutines designed to provide a computer programmer with an easy route to extension of (1) a previously written serial code that uses a logically Cartesian structured mesh into (2) a parallel code with adaptive mesh refinement (AMR). Alternatively, in its simplest use, and with minimal effort, PARAMESH can operate as a domain-decomposition tool for users who want to parallelize their serial codes but who do not wish to utilize adaptivity. The package builds a hierarchy of sub-grids to cover the computational domain of a given application program, with spatial resolution varying to satisfy the demands of the application. The sub-grid blocks form the nodes of a tree data structure (a quad-tree in two or an oct-tree in three dimensions). Each grid block has a logically Cartesian mesh. The package supports one-, two- and three-dimensional models.

  13. Single-agent parallel window search

    NASA Technical Reports Server (NTRS)

    Powley, Curt; Korf, Richard E.

    1991-01-01

    Parallel window search is applied to single-agent problems by having different processes simultaneously perform iterations of Iterative-Deepening-A(asterisk) (IDA-asterisk) on the same problem but with different cost thresholds. This approach is limited by the time to perform the goal iteration. To overcome this disadvantage, the authors consider node ordering. They discuss how global node ordering by minimum h among nodes with equal f = g + h values can reduce the time complexity of serial IDA-asterisk by reducing the time to perform the iterations prior to the goal iteration. Finally, the two ideas of parallel window search and node ordering are combined to eliminate the weaknesses of each approach while retaining the strengths. The resulting approach, called simply parallel window search, can be used to find a near-optimal solution quickly, improve the solution until it is optimal, and then finally guarantee optimality, depending on the amount of time available.

  14. Large-scale parallel genome assembler over cloud computing environment.

    PubMed

    Das, Arghya Kusum; Koppa, Praveen Kumar; Goswami, Sayan; Platania, Richard; Park, Seung-Jong

    2017-06-01

    The size of high throughput DNA sequencing data has already reached the terabyte scale. To manage this huge volume of data, many downstream sequencing applications started using locality-based computing over different cloud infrastructures to take advantage of elastic (pay as you go) resources at a lower cost. However, the locality-based programming model (e.g. MapReduce) is relatively new. Consequently, developing scalable data-intensive bioinformatics applications using this model and understanding the hardware environment that these applications require for good performance, both require further research. In this paper, we present a de Bruijn graph oriented Parallel Giraph-based Genome Assembler (GiGA), as well as the hardware platform required for its optimal performance. GiGA uses the power of Hadoop (MapReduce) and Giraph (large-scale graph analysis) to achieve high scalability over hundreds of compute nodes by collocating the computation and data. GiGA achieves significantly higher scalability with competitive assembly quality compared to contemporary parallel assemblers (e.g. ABySS and Contrail) over traditional HPC cluster. Moreover, we show that the performance of GiGA is significantly improved by using an SSD-based private cloud infrastructure over traditional HPC cluster. We observe that the performance of GiGA on 256 cores of this SSD-based cloud infrastructure closely matches that of 512 cores of traditional HPC cluster.

  15. Parallel Algorithms for Switching Edges in Heterogeneous Graphs.

    PubMed

    Bhuiyan, Hasanuzzaman; Khan, Maleq; Chen, Jiangzhuo; Marathe, Madhav

    2017-06-01

    An edge switch is an operation on a graph (or network) where two edges are selected randomly and one of their end vertices are swapped with each other. Edge switch operations have important applications in graph theory and network analysis, such as in generating random networks with a given degree sequence, modeling and analyzing dynamic networks, and in studying various dynamic phenomena over a network. The recent growth of real-world networks motivates the need for efficient parallel algorithms. The dependencies among successive edge switch operations and the requirement to keep the graph simple (i.e., no self-loops or parallel edges) as the edges are switched lead to significant challenges in designing a parallel algorithm. Addressing these challenges requires complex synchronization and communication among the processors leading to difficulties in achieving a good speedup by parallelization. In this paper, we present distributed memory parallel algorithms for switching edges in massive networks. These algorithms provide good speedup and scale well to a large number of processors. A harmonic mean speedup of 73.25 is achieved on eight different networks with 1024 processors. One of the steps in our edge switch algorithms requires the computation of multinomial random variables in parallel. This paper presents the first non-trivial parallel algorithm for the problem, achieving a speedup of 925 using 1024 processors.

  16. Parallel Algorithms for Switching Edges in Heterogeneous Graphs☆

    PubMed Central

    Khan, Maleq; Chen, Jiangzhuo; Marathe, Madhav

    2017-01-01

    An edge switch is an operation on a graph (or network) where two edges are selected randomly and one of their end vertices are swapped with each other. Edge switch operations have important applications in graph theory and network analysis, such as in generating random networks with a given degree sequence, modeling and analyzing dynamic networks, and in studying various dynamic phenomena over a network. The recent growth of real-world networks motivates the need for efficient parallel algorithms. The dependencies among successive edge switch operations and the requirement to keep the graph simple (i.e., no self-loops or parallel edges) as the edges are switched lead to significant challenges in designing a parallel algorithm. Addressing these challenges requires complex synchronization and communication among the processors leading to difficulties in achieving a good speedup by parallelization. In this paper, we present distributed memory parallel algorithms for switching edges in massive networks. These algorithms provide good speedup and scale well to a large number of processors. A harmonic mean speedup of 73.25 is achieved on eight different networks with 1024 processors. One of the steps in our edge switch algorithms requires the computation of multinomial random variables in parallel. This paper presents the first non-trivial parallel algorithm for the problem, achieving a speedup of 925 using 1024 processors. PMID:28757680

  17. Parallel software support for computational structural mechanics

    NASA Technical Reports Server (NTRS)

    Jordan, Harry F.

    1987-01-01

    The application of the parallel programming methodology known as the Force was conducted. Two application issues were addressed. The first involves the efficiency of the implementation and its completeness in terms of satisfying the needs of other researchers implementing parallel algorithms. Support for, and interaction with, other Computational Structural Mechanics (CSM) researchers using the Force was the main issue, but some independent investigation of the Barrier construct, which is extremely important to overall performance, was also undertaken. Another efficiency issue which was addressed was that of relaxing the strong synchronization condition imposed on the self-scheduled parallel DO loop. The Force was extended by the addition of logical conditions to the cases of a parallel case construct and by the inclusion of a self-scheduled version of this construct. The second issue involved applying the Force to the parallelization of finite element codes such as those found in the NICE/SPAR testbed system. One of the more difficult problems encountered is the determination of what information in COMMON blocks is actually used outside of a subroutine and when a subroutine uses a COMMON block merely as scratch storage for internal temporary results.

  18. Efficient solvers for coupled models in respiratory mechanics.

    PubMed

    Verdugo, Francesc; Roth, Christian J; Yoshihara, Lena; Wall, Wolfgang A

    2017-02-01

    We present efficient preconditioners for one of the most physiologically relevant pulmonary models currently available. Our underlying motivation is to enable the efficient simulation of such a lung model on high-performance computing platforms in order to assess mechanical ventilation strategies and contributing to design more protective patient-specific ventilation treatments. The system of linear equations to be solved using the proposed preconditioners is essentially the monolithic system arising in fluid-structure interaction (FSI) extended by additional algebraic constraints. The introduction of these constraints leads to a saddle point problem that cannot be solved with usual FSI preconditioners available in the literature. The key ingredient in this work is to use the idea of the semi-implicit method for pressure-linked equations (SIMPLE) for getting rid of the saddle point structure, resulting in a standard FSI problem that can be treated with available techniques. The numerical examples show that the resulting preconditioners approach the optimal performance of multigrid methods, even though the lung model is a complex multiphysics problem. Moreover, the preconditioners are robust enough to deal with physiologically relevant simulations involving complex real-world patient-specific lung geometries. The same approach is applicable to other challenging biomedical applications where coupling between flow and tissue deformations is modeled with additional algebraic constraints. Copyright © 2016 John Wiley & Sons, Ltd. Copyright © 2016 John Wiley & Sons, Ltd.

  19. The Complexity of Parallel Algorithms,

    DTIC Science & Technology

    1985-11-01

    programns have been written for se(luiential coiipn ters. Many p~eop~le want coimp ~ilers dihal. will c(nimpile t he, code for parallel machines, to avoid...between two vertices. We also rely on parallel algorithms for maintaining data structures and manipulating graphs. We do not go into the details of these...Jpatlis and maintain connected coimp ~onents. The routine is: - 35 .- ExtendPath(r, Q, V) begin P +-0; s 4- while there is a path in V - P from s to a vertex

  20. Supercomputing on massively parallel bit-serial architectures

    NASA Technical Reports Server (NTRS)

    Iobst, Ken

    1985-01-01

    Research on the Goodyear Massively Parallel Processor (MPP) suggests that high-level parallel languages are practical and can be designed with powerful new semantics that allow algorithms to be efficiently mapped to the real machines. For the MPP these semantics include parallel/associative array selection for both dense and sparse matrices, variable precision arithmetic to trade accuracy for speed, micro-pipelined train broadcast, and conditional branching at the processing element (PE) control unit level. The preliminary design of a FORTRAN-like parallel language for the MPP has been completed and is being used to write programs to perform sparse matrix array selection, min/max search, matrix multiplication, Gaussian elimination on single bit arrays and other generic algorithms. A description is given of the MPP design. Features of the system and its operation are illustrated in the form of charts and diagrams.

  1. Implementation of parallel moment equations in NIMROD

    NASA Astrophysics Data System (ADS)

    Lee, Hankyu Q.; Held, Eric D.; Ji, Jeong-Young

    2017-10-01

    As collisionality is low (the Knudsen number is large) in many plasma applications, kinetic effects become important, particularly in parallel dynamics for magnetized plasmas. Fluid models can capture some kinetic effects when integral parallel closures are adopted. The adiabatic and linear approximations are used in solving general moment equations to obtain the integral closures. In this work, we present an effort to incorporate non-adiabatic (time-dependent) and nonlinear effects into parallel closures. Instead of analytically solving the approximate moment system, we implement exact parallel moment equations in the NIMROD fluid code. The moment code is expected to provide a natural convergence scheme by increasing the number of moments. Work in collaboration with the PSI Center and supported by the U.S. DOE under Grant Nos. DE-SC0014033, DE-SC0016256, and DE-FG02-04ER54746.

  2. Visualization and Tracking of Parallel CFD Simulations

    NASA Technical Reports Server (NTRS)

    Vaziri, Arsi; Kremenetsky, Mark

    1995-01-01

    We describe a system for interactive visualization and tracking of a 3-D unsteady computational fluid dynamics (CFD) simulation on a parallel computer. CM/AVS, a distributed, parallel implementation of a visualization environment (AVS) runs on the CM-5 parallel supercomputer. A CFD solver is run as a CM/AVS module on the CM-5. Data communication between the solver, other parallel visualization modules, and a graphics workstation, which is running AVS, are handled by CM/AVS. Partitioning of the visualization task, between CM-5 and the workstation, can be done interactively in the visual programming environment provided by AVS. Flow solver parameters can also be altered by programmable interactive widgets. This system partially removes the requirement of storing large solution files at frequent time steps, a characteristic of the traditional 'simulate (yields) store (yields) visualize' post-processing approach.

  3. Runtime support for parallelizing data mining algorithms

    NASA Astrophysics Data System (ADS)

    Jin, Ruoming; Agrawal, Gagan

    2002-03-01

    With recent technological advances, shared memory parallel machines have become more scalable, and offer large main memories and high bus bandwidths. They are emerging as good platforms for data warehousing and data mining. In this paper, we focus on shared memory parallelization of data mining algorithms. We have developed a series of techniques for parallelization of data mining algorithms, including full replication, full locking, fixed locking, optimized full locking, and cache-sensitive locking. Unlike previous work on shared memory parallelization of specific data mining algorithms, all of our techniques apply to a large number of common data mining algorithms. In addition, we propose a reduction-object based interface for specifying a data mining algorithm. We show how our runtime system can apply any of the technique we have developed starting from a common specification of the algorithm.

  4. Parallel approach in RDF query processing

    NASA Astrophysics Data System (ADS)

    Vajgl, Marek; Parenica, Jan

    2017-07-01

    Parallel approach is nowadays a very cheap solution to increase computational power due to possibility of usage of multithreaded computational units. This hardware became typical part of nowadays personal computers or notebooks and is widely spread. This contribution deals with experiments how evaluation of computational complex algorithm of the inference over RDF data can be parallelized over graphical cards to decrease computational time.

  5. Parallelizing alternating direction implicit solver on GPUs

    USDA-ARS?s Scientific Manuscript database

    We present a parallel Alternating Direction Implicit (ADI) solver on GPUs. Our implementation significantly improves existing implementations in two aspects. First, we address the scalability issue of existing Parallel Cyclic Reduction (PCR) implementations by eliminating their hardware resource con...

  6. National Combustion Code Parallel Performance Enhancements

    NASA Technical Reports Server (NTRS)

    Quealy, Angela; Benyo, Theresa (Technical Monitor)

    2002-01-01

    The National Combustion Code (NCC) is being developed by an industry-government team for the design and analysis of combustion systems. The unstructured grid, reacting flow code uses a distributed memory, message passing model for its parallel implementation. The focus of the present effort has been to improve the performance of the NCC code to meet combustor designer requirements for model accuracy and analysis turnaround time. Improving the performance of this code contributes significantly to the overall reduction in time and cost of the combustor design cycle. This report describes recent parallel processing modifications to NCC that have improved the parallel scalability of the code, enabling a two hour turnaround for a 1.3 million element fully reacting combustion simulation on an SGI Origin 2000.

  7. Quality of Education, Comparability, and Assessment Choice in Developing Countries

    ERIC Educational Resources Information Center

    Wagner, Daniel A.

    2010-01-01

    Over the past decade, international development agencies have begun to emphasize the improvement of the quality (rather than simply quantity) of education in developing countries. This new focus has been paralleled by a significant increase in the use of educational assessments as a way to measure gains and losses in quality. As this interest in…

  8. A rapid parallelization of cone-beam projection and back-projection operator based on texture fetching interpolation

    NASA Astrophysics Data System (ADS)

    Xie, Lizhe; Hu, Yining; Chen, Yang; Shi, Luyao

    2015-03-01

    Projection and back-projection are the most computational consuming parts in Computed Tomography (CT) reconstruction. Parallelization strategies using GPU computing techniques have been introduced. We in this paper present a new parallelization scheme for both projection and back-projection. The proposed method is based on CUDA technology carried out by NVIDIA Corporation. Instead of build complex model, we aimed on optimizing the existing algorithm and make it suitable for CUDA implementation so as to gain fast computation speed. Besides making use of texture fetching operation which helps gain faster interpolation speed, we fixed sampling numbers in the computation of projection, to ensure the synchronization of blocks and threads, thus prevents the latency caused by inconsistent computation complexity. Experiment results have proven the computational efficiency and imaging quality of the proposed method.

  9. The BLAZE language: A parallel language for scientific programming

    NASA Technical Reports Server (NTRS)

    Mehrotra, P.; Vanrosendale, J.

    1985-01-01

    A Pascal-like scientific programming language, Blaze, is described. Blaze contains array arithmetic, forall loops, and APL-style accumulation operators, which allow natural expression of fine grained parallelism. It also employs an applicative or functional procedure invocation mechanism, which makes it easy for compilers to extract coarse grained parallelism using machine specific program restructuring. Thus Blaze should allow one to achieve highly parallel execution on multiprocessor architectures, while still providing the user with onceptually sequential control flow. A central goal in the design of Blaze is portability across a broad range of parallel architectures. The multiple levels of parallelism present in Blaze code, in principle, allow a compiler to extract the types of parallelism appropriate for the given architecture while neglecting the remainder. The features of Blaze are described and shows how this language would be used in typical scientific programming.

  10. The BLAZE language - A parallel language for scientific programming

    NASA Technical Reports Server (NTRS)

    Mehrotra, Piyush; Van Rosendale, John

    1987-01-01

    A Pascal-like scientific programming language, BLAZE, is described. BLAZE contains array arithmetic, forall loops, and APL-style accumulation operators, which allow natural expression of fine grained parallelism. It also employs an applicative or functional procedure invocation mechanism, which makes it easy for compilers to extract coarse grained parallelism using machine specific program restructuring. Thus BLAZE should allow one to achieve highly parallel execution on multiprocessor architectures, while still providing the user with conceptually sequential control flow. A central goal in the design of BLAZE is portability across a broad range of parallel architectures. The multiple levels of parallelism present in BLAZE code, in principle, allow a compiler to extract the types of parallelism appropriate for the given architecture while neglecting the remainder. The features of BLAZE are described and it is shown how this language would be used in typical scientific programming.

  11. Tolerant (parallel) Programming

    NASA Technical Reports Server (NTRS)

    DiNucci, David C.; Bailey, David H. (Technical Monitor)

    1997-01-01

    In order to be truly portable, a program must be tolerant of a wide range of development and execution environments, and a parallel program is just one which must be tolerant of a very wide range. This paper first defines the term "tolerant programming", then describes many layers of tools to accomplish it. The primary focus is on F-Nets, a formal model for expressing computation as a folded partial-ordering of operations, thereby providing an architecture-independent expression of tolerant parallel algorithms. For implementing F-Nets, Cooperative Data Sharing (CDS) is a subroutine package for implementing communication efficiently in a large number of environments (e.g. shared memory and message passing). Software Cabling (SC), a very-high-level graphical programming language for building large F-Nets, possesses many of the features normally expected from today's computer languages (e.g. data abstraction, array operations). Finally, L2(sup 3) is a CASE tool which facilitates the construction, compilation, execution, and debugging of SC programs.

  12. MULTIOBJECTIVE PARALLEL GENETIC ALGORITHM FOR WASTE MINIMIZATION

    EPA Science Inventory

    In this research we have developed an efficient multiobjective parallel genetic algorithm (MOPGA) for waste minimization problems. This MOPGA integrates PGAPack (Levine, 1996) and NSGA-II (Deb, 2000) with novel modifications. PGAPack is a master-slave parallel implementation of a...

  13. Partitioning problems in parallel, pipelined and distributed computing

    NASA Technical Reports Server (NTRS)

    Bokhari, S.

    1985-01-01

    The problem of optimally assigning the modules of a parallel program over the processors of a multiple computer system is addressed. A Sum-Bottleneck path algorithm is developed that permits the efficient solution of many variants of this problem under some constraints on the structure of the partitions. In particular, the following problems are solved optimally for a single-host, multiple satellite system: partitioning multiple chain structured parallel programs, multiple arbitrarily structured serial programs and single tree structured parallel programs. In addition, the problems of partitioning chain structured parallel programs across chain connected systems and across shared memory (or shared bus) systems are also solved under certain constraints. All solutions for parallel programs are equally applicable to pipelined programs. These results extend prior research in this area by explicitly taking concurrency into account and permit the efficient utilization of multiple computer architectures for a wide range of problems of practical interest.

  14. Parallelized Three-Dimensional Resistivity Inversion Using Finite Elements And Adjoint State Methods

    NASA Astrophysics Data System (ADS)

    Schaa, Ralf; Gross, Lutz; Du Plessis, Jaco

    2015-04-01

    resistivity. The Hessian of the regularization term is used as preconditioner which requires an additional PDE solution in each iteration step. As it turns out, the relevant PDEs are naturally formulated in the finite element framework. Using the domain decomposition method provided in Escript, the inversion scheme has been parallelized for distributed memory computers with multi-core shared memory nodes. We show numerical examples from simple layered models to complex 3D models and compare with the results from other methods. The inversion scheme is furthermore tested on a field data example to characterise localised freshwater discharge in a coastal environment.. References: L. Gross and C. Kemp (2013) Large Scale Joint Inversion of Geophysical Data using the Finite Element Method in escript. ASEG Extended Abstracts 2013, http://dx.doi.org/10.1071/ASEG2013ab306

  15. Optimisation of a parallel ocean general circulation model

    NASA Astrophysics Data System (ADS)

    Beare, M. I.; Stevens, D. P.

    1997-10-01

    This paper presents the development of a general-purpose parallel ocean circulation model, for use on a wide range of computer platforms, from traditional scalar machines to workstation clusters and massively parallel processors. Parallelism is provided, as a modular option, via high-level message-passing routines, thus hiding the technical intricacies from the user. An initial implementation highlights that the parallel efficiency of the model is adversely affected by a number of factors, for which optimisations are discussed and implemented. The resulting ocean code is portable and, in particular, allows science to be achieved on local workstations that could otherwise only be undertaken on state-of-the-art supercomputers.

  16. Parallel computation and the basis system

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Smith, G.R.

    1993-05-01

    A software package has been written that can facilitate efforts to develop powerful, flexible, and easy-to use programs that can run in single-processor, massively parallel, and distributed computing environments. Particular attention has been given to the difficulties posed by a program consisting of many science packages that represent subsystems of a complicated, coupled system. Methods have been found to maintain independence of the packages by hiding data structures without increasing the communications costs in a parallel computing environment. Concepts developed in this work are demonstrated by a prototype program that uses library routines from two existing software systems, Basis andmore » Parallel Virtual Machine (PVM). Most of the details of these libraries have been encapsulated in routines and macros that could be rewritten for alternative libraries that possess certain minimum capabilities. The prototype software uses a flexible master-and-slaves paradigm for parallel computation and supports domain decomposition with message passing for partitioning work among slaves. Facilities are provided for accessing variables that are distributed among the memories of slaves assigned to subdomains. The software is named PROTOPAR.« less

  17. The science of computing - Parallel computation

    NASA Technical Reports Server (NTRS)

    Denning, P. J.

    1985-01-01

    Although parallel computation architectures have been known for computers since the 1920s, it was only in the 1970s that microelectronic components technologies advanced to the point where it became feasible to incorporate multiple processors in one machine. Concommitantly, the development of algorithms for parallel processing also lagged due to hardware limitations. The speed of computing with solid-state chips is limited by gate switching delays. The physical limit implies that a 1 Gflop operational speed is the maximum for sequential processors. A computer recently introduced features a 'hypercube' architecture with 128 processors connected in networks at 5, 6 or 7 points per grid, depending on the design choice. Its computing speed rivals that of supercomputers, but at a fraction of the cost. The added speed with less hardware is due to parallel processing, which utilizes algorithms representing different parts of an equation that can be broken into simpler statements and processed simultaneously. Present, highly developed computer languages like FORTRAN, PASCAL, COBOL, etc., rely on sequential instructions. Thus, increased emphasis will now be directed at parallel processing algorithms to exploit the new architectures.

  18. Symmetry Breaking by Parallel Flow Shear

    NASA Astrophysics Data System (ADS)

    Li, Jiacong; Diamond, Patrick

    2015-11-01

    Plasma rotation is important in reducing turbulent transport, suppressing MHD instabilities, and is beneficial to confinement. Intrinsic rotation without an external momentum input is of interest for its plausible application on ITER. k∥ spectrum asymmetry is required for residual Reynolds stress that drives the intrinsic rotation. Parallel flows are reported in linear devices without magnetic shear. In CSDX, parallel flows are mostly peaked in the core [Thakur et al., 2014]; more robust flows and reversed profiles are seen in PANTA [Oldenburger, et al. 2012]. A novel mechanism for symmetry breaking in momentum transport is proposed. Magnetic shear or mean flow profile are not required. A seed parallel flow shear (PFS) sets the sign of residual stress by selecting certain modes to grow faster. The resulted spectrum imbalance leads to a nonzero residual stress, which further drives a parallel flow with ∇n as the free energy source, adding to the shear until saturated by diffusion. Balanced flow gradient is set by Π∥Res /χϕ . Residual stress is calculated for ITG turbulence and collisional drift wave turbulence where electron-ion and electron-neutral collisions are discussed and compared. Numerical simulation is proposed for testing the effect of PFS.

  19. Performance Analysis of Multilevel Parallel Applications on Shared Memory Architectures

    NASA Technical Reports Server (NTRS)

    Biegel, Bryan A. (Technical Monitor); Jost, G.; Jin, H.; Labarta J.; Gimenez, J.; Caubet, J.

    2003-01-01

    Parallel programming paradigms include process level parallelism, thread level parallelization, and multilevel parallelism. This viewgraph presentation describes a detailed performance analysis of these paradigms for Shared Memory Architecture (SMA). This analysis uses the Paraver Performance Analysis System. The presentation includes diagrams of a flow of useful computations.

  20. Rubus: A compiler for seamless and extensible parallelism.

    PubMed

    Adnan, Muhammad; Aslam, Faisal; Nawaz, Zubair; Sarwar, Syed Mansoor

    2017-01-01

    Nowadays, a typical processor may have multiple processing cores on a single chip. Furthermore, a special purpose processing unit called Graphic Processing Unit (GPU), originally designed for 2D/3D games, is now available for general purpose use in computers and mobile devices. However, the traditional programming languages which were designed to work with machines having single core CPUs, cannot utilize the parallelism available on multi-core processors efficiently. Therefore, to exploit the extraordinary processing power of multi-core processors, researchers are working on new tools and techniques to facilitate parallel programming. To this end, languages like CUDA and OpenCL have been introduced, which can be used to write code with parallelism. The main shortcoming of these languages is that programmer needs to specify all the complex details manually in order to parallelize the code across multiple cores. Therefore, the code written in these languages is difficult to understand, debug and maintain. Furthermore, to parallelize legacy code can require rewriting a significant portion of code in CUDA or OpenCL, which can consume significant time and resources. Thus, the amount of parallelism achieved is proportional to the skills of the programmer and the time spent in code optimizations. This paper proposes a new open source compiler, Rubus, to achieve seamless parallelism. The Rubus compiler relieves the programmer from manually specifying the low-level details. It analyses and transforms a sequential program into a parallel program automatically, without any user intervention. This achieves massive speedup and better utilization of the underlying hardware without a programmer's expertise in parallel programming. For five different benchmarks, on average a speedup of 34.54 times has been achieved by Rubus as compared to Java on a basic GPU having only 96 cores. Whereas, for a matrix multiplication benchmark the average execution speedup of 84 times has been

  1. Rubus: A compiler for seamless and extensible parallelism

    PubMed Central

    Adnan, Muhammad; Aslam, Faisal; Sarwar, Syed Mansoor

    2017-01-01

    Nowadays, a typical processor may have multiple processing cores on a single chip. Furthermore, a special purpose processing unit called Graphic Processing Unit (GPU), originally designed for 2D/3D games, is now available for general purpose use in computers and mobile devices. However, the traditional programming languages which were designed to work with machines having single core CPUs, cannot utilize the parallelism available on multi-core processors efficiently. Therefore, to exploit the extraordinary processing power of multi-core processors, researchers are working on new tools and techniques to facilitate parallel programming. To this end, languages like CUDA and OpenCL have been introduced, which can be used to write code with parallelism. The main shortcoming of these languages is that programmer needs to specify all the complex details manually in order to parallelize the code across multiple cores. Therefore, the code written in these languages is difficult to understand, debug and maintain. Furthermore, to parallelize legacy code can require rewriting a significant portion of code in CUDA or OpenCL, which can consume significant time and resources. Thus, the amount of parallelism achieved is proportional to the skills of the programmer and the time spent in code optimizations. This paper proposes a new open source compiler, Rubus, to achieve seamless parallelism. The Rubus compiler relieves the programmer from manually specifying the low-level details. It analyses and transforms a sequential program into a parallel program automatically, without any user intervention. This achieves massive speedup and better utilization of the underlying hardware without a programmer’s expertise in parallel programming. For five different benchmarks, on average a speedup of 34.54 times has been achieved by Rubus as compared to Java on a basic GPU having only 96 cores. Whereas, for a matrix multiplication benchmark the average execution speedup of 84 times has been

  2. 3D printed soft parallel actuator

    NASA Astrophysics Data System (ADS)

    Zolfagharian, Ali; Kouzani, Abbas Z.; Khoo, Sui Yang; Noshadi, Amin; Kaynak, Akif

    2018-04-01

    This paper presents a 3-dimensional (3D) printed soft parallel contactless actuator for the first time. The actuator involves an electro-responsive parallel mechanism made of two segments namely active chain and passive chain both 3D printed. The active chain is attached to the ground from one end and constitutes two actuator links made of responsive hydrogel. The passive chain, on the other hand, is attached to the active chain from one end and consists of two rigid links made of polymer. The actuator links are printed using an extrusion-based 3D-Bioplotter with polyelectrolyte hydrogel as printer ink. The rigid links are also printed by a 3D fused deposition modelling (FDM) printer with acrylonitrile butadiene styrene (ABS) as print material. The kinematics model of the soft parallel actuator is derived via transformation matrices notations to simulate and determine the workspace of the actuator. The printed soft parallel actuator is then immersed into NaOH solution with specific voltage applied to it via two contactless electrodes. The experimental data is then collected and used to develop a parametric model to estimate the end-effector position and regulate kinematics model in response to specific input voltage over time. It is observed that the electroactive actuator demonstrates expected behaviour according to the simulation of its kinematics model. The use of 3D printing for the fabrication of parallel soft actuators opens a new chapter in manufacturing sophisticated soft actuators with high dexterity and mechanical robustness for biomedical applications such as cell manipulation and drug release.

  3. MPI_XSTAR: MPI-based Parallelization of the XSTAR Photoionization Program

    NASA Astrophysics Data System (ADS)

    Danehkar, Ashkbiz; Nowak, Michael A.; Lee, Julia C.; Smith, Randall K.

    2018-02-01

    We describe a program for the parallel implementation of multiple runs of XSTAR, a photoionization code that is used to predict the physical properties of an ionized gas from its emission and/or absorption lines. The parallelization program, called MPI_XSTAR, has been developed and implemented in the C++ language by using the Message Passing Interface (MPI) protocol, a conventional standard of parallel computing. We have benchmarked parallel multiprocessing executions of XSTAR, using MPI_XSTAR, against a serial execution of XSTAR, in terms of the parallelization speedup and the computing resource efficiency. Our experience indicates that the parallel execution runs significantly faster than the serial execution, however, the efficiency in terms of the computing resource usage decreases with increasing the number of processors used in the parallel computing.

  4. Aerodynamic simulation on massively parallel systems

    NASA Technical Reports Server (NTRS)

    Haeuser, Jochem; Simon, Horst D.

    1992-01-01

    This paper briefly addresses the computational requirements for the analysis of complete configurations of aircraft and spacecraft currently under design to be used for advanced transportation in commercial applications as well as in space flight. The discussion clearly shows that massively parallel systems are the only alternative which is both cost effective and on the other hand can provide the necessary TeraFlops, needed to satisfy the narrow design margins of modern vehicles. It is assumed that the solution of the governing physical equations, i.e., the Navier-Stokes equations which may be complemented by chemistry and turbulence models, is done on multiblock grids. This technique is situated between the fully structured approach of classical boundary fitted grids and the fully unstructured tetrahedra grids. A fully structured grid best represents the flow physics, while the unstructured grid gives best geometrical flexibility. The multiblock grid employed is structured within a block, but completely unstructured on the block level. While a completely unstructured grid is not straightforward to parallelize, the above mentioned multiblock grid is inherently parallel, in particular for multiple instruction multiple datastream (MIMD) machines. In this paper guidelines are provided for setting up or modifying an existing sequential code so that a direct parallelization on a massively parallel system is possible. Results are presented for three parallel systems, namely the Intel hypercube, the Ncube hypercube, and the FPS 500 system. Some preliminary results for an 8K CM2 machine will also be mentioned. The code run is the two dimensional grid generation module of Grid, which is a general two dimensional and three dimensional grid generation code for complex geometries. A system of nonlinear Poisson equations is solved. This code is also a good testcase for complex fluid dynamics codes, since the same datastructures are used. All systems provided good speedups, but

  5. Parallel/Vector Integration Methods for Dynamical Astronomy

    NASA Astrophysics Data System (ADS)

    Fukushima, Toshio

    1999-01-01

    This paper reviews three recent works on the numerical methods to integrate ordinary differential equations (ODE), which are specially designed for parallel, vector, and/or multi-processor-unit(PU) computers. The first is the Picard-Chebyshev method (Fukushima, 1997a). It obtains a global solution of ODE in the form of Chebyshev polynomial of large (> 1000) degree by applying the Picard iteration repeatedly. The iteration converges for smooth problems and/or perturbed dynamics. The method runs around 100-1000 times faster in the vector mode than in the scalar mode of a certain computer with vector processors (Fukushima, 1997b). The second is a parallelization of a symplectic integrator (Saha et al., 1997). It regards the implicit midpoint rules covering thousands of timesteps as large-scale nonlinear equations and solves them by the fixed-point iteration. The method is applicable to Hamiltonian systems and is expected to lead an acceleration factor of around 50 in parallel computers with more than 1000 PUs. The last is a parallelization of the extrapolation method (Ito and Fukushima, 1997). It performs trial integrations in parallel. Also the trial integrations are further accelerated by balancing computational load among PUs by the technique of folding. The method is all-purpose and achieves an acceleration factor of around 3.5 by using several PUs. Finally, we give a perspective on the parallelization of some implicit integrators which require multiple corrections in solving implicit formulas like the implicit Hermitian integrators (Makino and Aarseth, 1992), (Hut et al., 1995) or the implicit symmetric multistep methods (Fukushima, 1998), (Fukushima, 1999).

  6. Modelling parallel programs and multiprocessor architectures with AXE

    NASA Technical Reports Server (NTRS)

    Yan, Jerry C.; Fineman, Charles E.

    1991-01-01

    AXE, An Experimental Environment for Parallel Systems, was designed to model and simulate for parallel systems at the process level. It provides an integrated environment for specifying computation models, multiprocessor architectures, data collection, and performance visualization. AXE is being used at NASA-Ames for developing resource management strategies, parallel problem formulation, multiprocessor architectures, and operating system issues related to the High Performance Computing and Communications Program. AXE's simple, structured user-interface enables the user to model parallel programs and machines precisely and efficiently. Its quick turn-around time keeps the user interested and productive. AXE models multicomputers. The user may easily modify various architectural parameters including the number of sites, connection topologies, and overhead for operating system activities. Parallel computations in AXE are represented as collections of autonomous computing objects known as players. Their use and behavior is described. Performance data of the multiprocessor model can be observed on a color screen. These include CPU and message routing bottlenecks, and the dynamic status of the software.

  7. Parallel Narrative Structure in Paul Harding's "Tinkers"

    ERIC Educational Resources Information Center

    Çirakli, Mustafa Zeki

    2014-01-01

    The present paper explores the implications of parallel narrative structure in Paul Harding's "Tinkers" (2009). Besides primarily recounting the two sets of parallel narratives, "Tinkers" also comprises of seemingly unrelated fragments such as excerpts from clock repair manuals and diaries. The main stories, however, told…

  8. NDetermin: Inferring Nondeterministic Sequential Specifications for Parallelism Correctness

    DTIC Science & Technology

    2011-12-16

    other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a...Lab affiliates National Instruments, NEC, Nokia , NVIDIA, and Samsung. NDetermin: Inferring Nondeterministic Sequential Specifications for Parallelism...concurrently update x, some of these CAS’s will fail and those parallel loop iterations will recompute their updates to x and try again. Consider the parallel

  9. Parallel Algorithms for Least Squares and Related Computations.

    DTIC Science & Technology

    1991-03-22

    for dense computations in linear algebra . The work has recently been published in a general reference book on parallel algorithms by SIAM. AFO SR...written his Ph.D. dissertation with the principal investigator. (See publication 6.) • Parallel Algorithms for Dense Linear Algebra Computations. Our...and describe and to put into perspective a selection of the more important parallel algorithms for numerical linear algebra . We give a major new

  10. Improving operating room productivity via parallel anesthesia processing.

    PubMed

    Brown, Michael J; Subramanian, Arun; Curry, Timothy B; Kor, Daryl J; Moran, Steven L; Rohleder, Thomas R

    2014-01-01

    Parallel processing of regional anesthesia may improve operating room (OR) efficiency in patients undergoes upper extremity surgical procedures. The purpose of this paper is to evaluate whether performing regional anesthesia outside the OR in parallel increases total cases per day, improve efficiency and productivity. Data from all adult patients who underwent regional anesthesia as their primary anesthetic for upper extremity surgery over a one-year period were used to develop a simulation model. The model evaluated pure operating modes of regional anesthesia performed within and outside the OR in a parallel manner. The scenarios were used to evaluate how many surgeries could be completed in a standard work day (555 minutes) and assuming a standard three cases per day, what was the predicted end-of-day time overtime. Modeling results show that parallel processing of regional anesthesia increases the average cases per day for all surgeons included in the study. The average increase was 0.42 surgeries per day. Where it was assumed that three cases per day would be performed by all surgeons, the days going to overtime was reduced by 43 percent with parallel block. The overtime with parallel anesthesia was also projected to be 40 minutes less per day per surgeon. Key limitations include the assumption that all cases used regional anesthesia in the comparisons. Many days may have both regional and general anesthesia. Also, as a case study, single-center research may limit generalizability. Perioperative care providers should consider parallel administration of regional anesthesia where there is a desire to increase daily upper extremity surgical case capacity. Where there are sufficient resources to do parallel anesthesia processing, efficiency and productivity can be significantly improved. Simulation modeling can be an effective tool to show practice change effects at a system-wide level.

  11. Line-drawing algorithms for parallel machines

    NASA Technical Reports Server (NTRS)

    Pang, Alex T.

    1990-01-01

    The fact that conventional line-drawing algorithms, when applied directly on parallel machines, can lead to very inefficient codes is addressed. It is suggested that instead of modifying an existing algorithm for a parallel machine, a more efficient implementation can be produced by going back to the invariants in the definition. Popular line-drawing algorithms are compared with two alternatives; distance to a line (a point is on the line if sufficiently close to it) and intersection with a line (a point on the line if an intersection point). For massively parallel single-instruction-multiple-data (SIMD) machines (with thousands of processors and up), the alternatives provide viable line-drawing algorithms. Because of the pixel-per-processor mapping, their performance is independent of the line length and orientation.

  12. Genetic algorithms using SISAL parallel programming language

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Tejada, S.

    1994-05-06

    Genetic algorithms are a mathematical optimization technique developed by John Holland at the University of Michigan [1]. The SISAL programming language possesses many of the characteristics desired to implement genetic algorithms. SISAL is a deterministic, functional programming language which is inherently parallel. Because SISAL is functional and based on mathematical concepts, genetic algorithms can be efficiently translated into the language. Several of the steps involved in genetic algorithms, such as mutation, crossover, and fitness evaluation, can be parallelized using SISAL. In this paper I will l discuss the implementation and performance of parallel genetic algorithms in SISAL.

  13. On the parallel solution of parabolic equations

    NASA Technical Reports Server (NTRS)

    Gallopoulos, E.; Saad, Youcef

    1989-01-01

    Parallel algorithms for the solution of linear parabolic problems are proposed. The first of these methods is based on using polynomial approximation to the exponential. It does not require solving any linear systems and is highly parallelizable. The two other methods proposed are based on Pade and Chebyshev approximations to the matrix exponential. The parallelization of these methods is achieved by using partial fraction decomposition techniques to solve the resulting systems and thus offers the potential for increased time parallelism in time dependent problems. Experimental results from the Alliant FX/8 and the Cray Y-MP/832 vector multiprocessors are also presented.

  14. Nemesis I: Parallel Enhancements to ExodusII

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Hennigan, Gary L.; John, Matthew S.; Shadid, John N.

    2006-03-28

    NEMESIS I is an enhancement to the EXODUS II finite element database model used to store and retrieve data for unstructured parallel finite element analyses. NEMESIS I adds data structures which facilitate the partitioning of a scalar (standard serial) EXODUS II file onto parallel disk systems found on many parallel computers. Since the NEMESIS I application programming interface (APl)can be used to append information to an existing EXODUS II files can be used on files which contain NEMESIS I information. The NEMESIS I information is written and read via C or C++ callable functions which compromise the NEMESIS I API.

  15. Optics Program Modified for Multithreaded Parallel Computing

    NASA Technical Reports Server (NTRS)

    Lou, John; Bedding, Dave; Basinger, Scott

    2006-01-01

    A powerful high-performance computer program for simulating and analyzing adaptive and controlled optical systems has been developed by modifying the serial version of the Modeling and Analysis for Controlled Optical Systems (MACOS) program to impart capabilities for multithreaded parallel processing on computing systems ranging from supercomputers down to Symmetric Multiprocessing (SMP) personal computers. The modifications included the incorporation of OpenMP, a portable and widely supported application interface software, that can be used to explicitly add multithreaded parallelism to an application program under a shared-memory programming model. OpenMP was applied to parallelize ray-tracing calculations, one of the major computing components in MACOS. Multithreading is also used in the diffraction propagation of light in MACOS based on pthreads [POSIX Thread, (where "POSIX" signifies a portable operating system for UNIX)]. In tests of the parallelized version of MACOS, the speedup in ray-tracing calculations was found to be linear, or proportional to the number of processors, while the speedup in diffraction calculations ranged from 50 to 60 percent, depending on the type and number of processors. The parallelized version of MACOS is portable, and, to the user, its interface is basically the same as that of the original serial version of MACOS.

  16. PISCES: An environment for parallel scientific computation

    NASA Technical Reports Server (NTRS)

    Pratt, T. W.

    1985-01-01

    The parallel implementation of scientific computing environment (PISCES) is a project to provide high-level programming environments for parallel MIMD computers. Pisces 1, the first of these environments, is a FORTRAN 77 based environment which runs under the UNIX operating system. The Pisces 1 user programs in Pisces FORTRAN, an extension of FORTRAN 77 for parallel processing. The major emphasis in the Pisces 1 design is in providing a carefully specified virtual machine that defines the run-time environment within which Pisces FORTRAN programs are executed. Each implementation then provides the same virtual machine, regardless of differences in the underlying architecture. The design is intended to be portable to a variety of architectures. Currently Pisces 1 is implemented on a network of Apollo workstations and on a DEC VAX uniprocessor via simulation of the task level parallelism. An implementation for the Flexible Computing Corp. FLEX/32 is under construction. An introduction to the Pisces 1 virtual computer and the FORTRAN 77 extensions is presented. An example of an algorithm for the iterative solution of a system of equations is given. The most notable features of the design are the provision for several granularities of parallelism in programs and the provision of a window mechanism for distributed access to large arrays of data.

  17. Run-time parallelization and scheduling of loops

    NASA Technical Reports Server (NTRS)

    Saltz, Joel H.; Mirchandaney, Ravi; Baxter, Doug

    1988-01-01

    The class of problems that can be effectively compiled by parallelizing compilers is discussed. This is accomplished with the doconsider construct which would allow these compilers to parallelize many problems in which substantial loop-level parallelism is available but cannot be detected by standard compile-time analysis. We describe and experimentally analyze mechanisms used to parallelize the work required for these types of loops. In each of these methods, a new loop structure is produced by modifying the loop to be parallelized. We also present the rules by which these loop transformations may be automated in order that they be included in language compilers. The main application area of the research involves problems in scientific computations and engineering. The workload used in our experiment includes a mixture of real problems as well as synthetically generated inputs. From our extensive tests on the Encore Multimax/320, we have reached the conclusion that for the types of workloads we have investigated, self-execution almost always performs better than pre-scheduling. Further, the improvement in performance that accrues as a result of global topological sorting of indices as opposed to the less expensive local sorting, is not very significant in the case of self-execution.

  18. Efficiency and flexibility using implicit methods within atmosphere dycores

    NASA Astrophysics Data System (ADS)

    Evans, K. J.; Archibald, R.; Norman, M. R.; Gardner, D. J.; Woodward, C. S.; Worley, P.; Taylor, M.

    2016-12-01

    A suite of explicit and implicit methods are evaluated for a range of configurations of the shallow water dynamical core within the spectral-element Community Atmosphere Model (CAM-SE) to explore their relative computational performance. The configurations are designed to explore the attributes of each method under different but relevant model usage scenarios including varied spectral order within an element, static regional refinement, and scaling to large problem sizes. The limitations and benefits of using explicit versus implicit, with different discretizations and parameters, are discussed in light of trade-offs such as MPI communication, memory, and inherent efficiency bottlenecks. For the regionally refined shallow water configurations, the implicit BDF2 method is about the same efficiency as an explicit Runge-Kutta method, without including a preconditioner. Performance of the implicit methods with the residual function executed on a GPU is also presented; there is speed up for the residual relative to a CPU, but overwhelming transfer costs motivate moving more of the solver to the device. Given the performance behavior of implicit methods within the shallow water dynamical core, the recommendation for future work using implicit solvers is conditional based on scale separation and the stiffness of the problem. The strong growth of linear iterations with increasing resolution or time step size is the main bottleneck to computational efficiency. Within the hydrostatic dynamical core, of CAM-SE, we present results utilizing approximate block factorization preconditioners implemented using the Trilinos library of solvers. They reduce the cost of linear system solves and improve parallel scalability. We provide a summary of the remaining efficiency considerations within the preconditioner and utilization of the GPU, as well as a discussion about the benefits of a time stepping method that provides converged and stable solutions for a much wider range of time

  19. Thread concept for automatic task parallelization in image analysis

    NASA Astrophysics Data System (ADS)

    Lueckenhaus, Maximilian; Eckstein, Wolfgang

    1998-09-01

    Parallel processing of image analysis tasks is an essential method to speed up image processing and helps to exploit the full capacity of distributed systems. However, writing parallel code is a difficult and time-consuming process and often leads to an architecture-dependent program that has to be re-implemented when changing the hardware. Therefore it is highly desirable to do the parallelization automatically. For this we have developed a special kind of thread concept for image analysis tasks. Threads derivated from one subtask may share objects and run in the same context but may process different threads of execution and work on different data in parallel. In this paper we describe the basics of our thread concept and show how it can be used as basis of an automatic task parallelization to speed up image processing. We further illustrate the design and implementation of an agent-based system that uses image analysis threads for generating and processing parallel programs by taking into account the available hardware. The tests made with our system prototype show that the thread concept combined with the agent paradigm is suitable to speed up image processing by an automatic parallelization of image analysis tasks.

  20. Methods of parallel computation applied on granular simulations

    NASA Astrophysics Data System (ADS)

    Martins, Gustavo H. B.; Atman, Allbens P. F.

    2017-06-01

    Every year, parallel computing has becoming cheaper and more accessible. As consequence, applications were spreading over all research areas. Granular materials is a promising area for parallel computing. To prove this statement we study the impact of parallel computing in simulations of the BNE (Brazil Nut Effect). This property is due the remarkable arising of an intruder confined to a granular media when vertically shaken against gravity. By means of DEM (Discrete Element Methods) simulations, we study the code performance testing different methods to improve clock time. A comparison between serial and parallel algorithms, using OpenMP® is also shown. The best improvement was obtained by optimizing the function that find contacts using Verlet's cells.

  1. Parallel computation using boundary elements in solid mechanics

    NASA Technical Reports Server (NTRS)

    Chien, L. S.; Sun, C. T.

    1990-01-01

    The inherent parallelism of the boundary element method is shown. The boundary element is formulated by assuming the linear variation of displacements and tractions within a line element. Moreover, MACSYMA symbolic program is employed to obtain the analytical results for influence coefficients. Three computational components are parallelized in this method to show the speedup and efficiency in computation. The global coefficient matrix is first formed concurrently. Then, the parallel Gaussian elimination solution scheme is applied to solve the resulting system of equations. Finally, and more importantly, the domain solutions of a given boundary value problem are calculated simultaneously. The linear speedups and high efficiencies are shown for solving a demonstrated problem on Sequent Symmetry S81 parallel computing system.

  2. A scalable block-preconditioning strategy for divergence-conforming B-spline discretizations of the Stokes problem

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Cortes, Adriano M.; Dalcin, Lisandro; Sarmiento, Adel F.

    The recently introduced divergence-conforming B-spline discretizations allow the construction of smooth discrete velocity–pressure pairs for viscous incompressible flows that are at the same time inf–sup stable and pointwise divergence-free. When applied to the discretized Stokes problem, these spaces generate a symmetric and indefinite saddle-point linear system. The iterative method of choice to solve such system is the Generalized Minimum Residual Method. This method lacks robustness, and one remedy is to use preconditioners. For linear systems of saddle-point type, a large family of preconditioners can be obtained by using a block factorization of the system. In this paper, we show howmore » the nesting of “black-box” solvers and preconditioners can be put together in a block triangular strategy to build a scalable block preconditioner for the Stokes system discretized by divergence-conforming B-splines. Lastly, besides the known cavity flow problem, we used for benchmark flows defined on complex geometries: an eccentric annulus and hollow torus of an eccentric annular cross-section.« less

  3. A scalable block-preconditioning strategy for divergence-conforming B-spline discretizations of the Stokes problem

    DOE PAGES

    Cortes, Adriano M.; Dalcin, Lisandro; Sarmiento, Adel F.; ...

    2016-10-19

    The recently introduced divergence-conforming B-spline discretizations allow the construction of smooth discrete velocity–pressure pairs for viscous incompressible flows that are at the same time inf–sup stable and pointwise divergence-free. When applied to the discretized Stokes problem, these spaces generate a symmetric and indefinite saddle-point linear system. The iterative method of choice to solve such system is the Generalized Minimum Residual Method. This method lacks robustness, and one remedy is to use preconditioners. For linear systems of saddle-point type, a large family of preconditioners can be obtained by using a block factorization of the system. In this paper, we show howmore » the nesting of “black-box” solvers and preconditioners can be put together in a block triangular strategy to build a scalable block preconditioner for the Stokes system discretized by divergence-conforming B-splines. Lastly, besides the known cavity flow problem, we used for benchmark flows defined on complex geometries: an eccentric annulus and hollow torus of an eccentric annular cross-section.« less

  4. Repeated Red-Black ordering

    NASA Astrophysics Data System (ADS)

    Ciarlet, P.

    1994-09-01

    Hereafter, we describe and analyze, from both a theoretical and a numerical point of view, an iterative method for efficiently solving symmetric elliptic problems with possibly discontinuous coefficients. In the following, we use the Preconditioned Conjugate Gradient method to solve the symmetric positive definite linear systems which arise from the finite element discretization of the problems. We focus our interest on sparse and efficient preconditioners. In order to define the preconditioners, we perform two steps: first we reorder the unknowns and then we carry out a (modified) incomplete factorization of the original matrix. We study numerically and theoretically two preconditioners, the second preconditioner corresponding to the one investigated by Brand and Heinemann [2]. We prove convergence results about the Poisson equation with either Dirichlet or periodic boundary conditions. For a meshsizeh, Brand proved that the condition number of the preconditioned system is bounded byO(h-1/2) for Dirichlet boundary conditions. By slightly modifying the preconditioning process, we prove that the condition number is bounded byO(h-1/3).

  5. Parallel Continuous Flow: A Parallel Suffix Tree Construction Tool for Whole Genomes

    PubMed Central

    Farreras, Montse

    2014-01-01

    Abstract The construction of suffix trees for very long sequences is essential for many applications, and it plays a central role in the bioinformatic domain. With the advent of modern sequencing technologies, biological sequence databases have grown dramatically. Also the methodologies required to analyze these data have become more complex everyday, requiring fast queries to multiple genomes. In this article, we present parallel continuous flow (PCF), a parallel suffix tree construction method that is suitable for very long genomes. We tested our method for the suffix tree construction of the entire human genome, about 3GB. We showed that PCF can scale gracefully as the size of the input genome grows. Our method can work with an efficiency of 90% with 36 processors and 55% with 172 processors. We can index the human genome in 7 minutes using 172 processes. PMID:24597675

  6. Series and parallel arc-fault circuit interrupter tests.

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Johnson, Jay Dean; Fresquez, Armando J.; Gudgel, Bob

    2013-07-01

    While the 2011 National Electrical Codeª (NEC) only requires series arc-fault protection, some arc-fault circuit interrupter (AFCI) manufacturers are designing products to detect and mitigate both series and parallel arc-faults. Sandia National Laboratories (SNL) has extensively investigated the electrical differences of series and parallel arc-faults and has offered possible classification and mitigation solutions. As part of this effort, Sandia National Laboratories has collaborated with MidNite Solar to create and test a 24-string combiner box with an AFCI which detects, differentiates, and de-energizes series and parallel arc-faults. In the case of the MidNite AFCI prototype, series arc-faults are mitigated by openingmore » the PV strings, whereas parallel arc-faults are mitigated by shorting the array. A range of different experimental series and parallel arc-fault tests with the MidNite combiner box were performed at the Distributed Energy Technologies Laboratory (DETL) at SNL in Albuquerque, NM. In all the tests, the prototype de-energized the arc-faults in the time period required by the arc-fault circuit interrupt testing standard, UL 1699B. The experimental tests confirm series and parallel arc-faults can be successfully mitigated with a combiner box-integrated solution.« less

  7. Parallelization of ARC3D with Computer-Aided Tools

    NASA Technical Reports Server (NTRS)

    Jin, Haoqiang; Hribar, Michelle; Yan, Jerry; Saini, Subhash (Technical Monitor)

    1998-01-01

    A series of efforts have been devoted to investigating methods of porting and parallelizing applications quickly and efficiently for new architectures, such as the SCSI Origin 2000 and Cray T3E. This report presents the parallelization of a CFD application, ARC3D, using the computer-aided tools, Cesspools. Steps of parallelizing this code and requirements of achieving better performance are discussed. The generated parallel version has achieved reasonably well performance, for example, having a speedup of 30 for 36 Cray T3E processors. However, this performance could not be obtained without modification of the original serial code. It is suggested that in many cases improving serial code and performing necessary code transformations are important parts for the automated parallelization process although user intervention in many of these parts are still necessary. Nevertheless, development and improvement of useful software tools, such as Cesspools, can help trim down many tedious parallelization details and improve the processing efficiency.

  8. Performance Characteristics of the Multi-Zone NAS Parallel Benchmarks

    NASA Technical Reports Server (NTRS)

    Jin, Haoqiang; VanderWijngaart, Rob F.

    2003-01-01

    We describe a new suite of computational benchmarks that models applications featuring multiple levels of parallelism. Such parallelism is often available in realistic flow computations on systems of grids, but had not previously been captured in bench-marks. The new suite, named NPB Multi-Zone, is extended from the NAS Parallel Benchmarks suite, and involves solving the application benchmarks LU, BT and SP on collections of loosely coupled discretization meshes. The solutions on the meshes are updated independently, but after each time step they exchange boundary value information. This strategy provides relatively easily exploitable coarse-grain parallelism between meshes. Three reference implementations are available: one serial, one hybrid using the Message Passing Interface (MPI) and OpenMP, and another hybrid using a shared memory multi-level programming model (SMP+OpenMP). We examine the effectiveness of hybrid parallelization paradigms in these implementations on three different parallel computers. We also use an empirical formula to investigate the performance characteristics of the multi-zone benchmarks.

  9. Graphical Representation of Parallel Algorithmic Processes

    DTIC Science & Technology

    1990-12-01

    interface with the AAARF main process . The source code for the AAARF class-common library is in the common subdi- rectory and consists of the following files... for public release; distribution unlimited AFIT/GCE/ENG/90D-07 Graphical Representation of Parallel Algorithmic Processes THESIS Presented to the...goal of this study is to develop an algorithm animation facility for parallel processes executing on different architectures, from multiprocessor

  10. Portable programming on parallel/networked computers using the Application Portable Parallel Library (APPL)

    NASA Technical Reports Server (NTRS)

    Quealy, Angela; Cole, Gary L.; Blech, Richard A.

    1993-01-01

    The Application Portable Parallel Library (APPL) is a subroutine-based library of communication primitives that is callable from applications written in FORTRAN or C. APPL provides a consistent programmer interface to a variety of distributed and shared-memory multiprocessor MIMD machines. The objective of APPL is to minimize the effort required to move parallel applications from one machine to another, or to a network of homogeneous machines. APPL encompasses many of the message-passing primitives that are currently available on commercial multiprocessor systems. This paper describes APPL (version 2.3.1) and its usage, reports the status of the APPL project, and indicates possible directions for the future. Several applications using APPL are discussed, as well as performance and overhead results.

  11. Seismic imaging using finite-differences and parallel computers

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Ober, C.C.

    1997-12-31

    A key to reducing the risks and costs of associated with oil and gas exploration is the fast, accurate imaging of complex geologies, such as salt domes in the Gulf of Mexico and overthrust regions in US onshore regions. Prestack depth migration generally yields the most accurate images, and one approach to this is to solve the scalar wave equation using finite differences. As part of an ongoing ACTI project funded by the US Department of Energy, a finite difference, 3-D prestack, depth migration code has been developed. The goal of this work is to demonstrate that massively parallel computersmore » can be used efficiently for seismic imaging, and that sufficient computing power exists (or soon will exist) to make finite difference, prestack, depth migration practical for oil and gas exploration. Several problems had to be addressed to get an efficient code for the Intel Paragon. These include efficient I/O, efficient parallel tridiagonal solves, and high single-node performance. Furthermore, to provide portable code the author has been restricted to the use of high-level programming languages (C and Fortran) and interprocessor communications using MPI. He has been using the SUNMOS operating system, which has affected many of his programming decisions. He will present images created from two verification datasets (the Marmousi Model and the SEG/EAEG 3D Salt Model). Also, he will show recent images from real datasets, and point out locations of improved imaging. Finally, he will discuss areas of current research which will hopefully improve the image quality and reduce computational costs.« less

  12. An Efficient Objective Analysis System for Parallel Computers

    NASA Technical Reports Server (NTRS)

    Stobie, James G.

    1999-01-01

    A new objective analysis system designed for parallel computers will be described. The system can produce a global analysis (on a 2 x 2.5 lat-lon grid with 20 levels of heights and winds and 10 levels of moisture) using 120,000 observations in less than 3 minutes on 32 CPUs (SGI Origin 2000). No special parallel code is needed (e.g. MPI or multitasking) and the 32 CPUs do not have to be on the same platform. The system Ls totally portable and can run on -several different architectures at once. In addition, the system can easily scale up to 100 or more CPUS. This will allow for much higher resolution and significant increases in input data. The system scales linearly as the number of observations and the number of grid points. The cost overhead in going from I to 32 CPus is 18%. in addition, the analysis results are identical regardless of the number of processors used. T'his system has all the characteristics of optimal interpolation, combining detailed instrument and first guess error statistics to produce the best estimate of the atmospheric state. It also includes a new quality control (buddy check) system. Static tests with the system showed it's analysis increments are comparable to the latest NASA operational system including maintenance of mass-wind balance. Results from a 2-month cycling test in the Goddard EOS Data Assimilation System (GEOS DAS) show this new analysis retains the same level of agreement between the first guess and observations (0-F statistics) throughout the entire two months.

  13. Desiccant outdoor air preconditioners maximize heat recovery ventilation potentials

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Meckler, M.

    1995-12-31

    Microorganisms are well protected indoors by the moisture surrounding them if the relative humidity is above 70%. They can cause many acute diseases, infections, and allergies. Humidity also has an effect on air cleanliness and causes the building structure and its contents to deteriorate. Therefore, controlling humidity is a very important factor to human health and comfort and the structural longevity of a building. To date, a great deal of research has been done, and is continuing, in the use of both solid and liquid desiccants. This paper introduces a desiccant-assisted system that combines dehumidification and mechanical refrigeration by meansmore » of a desiccant preconditioning module that can serve two or more conventional air-conditioning units. It will be demonstrated that the proposed system, also having indirect evaporative cooling within the preconditioning module, can reduce energy consumption and provide significant cost savings, independent humidity and temperature control, and, therefore, improved indoor air quality and enhanced occupant comfort.« less

  14. Parallel Architecture, Parallel Acquisition Cross-Linguistic Evidence from Nominal and Verbal Domains

    ERIC Educational Resources Information Center

    Sutton, Brett R.

    2017-01-01

    This dissertation explores parallels between Complementizer Phrase (CP) and Determiner Phrase (DP) semantics, syntax, and morphology--including similarities in case-assignment, subject-verb and possessor-possessum agreement, subject and possessor semantics, and overall syntactic structure--in first language acquisition. Applying theoretical…

  15. A novel approach combining self-organizing map and parallel factor analysis for monitoring water quality of watersheds under non-point source pollution

    PubMed Central

    Zhang, Yixiang; Liang, Xinqiang; Wang, Zhibo; Xu, Lixian

    2015-01-01

    High content of organic matter in the downstream of watersheds underscored the severity of non-point source (NPS) pollution. The major objectives of this study were to characterize and quantify dissolved organic matter (DOM) in watersheds affected by NPS pollution, and to apply self-organizing map (SOM) and parallel factor analysis (PARAFAC) to assess fluorescence properties as proxy indicators for NPS pollution and labor-intensive routine water quality indicators. Water from upstreams and downstreams was sampled to measure dissolved organic carbon (DOC) concentrations and excitation-emission matrix (EEM). Five fluorescence components were modeled with PARAFAC. The regression analysis between PARAFAC intensities (Fmax) and raw EEM measurements indicated that several raw fluorescence measurements at target excitation-emission wavelength region could provide similar DOM information to massive EEM measurements combined with PARAFAC. Regression analysis between DOC concentration and raw EEM measurements suggested that some regions in raw EEM could be used as surrogates for labor-intensive routine indicators. SOM can be used to visualize the occurrence of pollution. Relationship between DOC concentration and PARAFAC components analyzed with SOM suggested that PARAFAC component 2 might be the major part of bulk DOC and could be recognized as a proxy indicator to predict the DOC concentration. PMID:26526140

  16. Modeling Cooperative Threads to Project GPU Performance for Adaptive Parallelism

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Meng, Jiayuan; Uram, Thomas; Morozov, Vitali A.

    Most accelerators, such as graphics processing units (GPUs) and vector processors, are particularly suitable for accelerating massively parallel workloads. On the other hand, conventional workloads are developed for multi-core parallelism, which often scale to only a few dozen OpenMP threads. When hardware threads significantly outnumber the degree of parallelism in the outer loop, programmers are challenged with efficient hardware utilization. A common solution is to further exploit the parallelism hidden deep in the code structure. Such parallelism is less structured: parallel and sequential loops may be imperfectly nested within each other, neigh boring inner loops may exhibit different concurrency patternsmore » (e.g. Reduction vs. Forall), yet have to be parallelized in the same parallel section. Many input-dependent transformations have to be explored. A programmer often employs a larger group of hardware threads to cooperatively walk through a smaller outer loop partition and adaptively exploit any encountered parallelism. This process is time-consuming and error-prone, yet the risk of gaining little or no performance remains high for such workloads. To reduce risk and guide implementation, we propose a technique to model workloads with limited parallelism that can automatically explore and evaluate transformations involving cooperative threads. Eventually, our framework projects the best achievable performance and the most promising transformations without implementing GPU code or using physical hardware. We envision our technique to be integrated into future compilers or optimization frameworks for autotuning.« less

  17. Parallel heuristics for scalable community detection

    DOE PAGES

    Lu, Hao; Halappanavar, Mahantesh; Kalyanaraman, Ananth

    2015-08-14

    Community detection has become a fundamental operation in numerous graph-theoretic applications. Despite its potential for application, there is only limited support for community detection on large-scale parallel computers, largely owing to the irregular and inherently sequential nature of the underlying heuristics. In this paper, we present parallelization heuristics for fast community detection using the Louvain method as the serial template. The Louvain method is an iterative heuristic for modularity optimization. Originally developed in 2008, the method has become increasingly popular owing to its ability to detect high modularity community partitions in a fast and memory-efficient manner. However, the method ismore » also inherently sequential, thereby limiting its scalability. Here, we observe certain key properties of this method that present challenges for its parallelization, and consequently propose heuristics that are designed to break the sequential barrier. For evaluation purposes, we implemented our heuristics using OpenMP multithreading, and tested them over real world graphs derived from multiple application domains. Compared to the serial Louvain implementation, our parallel implementation is able to produce community outputs with a higher modularity for most of the inputs tested, in comparable number or fewer iterations, while providing real speedups of up to 16x using 32 threads.« less

  18. Simultaneous digital quantification and fluorescence-based size characterization of massively parallel sequencing libraries.

    PubMed

    Laurie, Matthew T; Bertout, Jessica A; Taylor, Sean D; Burton, Joshua N; Shendure, Jay A; Bielas, Jason H

    2013-08-01

    Due to the high cost of failed runs and suboptimal data yields, quantification and determination of fragment size range are crucial steps in the library preparation process for massively parallel sequencing (or next-generation sequencing). Current library quality control methods commonly involve quantification using real-time quantitative PCR and size determination using gel or capillary electrophoresis. These methods are laborious and subject to a number of significant limitations that can make library calibration unreliable. Herein, we propose and test an alternative method for quality control of sequencing libraries using droplet digital PCR (ddPCR). By exploiting a correlation we have discovered between droplet fluorescence and amplicon size, we achieve the joint quantification and size determination of target DNA with a single ddPCR assay. We demonstrate the accuracy and precision of applying this method to the preparation of sequencing libraries.

  19. Hierarchical Parallelism in Finite Difference Analysis of Heat Conduction

    NASA Technical Reports Server (NTRS)

    Padovan, Joseph; Krishna, Lala; Gute, Douglas

    1997-01-01

    Based on the concept of hierarchical parallelism, this research effort resulted in highly efficient parallel solution strategies for very large scale heat conduction problems. Overall, the method of hierarchical parallelism involves the partitioning of thermal models into several substructured levels wherein an optimal balance into various associated bandwidths is achieved. The details are described in this report. Overall, the report is organized into two parts. Part 1 describes the parallel modelling methodology and associated multilevel direct, iterative and mixed solution schemes. Part 2 establishes both the formal and computational properties of the scheme.

  20. Image Processing Using a Parallel Architecture.

    DTIC Science & Technology

    1987-12-01

    ENG/87D-25 Abstract This study developed a set o± low level image processing tools on a parallel computer that allows concurrent processing of images...environment, the set of tools offers a significant reduction in the time required to perform some commonly used image processing operations. vI IMAGE...step toward developing these systems, a structured set of image processing tools was implemented using a parallel computer. More important than