Iterative algorithms for large sparse linear systems on parallel computers
NASA Technical Reports Server (NTRS)
Adams, L. M.
1982-01-01
Algorithms for assembling in parallel the sparse system of linear equations that result from finite difference or finite element discretizations of elliptic partial differential equations, such as those that arise in structural engineering are developed. Parallel linear stationary iterative algorithms and parallel preconditioned conjugate gradient algorithms are developed for solving these systems. In addition, a model for comparing parallel algorithms on array architectures is developed and results of this model for the algorithms are given.
Zhao, Jing; Zong, Haili
2018-01-01
In this paper, we propose parallel and cyclic iterative algorithms for solving the multiple-set split equality common fixed-point problem of firmly quasi-nonexpansive operators. We also combine the process of cyclic and parallel iterative methods and propose two mixed iterative algorithms. Our several algorithms do not need any prior information about the operator norms. Under mild assumptions, we prove weak convergence of the proposed iterative sequences in Hilbert spaces. As applications, we obtain several iterative algorithms to solve the multiple-set split equality problem.
Parallel conjugate gradient algorithms for manipulator dynamic simulation
NASA Technical Reports Server (NTRS)
Fijany, Amir; Scheld, Robert E.
1989-01-01
Parallel conjugate gradient algorithms for the computation of multibody dynamics are developed for the specialized case of a robot manipulator. For an n-dimensional positive-definite linear system, the Classical Conjugate Gradient (CCG) algorithms are guaranteed to converge in n iterations, each with a computation cost of O(n); this leads to a total computational cost of O(n sq) on a serial processor. A conjugate gradient algorithms is presented that provide greater efficiency using a preconditioner, which reduces the number of iterations required, and by exploiting parallelism, which reduces the cost of each iteration. Two Preconditioned Conjugate Gradient (PCG) algorithms are proposed which respectively use a diagonal and a tridiagonal matrix, composed of the diagonal and tridiagonal elements of the mass matrix, as preconditioners. Parallel algorithms are developed to compute the preconditioners and their inversions in O(log sub 2 n) steps using n processors. A parallel algorithm is also presented which, on the same architecture, achieves the computational time of O(log sub 2 n) for each iteration. Simulation results for a seven degree-of-freedom manipulator are presented. Variants of the proposed algorithms are also developed which can be efficiently implemented on the Robot Mathematics Processor (RMP).
NASA Astrophysics Data System (ADS)
Rastogi, Richa; Srivastava, Abhishek; Khonde, Kiran; Sirasala, Kirannmayi M.; Londhe, Ashutosh; Chavhan, Hitesh
2015-07-01
This paper presents an efficient parallel 3D Kirchhoff depth migration algorithm suitable for current class of multicore architecture. The fundamental Kirchhoff depth migration algorithm exhibits inherent parallelism however, when it comes to 3D data migration, as the data size increases the resource requirement of the algorithm also increases. This challenges its practical implementation even on current generation high performance computing systems. Therefore a smart parallelization approach is essential to handle 3D data for migration. The most compute intensive part of Kirchhoff depth migration algorithm is the calculation of traveltime tables due to its resource requirements such as memory/storage and I/O. In the current research work, we target this area and develop a competent parallel algorithm for post and prestack 3D Kirchhoff depth migration, using hybrid MPI+OpenMP programming techniques. We introduce a concept of flexi-depth iterations while depth migrating data in parallel imaging space, using optimized traveltime table computations. This concept provides flexibility to the algorithm by migrating data in a number of depth iterations, which depends upon the available node memory and the size of data to be migrated during runtime. Furthermore, it minimizes the requirements of storage, I/O and inter-node communication, thus making it advantageous over the conventional parallelization approaches. The developed parallel algorithm is demonstrated and analysed on Yuva II, a PARAM series of supercomputers. Optimization, performance and scalability experiment results along with the migration outcome show the effectiveness of the parallel algorithm.
NASA Astrophysics Data System (ADS)
Qin, Cheng-Zhi; Zhan, Lijun
2012-06-01
As one of the important tasks in digital terrain analysis, the calculation of flow accumulations from gridded digital elevation models (DEMs) usually involves two steps in a real application: (1) using an iterative DEM preprocessing algorithm to remove the depressions and flat areas commonly contained in real DEMs, and (2) using a recursive flow-direction algorithm to calculate the flow accumulation for every cell in the DEM. Because both algorithms are computationally intensive, quick calculation of the flow accumulations from a DEM (especially for a large area) presents a practical challenge to personal computer (PC) users. In recent years, rapid increases in hardware capacity of the graphics processing units (GPUs) provided in modern PCs have made it possible to meet this challenge in a PC environment. Parallel computing on GPUs using a compute-unified-device-architecture (CUDA) programming model has been explored to speed up the execution of the single-flow-direction algorithm (SFD). However, the parallel implementation on a GPU of the multiple-flow-direction (MFD) algorithm, which generally performs better than the SFD algorithm, has not been reported. Moreover, GPU-based parallelization of the DEM preprocessing step in the flow-accumulation calculations has not been addressed. This paper proposes a parallel approach to calculate flow accumulations (including both iterative DEM preprocessing and a recursive MFD algorithm) on a CUDA-compatible GPU. For the parallelization of an MFD algorithm (MFD-md), two different parallelization strategies using a GPU are explored. The first parallelization strategy, which has been used in the existing parallel SFD algorithm on GPU, has the problem of computing redundancy. Therefore, we designed a parallelization strategy based on graph theory. The application results show that the proposed parallel approach to calculate flow accumulations on a GPU performs much faster than either sequential algorithms or other parallel GPU-based algorithms based on existing parallelization strategies.
A Parallel Particle Swarm Optimization Algorithm Accelerated by Asynchronous Evaluations
NASA Technical Reports Server (NTRS)
Venter, Gerhard; Sobieszczanski-Sobieski, Jaroslaw
2005-01-01
A parallel Particle Swarm Optimization (PSO) algorithm is presented. Particle swarm optimization is a fairly recent addition to the family of non-gradient based, probabilistic search algorithms that is based on a simplified social model and is closely tied to swarming theory. Although PSO algorithms present several attractive properties to the designer, they are plagued by high computational cost as measured by elapsed time. One approach to reduce the elapsed time is to make use of coarse-grained parallelization to evaluate the design points. Previous parallel PSO algorithms were mostly implemented in a synchronous manner, where all design points within a design iteration are evaluated before the next iteration is started. This approach leads to poor parallel speedup in cases where a heterogeneous parallel environment is used and/or where the analysis time depends on the design point being analyzed. This paper introduces an asynchronous parallel PSO algorithm that greatly improves the parallel e ciency. The asynchronous algorithm is benchmarked on a cluster assembled of Apple Macintosh G5 desktop computers, using the multi-disciplinary optimization of a typical transport aircraft wing as an example.
Soft-output decoding algorithms in iterative decoding of turbo codes
NASA Technical Reports Server (NTRS)
Benedetto, S.; Montorsi, G.; Divsalar, D.; Pollara, F.
1996-01-01
In this article, we present two versions of a simplified maximum a posteriori decoding algorithm. The algorithms work in a sliding window form, like the Viterbi algorithm, and can thus be used to decode continuously transmitted sequences obtained by parallel concatenated codes, without requiring code trellis termination. A heuristic explanation is also given of how to embed the maximum a posteriori algorithms into the iterative decoding of parallel concatenated codes (turbo codes). The performances of the two algorithms are compared on the basis of a powerful rate 1/3 parallel concatenated code. Basic circuits to implement the simplified a posteriori decoding algorithm using lookup tables, and two further approximations (linear and threshold), with a very small penalty, to eliminate the need for lookup tables are proposed.
An iterative method for systems of nonlinear hyperbolic equations
NASA Technical Reports Server (NTRS)
Scroggs, Jeffrey S.
1989-01-01
An iterative algorithm for the efficient solution of systems of nonlinear hyperbolic equations is presented. Parallelism is evident at several levels. In the formation of the iteration, the equations are decoupled, thereby providing large grain parallelism. Parallelism may also be exploited within the solves for each equation. Convergence of the interation is established via a bounding function argument. Experimental results in two-dimensions are presented.
Comparison of multihardware parallel implementations for a phase unwrapping algorithm
NASA Astrophysics Data System (ADS)
Hernandez-Lopez, Francisco Javier; Rivera, Mariano; Salazar-Garibay, Adan; Legarda-Sáenz, Ricardo
2018-04-01
Phase unwrapping is an important problem in the areas of optical metrology, synthetic aperture radar (SAR) image analysis, and magnetic resonance imaging (MRI) analysis. These images are becoming larger in size and, particularly, the availability and need for processing of SAR and MRI data have increased significantly with the acquisition of remote sensing data and the popularization of magnetic resonators in clinical diagnosis. Therefore, it is important to develop faster and accurate phase unwrapping algorithms. We propose a parallel multigrid algorithm of a phase unwrapping method named accumulation of residual maps, which builds on a serial algorithm that consists of the minimization of a cost function; minimization achieved by means of a serial Gauss-Seidel kind algorithm. Our algorithm also optimizes the original cost function, but unlike the original work, our algorithm is a parallel Jacobi class with alternated minimizations. This strategy is known as the chessboard type, where red pixels can be updated in parallel at same iteration since they are independent. Similarly, black pixels can be updated in parallel in an alternating iteration. We present parallel implementations of our algorithm for different parallel multicore architecture such as CPU-multicore, Xeon Phi coprocessor, and Nvidia graphics processing unit. In all the cases, we obtain a superior performance of our parallel algorithm when compared with the original serial version. In addition, we present a detailed comparative performance of the developed parallel versions.
Learning control system design based on 2-D theory - An application to parallel link manipulator
NASA Technical Reports Server (NTRS)
Geng, Z.; Carroll, R. L.; Lee, J. D.; Haynes, L. H.
1990-01-01
An approach to iterative learning control system design based on two-dimensional system theory is presented. A two-dimensional model for the iterative learning control system which reveals the connections between learning control systems and two-dimensional system theory is established. A learning control algorithm is proposed, and the convergence of learning using this algorithm is guaranteed by two-dimensional stability. The learning algorithm is applied successfully to the trajectory tracking control problem for a parallel link robot manipulator. The excellent performance of this learning algorithm is demonstrated by the computer simulation results.
Lee, Jae H.; Yao, Yushu; Shrestha, Uttam; Gullberg, Grant T.; Seo, Youngho
2014-01-01
The primary goal of this project is to implement the iterative statistical image reconstruction algorithm, in this case maximum likelihood expectation maximum (MLEM) used for dynamic cardiac single photon emission computed tomography, on Spark/GraphX. This involves porting the algorithm to run on large-scale parallel computing systems. Spark is an easy-to- program software platform that can handle large amounts of data in parallel. GraphX is a graph analytic system running on top of Spark to handle graph and sparse linear algebra operations in parallel. The main advantage of implementing MLEM algorithm in Spark/GraphX is that it allows users to parallelize such computation without any expertise in parallel computing or prior knowledge in computer science. In this paper we demonstrate a successful implementation of MLEM in Spark/GraphX and present the performance gains with the goal to eventually make it useable in clinical setting. PMID:27081299
Lee, Jae H; Yao, Yushu; Shrestha, Uttam; Gullberg, Grant T; Seo, Youngho
2014-11-01
The primary goal of this project is to implement the iterative statistical image reconstruction algorithm, in this case maximum likelihood expectation maximum (MLEM) used for dynamic cardiac single photon emission computed tomography, on Spark/GraphX. This involves porting the algorithm to run on large-scale parallel computing systems. Spark is an easy-to- program software platform that can handle large amounts of data in parallel. GraphX is a graph analytic system running on top of Spark to handle graph and sparse linear algebra operations in parallel. The main advantage of implementing MLEM algorithm in Spark/GraphX is that it allows users to parallelize such computation without any expertise in parallel computing or prior knowledge in computer science. In this paper we demonstrate a successful implementation of MLEM in Spark/GraphX and present the performance gains with the goal to eventually make it useable in clinical setting.
Matching pursuit parallel decomposition of seismic data
NASA Astrophysics Data System (ADS)
Li, Chuanhui; Zhang, Fanchang
2017-07-01
In order to improve the computation speed of matching pursuit decomposition of seismic data, a matching pursuit parallel algorithm is designed in this paper. We pick a fixed number of envelope peaks from the current signal in every iteration according to the number of compute nodes and assign them to the compute nodes on average to search the optimal Morlet wavelets in parallel. With the help of parallel computer systems and Message Passing Interface, the parallel algorithm gives full play to the advantages of parallel computing to significantly improve the computation speed of the matching pursuit decomposition and also has good expandability. Besides, searching only one optimal Morlet wavelet by every compute node in every iteration is the most efficient implementation.
NASA Astrophysics Data System (ADS)
Wichert, Viktoria; Arkenberg, Mario; Hauschildt, Peter H.
2016-10-01
Highly resolved state-of-the-art 3D atmosphere simulations will remain computationally extremely expensive for years to come. In addition to the need for more computing power, rethinking coding practices is necessary. We take a dual approach by introducing especially adapted, parallel numerical methods and correspondingly parallelizing critical code passages. In the following, we present our respective work on PHOENIX/3D. With new parallel numerical algorithms, there is a big opportunity for improvement when iteratively solving the system of equations emerging from the operator splitting of the radiative transfer equation J = ΛS. The narrow-banded approximate Λ-operator Λ* , which is used in PHOENIX/3D, occurs in each iteration step. By implementing a numerical algorithm which takes advantage of its characteristic traits, the parallel code's efficiency is further increased and a speed-up in computational time can be achieved.
Scalable domain decomposition solvers for stochastic PDEs in high performance computing
Desai, Ajit; Khalil, Mohammad; Pettit, Chris; ...
2017-09-21
Stochastic spectral finite element models of practical engineering systems may involve solutions of linear systems or linearized systems for non-linear problems with billions of unknowns. For stochastic modeling, it is therefore essential to design robust, parallel and scalable algorithms that can efficiently utilize high-performance computing to tackle such large-scale systems. Domain decomposition based iterative solvers can handle such systems. And though these algorithms exhibit excellent scalabilities, significant algorithmic and implementational challenges exist to extend them to solve extreme-scale stochastic systems using emerging computing platforms. Intrusive polynomial chaos expansion based domain decomposition algorithms are extended here to concurrently handle high resolutionmore » in both spatial and stochastic domains using an in-house implementation. Sparse iterative solvers with efficient preconditioners are employed to solve the resulting global and subdomain level local systems through multi-level iterative solvers. We also use parallel sparse matrix–vector operations to reduce the floating-point operations and memory requirements. Numerical and parallel scalabilities of these algorithms are presented for the diffusion equation having spatially varying diffusion coefficient modeled by a non-Gaussian stochastic process. Scalability of the solvers with respect to the number of random variables is also investigated.« less
Scalable domain decomposition solvers for stochastic PDEs in high performance computing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Desai, Ajit; Khalil, Mohammad; Pettit, Chris
Stochastic spectral finite element models of practical engineering systems may involve solutions of linear systems or linearized systems for non-linear problems with billions of unknowns. For stochastic modeling, it is therefore essential to design robust, parallel and scalable algorithms that can efficiently utilize high-performance computing to tackle such large-scale systems. Domain decomposition based iterative solvers can handle such systems. And though these algorithms exhibit excellent scalabilities, significant algorithmic and implementational challenges exist to extend them to solve extreme-scale stochastic systems using emerging computing platforms. Intrusive polynomial chaos expansion based domain decomposition algorithms are extended here to concurrently handle high resolutionmore » in both spatial and stochastic domains using an in-house implementation. Sparse iterative solvers with efficient preconditioners are employed to solve the resulting global and subdomain level local systems through multi-level iterative solvers. We also use parallel sparse matrix–vector operations to reduce the floating-point operations and memory requirements. Numerical and parallel scalabilities of these algorithms are presented for the diffusion equation having spatially varying diffusion coefficient modeled by a non-Gaussian stochastic process. Scalability of the solvers with respect to the number of random variables is also investigated.« less
Solution of a tridiagonal system of equations on the finite element machine
NASA Technical Reports Server (NTRS)
Bostic, S. W.
1984-01-01
Two parallel algorithms for the solution of tridiagonal systems of equations were implemented on the Finite Element Machine. The Accelerated Parallel Gauss method, an iterative method, and the Buneman algorithm, a direct method, are discussed and execution statistics are presented.
NASA Astrophysics Data System (ADS)
Lashkin, S. V.; Kozelkov, A. S.; Yalozo, A. V.; Gerasimov, V. Yu.; Zelensky, D. K.
2017-12-01
This paper describes the details of the parallel implementation of the SIMPLE algorithm for numerical solution of the Navier-Stokes system of equations on arbitrary unstructured grids. The iteration schemes for the serial and parallel versions of the SIMPLE algorithm are implemented. In the description of the parallel implementation, special attention is paid to computational data exchange among processors under the condition of the grid model decomposition using fictitious cells. We discuss the specific features for the storage of distributed matrices and implementation of vector-matrix operations in parallel mode. It is shown that the proposed way of matrix storage reduces the number of interprocessor exchanges. A series of numerical experiments illustrates the effect of the multigrid SLAE solver tuning on the general efficiency of the algorithm; the tuning involves the types of the cycles used (V, W, and F), the number of iterations of a smoothing operator, and the number of cells for coarsening. Two ways (direct and indirect) of efficiency evaluation for parallelization of the numerical algorithm are demonstrated. The paper presents the results of solving some internal and external flow problems with the evaluation of parallelization efficiency by two algorithms. It is shown that the proposed parallel implementation enables efficient computations for the problems on a thousand processors. Based on the results obtained, some general recommendations are made for the optimal tuning of the multigrid solver, as well as for selecting the optimal number of cells per processor.
A parallel Jacobson-Oksman optimization algorithm. [parallel processing (computers)
NASA Technical Reports Server (NTRS)
Straeter, T. A.; Markos, A. T.
1975-01-01
A gradient-dependent optimization technique which exploits the vector-streaming or parallel-computing capabilities of some modern computers is presented. The algorithm, derived by assuming that the function to be minimized is homogeneous, is a modification of the Jacobson-Oksman serial minimization method. In addition to describing the algorithm, conditions insuring the convergence of the iterates of the algorithm and the results of numerical experiments on a group of sample test functions are presented. The results of these experiments indicate that this algorithm will solve optimization problems in less computing time than conventional serial methods on machines having vector-streaming or parallel-computing capabilities.
A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data.
Liang, Faming; Kim, Jinsu; Song, Qifan
2016-01-01
Markov chain Monte Carlo (MCMC) methods have proven to be a very powerful tool for analyzing data of complex structures. However, their computer-intensive nature, which typically require a large number of iterations and a complete scan of the full dataset for each iteration, precludes their use for big data analysis. In this paper, we propose the so-called bootstrap Metropolis-Hastings (BMH) algorithm, which provides a general framework for how to tame powerful MCMC methods to be used for big data analysis; that is to replace the full data log-likelihood by a Monte Carlo average of the log-likelihoods that are calculated in parallel from multiple bootstrap samples. The BMH algorithm possesses an embarrassingly parallel structure and avoids repeated scans of the full dataset in iterations, and is thus feasible for big data problems. Compared to the popular divide-and-combine method, BMH can be generally more efficient as it can asymptotically integrate the whole data information into a single simulation run. The BMH algorithm is very flexible. Like the Metropolis-Hastings algorithm, it can serve as a basic building block for developing advanced MCMC algorithms that are feasible for big data problems. This is illustrated in the paper by the tempering BMH algorithm, which can be viewed as a combination of parallel tempering and the BMH algorithm. BMH can also be used for model selection and optimization by combining with reversible jump MCMC and simulated annealing, respectively.
A Bootstrap Metropolis–Hastings Algorithm for Bayesian Analysis of Big Data
Kim, Jinsu; Song, Qifan
2016-01-01
Markov chain Monte Carlo (MCMC) methods have proven to be a very powerful tool for analyzing data of complex structures. However, their computer-intensive nature, which typically require a large number of iterations and a complete scan of the full dataset for each iteration, precludes their use for big data analysis. In this paper, we propose the so-called bootstrap Metropolis-Hastings (BMH) algorithm, which provides a general framework for how to tame powerful MCMC methods to be used for big data analysis; that is to replace the full data log-likelihood by a Monte Carlo average of the log-likelihoods that are calculated in parallel from multiple bootstrap samples. The BMH algorithm possesses an embarrassingly parallel structure and avoids repeated scans of the full dataset in iterations, and is thus feasible for big data problems. Compared to the popular divide-and-combine method, BMH can be generally more efficient as it can asymptotically integrate the whole data information into a single simulation run. The BMH algorithm is very flexible. Like the Metropolis-Hastings algorithm, it can serve as a basic building block for developing advanced MCMC algorithms that are feasible for big data problems. This is illustrated in the paper by the tempering BMH algorithm, which can be viewed as a combination of parallel tempering and the BMH algorithm. BMH can also be used for model selection and optimization by combining with reversible jump MCMC and simulated annealing, respectively. PMID:29033469
Communications oriented programming of parallel iterative solutions of sparse linear systems
NASA Technical Reports Server (NTRS)
Patrick, M. L.; Pratt, T. W.
1986-01-01
Parallel algorithms are developed for a class of scientific computational problems by partitioning the problems into smaller problems which may be solved concurrently. The effectiveness of the resulting parallel solutions is determined by the amount and frequency of communication and synchronization and the extent to which communication can be overlapped with computation. Three different parallel algorithms for solving the same class of problems are presented, and their effectiveness is analyzed from this point of view. The algorithms are programmed using a new programming environment. Run-time statistics and experience obtained from the execution of these programs assist in measuring the effectiveness of these algorithms.
Organizing Compression of Hyperspectral Imagery to Allow Efficient Parallel Decompression
NASA Technical Reports Server (NTRS)
Klimesh, Matthew A.; Kiely, Aaron B.
2014-01-01
family of schemes has been devised for organizing the output of an algorithm for predictive data compression of hyperspectral imagery so as to allow efficient parallelization in both the compressor and decompressor. In these schemes, the compressor performs a number of iterations, during each of which a portion of the data is compressed via parallel threads operating on independent portions of the data. The general idea is that for each iteration it is predetermined how much compressed data will be produced from each thread.
Noniterative Multireference Coupled Cluster Methods on Heterogeneous CPU-GPU Systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bhaskaran-Nair, Kiran; Ma, Wenjing; Krishnamoorthy, Sriram
2013-04-09
A novel parallel algorithm for non-iterative multireference coupled cluster (MRCC) theories, which merges recently introduced reference-level parallelism (RLP) [K. Bhaskaran-Nair, J.Brabec, E. Aprà, H.J.J. van Dam, J. Pittner, K. Kowalski, J. Chem. Phys. 137, 094112 (2012)] with the possibility of accelerating numerical calculations using graphics processing unit (GPU) is presented. We discuss the performance of this algorithm on the example of the MRCCSD(T) method (iterative singles and doubles and perturbative triples), where the corrections due to triples are added to the diagonal elements of the MRCCSD (iterative singles and doubles) effective Hamiltonian matrix. The performance of the combined RLP/GPU algorithmmore » is illustrated on the example of the Brillouin-Wigner (BW) and Mukherjee (Mk) state-specific MRCCSD(T) formulations.« less
Gong, Chunye; Bao, Weimin; Tang, Guojian; Jiang, Yuewen; Liu, Jie
2014-01-01
It is very time consuming to solve fractional differential equations. The computational complexity of two-dimensional fractional differential equation (2D-TFDE) with iterative implicit finite difference method is O(M(x)M(y)N(2)). In this paper, we present a parallel algorithm for 2D-TFDE and give an in-depth discussion about this algorithm. A task distribution model and data layout with virtual boundary are designed for this parallel algorithm. The experimental results show that the parallel algorithm compares well with the exact solution. The parallel algorithm on single Intel Xeon X5540 CPU runs 3.16-4.17 times faster than the serial algorithm on single CPU core. The parallel efficiency of 81 processes is up to 88.24% compared with 9 processes on a distributed memory cluster system. We do think that the parallel computing technology will become a very basic method for the computational intensive fractional applications in the near future.
Parallelized implicit propagators for the finite-difference Schrödinger equation
NASA Astrophysics Data System (ADS)
Parker, Jonathan; Taylor, K. T.
1995-08-01
We describe the application of block Gauss-Seidel and block Jacobi iterative methods to the design of implicit propagators for finite-difference models of the time-dependent Schrödinger equation. The block-wise iterative methods discussed here are mixed direct-iterative methods for solving simultaneous equations, in the sense that direct methods (e.g. LU decomposition) are used to invert certain block sub-matrices, and iterative methods are used to complete the solution. We describe parallel variants of the basic algorithm that are well suited to the medium- to coarse-grained parallelism of work-station clusters, and MIMD supercomputers, and we show that under a wide range of conditions, fine-grained parallelism of the computation can be achieved. Numerical tests are conducted on a typical one-electron atom Hamiltonian. The methods converge robustly to machine precision (15 significant figures), in some cases in as few as 6 or 7 iterations. The rate of convergence is nearly independent of the finite-difference grid-point separations.
A parallel variable metric optimization algorithm
NASA Technical Reports Server (NTRS)
Straeter, T. A.
1973-01-01
An algorithm, designed to exploit the parallel computing or vector streaming (pipeline) capabilities of computers is presented. When p is the degree of parallelism, then one cycle of the parallel variable metric algorithm is defined as follows: first, the function and its gradient are computed in parallel at p different values of the independent variable; then the metric is modified by p rank-one corrections; and finally, a single univariant minimization is carried out in the Newton-like direction. Several properties of this algorithm are established. The convergence of the iterates to the solution is proved for a quadratic functional on a real separable Hilbert space. For a finite-dimensional space the convergence is in one cycle when p equals the dimension of the space. Results of numerical experiments indicate that the new algorithm will exploit parallel or pipeline computing capabilities to effect faster convergence than serial techniques.
Parallel processing in finite element structural analysis
NASA Technical Reports Server (NTRS)
Noor, Ahmed K.
1987-01-01
A brief review is made of the fundamental concepts and basic issues of parallel processing. Discussion focuses on parallel numerical algorithms, performance evaluation of machines and algorithms, and parallelism in finite element computations. A computational strategy is proposed for maximizing the degree of parallelism at different levels of the finite element analysis process including: 1) formulation level (through the use of mixed finite element models); 2) analysis level (through additive decomposition of the different arrays in the governing equations into the contributions to a symmetrized response plus correction terms); 3) numerical algorithm level (through the use of operator splitting techniques and application of iterative processes); and 4) implementation level (through the effective combination of vectorization, multitasking and microtasking, whenever available).
Development of iterative techniques for the solution of unsteady compressible viscous flows
NASA Technical Reports Server (NTRS)
Sankar, Lakshmi N.; Hixon, Duane
1991-01-01
Efficient iterative solution methods are being developed for the numerical solution of two- and three-dimensional compressible Navier-Stokes equations. Iterative time marching methods have several advantages over classical multi-step explicit time marching schemes, and non-iterative implicit time marching schemes. Iterative schemes have better stability characteristics than non-iterative explicit and implicit schemes. Thus, the extra work required by iterative schemes can also be designed to perform efficiently on current and future generation scalable, missively parallel machines. An obvious candidate for iteratively solving the system of coupled nonlinear algebraic equations arising in CFD applications is the Newton method. Newton's method was implemented in existing finite difference and finite volume methods. Depending on the complexity of the problem, the number of Newton iterations needed per step to solve the discretized system of equations can, however, vary dramatically from a few to several hundred. Another popular approach based on the classical conjugate gradient method, known as the GMRES (Generalized Minimum Residual) algorithm is investigated. The GMRES algorithm was used in the past by a number of researchers for solving steady viscous and inviscid flow problems with considerable success. Here, the suitability of this algorithm is investigated for solving the system of nonlinear equations that arise in unsteady Navier-Stokes solvers at each time step. Unlike the Newton method which attempts to drive the error in the solution at each and every node down to zero, the GMRES algorithm only seeks to minimize the L2 norm of the error. In the GMRES algorithm the changes in the flow properties from one time step to the next are assumed to be the sum of a set of orthogonal vectors. By choosing the number of vectors to a reasonably small value N (between 5 and 20) the work required for advancing the solution from one time step to the next may be kept to (N+1) times that of a noniterative scheme. Many of the operations required by the GMRES algorithm such as matrix-vector multiplies, matrix additions and subtractions can all be vectorized and parallelized efficiently.
Unified Lambert Tool for Massively Parallel Applications in Space Situational Awareness
NASA Astrophysics Data System (ADS)
Woollands, Robyn M.; Read, Julie; Hernandez, Kevin; Probe, Austin; Junkins, John L.
2018-03-01
This paper introduces a parallel-compiled tool that combines several of our recently developed methods for solving the perturbed Lambert problem using modified Chebyshev-Picard iteration. This tool (unified Lambert tool) consists of four individual algorithms, each of which is unique and better suited for solving a particular type of orbit transfer. The first is a Keplerian Lambert solver, which is used to provide a good initial guess (warm start) for solving the perturbed problem. It is also used to determine the appropriate algorithm to call for solving the perturbed problem. The arc length or true anomaly angle spanned by the transfer trajectory is the parameter that governs the automated selection of the appropriate perturbed algorithm, and is based on the respective algorithm convergence characteristics. The second algorithm solves the perturbed Lambert problem using the modified Chebyshev-Picard iteration two-point boundary value solver. This algorithm does not require a Newton-like shooting method and is the most efficient of the perturbed solvers presented herein, however the domain of convergence is limited to about a third of an orbit and is dependent on eccentricity. The third algorithm extends the domain of convergence of the modified Chebyshev-Picard iteration two-point boundary value solver to about 90% of an orbit, through regularization with the Kustaanheimo-Stiefel transformation. This is the second most efficient of the perturbed set of algorithms. The fourth algorithm uses the method of particular solutions and the modified Chebyshev-Picard iteration initial value solver for solving multiple revolution perturbed transfers. This method does require "shooting" but differs from Newton-like shooting methods in that it does not require propagation of a state transition matrix. The unified Lambert tool makes use of the General Mission Analysis Tool and we use it to compute thousands of perturbed Lambert trajectories in parallel on the Space Situational Awareness computer cluster at the LASR Lab, Texas A&M University. We demonstrate the power of our tool by solving a highly parallel example problem, that is the generation of extremal field maps for optimal spacecraft rendezvous (and eventual orbit debris removal). In addition we demonstrate the need for including perturbative effects in simulations for satellite tracking or data association. The unified Lambert tool is ideal for but not limited to space situational awareness applications.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Moryakov, A. V., E-mail: sailor@orc.ru
2016-12-15
An algorithm for solving the linear Cauchy problem for large systems of ordinary differential equations is presented. The algorithm for systems of first-order differential equations is implemented in the EDELWEISS code with the possibility of parallel computations on supercomputers employing the MPI (Message Passing Interface) standard for the data exchange between parallel processes. The solution is represented by a series of orthogonal polynomials on the interval [0, 1]. The algorithm is characterized by simplicity and the possibility to solve nonlinear problems with a correction of the operator in accordance with the solution obtained in the previous iterative process.
Parallel AFSA algorithm accelerating based on MIC architecture
NASA Astrophysics Data System (ADS)
Zhou, Junhao; Xiao, Hong; Huang, Yifan; Li, Yongzhao; Xu, Yuanrui
2017-05-01
Analysis AFSA past for solving the traveling salesman problem, the algorithm efficiency is often a big problem, and the algorithm processing method, it does not fully responsive to the characteristics of the traveling salesman problem to deal with, and therefore proposes a parallel join improved AFSA process. The simulation with the current TSP known optimal solutions were analyzed, the results showed that the AFSA iterations improved less, on the MIC cards doubled operating efficiency, efficiency significantly.
Li, Chuan; Li, Lin; Zhang, Jie; Alexov, Emil
2012-01-01
The Gauss-Seidel method is a standard iterative numerical method widely used to solve a system of equations and, in general, is more efficient comparing to other iterative methods, such as the Jacobi method. However, standard implementation of the Gauss-Seidel method restricts its utilization in parallel computing due to its requirement of using updated neighboring values (i.e., in current iteration) as soon as they are available. Here we report an efficient and exact (not requiring assumptions) method to parallelize iterations and to reduce the computational time as a linear/nearly linear function of the number of CPUs. In contrast to other existing solutions, our method does not require any assumptions and is equally applicable for solving linear and nonlinear equations. This approach is implemented in the DelPhi program, which is a finite difference Poisson-Boltzmann equation solver to model electrostatics in molecular biology. This development makes the iterative procedure on obtaining the electrostatic potential distribution in the parallelized DelPhi several folds faster than that in the serial code. Further we demonstrate the advantages of the new parallelized DelPhi by computing the electrostatic potential and the corresponding energies of large supramolecular structures. PMID:22674480
DOE Office of Scientific and Technical Information (OSTI.GOV)
Thomquist, Heidi K.; Fixel, Deborah A.; Fett, David Brian
The Xyce Parallel Electronic Simulator simulates electronic circuit behavior in DC, AC, HB, MPDE and transient mode using standard analog (DAE) and/or device (PDE) device models including several age and radiation aware devices. It supports a variety of computing platforms (both serial and parallel) computers. Lastly, it uses a variety of modern solution algorithms dynamic parallel load-balancing and iterative solvers.
HPC-NMF: A High-Performance Parallel Algorithm for Nonnegative Matrix Factorization
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kannan, Ramakrishnan; Sukumar, Sreenivas R.; Ballard, Grey M.
NMF is a useful tool for many applications in different domains such as topic modeling in text mining, background separation in video analysis, and community detection in social networks. Despite its popularity in the data mining community, there is a lack of efficient distributed algorithms to solve the problem for big data sets. We propose a high-performance distributed-memory parallel algorithm that computes the factorization by iteratively solving alternating non-negative least squares (NLS) subproblems formore » $$\\WW$$ and $$\\HH$$. It maintains the data and factor matrices in memory (distributed across processors), uses MPI for interprocessor communication, and, in the dense case, provably minimizes communication costs (under mild assumptions). As opposed to previous implementation, our algorithm is also flexible: It performs well for both dense and sparse matrices, and allows the user to choose any one of the multiple algorithms for solving the updates to low rank factors $$\\WW$$ and $$\\HH$$ within the alternating iterations.« less
NASA Technical Reports Server (NTRS)
Dongarra, Jack (Editor); Messina, Paul (Editor); Sorensen, Danny C. (Editor); Voigt, Robert G. (Editor)
1990-01-01
Attention is given to such topics as an evaluation of block algorithm variants in LAPACK and presents a large-grain parallel sparse system solver, a multiprocessor method for the solution of the generalized Eigenvalue problem on an interval, and a parallel QR algorithm for iterative subspace methods on the CM2. A discussion of numerical methods includes the topics of asynchronous numerical solutions of PDEs on parallel computers, parallel homotopy curve tracking on a hypercube, and solving Navier-Stokes equations on the Cedar Multi-Cluster system. A section on differential equations includes a discussion of a six-color procedure for the parallel solution of elliptic systems using the finite quadtree structure, data parallel algorithms for the finite element method, and domain decomposition methods in aerodynamics. Topics dealing with massively parallel computing include hypercube vs. 2-dimensional meshes and massively parallel computation of conservation laws. Performance and tools are also discussed.
Parallelization of a blind deconvolution algorithm
NASA Astrophysics Data System (ADS)
Matson, Charles L.; Borelli, Kathy J.
2006-09-01
Often it is of interest to deblur imagery in order to obtain higher-resolution images. Deblurring requires knowledge of the blurring function - information that is often not available separately from the blurred imagery. Blind deconvolution algorithms overcome this problem by jointly estimating both the high-resolution image and the blurring function from the blurred imagery. Because blind deconvolution algorithms are iterative in nature, they can take minutes to days to deblur an image depending how many frames of data are used for the deblurring and the platforms on which the algorithms are executed. Here we present our progress in parallelizing a blind deconvolution algorithm to increase its execution speed. This progress includes sub-frame parallelization and a code structure that is not specialized to a specific computer hardware architecture.
FPGA implementation of low complexity LDPC iterative decoder
NASA Astrophysics Data System (ADS)
Verma, Shivani; Sharma, Sanjay
2016-07-01
Low-density parity-check (LDPC) codes, proposed by Gallager, emerged as a class of codes which can yield very good performance on the additive white Gaussian noise channel as well as on the binary symmetric channel. LDPC codes have gained lots of importance due to their capacity achieving property and excellent performance in the noisy channel. Belief propagation (BP) algorithm and its approximations, most notably min-sum, are popular iterative decoding algorithms used for LDPC and turbo codes. The trade-off between the hardware complexity and the decoding throughput is a critical factor in the implementation of the practical decoder. This article presents introduction to LDPC codes and its various decoding algorithms followed by realisation of LDPC decoder by using simplified message passing algorithm and partially parallel decoder architecture. Simplified message passing algorithm has been proposed for trade-off between low decoding complexity and decoder performance. It greatly reduces the routing and check node complexity of the decoder. Partially parallel decoder architecture possesses high speed and reduced complexity. The improved design of the decoder possesses a maximum symbol throughput of 92.95 Mbps and a maximum of 18 decoding iterations. The article presents implementation of 9216 bits, rate-1/2, (3, 6) LDPC decoder on Xilinx XC3D3400A device from Spartan-3A DSP family.
Methodes iteratives paralleles: Applications en neutronique et en mecanique des fluides
NASA Astrophysics Data System (ADS)
Qaddouri, Abdessamad
Dans cette these, le calcul parallele est applique successivement a la neutronique et a la mecanique des fluides. Dans chacune de ces deux applications, des methodes iteratives sont utilisees pour resoudre le systeme d'equations algebriques resultant de la discretisation des equations du probleme physique. Dans le probleme de neutronique, le calcul des matrices des probabilites de collision (PC) ainsi qu'un schema iteratif multigroupe utilisant une methode inverse de puissance sont parallelises. Dans le probleme de mecanique des fluides, un code d'elements finis utilisant un algorithme iteratif du type GMRES preconditionne est parallelise. Cette these est presentee sous forme de six articles suivis d'une conclusion. Les cinq premiers articles traitent des applications en neutronique, articles qui representent l'evolution de notre travail dans ce domaine. Cette evolution passe par un calcul parallele des matrices des PC et un algorithme multigroupe parallele teste sur un probleme unidimensionnel (article 1), puis par deux algorithmes paralleles l'un mutiregion l'autre multigroupe, testes sur des problemes bidimensionnels (articles 2--3). Ces deux premieres etapes sont suivies par l'application de deux techniques d'acceleration, le rebalancement neutronique et la minimisation du residu aux deux algorithmes paralleles (article 4). Finalement, on a mis en oeuvre l'algorithme multigroupe et le calcul parallele des matrices des PC sur un code de production DRAGON ou les tests sont plus realistes et peuvent etre tridimensionnels (article 5). Le sixieme article (article 6), consacre a l'application a la mecanique des fluides, traite la parallelisation d'un code d'elements finis FES ou le partitionneur de graphe METIS et la librairie PSPARSLIB sont utilises.
Fast Time and Space Parallel Algorithms for Solution of Parabolic Partial Differential Equations
NASA Technical Reports Server (NTRS)
Fijany, Amir
1993-01-01
In this paper, fast time- and Space -Parallel agorithms for solution of linear parabolic PDEs are developed. It is shown that the seemingly strictly serial iterations of the time-stepping procedure for solution of the problem can be completed decoupled.
Computational Issues in Damping Identification for Large Scale Problems
NASA Technical Reports Server (NTRS)
Pilkey, Deborah L.; Roe, Kevin P.; Inman, Daniel J.
1997-01-01
Two damping identification methods are tested for efficiency in large-scale applications. One is an iterative routine, and the other a least squares method. Numerical simulations have been performed on multiple degree-of-freedom models to test the effectiveness of the algorithm and the usefulness of parallel computation for the problems. High Performance Fortran is used to parallelize the algorithm. Tests were performed using the IBM-SP2 at NASA Ames Research Center. The least squares method tested incurs high communication costs, which reduces the benefit of high performance computing. This method's memory requirement grows at a very rapid rate meaning that larger problems can quickly exceed available computer memory. The iterative method's memory requirement grows at a much slower pace and is able to handle problems with 500+ degrees of freedom on a single processor. This method benefits from parallelization, and significant speedup can he seen for problems of 100+ degrees-of-freedom.
Hielscher, Andreas H; Bartel, Sebastian
2004-02-01
Optical tomography (OT) is a fast developing novel imaging modality that uses near-infrared (NIR) light to obtain cross-sectional views of optical properties inside the human body. A major challenge remains the time-consuming, computational-intensive image reconstruction problem that converts NIR transmission measurements into cross-sectional images. To increase the speed of iterative image reconstruction schemes that are commonly applied for OT, we have developed and implemented several parallel algorithms on a cluster of workstations. Static process distribution as well as dynamic load balancing schemes suitable for heterogeneous clusters and varying machine performances are introduced and tested. The resulting algorithms are shown to accelerate the reconstruction process to various degrees, substantially reducing the computation times for clinically relevant problems.
NASA Astrophysics Data System (ADS)
Shao, Meiyue; Aktulga, H. Metin; Yang, Chao; Ng, Esmond G.; Maris, Pieter; Vary, James P.
2018-01-01
We describe a number of recently developed techniques for improving the performance of large-scale nuclear configuration interaction calculations on high performance parallel computers. We show the benefit of using a preconditioned block iterative method to replace the Lanczos algorithm that has traditionally been used to perform this type of computation. The rapid convergence of the block iterative method is achieved by a proper choice of starting guesses of the eigenvectors and the construction of an effective preconditioner. These acceleration techniques take advantage of special structure of the nuclear configuration interaction problem which we discuss in detail. The use of a block method also allows us to improve the concurrency of the computation, and take advantage of the memory hierarchy of modern microprocessors to increase the arithmetic intensity of the computation relative to data movement. We also discuss the implementation details that are critical to achieving high performance on massively parallel multi-core supercomputers, and demonstrate that the new block iterative solver is two to three times faster than the Lanczos based algorithm for problems of moderate sizes on a Cray XC30 system.
New Parallel Algorithms for Structural Analysis and Design of Aerospace Structures
NASA Technical Reports Server (NTRS)
Nguyen, Duc T.
1998-01-01
Subspace and Lanczos iterations have been developed, well documented, and widely accepted as efficient methods for obtaining p-lowest eigen-pair solutions of large-scale, practical engineering problems. The focus of this paper is to incorporate recent developments in vectorized sparse technologies in conjunction with Subspace and Lanczos iterative algorithms for computational enhancements. Numerical performance, in terms of accuracy and efficiency of the proposed sparse strategies for Subspace and Lanczos algorithm, is demonstrated by solving for the lowest frequencies and mode shapes of structural problems on the IBM-R6000/590 and SunSparc 20 workstations.
Iterative Importance Sampling Algorithms for Parameter Estimation
DOE Office of Scientific and Technical Information (OSTI.GOV)
Grout, Ray W; Morzfeld, Matthias; Day, Marcus S.
In parameter estimation problems one computes a posterior distribution over uncertain parameters defined jointly by a prior distribution, a model, and noisy data. Markov chain Monte Carlo (MCMC) is often used for the numerical solution of such problems. An alternative to MCMC is importance sampling, which can exhibit near perfect scaling with the number of cores on high performance computing systems because samples are drawn independently. However, finding a suitable proposal distribution is a challenging task. Several sampling algorithms have been proposed over the past years that take an iterative approach to constructing a proposal distribution. We investigate the applicabilitymore » of such algorithms by applying them to two realistic and challenging test problems, one in subsurface flow, and one in combustion modeling. More specifically, we implement importance sampling algorithms that iterate over the mean and covariance matrix of Gaussian or multivariate t-proposal distributions. Our implementation leverages massively parallel computers, and we present strategies to initialize the iterations using 'coarse' MCMC runs or Gaussian mixture models.« less
NASA Technical Reports Server (NTRS)
Tilton, James C.
1988-01-01
Image segmentation can be a key step in data compression and image analysis. However, the segmentation results produced by most previous approaches to region growing are suspect because they depend on the order in which portions of the image are processed. An iterative parallel segmentation algorithm avoids this problem by performing globally best merges first. Such a segmentation approach, and two implementations of the approach on NASA's Massively Parallel Processor (MPP) are described. Application of the segmentation approach to data compression and image analysis is then described, and results of such application are given for a LANDSAT Thematic Mapper image.
Marine Controlled-Source Electromagnetic 2D Inversion for synthetic models.
NASA Astrophysics Data System (ADS)
Liu, Y.; Li, Y.
2016-12-01
We present a 2D inverse algorithm for frequency domain marine controlled-source electromagnetic (CSEM) data, which is based on the regularized Gauss-Newton approach. As a forward solver, our parallel adaptive finite element forward modeling program is employed. It is a self-adaptive, goal-oriented grid refinement algorithm in which a finite element analysis is performed on a sequence of refined meshes. The mesh refinement process is guided by a dual error estimate weighting to bias refinement towards elements that affect the solution at the EM receiver locations. With the use of the direct solver (MUMPS), we can effectively compute the electromagnetic fields for multi-sources and parametric sensitivities. We also implement the parallel data domain decomposition approach of Key and Ovall (2011), with the goal of being able to compute accurate responses in parallel for complicated models and a full suite of data parameters typical of offshore CSEM surveys. All minimizations are carried out by using the Gauss-Newton algorithm and model perturbations at each iteration step are obtained by using the Inexact Conjugate Gradient iteration method. Synthetic test inversions are presented.
Summer Proceedings 2016: The Center for Computing Research at Sandia National Laboratories
DOE Office of Scientific and Technical Information (OSTI.GOV)
Carleton, James Brian; Parks, Michael L.
Solving sparse linear systems from the discretization of elliptic partial differential equations (PDEs) is an important building block in many engineering applications. Sparse direct solvers can solve general linear systems, but are usually slower and use much more memory than effective iterative solvers. To overcome these two disadvantages, a hierarchical solver (LoRaSp) based on H2-matrices was introduced in [22]. Here, we have developed a parallel version of the algorithm in LoRaSp to solve large sparse matrices on distributed memory machines. On a single processor, the factorization time of our parallel solver scales almost linearly with the problem size for three-dimensionalmore » problems, as opposed to the quadratic scalability of many existing sparse direct solvers. Moreover, our solver leads to almost constant numbers of iterations, when used as a preconditioner for Poisson problems. On more than one processor, our algorithm has significant speedups compared to sequential runs. With this parallel algorithm, we are able to solve large problems much faster than many existing packages as demonstrated by the numerical experiments.« less
Using parallel banded linear system solvers in generalized eigenvalue problems
NASA Technical Reports Server (NTRS)
Zhang, Hong; Moss, William F.
1993-01-01
Subspace iteration is a reliable and cost effective method for solving positive definite banded symmetric generalized eigenproblems, especially in the case of large scale problems. This paper discusses an algorithm that makes use of two parallel banded solvers in subspace iteration. A shift is introduced to decompose the banded linear systems into relatively independent subsystems and to accelerate the iterations. With this shift, an eigenproblem is mapped efficiently into the memories of a multiprocessor and a high speed-up is obtained for parallel implementations. An optimal shift is a shift that balances total computation and communication costs. Under certain conditions, we show how to estimate an optimal shift analytically using the decay rate for the inverse of a banded matrix, and how to improve this estimate. Computational results on iPSC/2 and iPSC/860 multiprocessors are presented.
Block iterative restoration of astronomical images with the massively parallel processor
NASA Technical Reports Server (NTRS)
Heap, Sara R.; Lindler, Don J.
1987-01-01
A method is described for algebraic image restoration capable of treating astronomical images. For a typical 500 x 500 image, direct algebraic restoration would require the solution of a 250,000 x 250,000 linear system. The block iterative approach is used to reduce the problem to solving 4900 121 x 121 linear systems. The algorithm was implemented on the Goddard Massively Parallel Processor, which can solve a 121 x 121 system in approximately 0.06 seconds. Examples are shown of the results for various astronomical images.
Iterative image reconstruction for PROPELLER-MRI using the nonuniform fast fourier transform.
Tamhane, Ashish A; Anastasio, Mark A; Gui, Minzhi; Arfanakis, Konstantinos
2010-07-01
To investigate an iterative image reconstruction algorithm using the nonuniform fast Fourier transform (NUFFT) for PROPELLER (Periodically Rotated Overlapping ParallEL Lines with Enhanced Reconstruction) MRI. Numerical simulations, as well as experiments on a phantom and a healthy human subject were used to evaluate the performance of the iterative image reconstruction algorithm for PROPELLER, and compare it with that of conventional gridding. The trade-off between spatial resolution, signal to noise ratio, and image artifacts, was investigated for different values of the regularization parameter. The performance of the iterative image reconstruction algorithm in the presence of motion was also evaluated. It was demonstrated that, for a certain range of values of the regularization parameter, iterative reconstruction produced images with significantly increased signal to noise ratio, reduced artifacts, for similar spatial resolution, compared with gridding. Furthermore, the ability to reduce the effects of motion in PROPELLER-MRI was maintained when using the iterative reconstruction approach. An iterative image reconstruction technique based on the NUFFT was investigated for PROPELLER MRI. For a certain range of values of the regularization parameter, the new reconstruction technique may provide PROPELLER images with improved image quality compared with conventional gridding. (c) 2010 Wiley-Liss, Inc.
Iterative Image Reconstruction for PROPELLER-MRI using the NonUniform Fast Fourier Transform
Tamhane, Ashish A.; Anastasio, Mark A.; Gui, Minzhi; Arfanakis, Konstantinos
2013-01-01
Purpose To investigate an iterative image reconstruction algorithm using the non-uniform fast Fourier transform (NUFFT) for PROPELLER (Periodically Rotated Overlapping parallEL Lines with Enhanced Reconstruction) MRI. Materials and Methods Numerical simulations, as well as experiments on a phantom and a healthy human subject were used to evaluate the performance of the iterative image reconstruction algorithm for PROPELLER, and compare it to that of conventional gridding. The trade-off between spatial resolution, signal to noise ratio, and image artifacts, was investigated for different values of the regularization parameter. The performance of the iterative image reconstruction algorithm in the presence of motion was also evaluated. Results It was demonstrated that, for a certain range of values of the regularization parameter, iterative reconstruction produced images with significantly increased SNR, reduced artifacts, for similar spatial resolution, compared to gridding. Furthermore, the ability to reduce the effects of motion in PROPELLER-MRI was maintained when using the iterative reconstruction approach. Conclusion An iterative image reconstruction technique based on the NUFFT was investigated for PROPELLER MRI. For a certain range of values of the regularization parameter the new reconstruction technique may provide PROPELLER images with improved image quality compared to conventional gridding. PMID:20578028
Scalable splitting algorithms for big-data interferometric imaging in the SKA era
NASA Astrophysics Data System (ADS)
Onose, Alexandru; Carrillo, Rafael E.; Repetti, Audrey; McEwen, Jason D.; Thiran, Jean-Philippe; Pesquet, Jean-Christophe; Wiaux, Yves
2016-11-01
In the context of next-generation radio telescopes, like the Square Kilometre Array (SKA), the efficient processing of large-scale data sets is extremely important. Convex optimization tasks under the compressive sensing framework have recently emerged and provide both enhanced image reconstruction quality and scalability to increasingly larger data sets. We focus herein mainly on scalability and propose two new convex optimization algorithmic structures able to solve the convex optimization tasks arising in radio-interferometric imaging. They rely on proximal splitting and forward-backward iterations and can be seen, by analogy, with the CLEAN major-minor cycle, as running sophisticated CLEAN-like iterations in parallel in multiple data, prior, and image spaces. Both methods support any convex regularization function, in particular, the well-studied ℓ1 priors promoting image sparsity in an adequate domain. Tailored for big-data, they employ parallel and distributed computations to achieve scalability, in terms of memory and computational requirements. One of them also exploits randomization, over data blocks at each iteration, offering further flexibility. We present simulation results showing the feasibility of the proposed methods as well as their advantages compared to state-of-the-art algorithmic solvers. Our MATLAB code is available online on GitHub.
Extending substructure based iterative solvers to multiple load and repeated analyses
NASA Technical Reports Server (NTRS)
Farhat, Charbel
1993-01-01
Direct solvers currently dominate commercial finite element structural software, but do not scale well in the fine granularity regime targeted by emerging parallel processors. Substructure based iterative solvers--often called also domain decomposition algorithms--lend themselves better to parallel processing, but must overcome several obstacles before earning their place in general purpose structural analysis programs. One such obstacle is the solution of systems with many or repeated right hand sides. Such systems arise, for example, in multiple load static analyses and in implicit linear dynamics computations. Direct solvers are well-suited for these problems because after the system matrix has been factored, the multiple or repeated solutions can be obtained through relatively inexpensive forward and backward substitutions. On the other hand, iterative solvers in general are ill-suited for these problems because they often must restart from scratch for every different right hand side. In this paper, we present a methodology for extending the range of applications of domain decomposition methods to problems with multiple or repeated right hand sides. Basically, we formulate the overall problem as a series of minimization problems over K-orthogonal and supplementary subspaces, and tailor the preconditioned conjugate gradient algorithm to solve them efficiently. The resulting solution method is scalable, whereas direct factorization schemes and forward and backward substitution algorithms are not. We illustrate the proposed methodology with the solution of static and dynamic structural problems, and highlight its potential to outperform forward and backward substitutions on parallel computers. As an example, we show that for a linear structural dynamics problem with 11640 degrees of freedom, every time-step beyond time-step 15 is solved in a single iteration and consumes 1.0 second on a 32 processor iPSC-860 system; for the same problem and the same parallel processor, a pair of forward/backward substitutions at each step consumes 15.0 seconds.
Shao, Meiyue; Aktulga, H. Metin; Yang, Chao; ...
2017-09-14
In this paper, we describe a number of recently developed techniques for improving the performance of large-scale nuclear configuration interaction calculations on high performance parallel computers. We show the benefit of using a preconditioned block iterative method to replace the Lanczos algorithm that has traditionally been used to perform this type of computation. The rapid convergence of the block iterative method is achieved by a proper choice of starting guesses of the eigenvectors and the construction of an effective preconditioner. These acceleration techniques take advantage of special structure of the nuclear configuration interaction problem which we discuss in detail. Themore » use of a block method also allows us to improve the concurrency of the computation, and take advantage of the memory hierarchy of modern microprocessors to increase the arithmetic intensity of the computation relative to data movement. Finally, we also discuss the implementation details that are critical to achieving high performance on massively parallel multi-core supercomputers, and demonstrate that the new block iterative solver is two to three times faster than the Lanczos based algorithm for problems of moderate sizes on a Cray XC30 system.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shao, Meiyue; Aktulga, H. Metin; Yang, Chao
In this paper, we describe a number of recently developed techniques for improving the performance of large-scale nuclear configuration interaction calculations on high performance parallel computers. We show the benefit of using a preconditioned block iterative method to replace the Lanczos algorithm that has traditionally been used to perform this type of computation. The rapid convergence of the block iterative method is achieved by a proper choice of starting guesses of the eigenvectors and the construction of an effective preconditioner. These acceleration techniques take advantage of special structure of the nuclear configuration interaction problem which we discuss in detail. Themore » use of a block method also allows us to improve the concurrency of the computation, and take advantage of the memory hierarchy of modern microprocessors to increase the arithmetic intensity of the computation relative to data movement. Finally, we also discuss the implementation details that are critical to achieving high performance on massively parallel multi-core supercomputers, and demonstrate that the new block iterative solver is two to three times faster than the Lanczos based algorithm for problems of moderate sizes on a Cray XC30 system.« less
NASA Technical Reports Server (NTRS)
Strong, James P.
1987-01-01
A local area matching algorithm was developed on the Massively Parallel Processor (MPP). It is an iterative technique that first matches coarse or low resolution areas and at each iteration performs matches of higher resolution. Results so far show that when good matches are possible in the two images, the MPP algorithm matches corresponding areas as well as a human observer. To aid in developing this algorithm, a control or shell program was developed for the MPP that allows interactive experimentation with various parameters and procedures to be used in the matching process. (This would not be possible without the high speed of the MPP). With the system, optimal techniques can be developed for different types of matching problems.
Parallel Newton-Krylov-Schwarz algorithms for the transonic full potential equation
NASA Technical Reports Server (NTRS)
Cai, Xiao-Chuan; Gropp, William D.; Keyes, David E.; Melvin, Robin G.; Young, David P.
1996-01-01
We study parallel two-level overlapping Schwarz algorithms for solving nonlinear finite element problems, in particular, for the full potential equation of aerodynamics discretized in two dimensions with bilinear elements. The overall algorithm, Newton-Krylov-Schwarz (NKS), employs an inexact finite-difference Newton method and a Krylov space iterative method, with a two-level overlapping Schwarz method as a preconditioner. We demonstrate that NKS, combined with a density upwinding continuation strategy for problems with weak shocks, is robust and, economical for this class of mixed elliptic-hyperbolic nonlinear partial differential equations, with proper specification of several parameters. We study upwinding parameters, inner convergence tolerance, coarse grid density, subdomain overlap, and the level of fill-in in the incomplete factorization, and report their effect on numerical convergence rate, overall execution time, and parallel efficiency on a distributed-memory parallel computer.
Fast iterative censoring CFAR algorithm for ship detection from SAR images
NASA Astrophysics Data System (ADS)
Gu, Dandan; Yue, Hui; Zhang, Yuan; Gao, Pengcheng
2017-11-01
Ship detection is one of the essential techniques for ship recognition from synthetic aperture radar (SAR) images. This paper presents a fast iterative detection procedure to eliminate the influence of target returns on the estimation of local sea clutter distributions for constant false alarm rate (CFAR) detectors. A fast block detector is first employed to extract potential target sub-images; and then, an iterative censoring CFAR algorithm is used to detect ship candidates from each target blocks adaptively and efficiently, where parallel detection is available, and statistical parameters of G0 distribution fitting local sea clutter well can be quickly estimated based on an integral image operator. Experimental results of TerraSAR-X images demonstrate the effectiveness of the proposed technique.
Efficient parallel resolution of the simplified transport equations in mixed-dual formulation
NASA Astrophysics Data System (ADS)
Barrault, M.; Lathuilière, B.; Ramet, P.; Roman, J.
2011-03-01
A reactivity computation consists of computing the highest eigenvalue of a generalized eigenvalue problem, for which an inverse power algorithm is commonly used. Very fine modelizations are difficult to treat for our sequential solver, based on the simplified transport equations, in terms of memory consumption and computational time. A first implementation of a Lagrangian based domain decomposition method brings to a poor parallel efficiency because of an increase in the power iterations [1]. In order to obtain a high parallel efficiency, we improve the parallelization scheme by changing the location of the loop over the subdomains in the overall algorithm and by benefiting from the characteristics of the Raviart-Thomas finite element. The new parallel algorithm still allows us to locally adapt the numerical scheme (mesh, finite element order). However, it can be significantly optimized for the matching grid case. The good behavior of the new parallelization scheme is demonstrated for the matching grid case on several hundreds of nodes for computations based on a pin-by-pin discretization.
DOE Office of Scientific and Technical Information (OSTI.GOV)
D'Azevedo, Eduardo; Abbott, Stephen; Koskela, Tuomas
The XGC fusion gyrokinetic code combines state-of-the-art, portable computational and algorithmic technologies to enable complicated multiscale simulations of turbulence and transport dynamics in ITER edge plasma on the largest US open-science computer, the CRAY XK7 Titan, at its maximal heterogeneous capability, which have not been possible before due to a factor of over 10 shortage in the time-to-solution for less than 5 days of wall-clock time for one physics case. Frontier techniques such as nested OpenMP parallelism, adaptive parallel I/O, staging I/O and data reduction using dynamic and asynchronous applications interactions, dynamic repartitioning for balancing computational work in pushing particlesmore » and in grid related work, scalable and accurate discretization algorithms for non-linear Coulomb collisions, and communication-avoiding subcycling technology for pushing particles on both CPUs and GPUs are also utilized to dramatically improve the scalability and time-to-solution, hence enabling the difficult kinetic ITER edge simulation on a present-day leadership class computer.« less
NASA Astrophysics Data System (ADS)
Wu, Ping; Liu, Kai; Zhang, Qian; Xue, Zhenwen; Li, Yongbao; Ning, Nannan; Yang, Xin; Li, Xingde; Tian, Jie
2012-12-01
Liver cancer is one of the most common malignant tumors worldwide. In order to enable the noninvasive detection of small liver tumors in mice, we present a parallel iterative shrinkage (PIS) algorithm for dual-modality tomography. It takes advantage of microcomputed tomography and multiview bioluminescence imaging, providing anatomical structure and bioluminescence intensity information to reconstruct the size and location of tumors. By incorporating prior knowledge of signal sparsity, we associate some mathematical strategies including specific smooth convex approximation, an iterative shrinkage operator, and affine subspace with the PIS method, which guarantees the accuracy, efficiency, and reliability for three-dimensional reconstruction. Then an in vivo experiment on the bead-implanted mouse has been performed to validate the feasibility of this method. The findings indicate that a tiny lesion less than 3 mm in diameter can be localized with a position bias no more than 1 mm the computational efficiency is one to three orders of magnitude faster than the existing algorithms; this approach is robust to the different regularization parameters and the lp norms. Finally, we have applied this algorithm to another in vivo experiment on an HCCLM3 orthotopic xenograft mouse model, which suggests the PIS method holds the promise for practical applications of whole-body cancer detection.
Murphy, Mark; Alley, Marcus; Demmel, James; Keutzer, Kurt; Vasanawala, Shreyas; Lustig, Michael
2012-06-01
We present l₁-SPIRiT, a simple algorithm for auto calibrating parallel imaging (acPI) and compressed sensing (CS) that permits an efficient implementation with clinically-feasible runtimes. We propose a CS objective function that minimizes cross-channel joint sparsity in the wavelet domain. Our reconstruction minimizes this objective via iterative soft-thresholding, and integrates naturally with iterative self-consistent parallel imaging (SPIRiT). Like many iterative magnetic resonance imaging reconstructions, l₁-SPIRiT's image quality comes at a high computational cost. Excessively long runtimes are a barrier to the clinical use of any reconstruction approach, and thus we discuss our approach to efficiently parallelizing l₁-SPIRiT and to achieving clinically-feasible runtimes. We present parallelizations of l₁-SPIRiT for both multi-GPU systems and multi-core CPUs, and discuss the software optimization and parallelization decisions made in our implementation. The performance of these alternatives depends on the processor architecture, the size of the image matrix, and the number of parallel imaging channels. Fundamentally, achieving fast runtime requires the correct trade-off between cache usage and parallelization overheads. We demonstrate image quality via a case from our clinical experimentation, using a custom 3DFT spoiled gradient echo (SPGR) sequence with up to 8× acceleration via Poisson-disc undersampling in the two phase-encoded directions.
Murphy, Mark; Alley, Marcus; Demmel, James; Keutzer, Kurt; Vasanawala, Shreyas; Lustig, Michael
2012-01-01
We present ℓ1-SPIRiT, a simple algorithm for auto calibrating parallel imaging (acPI) and compressed sensing (CS) that permits an efficient implementation with clinically-feasible runtimes. We propose a CS objective function that minimizes cross-channel joint sparsity in the Wavelet domain. Our reconstruction minimizes this objective via iterative soft-thresholding, and integrates naturally with iterative Self-Consistent Parallel Imaging (SPIRiT). Like many iterative MRI reconstructions, ℓ1-SPIRiT’s image quality comes at a high computational cost. Excessively long runtimes are a barrier to the clinical use of any reconstruction approach, and thus we discuss our approach to efficiently parallelizing ℓ1-SPIRiT and to achieving clinically-feasible runtimes. We present parallelizations of ℓ1-SPIRiT for both multi-GPU systems and multi-core CPUs, and discuss the software optimization and parallelization decisions made in our implementation. The performance of these alternatives depends on the processor architecture, the size of the image matrix, and the number of parallel imaging channels. Fundamentally, achieving fast runtime requires the correct trade-off between cache usage and parallelization overheads. We demonstrate image quality via a case from our clinical experimentation, using a custom 3DFT Spoiled Gradient Echo (SPGR) sequence with up to 8× acceleration via poisson-disc undersampling in the two phase-encoded directions. PMID:22345529
Xu, Zheng; Wang, Sheng; Li, Yeqing; Zhu, Feiyun; Huang, Junzhou
2018-02-08
The most recent history of parallel Magnetic Resonance Imaging (pMRI) has in large part been devoted to finding ways to reduce acquisition time. While joint total variation (JTV) regularized model has been demonstrated as a powerful tool in increasing sampling speed for pMRI, however, the major bottleneck is the inefficiency of the optimization method. While all present state-of-the-art optimizations for the JTV model could only reach a sublinear convergence rate, in this paper, we squeeze the performance by proposing a linear-convergent optimization method for the JTV model. The proposed method is based on the Iterative Reweighted Least Squares algorithm. Due to the complexity of the tangled JTV objective, we design a novel preconditioner to further accelerate the proposed method. Extensive experiments demonstrate the superior performance of the proposed algorithm for pMRI regarding both accuracy and efficiency compared with state-of-the-art methods.
High-speed parallel implementation of a modified PBR algorithm on DSP-based EH topology
NASA Astrophysics Data System (ADS)
Rajan, K.; Patnaik, L. M.; Ramakrishna, J.
1997-08-01
Algebraic Reconstruction Technique (ART) is an age-old method used for solving the problem of three-dimensional (3-D) reconstruction from projections in electron microscopy and radiology. In medical applications, direct 3-D reconstruction is at the forefront of investigation. The simultaneous iterative reconstruction technique (SIRT) is an ART-type algorithm with the potential of generating in a few iterations tomographic images of a quality comparable to that of convolution backprojection (CBP) methods. Pixel-based reconstruction (PBR) is similar to SIRT reconstruction, and it has been shown that PBR algorithms give better quality pictures compared to those produced by SIRT algorithms. In this work, we propose a few modifications to the PBR algorithms. The modified algorithms are shown to give better quality pictures compared to PBR algorithms. The PBR algorithm and the modified PBR algorithms are highly compute intensive, Not many attempts have been made to reconstruct objects in the true 3-D sense because of the high computational overhead. In this study, we have developed parallel two-dimensional (2-D) and 3-D reconstruction algorithms based on modified PBR. We attempt to solve the two problems encountered by the PBR and modified PBR algorithms, i.e., the long computational time and the large memory requirements, by parallelizing the algorithm on a multiprocessor system. We investigate the possible task and data partitioning schemes by exploiting the potential parallelism in the PBR algorithm subject to minimizing the memory requirement. We have implemented an extended hypercube (EH) architecture for the high-speed execution of the 3-D reconstruction algorithm using the commercially available fast floating point digital signal processor (DSP) chips as the processing elements (PEs) and dual-port random access memories (DPR) as channels between the PEs. We discuss and compare the performances of the PBR algorithm on an IBM 6000 RISC workstation, on a Silicon Graphics Indigo 2 workstation, and on an EH system. The results show that an EH(3,1) using DSP chips as PEs executes the modified PBR algorithm about 100 times faster than an LBM 6000 RISC workstation. We have executed the algorithms on a 4-node IBM SP2 parallel computer. The results show that execution time of the algorithm on an EH(3,1) is better than that of a 4-node IBM SP2 system. The speed-up of an EH(3,1) system with eight PEs and one network controller is approximately 7.85.
Computing Maximum Cardinality Matchings in Parallel on Bipartite Graphs via Tree-Grafting
DOE Office of Scientific and Technical Information (OSTI.GOV)
Azad, Ariful; Buluc, Aydn; Pothen, Alex
It is difficult to obtain high performance when computing matchings on parallel processors because matching algorithms explicitly or implicitly search for paths in the graph, and when these paths become long, there is little concurrency. In spite of this limitation, we present a new algorithm and its shared-memory parallelization that achieves good performance and scalability in computing maximum cardinality matchings in bipartite graphs. This algorithm searches for augmenting paths via specialized breadth-first searches (BFS) from multiple source vertices, hence creating more parallelism than single source algorithms. Algorithms that employ multiple-source searches cannot discard a search tree once no augmenting pathmore » is discovered from the tree, unlike algorithms that rely on single-source searches. We describe a novel tree-grafting method that eliminates most of the redundant edge traversals resulting from this property of multiple-source searches. We also employ the recent direction-optimizing BFS algorithm as a subroutine to discover augmenting paths faster. Our algorithm compares favorably with the current best algorithms in terms of the number of edges traversed, the average augmenting path length, and the number of iterations. Here, we provide a proof of correctness for our algorithm. Our NUMA-aware implementation is scalable to 80 threads of an Intel multiprocessor and to 240 threads on an Intel Knights Corner coprocessor. On average, our parallel algorithm runs an order of magnitude faster than the fastest algorithms available. The performance improvement is more significant on graphs with small matching number.« less
Computing Maximum Cardinality Matchings in Parallel on Bipartite Graphs via Tree-Grafting
Azad, Ariful; Buluc, Aydn; Pothen, Alex
2016-03-24
It is difficult to obtain high performance when computing matchings on parallel processors because matching algorithms explicitly or implicitly search for paths in the graph, and when these paths become long, there is little concurrency. In spite of this limitation, we present a new algorithm and its shared-memory parallelization that achieves good performance and scalability in computing maximum cardinality matchings in bipartite graphs. This algorithm searches for augmenting paths via specialized breadth-first searches (BFS) from multiple source vertices, hence creating more parallelism than single source algorithms. Algorithms that employ multiple-source searches cannot discard a search tree once no augmenting pathmore » is discovered from the tree, unlike algorithms that rely on single-source searches. We describe a novel tree-grafting method that eliminates most of the redundant edge traversals resulting from this property of multiple-source searches. We also employ the recent direction-optimizing BFS algorithm as a subroutine to discover augmenting paths faster. Our algorithm compares favorably with the current best algorithms in terms of the number of edges traversed, the average augmenting path length, and the number of iterations. Here, we provide a proof of correctness for our algorithm. Our NUMA-aware implementation is scalable to 80 threads of an Intel multiprocessor and to 240 threads on an Intel Knights Corner coprocessor. On average, our parallel algorithm runs an order of magnitude faster than the fastest algorithms available. The performance improvement is more significant on graphs with small matching number.« less
Terascale Optimal PDE Simulations (TOPS) Center
DOE Office of Scientific and Technical Information (OSTI.GOV)
Professor Olof B. Widlund
2007-07-09
Our work has focused on the development and analysis of domain decomposition algorithms for a variety of problems arising in continuum mechanics modeling. In particular, we have extended and analyzed FETI-DP and BDDC algorithms; these iterative solvers were first introduced and studied by Charbel Farhat and his collaborators, see [11, 45, 12], and by Clark Dohrmann of SANDIA, Albuquerque, see [43, 2, 1], respectively. These two closely related families of methods are of particular interest since they are used more extensively than other iterative substructuring methods to solve very large and difficult problems. Thus, the FETI algorithms are part ofmore » the SALINAS system developed by the SANDIA National Laboratories for very large scale computations, and as already noted, BDDC was first developed by a SANDIA scientist, Dr. Clark Dohrmann. The FETI algorithms are also making inroads in commercial engineering software systems. We also note that the analysis of these algorithms poses very real mathematical challenges. The success in developing this theory has, in several instances, led to significant improvements in the performance of these algorithms. A very desirable feature of these iterative substructuring and other domain decomposition algorithms is that they respect the memory hierarchy of modern parallel and distributed computing systems, which is essential for approaching peak floating point performance. The development of improved methods, together with more powerful computer systems, is making it possible to carry out simulations in three dimensions, with quite high resolution, relatively easily. This work is supported by high quality software systems, such as Argonne's PETSc library, which facilitates code development as well as the access to a variety of parallel and distributed computer systems. The success in finding scalable and robust domain decomposition algorithms for very large number of processors and very large finite element problems is, e.g., illustrated in [24, 25, 26]. This work is based on [29, 31]. Our work over these five and half years has, in our opinion, helped advance the knowledge of domain decomposition methods significantly. We see these methods as providing valuable alternatives to other iterative methods, in particular, those based on multi-grid. In our opinion, our accomplishments also match the goals of the TOPS project quite closely.« less
LDRD final report on massively-parallel linear programming : the parPCx system.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Parekh, Ojas; Phillips, Cynthia Ann; Boman, Erik Gunnar
2005-02-01
This report summarizes the research and development performed from October 2002 to September 2004 at Sandia National Laboratories under the Laboratory-Directed Research and Development (LDRD) project ''Massively-Parallel Linear Programming''. We developed a linear programming (LP) solver designed to use a large number of processors. LP is the optimization of a linear objective function subject to linear constraints. Companies and universities have expended huge efforts over decades to produce fast, stable serial LP solvers. Previous parallel codes run on shared-memory systems and have little or no distribution of the constraint matrix. We have seen no reports of general LP solver runsmore » on large numbers of processors. Our parallel LP code is based on an efficient serial implementation of Mehrotra's interior-point predictor-corrector algorithm (PCx). The computational core of this algorithm is the assembly and solution of a sparse linear system. We have substantially rewritten the PCx code and based it on Trilinos, the parallel linear algebra library developed at Sandia. Our interior-point method can use either direct or iterative solvers for the linear system. To achieve a good parallel data distribution of the constraint matrix, we use a (pre-release) version of a hypergraph partitioner from the Zoltan partitioning library. We describe the design and implementation of our new LP solver called parPCx and give preliminary computational results. We summarize a number of issues related to efficient parallel solution of LPs with interior-point methods including data distribution, numerical stability, and solving the core linear system using both direct and iterative methods. We describe a number of applications of LP specific to US Department of Energy mission areas and we summarize our efforts to integrate parPCx (and parallel LP solvers in general) into Sandia's massively-parallel integer programming solver PICO (Parallel Interger and Combinatorial Optimizer). We conclude with directions for long-term future algorithmic research and for near-term development that could improve the performance of parPCx.« less
Parallel Preconditioning for CFD Problems on the CM-5
NASA Technical Reports Server (NTRS)
Simon, Horst D.; Kremenetsky, Mark D.; Richardson, John; Lasinski, T. A. (Technical Monitor)
1994-01-01
Up to today, preconditioning methods on massively parallel systems have faced a major difficulty. The most successful preconditioning methods in terms of accelerating the convergence of the iterative solver such as incomplete LU factorizations are notoriously difficult to implement on parallel machines for two reasons: (1) the actual computation of the preconditioner is not very floating-point intensive, but requires a large amount of unstructured communication, and (2) the application of the preconditioning matrix in the iteration phase (i.e. triangular solves) are difficult to parallelize because of the recursive nature of the computation. Here we present a new approach to preconditioning for very large, sparse, unsymmetric, linear systems, which avoids both difficulties. We explicitly compute an approximate inverse to our original matrix. This new preconditioning matrix can be applied most efficiently for iterative methods on massively parallel machines, since the preconditioning phase involves only a matrix-vector multiplication, with possibly a dense matrix. Furthermore the actual computation of the preconditioning matrix has natural parallelism. For a problem of size n, the preconditioning matrix can be computed by solving n independent small least squares problems. The algorithm and its implementation on the Connection Machine CM-5 are discussed in detail and supported by extensive timings obtained from real problem data.
NASA Astrophysics Data System (ADS)
Nguyen, An Hung; Guillemette, Thomas; Lambert, Andrew J.; Pickering, Mark R.; Garratt, Matthew A.
2017-09-01
Image registration is a fundamental image processing technique. It is used to spatially align two or more images that have been captured at different times, from different sensors, or from different viewpoints. There have been many algorithms proposed for this task. The most common of these being the well-known Lucas-Kanade (LK) and Horn-Schunck approaches. However, the main limitation of these approaches is the computational complexity required to implement the large number of iterations necessary for successful alignment of the images. Previously, a multi-pass image interpolation algorithm (MP-I2A) was developed to considerably reduce the number of iterations required for successful registration compared with the LK algorithm. This paper develops a kernel-warping algorithm (KWA), a modified version of the MP-I2A, which requires fewer iterations to successfully register two images and less memory space for the field-programmable gate array (FPGA) implementation than the MP-I2A. These reductions increase feasibility of the implementation of the proposed algorithm on FPGAs with very limited memory space and other hardware resources. A two-FPGA system rather than single FPGA system is successfully developed to implement the KWA in order to compensate insufficiency of hardware resources supported by one FPGA, and increase parallel processing ability and scalability of the system.
NASA Astrophysics Data System (ADS)
Fu, Lin; Hu, Xiangyu Y.; Adams, Nikolaus A.
2017-12-01
We propose efficient single-step formulations for reinitialization and extending algorithms, which are critical components of level-set based interface-tracking methods. The level-set field is reinitialized with a single-step (non iterative) "forward tracing" algorithm. A minimum set of cells is defined that describes the interface, and reinitialization employs only data from these cells. Fluid states are extrapolated or extended across the interface by a single-step "backward tracing" algorithm. Both algorithms, which are motivated by analogy to ray-tracing, avoid multiple block-boundary data exchanges that are inevitable for iterative reinitialization and extending approaches within a parallel-computing environment. The single-step algorithms are combined with a multi-resolution conservative sharp-interface method and validated by a wide range of benchmark test cases. We demonstrate that the proposed reinitialization method achieves second-order accuracy in conserving the volume of each phase. The interface location is invariant to reapplication of the single-step reinitialization. Generally, we observe smaller absolute errors than for standard iterative reinitialization on the same grid. The computational efficiency is higher than for the standard and typical high-order iterative reinitialization methods. We observe a 2- to 6-times efficiency improvement over the standard method for serial execution. The proposed single-step extending algorithm, which is commonly employed for assigning data to ghost cells with ghost-fluid or conservative interface interaction methods, shows about 10-times efficiency improvement over the standard method while maintaining same accuracy. Despite their simplicity, the proposed algorithms offer an efficient and robust alternative to iterative reinitialization and extending methods for level-set based multi-phase simulations.
A high data rate universal lattice decoder on FPGA
NASA Astrophysics Data System (ADS)
Ma, Jing; Huang, Xinming; Kura, Swapna
2005-06-01
This paper presents the architecture design of a high data rate universal lattice decoder for MIMO channels on FPGA platform. A phost strategy based lattice decoding algorithm is modified in this paper to reduce the complexity of the closest lattice point search. The data dependency of the improved algorithm is examined and a parallel and pipeline architecture is developed with the iterative decoding function on FPGA and the division intensive channel matrix preprocessing on DSP. Simulation results demonstrate that the improved lattice decoding algorithm provides better bit error rate and less iteration number compared with the original algorithm. The system prototype of the decoder shows that it supports data rate up to 7Mbit/s on a Virtex2-1000 FPGA, which is about 8 times faster than the original algorithm on FPGA platform and two-orders of magnitude better than its implementation on a DSP platform.
Scalable load balancing for massively parallel distributed Monte Carlo particle transport
DOE Office of Scientific and Technical Information (OSTI.GOV)
O'Brien, M. J.; Brantley, P. S.; Joy, K. I.
2013-07-01
In order to run computer simulations efficiently on massively parallel computers with hundreds of thousands or millions of processors, care must be taken that the calculation is load balanced across the processors. Examining the workload of every processor leads to an unscalable algorithm, with run time at least as large as O(N), where N is the number of processors. We present a scalable load balancing algorithm, with run time 0(log(N)), that involves iterated processor-pair-wise balancing steps, ultimately leading to a globally balanced workload. We demonstrate scalability of the algorithm up to 2 million processors on the Sequoia supercomputer at Lawrencemore » Livermore National Laboratory. (authors)« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Xu, Qiaofeng; Sawatzky, Alex; Anastasio, Mark A., E-mail: anastasio@wustl.edu
Purpose: The development of iterative image reconstruction algorithms for cone-beam computed tomography (CBCT) remains an active and important research area. Even with hardware acceleration, the overwhelming majority of the available 3D iterative algorithms that implement nonsmooth regularizers remain computationally burdensome and have not been translated for routine use in time-sensitive applications such as image-guided radiation therapy (IGRT). In this work, two variants of the fast iterative shrinkage thresholding algorithm (FISTA) are proposed and investigated for accelerated iterative image reconstruction in CBCT. Methods: Algorithm acceleration was achieved by replacing the original gradient-descent step in the FISTAs by a subproblem that ismore » solved by use of the ordered subset simultaneous algebraic reconstruction technique (OS-SART). Due to the preconditioning matrix adopted in the OS-SART method, two new weighted proximal problems were introduced and corresponding fast gradient projection-type algorithms were developed for solving them. We also provided efficient numerical implementations of the proposed algorithms that exploit the massive data parallelism of multiple graphics processing units. Results: The improved rates of convergence of the proposed algorithms were quantified in computer-simulation studies and by use of clinical projection data corresponding to an IGRT study. The accelerated FISTAs were shown to possess dramatically improved convergence properties as compared to the standard FISTAs. For example, the number of iterations to achieve a specified reconstruction error could be reduced by an order of magnitude. Volumetric images reconstructed from clinical data were produced in under 4 min. Conclusions: The FISTA achieves a quadratic convergence rate and can therefore potentially reduce the number of iterations required to produce an image of a specified image quality as compared to first-order methods. We have proposed and investigated accelerated FISTAs for use with two nonsmooth penalty functions that will lead to further reductions in image reconstruction times while preserving image quality. Moreover, with the help of a mixed sparsity-regularization, better preservation of soft-tissue structures can be potentially obtained. The algorithms were systematically evaluated by use of computer-simulated and clinical data sets.« less
Xu, Qiaofeng; Yang, Deshan; Tan, Jun; Sawatzky, Alex; Anastasio, Mark A
2016-04-01
The development of iterative image reconstruction algorithms for cone-beam computed tomography (CBCT) remains an active and important research area. Even with hardware acceleration, the overwhelming majority of the available 3D iterative algorithms that implement nonsmooth regularizers remain computationally burdensome and have not been translated for routine use in time-sensitive applications such as image-guided radiation therapy (IGRT). In this work, two variants of the fast iterative shrinkage thresholding algorithm (FISTA) are proposed and investigated for accelerated iterative image reconstruction in CBCT. Algorithm acceleration was achieved by replacing the original gradient-descent step in the FISTAs by a subproblem that is solved by use of the ordered subset simultaneous algebraic reconstruction technique (OS-SART). Due to the preconditioning matrix adopted in the OS-SART method, two new weighted proximal problems were introduced and corresponding fast gradient projection-type algorithms were developed for solving them. We also provided efficient numerical implementations of the proposed algorithms that exploit the massive data parallelism of multiple graphics processing units. The improved rates of convergence of the proposed algorithms were quantified in computer-simulation studies and by use of clinical projection data corresponding to an IGRT study. The accelerated FISTAs were shown to possess dramatically improved convergence properties as compared to the standard FISTAs. For example, the number of iterations to achieve a specified reconstruction error could be reduced by an order of magnitude. Volumetric images reconstructed from clinical data were produced in under 4 min. The FISTA achieves a quadratic convergence rate and can therefore potentially reduce the number of iterations required to produce an image of a specified image quality as compared to first-order methods. We have proposed and investigated accelerated FISTAs for use with two nonsmooth penalty functions that will lead to further reductions in image reconstruction times while preserving image quality. Moreover, with the help of a mixed sparsity-regularization, better preservation of soft-tissue structures can be potentially obtained. The algorithms were systematically evaluated by use of computer-simulated and clinical data sets.
Xu, Qiaofeng; Yang, Deshan; Tan, Jun; Sawatzky, Alex; Anastasio, Mark A.
2016-01-01
Purpose: The development of iterative image reconstruction algorithms for cone-beam computed tomography (CBCT) remains an active and important research area. Even with hardware acceleration, the overwhelming majority of the available 3D iterative algorithms that implement nonsmooth regularizers remain computationally burdensome and have not been translated for routine use in time-sensitive applications such as image-guided radiation therapy (IGRT). In this work, two variants of the fast iterative shrinkage thresholding algorithm (FISTA) are proposed and investigated for accelerated iterative image reconstruction in CBCT. Methods: Algorithm acceleration was achieved by replacing the original gradient-descent step in the FISTAs by a subproblem that is solved by use of the ordered subset simultaneous algebraic reconstruction technique (OS-SART). Due to the preconditioning matrix adopted in the OS-SART method, two new weighted proximal problems were introduced and corresponding fast gradient projection-type algorithms were developed for solving them. We also provided efficient numerical implementations of the proposed algorithms that exploit the massive data parallelism of multiple graphics processing units. Results: The improved rates of convergence of the proposed algorithms were quantified in computer-simulation studies and by use of clinical projection data corresponding to an IGRT study. The accelerated FISTAs were shown to possess dramatically improved convergence properties as compared to the standard FISTAs. For example, the number of iterations to achieve a specified reconstruction error could be reduced by an order of magnitude. Volumetric images reconstructed from clinical data were produced in under 4 min. Conclusions: The FISTA achieves a quadratic convergence rate and can therefore potentially reduce the number of iterations required to produce an image of a specified image quality as compared to first-order methods. We have proposed and investigated accelerated FISTAs for use with two nonsmooth penalty functions that will lead to further reductions in image reconstruction times while preserving image quality. Moreover, with the help of a mixed sparsity-regularization, better preservation of soft-tissue structures can be potentially obtained. The algorithms were systematically evaluated by use of computer-simulated and clinical data sets. PMID:27036582
NASA Astrophysics Data System (ADS)
Handhika, T.; Bustamam, A.; Ernastuti, Kerami, D.
2017-07-01
Multi-thread programming using OpenMP on the shared-memory architecture with hyperthreading technology allows the resource to be accessed by multiple processors simultaneously. Each processor can execute more than one thread for a certain period of time. However, its speedup depends on the ability of the processor to execute threads in limited quantities, especially the sequential algorithm which contains a nested loop. The number of the outer loop iterations is greater than the maximum number of threads that can be executed by a processor. The thread distribution technique that had been found previously only be applied by the high-level programmer. This paper generates a parallelization procedure for low-level programmer in dealing with 2-level nested loop problems with the maximum number of threads that can be executed by a processor is smaller than the number of the outer loop iterations. Data preprocessing which is related to the number of the outer loop and the inner loop iterations, the computational time required to execute each iteration and the maximum number of threads that can be executed by a processor are used as a strategy to determine which parallel region that will produce optimal speedup.
GPU-based parallel algorithm for blind image restoration using midfrequency-based methods
NASA Astrophysics Data System (ADS)
Xie, Lang; Luo, Yi-han; Bao, Qi-liang
2013-08-01
GPU-based general-purpose computing is a new branch of modern parallel computing, so the study of parallel algorithms specially designed for GPU hardware architecture is of great significance. In order to solve the problem of high computational complexity and poor real-time performance in blind image restoration, the midfrequency-based algorithm for blind image restoration was analyzed and improved in this paper. Furthermore, a midfrequency-based filtering method is also used to restore the image hardly with any recursion or iteration. Combining the algorithm with data intensiveness, data parallel computing and GPU execution model of single instruction and multiple threads, a new parallel midfrequency-based algorithm for blind image restoration is proposed in this paper, which is suitable for stream computing of GPU. In this algorithm, the GPU is utilized to accelerate the estimation of class-G point spread functions and midfrequency-based filtering. Aiming at better management of the GPU threads, the threads in a grid are scheduled according to the decomposition of the filtering data in frequency domain after the optimization of data access and the communication between the host and the device. The kernel parallelism structure is determined by the decomposition of the filtering data to ensure the transmission rate to get around the memory bandwidth limitation. The results show that, with the new algorithm, the operational speed is significantly increased and the real-time performance of image restoration is effectively improved, especially for high-resolution images.
A FAST ITERATIVE METHOD FOR SOLVING THE EIKONAL EQUATION ON TRIANGULATED SURFACES*
Fu, Zhisong; Jeong, Won-Ki; Pan, Yongsheng; Kirby, Robert M.; Whitaker, Ross T.
2012-01-01
This paper presents an efficient, fine-grained parallel algorithm for solving the Eikonal equation on triangular meshes. The Eikonal equation, and the broader class of Hamilton–Jacobi equations to which it belongs, have a wide range of applications from geometric optics and seismology to biological modeling and analysis of geometry and images. The ability to solve such equations accurately and efficiently provides new capabilities for exploring and visualizing parameter spaces and for solving inverse problems that rely on such equations in the forward model. Efficient solvers on state-of-the-art, parallel architectures require new algorithms that are not, in many cases, optimal, but are better suited to synchronous updates of the solution. In previous work [W. K. Jeong and R. T. Whitaker, SIAM J. Sci. Comput., 30 (2008), pp. 2512–2534], the authors proposed the fast iterative method (FIM) to efficiently solve the Eikonal equation on regular grids. In this paper we extend the fast iterative method to solve Eikonal equations efficiently on triangulated domains on the CPU and on parallel architectures, including graphics processors. We propose a new local update scheme that provides solutions of first-order accuracy for both architectures. We also propose a novel triangle-based update scheme and its corresponding data structure for efficient irregular data mapping to parallel single-instruction multiple-data (SIMD) processors. We provide detailed descriptions of the implementations on a single CPU, a multicore CPU with shared memory, and SIMD architectures with comparative results against state-of-the-art Eikonal solvers. PMID:22641200
Unweighted least squares phase unwrapping by means of multigrid techniques
NASA Astrophysics Data System (ADS)
Pritt, Mark D.
1995-11-01
We present a multigrid algorithm for unweighted least squares phase unwrapping. This algorithm applies Gauss-Seidel relaxation schemes to solve the Poisson equation on smaller, coarser grids and transfers the intermediate results to the finer grids. This approach forms the basis of our multigrid algorithm for weighted least squares phase unwrapping, which is described in a separate paper. The key idea of our multigrid approach is to maintain the partial derivatives of the phase data in separate arrays and to correct these derivatives at the boundaries of the coarser grids. This maintains the boundary conditions necessary for rapid convergence to the correct solution. Although the multigrid algorithm is an iterative algorithm, we demonstrate that it is nearly as fast as the direct Fourier-based method. We also describe how to parallelize the algorithm for execution on a distributed-memory parallel processor computer or a network-cluster of workstations.
Fully Parallel MHD Stability Analysis Tool
NASA Astrophysics Data System (ADS)
Svidzinski, Vladimir; Galkin, Sergei; Kim, Jin-Soo; Liu, Yueqiang
2014-10-01
Progress on full parallelization of the plasma stability code MARS will be reported. MARS calculates eigenmodes in 2D axisymmetric toroidal equilibria in MHD-kinetic plasma models. It is a powerful tool for studying MHD and MHD-kinetic instabilities and it is widely used by fusion community. Parallel version of MARS is intended for simulations on local parallel clusters. It will be an efficient tool for simulation of MHD instabilities with low, intermediate and high toroidal mode numbers within both fluid and kinetic plasma models, already implemented in MARS. Parallelization of the code includes parallelization of the construction of the matrix for the eigenvalue problem and parallelization of the inverse iterations algorithm, implemented in MARS for the solution of the formulated eigenvalue problem. Construction of the matrix is parallelized by distributing the load among processors assigned to different magnetic surfaces. Parallelization of the solution of the eigenvalue problem is made by repeating steps of the present MARS algorithm using parallel libraries and procedures. Initial results of the code parallelization will be reported. Work is supported by the U.S. DOE SBIR program.
NASA Astrophysics Data System (ADS)
Bosch, Carl; Degirmenci, Soysal; Barlow, Jason; Mesika, Assaf; Politte, David G.; O'Sullivan, Joseph A.
2016-05-01
X-ray computed tomography reconstruction for medical, security and industrial applications has evolved through 40 years of experience with rotating gantry scanners using analytic reconstruction techniques such as filtered back projection (FBP). In parallel, research into statistical iterative reconstruction algorithms has evolved to apply to sparse view scanners in nuclear medicine, low data rate scanners in Positron Emission Tomography (PET) [5, 7, 10] and more recently to reduce exposure to ionizing radiation in conventional X-ray CT scanners. Multiple approaches to statistical iterative reconstruction have been developed based primarily on variations of expectation maximization (EM) algorithms. The primary benefit of EM algorithms is the guarantee of convergence that is maintained when iterative corrections are made within the limits of convergent algorithms. The primary disadvantage, however is that strict adherence to correction limits of convergent algorithms extends the number of iterations and ultimate timeline to complete a 3D volumetric reconstruction. Researchers have studied methods to accelerate convergence through more aggressive corrections [1], ordered subsets [1, 3, 4, 9] and spatially variant image updates. In this paper we describe the development of an AM reconstruction algorithm with accelerated convergence for use in a real-time explosive detection application for aviation security. By judiciously applying multiple acceleration techniques and advanced GPU processing architectures, we are able to perform 3D reconstruction of scanned passenger baggage at a rate of 75 slices per second. Analysis of the results on stream of commerce passenger bags demonstrates accelerated convergence by factors of 8 to 15, when comparing images from accelerated and strictly convergent algorithms.
Segmentation of remotely sensed data using parallel region growing
NASA Technical Reports Server (NTRS)
Tilton, J. C.; Cox, S. C.
1983-01-01
The improved spatial resolution of the new earth resources satellites will increase the need for effective utilization of spatial information in machine processing of remotely sensed data. One promising technique is scene segmentation by region growing. Region growing can use spatial information in two ways: only spatially adjacent regions merge together, and merging criteria can be based on region-wide spatial features. A simple region growing approach is described in which the similarity criterion is based on region mean and variance (a simple spatial feature). An effective way to implement region growing for remote sensing is as an iterative parallel process on a large parallel processor. A straightforward parallel pixel-based implementation of the algorithm is explored and its efficiency is compared with sequential pixel-based, sequential region-based, and parallel region-based implementations. Experimental results from on aircraft scanner data set are presented, as is a discussioon of proposed improvements to the segmentation algorithm.
SciSpark: Highly Interactive and Scalable Model Evaluation and Climate Metrics
NASA Astrophysics Data System (ADS)
Wilson, B. D.; Mattmann, C. A.; Waliser, D. E.; Kim, J.; Loikith, P.; Lee, H.; McGibbney, L. J.; Whitehall, K. D.
2014-12-01
Remote sensing data and climate model output are multi-dimensional arrays of massive sizes locked away in heterogeneous file formats (HDF5/4, NetCDF 3/4) and metadata models (HDF-EOS, CF) making it difficult to perform multi-stage, iterative science processing since each stage requires writing and reading data to and from disk. We are developing a lightning fast Big Data technology called SciSpark based on ApacheTM Spark. Spark implements the map-reduce paradigm for parallel computing on a cluster, but emphasizes in-memory computation, "spilling" to disk only as needed, and so outperforms the disk-based ApacheTM Hadoop by 100x in memory and by 10x on disk, and makes iterative algorithms feasible. SciSpark will enable scalable model evaluation by executing large-scale comparisons of A-Train satellite observations to model grids on a cluster of 100 to 1000 compute nodes. This 2nd generation capability for NASA's Regional Climate Model Evaluation System (RCMES) will compute simple climate metrics at interactive speeds, and extend to quite sophisticated iterative algorithms such as machine-learning (ML) based clustering of temperature PDFs, and even graph-based algorithms for searching for Mesocale Convective Complexes. The goals of SciSpark are to: (1) Decrease the time to compute comparison statistics and plots from minutes to seconds; (2) Allow for interactive exploration of time-series properties over seasons and years; (3) Decrease the time for satellite data ingestion into RCMES to hours; (4) Allow for Level-2 comparisons with higher-order statistics or PDF's in minutes to hours; and (5) Move RCMES into a near real time decision-making platform. We will report on: the architecture and design of SciSpark, our efforts to integrate climate science algorithms in Python and Scala, parallel ingest and partitioning (sharding) of A-Train satellite observations from HDF files and model grids from netCDF files, first parallel runs to compute comparison statistics and PDF's, and first metrics quantifying parallel speedups and memory & disk usage.
Benzi, Michele; Evans, Thomas M.; Hamilton, Steven P.; ...
2017-03-05
Here, we consider hybrid deterministic-stochastic iterative algorithms for the solution of large, sparse linear systems. Starting from a convergent splitting of the coefficient matrix, we analyze various types of Monte Carlo acceleration schemes applied to the original preconditioned Richardson (stationary) iteration. We expect that these methods will have considerable potential for resiliency to faults when implemented on massively parallel machines. We also establish sufficient conditions for the convergence of the hybrid schemes, and we investigate different types of preconditioners including sparse approximate inverses. Numerical experiments on linear systems arising from the discretization of partial differential equations are presented.
A distributed-memory approximation algorithm for maximum weight perfect bipartite matching
DOE Office of Scientific and Technical Information (OSTI.GOV)
Azad, Ariful; Buluc, Aydin; Li, Xiaoye S.
We design and implement an efficient parallel approximation algorithm for the problem of maximum weight perfect matching in bipartite graphs, i.e. the problem of finding a set of non-adjacent edges that covers all vertices and has maximum weight. This problem differs from the maximum weight matching problem, for which scalable approximation algorithms are known. It is primarily motivated by finding good pivots in scalable sparse direct solvers before factorization where sequential implementations of maximum weight perfect matching algorithms, such as those available in MC64, are widely used due to the lack of scalable alternatives. To overcome this limitation, we proposemore » a fully parallel distributed memory algorithm that first generates a perfect matching and then searches for weightaugmenting cycles of length four in parallel and iteratively augments the matching with a vertex disjoint set of such cycles. For most practical problems the weights of the perfect matchings generated by our algorithm are very close to the optimum. An efficient implementation of the algorithm scales up to 256 nodes (17,408 cores) on a Cray XC40 supercomputer and can solve instances that are too large to be handled by a single node using the sequential algorithm.« less
Distributed Optimal Power Flow of AC/DC Interconnected Power Grid Using Synchronous ADMM
NASA Astrophysics Data System (ADS)
Liang, Zijun; Lin, Shunjiang; Liu, Mingbo
2017-05-01
Distributed optimal power flow (OPF) is of great importance and challenge to AC/DC interconnected power grid with different dispatching centres, considering the security and privacy of information transmission. In this paper, a fully distributed algorithm for OPF problem of AC/DC interconnected power grid called synchronous ADMM is proposed, and it requires no form of central controller. The algorithm is based on the fundamental alternating direction multiplier method (ADMM), by using the average value of boundary variables of adjacent regions obtained from current iteration as the reference values of both regions for next iteration, which realizes the parallel computation among different regions. The algorithm is tested with the IEEE 11-bus AC/DC interconnected power grid, and by comparing the results with centralized algorithm, we find it nearly no differences, and its correctness and effectiveness can be validated.
Dynamic phasing of multichannel cw laser radiation by means of a stochastic gradient algorithm
DOE Office of Scientific and Technical Information (OSTI.GOV)
Volkov, V A; Volkov, M V; Garanin, S G
2013-09-30
The phasing of a multichannel laser beam by means of an iterative stochastic parallel gradient (SPG) algorithm has been numerically and experimentally investigated. The operation of the SPG algorithm is simulated, the acceptable range of amplitudes of probe phase shifts is found, and the algorithm parameters at which the desired Strehl number can be obtained with a minimum number of iterations are determined. An experimental bench with phase modulators based on lithium niobate, which are controlled by a multichannel electronic unit with a real-time microcontroller, has been designed. Phasing of 16 cw laser beams at a system response bandwidth ofmore » 3.7 kHz and phase thermal distortions in a frequency band of about 10 Hz is experimentally demonstrated. The experimental data are in complete agreement with the calculation results. (control of laser radiation parameters)« less
Vectorized and multitasked solution of the few-group neutron diffusion equations
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zee, S.K.; Turinsky, P.J.; Shayer, Z.
1989-03-01
A numerical algorithm with parallelism was used to solve the two-group, multidimensional neutron diffusion equations on computers characterized by shared memory, vector pipeline, and multi-CPU architecture features. Specifically, solutions were obtained on the Cray X/MP-48, the IBM-3090 with vector facilities, and the FPS-164. The material-centered mesh finite difference method approximation and outer-inner iteration method were employed. Parallelism was introduced in the inner iterations using the cyclic line successive overrelaxation iterative method and solving in parallel across lines. The outer iterations were completed using the Chebyshev semi-iterative method that allows parallelism to be introduced in both space and energy groups. Formore » the three-dimensional model, power, soluble boron, and transient fission product feedbacks were included. Concentrating on the pressurized water reactor (PWR), the thermal-hydraulic calculation of moderator density assumed single-phase flow and a closed flow channel, allowing parallelism to be introduced in the solution across the radial plane. Using a pinwise detail, quarter-core model of a typical PWR in cycle 1, for the two-dimensional model without feedback the measured million floating point operations per second (MFLOPS)/vector speedups were 83/11.7. 18/2.2, and 2.4/5.6 on the Cray, IBM, and FPS without multitasking, respectively. Lower performance was observed with a coarser mesh, i.e., shorter vector length, due to vector pipeline start-up. For an 18 x 18 x 30 (x-y-z) three-dimensional model with feedback of the same core, MFLOPS/vector speedups of --61/6.7 and an execution time of 0.8 CPU seconds on the Cray without multitasking were measured. Finally, using two CPUs and the vector pipelines of the Cray, a multitasking efficiency of 81% was noted for the three-dimensional model.« less
Multilevel acceleration of scattering-source iterations with application to electron transport
Drumm, Clif; Fan, Wesley
2017-08-18
Acceleration/preconditioning strategies available in the SCEPTRE radiation transport code are described. A flexible transport synthetic acceleration (TSA) algorithm that uses a low-order discrete-ordinates (S N) or spherical-harmonics (P N) solve to accelerate convergence of a high-order S N source-iteration (SI) solve is described. Convergence of the low-order solves can be further accelerated by applying off-the-shelf incomplete-factorization or algebraic-multigrid methods. Also available is an algorithm that uses a generalized minimum residual (GMRES) iterative method rather than SI for convergence, using a parallel sweep-based solver to build up a Krylov subspace. TSA has been applied as a preconditioner to accelerate the convergencemore » of the GMRES iterations. The methods are applied to several problems involving electron transport and problems with artificial cross sections with large scattering ratios. These methods were compared and evaluated by considering material discontinuities and scattering anisotropy. Observed accelerations obtained are highly problem dependent, but speedup factors around 10 have been observed in typical applications.« less
Sequential and parallel image restoration: neural network implementations.
Figueiredo, M T; Leitao, J N
1994-01-01
Sequential and parallel image restoration algorithms and their implementations on neural networks are proposed. For images degraded by linear blur and contaminated by additive white Gaussian noise, maximum a posteriori (MAP) estimation and regularization theory lead to the same high dimension convex optimization problem. The commonly adopted strategy (in using neural networks for image restoration) is to map the objective function of the optimization problem into the energy of a predefined network, taking advantage of its energy minimization properties. Departing from this approach, we propose neural implementations of iterative minimization algorithms which are first proved to converge. The developed schemes are based on modified Hopfield (1985) networks of graded elements, with both sequential and parallel updating schedules. An algorithm supported on a fully standard Hopfield network (binary elements and zero autoconnections) is also considered. Robustness with respect to finite numerical precision is studied, and examples with real images are presented.
Density-matrix-based algorithm for solving eigenvalue problems
NASA Astrophysics Data System (ADS)
Polizzi, Eric
2009-03-01
A fast and stable numerical algorithm for solving the symmetric eigenvalue problem is presented. The technique deviates fundamentally from the traditional Krylov subspace iteration based techniques (Arnoldi and Lanczos algorithms) or other Davidson-Jacobi techniques and takes its inspiration from the contour integration and density-matrix representation in quantum mechanics. It will be shown that this algorithm—named FEAST—exhibits high efficiency, robustness, accuracy, and scalability on parallel architectures. Examples from electronic structure calculations of carbon nanotubes are presented, and numerical performances and capabilities are discussed.
A Strassen-Newton algorithm for high-speed parallelizable matrix inversion
NASA Technical Reports Server (NTRS)
Bailey, David H.; Ferguson, Helaman R. P.
1988-01-01
Techniques are described for computing matrix inverses by algorithms that are highly suited to massively parallel computation. The techniques are based on an algorithm suggested by Strassen (1969). Variations of this scheme use matrix Newton iterations and other methods to improve the numerical stability while at the same time preserving a very high level of parallelism. One-processor Cray-2 implementations of these schemes range from one that is up to 55 percent faster than a conventional library routine to one that is slower than a library routine but achieves excellent numerical stability. The problem of computing the solution to a single set of linear equations is discussed, and it is shown that this problem can also be solved efficiently using these techniques.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Spotz, William F.
PyTrilinos is a set of Python interfaces to compiled Trilinos packages. This collection supports serial and parallel dense linear algebra, serial and parallel sparse linear algebra, direct and iterative linear solution techniques, algebraic and multilevel preconditioners, nonlinear solvers and continuation algorithms, eigensolvers and partitioning algorithms. Also included are a variety of related utility functions and classes, including distributed I/O, coloring algorithms and matrix generation. PyTrilinos vector objects are compatible with the popular NumPy Python package. As a Python front end to compiled libraries, PyTrilinos takes advantage of the flexibility and ease of use of Python, and the efficiency of themore » underlying C++, C and Fortran numerical kernels. This paper covers recent, previously unpublished advances in the PyTrilinos package.« less
Preconditioned conjugate gradient methods for the compressible Navier-Stokes equations
NASA Technical Reports Server (NTRS)
Venkatakrishnan, V.
1990-01-01
The compressible Navier-Stokes equations are solved for a variety of two-dimensional inviscid and viscous problems by preconditioned conjugate gradient-like algorithms. Roe's flux difference splitting technique is used to discretize the inviscid fluxes. The viscous terms are discretized by using central differences. An algebraic turbulence model is also incorporated. The system of linear equations which arises out of the linearization of a fully implicit scheme is solved iteratively by the well known methods of GMRES (Generalized Minimum Residual technique) and Chebyschev iteration. Incomplete LU factorization and block diagonal factorization are used as preconditioners. The resulting algorithm is competitive with the best current schemes, but has wide applications in parallel computing and unstructured mesh computations.
Efficient Iterative Methods Applied to the Solution of Transonic Flows
NASA Astrophysics Data System (ADS)
Wissink, Andrew M.; Lyrintzis, Anastasios S.; Chronopoulos, Anthony T.
1996-02-01
We investigate the use of an inexact Newton's method to solve the potential equations in the transonic regime. As a test case, we solve the two-dimensional steady transonic small disturbance equation. Approximate factorization/ADI techniques have traditionally been employed for implicit solutions of this nonlinear equation. Instead, we apply Newton's method using an exact analytical determination of the Jacobian with preconditioned conjugate gradient-like iterative solvers for solution of the linear systems in each Newton iteration. Two iterative solvers are tested; a block s-step version of the classical Orthomin(k) algorithm called orthogonal s-step Orthomin (OSOmin) and the well-known GMRES method. The preconditioner is a vectorizable and parallelizable version of incomplete LU (ILU) factorization. Efficiency of the Newton-Iterative method on vector and parallel computer architectures is the main issue addressed. In vectorized tests on a single processor of the Cray C-90, the performance of Newton-OSOmin is superior to Newton-GMRES and a more traditional monotone AF/ADI method (MAF) for a variety of transonic Mach numbers and mesh sizes. Newton-GMRES is superior to MAF for some cases. The parallel performance of the Newton method is also found to be very good on multiple processors of the Cray C-90 and on the massively parallel thinking machine CM-5, where very fast execution rates (up to 9 Gflops) are found for large problems.
Fault Tolerant Parallel Implementations of Iterative Algorithms for Optimal Control Problems
1988-01-21
p/.V)] steps, but did not discuss any specific parallel implementation. Gajski [51 improved upon this result by performing the SIMD computation in...N = p2. our approach reduces to that of [51, except that Gajski presents the coefficient computation and partial solution phases as a single...8217>. the SIMD algo- rithm presented by Gajski [5] can be most efficiently mapped to a unidirec- tional ring network with broadcasting capability. Based
2014-05-01
exact one is solved later — as- signed as step 5 of Algorithm 2 — because at each iteration , the ADMM updates the variables in the Gauss - Seidel ...Jacobi ADMM (see Algo- rithm 5 below). Unlike the Gauss - Seidel ADMM, the Jacobi ADMM updates all the 70 blocks in parallel at every iteration : xk+1i...that extending ADMM straightforwardly from the classic Gauss - Seidel setting to the Jacobi setting, from two blocks to multiple blocks, will preserve
Ergül, Özgür
2011-11-01
Fast and accurate solutions of large-scale electromagnetics problems involving homogeneous dielectric objects are considered. Problems are formulated with the electric and magnetic current combined-field integral equation and discretized with the Rao-Wilton-Glisson functions. Solutions are performed iteratively by using the multilevel fast multipole algorithm (MLFMA). For the solution of large-scale problems discretized with millions of unknowns, MLFMA is parallelized on distributed-memory architectures using a rigorous technique, namely, the hierarchical partitioning strategy. Efficiency and accuracy of the developed implementation are demonstrated on very large problems involving as many as 100 million unknowns.
Solution of partial differential equations on vector and parallel computers
NASA Technical Reports Server (NTRS)
Ortega, J. M.; Voigt, R. G.
1985-01-01
The present status of numerical methods for partial differential equations on vector and parallel computers was reviewed. The relevant aspects of these computers are discussed and a brief review of their development is included, with particular attention paid to those characteristics that influence algorithm selection. Both direct and iterative methods are given for elliptic equations as well as explicit and implicit methods for initial boundary value problems. The intent is to point out attractive methods as well as areas where this class of computer architecture cannot be fully utilized because of either hardware restrictions or the lack of adequate algorithms. Application areas utilizing these computers are briefly discussed.
On iterative processes in the Krylov-Sonneveld subspaces
NASA Astrophysics Data System (ADS)
Ilin, Valery P.
2016-10-01
The iterative Induced Dimension Reduction (IDR) methods are considered for solving large systems of linear algebraic equations (SLAEs) with nonsingular nonsymmetric matrices. These approaches are investigated by many authors and are charachterized sometimes as the alternative to the classical processes of Krylov type. The key moments of the IDR algorithms consist in the construction of the embedded Sonneveld subspaces, which have the decreasing dimensions and use the orthogonalization to some fixed subspace. Other independent approaches for research and optimization of the iterations are based on the augmented and modified Krylov subspaces by using the aggregation and deflation procedures with present various low rank approximations of the original matrices. The goal of this paper is to show, that IDR method in Sonneveld subspaces present an original interpretation of the modified algorithms in the Krylov subspaces. In particular, such description is given for the multi-preconditioned semi-conjugate direction methods which are actual for the parallel algebraic domain decomposition approaches.
Parallel computation of multigroup reactivity coefficient using iterative method
NASA Astrophysics Data System (ADS)
Susmikanti, Mike; Dewayatna, Winter
2013-09-01
One of the research activities to support the commercial radioisotope production program is a safety research target irradiation FPM (Fission Product Molybdenum). FPM targets form a tube made of stainless steel in which the nuclear degrees of superimposed high-enriched uranium. FPM irradiation tube is intended to obtain fission. The fission material widely used in the form of kits in the world of nuclear medicine. Irradiation FPM tube reactor core would interfere with performance. One of the disorders comes from changes in flux or reactivity. It is necessary to study a method for calculating safety terrace ongoing configuration changes during the life of the reactor, making the code faster became an absolute necessity. Neutron safety margin for the research reactor can be reused without modification to the calculation of the reactivity of the reactor, so that is an advantage of using perturbation method. The criticality and flux in multigroup diffusion model was calculate at various irradiation positions in some uranium content. This model has a complex computation. Several parallel algorithms with iterative method have been developed for the sparse and big matrix solution. The Black-Red Gauss Seidel Iteration and the power iteration parallel method can be used to solve multigroup diffusion equation system and calculated the criticality and reactivity coeficient. This research was developed code for reactivity calculation which used one of safety analysis with parallel processing. It can be done more quickly and efficiently by utilizing the parallel processing in the multicore computer. This code was applied for the safety limits calculation of irradiated targets FPM with increment Uranium.
Memory-Scalable GPU Spatial Hierarchy Construction.
Qiming Hou; Xin Sun; Kun Zhou; Lauterbach, C; Manocha, D
2011-04-01
Recent GPU algorithms for constructing spatial hierarchies have achieved promising performance for moderately complex models by using the breadth-first search (BFS) construction order. While being able to exploit the massive parallelism on the GPU, the BFS order also consumes excessive GPU memory, which becomes a serious issue for interactive applications involving very complex models with more than a few million triangles. In this paper, we propose to use the partial breadth-first search (PBFS) construction order to control memory consumption while maximizing performance. We apply the PBFS order to two hierarchy construction algorithms. The first algorithm is for kd-trees that automatically balances between the level of parallelism and intermediate memory usage. With PBFS, peak memory consumption during construction can be efficiently controlled without costly CPU-GPU data transfer. We also develop memory allocation strategies to effectively limit memory fragmentation. The resulting algorithm scales well with GPU memory and constructs kd-trees of models with millions of triangles at interactive rates on GPUs with 1 GB memory. Compared with existing algorithms, our algorithm is an order of magnitude more scalable for a given GPU memory bound. The second algorithm is for out-of-core bounding volume hierarchy (BVH) construction for very large scenes based on the PBFS construction order. At each iteration, all constructed nodes are dumped to the CPU memory, and the GPU memory is freed for the next iteration's use. In this way, the algorithm is able to build trees that are too large to be stored in the GPU memory. Experiments show that our algorithm can construct BVHs for scenes with up to 20 M triangles, several times larger than previous GPU algorithms.
Parallel approach for bioinspired algorithms
NASA Astrophysics Data System (ADS)
Zaporozhets, Dmitry; Zaruba, Daria; Kulieva, Nina
2018-05-01
In the paper, a probabilistic parallel approach based on the population heuristic, such as a genetic algorithm, is suggested. The authors proposed using a multithreading approach at the micro level at which new alternative solutions are generated. On each iteration, several threads that independently used the same population to generate new solutions can be started. After the work of all threads, a selection operator combines obtained results in the new population. To confirm the effectiveness of the suggested approach, the authors have developed software on the basis of which experimental computations can be carried out. The authors have considered a classic optimization problem – finding a Hamiltonian cycle in a graph. Experiments show that due to the parallel approach at the micro level, increment of running speed can be obtained on graphs with 250 and more vertices.
Capabilities of Fully Parallelized MHD Stability Code MARS
NASA Astrophysics Data System (ADS)
Svidzinski, Vladimir; Galkin, Sergei; Kim, Jin-Soo; Liu, Yueqiang
2016-10-01
Results of full parallelization of the plasma stability code MARS will be reported. MARS calculates eigenmodes in 2D axisymmetric toroidal equilibria in MHD-kinetic plasma models. Parallel version of MARS, named PMARS, has been recently developed at FAR-TECH. Parallelized MARS is an efficient tool for simulation of MHD instabilities with low, intermediate and high toroidal mode numbers within both fluid and kinetic plasma models, implemented in MARS. Parallelization of the code included parallelization of the construction of the matrix for the eigenvalue problem and parallelization of the inverse vector iterations algorithm, implemented in MARS for the solution of the formulated eigenvalue problem. Construction of the matrix is parallelized by distributing the load among processors assigned to different magnetic surfaces. Parallelization of the solution of the eigenvalue problem is made by repeating steps of the MARS algorithm using parallel libraries and procedures. Parallelized MARS is capable of calculating eigenmodes with significantly increased spatial resolution: up to 5,000 adapted radial grid points with up to 500 poloidal harmonics. Such resolution is sufficient for simulation of kink, tearing and peeling-ballooning instabilities with physically relevant parameters. Work is supported by the U.S. DOE SBIR program.
Fully Parallel MHD Stability Analysis Tool
NASA Astrophysics Data System (ADS)
Svidzinski, Vladimir; Galkin, Sergei; Kim, Jin-Soo; Liu, Yueqiang
2015-11-01
Progress on full parallelization of the plasma stability code MARS will be reported. MARS calculates eigenmodes in 2D axisymmetric toroidal equilibria in MHD-kinetic plasma models. It is a powerful tool for studying MHD and MHD-kinetic instabilities and it is widely used by fusion community. Parallel version of MARS is intended for simulations on local parallel clusters. It will be an efficient tool for simulation of MHD instabilities with low, intermediate and high toroidal mode numbers within both fluid and kinetic plasma models, already implemented in MARS. Parallelization of the code includes parallelization of the construction of the matrix for the eigenvalue problem and parallelization of the inverse iterations algorithm, implemented in MARS for the solution of the formulated eigenvalue problem. Construction of the matrix is parallelized by distributing the load among processors assigned to different magnetic surfaces. Parallelization of the solution of the eigenvalue problem is made by repeating steps of the present MARS algorithm using parallel libraries and procedures. Results of MARS parallelization and of the development of a new fix boundary equilibrium code adapted for MARS input will be reported. Work is supported by the U.S. DOE SBIR program.
Region of interest processing for iterative reconstruction in x-ray computed tomography
NASA Astrophysics Data System (ADS)
Kopp, Felix K.; Nasirudin, Radin A.; Mei, Kai; Fehringer, Andreas; Pfeiffer, Franz; Rummeny, Ernst J.; Noël, Peter B.
2015-03-01
The recent advancements in the graphics card technology raised the performance of parallel computing and contributed to the introduction of iterative reconstruction methods for x-ray computed tomography in clinical CT scanners. Iterative maximum likelihood (ML) based reconstruction methods are known to reduce image noise and to improve the diagnostic quality of low-dose CT. However, iterative reconstruction of a region of interest (ROI), especially ML based, is challenging. But for some clinical procedures, like cardiac CT, only a ROI is needed for diagnostics. A high-resolution reconstruction of the full field of view (FOV) consumes unnecessary computation effort that results in a slower reconstruction than clinically acceptable. In this work, we present an extension and evaluation of an existing ROI processing algorithm. Especially improvements for the equalization between regions inside and outside of a ROI are proposed. The evaluation was done on data collected from a clinical CT scanner. The performance of the different algorithms is qualitatively and quantitatively assessed. Our solution to the ROI problem provides an increase in signal-to-noise ratio and leads to visually less noise in the final reconstruction. The reconstruction speed of our technique was observed to be comparable with other previous proposed techniques. The development of ROI processing algorithms in combination with iterative reconstruction will provide higher diagnostic quality in the near future.
A communication-avoiding, hybrid-parallel, rank-revealing orthogonalization method.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hoemmen, Mark
2010-11-01
Orthogonalization consumes much of the run time of many iterative methods for solving sparse linear systems and eigenvalue problems. Commonly used algorithms, such as variants of Gram-Schmidt or Householder QR, have performance dominated by communication. Here, 'communication' includes both data movement between the CPU and memory, and messages between processors in parallel. Our Tall Skinny QR (TSQR) family of algorithms requires asymptotically fewer messages between processors and data movement between CPU and memory than typical orthogonalization methods, yet achieves the same accuracy as Householder QR factorization. Furthermore, in block orthogonalizations, TSQR is faster and more accurate than existing approaches formore » orthogonalizing the vectors within each block ('normalization'). TSQR's rank-revealing capability also makes it useful for detecting deflation in block iterative methods, for which existing approaches sacrifice performance, accuracy, or both. We have implemented a version of TSQR that exploits both distributed-memory and shared-memory parallelism, and supports real and complex arithmetic. Our implementation is optimized for the case of orthogonalizing a small number (5-20) of very long vectors. The shared-memory parallel component uses Intel's Threading Building Blocks, though its modular design supports other shared-memory programming models as well, including computation on the GPU. Our implementation achieves speedups of 2 times or more over competing orthogonalizations. It is available now in the development branch of the Trilinos software package, and will be included in the 10.8 release.« less
Simple Common Plane contact algorithm for explicit FE/FD methods
DOE Office of Scientific and Technical Information (OSTI.GOV)
Vorobiev, O
2006-12-18
Common-plane (CP) algorithm is widely used in Discrete Element Method (DEM) to model contact forces between interacting particles or blocks. A new simple contact algorithm is proposed to model contacts in FE/FD methods which is similar to the CP algorithm. The CP is defined as a plane separating interacting faces of FE/FD mesh instead of blocks or particles used in the original CP method. The new method does not require iterations even for very stiff contacts. It is very robust and easy to implement both in 2D and 3D parallel codes.
Projection matrix acquisition for cone-beam computed tomography iterative reconstruction
NASA Astrophysics Data System (ADS)
Yang, Fuqiang; Zhang, Dinghua; Huang, Kuidong; Shi, Wenlong; Zhang, Caixin; Gao, Zongzhao
2017-02-01
Projection matrix is an essential and time-consuming part in computed tomography (CT) iterative reconstruction. In this article a novel calculation algorithm of three-dimensional (3D) projection matrix is proposed to quickly acquire the matrix for cone-beam CT (CBCT). The CT data needed to be reconstructed is considered as consisting of the three orthogonal sets of equally spaced and parallel planes, rather than the individual voxels. After getting the intersections the rays with the surfaces of the voxels, the coordinate points and vertex is compared to obtain the index value that the ray traversed. Without considering ray-slope to voxel, it just need comparing the position of two points. Finally, the computer simulation is used to verify the effectiveness of the algorithm.
F-8C adaptive control law refinement and software development
NASA Technical Reports Server (NTRS)
Hartmann, G. L.; Stein, G.
1981-01-01
An explicit adaptive control algorithm based on maximum likelihood estimation of parameters was designed. To avoid iterative calculations, the algorithm uses parallel channels of Kalman filters operating at fixed locations in parameter space. This algorithm was implemented in NASA/DFRC's Remotely Augmented Vehicle (RAV) facility. Real-time sensor outputs (rate gyro, accelerometer, surface position) are telemetered to a ground computer which sends new gain values to an on-board system. Ground test data and flight records were used to establish design values of noise statistics and to verify the ground-based adaptive software.
NASA Astrophysics Data System (ADS)
Pei, Zongrui; Eisenbach, Markus
2017-06-01
Dislocations are among the most important defects in determining the mechanical properties of both conventional alloys and high-entropy alloys. The Peierls-Nabarro model supplies an efficient pathway to their geometries and mobility. The difficulty in solving the integro-differential Peierls-Nabarro equation is how to effectively avoid the local minima in the energy landscape of a dislocation core. Among the other methods to optimize the dislocation core structures, we choose the algorithm of Particle Swarm Optimization, an algorithm that simulates the social behaviors of organisms. By employing more particles (bigger swarm) and more iterative steps (allowing them to explore for longer time), the local minima can be effectively avoided. But this would require more computational cost. The advantage of this algorithm is that it is readily parallelized in modern high computing architecture. We demonstrate the performance of our parallelized algorithm scales linearly with the number of employed cores.
Efficient iterative methods applied to the solution of transonic flows
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wissink, A.M.; Lyrintzis, A.S.; Chronopoulos, A.T.
1996-02-01
We investigate the use of an inexact Newton`s method to solve the potential equations in the transonic regime. As a test case, we solve the two-dimensional steady transonic small disturbance equation. Approximate factorization/ADI techniques have traditionally been employed for implicit solutions of this nonlinear equation. Instead, we apply Newton`s method using an exact analytical determination of the Jacobian with preconditioned conjugate gradient-like iterative solvers for solution of the linear systems in each Newton iteration. Two iterative solvers are tested; a block s-step version of the classical Orthomin(k) algorithm called orthogonal s-step Orthomin (OSOmin) and the well-known GIVIRES method. The preconditionermore » is a vectorizable and parallelizable version of incomplete LU (ILU) factorization. Efficiency of the Newton-Iterative method on vector and parallel computer architectures is the main issue addressed. In vectorized tests on a single processor of the Cray C-90, the performance of Newton-OSOmin is superior to Newton-GMRES and a more traditional monotone AF/ADI method (MAF) for a variety of transonic Mach numbers and mesh sizes. Newton- GIVIRES is superior to MAF for some cases. The parallel performance of the Newton method is also found to be very good on multiple processors of the Cray C-90 and on the massively parallel thinking machine CM-5, where very fast execution rates (up to 9 Gflops) are found for large problems. 38 refs., 14 figs., 7 tabs.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zhao, Xujun; Li, Jiyuan; Jiang, Xikai
An efficient parallel Stokes’s solver is developed towards the complete inclusion of hydrodynamic interactions of Brownian particles in any geometry. A Langevin description of the particle dynamics is adopted, where the long-range interactions are included using a Green’s function formalism. We present a scalable parallel computational approach, where the general geometry Stokeslet is calculated following a matrix-free algorithm using the General geometry Ewald-like method. Our approach employs a highly-efficient iterative finite element Stokes’ solver for the accurate treatment of long-range hydrodynamic interactions within arbitrary confined geometries. A combination of mid-point time integration of the Brownian stochastic differential equation, the parallelmore » Stokes’ solver, and a Chebyshev polynomial approximation for the fluctuation-dissipation theorem result in an O(N) parallel algorithm. We also illustrate the new algorithm in the context of the dynamics of confined polymer solutions in equilibrium and non-equilibrium conditions. Our method is extended to treat suspended finite size particles of arbitrary shape in any geometry using an Immersed Boundary approach.« less
NASA Astrophysics Data System (ADS)
Zapata, M. A. Uh; Van Bang, D. Pham; Nguyen, K. D.
2016-05-01
This paper presents a parallel algorithm for the finite-volume discretisation of the Poisson equation on three-dimensional arbitrary geometries. The proposed method is formulated by using a 2D horizontal block domain decomposition and interprocessor data communication techniques with message passing interface. The horizontal unstructured-grid cells are reordered according to the neighbouring relations and decomposed into blocks using a load-balanced distribution to give all processors an equal amount of elements. In this algorithm, two parallel successive over-relaxation methods are presented: a multi-colour ordering technique for unstructured grids based on distributed memory and a block method using reordering index following similar ideas of the partitioning for structured grids. In all cases, the parallel algorithms are implemented with a combination of an acceleration iterative solver. This solver is based on a parabolic-diffusion equation introduced to obtain faster solutions of the linear systems arising from the discretisation. Numerical results are given to evaluate the performances of the methods showing speedups better than linear.
Zhao, Xujun; Li, Jiyuan; Jiang, Xikai; ...
2017-06-29
An efficient parallel Stokes’s solver is developed towards the complete inclusion of hydrodynamic interactions of Brownian particles in any geometry. A Langevin description of the particle dynamics is adopted, where the long-range interactions are included using a Green’s function formalism. We present a scalable parallel computational approach, where the general geometry Stokeslet is calculated following a matrix-free algorithm using the General geometry Ewald-like method. Our approach employs a highly-efficient iterative finite element Stokes’ solver for the accurate treatment of long-range hydrodynamic interactions within arbitrary confined geometries. A combination of mid-point time integration of the Brownian stochastic differential equation, the parallelmore » Stokes’ solver, and a Chebyshev polynomial approximation for the fluctuation-dissipation theorem result in an O(N) parallel algorithm. We also illustrate the new algorithm in the context of the dynamics of confined polymer solutions in equilibrium and non-equilibrium conditions. Our method is extended to treat suspended finite size particles of arbitrary shape in any geometry using an Immersed Boundary approach.« less
NASA Technical Reports Server (NTRS)
Chew, W. C.; Song, J. M.; Lu, C. C.; Weedon, W. H.
1995-01-01
In the first phase of our work, we have concentrated on laying the foundation to develop fast algorithms, including the use of recursive structure like the recursive aggregate interaction matrix algorithm (RAIMA), the nested equivalence principle algorithm (NEPAL), the ray-propagation fast multipole algorithm (RPFMA), and the multi-level fast multipole algorithm (MLFMA). We have also investigated the use of curvilinear patches to build a basic method of moments code where these acceleration techniques can be used later. In the second phase, which is mainly reported on here, we have concentrated on implementing three-dimensional NEPAL on a massively parallel machine, the Connection Machine CM-5, and have been able to obtain some 3D scattering results. In order to understand the parallelization of codes on the Connection Machine, we have also studied the parallelization of 3D finite-difference time-domain (FDTD) code with PML material absorbing boundary condition (ABC). We found that simple algorithms like the FDTD with material ABC can be parallelized very well allowing us to solve within a minute a problem of over a million nodes. In addition, we have studied the use of the fast multipole method and the ray-propagation fast multipole algorithm to expedite matrix-vector multiplication in a conjugate-gradient solution to integral equations of scattering. We find that these methods are faster than LU decomposition for one incident angle, but are slower than LU decomposition when many incident angles are needed as in the monostatic RCS calculations.
Regularization and computational methods for precise solution of perturbed orbit transfer problems
NASA Astrophysics Data System (ADS)
Woollands, Robyn Michele
The author has developed a suite of algorithms for solving the perturbed Lambert's problem in celestial mechanics. These algorithms have been implemented as a parallel computation tool that has broad applicability. This tool is composed of four component algorithms and each provides unique benefits for solving a particular type of orbit transfer problem. The first one utilizes a Keplerian solver (a-iteration) for solving the unperturbed Lambert's problem. This algorithm not only provides a "warm start" for solving the perturbed problem but is also used to identify which of several perturbed solvers is best suited for the job. The second algorithm solves the perturbed Lambert's problem using a variant of the modified Chebyshev-Picard iteration initial value solver that solves two-point boundary value problems. This method converges over about one third of an orbit and does not require a Newton-type shooting method and thus no state transition matrix needs to be computed. The third algorithm makes use of regularization of the differential equations through the Kustaanheimo-Stiefel transformation and extends the domain of convergence over which the modified Chebyshev-Picard iteration two-point boundary value solver will converge, from about one third of an orbit to almost a full orbit. This algorithm also does not require a Newton-type shooting method. The fourth algorithm uses the method of particular solutions and the modified Chebyshev-Picard iteration initial value solver to solve the perturbed two-impulse Lambert problem over multiple revolutions. The method of particular solutions is a shooting method but differs from the Newton-type shooting methods in that it does not require integration of the state transition matrix. The mathematical developments that underlie these four algorithms are derived in the chapters of this dissertation. For each of the algorithms, some orbit transfer test cases are included to provide insight on accuracy and efficiency of these individual algorithms. Following this discussion, the combined parallel algorithm, known as the unified Lambert tool, is presented and an explanation is given as to how it automatically selects which of the three perturbed solvers to compute the perturbed solution for a particular orbit transfer. The unified Lambert tool may be used to determine a single orbit transfer or for generating of an extremal field map. A case study is presented for a mission that is required to rendezvous with two pieces of orbit debris (spent rocket boosters). The unified Lambert tool software developed in this dissertation is already being utilized by several industrial partners and we are confident that it will play a significant role in practical applications, including solution of Lambert problems that arise in the current applications focused on enhanced space situational awareness.
Development of parallel algorithms for electrical power management in space applications
NASA Technical Reports Server (NTRS)
Berry, Frederick C.
1989-01-01
The application of parallel techniques for electrical power system analysis is discussed. The Newton-Raphson method of load flow analysis was used along with the decomposition-coordination technique to perform load flow analysis. The decomposition-coordination technique enables tasks to be performed in parallel by partitioning the electrical power system into independent local problems. Each independent local problem represents a portion of the total electrical power system on which a loan flow analysis can be performed. The load flow analysis is performed on these partitioned elements by using the Newton-Raphson load flow method. These independent local problems will produce results for voltage and power which can then be passed to the coordinator portion of the solution procedure. The coordinator problem uses the results of the local problems to determine if any correction is needed on the local problems. The coordinator problem is also solved by an iterative method much like the local problem. The iterative method for the coordination problem will also be the Newton-Raphson method. Therefore, each iteration at the coordination level will result in new values for the local problems. The local problems will have to be solved again along with the coordinator problem until some convergence conditions are met.
Quantum supercharger library: hyper-parallelism of the Hartree-Fock method.
Fernandes, Kyle D; Renison, C Alicia; Naidoo, Kevin J
2015-07-05
We present here a set of algorithms that completely rewrites the Hartree-Fock (HF) computations common to many legacy electronic structure packages (such as GAMESS-US, GAMESS-UK, and NWChem) into a massively parallel compute scheme that takes advantage of hardware accelerators such as Graphical Processing Units (GPUs). The HF compute algorithm is core to a library of routines that we name the Quantum Supercharger Library (QSL). We briefly evaluate the QSL's performance and report that it accelerates a HF 6-31G Self-Consistent Field (SCF) computation by up to 20 times for medium sized molecules (such as a buckyball) when compared with mature Central Processing Unit algorithms available in the legacy codes in regular use by researchers. It achieves this acceleration by massive parallelization of the one- and two-electron integrals and optimization of the SCF and Direct Inversion in the Iterative Subspace routines through the use of GPU linear algebra libraries. © 2015 Wiley Periodicals, Inc. © 2015 Wiley Periodicals, Inc.
NASA Technical Reports Server (NTRS)
Wigton, Larry
1996-01-01
Improving the numerical linear algebra routines for use in new Navier-Stokes codes, specifically Tim Barth's unstructured grid code, with spin-offs to TRANAIR is reported. A fast distance calculation routine for Navier-Stokes codes using the new one-equation turbulence models is written. The primary focus of this work was devoted to improving matrix-iterative methods. New algorithms have been developed which activate the full potential of classical Cray-class computers as well as distributed-memory parallel computers.
An improved parallel fuzzy connected image segmentation method based on CUDA.
Wang, Liansheng; Li, Dong; Huang, Shaohui
2016-05-12
Fuzzy connectedness method (FC) is an effective method for extracting fuzzy objects from medical images. However, when FC is applied to large medical image datasets, its running time will be greatly expensive. Therefore, a parallel CUDA version of FC (CUDA-kFOE) was proposed by Ying et al. to accelerate the original FC. Unfortunately, CUDA-kFOE does not consider the edges between GPU blocks, which causes miscalculation of edge points. In this paper, an improved algorithm is proposed by adding a correction step on the edge points. The improved algorithm can greatly enhance the calculation accuracy. In the improved method, an iterative manner is applied. In the first iteration, the affinity computation strategy is changed and a look up table is employed for memory reduction. In the second iteration, the error voxels because of asynchronism are updated again. Three different CT sequences of hepatic vascular with different sizes were used in the experiments with three different seeds. NVIDIA Tesla C2075 is used to evaluate our improved method over these three data sets. Experimental results show that the improved algorithm can achieve a faster segmentation compared to the CPU version and higher accuracy than CUDA-kFOE. The calculation results were consistent with the CPU version, which demonstrates that it corrects the edge point calculation error of the original CUDA-kFOE. The proposed method has a comparable time cost and has less errors compared to the original CUDA-kFOE as demonstrated in the experimental results. In the future, we will focus on automatic acquisition method and automatic processing.
Parallel Finite Element Domain Decomposition for Structural/Acoustic Analysis
NASA Technical Reports Server (NTRS)
Nguyen, Duc T.; Tungkahotara, Siroj; Watson, Willie R.; Rajan, Subramaniam D.
2005-01-01
A domain decomposition (DD) formulation for solving sparse linear systems of equations resulting from finite element analysis is presented. The formulation incorporates mixed direct and iterative equation solving strategics and other novel algorithmic ideas that are optimized to take advantage of sparsity and exploit modern computer architecture, such as memory and parallel computing. The most time consuming part of the formulation is identified and the critical roles of direct sparse and iterative solvers within the framework of the formulation are discussed. Experiments on several computer platforms using several complex test matrices are conducted using software based on the formulation. Small-scale structural examples are used to validate thc steps in the formulation and large-scale (l,000,000+ unknowns) duct acoustic examples are used to evaluate the ORIGIN 2000 processors, and a duster of 6 PCs (running under the Windows environment). Statistics show that the formulation is efficient in both sequential and parallel computing environmental and that the formulation is significantly faster and consumes less memory than that based on one of the best available commercialized parallel sparse solvers.
State Transition Matrix for Perturbed Orbital Motion Using Modified Chebyshev Picard Iteration
NASA Astrophysics Data System (ADS)
Read, Julie L.; Younes, Ahmad Bani; Macomber, Brent; Turner, James; Junkins, John L.
2015-06-01
The Modified Chebyshev Picard Iteration (MCPI) method has recently proven to be highly efficient for a given accuracy compared to several commonly adopted numerical integration methods, as a means to solve for perturbed orbital motion. This method utilizes Picard iteration, which generates a sequence of path approximations, and Chebyshev Polynomials, which are orthogonal and also enable both efficient and accurate function approximation. The nodes consistent with discrete Chebyshev orthogonality are generated using cosine sampling; this strategy also reduces the Runge effect and as a consequence of orthogonality, there is no matrix inversion required to find the basis function coefficients. The MCPI algorithms considered herein are parallel-structured so that they are immediately well-suited for massively parallel implementation with additional speedup. MCPI has a wide range of applications beyond ephemeris propagation, including the propagation of the State Transition Matrix (STM) for perturbed two-body motion. A solution is achieved for a spherical harmonic series representation of earth gravity (EGM2008), although the methodology is suitable for application to any gravity model. Included in this representation the normalized, Associated Legendre Functions are given and verified numerically. Modifications of the classical algorithm techniques, such as rewriting the STM equations in a second-order cascade formulation, gives rise to additional speedup. Timing results for the baseline formulation and this second-order formulation are given.
NASA Technical Reports Server (NTRS)
Kasahara, Hironori; Honda, Hiroki; Narita, Seinosuke
1989-01-01
Parallel processing of real-time dynamic systems simulation on a multiprocessor system named OSCAR is presented. In the simulation of dynamic systems, generally, the same calculation are repeated every time step. However, we cannot apply to Do-all or the Do-across techniques for parallel processing of the simulation since there exist data dependencies from the end of an iteration to the beginning of the next iteration and furthermore data-input and data-output are required every sampling time period. Therefore, parallelism inside the calculation required for a single time step, or a large basic block which consists of arithmetic assignment statements, must be used. In the proposed method, near fine grain tasks, each of which consists of one or more floating point operations, are generated to extract the parallelism from the calculation and assigned to processors by using optimal static scheduling at compile time in order to reduce large run time overhead caused by the use of near fine grain tasks. The practicality of the scheme is demonstrated on OSCAR (Optimally SCheduled Advanced multiprocessoR) which has been developed to extract advantageous features of static scheduling algorithms to the maximum extent.
NASA Astrophysics Data System (ADS)
Furuichi, Mikito; Nishiura, Daisuke
2017-10-01
We developed dynamic load-balancing algorithms for Particle Simulation Methods (PSM) involving short-range interactions, such as Smoothed Particle Hydrodynamics (SPH), Moving Particle Semi-implicit method (MPS), and Discrete Element method (DEM). These are needed to handle billions of particles modeled in large distributed-memory computer systems. Our method utilizes flexible orthogonal domain decomposition, allowing the sub-domain boundaries in the column to be different for each row. The imbalances in the execution time between parallel logical processes are treated as a nonlinear residual. Load-balancing is achieved by minimizing the residual within the framework of an iterative nonlinear solver, combined with a multigrid technique in the local smoother. Our iterative method is suitable for adjusting the sub-domain frequently by monitoring the performance of each computational process because it is computationally cheaper in terms of communication and memory costs than non-iterative methods. Numerical tests demonstrated the ability of our approach to handle workload imbalances arising from a non-uniform particle distribution, differences in particle types, or heterogeneous computer architecture which was difficult with previously proposed methods. We analyzed the parallel efficiency and scalability of our method using Earth simulator and K-computer supercomputer systems.
A finite element solver for 3-D compressible viscous flows
NASA Technical Reports Server (NTRS)
Reddy, K. C.; Reddy, J. N.; Nayani, S.
1990-01-01
Computation of the flow field inside a space shuttle main engine (SSME) requires the application of state of the art computational fluid dynamic (CFD) technology. Several computer codes are under development to solve 3-D flow through the hot gas manifold. Some algorithms were designed to solve the unsteady compressible Navier-Stokes equations, either by implicit or explicit factorization methods, using several hundred or thousands of time steps to reach a steady state solution. A new iterative algorithm is being developed for the solution of the implicit finite element equations without assembling global matrices. It is an efficient iteration scheme based on a modified nonlinear Gauss-Seidel iteration with symmetric sweeps. The algorithm is analyzed for a model equation and is shown to be unconditionally stable. Results from a series of test problems are presented. The finite element code was tested for couette flow, which is flow under a pressure gradient between two parallel plates in relative motion. Another problem that was solved is viscous laminar flow over a flat plate. The general 3-D finite element code was used to compute the flow in an axisymmetric turnaround duct at low Mach numbers.
NASA Technical Reports Server (NTRS)
Oliker, Leonid; Heber, Gerd; Biswas, Rupak
2000-01-01
The Conjugate Gradient (CG) algorithm is perhaps the best-known iterative technique to solve sparse linear systems that are symmetric and positive definite. A sparse matrix-vector multiply (SPMV) usually accounts for most of the floating-point operations within a CG iteration. In this paper, we investigate the effects of various ordering and partitioning strategies on the performance of parallel CG and SPMV using different programming paradigms and architectures. Results show that for this class of applications, ordering significantly improves overall performance, that cache reuse may be more important than reducing communication, and that it is possible to achieve message passing performance using shared memory constructs through careful data ordering and distribution. However, a multi-threaded implementation of CG on the Tera MTA does not require special ordering or partitioning to obtain high efficiency and scalability.
Efficient parallel implicit methods for rotary-wing aerodynamics calculations
NASA Astrophysics Data System (ADS)
Wissink, Andrew M.
Euler/Navier-Stokes Computational Fluid Dynamics (CFD) methods are commonly used for prediction of the aerodynamics and aeroacoustics of modern rotary-wing aircraft. However, their widespread application to large complex problems is limited lack of adequate computing power. Parallel processing offers the potential for dramatic increases in computing power, but most conventional implicit solution methods are inefficient in parallel and new techniques must be adopted to realize its potential. This work proposes alternative implicit schemes for Euler/Navier-Stokes rotary-wing calculations which are robust and efficient in parallel. The first part of this work proposes an efficient parallelizable modification of the Lower Upper-Symmetric Gauss Seidel (LU-SGS) implicit operator used in the well-known Transonic Unsteady Rotor Navier Stokes (TURNS) code. The new hybrid LU-SGS scheme couples a point-relaxation approach of the Data Parallel-Lower Upper Relaxation (DP-LUR) algorithm for inter-processor communication with the Symmetric Gauss Seidel algorithm of LU-SGS for on-processor computations. With the modified operator, TURNS is implemented in parallel using Message Passing Interface (MPI) for communication. Numerical performance and parallel efficiency are evaluated on the IBM SP2 and Thinking Machines CM-5 multi-processors for a variety of steady-state and unsteady test cases. The hybrid LU-SGS scheme maintains the numerical performance of the original LU-SGS algorithm in all cases and shows a good degree of parallel efficiency. It experiences a higher degree of robustness than DP-LUR for third-order upwind solutions. The second part of this work examines use of Krylov subspace iterative solvers for the nonlinear CFD solutions. The hybrid LU-SGS scheme is used as a parallelizable preconditioner. Two iterative methods are tested, Generalized Minimum Residual (GMRES) and Orthogonal s-Step Generalized Conjugate Residual (OSGCR). The Newton method demonstrates good parallel performance on the IBM SP2, with OS-GCR giving slightly better performance than GMRES on large numbers of processors. For steady and quasi-steady calculations, the convergence rate is accelerated but the overall solution time remains about the same as the standard hybrid LU-SGS scheme. For unsteady calculations, however, the Newton method maintains a higher degree of time-accuracy which allows tbe use of larger timesteps and results in CPU savings of 20-35%.
NASA Astrophysics Data System (ADS)
Kobayashi, Kiyoshi; Suzuki, Tohru S.
2018-03-01
A new algorithm for the automatic estimation of an equivalent circuit and the subsequent parameter optimization is developed by combining the data-mining concept and complex least-squares method. In this algorithm, the program generates an initial equivalent-circuit model based on the sampling data and then attempts to optimize the parameters. The basic hypothesis is that the measured impedance spectrum can be reproduced by the sum of the partial-impedance spectra presented by the resistor, inductor, resistor connected in parallel to a capacitor, and resistor connected in parallel to an inductor. The adequacy of the model is determined by using a simple artificial-intelligence function, which is applied to the output function of the Levenberg-Marquardt module. From the iteration of model modifications, the program finds an adequate equivalent-circuit model without any user input to the equivalent-circuit model.
Parallel architectures for iterative methods on adaptive, block structured grids
NASA Technical Reports Server (NTRS)
Gannon, D.; Vanrosendale, J.
1983-01-01
A parallel computer architecture well suited to the solution of partial differential equations in complicated geometries is proposed. Algorithms for partial differential equations contain a great deal of parallelism. But this parallelism can be difficult to exploit, particularly on complex problems. One approach to extraction of this parallelism is the use of special purpose architectures tuned to a given problem class. The architecture proposed here is tuned to boundary value problems on complex domains. An adaptive elliptic algorithm which maps effectively onto the proposed architecture is considered in detail. Two levels of parallelism are exploited by the proposed architecture. First, by making use of the freedom one has in grid generation, one can construct grids which are locally regular, permitting a one to one mapping of grids to systolic style processor arrays, at least over small regions. All local parallelism can be extracted by this approach. Second, though there may be a regular global structure to the grids constructed, there will be parallelism at this level. One approach to finding and exploiting this parallelism is to use an architecture having a number of processor clusters connected by a switching network. The use of such a network creates a highly flexible architecture which automatically configures to the problem being solved.
Iterative methods for 3D implicit finite-difference migration using the complex Padé approximation
NASA Astrophysics Data System (ADS)
Costa, Carlos A. N.; Campos, Itamara S.; Costa, Jessé C.; Neto, Francisco A.; Schleicher, Jörg; Novais, Amélia
2013-08-01
Conventional implementations of 3D finite-difference (FD) migration use splitting techniques to accelerate performance and save computational cost. However, such techniques are plagued with numerical anisotropy that jeopardises the correct positioning of dipping reflectors in the directions not used for the operator splitting. We implement 3D downward continuation FD migration without splitting using a complex Padé approximation. In this way, the numerical anisotropy is eliminated at the expense of a computationally more intensive solution of a large-band linear system. We compare the performance of the iterative stabilized biconjugate gradient (BICGSTAB) and that of the multifrontal massively parallel direct solver (MUMPS). It turns out that the use of the complex Padé approximation not only stabilizes the solution, but also acts as an effective preconditioner for the BICGSTAB algorithm, reducing the number of iterations as compared to the implementation using the real Padé expansion. As a consequence, the iterative BICGSTAB method is more efficient than the direct MUMPS method when solving a single term in the Padé expansion. The results of both algorithms, here evaluated by computing the migration impulse response in the SEG/EAGE salt model, are of comparable quality.
Pei, Zongrui; Max-Planck-Inst. fur Eisenforschung, Duseldorf; Eisenbach, Markus
2017-02-06
Dislocations are among the most important defects in determining the mechanical properties of both conventional alloys and high-entropy alloys. The Peierls-Nabarro model supplies an efficient pathway to their geometries and mobility. The difficulty in solving the integro-differential Peierls-Nabarro equation is how to effectively avoid the local minima in the energy landscape of a dislocation core. Among the other methods to optimize the dislocation core structures, we choose the algorithm of Particle Swarm Optimization, an algorithm that simulates the social behaviors of organisms. By employing more particles (bigger swarm) and more iterative steps (allowing them to explore for longer time), themore » local minima can be effectively avoided. But this would require more computational cost. The advantage of this algorithm is that it is readily parallelized in modern high computing architecture. We demonstrate the performance of our parallelized algorithm scales linearly with the number of employed cores.« less
Optimizing ion channel models using a parallel genetic algorithm on graphical processors.
Ben-Shalom, Roy; Aviv, Amit; Razon, Benjamin; Korngreen, Alon
2012-01-01
We have recently shown that we can semi-automatically constrain models of voltage-gated ion channels by combining a stochastic search algorithm with ionic currents measured using multiple voltage-clamp protocols. Although numerically successful, this approach is highly demanding computationally, with optimization on a high performance Linux cluster typically lasting several days. To solve this computational bottleneck we converted our optimization algorithm for work on a graphical processing unit (GPU) using NVIDIA's CUDA. Parallelizing the process on a Fermi graphic computing engine from NVIDIA increased the speed ∼180 times over an application running on an 80 node Linux cluster, considerably reducing simulation times. This application allows users to optimize models for ion channel kinetics on a single, inexpensive, desktop "super computer," greatly reducing the time and cost of building models relevant to neuronal physiology. We also demonstrate that the point of algorithm parallelization is crucial to its performance. We substantially reduced computing time by solving the ODEs (Ordinary Differential Equations) so as to massively reduce memory transfers to and from the GPU. This approach may be applied to speed up other data intensive applications requiring iterative solutions of ODEs. Copyright © 2012 Elsevier B.V. All rights reserved.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chacon, Luis; del-Castillo-Negrete, Diego; Hauck, Cory D.
2014-09-01
We propose a Lagrangian numerical algorithm for a time-dependent, anisotropic temperature transport equation in magnetized plasmas in the large guide field regime. The approach is based on an analytical integral formal solution of the parallel (i.e., along the magnetic field) transport equation with sources, and it is able to accommodate both local and non-local parallel heat flux closures. The numerical implementation is based on an operator-split formulation, with two straightforward steps: a perpendicular transport step (including sources), and a Lagrangian (field-line integral) parallel transport step. Algorithmically, the first step is amenable to the use of modern iterative methods, while themore » second step has a fixed cost per degree of freedom (and is therefore scalable). Accuracy-wise, the approach is free from the numerical pollution introduced by the discrete parallel transport term when the perpendicular to parallel transport coefficient ratio X ⊥ /X ∥ becomes arbitrarily small, and is shown to capture the correct limiting solution when ε = X⊥L 2 ∥/X1L 2 ⊥ → 0 (with L∥∙ L⊥ , the parallel and perpendicular diffusion length scales, respectively). Therefore, the approach is asymptotic-preserving. We demonstrate the capabilities of the scheme with several numerical experiments with varying magnetic field complexity in two dimensions, including the case of transport across a magnetic island.« less
Flight data processing with the F-8 adaptive algorithm
NASA Technical Reports Server (NTRS)
Hartmann, G.; Stein, G.; Petersen, K.
1977-01-01
An explicit adaptive control algorithm based on maximum likelihood estimation of parameters has been designed for NASA's DFBW F-8 aircraft. To avoid iterative calculations, the algorithm uses parallel channels of Kalman filters operating at fixed locations in parameter space. This algorithm has been implemented in NASA/DFRC's Remotely Augmented Vehicle (RAV) facility. Real-time sensor outputs (rate gyro, accelerometer and surface position) are telemetered to a ground computer which sends new gain values to an on-board system. Ground test data and flight records were used to establish design values of noise statistics and to verify the ground-based adaptive software. The software and its performance evaluation based on flight data are described
Incoherent beam combining based on the momentum SPGD algorithm
NASA Astrophysics Data System (ADS)
Yang, Guoqing; Liu, Lisheng; Jiang, Zhenhua; Guo, Jin; Wang, Tingfeng
2018-05-01
Incoherent beam combining (ICBC) technology is one of the most promising ways to achieve high-energy, near-diffraction laser output. In this paper, the momentum method is proposed as a modification of the stochastic parallel gradient descent (SPGD) algorithm. The momentum method can improve the speed of convergence of the combining system efficiently. The analytical method is employed to interpret the principle of the momentum method. Furthermore, the proposed algorithm is testified through simulations as well as experiments. The results of the simulations and the experiments show that the proposed algorithm not only accelerates the speed of the iteration, but also keeps the stability of the combining process. Therefore the feasibility of the proposed algorithm in the beam combining system is testified.
On some Aitken-like acceleration of the Schwarz method
NASA Astrophysics Data System (ADS)
Garbey, M.; Tromeur-Dervout, D.
2002-12-01
In this paper we present a family of domain decomposition based on Aitken-like acceleration of the Schwarz method seen as an iterative procedure with a linear rate of convergence. We first present the so-called Aitken-Schwarz procedure for linear differential operators. The solver can be a direct solver when applied to the Helmholtz problem with five-point finite difference scheme on regular grids. We then introduce the Steffensen-Schwarz variant which is an iterative domain decomposition solver that can be applied to linear and nonlinear problems. We show that these solvers have reasonable numerical efficiency compared to classical fast solvers for the Poisson problem or multigrids for more general linear and nonlinear elliptic problems. However, the salient feature of our method is that our algorithm has high tolerance to slow network in the context of distributed parallel computing and is attractive, generally speaking, to use with computer architecture for which performance is limited by the memory bandwidth rather than the flop performance of the CPU. This is nowadays the case for most parallel. computer using the RISC processor architecture. We will illustrate this highly desirable property of our algorithm with large-scale computing experiments.
Wei, Qinglai; Liu, Derong; Lin, Qiao
In this paper, a novel local value iteration adaptive dynamic programming (ADP) algorithm is developed to solve infinite horizon optimal control problems for discrete-time nonlinear systems. The focuses of this paper are to study admissibility properties and the termination criteria of discrete-time local value iteration ADP algorithms. In the discrete-time local value iteration ADP algorithm, the iterative value functions and the iterative control laws are both updated in a given subset of the state space in each iteration, instead of the whole state space. For the first time, admissibility properties of iterative control laws are analyzed for the local value iteration ADP algorithm. New termination criteria are established, which terminate the iterative local ADP algorithm with an admissible approximate optimal control law. Finally, simulation results are given to illustrate the performance of the developed algorithm.In this paper, a novel local value iteration adaptive dynamic programming (ADP) algorithm is developed to solve infinite horizon optimal control problems for discrete-time nonlinear systems. The focuses of this paper are to study admissibility properties and the termination criteria of discrete-time local value iteration ADP algorithms. In the discrete-time local value iteration ADP algorithm, the iterative value functions and the iterative control laws are both updated in a given subset of the state space in each iteration, instead of the whole state space. For the first time, admissibility properties of iterative control laws are analyzed for the local value iteration ADP algorithm. New termination criteria are established, which terminate the iterative local ADP algorithm with an admissible approximate optimal control law. Finally, simulation results are given to illustrate the performance of the developed algorithm.
Streaming parallel GPU acceleration of large-scale filter-based spiking neural networks.
Slażyński, Leszek; Bohte, Sander
2012-01-01
The arrival of graphics processing (GPU) cards suitable for massively parallel computing promises affordable large-scale neural network simulation previously only available at supercomputing facilities. While the raw numbers suggest that GPUs may outperform CPUs by at least an order of magnitude, the challenge is to develop fine-grained parallel algorithms to fully exploit the particulars of GPUs. Computation in a neural network is inherently parallel and thus a natural match for GPU architectures: given inputs, the internal state for each neuron can be updated in parallel. We show that for filter-based spiking neurons, like the Spike Response Model, the additive nature of membrane potential dynamics enables additional update parallelism. This also reduces the accumulation of numerical errors when using single precision computation, the native precision of GPUs. We further show that optimizing simulation algorithms and data structures to the GPU's architecture has a large pay-off: for example, matching iterative neural updating to the memory architecture of the GPU speeds up this simulation step by a factor of three to five. With such optimizations, we can simulate in better-than-realtime plausible spiking neural networks of up to 50 000 neurons, processing over 35 million spiking events per second.
A novel highly parallel algorithm for linearly unmixing hyperspectral images
NASA Astrophysics Data System (ADS)
Guerra, Raúl; López, Sebastián.; Callico, Gustavo M.; López, Jose F.; Sarmiento, Roberto
2014-10-01
Endmember extraction and abundances calculation represent critical steps within the process of linearly unmixing a given hyperspectral image because of two main reasons. The first one is due to the need of computing a set of accurate endmembers in order to further obtain confident abundance maps. The second one refers to the huge amount of operations involved in these time-consuming processes. This work proposes an algorithm to estimate the endmembers of a hyperspectral image under analysis and its abundances at the same time. The main advantage of this algorithm is its high parallelization degree and the mathematical simplicity of the operations implemented. This algorithm estimates the endmembers as virtual pixels. In particular, the proposed algorithm performs the descent gradient method to iteratively refine the endmembers and the abundances, reducing the mean square error, according with the linear unmixing model. Some mathematical restrictions must be added so the method converges in a unique and realistic solution. According with the algorithm nature, these restrictions can be easily implemented. The results obtained with synthetic images demonstrate the well behavior of the algorithm proposed. Moreover, the results obtained with the well-known Cuprite dataset also corroborate the benefits of our proposal.
A dual-processor multi-frequency implementation of the FINDS algorithm
NASA Technical Reports Server (NTRS)
Godiwala, Pankaj M.; Caglayan, Alper K.
1987-01-01
This report presents a parallel processing implementation of the FINDS (Fault Inferring Nonlinear Detection System) algorithm on a dual processor configured target flight computer. First, a filter initialization scheme is presented which allows the no-fail filter (NFF) states to be initialized using the first iteration of the flight data. A modified failure isolation strategy, compatible with the new failure detection strategy reported earlier, is discussed and the performance of the new FDI algorithm is analyzed using flight recorded data from the NASA ATOPS B-737 aircraft in a Microwave Landing System (MLS) environment. The results show that low level MLS, IMU, and IAS sensor failures are detected and isolated instantaneously, while accelerometer and rate gyro failures continue to take comparatively longer to detect and isolate. The parallel implementation is accomplished by partitioning the FINDS algorithm into two parts: one based on the translational dynamics and the other based on the rotational kinematics. Finally, a multi-rate implementation of the algorithm is presented yielding significantly low execution times with acceptable estimation and FDI performance.
SPIRiT: Iterative Self-consistent Parallel Imaging Reconstruction from Arbitrary k-Space
Lustig, Michael; Pauly, John M.
2010-01-01
A new approach to autocalibrating, coil-by-coil parallel imaging reconstruction is presented. It is a generalized reconstruction framework based on self consistency. The reconstruction problem is formulated as an optimization that yields the most consistent solution with the calibration and acquisition data. The approach is general and can accurately reconstruct images from arbitrary k-space sampling patterns. The formulation can flexibly incorporate additional image priors such as off-resonance correction and regularization terms that appear in compressed sensing. Several iterative strategies to solve the posed reconstruction problem in both image and k-space domain are presented. These are based on a projection over convex sets (POCS) and a conjugate gradient (CG) algorithms. Phantom and in-vivo studies demonstrate efficient reconstructions from undersampled Cartesian and spiral trajectories. Reconstructions that include off-resonance correction and nonlinear ℓ1-wavelet regularization are also demonstrated. PMID:20665790
NASA Astrophysics Data System (ADS)
Arteaga, Santiago Egido
1998-12-01
The steady-state Navier-Stokes equations are of considerable interest because they are used to model numerous common physical phenomena. The applications encountered in practice often involve small viscosities and complicated domain geometries, and they result in challenging problems in spite of the vast attention that has been dedicated to them. In this thesis we examine methods for computing the numerical solution of the primitive variable formulation of the incompressible equations on distributed memory parallel computers. We use the Galerkin method to discretize the differential equations, although most results are stated so that they apply also to stabilized methods. We also reformulate some classical results in a single framework and discuss some issues frequently dismissed in the literature, such as the implementation of pressure space basis and non- homogeneous boundary values. We consider three nonlinear methods: Newton's method, Oseen's (or Picard) iteration, and sequences of Stokes problems. All these iterative nonlinear methods require solving a linear system at every step. Newton's method has quadratic convergence while that of the others is only linear; however, we obtain theoretical bounds showing that Oseen's iteration is more robust, and we confirm it experimentally. In addition, although Oseen's iteration usually requires more iterations than Newton's method, the linear systems it generates tend to be simpler and its overall costs (in CPU time) are lower. The Stokes problems result in linear systems which are easier to solve, but its convergence is much slower, so that it is competitive only for large viscosities. Inexact versions of these methods are studied, and we explain why the best timings are obtained using relatively modest error tolerances in solving the corresponding linear systems. We also present a new damping optimization strategy based on the quadratic nature of the Navier-Stokes equations, which improves the robustness of all the linearization strategies considered and whose computational cost is negligible. The algebraic properties of these systems depend on both the discretization and nonlinear method used. We study in detail the positive definiteness and skewsymmetry of the advection submatrices (essentially, convection-diffusion problems). We propose a discretization based on a new trilinear form for Newton's method. We solve the linear systems using three Krylov subspace methods, GMRES, QMR and TFQMR, and compare the advantages of each. Our emphasis is on parallel algorithms, and so we consider preconditioners suitable for parallel computers such as line variants of the Jacobi and Gauss- Seidel methods, alternating direction implicit methods, and Chebyshev and least squares polynomial preconditioners. These work well for moderate viscosities (moderate Reynolds number). For small viscosities we show that effective parallel solution of the advection subproblem is a critical factor to improve performance. Implementation details on a CM-5 are presented.
Lagardère, Louis; Lipparini, Filippo; Polack, Étienne; Stamm, Benjamin; Cancès, Éric; Schnieders, Michael; Ren, Pengyu; Maday, Yvon; Piquemal, Jean-Philip
2014-02-28
In this paper, we present a scalable and efficient implementation of point dipole-based polarizable force fields for molecular dynamics (MD) simulations with periodic boundary conditions (PBC). The Smooth Particle-Mesh Ewald technique is combined with two optimal iterative strategies, namely, a preconditioned conjugate gradient solver and a Jacobi solver in conjunction with the Direct Inversion in the Iterative Subspace for convergence acceleration, to solve the polarization equations. We show that both solvers exhibit very good parallel performances and overall very competitive timings in an energy-force computation needed to perform a MD step. Various tests on large systems are provided in the context of the polarizable AMOEBA force field as implemented in the newly developed Tinker-HP package which is the first implementation for a polarizable model making large scale experiments for massively parallel PBC point dipole models possible. We show that using a large number of cores offers a significant acceleration of the overall process involving the iterative methods within the context of spme and a noticeable improvement of the memory management giving access to very large systems (hundreds of thousands of atoms) as the algorithm naturally distributes the data on different cores. Coupled with advanced MD techniques, gains ranging from 2 to 3 orders of magnitude in time are now possible compared to non-optimized, sequential implementations giving new directions for polarizable molecular dynamics in periodic boundary conditions using massively parallel implementations.
Lagardère, Louis; Lipparini, Filippo; Polack, Étienne; Stamm, Benjamin; Cancès, Éric; Schnieders, Michael; Ren, Pengyu; Maday, Yvon; Piquemal, Jean-Philip
2015-01-01
In this paper, we present a scalable and efficient implementation of point dipole-based polarizable force fields for molecular dynamics (MD) simulations with periodic boundary conditions (PBC). The Smooth Particle-Mesh Ewald technique is combined with two optimal iterative strategies, namely, a preconditioned conjugate gradient solver and a Jacobi solver in conjunction with the Direct Inversion in the Iterative Subspace for convergence acceleration, to solve the polarization equations. We show that both solvers exhibit very good parallel performances and overall very competitive timings in an energy-force computation needed to perform a MD step. Various tests on large systems are provided in the context of the polarizable AMOEBA force field as implemented in the newly developed Tinker-HP package which is the first implementation for a polarizable model making large scale experiments for massively parallel PBC point dipole models possible. We show that using a large number of cores offers a significant acceleration of the overall process involving the iterative methods within the context of spme and a noticeable improvement of the memory management giving access to very large systems (hundreds of thousands of atoms) as the algorithm naturally distributes the data on different cores. Coupled with advanced MD techniques, gains ranging from 2 to 3 orders of magnitude in time are now possible compared to non-optimized, sequential implementations giving new directions for polarizable molecular dynamics in periodic boundary conditions using massively parallel implementations. PMID:26512230
NASA Astrophysics Data System (ADS)
Zerr, Robert Joseph
2011-12-01
The integral transport matrix method (ITMM) has been used as the kernel of new parallel solution methods for the discrete ordinates approximation of the within-group neutron transport equation. The ITMM abandons the repetitive mesh sweeps of the traditional source iterations (SI) scheme in favor of constructing stored operators that account for the direct coupling factors among all the cells and between the cells and boundary surfaces. The main goals of this work were to develop the algorithms that construct these operators and employ them in the solution process, determine the most suitable way to parallelize the entire procedure, and evaluate the behavior and performance of the developed methods for increasing number of processes. This project compares the effectiveness of the ITMM with the SI scheme parallelized with the Koch-Baker-Alcouffe (KBA) method. The primary parallel solution method involves a decomposition of the domain into smaller spatial sub-domains, each with their own transport matrices, and coupled together via interface boundary angular fluxes. Each sub-domain has its own set of ITMM operators and represents an independent transport problem. Multiple iterative parallel solution methods have investigated, including parallel block Jacobi (PBJ), parallel red/black Gauss-Seidel (PGS), and parallel GMRES (PGMRES). The fastest observed parallel solution method, PGS, was used in a weak scaling comparison with the PARTISN code. Compared to the state-of-the-art SI-KBA with diffusion synthetic acceleration (DSA), this new method without acceleration/preconditioning is not competitive for any problem parameters considered. The best comparisons occur for problems that are difficult for SI DSA, namely highly scattering and optically thick. SI DSA execution time curves are generally steeper than the PGS ones. However, until further testing is performed it cannot be concluded that SI DSA does not outperform the ITMM with PGS even on several thousand or tens of thousands of processors. The PGS method does outperform SI DSA for the periodic heterogeneous layers (PHL) configuration problems. Although this demonstrates a relative strength/weakness between the two methods, the practicality of these problems is much less, further limiting instances where it would be beneficial to select ITMM over SI DSA. The results strongly indicate a need for a robust, stable, and efficient acceleration method (or preconditioner for PGMRES). The spatial multigrid (SMG) method is currently incomplete in that it does not work for all cases considered and does not effectively improve the convergence rate for all values of scattering ratio c or cell dimension h. Nevertheless, it does display the desired trend for highly scattering, optically thin problems. That is, it tends to lower the rate of growth of number of iterations with increasing number of processes, P, while not increasing the number of additional operations per iteration to the extent that the total execution time of the rapidly converging accelerated iterations exceeds that of the slower unaccelerated iterations. A predictive parallel performance model has been developed for the PBJ method. Timing tests were performed such that trend lines could be fitted to the data for the different components and used to estimate the execution times. Applied to the weak scaling results, the model notably underestimates construction time, but combined with a slight overestimation in iterative solution time, the model predicts total execution time very well for large P. It also does a decent job with the strong scaling results, closely predicting the construction time and time per iteration, especially as P increases. Although not shown to be competitive up to 1,024 processing elements with the current state of the art, the parallelized ITMM exhibits promising scaling trends. Ultimately, compared to the KBA method, the parallelized ITMM may be found to be a very attractive option for transport calculations spatially decomposed over several tens of thousands of processes. Acceleration/preconditioning of the parallelized ITMM once developed will improve the convergence rate and improve its competitiveness. (Abstract shortened by UMI.)
A transient FETI methodology for large-scale parallel implicit computations in structural mechanics
NASA Technical Reports Server (NTRS)
Farhat, Charbel; Crivelli, Luis; Roux, Francois-Xavier
1992-01-01
Explicit codes are often used to simulate the nonlinear dynamics of large-scale structural systems, even for low frequency response, because the storage and CPU requirements entailed by the repeated factorizations traditionally found in implicit codes rapidly overwhelm the available computing resources. With the advent of parallel processing, this trend is accelerating because explicit schemes are also easier to parallelize than implicit ones. However, the time step restriction imposed by the Courant stability condition on all explicit schemes cannot yet -- and perhaps will never -- be offset by the speed of parallel hardware. Therefore, it is essential to develop efficient and robust alternatives to direct methods that are also amenable to massively parallel processing because implicit codes using unconditionally stable time-integration algorithms are computationally more efficient when simulating low-frequency dynamics. Here we present a domain decomposition method for implicit schemes that requires significantly less storage than factorization algorithms, that is several times faster than other popular direct and iterative methods, that can be easily implemented on both shared and local memory parallel processors, and that is both computationally and communication-wise efficient. The proposed transient domain decomposition method is an extension of the method of Finite Element Tearing and Interconnecting (FETI) developed by Farhat and Roux for the solution of static problems. Serial and parallel performance results on the CRAY Y-MP/8 and the iPSC-860/128 systems are reported and analyzed for realistic structural dynamics problems. These results establish the superiority of the FETI method over both the serial/parallel conjugate gradient algorithm with diagonal scaling and the serial/parallel direct method, and contrast the computational power of the iPSC-860/128 parallel processor with that of the CRAY Y-MP/8 system.
A Parallel Genetic Algorithm for Automated Electronic Circuit Design
NASA Technical Reports Server (NTRS)
Long, Jason D.; Colombano, Silvano P.; Haith, Gary L.; Stassinopoulos, Dimitris
2000-01-01
Parallelized versions of genetic algorithms (GAs) are popular primarily for three reasons: the GA is an inherently parallel algorithm, typical GA applications are very compute intensive, and powerful computing platforms, especially Beowulf-style computing clusters, are becoming more affordable and easier to implement. In addition, the low communication bandwidth required allows the use of inexpensive networking hardware such as standard office ethernet. In this paper we describe a parallel GA and its use in automated high-level circuit design. Genetic algorithms are a type of trial-and-error search technique that are guided by principles of Darwinian evolution. Just as the genetic material of two living organisms can intermix to produce offspring that are better adapted to their environment, GAs expose genetic material, frequently strings of 1s and Os, to the forces of artificial evolution: selection, mutation, recombination, etc. GAs start with a pool of randomly-generated candidate solutions which are then tested and scored with respect to their utility. Solutions are then bred by probabilistically selecting high quality parents and recombining their genetic representations to produce offspring solutions. Offspring are typically subjected to a small amount of random mutation. After a pool of offspring is produced, this process iterates until a satisfactory solution is found or an iteration limit is reached. Genetic algorithms have been applied to a wide variety of problems in many fields, including chemistry, biology, and many engineering disciplines. There are many styles of parallelism used in implementing parallel GAs. One such method is called the master-slave or processor farm approach. In this technique, slave nodes are used solely to compute fitness evaluations (the most time consuming part). The master processor collects fitness scores from the nodes and performs the genetic operators (selection, reproduction, variation, etc.). Because of dependency issues in the GA, it is possible to have idle processors. However, as long as the load at each processing node is similar, the processors are kept busy nearly all of the time. In applying GAs to circuit design, a suitable genetic representation 'is that of a circuit-construction program. We discuss one such circuit-construction programming language and show how evolution can generate useful analog circuit designs. This language has the desirable property that virtually all sets of combinations of primitives result in valid circuit graphs. Our system allows circuit size (number of devices), circuit topology, and device values to be evolved. Using a parallel genetic algorithm and circuit simulation software, we present experimental results as applied to three analog filter and two amplifier design tasks. For example, a figure shows an 85 dB amplifier design evolved by our system, and another figure shows the performance of that circuit (gain and frequency response). In all tasks, our system is able to generate circuits that achieve the target specifications.
Parallel multigrid smoothing: polynomial versus Gauss-Seidel
NASA Astrophysics Data System (ADS)
Adams, Mark; Brezina, Marian; Hu, Jonathan; Tuminaro, Ray
2003-07-01
Gauss-Seidel is often the smoother of choice within multigrid applications. In the context of unstructured meshes, however, maintaining good parallel efficiency is difficult with multiplicative iterative methods such as Gauss-Seidel. This leads us to consider alternative smoothers. We discuss the computational advantages of polynomial smoothers within parallel multigrid algorithms for positive definite symmetric systems. Two particular polynomials are considered: Chebyshev and a multilevel specific polynomial. The advantages of polynomial smoothing over traditional smoothers such as Gauss-Seidel are illustrated on several applications: Poisson's equation, thin-body elasticity, and eddy current approximations to Maxwell's equations. While parallelizing the Gauss-Seidel method typically involves a compromise between a scalable convergence rate and maintaining high flop rates, polynomial smoothers achieve parallel scalable multigrid convergence rates without sacrificing flop rates. We show that, although parallel computers are the main motivation, polynomial smoothers are often surprisingly competitive with Gauss-Seidel smoothers on serial machines.
Development of iterative techniques for the solution of unsteady compressible viscous flows
NASA Technical Reports Server (NTRS)
Sankar, Lakshmi; Hixon, Duane
1993-01-01
The work done under this project was documented in detail as the Ph. D. dissertation of Dr. Duane Hixon. The objectives of the research project were evaluation of the generalized minimum residual method (GMRES) as a tool for accelerating 2-D and 3-D unsteady flows and evaluation of the suitability of the GMRES algorithm for unsteady flows, computed on parallel computer architectures.
Muckley, Matthew J; Noll, Douglas C; Fessler, Jeffrey A
2015-02-01
Sparsity-promoting regularization is useful for combining compressed sensing assumptions with parallel MRI for reducing scan time while preserving image quality. Variable splitting algorithms are the current state-of-the-art algorithms for SENSE-type MR image reconstruction with sparsity-promoting regularization. These methods are very general and have been observed to work with almost any regularizer; however, the tuning of associated convergence parameters is a commonly-cited hindrance in their adoption. Conversely, majorize-minimize algorithms based on a single Lipschitz constant have been observed to be slow in shift-variant applications such as SENSE-type MR image reconstruction since the associated Lipschitz constants are loose bounds for the shift-variant behavior. This paper bridges the gap between the Lipschitz constant and the shift-variant aspects of SENSE-type MR imaging by introducing majorizing matrices in the range of the regularizer matrix. The proposed majorize-minimize methods (called BARISTA) converge faster than state-of-the-art variable splitting algorithms when combined with momentum acceleration and adaptive momentum restarting. Furthermore, the tuning parameters associated with the proposed methods are unitless convergence tolerances that are easier to choose than the constraint penalty parameters required by variable splitting algorithms.
Noll, Douglas C.; Fessler, Jeffrey A.
2014-01-01
Sparsity-promoting regularization is useful for combining compressed sensing assumptions with parallel MRI for reducing scan time while preserving image quality. Variable splitting algorithms are the current state-of-the-art algorithms for SENSE-type MR image reconstruction with sparsity-promoting regularization. These methods are very general and have been observed to work with almost any regularizer; however, the tuning of associated convergence parameters is a commonly-cited hindrance in their adoption. Conversely, majorize-minimize algorithms based on a single Lipschitz constant have been observed to be slow in shift-variant applications such as SENSE-type MR image reconstruction since the associated Lipschitz constants are loose bounds for the shift-variant behavior. This paper bridges the gap between the Lipschitz constant and the shift-variant aspects of SENSE-type MR imaging by introducing majorizing matrices in the range of the regularizer matrix. The proposed majorize-minimize methods (called BARISTA) converge faster than state-of-the-art variable splitting algorithms when combined with momentum acceleration and adaptive momentum restarting. Furthermore, the tuning parameters associated with the proposed methods are unitless convergence tolerances that are easier to choose than the constraint penalty parameters required by variable splitting algorithms. PMID:25330484
NASA Astrophysics Data System (ADS)
Macomber, B.; Woollands, R. M.; Probe, A.; Younes, A.; Bai, X.; Junkins, J.
2013-09-01
Modified Chebyshev Picard Iteration (MCPI) is an iterative numerical method for approximating solutions of linear or non-linear Ordinary Differential Equations (ODEs) to obtain time histories of system state trajectories. Unlike other step-by-step differential equation solvers, the Runge-Kutta family of numerical integrators for example, MCPI approximates long arcs of the state trajectory with an iterative path approximation approach, and is ideally suited to parallel computation. Orthogonal Chebyshev Polynomials are used as basis functions during each path iteration; the integrations of the Picard iteration are then done analytically. Due to the orthogonality of the Chebyshev basis functions, the least square approximations are computed without matrix inversion; the coefficients are computed robustly from discrete inner products. As a consequence of discrete sampling and weighting adopted for the inner product definition, Runge phenomena errors are minimized near the ends of the approximation intervals. The MCPI algorithm utilizes a vector-matrix framework for computational efficiency. Additionally, all Chebyshev coefficients and integrand function evaluations are independent, meaning they can be simultaneously computed in parallel for further decreased computational cost. Over an order of magnitude speedup from traditional methods is achieved in serial processing, and an additional order of magnitude is achievable in parallel architectures. This paper presents a new MCPI library, a modular toolset designed to allow MCPI to be easily applied to a wide variety of ODE systems. Library users will not have to concern themselves with the underlying mathematics behind the MCPI method. Inputs are the boundary conditions of the dynamical system, the integrand function governing system behavior, and the desired time interval of integration, and the output is a time history of the system states over the interval of interest. Examples from the field of astrodynamics are presented to compare the output from the MCPI library to current state-of-practice numerical integration methods. It is shown that MCPI is capable of out-performing the state-of-practice in terms of computational cost and accuracy.
DOE Office of Scientific and Technical Information (OSTI.GOV)
D'Azevedo, Eduardo; Abbott, Stephen; Koskela, Tuomas
The XGC fusion gyrokinetic code combines state-of-the-art, portable computational and algorithmic technologies to enable complicated multiscale simulations of turbulence and transport dynamics in ITER edge plasma on the largest US open-science computer, the CRAY XK7 Titan, at its maximal heterogeneous capability, which have not been possible before due to a factor of over 10 shortage in the time-to-solution for less than 5 days of wall-clock time for one physics case. Frontier techniques such as nested OpenMP parallelism, adaptive parallel I/O, staging I/O and data reduction using dynamic and asynchronous applications interactions, dynamic repartitioning.
A parallel computing engine for a class of time critical processes.
Nabhan, T M; Zomaya, A Y
1997-01-01
This paper focuses on the efficient parallel implementation of systems of numerically intensive nature over loosely coupled multiprocessor architectures. These analytical models are of significant importance to many real-time systems that have to meet severe time constants. A parallel computing engine (PCE) has been developed in this work for the efficient simplification and the near optimal scheduling of numerical models over the different cooperating processors of the parallel computer. First, the analytical system is efficiently coded in its general form. The model is then simplified by using any available information (e.g., constant parameters). A task graph representing the interconnections among the different components (or equations) is generated. The graph can then be compressed to control the computation/communication requirements. The task scheduler employs a graph-based iterative scheme, based on the simulated annealing algorithm, to map the vertices of the task graph onto a Multiple-Instruction-stream Multiple-Data-stream (MIMD) type of architecture. The algorithm uses a nonanalytical cost function that properly considers the computation capability of the processors, the network topology, the communication time, and congestion possibilities. Moreover, the proposed technique is simple, flexible, and computationally viable. The efficiency of the algorithm is demonstrated by two case studies with good results.
Scaling Optimization of the SIESTA MHD Code
NASA Astrophysics Data System (ADS)
Seal, Sudip; Hirshman, Steven; Perumalla, Kalyan
2013-10-01
SIESTA is a parallel three-dimensional plasma equilibrium code capable of resolving magnetic islands at high spatial resolutions for toroidal plasmas. Originally designed to exploit small-scale parallelism, SIESTA has now been scaled to execute efficiently over several thousands of processors P. This scaling improvement was accomplished with minimal intrusion to the execution flow of the original version. First, the efficiency of the iterative solutions was improved by integrating the parallel tridiagonal block solver code BCYCLIC. Krylov-space generation in GMRES was then accelerated using a customized parallel matrix-vector multiplication algorithm. Novel parallel Hessian generation algorithms were integrated and memory access latencies were dramatically reduced through loop nest optimizations and data layout rearrangement. These optimizations sped up equilibria calculations by factors of 30-50. It is possible to compute solutions with granularity N/P near unity on extremely fine radial meshes (N > 1024 points). Grid separation in SIESTA, which manifests itself primarily in the resonant components of the pressure far from rational surfaces, is strongly suppressed by finer meshes. Large problem sizes of up to 300 K simultaneous non-linear coupled equations have been solved on the NERSC supercomputers. Work supported by U.S. DOE under Contract DE-AC05-00OR22725 with UT-Battelle, LLC.
A highly parallel multigrid-like method for the solution of the Euler equations
NASA Technical Reports Server (NTRS)
Tuminaro, Ray S.
1989-01-01
We consider a highly parallel multigrid-like method for the solution of the two dimensional steady Euler equations. The new method, introduced as filtering multigrid, is similar to a standard multigrid scheme in that convergence on the finest grid is accelerated by iterations on coarser grids. In the filtering method, however, additional fine grid subproblems are processed concurrently with coarse grid computations to further accelerate convergence. These additional problems are obtained by splitting the residual into a smooth and an oscillatory component. The smooth component is then used to form a coarse grid problem (similar to standard multigrid) while the oscillatory component is used for a fine grid subproblem. The primary advantage in the filtering approach is that fewer iterations are required and that most of the additional work per iteration can be performed in parallel with the standard coarse grid computations. We generalize the filtering algorithm to a version suitable for nonlinear problems. We emphasize that this generalization is conceptually straight-forward and relatively easy to implement. In particular, no explicit linearization (e.g., formation of Jacobians) needs to be performed (similar to the FAS multigrid approach). We illustrate the nonlinear version by applying it to the Euler equations, and presenting numerical results. Finally, a performance evaluation is made based on execution time models and convergence information obtained from numerical experiments.
Reconstruction of multiple-pinhole micro-SPECT data using origin ensembles.
Lyon, Morgan C; Sitek, Arkadiusz; Metzler, Scott D; Moore, Stephen C
2016-10-01
The authors are currently developing a dual-resolution multiple-pinhole microSPECT imaging system based on three large NaI(Tl) gamma cameras. Two multiple-pinhole tungsten collimator tubes will be used sequentially for whole-body "scout" imaging of a mouse, followed by high-resolution (hi-res) imaging of an organ of interest, such as the heart or brain. Ideally, the whole-body image will be reconstructed in real time such that data need only be acquired until the area of interest can be visualized well-enough to determine positioning for the hi-res scan. The authors investigated the utility of the origin ensemble (OE) algorithm for online and offline reconstructions of the scout data. This algorithm operates directly in image space, and can provide estimates of image uncertainty, along with reconstructed images. Techniques for accelerating the OE reconstruction were also introduced and evaluated. System matrices were calculated for our 39-pinhole scout collimator design. SPECT projections were simulated for a range of count levels using the MOBY digital mouse phantom. Simulated data were used for a comparison of OE and maximum-likelihood expectation maximization (MLEM) reconstructions. The OE algorithm convergence was evaluated by calculating the total-image entropy and by measuring the counts in a volume-of-interest (VOI) containing the heart. Total-image entropy was also calculated for simulated MOBY data reconstructed using OE with various levels of parallelization. For VOI measurements in the heart, liver, bladder, and soft-tissue, MLEM and OE reconstructed images agreed within 6%. Image entropy converged after ∼2000 iterations of OE, while the counts in the heart converged earlier at ∼200 iterations of OE. An accelerated version of OE completed 1000 iterations in <9 min for a 6.8M count data set, with some loss of image entropy performance, whereas the same dataset required ∼79 min to complete 1000 iterations of conventional OE. A combination of the two methods showed decreased reconstruction time and no loss of performance when compared to conventional OE alone. OE-reconstructed images were found to be quantitatively and qualitatively similar to MLEM, yet OE also provided estimates of image uncertainty. Some acceleration of the reconstruction can be gained through the use of parallel computing. The OE algorithm is useful for reconstructing multiple-pinhole SPECT data and can be easily modified for real-time reconstruction.
Coherent diffraction imaging by moving a lens.
Shen, Cheng; Tan, Jiubin; Wei, Ce; Liu, Zhengjun
2016-07-25
A moveable lens is used for determining amplitude and phase on the object plane. The extended fractional Fourier transform is introduced to address the single lens imaging. We put forward a fast algorithm for the transform by convolution. Combined with parallel iterative phase retrieval algorithm, it is applied to reconstruct the complex amplitude of the object. Compared with inline holography, the implementation of our method is simple and easy. Without the oversampling operation, the computational load is less. Also the proposed method has a superiority of accuracy over the direct focusing measurement for the imaging of small size objects.
Tomographic image reconstruction using the cell broadband engine (CBE) general purpose hardware
NASA Astrophysics Data System (ADS)
Knaup, Michael; Steckmann, Sven; Bockenbach, Olivier; Kachelrieß, Marc
2007-02-01
Tomographic image reconstruction, such as the reconstruction of CT projection values, of tomosynthesis data, PET or SPECT events, is computational very demanding. In filtered backprojection as well as in iterative reconstruction schemes, the most time-consuming steps are forward- and backprojection which are often limited by the memory bandwidth. Recently, a novel general purpose architecture optimized for distributed computing became available: the Cell Broadband Engine (CBE). Its eight synergistic processing elements (SPEs) currently allow for a theoretical performance of 192 GFlops (3 GHz, 8 units, 4 floats per vector, 2 instructions, multiply and add, per clock). To maximize image reconstruction speed we modified our parallel-beam and perspective backprojection algorithms which are highly optimized for standard PCs, and optimized the code for the CBE processor. 1-3 In addition, we implemented an optimized perspective forwardprojection on the CBE which allows us to perform statistical image reconstructions like the ordered subset convex (OSC) algorithm. 4 Performance was measured using simulated data with 512 projections per rotation and 5122 detector elements. The data were backprojected into an image of 512 3 voxels using our PC-based approaches and the new CBE- based algorithms. Both the PC and the CBE timings were scaled to a 3 GHz clock frequency. On the CBE, we obtain total reconstruction times of 4.04 s for the parallel backprojection, 13.6 s for the perspective backprojection and 192 s for a complete OSC reconstruction, consisting of one initial Feldkamp reconstruction, followed by 4 OSC iterations.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Feo, J.T.
1993-10-01
This report contain papers on: Programmability and performance issues; The case of an iterative partial differential equation solver; Implementing the kernal of the Australian Region Weather Prediction Model in Sisal; Even and quarter-even prime length symmetric FFTs and their Sisal Implementations; Top-down thread generation for Sisal; Overlapping communications and computations on NUMA architechtures; Compiling technique based on dataflow analysis for funtional programming language Valid; Copy elimination for true multidimensional arrays in Sisal 2.0; Increasing parallelism for an optimization that reduces copying in IF2 graphs; Caching in on Sisal; Cache performance of Sisal Vs. FORTRAN; FFT algorithms on a shared-memory multiprocessor;more » A parallel implementation of nonnumeric search problems in Sisal; Computer vision algorithms in Sisal; Compilation of Sisal for a high-performance data driven vector processor; Sisal on distributed memory machines; A virtual shared addressing system for distributed memory Sisal; Developing a high-performance FFT algorithm in Sisal for a vector supercomputer; Implementation issues for IF2 on a static data-flow architechture; and Systematic control of parallelism in array-based data-flow computation. Selected papers have been indexed separately for inclusion in the Energy Science and Technology Database.« less
Virtual fringe projection system with nonparallel illumination based on iteration
NASA Astrophysics Data System (ADS)
Zhou, Duo; Wang, Zhangying; Gao, Nan; Zhang, Zonghua; Jiang, Xiangqian
2017-06-01
Fringe projection profilometry has been widely applied in many fields. To set up an ideal measuring system, a virtual fringe projection technique has been studied to assist in the design of hardware configurations. However, existing virtual fringe projection systems use parallel illumination and have a fixed optical framework. This paper presents a virtual fringe projection system with nonparallel illumination. Using an iterative method to calculate intersection points between rays and reference planes or object surfaces, the proposed system can simulate projected fringe patterns and captured images. A new explicit calibration method has been presented to validate the precision of the system. Simulated results indicate that the proposed iterative method outperforms previous systems. Our virtual system can be applied to error analysis, algorithm optimization, and help operators to find ideal system parameter settings for actual measurements.
Variable aperture-based ptychographical iterative engine method
NASA Astrophysics Data System (ADS)
Sun, Aihui; Kong, Yan; Meng, Xin; He, Xiaoliang; Du, Ruijun; Jiang, Zhilong; Liu, Fei; Xue, Liang; Wang, Shouyu; Liu, Cheng
2018-02-01
A variable aperture-based ptychographical iterative engine (vaPIE) is demonstrated both numerically and experimentally to reconstruct the sample phase and amplitude rapidly. By adjusting the size of a tiny aperture under the illumination of a parallel light beam to change the illumination on the sample step by step and recording the corresponding diffraction patterns sequentially, both the sample phase and amplitude can be faithfully reconstructed with a modified ptychographical iterative engine (PIE) algorithm. Since many fewer diffraction patterns are required than in common PIE and the shape, the size, and the position of the aperture need not to be known exactly, this proposed vaPIE method remarkably reduces the data acquisition time and makes PIE less dependent on the mechanical accuracy of the translation stage; therefore, the proposed technique can be potentially applied for various scientific researches.
NASA Technical Reports Server (NTRS)
Choudhary, Alok Nidhi; Leung, Mun K.; Huang, Thomas S.; Patel, Janak H.
1989-01-01
Computer vision systems employ a sequence of vision algorithms in which the output of an algorithm is the input of the next algorithm in the sequence. Algorithms that constitute such systems exhibit vastly different computational characteristics, and therefore, require different data decomposition techniques and efficient load balancing techniques for parallel implementation. However, since the input data for a task is produced as the output data of the previous task, this information can be exploited to perform knowledge based data decomposition and load balancing. Presented here are algorithms for a motion estimation system. The motion estimation is based on the point correspondence between the involved images which are a sequence of stereo image pairs. Researchers propose algorithms to obtain point correspondences by matching feature points among stereo image pairs at any two consecutive time instants. Furthermore, the proposed algorithms employ non-iterative procedures, which results in saving considerable amounts of computation time. The system consists of the following steps: (1) extraction of features; (2) stereo match of images in one time instant; (3) time match of images from consecutive time instants; (4) stereo match to compute final unambiguous points; and (5) computation of motion parameters.
Mitra, Ayan; Politte, David G; Whiting, Bruce R; Williamson, Jeffrey F; O'Sullivan, Joseph A
2017-01-01
Model-based image reconstruction (MBIR) techniques have the potential to generate high quality images from noisy measurements and a small number of projections which can reduce the x-ray dose in patients. These MBIR techniques rely on projection and backprojection to refine an image estimate. One of the widely used projectors for these modern MBIR based technique is called branchless distance driven (DD) projection and backprojection. While this method produces superior quality images, the computational cost of iterative updates keeps it from being ubiquitous in clinical applications. In this paper, we provide several new parallelization ideas for concurrent execution of the DD projectors in multi-GPU systems using CUDA programming tools. We have introduced some novel schemes for dividing the projection data and image voxels over multiple GPUs to avoid runtime overhead and inter-device synchronization issues. We have also reduced the complexity of overlap calculation of the algorithm by eliminating the common projection plane and directly projecting the detector boundaries onto image voxel boundaries. To reduce the time required for calculating the overlap between the detector edges and image voxel boundaries, we have proposed a pre-accumulation technique to accumulate image intensities in perpendicular 2D image slabs (from a 3D image) before projection and after backprojection to ensure our DD kernels run faster in parallel GPU threads. For the implementation of our iterative MBIR technique we use a parallel multi-GPU version of the alternating minimization (AM) algorithm with penalized likelihood update. The time performance using our proposed reconstruction method with Siemens Sensation 16 patient scan data shows an average of 24 times speedup using a single TITAN X GPU and 74 times speedup using 3 TITAN X GPUs in parallel for combined projection and backprojection.
Myria: Scalable Analytics as a Service
NASA Astrophysics Data System (ADS)
Howe, B.; Halperin, D.; Whitaker, A.
2014-12-01
At the UW eScience Institute, we're working to empower non-experts, especially in the sciences, to write and use data-parallel algorithms. To this end, we are building Myria, a web-based platform for scalable analytics and data-parallel programming. Myria's internal model of computation is the relational algebra extended with iteration, such that every program is inherently data-parallel, just as every query in a database is inherently data-parallel. But unlike databases, iteration is a first class concept, allowing us to express machine learning tasks, graph traversal tasks, and more. Programs can be expressed in a number of languages and can be executed on a number of execution environments, but we emphasize a particular language called MyriaL that supports both imperative and declarative styles and a particular execution engine called MyriaX that uses an in-memory column-oriented representation and asynchronous iteration. We deliver Myria over the web as a service, providing an editor, performance analysis tools, and catalog browsing features in a single environment. We find that this web-based "delivery vector" is critical in reaching non-experts: they are insulated from irrelevant effort technical work associated with installation, configuration, and resource management. The MyriaX backend, one of several execution runtimes we support, is a main-memory, column-oriented, RDBMS-on-the-worker system that supports cyclic data flows as a first-class citizen and has been shown to outperform competitive systems on 100-machine cluster sizes. I will describe the Myria system, give a demo, and present some new results in large-scale oceanographic microbiology.
NASA Technical Reports Server (NTRS)
Nguyen, D. T.; Watson, Willie R. (Technical Monitor)
2005-01-01
The overall objectives of this research work are to formulate and validate efficient parallel algorithms, and to efficiently design/implement computer software for solving large-scale acoustic problems, arised from the unified frameworks of the finite element procedures. The adopted parallel Finite Element (FE) Domain Decomposition (DD) procedures should fully take advantages of multiple processing capabilities offered by most modern high performance computing platforms for efficient parallel computation. To achieve this objective. the formulation needs to integrate efficient sparse (and dense) assembly techniques, hybrid (or mixed) direct and iterative equation solvers, proper pre-conditioned strategies, unrolling strategies, and effective processors' communicating schemes. Finally, the numerical performance of the developed parallel finite element procedures will be evaluated by solving series of structural, and acoustic (symmetrical and un-symmetrical) problems (in different computing platforms). Comparisons with existing "commercialized" and/or "public domain" software are also included, whenever possible.
NASA Technical Reports Server (NTRS)
Nguyen, D. T.; Al-Nasra, M.; Zhang, Y.; Baddourah, M. A.; Agarwal, T. K.; Storaasli, O. O.; Carmona, E. A.
1991-01-01
Several parallel-vector computational improvements to the unconstrained optimization procedure are described which speed up the structural analysis-synthesis process. A fast parallel-vector Choleski-based equation solver, pvsolve, is incorporated into the well-known SAP-4 general-purpose finite-element code. The new code, denoted PV-SAP, is tested for static structural analysis. Initial results on a four processor CRAY 2 show that using pvsolve reduces the equation solution time by a factor of 14-16 over the original SAP-4 code. In addition, parallel-vector procedures for the Golden Block Search technique and the BFGS method are developed and tested for nonlinear unconstrained optimization. A parallel version of an iterative solver and the pvsolve direct solver are incorporated into the BFGS method. Preliminary results on nonlinear unconstrained optimization test problems, using pvsolve in the analysis, show excellent parallel-vector performance indicating that these parallel-vector algorithms can be used in a new generation of finite-element based structural design/analysis-synthesis codes.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chatterjee, Kausik, E-mail: kausik.chatterjee@aggiemail.usu.edu; Center for Atmospheric and Space Sciences, Utah State University, Logan, UT 84322; Roadcap, John R., E-mail: john.roadcap@us.af.mil
The objective of this paper is the exposition of a recently-developed, novel Green's function Monte Carlo (GFMC) algorithm for the solution of nonlinear partial differential equations and its application to the modeling of the plasma sheath region around a cylindrical conducting object, carrying a potential and moving at low speeds through an otherwise neutral medium. The plasma sheath is modeled in equilibrium through the GFMC solution of the nonlinear Poisson–Boltzmann (NPB) equation. The traditional Monte Carlo based approaches for the solution of nonlinear equations are iterative in nature, involving branching stochastic processes which are used to calculate linear functionals ofmore » the solution of nonlinear integral equations. Over the last several years, one of the authors of this paper, K. Chatterjee has been developing a philosophically-different approach, where the linearization of the equation of interest is not required and hence there is no need for iteration and the simulation of branching processes. Instead, an approximate expression for the Green's function is obtained using perturbation theory, which is used to formulate the random walk equations within the problem sub-domains where the random walker makes its walks. However, as a trade-off, the dimensions of these sub-domains have to be restricted by the limitations imposed by perturbation theory. The greatest advantage of this approach is the ease and simplicity of parallelization stemming from the lack of the need for iteration, as a result of which the parallelization procedure is identical to the parallelization procedure for the GFMC solution of a linear problem. The application area of interest is in the modeling of the communication breakdown problem during a space vehicle's re-entry into the atmosphere. However, additional application areas are being explored in the modeling of electromagnetic propagation through the atmosphere/ionosphere in UHF/GPS applications.« less
NASA Astrophysics Data System (ADS)
Chatterjee, Kausik; Roadcap, John R.; Singh, Surendra
2014-11-01
The objective of this paper is the exposition of a recently-developed, novel Green's function Monte Carlo (GFMC) algorithm for the solution of nonlinear partial differential equations and its application to the modeling of the plasma sheath region around a cylindrical conducting object, carrying a potential and moving at low speeds through an otherwise neutral medium. The plasma sheath is modeled in equilibrium through the GFMC solution of the nonlinear Poisson-Boltzmann (NPB) equation. The traditional Monte Carlo based approaches for the solution of nonlinear equations are iterative in nature, involving branching stochastic processes which are used to calculate linear functionals of the solution of nonlinear integral equations. Over the last several years, one of the authors of this paper, K. Chatterjee has been developing a philosophically-different approach, where the linearization of the equation of interest is not required and hence there is no need for iteration and the simulation of branching processes. Instead, an approximate expression for the Green's function is obtained using perturbation theory, which is used to formulate the random walk equations within the problem sub-domains where the random walker makes its walks. However, as a trade-off, the dimensions of these sub-domains have to be restricted by the limitations imposed by perturbation theory. The greatest advantage of this approach is the ease and simplicity of parallelization stemming from the lack of the need for iteration, as a result of which the parallelization procedure is identical to the parallelization procedure for the GFMC solution of a linear problem. The application area of interest is in the modeling of the communication breakdown problem during a space vehicle's re-entry into the atmosphere. However, additional application areas are being explored in the modeling of electromagnetic propagation through the atmosphere/ionosphere in UHF/GPS applications.
Phase Reconstruction from FROG Using Genetic Algorithms[Frequency-Resolved Optical Gating
DOE Office of Scientific and Technical Information (OSTI.GOV)
Omenetto, F.G.; Nicholson, J.W.; Funk, D.J.
1999-04-12
The authors describe a new technique for obtaining the phase and electric field from FROG measurements using genetic algorithms. Frequency-Resolved Optical Gating (FROG) has gained prominence as a technique for characterizing ultrashort pulses. FROG consists of a spectrally resolved autocorrelation of the pulse to be measured. Typically a combination of iterative algorithms is used, applying constraints from experimental data, and alternating between the time and frequency domain, in order to retrieve an optical pulse. The authors have developed a new approach to retrieving the intensity and phase from FROG data using a genetic algorithm (GA). A GA is a generalmore » parallel search technique that operates on a population of potential solutions simultaneously. Operators in a genetic algorithm, such as crossover, selection, and mutation are based on ideas taken from evolution.« less
Real-Time Compressive Sensing MRI Reconstruction Using GPU Computing and Split Bregman Methods
Smith, David S.; Gore, John C.; Yankeelov, Thomas E.; Welch, E. Brian
2012-01-01
Compressive sensing (CS) has been shown to enable dramatic acceleration of MRI acquisition in some applications. Being an iterative reconstruction technique, CS MRI reconstructions can be more time-consuming than traditional inverse Fourier reconstruction. We have accelerated our CS MRI reconstruction by factors of up to 27 by using a split Bregman solver combined with a graphics processing unit (GPU) computing platform. The increases in speed we find are similar to those we measure for matrix multiplication on this platform, suggesting that the split Bregman methods parallelize efficiently. We demonstrate that the combination of the rapid convergence of the split Bregman algorithm and the massively parallel strategy of GPU computing can enable real-time CS reconstruction of even acquisition data matrices of dimension 40962 or more, depending on available GPU VRAM. Reconstruction of two-dimensional data matrices of dimension 10242 and smaller took ~0.3 s or less, showing that this platform also provides very fast iterative reconstruction for small-to-moderate size images. PMID:22481908
Real-Time Compressive Sensing MRI Reconstruction Using GPU Computing and Split Bregman Methods.
Smith, David S; Gore, John C; Yankeelov, Thomas E; Welch, E Brian
2012-01-01
Compressive sensing (CS) has been shown to enable dramatic acceleration of MRI acquisition in some applications. Being an iterative reconstruction technique, CS MRI reconstructions can be more time-consuming than traditional inverse Fourier reconstruction. We have accelerated our CS MRI reconstruction by factors of up to 27 by using a split Bregman solver combined with a graphics processing unit (GPU) computing platform. The increases in speed we find are similar to those we measure for matrix multiplication on this platform, suggesting that the split Bregman methods parallelize efficiently. We demonstrate that the combination of the rapid convergence of the split Bregman algorithm and the massively parallel strategy of GPU computing can enable real-time CS reconstruction of even acquisition data matrices of dimension 4096(2) or more, depending on available GPU VRAM. Reconstruction of two-dimensional data matrices of dimension 1024(2) and smaller took ~0.3 s or less, showing that this platform also provides very fast iterative reconstruction for small-to-moderate size images.
A high-performance spatial database based approach for pathology imaging algorithm evaluation
Wang, Fusheng; Kong, Jun; Gao, Jingjing; Cooper, Lee A.D.; Kurc, Tahsin; Zhou, Zhengwen; Adler, David; Vergara-Niedermayr, Cristobal; Katigbak, Bryan; Brat, Daniel J.; Saltz, Joel H.
2013-01-01
Background: Algorithm evaluation provides a means to characterize variability across image analysis algorithms, validate algorithms by comparison with human annotations, combine results from multiple algorithms for performance improvement, and facilitate algorithm sensitivity studies. The sizes of images and image analysis results in pathology image analysis pose significant challenges in algorithm evaluation. We present an efficient parallel spatial database approach to model, normalize, manage, and query large volumes of analytical image result data. This provides an efficient platform for algorithm evaluation. Our experiments with a set of brain tumor images demonstrate the application, scalability, and effectiveness of the platform. Context: The paper describes an approach and platform for evaluation of pathology image analysis algorithms. The platform facilitates algorithm evaluation through a high-performance database built on the Pathology Analytic Imaging Standards (PAIS) data model. Aims: (1) Develop a framework to support algorithm evaluation by modeling and managing analytical results and human annotations from pathology images; (2) Create a robust data normalization tool for converting, validating, and fixing spatial data from algorithm or human annotations; (3) Develop a set of queries to support data sampling and result comparisons; (4) Achieve high performance computation capacity via a parallel data management infrastructure, parallel data loading and spatial indexing optimizations in this infrastructure. Materials and Methods: We have considered two scenarios for algorithm evaluation: (1) algorithm comparison where multiple result sets from different methods are compared and consolidated; and (2) algorithm validation where algorithm results are compared with human annotations. We have developed a spatial normalization toolkit to validate and normalize spatial boundaries produced by image analysis algorithms or human annotations. The validated data were formatted based on the PAIS data model and loaded into a spatial database. To support efficient data loading, we have implemented a parallel data loading tool that takes advantage of multi-core CPUs to accelerate data injection. The spatial database manages both geometric shapes and image features or classifications, and enables spatial sampling, result comparison, and result aggregation through expressive structured query language (SQL) queries with spatial extensions. To provide scalable and efficient query support, we have employed a shared nothing parallel database architecture, which distributes data homogenously across multiple database partitions to take advantage of parallel computation power and implements spatial indexing to achieve high I/O throughput. Results: Our work proposes a high performance, parallel spatial database platform for algorithm validation and comparison. This platform was evaluated by storing, managing, and comparing analysis results from a set of brain tumor whole slide images. The tools we develop are open source and available to download. Conclusions: Pathology image algorithm validation and comparison are essential to iterative algorithm development and refinement. One critical component is the support for queries involving spatial predicates and comparisons. In our work, we develop an efficient data model and parallel database approach to model, normalize, manage and query large volumes of analytical image result data. Our experiments demonstrate that the data partitioning strategy and the grid-based indexing result in good data distribution across database nodes and reduce I/O overhead in spatial join queries through parallel retrieval of relevant data and quick subsetting of datasets. The set of tools in the framework provide a full pipeline to normalize, load, manage and query analytical results for algorithm evaluation. PMID:23599905
Parareal in time 3D numerical solver for the LWR Benchmark neutron diffusion transient model
DOE Office of Scientific and Technical Information (OSTI.GOV)
Baudron, Anne-Marie, E-mail: anne-marie.baudron@cea.fr; CEA-DRN/DMT/SERMA, CEN-Saclay, 91191 Gif sur Yvette Cedex; Lautard, Jean-Jacques, E-mail: jean-jacques.lautard@cea.fr
2014-12-15
In this paper we present a time-parallel algorithm for the 3D neutrons calculation of a transient model in a nuclear reactor core. The neutrons calculation consists in numerically solving the time dependent diffusion approximation equation, which is a simplified transport equation. The numerical resolution is done with finite elements method based on a tetrahedral meshing of the computational domain, representing the reactor core, and time discretization is achieved using a θ-scheme. The transient model presents moving control rods during the time of the reaction. Therefore, cross-sections (piecewise constants) are taken into account by interpolations with respect to the velocity ofmore » the control rods. The parallelism across the time is achieved by an adequate use of the parareal in time algorithm to the handled problem. This parallel method is a predictor corrector scheme that iteratively combines the use of two kinds of numerical propagators, one coarse and one fine. Our method is made efficient by means of a coarse solver defined with large time step and fixed position control rods model, while the fine propagator is assumed to be a high order numerical approximation of the full model. The parallel implementation of our method provides a good scalability of the algorithm. Numerical results show the efficiency of the parareal method on large light water reactor transient model corresponding to the Langenbuch–Maurer–Werner benchmark.« less
Variable aperture-based ptychographical iterative engine method.
Sun, Aihui; Kong, Yan; Meng, Xin; He, Xiaoliang; Du, Ruijun; Jiang, Zhilong; Liu, Fei; Xue, Liang; Wang, Shouyu; Liu, Cheng
2018-02-01
A variable aperture-based ptychographical iterative engine (vaPIE) is demonstrated both numerically and experimentally to reconstruct the sample phase and amplitude rapidly. By adjusting the size of a tiny aperture under the illumination of a parallel light beam to change the illumination on the sample step by step and recording the corresponding diffraction patterns sequentially, both the sample phase and amplitude can be faithfully reconstructed with a modified ptychographical iterative engine (PIE) algorithm. Since many fewer diffraction patterns are required than in common PIE and the shape, the size, and the position of the aperture need not to be known exactly, this proposed vaPIE method remarkably reduces the data acquisition time and makes PIE less dependent on the mechanical accuracy of the translation stage; therefore, the proposed technique can be potentially applied for various scientific researches. (2018) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE).
2D Seismic Imaging of Elastic Parameters by Frequency Domain Full Waveform Inversion
NASA Astrophysics Data System (ADS)
Brossier, R.; Virieux, J.; Operto, S.
2008-12-01
Thanks to recent advances in parallel computing, full waveform inversion is today a tractable seismic imaging method to reconstruct physical parameters of the earth interior at different scales ranging from the near- surface to the deep crust. We present a massively parallel 2D frequency-domain full-waveform algorithm for imaging visco-elastic media from multi-component seismic data. The forward problem (i.e. the resolution of the frequency-domain 2D PSV elastodynamics equations) is based on low-order Discontinuous Galerkin (DG) method (P0 and/or P1 interpolations). Thanks to triangular unstructured meshes, the DG method allows accurate modeling of both body waves and surface waves in case of complex topography for a discretization of 10 to 15 cells per shear wavelength. The frequency-domain DG system is solved efficiently for multiple sources with the parallel direct solver MUMPS. The local inversion procedure (i.e. minimization of residuals between observed and computed data) is based on the adjoint-state method which allows to efficiently compute the gradient of the objective function. Applying the inversion hierarchically from the low frequencies to the higher ones defines a multiresolution imaging strategy which helps convergence towards the global minimum. In place of expensive Newton algorithm, the combined use of the diagonal terms of the approximate Hessian matrix and optimization algorithms based on quasi-Newton methods (Conjugate Gradient, LBFGS, ...) allows to improve the convergence of the iterative inversion. The distribution of forward problem solutions over processors driven by a mesh partitioning performed by METIS allows to apply most of the inversion in parallel. We shall present the main features of the parallel modeling/inversion algorithm, assess its scalability and illustrate its performances with realistic synthetic case studies.
NASA Astrophysics Data System (ADS)
Sourbier, F.; Operto, S.; Virieux, J.
2006-12-01
We present a distributed-memory parallel algorithm for 2D visco-acoustic full-waveform inversion of wide-angle seismic data. Our code is written in fortran90 and use MPI for parallelism. The algorithm was applied to real wide-angle data set recorded by 100 OBSs with a 1-km spacing in the eastern-Nankai trough (Japan) to image the deep structure of the subduction zone. Full-waveform inversion is applied sequentially to discrete frequencies by proceeding from the low to the high frequencies. The inverse problem is solved with a classic gradient method. Full-waveform modeling is performed with a frequency-domain finite-difference method. In the frequency-domain, solving the wave equation requires resolution of a large unsymmetric system of linear equations. We use the massively parallel direct solver MUMPS (http://www.enseeiht.fr/irit/apo/MUMPS) for distributed-memory computer to solve this system. The MUMPS solver is based on a multifrontal method for the parallel factorization. The MUMPS algorithm is subdivided in 3 main steps: a symbolic analysis step that performs re-ordering of the matrix coefficients to minimize the fill-in of the matrix during the subsequent factorization and an estimation of the assembly tree of the matrix. Second, the factorization is performed with dynamic scheduling to accomodate numerical pivoting and provides the LU factors distributed over all the processors. Third, the resolution is performed for multiple sources. To compute the gradient of the cost function, 2 simulations per shot are required (one to compute the forward wavefield and one to back-propagate residuals). The multi-source resolutions can be performed in parallel with MUMPS. In the end, each processor stores in core a sub-domain of all the solutions. These distributed solutions can be exploited to compute in parallel the gradient of the cost function. Since the gradient of the cost function is a weighted stack of the shot and residual solutions of MUMPS, each processor computes the corresponding sub-domain of the gradient. In the end, the gradient is centralized on the master processor using a collective communation. The gradient is scaled by the diagonal elements of the Hessian matrix. This scaling is computed only once per frequency before the first iteration of the inversion. Estimation of the diagonal terms of the Hessian requires performing one simulation per non redondant shot and receiver position. The same strategy that the one used for the gradient is used to compute the diagonal Hessian in parallel. This algorithm was applied to a dense wide-angle data set recorded by 100 OBSs in the eastern Nankai trough, offshore Japan. Thirteen frequencies ranging from 3 and 15 Hz were inverted. Tweny iterations per frequency were computed leading to 260 tomographic velocity models of increasing resolution. The velocity model dimensions are 105 km x 25 km corresponding to a finite-difference grid of 4201 x 1001 grid with a 25-m grid interval. The number of shot was 1005 and the number of inverted OBS gathers was 93. The inversion requires 20 days on 6 32-bits bi-processor nodes with 4 Gbytes of RAM memory per node when only the LU factorization is performed in parallel. Preliminary estimations of the time required to perform the inversion with the fully-parallelized code is 6 and 4 days using 20 and 50 processors respectively.
Value Iteration Adaptive Dynamic Programming for Optimal Control of Discrete-Time Nonlinear Systems.
Wei, Qinglai; Liu, Derong; Lin, Hanquan
2016-03-01
In this paper, a value iteration adaptive dynamic programming (ADP) algorithm is developed to solve infinite horizon undiscounted optimal control problems for discrete-time nonlinear systems. The present value iteration ADP algorithm permits an arbitrary positive semi-definite function to initialize the algorithm. A novel convergence analysis is developed to guarantee that the iterative value function converges to the optimal performance index function. Initialized by different initial functions, it is proven that the iterative value function will be monotonically nonincreasing, monotonically nondecreasing, or nonmonotonic and will converge to the optimum. In this paper, for the first time, the admissibility properties of the iterative control laws are developed for value iteration algorithms. It is emphasized that new termination criteria are established to guarantee the effectiveness of the iterative control laws. Neural networks are used to approximate the iterative value function and compute the iterative control law, respectively, for facilitating the implementation of the iterative ADP algorithm. Finally, two simulation examples are given to illustrate the performance of the present method.
Solving radiative transfer with line overlaps using Gauss-Seidel algorithms
NASA Astrophysics Data System (ADS)
Daniel, F.; Cernicharo, J.
2008-09-01
Context: The improvement in observational facilities requires refining the modelling of the geometrical structures of astrophysical objects. Nevertheless, for complex problems such as line overlap in molecules showing hyperfine structure, a detailed analysis still requires a large amount of computing time and thus, misinterpretation cannot be dismissed due to an undersampling of the whole space of parameters. Aims: We extend the discussion of the implementation of the Gauss-Seidel algorithm in spherical geometry and include the case of hyperfine line overlap. Methods: We first review the basics of the short characteristics method that is used to solve the radiative transfer equations. Details are given on the determination of the Lambda operator in spherical geometry. The Gauss-Seidel algorithm is then described and, by analogy to the plan-parallel case, we see how to introduce it in spherical geometry. Doing so requires some approximations in order to keep the algorithm competitive. Finally, line overlap effects are included. Results: The convergence speed of the algorithm is compared to the usual Jacobi iterative schemes. The gain in the number of iterations is typically factors of 2 and 4 for the two implementations made of the Gauss-Seidel algorithm. This is obtained despite the introduction of approximations in the algorithm. A comparison of results obtained with and without line overlaps for N2H^+, HCN, and HNC shows that the J=3-2 line intensities are significantly underestimated in models where line overlap is neglected.
GPU computing with Kaczmarz’s and other iterative algorithms for linear systems
Elble, Joseph M.; Sahinidis, Nikolaos V.; Vouzis, Panagiotis
2009-01-01
The graphics processing unit (GPU) is used to solve large linear systems derived from partial differential equations. The differential equations studied are strongly convection-dominated, of various sizes, and common to many fields, including computational fluid dynamics, heat transfer, and structural mechanics. The paper presents comparisons between GPU and CPU implementations of several well-known iterative methods, including Kaczmarz’s, Cimmino’s, component averaging, conjugate gradient normal residual (CGNR), symmetric successive overrelaxation-preconditioned conjugate gradient, and conjugate-gradient-accelerated component-averaged row projections (CARP-CG). Computations are preformed with dense as well as general banded systems. The results demonstrate that our GPU implementation outperforms CPU implementations of these algorithms, as well as previously studied parallel implementations on Linux clusters and shared memory systems. While the CGNR method had begun to fall out of favor for solving such problems, for the problems studied in this paper, the CGNR method implemented on the GPU performed better than the other methods, including a cluster implementation of the CARP-CG method. PMID:20526446
Non-Cartesian Parallel Imaging Reconstruction
Wright, Katherine L.; Hamilton, Jesse I.; Griswold, Mark A.; Gulani, Vikas; Seiberlich, Nicole
2014-01-01
Non-Cartesian parallel imaging has played an important role in reducing data acquisition time in MRI. The use of non-Cartesian trajectories can enable more efficient coverage of k-space, which can be leveraged to reduce scan times. These trajectories can be undersampled to achieve even faster scan times, but the resulting images may contain aliasing artifacts. Just as Cartesian parallel imaging can be employed to reconstruct images from undersampled Cartesian data, non-Cartesian parallel imaging methods can mitigate aliasing artifacts by using additional spatial encoding information in the form of the non-homogeneous sensitivities of multi-coil phased arrays. This review will begin with an overview of non-Cartesian k-space trajectories and their sampling properties, followed by an in-depth discussion of several selected non-Cartesian parallel imaging algorithms. Three representative non-Cartesian parallel imaging methods will be described, including Conjugate Gradient SENSE (CG SENSE), non-Cartesian GRAPPA, and Iterative Self-Consistent Parallel Imaging Reconstruction (SPIRiT). After a discussion of these three techniques, several potential promising clinical applications of non-Cartesian parallel imaging will be covered. PMID:24408499
Real-time stereo matching using orthogonal reliability-based dynamic programming.
Gong, Minglun; Yang, Yee-Hong
2007-03-01
A novel algorithm is presented in this paper for estimating reliable stereo matches in real time. Based on the dynamic programming-based technique we previously proposed, the new algorithm can generate semi-dense disparity maps using as few as two dynamic programming passes. The iterative best path tracing process used in traditional dynamic programming is replaced by a local minimum searching process, making the algorithm suitable for parallel execution. Most computations are implemented on programmable graphics hardware, which improves the processing speed and makes real-time estimation possible. The experiments on the four new Middlebury stereo datasets show that, on an ATI Radeon X800 card, the presented algorithm can produce reliable matches for 60% approximately 80% of pixels at the rate of 10 approximately 20 frames per second. If needed, the algorithm can be configured for generating full density disparity maps.
Parallelization of the preconditioned IDR solver for modern multicore computer systems
NASA Astrophysics Data System (ADS)
Bessonov, O. A.; Fedoseyev, A. I.
2012-10-01
This paper present the analysis, parallelization and optimization approach for the large sparse matrix solver CNSPACK for modern multicore microprocessors. CNSPACK is an advanced solver successfully used for coupled solution of stiff problems arising in multiphysics applications such as CFD, semiconductor transport, kinetic and quantum problems. It employs iterative IDR algorithm with ILU preconditioning (user chosen ILU preconditioning order). CNSPACK has been successfully used during last decade for solving problems in several application areas, including fluid dynamics and semiconductor device simulation. However, there was a dramatic change in processor architectures and computer system organization in recent years. Due to this, performance criteria and methods have been revisited, together with involving the parallelization of the solver and preconditioner using Open MP environment. Results of the successful implementation for efficient parallelization are presented for the most advances computer system (Intel Core i7-9xx or two-processor Xeon 55xx/56xx).
NASA Astrophysics Data System (ADS)
Karimi, Hamed; Rosenberg, Gili; Katzgraber, Helmut G.
2017-10-01
We present and apply a general-purpose, multistart algorithm for improving the performance of low-energy samplers used for solving optimization problems. The algorithm iteratively fixes the value of a large portion of the variables to values that have a high probability of being optimal. The resulting problems are smaller and less connected, and samplers tend to give better low-energy samples for these problems. The algorithm is trivially parallelizable since each start in the multistart algorithm is independent, and could be applied to any heuristic solver that can be run multiple times to give a sample. We present results for several classes of hard problems solved using simulated annealing, path-integral quantum Monte Carlo, parallel tempering with isoenergetic cluster moves, and a quantum annealer, and show that the success metrics and the scaling are improved substantially. When combined with this algorithm, the quantum annealer's scaling was substantially improved for native Chimera graph problems. In addition, with this algorithm the scaling of the time to solution of the quantum annealer is comparable to the Hamze-de Freitas-Selby algorithm on the weak-strong cluster problems introduced by Boixo et al. Parallel tempering with isoenergetic cluster moves was able to consistently solve three-dimensional spin glass problems with 8000 variables when combined with our method, whereas without our method it could not solve any.
A Universal Tare Load Prediction Algorithm for Strain-Gage Balance Calibration Data Analysis
NASA Technical Reports Server (NTRS)
Ulbrich, N.
2011-01-01
An algorithm is discussed that may be used to estimate tare loads of wind tunnel strain-gage balance calibration data. The algorithm was originally developed by R. Galway of IAR/NRC Canada and has been described in the literature for the iterative analysis technique. Basic ideas of Galway's algorithm, however, are universally applicable and work for both the iterative and the non-iterative analysis technique. A recent modification of Galway's algorithm is presented that improves the convergence behavior of the tare load prediction process if it is used in combination with the non-iterative analysis technique. The modified algorithm allows an analyst to use an alternate method for the calculation of intermediate non-linear tare load estimates whenever Galway's original approach does not lead to a convergence of the tare load iterations. It is also shown in detail how Galway's algorithm may be applied to the non-iterative analysis technique. Hand load data from the calibration of a six-component force balance is used to illustrate the application of the original and modified tare load prediction method. During the analysis of the data both the iterative and the non-iterative analysis technique were applied. Overall, predicted tare loads for combinations of the two tare load prediction methods and the two balance data analysis techniques showed excellent agreement as long as the tare load iterations converged. The modified algorithm, however, appears to have an advantage over the original algorithm when absolute voltage measurements of gage outputs are processed using the non-iterative analysis technique. In these situations only the modified algorithm converged because it uses an exact solution of the intermediate non-linear tare load estimate for the tare load iteration.
Limits on the Efficiency of Event-Based Algorithms for Monte Carlo Neutron Transport
DOE Office of Scientific and Technical Information (OSTI.GOV)
Romano, Paul K.; Siegel, Andrew R.
The traditional form of parallelism in Monte Carlo particle transport simulations, wherein each individual particle history is considered a unit of work, does not lend itself well to data-level parallelism. Event-based algorithms, which were originally used for simulations on vector processors, may offer a path toward better utilizing data-level parallelism in modern computer architectures. In this study, a simple model is developed for estimating the efficiency of the event-based particle transport algorithm under two sets of assumptions. Data collected from simulations of four reactor problems using OpenMC was then used in conjunction with the models to calculate the speedup duemore » to vectorization as a function of the size of the particle bank and the vector width. When each event type is assumed to have constant execution time, the achievable speedup is directly related to the particle bank size. We observed that the bank size generally needs to be at least 20 times greater than vector size to achieve vector efficiency greater than 90%. Lastly, when the execution times for events are allowed to vary, the vector speedup is also limited by differences in execution time for events being carried out in a single event-iteration.« less
Limits on the Efficiency of Event-Based Algorithms for Monte Carlo Neutron Transport
Romano, Paul K.; Siegel, Andrew R.
2017-07-01
The traditional form of parallelism in Monte Carlo particle transport simulations, wherein each individual particle history is considered a unit of work, does not lend itself well to data-level parallelism. Event-based algorithms, which were originally used for simulations on vector processors, may offer a path toward better utilizing data-level parallelism in modern computer architectures. In this study, a simple model is developed for estimating the efficiency of the event-based particle transport algorithm under two sets of assumptions. Data collected from simulations of four reactor problems using OpenMC was then used in conjunction with the models to calculate the speedup duemore » to vectorization as a function of the size of the particle bank and the vector width. When each event type is assumed to have constant execution time, the achievable speedup is directly related to the particle bank size. We observed that the bank size generally needs to be at least 20 times greater than vector size to achieve vector efficiency greater than 90%. Lastly, when the execution times for events are allowed to vary, the vector speedup is also limited by differences in execution time for events being carried out in a single event-iteration.« less
GPU-based relative fuzzy connectedness image segmentation.
Zhuge, Ying; Ciesielski, Krzysztof C; Udupa, Jayaram K; Miller, Robert W
2013-01-01
Recently, clinical radiological research and practice are becoming increasingly quantitative. Further, images continue to increase in size and volume. For quantitative radiology to become practical, it is crucial that image segmentation algorithms and their implementations are rapid and yield practical run time on very large data sets. The purpose of this paper is to present a parallel version of an algorithm that belongs to the family of fuzzy connectedness (FC) algorithms, to achieve an interactive speed for segmenting large medical image data sets. The most common FC segmentations, optimizing an [script-l](∞)-based energy, are known as relative fuzzy connectedness (RFC) and iterative relative fuzzy connectedness (IRFC). Both RFC and IRFC objects (of which IRFC contains RFC) can be found via linear time algorithms, linear with respect to the image size. The new algorithm, P-ORFC (for parallel optimal RFC), which is implemented by using NVIDIA's Compute Unified Device Architecture (CUDA) platform, considerably improves the computational speed of the above mentioned CPU based IRFC algorithm. Experiments based on four data sets of small, medium, large, and super data size, achieved speedup factors of 32.8×, 22.9×, 20.9×, and 17.5×, correspondingly, on the NVIDIA Tesla C1060 platform. Although the output of P-ORFC need not precisely match that of IRFC output, it is very close to it and, as the authors prove, always lies between the RFC and IRFC objects. A parallel version of a top-of-the-line algorithm in the family of FC has been developed on the NVIDIA GPUs. An interactive speed of segmentation has been achieved, even for the largest medical image data set. Such GPU implementations may play a crucial role in automatic anatomy recognition in clinical radiology.
GPU-based relative fuzzy connectedness image segmentation
Zhuge, Ying; Ciesielski, Krzysztof C.; Udupa, Jayaram K.; Miller, Robert W.
2013-01-01
Purpose: Recently, clinical radiological research and practice are becoming increasingly quantitative. Further, images continue to increase in size and volume. For quantitative radiology to become practical, it is crucial that image segmentation algorithms and their implementations are rapid and yield practical run time on very large data sets. The purpose of this paper is to present a parallel version of an algorithm that belongs to the family of fuzzy connectedness (FC) algorithms, to achieve an interactive speed for segmenting large medical image data sets. Methods: The most common FC segmentations, optimizing an ℓ∞-based energy, are known as relative fuzzy connectedness (RFC) and iterative relative fuzzy connectedness (IRFC). Both RFC and IRFC objects (of which IRFC contains RFC) can be found via linear time algorithms, linear with respect to the image size. The new algorithm, P-ORFC (for parallel optimal RFC), which is implemented by using NVIDIA’s Compute Unified Device Architecture (CUDA) platform, considerably improves the computational speed of the above mentioned CPU based IRFC algorithm. Results: Experiments based on four data sets of small, medium, large, and super data size, achieved speedup factors of 32.8×, 22.9×, 20.9×, and 17.5×, correspondingly, on the NVIDIA Tesla C1060 platform. Although the output of P-ORFC need not precisely match that of IRFC output, it is very close to it and, as the authors prove, always lies between the RFC and IRFC objects. Conclusions: A parallel version of a top-of-the-line algorithm in the family of FC has been developed on the NVIDIA GPUs. An interactive speed of segmentation has been achieved, even for the largest medical image data set. Such GPU implementations may play a crucial role in automatic anatomy recognition in clinical radiology. PMID:23298094
Optical systolic array processor using residue arithmetic
NASA Technical Reports Server (NTRS)
Jackson, J.; Casasent, D.
1983-01-01
The use of residue arithmetic to increase the accuracy and reduce the dynamic range requirements of optical matrix-vector processors is evaluated. It is determined that matrix-vector operations and iterative algorithms can be performed totally in residue notation. A new parallel residue quantizer circuit is developed which significantly improves the performance of the systolic array feedback processor. Results are presented of a computer simulation of this system used to solve a set of three simultaneous equations.
Exploiting Data Sparsity in Parallel Matrix Powers Computations
2013-05-03
2013 Report Documentation Page Form ApprovedOMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour...matrices of the form A = D+USV H, where D is sparse and USV H has low rank but may be dense. Matrices of this form arise in many practical applications...methods numerical partial di erential equation solvers, and preconditioned iterative methods. If A has this form , our algorithm enables a communication
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chow, Edmond
Solving sparse problems is at the core of many DOE computational science applications. We focus on the challenge of developing sparse algorithms that can fully exploit the parallelism in extreme-scale computing systems, in particular systems with massive numbers of cores per node. Our approach is to express a sparse matrix factorization as a large number of bilinear constraint equations, and then solving these equations via an asynchronous iterative method. The unknowns in these equations are the matrix entries of the factorization that is desired.
Objective performance assessment of five computed tomography iterative reconstruction algorithms.
Omotayo, Azeez; Elbakri, Idris
2016-11-22
Iterative algorithms are gaining clinical acceptance in CT. We performed objective phantom-based image quality evaluation of five commercial iterative reconstruction algorithms available on four different multi-detector CT (MDCT) scanners at different dose levels as well as the conventional filtered back-projection (FBP) reconstruction. Using the Catphan500 phantom, we evaluated image noise, contrast-to-noise ratio (CNR), modulation transfer function (MTF) and noise-power spectrum (NPS). The algorithms were evaluated over a CTDIvol range of 0.75-18.7 mGy on four major MDCT scanners: GE DiscoveryCT750HD (algorithms: ASIR™ and VEO™); Siemens Somatom Definition AS+ (algorithm: SAFIRE™); Toshiba Aquilion64 (algorithm: AIDR3D™); and Philips Ingenuity iCT256 (algorithm: iDose4™). Images were reconstructed using FBP and the respective iterative algorithms on the four scanners. Use of iterative algorithms decreased image noise and increased CNR, relative to FBP. In the dose range of 1.3-1.5 mGy, noise reduction using iterative algorithms was in the range of 11%-51% on GE DiscoveryCT750HD, 10%-52% on Siemens Somatom Definition AS+, 49%-62% on Toshiba Aquilion64, and 13%-44% on Philips Ingenuity iCT256. The corresponding CNR increase was in the range 11%-105% on GE, 11%-106% on Siemens, 85%-145% on Toshiba and 13%-77% on Philips respectively. Most algorithms did not affect the MTF, except for VEO™ which produced an increase in the limiting resolution of up to 30%. A shift in the peak of the NPS curve towards lower frequencies and a decrease in NPS amplitude were obtained with all iterative algorithms. VEO™ required long reconstruction times, while all other algorithms produced reconstructions in real time. Compared to FBP, iterative algorithms reduced image noise and increased CNR. The iterative algorithms available on different scanners achieved different levels of noise reduction and CNR increase while spatial resolution improvements were obtained only with VEO™. This study is useful in that it provides performance assessment of the iterative algorithms available from several mainstream CT manufacturers.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bai, T; UT Southwestern Medical Center, Dallas, TX; Yan, H
2014-06-15
Purpose: To develop a 3D dictionary learning based statistical reconstruction algorithm on graphic processing units (GPU), to improve the quality of low-dose cone beam CT (CBCT) imaging with high efficiency. Methods: A 3D dictionary containing 256 small volumes (atoms) of 3x3x3 voxels was trained from a high quality volume image. During reconstruction, we utilized a Cholesky decomposition based orthogonal matching pursuit algorithm to find a sparse representation on this dictionary basis of each patch in the reconstructed image, in order to regularize the image quality. To accelerate the time-consuming sparse coding in the 3D case, we implemented our algorithm inmore » a parallel fashion by taking advantage of the tremendous computational power of GPU. Evaluations are performed based on a head-neck patient case. FDK reconstruction with full dataset of 364 projections is used as the reference. We compared the proposed 3D dictionary learning based method with a tight frame (TF) based one using a subset data of 121 projections. The image qualities under different resolutions in z-direction, with or without statistical weighting are also studied. Results: Compared to the TF-based CBCT reconstruction, our experiments indicated that 3D dictionary learning based CBCT reconstruction is able to recover finer structures, to remove more streaking artifacts, and is less susceptible to blocky artifacts. It is also observed that statistical reconstruction approach is sensitive to inconsistency between the forward and backward projection operations in parallel computing. Using high a spatial resolution along z direction helps improving the algorithm robustness. Conclusion: 3D dictionary learning based CBCT reconstruction algorithm is able to sense the structural information while suppressing noise, and hence to achieve high quality reconstruction. The GPU realization of the whole algorithm offers a significant efficiency enhancement, making this algorithm more feasible for potential clinical application. A high zresolution is preferred to stabilize statistical iterative reconstruction. This work was supported in part by NIH(1R01CA154747-01), NSFC((No. 61172163), Research Fund for the Doctoral Program of Higher Education of China (No. 20110201110011), China Scholarship Council.« less
Tuning iteration space slicing based tiled multi-core code implementing Nussinov's RNA folding.
Palkowski, Marek; Bielecki, Wlodzimierz
2018-01-15
RNA folding is an ongoing compute-intensive task of bioinformatics. Parallelization and improving code locality for this kind of algorithms is one of the most relevant areas in computational biology. Fortunately, RNA secondary structure approaches, such as Nussinov's recurrence, involve mathematical operations over affine control loops whose iteration space can be represented by the polyhedral model. This allows us to apply powerful polyhedral compilation techniques based on the transitive closure of dependence graphs to generate parallel tiled code implementing Nussinov's RNA folding. Such techniques are within the iteration space slicing framework - the transitive dependences are applied to the statement instances of interest to produce valid tiles. The main problem at generating parallel tiled code is defining a proper tile size and tile dimension which impact parallelism degree and code locality. To choose the best tile size and tile dimension, we first construct parallel parametric tiled code (parameters are variables defining tile size). With this purpose, we first generate two nonparametric tiled codes with different fixed tile sizes but with the same code structure and then derive a general affine model, which describes all integer factors available in expressions of those codes. Using this model and known integer factors present in the mentioned expressions (they define the left-hand side of the model), we find unknown integers in this model for each integer factor available in the same fixed tiled code position and replace in this code expressions, including integer factors, with those including parameters. Then we use this parallel parametric tiled code to implement the well-known tile size selection (TSS) technique, which allows us to discover in a given search space the best tile size and tile dimension maximizing target code performance. For a given search space, the presented approach allows us to choose the best tile size and tile dimension in parallel tiled code implementing Nussinov's RNA folding. Experimental results, received on modern Intel multi-core processors, demonstrate that this code outperforms known closely related implementations when the length of RNA strands is bigger than 2500.
NASA Astrophysics Data System (ADS)
Chacon, L.; Finn, J. M.; Knoll, D. A.
2000-10-01
Recently, a new parallel velocity instability has been found.(J. M. Finn, Phys. Plasmas), 2, 12 (1995) This mode is a tearing mode driven unstable by curvature effects and sound wave coupling in the presence of parallel velocity shear. Under such conditions, linear theory predicts that tearing instabilities will grow even in situations in which the classical tearing mode is stable. This could then be a viable seed mechanism for the neoclassical tearing mode, and hence a non-linear study is of interest. Here, the linear and non-linear stages of this instability are explored using a fully implicit, fully nonlinear 2D reduced resistive MHD code,(L. Chacon et al), ``Implicit, Jacobian-free Newton-Krylov 2D reduced resistive MHD nonlinear solver,'' submitted to J. Comput. Phys. (2000) including viscosity and particle transport effects. The nonlinear implicit time integration is performed using the Newton-Raphson iterative algorithm. Krylov iterative techniques are employed for the required algebraic matrix inversions, implemented Jacobian-free (i.e., without ever forming and storing the Jacobian matrix), and preconditioned with a ``physics-based'' preconditioner. Nonlinear results indicate that, for large total plasma beta and large parallel velocity shear, the instability results in the generation of large poloidal shear flows and large magnetic islands even in regimes when the classical tearing mode is absolutely stable. For small viscosity, the time asymptotic state can be turbulent.
SciSpark: Highly Interactive and Scalable Model Evaluation and Climate Metrics
NASA Astrophysics Data System (ADS)
Wilson, B. D.; Palamuttam, R. S.; Mogrovejo, R. M.; Whitehall, K. D.; Mattmann, C. A.; Verma, R.; Waliser, D. E.; Lee, H.
2015-12-01
Remote sensing data and climate model output are multi-dimensional arrays of massive sizes locked away in heterogeneous file formats (HDF5/4, NetCDF 3/4) and metadata models (HDF-EOS, CF) making it difficult to perform multi-stage, iterative science processing since each stage requires writing and reading data to and from disk. We are developing a lightning fast Big Data technology called SciSpark based on ApacheTM Spark under a NASA AIST grant (PI Mattmann). Spark implements the map-reduce paradigm for parallel computing on a cluster, but emphasizes in-memory computation, "spilling" to disk only as needed, and so outperforms the disk-based ApacheTM Hadoop by 100x in memory and by 10x on disk. SciSpark will enable scalable model evaluation by executing large-scale comparisons of A-Train satellite observations to model grids on a cluster of 10 to 1000 compute nodes. This 2nd generation capability for NASA's Regional Climate Model Evaluation System (RCMES) will compute simple climate metrics at interactive speeds, and extend to quite sophisticated iterative algorithms such as machine-learning based clustering of temperature PDFs, and even graph-based algorithms for searching for Mesocale Convective Complexes. We have implemented a parallel data ingest capability in which the user specifies desired variables (arrays) as several time-sorted lists of URL's (i.e. using OPeNDAP model.nc?varname, or local files). The specified variables are partitioned by time/space and then each Spark node pulls its bundle of arrays into memory to begin a computation pipeline. We also investigated the performance of several N-dim. array libraries (scala breeze, java jblas & netlib-java, and ND4J). We are currently developing science codes using ND4J and studying memory behavior on the JVM. On the pyspark side, many of our science codes already use the numpy and SciPy ecosystems. The talk will cover: the architecture of SciSpark, the design of the scientific RDD (sRDD) data structure, our efforts to integrate climate science algorithms in Python and Scala, parallel ingest and partitioning of A-Train satellite observations from HDF files and model grids from netCDF files, first parallel runs to compute comparison statistics and PDF's, and first metrics quantifying parallel speedups and memory & disk usage.
A fast method to emulate an iterative POCS image reconstruction algorithm.
Zeng, Gengsheng L
2017-10-01
Iterative image reconstruction algorithms are commonly used to optimize an objective function, especially when the objective function is nonquadratic. Generally speaking, the iterative algorithms are computationally inefficient. This paper presents a fast algorithm that has one backprojection and no forward projection. This paper derives a new method to solve an optimization problem. The nonquadratic constraint, for example, an edge-preserving denoising constraint is implemented as a nonlinear filter. The algorithm is derived based on the POCS (projections onto projections onto convex sets) approach. A windowed FBP (filtered backprojection) algorithm enforces the data fidelity. An iterative procedure, divided into segments, enforces edge-enhancement denoising. Each segment performs nonlinear filtering. The derived iterative algorithm is computationally efficient. It contains only one backprojection and no forward projection. Low-dose CT data are used for algorithm feasibility studies. The nonlinearity is implemented as an edge-enhancing noise-smoothing filter. The patient studies results demonstrate its effectiveness in processing low-dose x ray CT data. This fast algorithm can be used to replace many iterative algorithms. © 2017 American Association of Physicists in Medicine.
Convergence Results on Iteration Algorithms to Linear Systems
Wang, Zhuande; Yang, Chuansheng; Yuan, Yubo
2014-01-01
In order to solve the large scale linear systems, backward and Jacobi iteration algorithms are employed. The convergence is the most important issue. In this paper, a unified backward iterative matrix is proposed. It shows that some well-known iterative algorithms can be deduced with it. The most important result is that the convergence results have been proved. Firstly, the spectral radius of the Jacobi iterative matrix is positive and the one of backward iterative matrix is strongly positive (lager than a positive constant). Secondly, the mentioned two iterations have the same convergence results (convergence or divergence simultaneously). Finally, some numerical experiments show that the proposed algorithms are correct and have the merit of backward methods. PMID:24991640
Study of phase clustering method for analyzing large volumes of meteorological observation data
NASA Astrophysics Data System (ADS)
Volkov, Yu. V.; Krutikov, V. A.; Botygin, I. A.; Sherstnev, V. S.; Sherstneva, A. I.
2017-11-01
The article describes an iterative parallel phase grouping algorithm for temperature field classification. The algorithm is based on modified method of structure forming by using analytic signal. The developed method allows to solve tasks of climate classification as well as climatic zoning for any time or spatial scale. When used to surface temperature measurement series, the developed algorithm allows to find climatic structures with correlated changes of temperature field, to make conclusion on climate uniformity in a given area and to overview climate changes over time by analyzing offset in type groups. The information on climate type groups specific for selected geographical areas is expanded by genetic scheme of class distribution depending on change in mutual correlation level between ground temperature monthly average.
GPU-based Branchless Distance-Driven Projection and Backprojection
Liu, Rui; Fu, Lin; De Man, Bruno; Yu, Hengyong
2017-01-01
Projection and backprojection operations are essential in a variety of image reconstruction and physical correction algorithms in CT. The distance-driven (DD) projection and backprojection are widely used for their highly sequential memory access pattern and low arithmetic cost. However, a typical DD implementation has an inner loop that adjusts the calculation depending on the relative position between voxel and detector cell boundaries. The irregularity of the branch behavior makes it inefficient to be implemented on massively parallel computing devices such as graphics processing units (GPUs). Such irregular branch behaviors can be eliminated by factorizing the DD operation as three branchless steps: integration, linear interpolation, and differentiation, all of which are highly amenable to massive vectorization. In this paper, we implement and evaluate a highly parallel branchless DD algorithm for 3D cone beam CT. The algorithm utilizes the texture memory and hardware interpolation on GPUs to achieve fast computational speed. The developed branchless DD algorithm achieved 137-fold speedup for forward projection and 188-fold speedup for backprojection relative to a single-thread CPU implementation. Compared with a state-of-the-art 32-thread CPU implementation, the proposed branchless DD achieved 8-fold acceleration for forward projection and 10-fold acceleration for backprojection. GPU based branchless DD method was evaluated by iterative reconstruction algorithms with both simulation and real datasets. It obtained visually identical images as the CPU reference algorithm. PMID:29333480
GPU-based Branchless Distance-Driven Projection and Backprojection.
Liu, Rui; Fu, Lin; De Man, Bruno; Yu, Hengyong
2017-12-01
Projection and backprojection operations are essential in a variety of image reconstruction and physical correction algorithms in CT. The distance-driven (DD) projection and backprojection are widely used for their highly sequential memory access pattern and low arithmetic cost. However, a typical DD implementation has an inner loop that adjusts the calculation depending on the relative position between voxel and detector cell boundaries. The irregularity of the branch behavior makes it inefficient to be implemented on massively parallel computing devices such as graphics processing units (GPUs). Such irregular branch behaviors can be eliminated by factorizing the DD operation as three branchless steps: integration, linear interpolation, and differentiation, all of which are highly amenable to massive vectorization. In this paper, we implement and evaluate a highly parallel branchless DD algorithm for 3D cone beam CT. The algorithm utilizes the texture memory and hardware interpolation on GPUs to achieve fast computational speed. The developed branchless DD algorithm achieved 137-fold speedup for forward projection and 188-fold speedup for backprojection relative to a single-thread CPU implementation. Compared with a state-of-the-art 32-thread CPU implementation, the proposed branchless DD achieved 8-fold acceleration for forward projection and 10-fold acceleration for backprojection. GPU based branchless DD method was evaluated by iterative reconstruction algorithms with both simulation and real datasets. It obtained visually identical images as the CPU reference algorithm.
Discrete-Time Deterministic $Q$ -Learning: A Novel Convergence Analysis.
Wei, Qinglai; Lewis, Frank L; Sun, Qiuye; Yan, Pengfei; Song, Ruizhuo
2017-05-01
In this paper, a novel discrete-time deterministic Q -learning algorithm is developed. In each iteration of the developed Q -learning algorithm, the iterative Q function is updated for all the state and control spaces, instead of updating for a single state and a single control in traditional Q -learning algorithm. A new convergence criterion is established to guarantee that the iterative Q function converges to the optimum, where the convergence criterion of the learning rates for traditional Q -learning algorithms is simplified. During the convergence analysis, the upper and lower bounds of the iterative Q function are analyzed to obtain the convergence criterion, instead of analyzing the iterative Q function itself. For convenience of analysis, the convergence properties for undiscounted case of the deterministic Q -learning algorithm are first developed. Then, considering the discounted factor, the convergence criterion for the discounted case is established. Neural networks are used to approximate the iterative Q function and compute the iterative control law, respectively, for facilitating the implementation of the deterministic Q -learning algorithm. Finally, simulation results and comparisons are given to illustrate the performance of the developed algorithm.
Xu, Q; Yang, D; Tan, J; Anastasio, M
2012-06-01
To improve image quality and reduce imaging dose in CBCT for radiation therapy applications and to realize near real-time image reconstruction based on use of a fast convergence iterative algorithm and acceleration by multi-GPUs. An iterative image reconstruction that sought to minimize a weighted least squares cost function that employed total variation (TV) regularization was employed to mitigate projection data incompleteness and noise. To achieve rapid 3D image reconstruction (< 1 min), a highly optimized multiple-GPU implementation of the algorithm was developed. The convergence rate and reconstruction accuracy were evaluated using a modified 3D Shepp-Logan digital phantom and a Catphan-600 physical phantom. The reconstructed images were compared with the clinical FDK reconstruction results. Digital phantom studies showed that only 15 iterations and 60 iterations are needed to achieve algorithm convergence for 360-view and 60-view cases, respectively. The RMSE was reduced to 10-4 and 10-2, respectively, by using 15 iterations for each case. Our algorithm required 5.4s to complete one iteration for the 60-view case using one Tesla C2075 GPU. The few-view study indicated that our iterative algorithm has great potential to reduce the imaging dose and preserve good image quality. For the physical Catphan studies, the images obtained from the iterative algorithm possessed better spatial resolution and higher SNRs than those obtained from by use of a clinical FDK reconstruction algorithm. We have developed a fast convergence iterative algorithm for CBCT image reconstruction. The developed algorithm yielded images with better spatial resolution and higher SNR than those produced by a commercial FDK tool. In addition, from the few-view study, the iterative algorithm has shown great potential for significantly reducing imaging dose. We expect that the developed reconstruction approach will facilitate applications including IGART and patient daily CBCT-based treatment localization. © 2012 American Association of Physicists in Medicine.
Hierarchical Image Segmentation of Remotely Sensed Data using Massively Parallel GNU-LINUX Software
NASA Technical Reports Server (NTRS)
Tilton, James C.
2003-01-01
A hierarchical set of image segmentations is a set of several image segmentations of the same image at different levels of detail in which the segmentations at coarser levels of detail can be produced from simple merges of regions at finer levels of detail. In [1], Tilton, et a1 describes an approach for producing hierarchical segmentations (called HSEG) and gave a progress report on exploiting these hierarchical segmentations for image information mining. The HSEG algorithm is a hybrid of region growing and constrained spectral clustering that produces a hierarchical set of image segmentations based on detected convergence points. In the main, HSEG employs the hierarchical stepwise optimization (HSWO) approach to region growing, which was described as early as 1989 by Beaulieu and Goldberg. The HSWO approach seeks to produce segmentations that are more optimized than those produced by more classic approaches to region growing (e.g. Horowitz and T. Pavlidis, [3]). In addition, HSEG optionally interjects between HSWO region growing iterations, merges between spatially non-adjacent regions (i.e., spectrally based merging or clustering) constrained by a threshold derived from the previous HSWO region growing iteration. While the addition of constrained spectral clustering improves the utility of the segmentation results, especially for larger images, it also significantly increases HSEG s computational requirements. To counteract this, a computationally efficient recursive, divide-and-conquer, implementation of HSEG (RHSEG) was devised, which includes special code to avoid processing artifacts caused by RHSEG s recursive subdivision of the image data. The recursive nature of RHSEG makes for a straightforward parallel implementation. This paper describes the HSEG algorithm, its recursive formulation (referred to as RHSEG), and the implementation of RHSEG using massively parallel GNU-LINUX software. Results with Landsat TM data are included comparing RHSEG with classic region growing.
Trace: a high-throughput tomographic reconstruction engine for large-scale datasets
Bicer, Tekin; Gursoy, Doga; Andrade, Vincent De; ...
2017-01-28
Here, synchrotron light source and detector technologies enable scientists to perform advanced experiments. These scientific instruments and experiments produce data at such scale and complexity that large-scale computation is required to unleash their full power. One of the widely used data acquisition technique at light sources is Computed Tomography, which can generate tens of GB/s depending on x-ray range. A large-scale tomographic dataset, such as mouse brain, may require hours of computation time with a medium size workstation. In this paper, we present Trace, a data-intensive computing middleware we developed for implementation and parallelization of iterative tomographic reconstruction algorithms. Tracemore » provides fine-grained reconstruction of tomography datasets using both (thread level) shared memory and (process level) distributed memory parallelization. Trace utilizes a special data structure called replicated reconstruction object to maximize application performance. We also present the optimizations we have done on the replicated reconstruction objects and evaluate them using a shale and a mouse brain sinogram. Our experimental evaluations show that the applied optimizations and parallelization techniques can provide 158x speedup (using 32 compute nodes) over single core configuration, which decreases the reconstruction time of a sinogram (with 4501 projections and 22400 detector resolution) from 12.5 hours to less than 5 minutes per iteration.« less
Trace: a high-throughput tomographic reconstruction engine for large-scale datasets
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bicer, Tekin; Gursoy, Doga; Andrade, Vincent De
Here, synchrotron light source and detector technologies enable scientists to perform advanced experiments. These scientific instruments and experiments produce data at such scale and complexity that large-scale computation is required to unleash their full power. One of the widely used data acquisition technique at light sources is Computed Tomography, which can generate tens of GB/s depending on x-ray range. A large-scale tomographic dataset, such as mouse brain, may require hours of computation time with a medium size workstation. In this paper, we present Trace, a data-intensive computing middleware we developed for implementation and parallelization of iterative tomographic reconstruction algorithms. Tracemore » provides fine-grained reconstruction of tomography datasets using both (thread level) shared memory and (process level) distributed memory parallelization. Trace utilizes a special data structure called replicated reconstruction object to maximize application performance. We also present the optimizations we have done on the replicated reconstruction objects and evaluate them using a shale and a mouse brain sinogram. Our experimental evaluations show that the applied optimizations and parallelization techniques can provide 158x speedup (using 32 compute nodes) over single core configuration, which decreases the reconstruction time of a sinogram (with 4501 projections and 22400 detector resolution) from 12.5 hours to less than 5 minutes per iteration.« less
Craciun, Stefan; Brockmeier, Austin J; George, Alan D; Lam, Herman; Príncipe, José C
2011-01-01
Methods for decoding movements from neural spike counts using adaptive filters often rely on minimizing the mean-squared error. However, for non-Gaussian distribution of errors, this approach is not optimal for performance. Therefore, rather than using probabilistic modeling, we propose an alternate non-parametric approach. In order to extract more structure from the input signal (neuronal spike counts) we propose using minimum error entropy (MEE), an information-theoretic approach that minimizes the error entropy as part of an iterative cost function. However, the disadvantage of using MEE as the cost function for adaptive filters is the increase in computational complexity. In this paper we present a comparison between the decoding performance of the analytic Wiener filter and a linear filter trained with MEE, which is then mapped to a parallel architecture in reconfigurable hardware tailored to the computational needs of the MEE filter. We observe considerable speedup from the hardware design. The adaptation of filter weights for the multiple-input, multiple-output linear filters, necessary in motor decoding, is a highly parallelizable algorithm. It can be decomposed into many independent computational blocks with a parallel architecture readily mapped to a field-programmable gate array (FPGA) and scales to large numbers of neurons. By pipelining and parallelizing independent computations in the algorithm, the proposed parallel architecture has sublinear increases in execution time with respect to both window size and filter order.
Recursive inverse factorization.
Rubensson, Emanuel H; Bock, Nicolas; Holmström, Erik; Niklasson, Anders M N
2008-03-14
A recursive algorithm for the inverse factorization S(-1)=ZZ(*) of Hermitian positive definite matrices S is proposed. The inverse factorization is based on iterative refinement [A.M.N. Niklasson, Phys. Rev. B 70, 193102 (2004)] combined with a recursive decomposition of S. As the computational kernel is matrix-matrix multiplication, the algorithm can be parallelized and the computational effort increases linearly with system size for systems with sufficiently sparse matrices. Recent advances in network theory are used to find appropriate recursive decompositions. We show that optimization of the so-called network modularity results in an improved partitioning compared to other approaches. In particular, when the recursive inverse factorization is applied to overlap matrices of irregularly structured three-dimensional molecules.
Approximate Computing Techniques for Iterative Graph Algorithms
DOE Office of Scientific and Technical Information (OSTI.GOV)
Panyala, Ajay R.; Subasi, Omer; Halappanavar, Mahantesh
Approximate computing enables processing of large-scale graphs by trading off quality for performance. Approximate computing techniques have become critical not only due to the emergence of parallel architectures but also the availability of large scale datasets enabling data-driven discovery. Using two prototypical graph algorithms, PageRank and community detection, we present several approximate computing heuristics to scale the performance with minimal loss of accuracy. We present several heuristics including loop perforation, data caching, incomplete graph coloring and synchronization, and evaluate their efficiency. We demonstrate performance improvements of up to 83% for PageRank and up to 450x for community detection, with lowmore » impact of accuracy for both the algorithms. We expect the proposed approximate techniques will enable scalable graph analytics on data of importance to several applications in science and their subsequent adoption to scale similar graph algorithms.« less
Single-agent parallel window search
NASA Technical Reports Server (NTRS)
Powley, Curt; Korf, Richard E.
1991-01-01
Parallel window search is applied to single-agent problems by having different processes simultaneously perform iterations of Iterative-Deepening-A(asterisk) (IDA-asterisk) on the same problem but with different cost thresholds. This approach is limited by the time to perform the goal iteration. To overcome this disadvantage, the authors consider node ordering. They discuss how global node ordering by minimum h among nodes with equal f = g + h values can reduce the time complexity of serial IDA-asterisk by reducing the time to perform the iterations prior to the goal iteration. Finally, the two ideas of parallel window search and node ordering are combined to eliminate the weaknesses of each approach while retaining the strengths. The resulting approach, called simply parallel window search, can be used to find a near-optimal solution quickly, improve the solution until it is optimal, and then finally guarantee optimality, depending on the amount of time available.
NASA Technical Reports Server (NTRS)
Nguyen, Duc T.
1990-01-01
Practical engineering application can often be formulated in the form of a constrained optimization problem. There are several solution algorithms for solving a constrained optimization problem. One approach is to convert a constrained problem into a series of unconstrained problems. Furthermore, unconstrained solution algorithms can be used as part of the constrained solution algorithms. Structural optimization is an iterative process where one starts with an initial design, a finite element structure analysis is then performed to calculate the response of the system (such as displacements, stresses, eigenvalues, etc.). Based upon the sensitivity information on the objective and constraint functions, an optimizer such as ADS or IDESIGN, can be used to find the new, improved design. For the structural analysis phase, the equation solver for the system of simultaneous, linear equations plays a key role since it is needed for either static, or eigenvalue, or dynamic analysis. For practical, large-scale structural analysis-synthesis applications, computational time can be excessively large. Thus, it is necessary to have a new structural analysis-synthesis code which employs new solution algorithms to exploit both parallel and vector capabilities offered by modern, high performance computers such as the Convex, Cray-2 and Cray-YMP computers. The objective of this research project is, therefore, to incorporate the latest development in the parallel-vector equation solver, PVSOLVE into the widely popular finite-element production code, such as the SAP-4. Furthermore, several nonlinear unconstrained optimization subroutines have also been developed and tested under a parallel computer environment. The unconstrained optimization subroutines are not only useful in their own right, but they can also be incorporated into a more popular constrained optimization code, such as ADS.
NASA Astrophysics Data System (ADS)
Cao, Jingtai; Zhao, Xiaohui; Li, Zhaokun; Liu, Wei; Gu, Haijun
2017-11-01
The performance of free space optical (FSO) communication system is limited by atmospheric turbulent extremely. Adaptive optics (AO) is the significant method to overcome the atmosphere disturbance. Especially, for the strong scintillation effect, the sensor-less AO system plays a major role for compensation. In this paper, a modified artificial fish school (MAFS) algorithm is proposed to compensate the aberrations in the sensor-less AO system. Both the static and dynamic aberrations compensations are analyzed and the performance of FSO communication before and after aberrations compensations is compared. In addition, MAFS algorithm is compared with artificial fish school (AFS) algorithm, stochastic parallel gradient descent (SPGD) algorithm and simulated annealing (SA) algorithm. It is shown that the MAFS algorithm has a higher convergence speed than SPGD algorithm and SA algorithm, and reaches the better convergence value than AFS algorithm, SPGD algorithm and SA algorithm. The sensor-less AO system with MAFS algorithm effectively increases the coupling efficiency at the receiving terminal with fewer numbers of iterations. In conclusion, the MAFS algorithm has great significance for sensor-less AO system to compensate atmospheric turbulence in FSO communication system.
Zhang, Huaguang; Song, Ruizhuo; Wei, Qinglai; Zhang, Tieyan
2011-12-01
In this paper, a novel heuristic dynamic programming (HDP) iteration algorithm is proposed to solve the optimal tracking control problem for a class of nonlinear discrete-time systems with time delays. The novel algorithm contains state updating, control policy iteration, and performance index iteration. To get the optimal states, the states are also updated. Furthermore, the "backward iteration" is applied to state updating. Two neural networks are used to approximate the performance index function and compute the optimal control policy for facilitating the implementation of HDP iteration algorithm. At last, we present two examples to demonstrate the effectiveness of the proposed HDP iteration algorithm.
SequenceL: Automated Parallel Algorithms Derived from CSP-NT Computational Laws
NASA Technical Reports Server (NTRS)
Cooke, Daniel; Rushton, Nelson
2013-01-01
With the introduction of new parallel architectures like the cell and multicore chips from IBM, Intel, AMD, and ARM, as well as the petascale processing available for highend computing, a larger number of programmers will need to write parallel codes. Adding the parallel control structure to the sequence, selection, and iterative control constructs increases the complexity of code development, which often results in increased development costs and decreased reliability. SequenceL is a high-level programming language that is, a programming language that is closer to a human s way of thinking than to a machine s. Historically, high-level languages have resulted in decreased development costs and increased reliability, at the expense of performance. In recent applications at JSC and in industry, SequenceL has demonstrated the usual advantages of high-level programming in terms of low cost and high reliability. SequenceL programs, however, have run at speeds typically comparable with, and in many cases faster than, their counterparts written in C and C++ when run on single-core processors. Moreover, SequenceL is able to generate parallel executables automatically for multicore hardware, gaining parallel speedups without any extra effort from the programmer beyond what is required to write the sequen tial/singlecore code. A SequenceL-to-C++ translator has been developed that automatically renders readable multithreaded C++ from a combination of a SequenceL program and sample data input. The SequenceL language is based on two fundamental computational laws, Consume-Simplify- Produce (CSP) and Normalize-Trans - pose (NT), which enable it to automate the creation of parallel algorithms from high-level code that has no annotations of parallelism whatsoever. In our anecdotal experience, SequenceL development has been in every case less costly than development of the same algorithm in sequential (that is, single-core, single process) C or C++, and an order of magnitude less costly than development of comparable parallel code. Moreover, SequenceL not only automatically parallelizes the code, but since it is based on CSP-NT, it is provably race free, thus eliminating the largest quality challenge the parallelized software developer faces.
Automation of a Wave-Optics Simulation and Image Post-Processing Package on Riptide
NASA Astrophysics Data System (ADS)
Werth, M.; Lucas, J.; Thompson, D.; Abercrombie, M.; Holmes, R.; Roggemann, M.
Detailed wave-optics simulations and image post-processing algorithms are computationally expensive and benefit from the massively parallel hardware available at supercomputing facilities. We created an automated system that interfaces with the Maui High Performance Computing Center (MHPCC) Distributed MATLAB® Portal interface to submit massively parallel waveoptics simulations to the IBM iDataPlex (Riptide) supercomputer. This system subsequently postprocesses the output images with an improved version of physically constrained iterative deconvolution (PCID) and analyzes the results using a series of modular algorithms written in Python. With this architecture, a single person can simulate thousands of unique scenarios and produce analyzed, archived, and briefing-compatible output products with very little effort. This research was developed with funding from the Defense Advanced Research Projects Agency (DARPA). The views, opinions, and/or findings expressed are those of the author(s) and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.
Nested Conjugate Gradient Algorithm with Nested Preconditioning for Non-linear Image Restoration.
Skariah, Deepak G; Arigovindan, Muthuvel
2017-06-19
We develop a novel optimization algorithm, which we call Nested Non-Linear Conjugate Gradient algorithm (NNCG), for image restoration based on quadratic data fitting and smooth non-quadratic regularization. The algorithm is constructed as a nesting of two conjugate gradient (CG) iterations. The outer iteration is constructed as a preconditioned non-linear CG algorithm; the preconditioning is performed by the inner CG iteration that is linear. The inner CG iteration, which performs preconditioning for outer CG iteration, itself is accelerated by an another FFT based non-iterative preconditioner. We prove that the method converges to a stationary point for both convex and non-convex regularization functionals. We demonstrate experimentally that proposed method outperforms the well-known majorization-minimization method used for convex regularization, and a non-convex inertial-proximal method for non-convex regularization functional.
Investigation of iterative image reconstruction in three-dimensional optoacoustic tomography
Wang, Kun; Su, Richard; Oraevsky, Alexander A; Anastasio, Mark A
2012-01-01
Iterative image reconstruction algorithms for optoacoustic tomography (OAT), also known as photoacoustic tomography, have the ability to improve image quality over analytic algorithms due to their ability to incorporate accurate models of the imaging physics, instrument response, and measurement noise. However, to date, there have been few reported attempts to employ advanced iterative image reconstruction algorithms for improving image quality in three-dimensional (3D) OAT. In this work, we implement and investigate two iterative image reconstruction methods for use with a 3D OAT small animal imager: namely, a penalized least-squares (PLS) method employing a quadratic smoothness penalty and a PLS method employing a total variation norm penalty. The reconstruction algorithms employ accurate models of the ultrasonic transducer impulse responses. Experimental data sets are employed to compare the performances of the iterative reconstruction algorithms to that of a 3D filtered backprojection (FBP) algorithm. By use of quantitative measures of image quality, we demonstrate that the iterative reconstruction algorithms can mitigate image artifacts and preserve spatial resolution more effectively than FBP algorithms. These features suggest that the use of advanced image reconstruction algorithms can improve the effectiveness of 3D OAT while reducing the amount of data required for biomedical applications. PMID:22864062
Angelis, G I; Reader, A J; Markiewicz, P J; Kotasidis, F A; Lionheart, W R; Matthews, J C
2013-08-07
Recent studies have demonstrated the benefits of a resolution model within iterative reconstruction algorithms in an attempt to account for effects that degrade the spatial resolution of the reconstructed images. However, these algorithms suffer from slower convergence rates, compared to algorithms where no resolution model is used, due to the additional need to solve an image deconvolution problem. In this paper, a recently proposed algorithm, which decouples the tomographic and image deconvolution problems within an image-based expectation maximization (EM) framework, was evaluated. This separation is convenient, because more computational effort can be placed on the image deconvolution problem and therefore accelerate convergence. Since the computational cost of solving the image deconvolution problem is relatively small, multiple image-based EM iterations do not significantly increase the overall reconstruction time. The proposed algorithm was evaluated using 2D simulations, as well as measured 3D data acquired on the high-resolution research tomograph. Results showed that bias reduction can be accelerated by interleaving multiple iterations of the image-based EM algorithm solving the resolution model problem, with a single EM iteration solving the tomographic problem. Significant improvements were observed particularly for voxels that were located on the boundaries between regions of high contrast within the object being imaged and for small regions of interest, where resolution recovery is usually more challenging. Minor differences were observed using the proposed nested algorithm, compared to the single iteration normally performed, when an optimal number of iterations are performed for each algorithm. However, using the proposed nested approach convergence is significantly accelerated enabling reconstruction using far fewer tomographic iterations (up to 70% fewer iterations for small regions). Nevertheless, the optimal number of nested image-based EM iterations is hard to be defined and it should be selected according to the given application.
Stability analysis of a deterministic dose calculation for MRI-guided radiotherapy.
Zelyak, O; Fallone, B G; St-Aubin, J
2017-12-14
Modern effort in radiotherapy to address the challenges of tumor localization and motion has led to the development of MRI guided radiotherapy technologies. Accurate dose calculations must properly account for the effects of the MRI magnetic fields. Previous work has investigated the accuracy of a deterministic linear Boltzmann transport equation (LBTE) solver that includes magnetic field, but not the stability of the iterative solution method. In this work, we perform a stability analysis of this deterministic algorithm including an investigation of the convergence rate dependencies on the magnetic field, material density, energy, and anisotropy expansion. The iterative convergence rate of the continuous and discretized LBTE including magnetic fields is determined by analyzing the spectral radius using Fourier analysis for the stationary source iteration (SI) scheme. The spectral radius is calculated when the magnetic field is included (1) as a part of the iteration source, and (2) inside the streaming-collision operator. The non-stationary Krylov subspace solver GMRES is also investigated as a potential method to accelerate the iterative convergence, and an angular parallel computing methodology is investigated as a method to enhance the efficiency of the calculation. SI is found to be unstable when the magnetic field is part of the iteration source, but unconditionally stable when the magnetic field is included in the streaming-collision operator. The discretized LBTE with magnetic fields using a space-angle upwind stabilized discontinuous finite element method (DFEM) was also found to be unconditionally stable, but the spectral radius rapidly reaches unity for very low-density media and increasing magnetic field strengths indicating arbitrarily slow convergence rates. However, GMRES is shown to significantly accelerate the DFEM convergence rate showing only a weak dependence on the magnetic field. In addition, the use of an angular parallel computing strategy is shown to potentially increase the efficiency of the dose calculation.
Stability analysis of a deterministic dose calculation for MRI-guided radiotherapy
NASA Astrophysics Data System (ADS)
Zelyak, O.; Fallone, B. G.; St-Aubin, J.
2018-01-01
Modern effort in radiotherapy to address the challenges of tumor localization and motion has led to the development of MRI guided radiotherapy technologies. Accurate dose calculations must properly account for the effects of the MRI magnetic fields. Previous work has investigated the accuracy of a deterministic linear Boltzmann transport equation (LBTE) solver that includes magnetic field, but not the stability of the iterative solution method. In this work, we perform a stability analysis of this deterministic algorithm including an investigation of the convergence rate dependencies on the magnetic field, material density, energy, and anisotropy expansion. The iterative convergence rate of the continuous and discretized LBTE including magnetic fields is determined by analyzing the spectral radius using Fourier analysis for the stationary source iteration (SI) scheme. The spectral radius is calculated when the magnetic field is included (1) as a part of the iteration source, and (2) inside the streaming-collision operator. The non-stationary Krylov subspace solver GMRES is also investigated as a potential method to accelerate the iterative convergence, and an angular parallel computing methodology is investigated as a method to enhance the efficiency of the calculation. SI is found to be unstable when the magnetic field is part of the iteration source, but unconditionally stable when the magnetic field is included in the streaming-collision operator. The discretized LBTE with magnetic fields using a space-angle upwind stabilized discontinuous finite element method (DFEM) was also found to be unconditionally stable, but the spectral radius rapidly reaches unity for very low-density media and increasing magnetic field strengths indicating arbitrarily slow convergence rates. However, GMRES is shown to significantly accelerate the DFEM convergence rate showing only a weak dependence on the magnetic field. In addition, the use of an angular parallel computing strategy is shown to potentially increase the efficiency of the dose calculation.
Corrigendum to "Stability analysis of a deterministic dose calculation for MRI-guided radiotherapy".
Zelyak, Oleksandr; Fallone, B Gino; St-Aubin, Joel
2018-03-12
Modern effort in radiotherapy to address the challenges of tumor localization and motion has led to the development of MRI guided radiotherapy technologies. Accurate dose calculations must properly account for the effects of the MRI magnetic fields. Previous work has investigated the accuracy of a deterministic linear Boltzmann transport equation (LBTE) solver that includes magnetic field, but not the stability of the iterative solution method. In this work, we perform a stability analysis of this deterministic algorithm including an investigation of the convergence rate dependencies on the magnetic field, material density, energy, and anisotropy expansion. The iterative convergence rate of the continuous and discretized LBTE including magnetic fields is determined by analyzing the spectral radius using Fourier analysis for the stationary source iteration (SI) scheme. The spectral radius is calculated when the magnetic field is included (1) as a part of the iteration source, and (2) inside the streaming-collision operator. The non-stationary Krylov subspace solver GMRES is also investigated as a potential method to accelerate the iterative convergence, and an angular parallel computing methodology is investigated as a method to enhance the efficiency of the calculation. SI is found to be unstable when the magnetic field is part of the iteration source, but unconditionally stable when the magnetic field is included in the streaming-collision operator. The discretized LBTE with magnetic fields using a space-angle upwind stabilized discontinuous finite element method (DFEM) was also found to be unconditionally stable, but the spectral radius rapidly reaches unity for very low density media and increasing magnetic field strengths indicating arbitrarily slow convergence rates. However, GMRES is shown to significantly accelerate the DFEM convergence rate showing only a weak dependence on the magnetic field. In addition, the use of an angular parallel computing strategy is shown to potentially increase the efficiency of the dose calculation. © 2018 Institute of Physics and Engineering in Medicine.
PARALLELISATION OF THE MODEL-BASED ITERATIVE RECONSTRUCTION ALGORITHM DIRA.
Örtenberg, A; Magnusson, M; Sandborg, M; Alm Carlsson, G; Malusek, A
2016-06-01
New paradigms for parallel programming have been devised to simplify software development on multi-core processors and many-core graphical processing units (GPU). Despite their obvious benefits, the parallelisation of existing computer programs is not an easy task. In this work, the use of the Open Multiprocessing (OpenMP) and Open Computing Language (OpenCL) frameworks is considered for the parallelisation of the model-based iterative reconstruction algorithm DIRA with the aim to significantly shorten the code's execution time. Selected routines were parallelised using OpenMP and OpenCL libraries; some routines were converted from MATLAB to C and optimised. Parallelisation of the code with the OpenMP was easy and resulted in an overall speedup of 15 on a 16-core computer. Parallelisation with OpenCL was more difficult owing to differences between the central processing unit and GPU architectures. The resulting speedup was substantially lower than the theoretical peak performance of the GPU; the cause was explained. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
A Model and Simple Iterative Algorithm for Redundancy Analysis.
ERIC Educational Resources Information Center
Fornell, Claes; And Others
1988-01-01
This paper shows that redundancy maximization with J. K. Johansson's extension can be accomplished via a simple iterative algorithm based on H. Wold's Partial Least Squares. The model and the iterative algorithm for the least squares approach to redundancy maximization are presented. (TJH)
MPI-FAUN: An MPI-Based Framework for Alternating-Updating Nonnegative Matrix Factorization
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kannan, Ramakrishnan; Ballard, Grey; Park, Haesun
Non-negative matrix factorization (NMF) is the problem of determining two non-negative low rank factors W and H, for the given input matrix A, such that A≈WH. NMF is a useful tool for many applications in different domains such as topic modeling in text mining, background separation in video analysis, and community detection in social networks. Despite its popularity in the data mining community, there is a lack of efficient parallel algorithms to solve the problem for big data sets. The main contribution of this work is a new, high-performance parallel computational framework for a broad class of NMF algorithms thatmore » iteratively solves alternating non-negative least squares (NLS) subproblems for W and H. It maintains the data and factor matrices in memory (distributed across processors), uses MPI for interprocessor communication, and, in the dense case, provably minimizes communication costs (under mild assumptions). The framework is flexible and able to leverage a variety of NMF and NLS algorithms, including Multiplicative Update, Hierarchical Alternating Least Squares, and Block Principal Pivoting. Our implementation allows us to benchmark and compare different algorithms on massive dense and sparse data matrices of size that spans from few hundreds of millions to billions. We demonstrate the scalability of our algorithm and compare it with baseline implementations, showing significant performance improvements. The code and the datasets used for conducting the experiments are available online.« less
MPI-FAUN: An MPI-Based Framework for Alternating-Updating Nonnegative Matrix Factorization
Kannan, Ramakrishnan; Ballard, Grey; Park, Haesun
2017-10-30
Non-negative matrix factorization (NMF) is the problem of determining two non-negative low rank factors W and H, for the given input matrix A, such that A≈WH. NMF is a useful tool for many applications in different domains such as topic modeling in text mining, background separation in video analysis, and community detection in social networks. Despite its popularity in the data mining community, there is a lack of efficient parallel algorithms to solve the problem for big data sets. The main contribution of this work is a new, high-performance parallel computational framework for a broad class of NMF algorithms thatmore » iteratively solves alternating non-negative least squares (NLS) subproblems for W and H. It maintains the data and factor matrices in memory (distributed across processors), uses MPI for interprocessor communication, and, in the dense case, provably minimizes communication costs (under mild assumptions). The framework is flexible and able to leverage a variety of NMF and NLS algorithms, including Multiplicative Update, Hierarchical Alternating Least Squares, and Block Principal Pivoting. Our implementation allows us to benchmark and compare different algorithms on massive dense and sparse data matrices of size that spans from few hundreds of millions to billions. We demonstrate the scalability of our algorithm and compare it with baseline implementations, showing significant performance improvements. The code and the datasets used for conducting the experiments are available online.« less
Decomposition of timed automata for solving scheduling problems
NASA Astrophysics Data System (ADS)
Nishi, Tatsushi; Wakatake, Masato
2014-03-01
A decomposition algorithm for scheduling problems based on timed automata (TA) model is proposed. The problem is represented as an optimal state transition problem for TA. The model comprises of the parallel composition of submodels such as jobs and resources. The procedure of the proposed methodology can be divided into two steps. The first step is to decompose the TA model into several submodels by using decomposable condition. The second step is to combine individual solution of subproblems for the decomposed submodels by the penalty function method. A feasible solution for the entire model is derived through the iterated computation of solving the subproblem for each submodel. The proposed methodology is applied to solve flowshop and jobshop scheduling problems. Computational experiments demonstrate the effectiveness of the proposed algorithm compared with a conventional TA scheduling algorithm without decomposition.
Krylov subspace methods on supercomputers
NASA Technical Reports Server (NTRS)
Saad, Youcef
1988-01-01
A short survey of recent research on Krylov subspace methods with emphasis on implementation on vector and parallel computers is presented. Conjugate gradient methods have proven very useful on traditional scalar computers, and their popularity is likely to increase as three-dimensional models gain importance. A conservative approach to derive effective iterative techniques for supercomputers has been to find efficient parallel/vector implementations of the standard algorithms. The main source of difficulty in the incomplete factorization preconditionings is in the solution of the triangular systems at each step. A few approaches consisting of implementing efficient forward and backward triangular solutions are described in detail. Polynomial preconditioning as an alternative to standard incomplete factorization techniques is also discussed. Another efficient approach is to reorder the equations so as to improve the structure of the matrix to achieve better parallelism or vectorization. An overview of these and other ideas and their effectiveness or potential for different types of architectures is given.
GPU-accelerated regularized iterative reconstruction for few-view cone beam CT
DOE Office of Scientific and Technical Information (OSTI.GOV)
Matenine, Dmitri, E-mail: dmitri.matenine.1@ulaval.ca; Goussard, Yves, E-mail: yves.goussard@polymtl.ca; Després, Philippe, E-mail: philippe.despres@phy.ulaval.ca
2015-04-15
Purpose: The present work proposes an iterative reconstruction technique designed for x-ray transmission computed tomography (CT). The main objective is to provide a model-based solution to the cone-beam CT reconstruction problem, yielding accurate low-dose images via few-views acquisitions in clinically acceptable time frames. Methods: The proposed technique combines a modified ordered subsets convex (OSC) algorithm and the total variation minimization (TV) regularization technique and is called OSC-TV. The number of subsets of each OSC iteration follows a reduction pattern in order to ensure the best performance of the regularization method. Considering the high computational cost of the algorithm, it ismore » implemented on a graphics processing unit, using parallelization to accelerate computations. Results: The reconstructions were performed on computer-simulated as well as human pelvic cone-beam CT projection data and image quality was assessed. In terms of convergence and image quality, OSC-TV performs well in reconstruction of low-dose cone-beam CT data obtained via a few-view acquisition protocol. It compares favorably to the few-view TV-regularized projections onto convex sets (POCS-TV) algorithm. It also appears to be a viable alternative to full-dataset filtered backprojection. Execution times are of 1–2 min and are compatible with the typical clinical workflow for nonreal-time applications. Conclusions: Considering the image quality and execution times, this method may be useful for reconstruction of low-dose clinical acquisitions. It may be of particular benefit to patients who undergo multiple acquisitions by reducing the overall imaging radiation dose and associated risks.« less
Acoustic 3D modeling by the method of integral equations
NASA Astrophysics Data System (ADS)
Malovichko, M.; Khokhlov, N.; Yavich, N.; Zhdanov, M.
2018-02-01
This paper presents a parallel algorithm for frequency-domain acoustic modeling by the method of integral equations (IE). The algorithm is applied to seismic simulation. The IE method reduces the size of the problem but leads to a dense system matrix. A tolerable memory consumption and numerical complexity were achieved by applying an iterative solver, accompanied by an effective matrix-vector multiplication operation, based on the fast Fourier transform (FFT). We demonstrate that, the IE system matrix is better conditioned than that of the finite-difference (FD) method, and discuss its relation to a specially preconditioned FD matrix. We considered several methods of matrix-vector multiplication for the free-space and layered host models. The developed algorithm and computer code were benchmarked against the FD time-domain solution. It was demonstrated that, the method could accurately calculate the seismic field for the models with sharp material boundaries and a point source and receiver located close to the free surface. We used OpenMP to speed up the matrix-vector multiplication, while MPI was used to speed up the solution of the system equations, and also for parallelizing across multiple sources. The practical examples and efficiency tests are presented as well.
Randomized Dynamic Mode Decomposition
NASA Astrophysics Data System (ADS)
Erichson, N. Benjamin; Brunton, Steven L.; Kutz, J. Nathan
2017-11-01
The dynamic mode decomposition (DMD) is an equation-free, data-driven matrix decomposition that is capable of providing accurate reconstructions of spatio-temporal coherent structures arising in dynamical systems. We present randomized algorithms to compute the near-optimal low-rank dynamic mode decomposition for massive datasets. Randomized algorithms are simple, accurate and able to ease the computational challenges arising with `big data'. Moreover, randomized algorithms are amenable to modern parallel and distributed computing. The idea is to derive a smaller matrix from the high-dimensional input data matrix using randomness as a computational strategy. Then, the dynamic modes and eigenvalues are accurately learned from this smaller representation of the data, whereby the approximation quality can be controlled via oversampling and power iterations. Here, we present randomized DMD algorithms that are categorized by how many passes the algorithm takes through the data. Specifically, the single-pass randomized DMD does not require data to be stored for subsequent passes. Thus, it is possible to approximately decompose massive fluid flows (stored out of core memory, or not stored at all) using single-pass algorithms, which is infeasible with traditional DMD algorithms.
A Single LiDAR-Based Feature Fusion Indoor Localization Algorithm.
Wang, Yun-Ting; Peng, Chao-Chung; Ravankar, Ankit A; Ravankar, Abhijeet
2018-04-23
In past years, there has been significant progress in the field of indoor robot localization. To precisely recover the position, the robots usually relies on multiple on-board sensors. Nevertheless, this affects the overall system cost and increases computation. In this research work, we considered a light detection and ranging (LiDAR) device as the only sensor for detecting surroundings and propose an efficient indoor localization algorithm. To attenuate the computation effort and preserve localization robustness, a weighted parallel iterative closed point (WP-ICP) with interpolation is presented. As compared to the traditional ICP, the point cloud is first processed to extract corners and line features before applying point registration. Later, points labeled as corners are only matched with the corner candidates. Similarly, points labeled as lines are only matched with the lines candidates. Moreover, their ICP confidence levels are also fused in the algorithm, which make the pose estimation less sensitive to environment uncertainties. The proposed WP-ICP architecture reduces the probability of mismatch and thereby reduces the ICP iterations. Finally, based on given well-constructed indoor layouts, experiment comparisons are carried out under both clean and perturbed environments. It is shown that the proposed method is effective in significantly reducing computation effort and is simultaneously able to preserve localization precision.
A Single LiDAR-Based Feature Fusion Indoor Localization Algorithm
Wang, Yun-Ting; Peng, Chao-Chung; Ravankar, Ankit A.; Ravankar, Abhijeet
2018-01-01
In past years, there has been significant progress in the field of indoor robot localization. To precisely recover the position, the robots usually relies on multiple on-board sensors. Nevertheless, this affects the overall system cost and increases computation. In this research work, we considered a light detection and ranging (LiDAR) device as the only sensor for detecting surroundings and propose an efficient indoor localization algorithm. To attenuate the computation effort and preserve localization robustness, a weighted parallel iterative closed point (WP-ICP) with interpolation is presented. As compared to the traditional ICP, the point cloud is first processed to extract corners and line features before applying point registration. Later, points labeled as corners are only matched with the corner candidates. Similarly, points labeled as lines are only matched with the lines candidates. Moreover, their ICP confidence levels are also fused in the algorithm, which make the pose estimation less sensitive to environment uncertainties. The proposed WP-ICP architecture reduces the probability of mismatch and thereby reduces the ICP iterations. Finally, based on given well-constructed indoor layouts, experiment comparisons are carried out under both clean and perturbed environments. It is shown that the proposed method is effective in significantly reducing computation effort and is simultaneously able to preserve localization precision. PMID:29690624
NASA Astrophysics Data System (ADS)
Shimojo, Fuyuki; Kalia, Rajiv K.; Nakano, Aiichiro; Vashishta, Priya
2008-02-01
A linear-scaling algorithm based on a divide-and-conquer (DC) scheme has been designed to perform large-scale molecular-dynamics (MD) simulations, in which interatomic forces are computed quantum mechanically in the framework of the density functional theory (DFT). Electronic wave functions are represented on a real-space grid, which is augmented with a coarse multigrid to accelerate the convergence of iterative solutions and with adaptive fine grids around atoms to accurately calculate ionic pseudopotentials. Spatial decomposition is employed to implement the hierarchical-grid DC-DFT algorithm on massively parallel computers. The largest benchmark tests include 11.8×106 -atom ( 1.04×1012 electronic degrees of freedom) calculation on 131 072 IBM BlueGene/L processors. The DC-DFT algorithm has well-defined parameters to control the data locality, with which the solutions converge rapidly. Also, the total energy is well conserved during the MD simulation. We perform first-principles MD simulations based on the DC-DFT algorithm, in which large system sizes bring in excellent agreement with x-ray scattering measurements for the pair-distribution function of liquid Rb and allow the description of low-frequency vibrational modes of graphene. The band gap of a CdSe nanorod calculated by the DC-DFT algorithm agrees well with the available conventional DFT results. With the DC-DFT algorithm, the band gap is calculated for larger system sizes until the result reaches the asymptotic value.
Agile parallel bioinformatics workflow management using Pwrake.
Mishima, Hiroyuki; Sasaki, Kensaku; Tanaka, Masahiro; Tatebe, Osamu; Yoshiura, Koh-Ichiro
2011-09-08
In bioinformatics projects, scientific workflow systems are widely used to manage computational procedures. Full-featured workflow systems have been proposed to fulfil the demand for workflow management. However, such systems tend to be over-weighted for actual bioinformatics practices. We realize that quick deployment of cutting-edge software implementing advanced algorithms and data formats, and continuous adaptation to changes in computational resources and the environment are often prioritized in scientific workflow management. These features have a greater affinity with the agile software development method through iterative development phases after trial and error.Here, we show the application of a scientific workflow system Pwrake to bioinformatics workflows. Pwrake is a parallel workflow extension of Ruby's standard build tool Rake, the flexibility of which has been demonstrated in the astronomy domain. Therefore, we hypothesize that Pwrake also has advantages in actual bioinformatics workflows. We implemented the Pwrake workflows to process next generation sequencing data using the Genomic Analysis Toolkit (GATK) and Dindel. GATK and Dindel workflows are typical examples of sequential and parallel workflows, respectively. We found that in practice, actual scientific workflow development iterates over two phases, the workflow definition phase and the parameter adjustment phase. We introduced separate workflow definitions to help focus on each of the two developmental phases, as well as helper methods to simplify the descriptions. This approach increased iterative development efficiency. Moreover, we implemented combined workflows to demonstrate modularity of the GATK and Dindel workflows. Pwrake enables agile management of scientific workflows in the bioinformatics domain. The internal domain specific language design built on Ruby gives the flexibility of rakefiles for writing scientific workflows. Furthermore, readability and maintainability of rakefiles may facilitate sharing workflows among the scientific community. Workflows for GATK and Dindel are available at http://github.com/misshie/Workflows.
Agile parallel bioinformatics workflow management using Pwrake
2011-01-01
Background In bioinformatics projects, scientific workflow systems are widely used to manage computational procedures. Full-featured workflow systems have been proposed to fulfil the demand for workflow management. However, such systems tend to be over-weighted for actual bioinformatics practices. We realize that quick deployment of cutting-edge software implementing advanced algorithms and data formats, and continuous adaptation to changes in computational resources and the environment are often prioritized in scientific workflow management. These features have a greater affinity with the agile software development method through iterative development phases after trial and error. Here, we show the application of a scientific workflow system Pwrake to bioinformatics workflows. Pwrake is a parallel workflow extension of Ruby's standard build tool Rake, the flexibility of which has been demonstrated in the astronomy domain. Therefore, we hypothesize that Pwrake also has advantages in actual bioinformatics workflows. Findings We implemented the Pwrake workflows to process next generation sequencing data using the Genomic Analysis Toolkit (GATK) and Dindel. GATK and Dindel workflows are typical examples of sequential and parallel workflows, respectively. We found that in practice, actual scientific workflow development iterates over two phases, the workflow definition phase and the parameter adjustment phase. We introduced separate workflow definitions to help focus on each of the two developmental phases, as well as helper methods to simplify the descriptions. This approach increased iterative development efficiency. Moreover, we implemented combined workflows to demonstrate modularity of the GATK and Dindel workflows. Conclusions Pwrake enables agile management of scientific workflows in the bioinformatics domain. The internal domain specific language design built on Ruby gives the flexibility of rakefiles for writing scientific workflows. Furthermore, readability and maintainability of rakefiles may facilitate sharing workflows among the scientific community. Workflows for GATK and Dindel are available at http://github.com/misshie/Workflows. PMID:21899774
Optimal parallel solution of sparse triangular systems
NASA Technical Reports Server (NTRS)
Alvarado, Fernando L.; Schreiber, Robert
1990-01-01
A method for the parallel solution of triangular sets of equations is described that is appropriate when there are many right-handed sides. By preprocessing, the method can reduce the number of parallel steps required to solve Lx = b compared to parallel forward or backsolve. Applications are to iterative solvers with triangular preconditioners, to structural analysis, or to power systems applications, where there may be many right-handed sides (not all available a priori). The inverse of L is represented as a product of sparse triangular factors. The problem is to find a factored representation of this inverse of L with the smallest number of factors (or partitions), subject to the requirement that no new nonzero elements be created in the formation of these inverse factors. A method from an earlier reference is shown to solve this problem. This method is improved upon by constructing a permutation of the rows and columns of L that preserves triangularity and allow for the best possible such partition. A number of practical examples and algorithmic details are presented. The parallelism attainable is illustrated by means of elimination trees and clique trees.
Iterative algorithm for joint zero diagonalization with application in blind source separation.
Zhang, Wei-Tao; Lou, Shun-Tian
2011-07-01
A new iterative algorithm for the nonunitary joint zero diagonalization of a set of matrices is proposed for blind source separation applications. On one hand, since the zero diagonalizer of the proposed algorithm is constructed iteratively by successive multiplications of an invertible matrix, the singular solutions that occur in the existing nonunitary iterative algorithms are naturally avoided. On the other hand, compared to the algebraic method for joint zero diagonalization, the proposed algorithm requires fewer matrices to be zero diagonalized to yield even better performance. The extension of the algorithm to the complex and nonsquare mixing cases is also addressed. Numerical simulations on both synthetic data and blind source separation using time-frequency distributions illustrate the performance of the algorithm and provide a comparison to the leading joint zero diagonalization schemes.
High resolution x-ray CMT: Reconstruction methods
DOE Office of Scientific and Technical Information (OSTI.GOV)
Brown, J.K.
This paper qualitatively discusses the primary characteristics of methods for reconstructing tomographic images from a set of projections. These reconstruction methods can be categorized as either {open_quotes}analytic{close_quotes} or {open_quotes}iterative{close_quotes} techniques. Analytic algorithms are derived from the formal inversion of equations describing the imaging process, while iterative algorithms incorporate a model of the imaging process and provide a mechanism to iteratively improve image estimates. Analytic reconstruction algorithms are typically computationally more efficient than iterative methods; however, analytic algorithms are available for a relatively limited set of imaging geometries and situations. Thus, the framework of iterative reconstruction methods is better suited formore » high accuracy, tomographic reconstruction codes.« less
Limits on the Efficiency of Event-Based Algorithms for Monte Carlo Neutron Transport
DOE Office of Scientific and Technical Information (OSTI.GOV)
Romano, Paul K.; Siegel, Andrew R.
The traditional form of parallelism in Monte Carlo particle transport simulations, wherein each individual particle history is considered a unit of work, does not lend itself well to data-level parallelism. Event-based algorithms, which were originally used for simulations on vector processors, may offer a path toward better utilizing data-level parallelism in modern computer architectures. In this study, a simple model is developed for estimating the efficiency of the event-based particle transport algorithm under two sets of assumptions. Data collected from simulations of four reactor problems using OpenMC was then used in conjunction with the models to calculate the speedup duemore » to vectorization as a function of two parameters: the size of the particle bank and the vector width. When each event type is assumed to have constant execution time, the achievable speedup is directly related to the particle bank size. We observed that the bank size generally needs to be at least 20 times greater than vector size in order to achieve vector efficiency greater than 90%. When the execution times for events are allowed to vary, however, the vector speedup is also limited by differences in execution time for events being carried out in a single event-iteration. For some problems, this implies that vector effciencies over 50% may not be attainable. While there are many factors impacting performance of an event-based algorithm that are not captured by our model, it nevertheless provides insights into factors that may be limiting in a real implementation.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Adams, Brian M.; Ebeida, Mohamed Salah; Eldred, Michael S
The Dakota (Design Analysis Kit for Optimization and Terascale Applications) toolkit provides a exible and extensible interface between simulation codes and iterative analysis methods. Dakota contains algorithms for optimization with gradient and nongradient-based methods; uncertainty quanti cation with sampling, reliability, and stochastic expansion methods; parameter estimation with nonlinear least squares methods; and sensitivity/variance analysis with design of experiments and parameter study methods. These capabilities may be used on their own or as components within advanced strategies such as surrogate-based optimization, mixed integer nonlinear programming, or optimization under uncertainty. By employing object-oriented design to implement abstractions of the key components requiredmore » for iterative systems analyses, the Dakota toolkit provides a exible and extensible problem-solving environment for design and performance analysis of computational models on high performance computers. This report serves as a theoretical manual for selected algorithms implemented within the Dakota software. It is not intended as a comprehensive theoretical treatment, since a number of existing texts cover general optimization theory, statistical analysis, and other introductory topics. Rather, this manual is intended to summarize a set of Dakota-related research publications in the areas of surrogate-based optimization, uncertainty quanti cation, and optimization under uncertainty that provide the foundation for many of Dakota's iterative analysis capabilities.« less
GPU-based relative fuzzy connectedness image segmentation
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zhuge Ying; Ciesielski, Krzysztof C.; Udupa, Jayaram K.
2013-01-15
Purpose:Recently, clinical radiological research and practice are becoming increasingly quantitative. Further, images continue to increase in size and volume. For quantitative radiology to become practical, it is crucial that image segmentation algorithms and their implementations are rapid and yield practical run time on very large data sets. The purpose of this paper is to present a parallel version of an algorithm that belongs to the family of fuzzy connectedness (FC) algorithms, to achieve an interactive speed for segmenting large medical image data sets. Methods: The most common FC segmentations, optimizing an Script-Small-L {sub {infinity}}-based energy, are known as relative fuzzymore » connectedness (RFC) and iterative relative fuzzy connectedness (IRFC). Both RFC and IRFC objects (of which IRFC contains RFC) can be found via linear time algorithms, linear with respect to the image size. The new algorithm, P-ORFC (for parallel optimal RFC), which is implemented by using NVIDIA's Compute Unified Device Architecture (CUDA) platform, considerably improves the computational speed of the above mentioned CPU based IRFC algorithm. Results: Experiments based on four data sets of small, medium, large, and super data size, achieved speedup factors of 32.8 Multiplication-Sign , 22.9 Multiplication-Sign , 20.9 Multiplication-Sign , and 17.5 Multiplication-Sign , correspondingly, on the NVIDIA Tesla C1060 platform. Although the output of P-ORFC need not precisely match that of IRFC output, it is very close to it and, as the authors prove, always lies between the RFC and IRFC objects. Conclusions: A parallel version of a top-of-the-line algorithm in the family of FC has been developed on the NVIDIA GPUs. An interactive speed of segmentation has been achieved, even for the largest medical image data set. Such GPU implementations may play a crucial role in automatic anatomy recognition in clinical radiology.« less
A FAST ITERATIVE METHOD FOR SOLVING THE EIKONAL EQUATION ON TETRAHEDRAL DOMAINS
Fu, Zhisong; Kirby, Robert M.; Whitaker, Ross T.
2014-01-01
Generating numerical solutions to the eikonal equation and its many variations has a broad range of applications in both the natural and computational sciences. Efficient solvers on cutting-edge, parallel architectures require new algorithms that may not be theoretically optimal, but that are designed to allow asynchronous solution updates and have limited memory access patterns. This paper presents a parallel algorithm for solving the eikonal equation on fully unstructured tetrahedral meshes. The method is appropriate for the type of fine-grained parallelism found on modern massively-SIMD architectures such as graphics processors and takes into account the particular constraints and capabilities of these computing platforms. This work builds on previous work for solving these equations on triangle meshes; in this paper we adapt and extend previous two-dimensional strategies to accommodate three-dimensional, unstructured, tetrahedralized domains. These new developments include a local update strategy with data compaction for tetrahedral meshes that provides solutions on both serial and parallel architectures, with a generalization to inhomogeneous, anisotropic speed functions. We also propose two new update schemes, specialized to mitigate the natural data increase observed when moving to three dimensions, and the data structures necessary for efficiently mapping data to parallel SIMD processors in a way that maintains computational density. Finally, we present descriptions of the implementations for a single CPU, as well as multicore CPUs with shared memory and SIMD architectures, with comparative results against state-of-the-art eikonal solvers. PMID:25221418
ZettaBricks: A Language Compiler and Runtime System for Anyscale Computing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Amarasinghe, Saman
This grant supported the ZettaBricks and OpenTuner projects. ZettaBricks is a new implicitly parallel language and compiler where defining multiple implementations of multiple algorithms to solve a problem is the natural way of programming. ZettaBricks makes algorithmic choice a first class construct of the language. Choices are provided in a way that also allows our compiler to tune at a finer granularity. The ZettaBricks compiler autotunes programs by making both fine-grained as well as algorithmic choices. Choices also include different automatic parallelization techniques, data distributions, algorithmic parameters, transformations, and blocking. Additionally, ZettaBricks introduces novel techniques to autotune algorithms for differentmore » convergence criteria. When choosing between various direct and iterative methods, the ZettaBricks compiler is able to tune a program in such a way that delivers near-optimal efficiency for any desired level of accuracy. The compiler has the flexibility of utilizing different convergence criteria for the various components within a single algorithm, providing the user with accuracy choice alongside algorithmic choice. OpenTuner is a generalization of the experience gained in building an autotuner for ZettaBricks. OpenTuner is a new open source framework for building domain-specific multi-objective program autotuners. OpenTuner supports fully-customizable configuration representations, an extensible technique representation to allow for domain-specific techniques, and an easy to use interface for communicating with the program to be autotuned. A key capability inside OpenTuner is the use of ensembles of disparate search techniques simultaneously; techniques that perform well will dynamically be allocated a larger proportion of tests.« less
NASA Astrophysics Data System (ADS)
Susmikanti, Mike; Dewayatna, Winter; Sulistyo, Yos
2014-09-01
One of the research activities in support of commercial radioisotope production program is a safety research on target FPM (Fission Product Molybdenum) irradiation. FPM targets form a tube made of stainless steel which contains nuclear-grade high-enrichment uranium. The FPM irradiation tube is intended to obtain fission products. Fission materials such as Mo99 used widely the form of kits in the medical world. The neutronics problem is solved using first-order perturbation theory derived from the diffusion equation for four groups. In contrast, Mo isotopes have longer half-lives, about 3 days (66 hours), so the delivery of radioisotopes to consumer centers and storage is possible though still limited. The production of this isotope potentially gives significant economic value. The criticality and flux in multigroup diffusion model was calculated for various irradiation positions and uranium contents. This model involves complex computation, with large and sparse matrix system. Several parallel algorithms have been developed for the sparse and large matrix solution. In this paper, a successive over-relaxation (SOR) algorithm was implemented for the calculation of reactivity coefficients which can be done in parallel. Previous works performed reactivity calculations serially with Gauss-Seidel iteratives. The parallel method can be used to solve multigroup diffusion equation system and calculate the criticality and reactivity coefficients. In this research a computer code was developed to exploit parallel processing to perform reactivity calculations which were to be used in safety analysis. The parallel processing in the multicore computer system allows the calculation to be performed more quickly. This code was applied for the safety limits calculation of irradiated FPM targets containing highly enriched uranium. The results of calculations neutron show that for uranium contents of 1.7676 g and 6.1866 g (× 106 cm-1) in a tube, their delta reactivities are the still within safety limits; however, for 7.9542 g and 8.838 g (× 106 cm-1) the limits were exceeded.
Development of iterative techniques for the solution of unsteady compressible viscous flows
NASA Technical Reports Server (NTRS)
Hixon, Duane; Sankar, L. N.
1993-01-01
During the past two decades, there has been significant progress in the field of numerical simulation of unsteady compressible viscous flows. At present, a variety of solution techniques exist such as the transonic small disturbance analyses (TSD), transonic full potential equation-based methods, unsteady Euler solvers, and unsteady Navier-Stokes solvers. These advances have been made possible by developments in three areas: (1) improved numerical algorithms; (2) automation of body-fitted grid generation schemes; and (3) advanced computer architectures with vector processing and massively parallel processing features. In this work, the GMRES scheme has been considered as a candidate for acceleration of a Newton iteration time marching scheme for unsteady 2-D and 3-D compressible viscous flow calculation; from preliminary calculations, this will provide up to a 65 percent reduction in the computer time requirements over the existing class of explicit and implicit time marching schemes. The proposed method has ben tested on structured grids, but is flexible enough for extension to unstructured grids. The described scheme has been tested only on the current generation of vector processor architecture of the Cray Y/MP class, but should be suitable for adaptation to massively parallel machines.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Huang, Renke; Jin, Shuangshuang; Chen, Yousu
This paper presents a faster-than-real-time dynamic simulation software package that is designed for large-size power system dynamic simulation. It was developed on the GridPACKTM high-performance computing (HPC) framework. The key features of the developed software package include (1) faster-than-real-time dynamic simulation for a WECC system (17,000 buses) with different types of detailed generator, controller, and relay dynamic models, (2) a decoupled parallel dynamic simulation algorithm with optimized computation architecture to better leverage HPC resources and technologies, (3) options for HPC-based linear and iterative solvers, (4) hidden HPC details, such as data communication and distribution, to enable development centered on mathematicalmore » models and algorithms rather than on computational details for power system researchers, and (5) easy integration of new dynamic models and related algorithms into the software package.« less
An iterative method for the Helmholtz equation
NASA Technical Reports Server (NTRS)
Bayliss, A.; Goldstein, C. I.; Turkel, E.
1983-01-01
An iterative algorithm for the solution of the Helmholtz equation is developed. The algorithm is based on a preconditioned conjugate gradient iteration for the normal equations. The preconditioning is based on an SSOR sweep for the discrete Laplacian. Numerical results are presented for a wide variety of problems of physical interest and demonstrate the effectiveness of the algorithm.
Iterative Nonlocal Total Variation Regularization Method for Image Restoration
Xu, Huanyu; Sun, Quansen; Luo, Nan; Cao, Guo; Xia, Deshen
2013-01-01
In this paper, a Bregman iteration based total variation image restoration algorithm is proposed. Based on the Bregman iteration, the algorithm splits the original total variation problem into sub-problems that are easy to solve. Moreover, non-local regularization is introduced into the proposed algorithm, and a method to choose the non-local filter parameter locally and adaptively is proposed. Experiment results show that the proposed algorithms outperform some other regularization methods. PMID:23776560
Twostep-by-twostep PIRK-type PC methods with continuous output formulas
NASA Astrophysics Data System (ADS)
Cong, Nguyen Huu; Xuan, Le Ngoc
2008-11-01
This paper deals with parallel predictor-corrector (PC) iteration methods based on collocation Runge-Kutta (RK) corrector methods with continuous output formulas for solving nonstiff initial-value problems (IVPs) for systems of first-order differential equations. At nth step, the continuous output formulas are used not only for predicting the stage values in the PC iteration methods but also for calculating the step values at (n+2)th step. In this case, the integration processes can be proceeded twostep-by-twostep. The resulting twostep-by-twostep (TBT) parallel-iterated RK-type (PIRK-type) methods with continuous output formulas (twostep-by-twostep PIRKC methods or TBTPIRKC methods) give us a faster integration process. Fixed stepsize applications of these TBTPIRKC methods to a few widely-used test problems reveal that the new PC methods are much more efficient when compared with the well-known parallel-iterated RK methods (PIRK methods), parallel-iterated RK-type PC methods with continuous output formulas (PIRKC methods) and sequential explicit RK codes DOPRI5 and DOP853 available from the literature.
HO2 rovibrational eigenvalue studies for nonzero angular momentum
NASA Astrophysics Data System (ADS)
Wu, Xudong T.; Hayes, Edward F.
1997-08-01
An efficient parallel algorithm is reported for determining all bound rovibrational energy levels for the HO2 molecule for nonzero angular momentum values, J=1, 2, and 3. Performance tests on the CRAY T3D indicate that the algorithm scales almost linearly when up to 128 processors are used. Sustained performance levels of up to 3.8 Gflops have been achieved using 128 processors for J=3. The algorithm uses a direct product discrete variable representation (DVR) basis and the implicitly restarted Lanczos method (IRLM) of Sorensen to compute the eigenvalues of the polyatomic Hamiltonian. Since the IRLM is an iterative method, it does not require storage of the full Hamiltonian matrix—it only requires the multiplication of the Hamiltonian matrix by a vector. When the IRLM is combined with a formulation such as DVR, which produces a very sparse matrix, both memory and computation times can be reduced dramatically. This algorithm has the potential to achieve even higher performance levels for larger values of the total angular momentum.
Real-time depth camera tracking with geometrically stable weight algorithm
NASA Astrophysics Data System (ADS)
Fu, Xingyin; Zhu, Feng; Qi, Feng; Wang, Mingming
2017-03-01
We present an approach for real-time camera tracking with depth stream. Existing methods are prone to drift in sceneries without sufficient geometric information. First, we propose a new weight method for an iterative closest point algorithm commonly used in real-time dense mapping and tracking systems. By detecting uncertainty in pose and increasing weight of points that constrain unstable transformations, our system achieves accurate and robust trajectory estimation results. Our pipeline can be fully parallelized with GPU and incorporated into the current real-time depth camera tracking system seamlessly. Second, we compare the state-of-the-art weight algorithms and propose a weight degradation algorithm according to the measurement characteristics of a consumer depth camera. Third, we use Nvidia Kepler Shuffle instructions during warp and block reduction to improve the efficiency of our system. Results on the public TUM RGB-D database benchmark demonstrate that our camera tracking system achieves state-of-the-art results both in accuracy and efficiency.
Discrete-Time Stable Generalized Self-Learning Optimal Control With Approximation Errors.
Wei, Qinglai; Li, Benkai; Song, Ruizhuo
2018-04-01
In this paper, a generalized policy iteration (GPI) algorithm with approximation errors is developed for solving infinite horizon optimal control problems for nonlinear systems. The developed stable GPI algorithm provides a general structure of discrete-time iterative adaptive dynamic programming algorithms, by which most of the discrete-time reinforcement learning algorithms can be described using the GPI structure. It is for the first time that approximation errors are explicitly considered in the GPI algorithm. The properties of the stable GPI algorithm with approximation errors are analyzed. The admissibility of the approximate iterative control law can be guaranteed if the approximation errors satisfy the admissibility criteria. The convergence of the developed algorithm is established, which shows that the iterative value function is convergent to a finite neighborhood of the optimal performance index function, if the approximate errors satisfy the convergence criterion. Finally, numerical examples and comparisons are presented.
Evaluation of the effects of patient arm attenuation in SPECT cardiac perfusion imaging
NASA Astrophysics Data System (ADS)
Luo, Dershan; King, M. A.; Pan, Tin-Su; Xia, Weishi
1996-12-01
It was hypothesized that the use of attenuation correction could compensate for degradation in the uniformity of apparent localization of imaging agents seen in cardiac walls when patients are imaged with arms at their sides. Noise-free simulations of the digital MCAT phantom were employed to investigate this hypothesis. Four variations in camera size and collimation scheme were investigated. We observed that: 1) without attenuation correction, the arms had little additional influences on the uniformity of the heart for 180/spl deg/ reconstructions and caused a small increase in nonuniformity for 360/spl deg/ reconstructions, where the impact of both arms was included; 2) change in patient size had more of an impact on count uniformity than the presence of the arms, either with or without attenuation correction; 3) for a low number of iterations and large patient size, slightly better uniformity was obtained from parallel emission data than from fan-beam emission data, independent of whether parallel or fan-beam transmission data was used to reconstruct the attenuation maps; and 4) for all camera configurations, uniformity was improved with attenuation correction and, given sufficient number of iterations, it was compatible among different imaging geometry combinations. Thus, iterative algorithms can compensate for the additional attenuation imposed by larger patients or having the arms on the sides. When the arms are at the sides of the patient, however, a larger radius of rotation may be required, resulting in decreased spatial resolution.
Evaluation of the effects of patient arm attenuation in SPECT cardiac perfusion imaging
DOE Office of Scientific and Technical Information (OSTI.GOV)
Luo, D.; King, M.A.; Pan, T.S.
1996-12-01
It was hypothesized that the use of attenuation correction could compensate for degradation in the uniformity of apparent localization of imaging agents seen in cardiac walls when patients are imaged with arms at their sides. Noise-free simulations of the digital MCAT phantom were employed to investigate this hypothesis. Four variations in camera size and collimation scheme were investigated. The authors observed that: (1) without attenuation correction, the arms had little additional influences on the uniformity of the heart for 180{degree} reconstructions and caused a small increase in nonuniformity for 360{degree} reconstructions, where the impact of both arms was included; (2)more » change in patient size had more of an impact on count uniformity than the presence of the arms, either with or without attenuation correction; (3) for a low number of iterations and large patient size, slightly better uniformity was obtained from parallel emission data than from fan-beam emission data, independent of whether parallel or fan-beam transmission data was used to reconstruct the attenuation maps; and (4) for all camera configurations, uniformity was improved with attenuation correction and, given sufficient number of iterations, it was compatible among different imaging geometry combinations. Thus, iterative algorithms can compensate for the additional attenuation imposed by larger patients or having the arms on the sides. When the arms are at the sides of the patient, however, a larger radius of rotation may be required, resulting in decreased spatial resolution.« less
Nonlinear relaxation algorithms for circuit simulation
DOE Office of Scientific and Technical Information (OSTI.GOV)
Saleh, R.A.
Circuit simulation is an important Computer-Aided Design (CAD) tool in the design of Integrated Circuits (IC). However, the standard techniques used in programs such as SPICE result in very long computer-run times when applied to large problems. In order to reduce the overall run time, a number of new approaches to circuit simulation were developed and are described. These methods are based on nonlinear relaxation techniques and exploit the relative inactivity of large circuits. Simple waveform-processing techniques are described to determine the maximum possible speed improvement that can be obtained by exploiting this property of large circuits. Three simulation algorithmsmore » are described, two of which are based on the Iterated Timing Analysis (ITA) method and a third based on the Waveform-Relaxation Newton (WRN) method. New programs that incorporate these techniques were developed and used to simulate a variety of industrial circuits. The results from these simulations are provided. The techniques are shown to be much faster than the standard approach. In addition, a number of parallel aspects of these algorithms are described, and a general space-time model of parallel-task scheduling is developed.« less
Radiofrequency pulse design in parallel transmission under strict temperature constraints.
Boulant, Nicolas; Massire, Aurélien; Amadon, Alexis; Vignaud, Alexandre
2014-09-01
To gain radiofrequency (RF) pulse performance by directly addressing the temperature constraints, as opposed to the specific absorption rate (SAR) constraints, in parallel transmission at ultra-high field. The magnitude least-squares RF pulse design problem under hard SAR constraints was solved repeatedly by using the virtual observation points and an active-set algorithm. The SAR constraints were updated at each iteration based on the result of a thermal simulation. The numerical study was performed for an SAR-demanding and simplified time of flight sequence using B1 and ΔB0 maps obtained in vivo on a human brain at 7T. The proposed adjustment of the SAR constraints combined with an active-set algorithm provided higher flexibility in RF pulse design within a reasonable time. The modifications of those constraints acted directly upon the thermal response as desired. Although further confidence in the thermal models is needed, this study shows that RF pulse design under strict temperature constraints is within reach, allowing better RF pulse performance and faster acquisitions at ultra-high fields at the cost of higher sequence complexity. Copyright © 2013 Wiley Periodicals, Inc.
NASA Astrophysics Data System (ADS)
Yuan, Jian-guo; Tong, Qing-zhen; Huang, Sheng; Wang, Yong
2013-11-01
An effective hierarchical reliable belief propagation (HRBP) decoding algorithm is proposed according to the structural characteristics of systematically constructed Gallager low-density parity-check (SCG-LDPC) codes. The novel decoding algorithm combines the layered iteration with the reliability judgment, and can greatly reduce the number of the variable nodes involved in the subsequent iteration process and accelerate the convergence rate. The result of simulation for SCG-LDPC(3969,3720) code shows that the novel HRBP decoding algorithm can greatly reduce the computing amount at the condition of ensuring the performance compared with the traditional belief propagation (BP) algorithm. The bit error rate (BER) of the HRBP algorithm is considerable at the threshold value of 15, but in the subsequent iteration process, the number of the variable nodes for the HRBP algorithm can be reduced by about 70% at the high signal-to-noise ratio (SNR) compared with the BP algorithm. When the threshold value is further increased, the HRBP algorithm will gradually degenerate into the layered-BP algorithm, but at the BER of 10-7 and the maximal iteration number of 30, the net coding gain (NCG) of the HRBP algorithm is 0.2 dB more than that of the BP algorithm, and the average iteration times can be reduced by about 40% at the high SNR. Therefore, the novel HRBP decoding algorithm is more suitable for optical communication systems.
Array architectures for iterative algorithms
NASA Technical Reports Server (NTRS)
Jagadish, Hosagrahar V.; Rao, Sailesh K.; Kailath, Thomas
1987-01-01
Regular mesh-connected arrays are shown to be isomorphic to a class of so-called regular iterative algorithms. For a wide variety of problems it is shown how to obtain appropriate iterative algorithms and then how to translate these algorithms into arrays in a systematic fashion. Several 'systolic' arrays presented in the literature are shown to be specific cases of the variety of architectures that can be derived by the techniques presented here. These include arrays for Fourier Transform, Matrix Multiplication, and Sorting.
NASA Astrophysics Data System (ADS)
Goossens, Bart; Aelterman, Jan; Luong, Hi"p.; Pižurica, Aleksandra; Philips, Wilfried
2011-09-01
The shearlet transform is a recent sibling in the family of geometric image representations that provides a traditional multiresolution analysis combined with a multidirectional analysis. In this paper, we present a fast DFT-based analysis and synthesis scheme for the 2D discrete shearlet transform. Our scheme conforms to the continuous shearlet theory to high extent, provides perfect numerical reconstruction (up to floating point rounding errors) in a non-iterative scheme and is highly suitable for parallel implementation (e.g. FPGA, GPU). We show that our discrete shearlet representation is also a tight frame and the redundancy factor of the transform is around 2.6, independent of the number of analysis directions. Experimental denoising results indicate that the transform performs the same or even better than several related multiresolution transforms, while having a significantly lower redundancy factor.
A fast algorithm to compute precise type-2 centroids for real-time control applications.
Chakraborty, Sumantra; Konar, Amit; Ralescu, Anca; Pal, Nikhil R
2015-02-01
An interval type-2 fuzzy set (IT2 FS) is characterized by its upper and lower membership functions containing all possible embedded fuzzy sets, which together is referred to as the footprint of uncertainty (FOU). The FOU results in a span of uncertainty measured in the defuzzified space and is determined by the positional difference of the centroids of all the embedded fuzzy sets taken together. This paper provides a closed-form formula to evaluate the span of uncertainty of an IT2 FS. The closed-form formula offers a precise measurement of the degree of uncertainty in an IT2 FS with a runtime complexity less than that of the classical iterative Karnik-Mendel algorithm and other formulations employing the iterative Newton-Raphson algorithm. This paper also demonstrates a real-time control application using the proposed closed-form formula of centroids with reduced root mean square error and computational overhead than those of the existing methods. Computer simulations for this real-time control application indicate that parallel realization of the IT2 defuzzification outperforms its competitors with respect to maximum overshoot even at high sampling rates. Furthermore, in the presence of measurement noise in system (plant) states, the proposed IT2 FS based scheme outperforms its type-1 counterpart with respect to peak overshoot and root mean square error in plant response.
NASA Astrophysics Data System (ADS)
Sorokin, V. A.; Volkov, Yu V.; Sherstneva, A. I.; Botygin, I. A.
2016-11-01
This paper overviews a method of generating climate regions based on an analytic signal theory. When applied to atmospheric surface layer temperature data sets, the method allows forming climatic structures with the corresponding changes in the temperature to make conclusions on the uniformity of climate in an area and to trace the climate changes in time by analyzing the type group shifts. The algorithm is based on the fact that the frequency spectrum of the thermal oscillation process is narrow-banded and has only one mode for most weather stations. This allows using the analytic signal theory, causality conditions and introducing an oscillation phase. The annual component of the phase, being a linear function, was removed by the least squares method. The remaining phase fluctuations allow consistent studying of their coordinated behavior and timing, using the Pearson correlation coefficient for dependence evaluation. This study includes program experiments to evaluate the calculation efficiency in the phase grouping task. The paper also overviews some single-threaded and multi-threaded computing models. It is shown that the phase grouping algorithm for meteorological data can be parallelized and that a multi-threaded implementation leads to a 25-30% increase in the performance.
Hudson, H M; Ma, J; Green, P
1994-01-01
Many algorithms for medical image reconstruction adopt versions of the expectation-maximization (EM) algorithm. In this approach, parameter estimates are obtained which maximize a complete data likelihood or penalized likelihood, in each iteration. Implicitly (and sometimes explicitly) penalized algorithms require smoothing of the current reconstruction in the image domain as part of their iteration scheme. In this paper, we discuss alternatives to EM which adapt Fisher's method of scoring (FS) and other methods for direct maximization of the incomplete data likelihood. Jacobi and Gauss-Seidel methods for non-linear optimization provide efficient algorithms applying FS in tomography. One approach uses smoothed projection data in its iterations. We investigate the convergence of Jacobi and Gauss-Seidel algorithms with clinical tomographic projection data.
Performance of a parallel thermal-hydraulics code TEMPEST
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fann, G.I.; Trent, D.S.
The authors describe the parallelization of the Tempest thermal-hydraulics code. The serial version of this code is used for production quality 3-D thermal-hydraulics simulations. Good speedup was obtained with a parallel diagonally preconditioned BiCGStab non-symmetric linear solver, using a spatial domain decomposition approach for the semi-iterative pressure-based and mass-conserved algorithm. The test case used here to illustrate the performance of the BiCGStab solver is a 3-D natural convection problem modeled using finite volume discretization in cylindrical coordinates. The BiCGStab solver replaced the LSOR-ADI method for solving the pressure equation in TEMPEST. BiCGStab also solves the coupled thermal energy equation. Scalingmore » performance of 3 problem sizes (221220 nodes, 358120 nodes, and 701220 nodes) are presented. These problems were run on 2 different parallel machines: IBM-SP and SGI PowerChallenge. The largest problem attains a speedup of 68 on an 128 processor IBM-SP. In real terms, this is over 34 times faster than the fastest serial production time using the LSOR-ADI solver.« less
Analytic TOF PET reconstruction algorithm within DIRECT data partitioning framework
Matej, Samuel; Daube-Witherspoon, Margaret E.; Karp, Joel S.
2016-01-01
Iterative reconstruction algorithms are routinely used for clinical practice; however, analytic algorithms are relevant candidates for quantitative research studies due to their linear behavior. While iterative algorithms also benefit from the inclusion of accurate data and noise models the widespread use of TOF scanners with less sensitivity to noise and data imperfections make analytic algorithms even more promising. In our previous work we have developed a novel iterative reconstruction approach (Direct Image Reconstruction for TOF) providing convenient TOF data partitioning framework and leading to very efficient reconstructions. In this work we have expanded DIRECT to include an analytic TOF algorithm with confidence weighting incorporating models of both TOF and spatial resolution kernels. Feasibility studies using simulated and measured data demonstrate that analytic-DIRECT with appropriate resolution and regularization filters is able to provide matched bias vs. variance performance to iterative TOF reconstruction with a matched resolution model. PMID:27032968
Analytic TOF PET reconstruction algorithm within DIRECT data partitioning framework
NASA Astrophysics Data System (ADS)
Matej, Samuel; Daube-Witherspoon, Margaret E.; Karp, Joel S.
2016-05-01
Iterative reconstruction algorithms are routinely used for clinical practice; however, analytic algorithms are relevant candidates for quantitative research studies due to their linear behavior. While iterative algorithms also benefit from the inclusion of accurate data and noise models the widespread use of time-of-flight (TOF) scanners with less sensitivity to noise and data imperfections make analytic algorithms even more promising. In our previous work we have developed a novel iterative reconstruction approach (DIRECT: direct image reconstruction for TOF) providing convenient TOF data partitioning framework and leading to very efficient reconstructions. In this work we have expanded DIRECT to include an analytic TOF algorithm with confidence weighting incorporating models of both TOF and spatial resolution kernels. Feasibility studies using simulated and measured data demonstrate that analytic-DIRECT with appropriate resolution and regularization filters is able to provide matched bias versus variance performance to iterative TOF reconstruction with a matched resolution model.
Development of iterative techniques for the solution of unsteady compressible viscous flows
NASA Technical Reports Server (NTRS)
Sankar, Lakshmi N.; Hixon, Duane
1992-01-01
The development of efficient iterative solution methods for the numerical solution of two- and three-dimensional compressible Navier-Stokes equations is discussed. Iterative time marching methods have several advantages over classical multi-step explicit time marching schemes, and non-iterative implicit time marching schemes. Iterative schemes have better stability characteristics than non-iterative explicit and implicit schemes. In this work, another approach based on the classical conjugate gradient method, known as the Generalized Minimum Residual (GMRES) algorithm is investigated. The GMRES algorithm has been used in the past by a number of researchers for solving steady viscous and inviscid flow problems. Here, we investigate the suitability of this algorithm for solving the system of non-linear equations that arise in unsteady Navier-Stokes solvers at each time step.
Iterative channel decoding of FEC-based multiple-description codes.
Chang, Seok-Ho; Cosman, Pamela C; Milstein, Laurence B
2012-03-01
Multiple description coding has been receiving attention as a robust transmission framework for multimedia services. This paper studies the iterative decoding of FEC-based multiple description codes. The proposed decoding algorithms take advantage of the error detection capability of Reed-Solomon (RS) erasure codes. The information of correctly decoded RS codewords is exploited to enhance the error correction capability of the Viterbi algorithm at the next iteration of decoding. In the proposed algorithm, an intradescription interleaver is synergistically combined with the iterative decoder. The interleaver does not affect the performance of noniterative decoding but greatly enhances the performance when the system is iteratively decoded. We also address the optimal allocation of RS parity symbols for unequal error protection. For the optimal allocation in iterative decoding, we derive mathematical equations from which the probability distributions of description erasures can be generated in a simple way. The performance of the algorithm is evaluated over an orthogonal frequency-division multiplexing system. The results show that the performance of the multiple description codes is significantly enhanced.
SciSpark's SRDD : A Scientific Resilient Distributed Dataset for Multidimensional Data
NASA Astrophysics Data System (ADS)
Palamuttam, R. S.; Wilson, B. D.; Mogrovejo, R. M.; Whitehall, K. D.; Mattmann, C. A.; McGibbney, L. J.; Ramirez, P.
2015-12-01
Remote sensing data and climate model output are multi-dimensional arrays of massive sizes locked away in heterogeneous file formats (HDF5/4, NetCDF 3/4) and metadata models (HDF-EOS, CF) making it difficult to perform multi-stage, iterative science processing since each stage requires writing and reading data to and from disk. We have developed SciSpark, a robust Big Data framework, that extends ApacheTM Spark for scaling scientific computations. Apache Spark improves the map-reduce implementation in ApacheTM Hadoop for parallel computing on a cluster, by emphasizing in-memory computation, "spilling" to disk only as needed, and relying on lazy evaluation. Central to Spark is the Resilient Distributed Dataset (RDD), an in-memory distributed data structure that extends the functional paradigm provided by the Scala programming language. However, RDDs are ideal for tabular or unstructured data, and not for highly dimensional data. The SciSpark project introduces the Scientific Resilient Distributed Dataset (sRDD), a distributed-computing array structure which supports iterative scientific algorithms for multidimensional data. SciSpark processes data stored in NetCDF and HDF files by partitioning them across time or space and distributing the partitions among a cluster of compute nodes. We show usability and extensibility of SciSpark by implementing distributed algorithms for geospatial operations on large collections of multi-dimensional grids. In particular we address the problem of scaling an automated method for finding Mesoscale Convective Complexes. SciSpark provides a tensor interface to support the pluggability of different matrix libraries. We evaluate performance of the various matrix libraries in distributed pipelines, such as Nd4jTM and BreezeTM. We detail the architecture and design of SciSpark, our efforts to integrate climate science algorithms, parallel ingest and partitioning (sharding) of A-Train satellite observations from model grids. These solutions are encompassed in SciSpark, an open-source software framework for distributed computing on scientific data.
Final Report, DE-FG01-06ER25718 Domain Decomposition and Parallel Computing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Widlund, Olof B.
2015-06-09
The goal of this project is to develop and improve domain decomposition algorithms for a variety of partial differential equations such as those of linear elasticity and electro-magnetics.These iterative methods are designed for massively parallel computing systems and allow the fast solution of the very large systems of algebraic equations that arise in large scale and complicated simulations. A special emphasis is placed on problems arising from Maxwell's equation. The approximate solvers, the preconditioners, are combined with the conjugate gradient method and must always include a solver of a coarse model in order to have a performance which is independentmore » of the number of processors used in the computer simulation. A recent development allows for an adaptive construction of this coarse component of the preconditioner.« less
Accelerating electron tomography reconstruction algorithm ICON with GPU.
Chen, Yu; Wang, Zihao; Zhang, Jingrong; Li, Lun; Wan, Xiaohua; Sun, Fei; Zhang, Fa
2017-01-01
Electron tomography (ET) plays an important role in studying in situ cell ultrastructure in three-dimensional space. Due to limited tilt angles, ET reconstruction always suffers from the "missing wedge" problem. With a validation procedure, iterative compressed-sensing optimized NUFFT reconstruction (ICON) demonstrates its power in the restoration of validated missing information for low SNR biological ET dataset. However, the huge computational demand has become a major problem for the application of ICON. In this work, we analyzed the framework of ICON and classified the operations of major steps of ICON reconstruction into three types. Accordingly, we designed parallel strategies and implemented them on graphics processing units (GPU) to generate a parallel program ICON-GPU. With high accuracy, ICON-GPU has a great acceleration compared to its CPU version, up to 83.7×, greatly relieving ICON's dependence on computing resource.
Compressively sampled MR image reconstruction using generalized thresholding iterative algorithm
NASA Astrophysics Data System (ADS)
Elahi, Sana; kaleem, Muhammad; Omer, Hammad
2018-01-01
Compressed sensing (CS) is an emerging area of interest in Magnetic Resonance Imaging (MRI). CS is used for the reconstruction of the images from a very limited number of samples in k-space. This significantly reduces the MRI data acquisition time. One important requirement for signal recovery in CS is the use of an appropriate non-linear reconstruction algorithm. It is a challenging task to choose a reconstruction algorithm that would accurately reconstruct the MR images from the under-sampled k-space data. Various algorithms have been used to solve the system of non-linear equations for better image quality and reconstruction speed in CS. In the recent past, iterative soft thresholding algorithm (ISTA) has been introduced in CS-MRI. This algorithm directly cancels the incoherent artifacts produced because of the undersampling in k -space. This paper introduces an improved iterative algorithm based on p -thresholding technique for CS-MRI image reconstruction. The use of p -thresholding function promotes sparsity in the image which is a key factor for CS based image reconstruction. The p -thresholding based iterative algorithm is a modification of ISTA, and minimizes non-convex functions. It has been shown that the proposed p -thresholding iterative algorithm can be used effectively to recover fully sampled image from the under-sampled data in MRI. The performance of the proposed method is verified using simulated and actual MRI data taken at St. Mary's Hospital, London. The quality of the reconstructed images is measured in terms of peak signal-to-noise ratio (PSNR), artifact power (AP), and structural similarity index measure (SSIM). The proposed approach shows improved performance when compared to other iterative algorithms based on log thresholding, soft thresholding and hard thresholding techniques at different reduction factors.
Trace: a high-throughput tomographic reconstruction engine for large-scale datasets.
Bicer, Tekin; Gürsoy, Doğa; Andrade, Vincent De; Kettimuthu, Rajkumar; Scullin, William; Carlo, Francesco De; Foster, Ian T
2017-01-01
Modern synchrotron light sources and detectors produce data at such scale and complexity that large-scale computation is required to unleash their full power. One of the widely used imaging techniques that generates data at tens of gigabytes per second is computed tomography (CT). Although CT experiments result in rapid data generation, the analysis and reconstruction of the collected data may require hours or even days of computation time with a medium-sized workstation, which hinders the scientific progress that relies on the results of analysis. We present Trace, a data-intensive computing engine that we have developed to enable high-performance implementation of iterative tomographic reconstruction algorithms for parallel computers. Trace provides fine-grained reconstruction of tomography datasets using both (thread-level) shared memory and (process-level) distributed memory parallelization. Trace utilizes a special data structure called replicated reconstruction object to maximize application performance. We also present the optimizations that we apply to the replicated reconstruction objects and evaluate them using tomography datasets collected at the Advanced Photon Source. Our experimental evaluations show that our optimizations and parallelization techniques can provide 158× speedup using 32 compute nodes (384 cores) over a single-core configuration and decrease the end-to-end processing time of a large sinogram (with 4501 × 1 × 22,400 dimensions) from 12.5 h to <5 min per iteration. The proposed tomographic reconstruction engine can efficiently process large-scale tomographic data using many compute nodes and minimize reconstruction times.
A Pervasive Parallel Processing Framework for Data Visualization and Analysis at Extreme Scale
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ma, Kwan-Liu
Most of today’s visualization libraries and applications are based off of what is known today as the visualization pipeline. In the visualization pipeline model, algorithms are encapsulated as “filtering” components with inputs and outputs. These components can be combined by connecting the outputs of one filter to the inputs of another filter. The visualization pipeline model is popular because it provides a convenient abstraction that allows users to combine algorithms in powerful ways. Unfortunately, the visualization pipeline cannot run effectively on exascale computers. Experts agree that the exascale machine will comprise processors that contain many cores. Furthermore, physical limitations willmore » prevent data movement in and out of the chip (that is, between main memory and the processing cores) from keeping pace with improvements in overall compute performance. To use these processors to their fullest capability, it is essential to carefully consider memory access. This is where the visualization pipeline fails. Each filtering component in the visualization library is expected to take a data set in its entirety, perform some computation across all of the elements, and output the complete results. The process of iterating over all elements must be repeated in each filter, which is one of the worst possible ways to traverse memory when trying to maximize the number of executions per memory access. This project investigates a new type of visualization framework that exhibits a pervasive parallelism necessary to run on exascale machines. Our framework achieves this by defining algorithms in terms of functors, which are localized, stateless operations. Functors can be composited in much the same way as filters in the visualization pipeline. But, functors’ design allows them to be concurrently running on massive amounts of lightweight threads. Only with such fine-grained parallelism can we hope to fill the billions of threads we expect will be necessary for efficient computation on an exascale computer. This project concludes with a functional prototype containing pervasively parallel algorithms that perform demonstratively well on many-core processors. These algorithms are fundamental for performing data analysis and visualization at extreme scale.« less
Sinogram-based adaptive iterative reconstruction for sparse view x-ray computed tomography
NASA Astrophysics Data System (ADS)
Trinca, D.; Zhong, Y.; Wang, Y.-Z.; Mamyrbayev, T.; Libin, E.
2016-10-01
With the availability of more powerful computing processors, iterative reconstruction algorithms have recently been successfully implemented as an approach to achieving significant dose reduction in X-ray CT. In this paper, we propose an adaptive iterative reconstruction algorithm for X-ray CT, that is shown to provide results comparable to those obtained by proprietary algorithms, both in terms of reconstruction accuracy and execution time. The proposed algorithm is thus provided for free to the scientific community, for regular use, and for possible further optimization.
Leapfrog variants of iterative methods for linear algebra equations
NASA Technical Reports Server (NTRS)
Saylor, Paul E.
1988-01-01
Two iterative methods are considered, Richardson's method and a general second order method. For both methods, a variant of the method is derived for which only even numbered iterates are computed. The variant is called a leapfrog method. Comparisons between the conventional form of the methods and the leapfrog form are made under the assumption that the number of unknowns is large. In the case of Richardson's method, it is possible to express the final iterate in terms of only the initial approximation, a variant of the iteration called the grand-leap method. In the case of the grand-leap variant, a set of parameters is required. An algorithm is presented to compute these parameters that is related to algorithms to compute the weights and abscissas for Gaussian quadrature. General algorithms to implement the leapfrog and grand-leap methods are presented. Algorithms for the important special case of the Chebyshev method are also given.
Frog: Asynchronous Graph Processing on GPU with Hybrid Coloring Model
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shi, Xuanhua; Luo, Xuan; Liang, Junling
GPUs have been increasingly used to accelerate graph processing for complicated computational problems regarding graph theory. Many parallel graph algorithms adopt the asynchronous computing model to accelerate the iterative convergence. Unfortunately, the consistent asynchronous computing requires locking or atomic operations, leading to significant penalties/overheads when implemented on GPUs. As such, coloring algorithm is adopted to separate the vertices with potential updating conflicts, guaranteeing the consistency/correctness of the parallel processing. Common coloring algorithms, however, may suffer from low parallelism because of a large number of colors generally required for processing a large-scale graph with billions of vertices. We propose a light-weightmore » asynchronous processing framework called Frog with a preprocessing/hybrid coloring model. The fundamental idea is based on Pareto principle (or 80-20 rule) about coloring algorithms as we observed through masses of realworld graph coloring cases. We find that a majority of vertices (about 80%) are colored with only a few colors, such that they can be read and updated in a very high degree of parallelism without violating the sequential consistency. Accordingly, our solution separates the processing of the vertices based on the distribution of colors. In this work, we mainly answer three questions: (1) how to partition the vertices in a sparse graph with maximized parallelism, (2) how to process large-scale graphs that cannot fit into GPU memory, and (3) how to reduce the overhead of data transfers on PCIe while processing each partition. We conduct experiments on real-world data (Amazon, DBLP, YouTube, RoadNet-CA, WikiTalk and Twitter) to evaluate our approach and make comparisons with well-known non-preprocessed (such as Totem, Medusa, MapGraph and Gunrock) and preprocessed (Cusha) approaches, by testing four classical algorithms (BFS, PageRank, SSSP and CC). On all the tested applications and datasets, Frog is able to significantly outperform existing GPU-based graph processing systems except Gunrock and MapGraph. MapGraph gets better performance than Frog when running BFS on RoadNet-CA. The comparison between Gunrock and Frog is inconclusive. Frog can outperform Gunrock more than 1.04X when running PageRank and SSSP, while the advantage of Frog is not obvious when running BFS and CC on some datasets especially for RoadNet-CA.« less
PISCES: An environment for parallel scientific computation
NASA Technical Reports Server (NTRS)
Pratt, T. W.
1985-01-01
The parallel implementation of scientific computing environment (PISCES) is a project to provide high-level programming environments for parallel MIMD computers. Pisces 1, the first of these environments, is a FORTRAN 77 based environment which runs under the UNIX operating system. The Pisces 1 user programs in Pisces FORTRAN, an extension of FORTRAN 77 for parallel processing. The major emphasis in the Pisces 1 design is in providing a carefully specified virtual machine that defines the run-time environment within which Pisces FORTRAN programs are executed. Each implementation then provides the same virtual machine, regardless of differences in the underlying architecture. The design is intended to be portable to a variety of architectures. Currently Pisces 1 is implemented on a network of Apollo workstations and on a DEC VAX uniprocessor via simulation of the task level parallelism. An implementation for the Flexible Computing Corp. FLEX/32 is under construction. An introduction to the Pisces 1 virtual computer and the FORTRAN 77 extensions is presented. An example of an algorithm for the iterative solution of a system of equations is given. The most notable features of the design are the provision for several granularities of parallelism in programs and the provision of a window mechanism for distributed access to large arrays of data.
AZTEC: A parallel iterative package for the solving linear systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hutchinson, S.A.; Shadid, J.N.; Tuminaro, R.S.
1996-12-31
We describe a parallel linear system package, AZTEC. The package incorporates a number of parallel iterative methods (e.g. GMRES, biCGSTAB, CGS, TFQMR) and preconditioners (e.g. Jacobi, Gauss-Seidel, polynomial, domain decomposition with LU or ILU within subdomains). Additionally, AZTEC allows for the reuse of previous preconditioning factorizations within Newton schemes for nonlinear methods. Currently, a number of different users are using this package to solve a variety of PDE applications.
Kim, Kwangdon; Lee, Kisung; Lee, Hakjae; Joo, Sungkwan; Kang, Jungwon
2018-01-01
We aimed to develop a gap-filling algorithm, in particular the filter mask design method of the algorithm, which optimizes the filter to the imaging object by an adaptive and iterative process, rather than by manual means. Two numerical phantoms (Shepp-Logan and Jaszczak) were used for sinogram generation. The algorithm works iteratively, not only on the gap-filling iteration but also on the mask generation, to identify the object-dedicated low frequency area in the DCT-domain that is to be preserved. We redefine the low frequency preserving region of the filter mask at every gap-filling iteration, and the region verges on the property of the original image in the DCT domain. The previous DCT2 mask for each phantom case had been manually well optimized, and the results show little difference from the reference image and sinogram. We observed little or no difference between the results of the manually optimized DCT2 algorithm and those of the proposed algorithm. The proposed algorithm works well for various types of scanning object and shows results that compare to those of the manually optimized DCT2 algorithm without perfect or full information of the imaging object.
2014-06-02
2011). [22] Li, Q., Micchelli, C., Shen, L., and Xu, Y. A proximity algorithm acelerated by Gauss - Seidel iterations for L1/TV denoising models. Inverse...system of equations and their relationship to the solution of Model (2) and present an algorithm with an iterative approach for finding these solutions...Using the fixed-point characterization above, the (k + 1)th iteration of the prox- imity operator algorithm to find the solution of the Dantzig
Liu, Derong; Li, Hongliang; Wang, Ding
2015-06-01
In this paper, we establish error bounds of adaptive dynamic programming algorithms for solving undiscounted infinite-horizon optimal control problems of discrete-time deterministic nonlinear systems. We consider approximation errors in the update equations of both value function and control policy. We utilize a new assumption instead of the contraction assumption in discounted optimal control problems. We establish the error bounds for approximate value iteration based on a new error condition. Furthermore, we also establish the error bounds for approximate policy iteration and approximate optimistic policy iteration algorithms. It is shown that the iterative approximate value function can converge to a finite neighborhood of the optimal value function under some conditions. To implement the developed algorithms, critic and action neural networks are used to approximate the value function and control policy, respectively. Finally, a simulation example is given to demonstrate the effectiveness of the developed algorithms.
Domain decomposition algorithms and computation fluid dynamics
NASA Technical Reports Server (NTRS)
Chan, Tony F.
1988-01-01
In the past several years, domain decomposition was a very popular topic, partly motivated by the potential of parallelization. While a large body of theory and algorithms were developed for model elliptic problems, they are only recently starting to be tested on realistic applications. The application of some of these methods to two model problems in computational fluid dynamics are investigated. Some examples are two dimensional convection-diffusion problems and the incompressible driven cavity flow problem. The construction and analysis of efficient preconditioners for the interface operator to be used in the iterative solution of the interface solution is described. For the convection-diffusion problems, the effect of the convection term and its discretization on the performance of some of the preconditioners is discussed. For the driven cavity problem, the effectiveness of a class of boundary probe preconditioners is discussed.
Calibration for single multi-mode fiber digital scanning microscopy imaging system
NASA Astrophysics Data System (ADS)
Yin, Zhe; Liu, Guodong; Liu, Bingguo; Gan, Yu; Zhuang, Zhitao; Chen, Fengdong
2015-11-01
Single multimode fiber (MMF) digital scanning imaging system is a development tendency of modern endoscope. We concentrate on the calibration method of the imaging system. Calibration method comprises two processes, forming scanning focused spots and calibrating the couple factors varied with positions. Adaptive parallel coordinate algorithm (APC) is adopted to form the focused spots at the multimode fiber (MMF) output. Compare with other algorithm, APC contains many merits, i.e. rapid speed, small amount calculations and no iterations. The ratio of the optics power captured by MMF to the intensity of the focused spots is called couple factor. We setup the calibration experimental system to form the scanning focused spots and calculate the couple factors for different object positions. The experimental result the couple factor is higher in the center than the edge.
Amesos2 and Belos: Direct and Iterative Solvers for Large Sparse Linear Systems
Bavier, Eric; Hoemmen, Mark; Rajamanickam, Sivasankaran; ...
2012-01-01
Solvers for large sparse linear systems come in two categories: direct and iterative. Amesos2, a package in the Trilinos software project, provides direct methods, and Belos, another Trilinos package, provides iterative methods. Amesos2 offers a common interface to many different sparse matrix factorization codes, and can handle any implementation of sparse matrices and vectors, via an easy-to-extend C++ traits interface. It can also factor matrices whose entries have arbitrary “Scalar” type, enabling extended-precision and mixed-precision algorithms. Belos includes many different iterative methods for solving large sparse linear systems and least-squares problems. Unlike competing iterative solver libraries, Belos completely decouples themore » algorithms from the implementations of the underlying linear algebra objects. This lets Belos exploit the latest hardware without changes to the code. Belos favors algorithms that solve higher-level problems, such as multiple simultaneous linear systems and sequences of related linear systems, faster than standard algorithms. The package also supports extended-precision and mixed-precision algorithms. Together, Amesos2 and Belos form a complete suite of sparse linear solvers.« less
Towards the Experimental Assessment of the DQE in SPECT Scanners
NASA Astrophysics Data System (ADS)
Fountos, G. P.; Michail, C. M.
2017-11-01
The purpose of this work was to introduce the Detective Quantum Efficiency (DQE) in single photon emission computed tomography (SPECT) systems using a flood source. A Tc-99m-based flood source (Eγ = 140 keV) consisting of a radiopharmaceutical solution of dithiothreitol (DTT, 10-3 M)/Tc-99m(III)-DMSA, 40 mCi/40 ml bound to the grains of an Agfa MammoRay HDR Medical X-ray film) was prepared in laboratory. The source was placed between two PMMA blocks and images were obtained by using the brain tomographic acquisition protocol (DatScan-brain). The Modulation Transfer Function (MTF) was evaluated using the Iterative 2D algorithm. All imaging experiments were performed in a Siemens e-Cam gamma camera. The Normalized Noise Power spectra (NNPS) were obtained from the sagittal views of the source. The higher MTF values were obtained for the Flash Iterative 2D with 24 iterations and 20 subsets. The noise levels of the SPECT reconstructed images, in terms of the NNPS, were found to increase as the number of iterations increase. The behavior of the DQE was influenced by both MTF and NNPS. As the number of iterations was increased, higher MTF values were obtained, however with a parallel, increase of magnitude in image noise, as depicted from the NNPS results. DQE values, which were influenced by both MTF and NNPS, were found higher when the number of iterations results in resolution saturation. The method presented here is novel and easy to implement, requiring materials commonly found in clinical practice and can be useful in the quality control of SPECT scanners.
de Lima, Camila; Salomão Helou, Elias
2018-01-01
Iterative methods for tomographic image reconstruction have the computational cost of each iteration dominated by the computation of the (back)projection operator, which take roughly O(N 3 ) floating point operations (flops) for N × N pixels images. Furthermore, classical iterative algorithms may take too many iterations in order to achieve acceptable images, thereby making the use of these techniques unpractical for high-resolution images. Techniques have been developed in the literature in order to reduce the computational cost of the (back)projection operator to O(N 2 logN) flops. Also, incremental algorithms have been devised that reduce by an order of magnitude the number of iterations required to achieve acceptable images. The present paper introduces an incremental algorithm with a cost of O(N 2 logN) flops per iteration and applies it to the reconstruction of very large tomographic images obtained from synchrotron light illuminated data.
Application of the perturbation iteration method to boundary layer type problems.
Pakdemirli, Mehmet
2016-01-01
The recently developed perturbation iteration method is applied to boundary layer type singular problems for the first time. As a preliminary work on the topic, the simplest algorithm of PIA(1,1) is employed in the calculations. Linear and nonlinear problems are solved to outline the basic ideas of the new solution technique. The inner and outer solutions are determined with the iteration algorithm and matched to construct a composite expansion valid within all parts of the domain. The solutions are contrasted with the available exact or numerical solutions. It is shown that the perturbation-iteration algorithm can be effectively used for solving boundary layer type problems.
Tomography by iterative convolution - Empirical study and application to interferometry
NASA Technical Reports Server (NTRS)
Vest, C. M.; Prikryl, I.
1984-01-01
An algorithm for computer tomography has been developed that is applicable to reconstruction from data having incomplete projections because an opaque object blocks some of the probing radiation as it passes through the object field. The algorithm is based on iteration between the object domain and the projection (Radon transform) domain. Reconstructions are computed during each iteration by the well-known convolution method. Although it is demonstrated that this algorithm does not converge, an empirically justified criterion for terminating the iteration when the most accurate estimate has been computed is presented. The algorithm has been studied by using it to reconstruct several different object fields with several different opaque regions. It also has been used to reconstruct aerodynamic density fields from interferometric data recorded in wind tunnel tests.
A novel iterative scheme and its application to differential equations.
Khan, Yasir; Naeem, F; Šmarda, Zdeněk
2014-01-01
The purpose of this paper is to employ an alternative approach to reconstruct the standard variational iteration algorithm II proposed by He, including Lagrange multiplier, and to give a simpler formulation of Adomian decomposition and modified Adomian decomposition method in terms of newly proposed variational iteration method-II (VIM). Through careful investigation of the earlier variational iteration algorithm and Adomian decomposition method, we find unnecessary calculations for Lagrange multiplier and also repeated calculations involved in each iteration, respectively. Several examples are given to verify the reliability and efficiency of the method.
Dang, C; Xu, L
2001-03-01
In this paper a globally convergent Lagrange and barrier function iterative algorithm is proposed for approximating a solution of the traveling salesman problem. The algorithm employs an entropy-type barrier function to deal with nonnegativity constraints and Lagrange multipliers to handle linear equality constraints, and attempts to produce a solution of high quality by generating a minimum point of a barrier problem for a sequence of descending values of the barrier parameter. For any given value of the barrier parameter, the algorithm searches for a minimum point of the barrier problem in a feasible descent direction, which has a desired property that the nonnegativity constraints are always satisfied automatically if the step length is a number between zero and one. At each iteration the feasible descent direction is found by updating Lagrange multipliers with a globally convergent iterative procedure. For any given value of the barrier parameter, the algorithm converges to a stationary point of the barrier problem without any condition on the objective function. Theoretical and numerical results show that the algorithm seems more effective and efficient than the softassign algorithm.
Laplace-domain waveform modeling and inversion for the 3D acoustic-elastic coupled media
NASA Astrophysics Data System (ADS)
Shin, Jungkyun; Shin, Changsoo; Calandra, Henri
2016-06-01
Laplace-domain waveform inversion reconstructs long-wavelength subsurface models by using the zero-frequency component of damped seismic signals. Despite the computational advantages of Laplace-domain waveform inversion over conventional frequency-domain waveform inversion, an acoustic assumption and an iterative matrix solver have been used to invert 3D marine datasets to mitigate the intensive computing cost. In this study, we develop a Laplace-domain waveform modeling and inversion algorithm for 3D acoustic-elastic coupled media by using a parallel sparse direct solver library (MUltifrontal Massively Parallel Solver, MUMPS). We precisely simulate a real marine environment by coupling the 3D acoustic and elastic wave equations with the proper boundary condition at the fluid-solid interface. In addition, we can extract the elastic properties of the Earth below the sea bottom from the recorded acoustic pressure datasets. As a matrix solver, the parallel sparse direct solver is used to factorize the non-symmetric impedance matrix in a distributed memory architecture and rapidly solve the wave field for a number of shots by using the lower and upper matrix factors. Using both synthetic datasets and real datasets obtained by a 3D wide azimuth survey, the long-wavelength component of the P-wave and S-wave velocity models is reconstructed and the proposed modeling and inversion algorithm are verified. A cluster of 80 CPU cores is used for this study.
Rodríguez, Manuel; Magdaleno, Eduardo; Pérez, Fernando; García, Cristhian
2017-03-28
Non-equispaced Fast Fourier transform (NFFT) is a very important algorithm in several technological and scientific areas such as synthetic aperture radar, computational photography, medical imaging, telecommunications, seismic analysis and so on. However, its computation complexity is high. In this paper, we describe an efficient NFFT implementation with a hardware coprocessor using an All-Programmable System-on-Chip (APSoC). This is a hybrid device that employs an Advanced RISC Machine (ARM) as Processing System with Programmable Logic for high-performance digital signal processing through parallelism and pipeline techniques. The algorithm has been coded in C language with pragma directives to optimize the architecture of the system. We have used the very novel Software Develop System-on-Chip (SDSoC) evelopment tool that simplifies the interface and partitioning between hardware and software. This provides shorter development cycles and iterative improvements by exploring several architectures of the global system. The computational results shows that hardware acceleration significantly outperformed the software based implementation.
Rodríguez, Manuel; Magdaleno, Eduardo; Pérez, Fernando; García, Cristhian
2017-01-01
Non-equispaced Fast Fourier transform (NFFT) is a very important algorithm in several technological and scientific areas such as synthetic aperture radar, computational photography, medical imaging, telecommunications, seismic analysis and so on. However, its computation complexity is high. In this paper, we describe an efficient NFFT implementation with a hardware coprocessor using an All-Programmable System-on-Chip (APSoC). This is a hybrid device that employs an Advanced RISC Machine (ARM) as Processing System with Programmable Logic for high-performance digital signal processing through parallelism and pipeline techniques. The algorithm has been coded in C language with pragma directives to optimize the architecture of the system. We have used the very novel Software Develop System-on-Chip (SDSoC) evelopment tool that simplifies the interface and partitioning between hardware and software. This provides shorter development cycles and iterative improvements by exploring several architectures of the global system. The computational results shows that hardware acceleration significantly outperformed the software based implementation. PMID:28350358
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chacon, Luis; Stanier, Adam John
Here, we demonstrate a scalable fully implicit algorithm for the two-field low-β extended MHD model. This reduced model describes plasma behavior in the presence of strong guide fields, and is of significant practical impact both in nature and in laboratory plasmas. The model displays strong hyperbolic behavior, as manifested by the presence of fast dispersive waves, which make a fully implicit treatment very challenging. In this study, we employ a Jacobian-free Newton–Krylov nonlinear solver, for which we propose a physics-based preconditioner that renders the linearized set of equations suitable for inversion with multigrid methods. As a result, the algorithm ismore » shown to scale both algorithmically (i.e., the iteration count is insensitive to grid refinement and timestep size) and in parallel in a weak-scaling sense, with the wall-clock time scaling weakly with the number of cores for up to 4096 cores. For a 4096 × 4096 mesh, we demonstrate a wall-clock-time speedup of ~6700 with respect to explicit algorithms. The model is validated linearly (against linear theory predictions) and nonlinearly (against fully kinetic simulations), demonstrating excellent agreement.« less
Parabolized Navier-Stokes Code for Computing Magneto-Hydrodynamic Flowfields
NASA Technical Reports Server (NTRS)
Mehta, Unmeel B. (Technical Monitor); Tannehill, J. C.
2003-01-01
This report consists of two published papers, 'Computation of Magnetohydrodynamic Flows Using an Iterative PNS Algorithm' and 'Numerical Simulation of Turbulent MHD Flows Using an Iterative PNS Algorithm'.
NASA Astrophysics Data System (ADS)
Feigin, G.; Ben-Yosef, N.
1983-10-01
A thinning algorithm, of the banana-peel type, is presented. In each iteration pixels are attacked from all directions (there are no sub-iterations), and the deletion criteria depend on the 24 nearest neighbours.
NASA Astrophysics Data System (ADS)
Cheng, Sheng-Yi; Liu, Wen-Jin; Chen, Shan-Qiu; Dong, Li-Zhi; Yang, Ping; Xu, Bing
2015-08-01
Among all kinds of wavefront control algorithms in adaptive optics systems, the direct gradient wavefront control algorithm is the most widespread and common method. This control algorithm obtains the actuator voltages directly from wavefront slopes through pre-measuring the relational matrix between deformable mirror actuators and Hartmann wavefront sensor with perfect real-time characteristic and stability. However, with increasing the number of sub-apertures in wavefront sensor and deformable mirror actuators of adaptive optics systems, the matrix operation in direct gradient algorithm takes too much time, which becomes a major factor influencing control effect of adaptive optics systems. In this paper we apply an iterative wavefront control algorithm to high-resolution adaptive optics systems, in which the voltages of each actuator are obtained through iteration arithmetic, which gains great advantage in calculation and storage. For AO system with thousands of actuators, the computational complexity estimate is about O(n2) ˜ O(n3) in direct gradient wavefront control algorithm, while the computational complexity estimate in iterative wavefront control algorithm is about O(n) ˜ (O(n)3/2), in which n is the number of actuators of AO system. And the more the numbers of sub-apertures and deformable mirror actuators, the more significant advantage the iterative wavefront control algorithm exhibits. Project supported by the National Key Scientific and Research Equipment Development Project of China (Grant No. ZDYZ2013-2), the National Natural Science Foundation of China (Grant No. 11173008), and the Sichuan Provincial Outstanding Youth Academic Technology Leaders Program, China (Grant No. 2012JQ0012).
Iterative methods for mixed finite element equations
NASA Technical Reports Server (NTRS)
Nakazawa, S.; Nagtegaal, J. C.; Zienkiewicz, O. C.
1985-01-01
Iterative strategies for the solution of indefinite system of equations arising from the mixed finite element method are investigated in this paper with application to linear and nonlinear problems in solid and structural mechanics. The augmented Hu-Washizu form is derived, which is then utilized to construct a family of iterative algorithms using the displacement method as the preconditioner. Two types of iterative algorithms are implemented. Those are: constant metric iterations which does not involve the update of preconditioner; variable metric iterations, in which the inverse of the preconditioning matrix is updated. A series of numerical experiments is conducted to evaluate the numerical performance with application to linear and nonlinear model problems.
Filtered gradient reconstruction algorithm for compressive spectral imaging
NASA Astrophysics Data System (ADS)
Mejia, Yuri; Arguello, Henry
2017-04-01
Compressive sensing matrices are traditionally based on random Gaussian and Bernoulli entries. Nevertheless, they are subject to physical constraints, and their structure unusually follows a dense matrix distribution, such as the case of the matrix related to compressive spectral imaging (CSI). The CSI matrix represents the integration of coded and shifted versions of the spectral bands. A spectral image can be recovered from CSI measurements by using iterative algorithms for linear inverse problems that minimize an objective function including a quadratic error term combined with a sparsity regularization term. However, current algorithms are slow because they do not exploit the structure and sparse characteristics of the CSI matrices. A gradient-based CSI reconstruction algorithm, which introduces a filtering step in each iteration of a conventional CSI reconstruction algorithm that yields improved image quality, is proposed. Motivated by the structure of the CSI matrix, Φ, this algorithm modifies the iterative solution such that it is forced to converge to a filtered version of the residual ΦTy, where y is the compressive measurement vector. We show that the filtered-based algorithm converges to better quality performance results than the unfiltered version. Simulation results highlight the relative performance gain over the existing iterative algorithms.
Wavelet-based edge correlation incorporated iterative reconstruction for undersampled MRI.
Hu, Changwei; Qu, Xiaobo; Guo, Di; Bao, Lijun; Chen, Zhong
2011-09-01
Undersampling k-space is an effective way to decrease acquisition time for MRI. However, aliasing artifacts introduced by undersampling may blur the edges of magnetic resonance images, which often contain important information for clinical diagnosis. Moreover, k-space data is often contaminated by the noise signals of unknown intensity. To better preserve the edge features while suppressing the aliasing artifacts and noises, we present a new wavelet-based algorithm for undersampled MRI reconstruction. The algorithm solves the image reconstruction as a standard optimization problem including a ℓ(2) data fidelity term and ℓ(1) sparsity regularization term. Rather than manually setting the regularization parameter for the ℓ(1) term, which is directly related to the threshold, an automatic estimated threshold adaptive to noise intensity is introduced in our proposed algorithm. In addition, a prior matrix based on edge correlation in wavelet domain is incorporated into the regularization term. Compared with nonlinear conjugate gradient descent algorithm, iterative shrinkage/thresholding algorithm, fast iterative soft-thresholding algorithm and the iterative thresholding algorithm using exponentially decreasing threshold, the proposed algorithm yields reconstructions with better edge recovery and noise suppression. Copyright © 2011 Elsevier Inc. All rights reserved.
NASA Astrophysics Data System (ADS)
Lezina, Natalya; Agoshkov, Valery
2017-04-01
Domain decomposition method (DDM) allows one to present a domain with complex geometry as a set of essentially simpler subdomains. This method is particularly applied for the hydrodynamics of oceans and seas. In each subdomain the system of thermo-hydrodynamic equations in the Boussinesq and hydrostatic approximations is solved. The problem of obtaining solution in the whole domain is that it is necessary to combine solutions in subdomains. For this purposes iterative algorithm is created and numerical experiments are conducted to investigate an effectiveness of developed algorithm using DDM. For symmetric operators in DDM, Poincare-Steklov's operators [1] are used, but for the problems of the hydrodynamics, it is not suitable. In this case for the problem, adjoint equation method [2] and inverse problem theory are used. In addition, it is possible to create algorithms for the parallel calculations using DDM on multiprocessor computer system. DDM for the model of the Baltic Sea dynamics is numerically studied. The results of numerical experiments using DDM are compared with the solution of the system of hydrodynamic equations in the whole domain. The work was supported by the Russian Science Foundation (project 14-11-00609, the formulation of the iterative process and numerical experiments). [1] V.I. Agoshkov, Domain Decompositions Methods in the Mathematical Physics Problem // Numerical processes and systems, No 8, Moscow, 1991 (in Russian). [2] V.I. Agoshkov, Optimal Control Approaches and Adjoint Equations in the Mathematical Physics Problem, Institute of Numerical Mathematics, RAS, Moscow, 2003 (in Russian).
NASA Astrophysics Data System (ADS)
Ku, Seung-Hoe; Hager, R.; Chang, C. S.; Chacon, L.; Chen, G.; EPSI Team
2016-10-01
The cancelation problem has been a long-standing issue for long wavelengths modes in electromagnetic gyrokinetic PIC simulations in toroidal geometry. As an attempt of resolving this issue, we implemented a fully implicit time integration scheme in the full-f, gyrokinetic PIC code XGC1. The new scheme - based on the implicit Vlasov-Darwin PIC algorithm by G. Chen and L. Chacon - can potentially resolve cancelation problem. The time advance for the field and the particle equations is space-time-centered, with particle sub-cycling. The resulting system of equations is solved by a Picard iteration solver with fixed-point accelerator. The algorithm is implemented in the parallel velocity formalism instead of the canonical parallel momentum formalism. XGC1 specializes in simulating the tokamak edge plasma with magnetic separatrix geometry. A fully implicit scheme could be a way to accurate and efficient gyrokinetic simulations. We will test if this numerical scheme overcomes the cancelation problem, and reproduces the dispersion relation of Alfven waves and tearing modes in cylindrical geometry. Funded by US DOE FES and ASCR, and computing resources provided by OLCF through ALCC.
Sparse matrix-vector multiplication on network-on-chip
NASA Astrophysics Data System (ADS)
Sun, C.-C.; Götze, J.; Jheng, H.-Y.; Ruan, S.-J.
2010-12-01
In this paper, we present an idea for performing matrix-vector multiplication by using Network-on-Chip (NoC) architecture. In traditional IC design on-chip communications have been designed with dedicated point-to-point interconnections. Therefore, regular local data transfer is the major concept of many parallel implementations. However, when dealing with the parallel implementation of sparse matrix-vector multiplication (SMVM), which is the main step of all iterative algorithms for solving systems of linear equation, the required data transfers depend on the sparsity structure of the matrix and can be extremely irregular. Using the NoC architecture makes it possible to deal with arbitrary structure of the data transfers; i.e. with the irregular structure of the sparse matrices. So far, we have already implemented the proposed SMVM-NoC architecture with the size 4×4 and 5×5 in IEEE 754 single float point precision using FPGA.
Ramani, Sathish; Liu, Zhihao; Rosen, Jeffrey; Nielsen, Jon-Fredrik; Fessler, Jeffrey A.
2012-01-01
Regularized iterative reconstruction algorithms for imaging inverse problems require selection of appropriate regularization parameter values. We focus on the challenging problem of tuning regularization parameters for nonlinear algorithms for the case of additive (possibly complex) Gaussian noise. Generalized cross-validation (GCV) and (weighted) mean-squared error (MSE) approaches (based on Stein's Unbiased Risk Estimate— SURE) need the Jacobian matrix of the nonlinear reconstruction operator (representative of the iterative algorithm) with respect to the data. We derive the desired Jacobian matrix for two types of nonlinear iterative algorithms: a fast variant of the standard iterative reweighted least-squares method and the contemporary split-Bregman algorithm, both of which can accommodate a wide variety of analysis- and synthesis-type regularizers. The proposed approach iteratively computes two weighted SURE-type measures: Predicted-SURE and Projected-SURE (that require knowledge of noise variance σ2), and GCV (that does not need σ2) for these algorithms. We apply the methods to image restoration and to magnetic resonance image (MRI) reconstruction using total variation (TV) and an analysis-type ℓ1-regularization. We demonstrate through simulations and experiments with real data that minimizing Predicted-SURE and Projected-SURE consistently lead to near-MSE-optimal reconstructions. We also observed that minimizing GCV yields reconstruction results that are near-MSE-optimal for image restoration and slightly sub-optimal for MRI. Theoretical derivations in this work related to Jacobian matrix evaluations can be extended, in principle, to other types of regularizers and reconstruction algorithms. PMID:22531764
Survey on the Performance of Source Localization Algorithms.
Fresno, José Manuel; Robles, Guillermo; Martínez-Tarifa, Juan Manuel; Stewart, Brian G
2017-11-18
The localization of emitters using an array of sensors or antennas is a prevalent issue approached in several applications. There exist different techniques for source localization, which can be classified into multilateration, received signal strength (RSS) and proximity methods. The performance of multilateration techniques relies on measured time variables: the time of flight (ToF) of the emission from the emitter to the sensor, the time differences of arrival (TDoA) of the emission between sensors and the pseudo-time of flight (pToF) of the emission to the sensors. The multilateration algorithms presented and compared in this paper can be classified as iterative and non-iterative methods. Both standard least squares (SLS) and hyperbolic least squares (HLS) are iterative and based on the Newton-Raphson technique to solve the non-linear equation system. The metaheuristic technique particle swarm optimization (PSO) used for source localisation is also studied. This optimization technique estimates the source position as the optimum of an objective function based on HLS and is also iterative in nature. Three non-iterative algorithms, namely the hyperbolic positioning algorithms (HPA), the maximum likelihood estimator (MLE) and Bancroft algorithm, are also presented. A non-iterative combined algorithm, MLE-HLS, based on MLE and HLS, is further proposed in this paper. The performance of all algorithms is analysed and compared in terms of accuracy in the localization of the position of the emitter and in terms of computational time. The analysis is also undertaken with three different sensor layouts since the positions of the sensors affect the localization; several source positions are also evaluated to make the comparison more robust. The analysis is carried out using theoretical time differences, as well as including errors due to the effect of digital sampling of the time variables. It is shown that the most balanced algorithm, yielding better results than the other algorithms in terms of accuracy and short computational time, is the combined MLE-HLS algorithm.
Survey on the Performance of Source Localization Algorithms
2017-01-01
The localization of emitters using an array of sensors or antennas is a prevalent issue approached in several applications. There exist different techniques for source localization, which can be classified into multilateration, received signal strength (RSS) and proximity methods. The performance of multilateration techniques relies on measured time variables: the time of flight (ToF) of the emission from the emitter to the sensor, the time differences of arrival (TDoA) of the emission between sensors and the pseudo-time of flight (pToF) of the emission to the sensors. The multilateration algorithms presented and compared in this paper can be classified as iterative and non-iterative methods. Both standard least squares (SLS) and hyperbolic least squares (HLS) are iterative and based on the Newton–Raphson technique to solve the non-linear equation system. The metaheuristic technique particle swarm optimization (PSO) used for source localisation is also studied. This optimization technique estimates the source position as the optimum of an objective function based on HLS and is also iterative in nature. Three non-iterative algorithms, namely the hyperbolic positioning algorithms (HPA), the maximum likelihood estimator (MLE) and Bancroft algorithm, are also presented. A non-iterative combined algorithm, MLE-HLS, based on MLE and HLS, is further proposed in this paper. The performance of all algorithms is analysed and compared in terms of accuracy in the localization of the position of the emitter and in terms of computational time. The analysis is also undertaken with three different sensor layouts since the positions of the sensors affect the localization; several source positions are also evaluated to make the comparison more robust. The analysis is carried out using theoretical time differences, as well as including errors due to the effect of digital sampling of the time variables. It is shown that the most balanced algorithm, yielding better results than the other algorithms in terms of accuracy and short computational time, is the combined MLE-HLS algorithm. PMID:29156565
Neural Generalized Predictive Control: A Newton-Raphson Implementation
NASA Technical Reports Server (NTRS)
Soloway, Donald; Haley, Pamela J.
1997-01-01
An efficient implementation of Generalized Predictive Control using a multi-layer feedforward neural network as the plant's nonlinear model is presented. In using Newton-Raphson as the optimization algorithm, the number of iterations needed for convergence is significantly reduced from other techniques. The main cost of the Newton-Raphson algorithm is in the calculation of the Hessian, but even with this overhead the low iteration numbers make Newton-Raphson faster than other techniques and a viable algorithm for real-time control. This paper presents a detailed derivation of the Neural Generalized Predictive Control algorithm with Newton-Raphson as the minimization algorithm. Simulation results show convergence to a good solution within two iterations and timing data show that real-time control is possible. Comments about the algorithm's implementation are also included.
Improved interpretation of satellite altimeter data using genetic algorithms
NASA Technical Reports Server (NTRS)
Messa, Kenneth; Lybanon, Matthew
1992-01-01
Genetic algorithms (GA) are optimization techniques that are based on the mechanics of evolution and natural selection. They take advantage of the power of cumulative selection, in which successive incremental improvements in a solution structure become the basis for continued development. A GA is an iterative procedure that maintains a 'population' of 'organisms' (candidate solutions). Through successive 'generations' (iterations) the population as a whole improves in simulation of Darwin's 'survival of the fittest'. GA's have been shown to be successful where noise significantly reduces the ability of other search techniques to work effectively. Satellite altimetry provides useful information about oceanographic phenomena. It provides rapid global coverage of the oceans and is not as severely hampered by cloud cover as infrared imagery. Despite these and other benefits, several factors lead to significant difficulty in interpretation. The GA approach to the improved interpretation of satellite data involves the representation of the ocean surface model as a string of parameters or coefficients from the model. The GA searches in parallel, a population of such representations (organisms) to obtain the individual that is best suited to 'survive', that is, the fittest as measured with respect to some 'fitness' function. The fittest organism is the one that best represents the ocean surface model with respect to the altimeter data.
Joint design of large-tip-angle parallel RF pulses and blipped gradient trajectories.
Cao, Zhipeng; Donahue, Manus J; Ma, Jun; Grissom, William A
2016-03-01
To design multichannel large-tip-angle kT-points and spokes radiofrequency (RF) pulses and gradient waveforms for transmit field inhomogeneity compensation in high field magnetic resonance imaging. An algorithm to design RF subpulse weights and gradient blip areas is proposed to minimize a magnitude least-squares cost function that measures the difference between realized and desired state parameters in the spin domain, and penalizes integrated RF power. The minimization problem is solved iteratively with interleaved target phase updates, RF subpulse weights updates using the conjugate gradient method with optimal control-based derivatives, and gradient blip area updates using the conjugate gradient method. Two-channel parallel transmit simulations and experiments were conducted in phantoms and human subjects at 7 T to demonstrate the method and compare it to small-tip-angle-designed pulses and circularly polarized excitations. The proposed algorithm designed more homogeneous and accurate 180° inversion and refocusing pulses than other methods. It also designed large-tip-angle pulses on multiple frequency bands with independent and joint phase relaxation. Pulses designed by the method improved specificity and contrast-to-noise ratio in a finger-tapping spin echo blood oxygen level dependent functional magnetic resonance imaging study, compared with circularly polarized mode refocusing. A joint RF and gradient waveform design algorithm was proposed and validated to improve large-tip-angle inversion and refocusing at ultrahigh field. © 2015 Wiley Periodicals, Inc.
Fast divide-and-conquer algorithm for evaluating polarization in classical force fields
NASA Astrophysics Data System (ADS)
Nocito, Dominique; Beran, Gregory J. O.
2017-03-01
Evaluation of the self-consistent polarization energy forms a major computational bottleneck in polarizable force fields. In large systems, the linear polarization equations are typically solved iteratively with techniques based on Jacobi iterations (JI) or preconditioned conjugate gradients (PCG). Two new variants of JI are proposed here that exploit domain decomposition to accelerate the convergence of the induced dipoles. The first, divide-and-conquer JI (DC-JI), is a block Jacobi algorithm which solves the polarization equations within non-overlapping sub-clusters of atoms directly via Cholesky decomposition, and iterates to capture interactions between sub-clusters. The second, fuzzy DC-JI, achieves further acceleration by employing overlapping blocks. Fuzzy DC-JI is analogous to an additive Schwarz method, but with distance-based weighting when averaging the fuzzy dipoles from different blocks. Key to the success of these algorithms is the use of K-means clustering to identify natural atomic sub-clusters automatically for both algorithms and to determine the appropriate weights in fuzzy DC-JI. The algorithm employs knowledge of the 3-D spatial interactions to group important elements in the 2-D polarization matrix. When coupled with direct inversion in the iterative subspace (DIIS) extrapolation, fuzzy DC-JI/DIIS in particular converges in a comparable number of iterations as PCG, but with lower computational cost per iteration. In the end, the new algorithms demonstrated here accelerate the evaluation of the polarization energy by 2-3 fold compared to existing implementations of PCG or JI/DIIS.
Chen, Kun; Zhang, Hongyuan; Wei, Haoyun; Li, Yan
2014-08-20
In this paper, we propose an improved subtraction algorithm for rapid recovery of Raman spectra that can substantially reduce the computation time. This algorithm is based on an improved Savitzky-Golay (SG) iterative smoothing method, which involves two key novel approaches: (a) the use of the Gauss-Seidel method and (b) the introduction of a relaxation factor into the iterative procedure. By applying a novel successive relaxation (SG-SR) iterative method to the relaxation factor, additional improvement in the convergence speed over the standard Savitzky-Golay procedure is realized. The proposed improved algorithm (the RIA-SG-SR algorithm), which uses SG-SR-based iteration instead of Savitzky-Golay iteration, has been optimized and validated with a mathematically simulated Raman spectrum, as well as experimentally measured Raman spectra from non-biological and biological samples. The method results in a significant reduction in computing cost while yielding consistent rejection of fluorescence and noise for spectra with low signal-to-fluorescence ratios and varied baselines. In the simulation, RIA-SG-SR achieved 1 order of magnitude improvement in iteration number and 2 orders of magnitude improvement in computation time compared with the range-independent background-subtraction algorithm (RIA). Furthermore the computation time of the experimentally measured raw Raman spectrum processing from skin tissue decreased from 6.72 to 0.094 s. In general, the processing of the SG-SR method can be conducted within dozens of milliseconds, which can provide a real-time procedure in practical situations.
A scalable, fully implicit algorithm for the reduced two-field low-β extended MHD model
Chacon, Luis; Stanier, Adam John
2016-12-01
Here, we demonstrate a scalable fully implicit algorithm for the two-field low-β extended MHD model. This reduced model describes plasma behavior in the presence of strong guide fields, and is of significant practical impact both in nature and in laboratory plasmas. The model displays strong hyperbolic behavior, as manifested by the presence of fast dispersive waves, which make a fully implicit treatment very challenging. In this study, we employ a Jacobian-free Newton–Krylov nonlinear solver, for which we propose a physics-based preconditioner that renders the linearized set of equations suitable for inversion with multigrid methods. As a result, the algorithm ismore » shown to scale both algorithmically (i.e., the iteration count is insensitive to grid refinement and timestep size) and in parallel in a weak-scaling sense, with the wall-clock time scaling weakly with the number of cores for up to 4096 cores. For a 4096 × 4096 mesh, we demonstrate a wall-clock-time speedup of ~6700 with respect to explicit algorithms. The model is validated linearly (against linear theory predictions) and nonlinearly (against fully kinetic simulations), demonstrating excellent agreement.« less
NASA Astrophysics Data System (ADS)
Ling, Jun
Achieving reliable underwater acoustic communications (UAC) has long been recognized as a challenging problem owing to the scarce bandwidth available and the reverberant spread in both time and frequency domains. To pursue high data rates, we consider a multi-input multi-output (MIMO) UAC system, and our focus is placed on two main issues regarding a MIMO UAC system: (1) channel estimation, which involves the design of the training sequences and the development of a reliable channel estimation algorithm, and (2) symbol detection, which requires interference cancelation schemes due to simultaneous transmission from multiple transducers. To enhance channel estimation performance, we present a cyclic approach for designing training sequences with good auto- and cross-correlation properties, and a channel estimation algorithm called the iterative adaptive approach (IAA). Sparse channel estimates can be obtained by combining IAA with the Bayesian information criterion (BIC). Moreover, we present sparse learning via iterative minimization (SLIM) and demonstrate that SLIM gives similar performance to IAA but at a much lower computational cost. Furthermore, an extension of the SLIM algorithm is introduced to estimate the sparse and frequency modulated acoustic channels. The extended algorithm is referred to as generalization of SLIM (GoSLIM). Regarding symbol detection, a linear minimum mean-squared error based detection scheme, called RELAX-BLAST, which is a combination of vertical Bell Labs layered space-time (V-BLAST) algorithm and the cyclic principle of the RELAX algorithm, is presented and it is shown that RELAX-BLAST outperforms V-BLAST. We show that RELAX-BLAST can be implemented efficiently by making use of the conjugate gradient method and diagonalization properties of circulant matrices. This fast implementation approach requires only simple fast Fourier transform operations and facilitates parallel implementations. The effectiveness of the proposed MIMO schemes is verified by both computer simulations and experimental results obtained by analyzing the measurements acquired in multiple in-water experiments.
Photoacoustic image reconstruction via deep learning
NASA Astrophysics Data System (ADS)
Antholzer, Stephan; Haltmeier, Markus; Nuster, Robert; Schwab, Johannes
2018-02-01
Applying standard algorithms to sparse data problems in photoacoustic tomography (PAT) yields low-quality images containing severe under-sampling artifacts. To some extent, these artifacts can be reduced by iterative image reconstruction algorithms which allow to include prior knowledge such as smoothness, total variation (TV) or sparsity constraints. These algorithms tend to be time consuming as the forward and adjoint problems have to be solved repeatedly. Further, iterative algorithms have additional drawbacks. For example, the reconstruction quality strongly depends on a-priori model assumptions about the objects to be recovered, which are often not strictly satisfied in practical applications. To overcome these issues, in this paper, we develop direct and efficient reconstruction algorithms based on deep learning. As opposed to iterative algorithms, we apply a convolutional neural network, whose parameters are trained before the reconstruction process based on a set of training data. For actual image reconstruction, a single evaluation of the trained network yields the desired result. Our presented numerical results (using two different network architectures) demonstrate that the proposed deep learning approach reconstructs images with a quality comparable to state of the art iterative reconstruction methods.
Dong, Jian; Hayakawa, Yoshihiko; Kannenberg, Sven; Kober, Cornelia
2013-02-01
The objective of this study was to reduce metal-induced streak artifact on oral and maxillofacial x-ray computed tomography (CT) images by developing the fast statistical image reconstruction system using iterative reconstruction algorithms. Adjacent CT images often depict similar anatomical structures in thin slices. So, first, images were reconstructed using the same projection data of an artifact-free image. Second, images were processed by the successive iterative restoration method where projection data were generated from reconstructed image in sequence. Besides the maximum likelihood-expectation maximization algorithm, the ordered subset-expectation maximization algorithm (OS-EM) was examined. Also, small region of interest (ROI) setting and reverse processing were applied for improving performance. Both algorithms reduced artifacts instead of slightly decreasing gray levels. The OS-EM and small ROI reduced the processing duration without apparent detriments. Sequential and reverse processing did not show apparent effects. Two alternatives in iterative reconstruction methods were effective for artifact reduction. The OS-EM algorithm and small ROI setting improved the performance. Copyright © 2012 Elsevier Inc. All rights reserved.
Samant, Sanjiv S; Xia, Junyi; Muyan-Ozcelik, Pinar; Owens, John D
2008-08-01
The advent of readily available temporal imaging or time series volumetric (4D) imaging has become an indispensable component of treatment planning and adaptive radiotherapy (ART) at many radiotherapy centers. Deformable image registration (DIR) is also used in other areas of medical imaging, including motion corrected image reconstruction. Due to long computation time, clinical applications of DIR in radiation therapy and elsewhere have been limited and consequently relegated to offline analysis. With the recent advances in hardware and software, graphics processing unit (GPU) based computing is an emerging technology for general purpose computation, including DIR, and is suitable for highly parallelized computing. However, traditional general purpose computation on the GPU is limited because the constraints of the available programming platforms. As well, compared to CPU programming, the GPU currently has reduced dedicated processor memory, which can limit the useful working data set for parallelized processing. We present an implementation of the demons algorithm using the NVIDIA 8800 GTX GPU and the new CUDA programming language. The GPU performance will be compared with single threading and multithreading CPU implementations on an Intel dual core 2.4 GHz CPU using the C programming language. CUDA provides a C-like language programming interface, and allows for direct access to the highly parallel compute units in the GPU. Comparisons for volumetric clinical lung images acquired using 4DCT were carried out. Computation time for 100 iterations in the range of 1.8-13.5 s was observed for the GPU with image size ranging from 2.0 x 10(6) to 14.2 x 10(6) pixels. The GPU registration was 55-61 times faster than the CPU for the single threading implementation, and 34-39 times faster for the multithreading implementation. For CPU based computing, the computational time generally has a linear dependence on image size for medical imaging data. Computational efficiency is characterized in terms of time per megapixels per iteration (TPMI) with units of seconds per megapixels per iteration (or spmi). For the demons algorithm, our CPU implementation yielded largely invariant values of TPMI. The mean TPMIs were 0.527 spmi and 0.335 spmi for the single threading and multithreading cases, respectively, with <2% variation over the considered image data range. For GPU computing, we achieved TPMI =0.00916 spmi with 3.7% variation, indicating optimized memory handling under CUDA. The paradigm of GPU based real-time DIR opens up a host of clinical applications for medical imaging.
Implementation of the Iterative Proportion Fitting Algorithm for Geostatistical Facies Modeling
DOE Office of Scientific and Technical Information (OSTI.GOV)
Li Yupeng, E-mail: yupeng@ualberta.ca; Deutsch, Clayton V.
2012-06-15
In geostatistics, most stochastic algorithm for simulation of categorical variables such as facies or rock types require a conditional probability distribution. The multivariate probability distribution of all the grouped locations including the unsampled location permits calculation of the conditional probability directly based on its definition. In this article, the iterative proportion fitting (IPF) algorithm is implemented to infer this multivariate probability. Using the IPF algorithm, the multivariate probability is obtained by iterative modification to an initial estimated multivariate probability using lower order bivariate probabilities as constraints. The imposed bivariate marginal probabilities are inferred from profiles along drill holes or wells.more » In the IPF process, a sparse matrix is used to calculate the marginal probabilities from the multivariate probability, which makes the iterative fitting more tractable and practical. This algorithm can be extended to higher order marginal probability constraints as used in multiple point statistics. The theoretical framework is developed and illustrated with estimation and simulation example.« less
Robust parallel iterative solvers for linear and least-squares problems, Final Technical Report
DOE Office of Scientific and Technical Information (OSTI.GOV)
Saad, Yousef
2014-01-16
The primary goal of this project is to study and develop robust iterative methods for solving linear systems of equations and least squares systems. The focus of the Minnesota team is on algorithms development, robustness issues, and on tests and validation of the methods on realistic problems. 1. The project begun with an investigation on how to practically update a preconditioner obtained from an ILU-type factorization, when the coefficient matrix changes. 2. We investigated strategies to improve robustness in parallel preconditioners in a specific case of a PDE with discontinuous coefficients. 3. We explored ways to adapt standard preconditioners formore » solving linear systems arising from the Helmholtz equation. These are often difficult linear systems to solve by iterative methods. 4. We have also worked on purely theoretical issues related to the analysis of Krylov subspace methods for linear systems. 5. We developed an effective strategy for performing ILU factorizations for the case when the matrix is highly indefinite. The strategy uses shifting in some optimal way. The method was extended to the solution of Helmholtz equations by using complex shifts, yielding very good results in many cases. 6. We addressed the difficult problem of preconditioning sparse systems of equations on GPUs. 7. A by-product of the above work is a software package consisting of an iterative solver library for GPUs based on CUDA. This was made publicly available. It was the first such library that offers complete iterative solvers for GPUs. 8. We considered another form of ILU which blends coarsening techniques from Multigrid with algebraic multilevel methods. 9. We have released a new version on our parallel solver - called pARMS [new version is version 3]. As part of this we have tested the code in complex settings - including the solution of Maxwell and Helmholtz equations and for a problem of crystal growth.10. As an application of polynomial preconditioning we considered the problem of evaluating f(A)v which arises in statistical sampling. 11. As an application to the methods we developed, we tackled the problem of computing the diagonal of the inverse of a matrix. This arises in statistical applications as well as in many applications in physics. We explored probing methods as well as domain-decomposition type methods. 12. A collaboration with researchers from Toulouse, France, considered the important problem of computing the Schur complement in a domain-decomposition approach. 13. We explored new ways of preconditioning linear systems, based on low-rank approximations.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Adams, Brian M.; Ebeida, Mohamed Salah; Eldred, Michael S.
The Dakota (Design Analysis Kit for Optimization and Terascale Applications) toolkit provides a exible and extensible interface between simulation codes and iterative analysis methods. Dakota contains algorithms for optimization with gradient and nongradient-based methods; uncertainty quanti cation with sampling, reliability, and stochastic expansion methods; parameter estimation with nonlinear least squares methods; and sensitivity/variance analysis with design of experiments and parameter study methods. These capabilities may be used on their own or as components within advanced strategies such as surrogate-based optimization, mixed integer nonlinear programming, or optimization under uncertainty. By employing object-oriented design to implement abstractions of the key components requiredmore » for iterative systems analyses, the Dakota toolkit provides a exible and extensible problem-solving environment for design and performance analysis of computational models on high performance computers. This report serves as a user's manual for the Dakota software and provides capability overviews and procedures for software execution, as well as a variety of example studies.« less
3D and 4D magnetic susceptibility tomography based on complex MR images
Chen, Zikuan; Calhoun, Vince D
2014-11-11
Magnetic susceptibility is the physical property for T2*-weighted magnetic resonance imaging (T2*MRI). The invention relates to methods for reconstructing an internal distribution (3D map) of magnetic susceptibility values, .chi. (x,y,z), of an object, from 3D T2*MRI phase images, by using Computed Inverse Magnetic Resonance Imaging (CIMRI) tomography. The CIMRI technique solves the inverse problem of the 3D convolution by executing a 3D Total Variation (TV) regularized iterative convolution scheme, using a split Bregman iteration algorithm. The reconstruction of .chi. (x,y,z) can be designed for low-pass, band-pass, and high-pass features by using a convolution kernel that is modified from the standard dipole kernel. Multiple reconstructions can be implemented in parallel, and averaging the reconstructions can suppress noise. 4D dynamic magnetic susceptibility tomography can be implemented by reconstructing a 3D susceptibility volume from a 3D phase volume by performing 3D CIMRI magnetic susceptibility tomography at each snapshot time.
Composition of web services using Markov decision processes and dynamic programming.
Uc-Cetina, Víctor; Moo-Mena, Francisco; Hernandez-Ucan, Rafael
2015-01-01
We propose a Markov decision process model for solving the Web service composition (WSC) problem. Iterative policy evaluation, value iteration, and policy iteration algorithms are used to experimentally validate our approach, with artificial and real data. The experimental results show the reliability of the model and the methods employed, with policy iteration being the best one in terms of the minimum number of iterations needed to estimate an optimal policy, with the highest Quality of Service attributes. Our experimental work shows how the solution of a WSC problem involving a set of 100,000 individual Web services and where a valid composition requiring the selection of 1,000 services from the available set can be computed in the worst case in less than 200 seconds, using an Intel Core i5 computer with 6 GB RAM. Moreover, a real WSC problem involving only 7 individual Web services requires less than 0.08 seconds, using the same computational power. Finally, a comparison with two popular reinforcement learning algorithms, sarsa and Q-learning, shows that these algorithms require one or two orders of magnitude and more time than policy iteration, iterative policy evaluation, and value iteration to handle WSC problems of the same complexity.
Multimodel Kalman filtering for adaptive nonuniformity correction in infrared sensors.
Pezoa, Jorge E; Hayat, Majeed M; Torres, Sergio N; Rahman, Md Saifur
2006-06-01
We present an adaptive technique for the estimation of nonuniformity parameters of infrared focal-plane arrays that is robust with respect to changes and uncertainties in scene and sensor characteristics. The proposed algorithm is based on using a bank of Kalman filters in parallel. Each filter independently estimates state variables comprising the gain and the bias matrices of the sensor, according to its own dynamic-model parameters. The supervising component of the algorithm then generates the final estimates of the state variables by forming a weighted superposition of all the estimates rendered by each Kalman filter. The weights are computed and updated iteratively, according to the a posteriori-likelihood principle. The performance of the estimator and its ability to compensate for fixed-pattern noise is tested using both simulated and real data obtained from two cameras operating in the mid- and long-wave infrared regime.
Layout optimization with algebraic multigrid methods
NASA Technical Reports Server (NTRS)
Regler, Hans; Ruede, Ulrich
1993-01-01
Finding the optimal position for the individual cells (also called functional modules) on the chip surface is an important and difficult step in the design of integrated circuits. This paper deals with the problem of relative placement, that is the minimization of a quadratic functional with a large, sparse, positive definite system matrix. The basic optimization problem must be augmented by constraints to inhibit solutions where cells overlap. Besides classical iterative methods, based on conjugate gradients (CG), we show that algebraic multigrid methods (AMG) provide an interesting alternative. For moderately sized examples with about 10000 cells, AMG is already competitive with CG and is expected to be superior for larger problems. Besides the classical 'multiplicative' AMG algorithm where the levels are visited sequentially, we propose an 'additive' variant of AMG where levels may be treated in parallel and that is suitable as a preconditioner in the CG algorithm.
Final report for “Extreme-scale Algorithms and Solver Resilience”
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gropp, William Douglas
2017-06-30
This is a joint project with principal investigators at Oak Ridge National Laboratory, Sandia National Laboratories, the University of California at Berkeley, and the University of Tennessee. Our part of the project involves developing performance models for highly scalable algorithms and the development of latency tolerant iterative methods. During this project, we extended our performance models for the Multigrid method for solving large systems of linear equations and conducted experiments with highly scalable variants of conjugate gradient methods that avoid blocking synchronization. In addition, we worked with the other members of the project on alternative techniques for resilience and reproducibility.more » We also presented an alternative approach for reproducible dot-products in parallel computations that performs almost as well as the conventional approach by separating the order of computation from the details of the decomposition of vectors across the processes.« less
Barrier-breaking performance for industrial problems on the CRAY C916
DOE Office of Scientific and Technical Information (OSTI.GOV)
Graffunder, S.K.
1993-12-31
Nine applications, including third-party codes, were submitted to the Gordon Bell Prize committee showing the CRAY C916 supercomputer providing record-breaking time to solution for industrial problems in several disciplines. Performance was obtained by balancing raw hardware speed; effective use of large, real, shared memory; compiler vectorization and autotasking; hand optimization; asynchronous I/O techniques; and new algorithms. The highest GFLOPS performance for the submissions was 11.1 GFLOPS out of a peak advertised performance of 16 GFLOPS for the CRAY C916 system. One program achieved a 15.45 speedup from the compiler with just two hand-inserted directives to scope variables properly for themore » mathematical library. New I/O techniques hide tens of gigabytes of I/O behind parallel computations. Finally, new iterative solver algorithms have demonstrated times to solution on 1 CPU as high as 70 times faster than the best direct solvers.« less
Computational aspects of helicopter trim analysis and damping levels from Floquet theory
NASA Technical Reports Server (NTRS)
Gaonkar, Gopal H.; Achar, N. S.
1992-01-01
Helicopter trim settings of periodic initial state and control inputs are investigated for convergence of Newton iteration in computing the settings sequentially and in parallel. The trim analysis uses a shooting method and a weak version of two temporal finite element methods with displacement formulation and with mixed formulation of displacements and momenta. These three methods broadly represent two main approaches of trim analysis: adaptation of initial-value and finite element boundary-value codes to periodic boundary conditions, particularly for unstable and marginally stable systems. In each method, both the sequential and in-parallel schemes are used and the resulting nonlinear algebraic equations are solved by damped Newton iteration with an optimally selected damping parameter. The impact of damped Newton iteration, including earlier-observed divergence problems in trim analysis, is demonstrated by the maximum condition number of the Jacobian matrices of the iterative scheme and by virtual elimination of divergence. The advantages of the in-parallel scheme over the conventional sequential scheme are also demonstrated.
An L1-norm phase constraint for half-Fourier compressed sensing in 3D MR imaging.
Li, Guobin; Hennig, Jürgen; Raithel, Esther; Büchert, Martin; Paul, Dominik; Korvink, Jan G; Zaitsev, Maxim
2015-10-01
In most half-Fourier imaging methods, explicit phase replacement is used. In combination with parallel imaging, or compressed sensing, half-Fourier reconstruction is usually performed in a separate step. The purpose of this paper is to report that integration of half-Fourier reconstruction into iterative reconstruction minimizes reconstruction errors. The L1-norm phase constraint for half-Fourier imaging proposed in this work is compared with the L2-norm variant of the same algorithm, with several typical half-Fourier reconstruction methods. Half-Fourier imaging with the proposed phase constraint can be seamlessly combined with parallel imaging and compressed sensing to achieve high acceleration factors. In simulations and in in-vivo experiments half-Fourier imaging with the proposed L1-norm phase constraint enables superior performance both reconstruction of image details and with regard to robustness against phase estimation errors. The performance and feasibility of half-Fourier imaging with the proposed L1-norm phase constraint is reported. Its seamless combination with parallel imaging and compressed sensing enables use of greater acceleration in 3D MR imaging.
A superlinear interior points algorithm for engineering design optimization
NASA Technical Reports Server (NTRS)
Herskovits, J.; Asquier, J.
1990-01-01
We present a quasi-Newton interior points algorithm for nonlinear constrained optimization. It is based on a general approach consisting of the iterative solution in the primal and dual spaces of the equalities in Karush-Kuhn-Tucker optimality conditions. This is done in such a way to have primal and dual feasibility at each iteration, which ensures satisfaction of those optimality conditions at the limit points. This approach is very strong and efficient, since at each iteration it only requires the solution of two linear systems with the same matrix, instead of quadratic programming subproblems. It is also particularly appropriate for engineering design optimization inasmuch at each iteration a feasible design is obtained. The present algorithm uses a quasi-Newton approximation of the second derivative of the Lagrangian function in order to have superlinear asymptotic convergence. We discuss theoretical aspects of the algorithm and its computer implementation.
Iterative methods for plasma sheath calculations: Application to spherical probe
NASA Technical Reports Server (NTRS)
Parker, L. W.; Sullivan, E. C.
1973-01-01
The computer cost of a Poisson-Vlasov iteration procedure for the numerical solution of a steady-state collisionless plasma-sheath problem depends on: (1) the nature of the chosen iterative algorithm, (2) the position of the outer boundary of the grid, and (3) the nature of the boundary condition applied to simulate a condition at infinity (as in three-dimensional probe or satellite-wake problems). Two iterative algorithms, in conjunction with three types of boundary conditions, are analyzed theoretically and applied to the computation of current-voltage characteristics of a spherical electrostatic probe. The first algorithm was commonly used by physicists, and its computer costs depend primarily on the boundary conditions and are only slightly affected by the mesh interval. The second algorithm is not commonly used, and its costs depend primarily on the mesh interval and slightly on the boundary conditions.
Image reconstruction from cone-beam projections with attenuation correction
NASA Astrophysics Data System (ADS)
Weng, Yi
1997-07-01
In single photon emission computered tomography (SPECT) imaging, photon attenuation within the body is a major factor contributing to the quantitative inaccuracy in measuring the distribution of radioactivity. Cone-beam SPECT provides improved sensitivity for imaging small organs. This thesis extends the results for 2D parallel- beam and fan-beam geometry to 3D parallel-beam and cone- beam geometries in order to derive filtered backprojection reconstruction algorithms for the 3D exponential parallel-beam transform and for the exponential cone-beam transform with sampling on a sphere. An exact inversion formula for the 3D exponential parallel-beam transform is obtained and is extended to the 3D exponential cone-beam transform. Sampling on a sphere is not useful clinically and current cone-beam tomography, with the focal point traversing a planar orbit, does not acquire sufficient data to give an accurate reconstruction. Thus a data acquisition method that obtains complete data for cone-beam SPECT by simultaneously rotating the gamma camera and translating the patient bed, so that cone-beam projections can be obtained with the focal point traversing a helix that surrounds the patient was developed. First, an implementation of Grangeat's algorithm for helical cone- beam projections was developed without attenuation correction. A fast new rebinning scheme was developed that uses all of the detected data to reconstruct the image and properly normalizes any multiply scanned data. In the case of attenuation no theorem analogous to Tuy's has been proven. We hypothesized that an artifact-free reconstruction could be obtained even if the cone-beam data are attenuated, provided the imaging orbit satisfies Tuy's condition and the exact attenuation map is known. Cone-beam emission data were acquired by using a circle- and-line and a helix orbit on a clinical SPECT system. An iterative conjugate gradient reconstruction algorithm was used to reconstruct projection data with a known attenuation map. The quantitative accuracy of the attenuation-corrected emission reconstruction was significantly improved.
Run-time parallelization and scheduling of loops
NASA Technical Reports Server (NTRS)
Saltz, Joel H.; Mirchandaney, Ravi; Crowley, Kay
1991-01-01
Run-time methods are studied to automatically parallelize and schedule iterations of a do loop in certain cases where compile-time information is inadequate. The methods presented involve execution time preprocessing of the loop. At compile-time, these methods set up the framework for performing a loop dependency analysis. At run-time, wavefronts of concurrently executable loop iterations are identified. Using this wavefront information, loop iterations are reordered for increased parallelism. Symbolic transformation rules are used to produce: inspector procedures that perform execution time preprocessing, and executors or transformed versions of source code loop structures. These transformed loop structures carry out the calculations planned in the inspector procedures. Performance results are presented from experiments conducted on the Encore Multimax. These results illustrate that run-time reordering of loop indexes can have a significant impact on performance.
Exploiting parallel computing with limited program changes using a network of microcomputers
NASA Technical Reports Server (NTRS)
Rogers, J. L., Jr.; Sobieszczanski-Sobieski, J.
1985-01-01
Network computing and multiprocessor computers are two discernible trends in parallel processing. The computational behavior of an iterative distributed process in which some subtasks are completed later than others because of an imbalance in computational requirements is of significant interest. The effects of asynchronus processing was studied. A small existing program was converted to perform finite element analysis by distributing substructure analysis over a network of four Apple IIe microcomputers connected to a shared disk, simulating a parallel computer. The substructure analysis uses an iterative, fully stressed, structural resizing procedure. A framework of beams divided into three substructures is used as the finite element model. The effects of asynchronous processing on the convergence of the design variables are determined by not resizing particular substructures on various iterations.
Okariz, Ana; Guraya, Teresa; Iturrondobeitia, Maider; Ibarretxe, Julen
2017-02-01
The SIRT (Simultaneous Iterative Reconstruction Technique) algorithm is commonly used in Electron Tomography to calculate the original volume of the sample from noisy images, but the results provided by this iterative procedure are strongly dependent on the specific implementation of the algorithm, as well as on the number of iterations employed for the reconstruction. In this work, a methodology for selecting the iteration number of the SIRT reconstruction that provides the most accurate segmentation is proposed. The methodology is based on the statistical analysis of the intensity profiles at the edge of the objects in the reconstructed volume. A phantom which resembles a a carbon black aggregate has been created to validate the methodology and the SIRT implementations of two free software packages (TOMOJ and TOMO3D) have been used. Copyright © 2016 Elsevier B.V. All rights reserved.
Iterative projection algorithms for ab initio phasing in virus crystallography.
Lo, Victor L; Kingston, Richard L; Millane, Rick P
2016-12-01
Iterative projection algorithms are proposed as a tool for ab initio phasing in virus crystallography. The good global convergence properties of these algorithms, coupled with the spherical shape and high structural redundancy of icosahedral viruses, allows high resolution phases to be determined with no initial phase information. This approach is demonstrated by determining the electron density of a virus crystal with 5-fold non-crystallographic symmetry, starting with only a spherical shell envelope. The electron density obtained is sufficiently accurate for model building. The results indicate that iterative projection algorithms should be routinely applicable in virus crystallography, without the need for ancillary phase information. Copyright © 2016 Elsevier Inc. All rights reserved.
On adaptive learning rate that guarantees convergence in feedforward networks.
Behera, Laxmidhar; Kumar, Swagat; Patnaik, Awhan
2006-09-01
This paper investigates new learning algorithms (LF I and LF II) based on Lyapunov function for the training of feedforward neural networks. It is observed that such algorithms have interesting parallel with the popular backpropagation (BP) algorithm where the fixed learning rate is replaced by an adaptive learning rate computed using convergence theorem based on Lyapunov stability theory. LF II, a modified version of LF I, has been introduced with an aim to avoid local minima. This modification also helps in improving the convergence speed in some cases. Conditions for achieving global minimum for these kind of algorithms have been studied in detail. The performances of the proposed algorithms are compared with BP algorithm and extended Kalman filtering (EKF) on three bench-mark function approximation problems: XOR, 3-bit parity, and 8-3 encoder. The comparisons are made in terms of number of learning iterations and computational time required for convergence. It is found that the proposed algorithms (LF I and II) are much faster in convergence than other two algorithms to attain same accuracy. Finally, the comparison is made on a complex two-dimensional (2-D) Gabor function and effect of adaptive learning rate for faster convergence is verified. In a nutshell, the investigations made in this paper help us better understand the learning procedure of feedforward neural networks in terms of adaptive learning rate, convergence speed, and local minima.
NASA Astrophysics Data System (ADS)
Zhou, Meiling; Singh, Alok Kumar; Pedrini, Giancarlo; Osten, Wolfgang; Min, Junwei; Yao, Baoli
2018-03-01
We present a tunable output-frequency filter (TOF) algorithm to reconstruct the object from noisy experimental data under low-power partially coherent illumination, such as LED, when imaging through scattering media. In the iterative algorithm, we employ Gaussian functions with different filter windows at different stages of iteration process to reduce corruption from experimental noise to search for a global minimum in the reconstruction. In comparison with the conventional iterative phase retrieval algorithm, we demonstrate that the proposed TOF algorithm achieves consistent and reliable reconstruction in the presence of experimental noise. Moreover, the spatial resolution and distinctive features are retained in the reconstruction since the filter is applied only to the region outside the object. The feasibility of the proposed method is proved by experimental results.
Lin, Jyh-Miin; Patterson, Andrew J; Chang, Hing-Chiu; Gillard, Jonathan H; Graves, Martin J
2015-10-01
To propose a new reduced field-of-view (rFOV) strategy for iterative reconstructions in a clinical environment. Iterative reconstructions can incorporate regularization terms to improve the image quality of periodically rotated overlapping parallel lines with enhanced reconstruction (PROPELLER) MRI. However, the large amount of calculations required for full FOV iterative reconstructions has posed a huge computational challenge for clinical usage. By subdividing the entire problem into smaller rFOVs, the iterative reconstruction can be accelerated on a desktop with a single graphic processing unit (GPU). This rFOV strategy divides the iterative reconstruction into blocks, based on the block-diagonal dominant structure. A near real-time reconstruction system was developed for the clinical MR unit, and parallel computing was implemented using the object-oriented model. In addition, the Toeplitz method was implemented on the GPU to reduce the time required for full interpolation. Using the data acquired from the PROPELLER MRI, the reconstructed images were then saved in the digital imaging and communications in medicine format. The proposed rFOV reconstruction reduced the gridding time by 97%, as the total iteration time was 3 s even with multiple processes running. A phantom study showed that the structure similarity index for rFOV reconstruction was statistically superior to conventional density compensation (p < 0.001). In vivo study validated the increased signal-to-noise ratio, which is over four times higher than with density compensation. Image sharpness index was improved using the regularized reconstruction implemented. The rFOV strategy permits near real-time iterative reconstruction to improve the image quality of PROPELLER images. Substantial improvements in image quality metrics were validated in the experiments. The concept of rFOV reconstruction may potentially be applied to other kinds of iterative reconstructions for shortened reconstruction duration.
An implicit iterative algorithm with a tuning parameter for Itô Lyapunov matrix equations
NASA Astrophysics Data System (ADS)
Zhang, Ying; Wu, Ai-Guo; Sun, Hui-Jie
2018-01-01
In this paper, an implicit iterative algorithm is proposed for solving a class of Lyapunov matrix equations arising in Itô stochastic linear systems. A tuning parameter is introduced in this algorithm, and thus the convergence rate of the algorithm can be changed. Some conditions are presented such that the developed algorithm is convergent. In addition, an explicit expression is also derived for the optimal tuning parameter, which guarantees that the obtained algorithm achieves its fastest convergence rate. Finally, numerical examples are employed to illustrate the effectiveness of the given algorithm.
Fast polar decomposition of an arbitrary matrix
NASA Technical Reports Server (NTRS)
Higham, Nicholas J.; Schreiber, Robert S.
1988-01-01
The polar decomposition of an m x n matrix A of full rank, where m is greater than or equal to n, can be computed using a quadratically convergent algorithm. The algorithm is based on a Newton iteration involving a matrix inverse. With the use of a preliminary complete orthogonal decomposition the algorithm can be extended to arbitrary A. How to use the algorithm to compute the positive semi-definite square root of a Hermitian positive semi-definite matrix is described. A hybrid algorithm which adaptively switches from the matrix inversion based iteration to a matrix multiplication based iteration due to Kovarik, and to Bjorck and Bowie is formulated. The decision when to switch is made using a condition estimator. This matrix multiplication rich algorithm is shown to be more efficient on machines for which matrix multiplication can be executed 1.5 times faster than matrix inversion.
Parallel goal-oriented adaptive finite element modeling for 3D electromagnetic exploration
NASA Astrophysics Data System (ADS)
Zhang, Y.; Key, K.; Ovall, J.; Holst, M.
2014-12-01
We present a parallel goal-oriented adaptive finite element method for accurate and efficient electromagnetic (EM) modeling of complex 3D structures. An unstructured tetrahedral mesh allows this approach to accommodate arbitrarily complex 3D conductivity variations and a priori known boundaries. The total electric field is approximated by the lowest order linear curl-conforming shape functions and the discretized finite element equations are solved by a sparse LU factorization. Accuracy of the finite element solution is achieved through adaptive mesh refinement that is performed iteratively until the solution converges to the desired accuracy tolerance. Refinement is guided by a goal-oriented error estimator that uses a dual-weighted residual method to optimize the mesh for accurate EM responses at the locations of the EM receivers. As a result, the mesh refinement is highly efficient since it only targets the elements where the inaccuracy of the solution corrupts the response at the possibly distant locations of the EM receivers. We compare the accuracy and efficiency of two approaches for estimating the primary residual error required at the core of this method: one uses local element and inter-element residuals and the other relies on solving a global residual system using a hierarchical basis. For computational efficiency our method follows the Bank-Holst algorithm for parallelization, where solutions are computed in subdomains of the original model. To resolve the load-balancing problem, this approach applies a spectral bisection method to divide the entire model into subdomains that have approximately equal error and the same number of receivers. The finite element solutions are then computed in parallel with each subdomain carrying out goal-oriented adaptive mesh refinement independently. We validate the newly developed algorithm by comparison with controlled-source EM solutions for 1D layered models and with 2D results from our earlier 2D goal oriented adaptive refinement code named MARE2DEM. We demonstrate the performance and parallel scaling of this algorithm on a medium-scale computing cluster with a marine controlled-source EM example that includes a 3D array of receivers located over a 3D model that includes significant seafloor bathymetry variations and a heterogeneous subsurface.
Iterative-Transform Phase Retrieval Using Adaptive Diversity
NASA Technical Reports Server (NTRS)
Dean, Bruce H.
2007-01-01
A phase-diverse iterative-transform phase-retrieval algorithm enables high spatial-frequency, high-dynamic-range, image-based wavefront sensing. [The terms phase-diverse, phase retrieval, image-based, and wavefront sensing are defined in the first of the two immediately preceding articles, Broadband Phase Retrieval for Image-Based Wavefront Sensing (GSC-14899-1).] As described below, no prior phase-retrieval algorithm has offered both high dynamic range and the capability to recover high spatial-frequency components. Each of the previously developed image-based phase-retrieval techniques can be classified into one of two categories: iterative transform or parametric. Among the modifications of the original iterative-transform approach has been the introduction of a defocus diversity function (also defined in the cited companion article). Modifications of the original parametric approach have included minimizing alternative objective functions as well as implementing a variety of nonlinear optimization methods. The iterative-transform approach offers the advantage of ability to recover low, middle, and high spatial frequencies, but has disadvantage of having a limited dynamic range to one wavelength or less. In contrast, parametric phase retrieval offers the advantage of high dynamic range, but is poorly suited for recovering higher spatial frequency aberrations. The present phase-diverse iterative transform phase-retrieval algorithm offers both the high-spatial-frequency capability of the iterative-transform approach and the high dynamic range of parametric phase-recovery techniques. In implementation, this is a focus-diverse iterative-transform phaseretrieval algorithm that incorporates an adaptive diversity function, which makes it possible to avoid phase unwrapping while preserving high-spatial-frequency recovery. The algorithm includes an inner and an outer loop (see figure). An initial estimate of phase is used to start the algorithm on the inner loop, wherein multiple intensity images are processed, each using a different defocus value. The processing is done by an iterative-transform method, yielding individual phase estimates corresponding to each image of the defocus-diversity data set. These individual phase estimates are combined in a weighted average to form a new phase estimate, which serves as the initial phase estimate for either the next iteration of the iterative-transform method or, if the maximum number of iterations has been reached, for the next several steps, which constitute the outerloop portion of the algorithm. The details of the next several steps must be omitted here for the sake of brevity. The overall effect of these steps is to adaptively update the diversity defocus values according to recovery of global defocus in the phase estimate. Aberration recovery varies with differing amounts as the amount of diversity defocus is updated in each image; thus, feedback is incorporated into the recovery process. This process is iterated until the global defocus error is driven to zero during the recovery process. The amplitude of aberration may far exceed one wavelength after completion of the inner-loop portion of the algorithm, and the classical iterative transform method does not, by itself, enable recovery of multi-wavelength aberrations. Hence, in the absence of a means of off-loading the multi-wavelength portion of the aberration, the algorithm would produce a wrapped phase map. However, a special aberration-fitting procedure can be applied to the wrapped phase data to transfer at least some portion of the multi-wavelength aberration to the diversity function, wherein the data are treated as known phase values. In this way, a multiwavelength aberration can be recovered incrementally by successively applying the aberration-fitting procedure to intermediate wrapped phase maps. During recovery, as more of the aberration is transferred to the diversity function following successive iterations around the ter loop, the estimated phase ceases to wrap in places where the aberration values become incorporated as part of the diversity function. As a result, as the aberration content is transferred to the diversity function, the phase estimate resembles that of a reference flat.
Iterative Neighbour-Information Gathering for Ranking Nodes in Complex Networks
NASA Astrophysics Data System (ADS)
Xu, Shuang; Wang, Pei; Lü, Jinhu
2017-01-01
Designing node influence ranking algorithms can provide insights into network dynamics, functions and structures. Increasingly evidences reveal that node’s spreading ability largely depends on its neighbours. We introduce an iterative neighbourinformation gathering (Ing) process with three parameters, including a transformation matrix, a priori information and an iteration time. The Ing process iteratively combines priori information from neighbours via the transformation matrix, and iteratively assigns an Ing score to each node to evaluate its influence. The algorithm appropriates for any types of networks, and includes some traditional centralities as special cases, such as degree, semi-local, LeaderRank. The Ing process converges in strongly connected networks with speed relying on the first two largest eigenvalues of the transformation matrix. Interestingly, the eigenvector centrality corresponds to a limit case of the algorithm. By comparing with eight renowned centralities, simulations of susceptible-infected-removed (SIR) model on real-world networks reveal that the Ing can offer more exact rankings, even without a priori information. We also observe that an optimal iteration time is always in existence to realize best characterizing of node influence. The proposed algorithms bridge the gaps among some existing measures, and may have potential applications in infectious disease control, designing of optimal information spreading strategies.
Conjugate gradient coupled with multigrid for an indefinite problem
NASA Technical Reports Server (NTRS)
Gozani, J.; Nachshon, A.; Turkel, E.
1984-01-01
An iterative algorithm for the Helmholtz equation is presented. This scheme was based on the preconditioned conjugate gradient method for the normal equations. The preconditioning is one cycle of a multigrid method for the discrete Laplacian. The smoothing algorithm is red-black Gauss-Seidel and is constructed so it is a symmetric operator. The total number of iterations needed by the algorithm is independent of h. By varying the number of grids, the number of iterations depends only weakly on k when k(3)h(2) is constant. Comparisons with a SSOR preconditioner are presented.
Regularization of nonlinear decomposition of spectral x-ray projection images.
Ducros, Nicolas; Abascal, Juan Felipe Perez-Juste; Sixou, Bruno; Rit, Simon; Peyrin, Françoise
2017-09-01
Exploiting the x-ray measurements obtained in different energy bins, spectral computed tomography (CT) has the ability to recover the 3-D description of a patient in a material basis. This may be achieved solving two subproblems, namely the material decomposition and the tomographic reconstruction problems. In this work, we address the material decomposition of spectral x-ray projection images, which is a nonlinear ill-posed problem. Our main contribution is to introduce a material-dependent spatial regularization in the projection domain. The decomposition problem is solved iteratively using a Gauss-Newton algorithm that can benefit from fast linear solvers. A Matlab implementation is available online. The proposed regularized weighted least squares Gauss-Newton algorithm (RWLS-GN) is validated on numerical simulations of a thorax phantom made of up to five materials (soft tissue, bone, lung, adipose tissue, and gadolinium), which is scanned with a 120 kV source and imaged by a 4-bin photon counting detector. To evaluate the method performance of our algorithm, different scenarios are created by varying the number of incident photons, the concentration of the marker and the configuration of the phantom. The RWLS-GN method is compared to the reference maximum likelihood Nelder-Mead algorithm (ML-NM). The convergence of the proposed method and its dependence on the regularization parameter are also studied. We show that material decomposition is feasible with the proposed method and that it converges in few iterations. Material decomposition with ML-NM was very sensitive to noise, leading to decomposed images highly affected by noise, and artifacts even for the best case scenario. The proposed method was less sensitive to noise and improved contrast-to-noise ratio of the gadolinium image. Results were superior to those provided by ML-NM in terms of image quality and decomposition was 70 times faster. For the assessed experiments, material decomposition was possible with the proposed method when the number of incident photons was equal or larger than 10 5 and when the marker concentration was equal or larger than 0.03 g·cm -3 . The proposed method efficiently solves the nonlinear decomposition problem for spectral CT, which opens up new possibilities such as material-specific regularization in the projection domain and a parallelization framework, in which projections are solved in parallel. © 2017 American Association of Physicists in Medicine.
Fast parallel algorithm for slicing STL based on pipeline
NASA Astrophysics Data System (ADS)
Ma, Xulong; Lin, Feng; Yao, Bo
2016-05-01
In Additive Manufacturing field, the current researches of data processing mainly focus on a slicing process of large STL files or complicated CAD models. To improve the efficiency and reduce the slicing time, a parallel algorithm has great advantages. However, traditional algorithms can't make full use of multi-core CPU hardware resources. In the paper, a fast parallel algorithm is presented to speed up data processing. A pipeline mode is adopted to design the parallel algorithm. And the complexity of the pipeline algorithm is analyzed theoretically. To evaluate the performance of the new algorithm, effects of threads number and layers number are investigated by a serial of experiments. The experimental results show that the threads number and layers number are two remarkable factors to the speedup ratio. The tendency of speedup versus threads number reveals a positive relationship which greatly agrees with the Amdahl's law, and the tendency of speedup versus layers number also keeps a positive relationship agreeing with Gustafson's law. The new algorithm uses topological information to compute contours with a parallel method of speedup. Another parallel algorithm based on data parallel is used in experiments to show that pipeline parallel mode is more efficient. A case study at last shows a suspending performance of the new parallel algorithm. Compared with the serial slicing algorithm, the new pipeline parallel algorithm can make full use of the multi-core CPU hardware, accelerate the slicing process, and compared with the data parallel slicing algorithm, the new slicing algorithm in this paper adopts a pipeline parallel model, and a much higher speedup ratio and efficiency is achieved.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Strout, Michelle
Programming parallel machines is fraught with difficulties: the obfuscation of algorithms due to implementation details such as communication and synchronization, the need for transparency between language constructs and performance, the difficulty of performing program analysis to enable automatic parallelization techniques, and the existence of important "dusty deck" codes. The SAIMI project developed abstractions that enable the orthogonal specification of algorithms and implementation details within the context of existing DOE applications. The main idea is to enable the injection of small programming models such as expressions involving transcendental functions, polyhedral iteration spaces with sparse constraints, and task graphs into full programsmore » through the use of pragmas. These smaller, more restricted programming models enable orthogonal specification of many implementation details such as how to map the computation on to parallel processors, how to schedule the computation, and how to allocation storage for the computation. At the same time, these small programming models enable the expression of the most computationally intense and communication heavy portions in many scientific simulations. The ability to orthogonally manipulate the implementation for such computations will significantly ease performance programming efforts and expose transformation possibilities and parameter to automated approaches such as autotuning. At Colorado State University, the SAIMI project was supported through DOE grant DE-SC3956 from April 2010 through August 2015. The SAIMI project has contributed a number of important results to programming abstractions that enable the orthogonal specification of implementation details in scientific codes. This final report summarizes the research that was funded by the SAIMI project.« less
Gong, Pinghua; Zhang, Changshui; Lu, Zhaosong; Huang, Jianhua Z; Ye, Jieping
2013-01-01
Non-convex sparsity-inducing penalties have recently received considerable attentions in sparse learning. Recent theoretical investigations have demonstrated their superiority over the convex counterparts in several sparse learning settings. However, solving the non-convex optimization problems associated with non-convex penalties remains a big challenge. A commonly used approach is the Multi-Stage (MS) convex relaxation (or DC programming), which relaxes the original non-convex problem to a sequence of convex problems. This approach is usually not very practical for large-scale problems because its computational cost is a multiple of solving a single convex problem. In this paper, we propose a General Iterative Shrinkage and Thresholding (GIST) algorithm to solve the nonconvex optimization problem for a large class of non-convex penalties. The GIST algorithm iteratively solves a proximal operator problem, which in turn has a closed-form solution for many commonly used penalties. At each outer iteration of the algorithm, we use a line search initialized by the Barzilai-Borwein (BB) rule that allows finding an appropriate step size quickly. The paper also presents a detailed convergence analysis of the GIST algorithm. The efficiency of the proposed algorithm is demonstrated by extensive experiments on large-scale data sets.
A parallel algorithm for the eigenvalues and eigenvectors for a general complex matrix
NASA Technical Reports Server (NTRS)
Shroff, Gautam
1989-01-01
A new parallel Jacobi-like algorithm is developed for computing the eigenvalues of a general complex matrix. Most parallel methods for this parallel typically display only linear convergence. Sequential norm-reducing algorithms also exit and they display quadratic convergence in most cases. The new algorithm is a parallel form of the norm-reducing algorithm due to Eberlein. It is proven that the asymptotic convergence rate of this algorithm is quadratic. Numerical experiments are presented which demonstrate the quadratic convergence of the algorithm and certain situations where the convergence is slow are also identified. The algorithm promises to be very competitive on a variety of parallel architectures.
Composition of Web Services Using Markov Decision Processes and Dynamic Programming
Uc-Cetina, Víctor; Moo-Mena, Francisco; Hernandez-Ucan, Rafael
2015-01-01
We propose a Markov decision process model for solving the Web service composition (WSC) problem. Iterative policy evaluation, value iteration, and policy iteration algorithms are used to experimentally validate our approach, with artificial and real data. The experimental results show the reliability of the model and the methods employed, with policy iteration being the best one in terms of the minimum number of iterations needed to estimate an optimal policy, with the highest Quality of Service attributes. Our experimental work shows how the solution of a WSC problem involving a set of 100,000 individual Web services and where a valid composition requiring the selection of 1,000 services from the available set can be computed in the worst case in less than 200 seconds, using an Intel Core i5 computer with 6 GB RAM. Moreover, a real WSC problem involving only 7 individual Web services requires less than 0.08 seconds, using the same computational power. Finally, a comparison with two popular reinforcement learning algorithms, sarsa and Q-learning, shows that these algorithms require one or two orders of magnitude and more time than policy iteration, iterative policy evaluation, and value iteration to handle WSC problems of the same complexity. PMID:25874247
Liu, Xiao; Shi, Jun; Zhou, Shichong; Lu, Minhua
2014-01-01
The dimensionality reduction is an important step in ultrasound image based computer-aided diagnosis (CAD) for breast cancer. A newly proposed l2,1 regularized correntropy algorithm for robust feature selection (CRFS) has achieved good performance for noise corrupted data. Therefore, it has the potential to reduce the dimensions of ultrasound image features. However, in clinical practice, the collection of labeled instances is usually expensive and time costing, while it is relatively easy to acquire the unlabeled or undetermined instances. Therefore, the semi-supervised learning is very suitable for clinical CAD. The iterated Laplacian regularization (Iter-LR) is a new regularization method, which has been proved to outperform the traditional graph Laplacian regularization in semi-supervised classification and ranking. In this study, to augment the classification accuracy of the breast ultrasound CAD based on texture feature, we propose an Iter-LR-based semi-supervised CRFS (Iter-LR-CRFS) algorithm, and then apply it to reduce the feature dimensions of ultrasound images for breast CAD. We compared the Iter-LR-CRFS with LR-CRFS, original supervised CRFS, and principal component analysis. The experimental results indicate that the proposed Iter-LR-CRFS significantly outperforms all other algorithms.
NASA Astrophysics Data System (ADS)
Furuichi, M.; Nishiura, D.
2015-12-01
Fully Lagrangian methods such as Smoothed Particle Hydrodynamics (SPH) and Discrete Element Method (DEM) have been widely used to solve the continuum and particles motions in the computational geodynamics field. These mesh-free methods are suitable for the problems with the complex geometry and boundary. In addition, their Lagrangian nature allows non-diffusive advection useful for tracking history dependent properties (e.g. rheology) of the material. These potential advantages over the mesh-based methods offer effective numerical applications to the geophysical flow and tectonic processes, which are for example, tsunami with free surface and floating body, magma intrusion with fracture of rock, and shear zone pattern generation of granular deformation. In order to investigate such geodynamical problems with the particle based methods, over millions to billion particles are required for the realistic simulation. Parallel computing is therefore important for handling such huge computational cost. An efficient parallel implementation of SPH and DEM methods is however known to be difficult especially for the distributed-memory architecture. Lagrangian methods inherently show workload imbalance problem for parallelization with the fixed domain in space, because particles move around and workloads change during the simulation. Therefore dynamic load balance is key technique to perform the large scale SPH and DEM simulation. In this work, we present the parallel implementation technique of SPH and DEM method utilizing dynamic load balancing algorithms toward the high resolution simulation over large domain using the massively parallel super computer system. Our method utilizes the imbalances of the executed time of each MPI process as the nonlinear term of parallel domain decomposition and minimizes them with the Newton like iteration method. In order to perform flexible domain decomposition in space, the slice-grid algorithm is used. Numerical tests show that our approach is suitable for solving the particles with different calculation costs (e.g. boundary particles) as well as the heterogeneous computer architecture. We analyze the parallel efficiency and scalability on the super computer systems (K-computer, Earth simulator 3, etc.).
2014-01-01
Background The huge quantity of data produced in Biomedical research needs sophisticated algorithmic methodologies for its storage, analysis, and processing. High Performance Computing (HPC) appears as a magic bullet in this challenge. However, several hard to solve parallelization and load balancing problems arise in this context. Here we discuss the HPC-oriented implementation of a general purpose learning algorithm, originally conceived for DNA analysis and recently extended to treat uncertainty on data (U-BRAIN). The U-BRAIN algorithm is a learning algorithm that finds a Boolean formula in disjunctive normal form (DNF), of approximately minimum complexity, that is consistent with a set of data (instances) which may have missing bits. The conjunctive terms of the formula are computed in an iterative way by identifying, from the given data, a family of sets of conditions that must be satisfied by all the positive instances and violated by all the negative ones; such conditions allow the computation of a set of coefficients (relevances) for each attribute (literal), that form a probability distribution, allowing the selection of the term literals. The great versatility that characterizes it, makes U-BRAIN applicable in many of the fields in which there are data to be analyzed. However the memory and the execution time required by the running are of O(n3) and of O(n5) order, respectively, and so, the algorithm is unaffordable for huge data sets. Results We find mathematical and programming solutions able to lead us towards the implementation of the algorithm U-BRAIN on parallel computers. First we give a Dynamic Programming model of the U-BRAIN algorithm, then we minimize the representation of the relevances. When the data are of great size we are forced to use the mass memory, and depending on where the data are actually stored, the access times can be quite different. According to the evaluation of algorithmic efficiency based on the Disk Model, in order to reduce the costs of the communications between different memories (RAM, Cache, Mass, Virtual) and to achieve efficient I/O performance, we design a mass storage structure able to access its data with a high degree of temporal and spatial locality. Then we develop a parallel implementation of the algorithm. We model it as a SPMD system together to a Message-Passing Programming Paradigm. Here, we adopt the high-level message-passing systems MPI (Message Passing Interface) in the version for the Java programming language, MPJ. The parallel processing is organized into four stages: partitioning, communication, agglomeration and mapping. The decomposition of the U-BRAIN algorithm determines the necessity of a communication protocol design among the processors involved. Efficient synchronization design is also discussed. Conclusions In the context of a collaboration between public and private institutions, the parallel model of U-BRAIN has been implemented and tested on the INTEL XEON E7xxx and E5xxx family of the CRESCO structure of Italian National Agency for New Technologies, Energy and Sustainable Economic Development (ENEA), developed in the framework of the European Grid Infrastructure (EGI), a series of efforts to provide access to high-throughput computing resources across Europe using grid computing techniques. The implementation is able to minimize both the memory space and the execution time. The test data used in this study are IPDATA (Irvine Primate splice- junction DATA set), a subset of HS3D (Homo Sapiens Splice Sites Dataset) and a subset of COSMIC (the Catalogue of Somatic Mutations in Cancer). The execution time and the speed-up on IPDATA reach the best values within about 90 processors. Then the parallelization advantage is balanced by the greater cost of non-local communications between the processors. A similar behaviour is evident on HS3D, but at a greater number of processors, so evidencing the direct relationship between data size and parallelization gain. This behaviour is confirmed on COSMIC. Overall, the results obtained show that the parallel version is up to 30 times faster than the serial one. PMID:25077818
D'Angelo, Gianni; Rampone, Salvatore
2014-01-01
The huge quantity of data produced in Biomedical research needs sophisticated algorithmic methodologies for its storage, analysis, and processing. High Performance Computing (HPC) appears as a magic bullet in this challenge. However, several hard to solve parallelization and load balancing problems arise in this context. Here we discuss the HPC-oriented implementation of a general purpose learning algorithm, originally conceived for DNA analysis and recently extended to treat uncertainty on data (U-BRAIN). The U-BRAIN algorithm is a learning algorithm that finds a Boolean formula in disjunctive normal form (DNF), of approximately minimum complexity, that is consistent with a set of data (instances) which may have missing bits. The conjunctive terms of the formula are computed in an iterative way by identifying, from the given data, a family of sets of conditions that must be satisfied by all the positive instances and violated by all the negative ones; such conditions allow the computation of a set of coefficients (relevances) for each attribute (literal), that form a probability distribution, allowing the selection of the term literals. The great versatility that characterizes it, makes U-BRAIN applicable in many of the fields in which there are data to be analyzed. However the memory and the execution time required by the running are of O(n(3)) and of O(n(5)) order, respectively, and so, the algorithm is unaffordable for huge data sets. We find mathematical and programming solutions able to lead us towards the implementation of the algorithm U-BRAIN on parallel computers. First we give a Dynamic Programming model of the U-BRAIN algorithm, then we minimize the representation of the relevances. When the data are of great size we are forced to use the mass memory, and depending on where the data are actually stored, the access times can be quite different. According to the evaluation of algorithmic efficiency based on the Disk Model, in order to reduce the costs of the communications between different memories (RAM, Cache, Mass, Virtual) and to achieve efficient I/O performance, we design a mass storage structure able to access its data with a high degree of temporal and spatial locality. Then we develop a parallel implementation of the algorithm. We model it as a SPMD system together to a Message-Passing Programming Paradigm. Here, we adopt the high-level message-passing systems MPI (Message Passing Interface) in the version for the Java programming language, MPJ. The parallel processing is organized into four stages: partitioning, communication, agglomeration and mapping. The decomposition of the U-BRAIN algorithm determines the necessity of a communication protocol design among the processors involved. Efficient synchronization design is also discussed. In the context of a collaboration between public and private institutions, the parallel model of U-BRAIN has been implemented and tested on the INTEL XEON E7xxx and E5xxx family of the CRESCO structure of Italian National Agency for New Technologies, Energy and Sustainable Economic Development (ENEA), developed in the framework of the European Grid Infrastructure (EGI), a series of efforts to provide access to high-throughput computing resources across Europe using grid computing techniques. The implementation is able to minimize both the memory space and the execution time. The test data used in this study are IPDATA (Irvine Primate splice- junction DATA set), a subset of HS3D (Homo Sapiens Splice Sites Dataset) and a subset of COSMIC (the Catalogue of Somatic Mutations in Cancer). The execution time and the speed-up on IPDATA reach the best values within about 90 processors. Then the parallelization advantage is balanced by the greater cost of non-local communications between the processors. A similar behaviour is evident on HS3D, but at a greater number of processors, so evidencing the direct relationship between data size and parallelization gain. This behaviour is confirmed on COSMIC. Overall, the results obtained show that the parallel version is up to 30 times faster than the serial one.
Parallel algorithms for placement and routing in VLSI design. Ph.D. Thesis
NASA Technical Reports Server (NTRS)
Brouwer, Randall Jay
1991-01-01
The computational requirements for high quality synthesis, analysis, and verification of very large scale integration (VLSI) designs have rapidly increased with the fast growing complexity of these designs. Research in the past has focused on the development of heuristic algorithms, special purpose hardware accelerators, or parallel algorithms for the numerous design tasks to decrease the time required for solution. Two new parallel algorithms are proposed for two VLSI synthesis tasks, standard cell placement and global routing. The first algorithm, a parallel algorithm for global routing, uses hierarchical techniques to decompose the routing problem into independent routing subproblems that are solved in parallel. Results are then presented which compare the routing quality to the results of other published global routers and which evaluate the speedups attained. The second algorithm, a parallel algorithm for cell placement and global routing, hierarchically integrates a quadrisection placement algorithm, a bisection placement algorithm, and the previous global routing algorithm. Unique partitioning techniques are used to decompose the various stages of the algorithm into independent tasks which can be evaluated in parallel. Finally, results are presented which evaluate the various algorithm alternatives and compare the algorithm performance to other placement programs. Measurements are presented on the parallel speedups available.
Regularization iteration imaging algorithm for electrical capacitance tomography
NASA Astrophysics Data System (ADS)
Tong, Guowei; Liu, Shi; Chen, Hongyan; Wang, Xueyao
2018-03-01
The image reconstruction method plays a crucial role in real-world applications of the electrical capacitance tomography technique. In this study, a new cost function that simultaneously considers the sparsity and low-rank properties of the imaging targets is proposed to improve the quality of the reconstruction images, in which the image reconstruction task is converted into an optimization problem. Within the framework of the split Bregman algorithm, an iterative scheme that splits a complicated optimization problem into several simpler sub-tasks is developed to solve the proposed cost function efficiently, in which the fast-iterative shrinkage thresholding algorithm is introduced to accelerate the convergence. Numerical experiment results verify the effectiveness of the proposed algorithm in improving the reconstruction precision and robustness.
NASA Astrophysics Data System (ADS)
Markou, A. A.; Manolis, G. D.
2018-03-01
Numerical methods for the solution of dynamical problems in engineering go back to 1950. The most famous and widely-used time stepping algorithm was developed by Newmark in 1959. In the present study, for the first time, the Newmark algorithm is developed for the case of the trilinear hysteretic model, a model that was used to describe the shear behaviour of high damping rubber bearings. This model is calibrated against free-vibration field tests implemented on a hybrid base isolated building, namely the Solarino project in Italy, as well as against laboratory experiments. A single-degree-of-freedom system is used to describe the behaviour of a low-rise building isolated with a hybrid system comprising high damping rubber bearings and low friction sliding bearings. The behaviour of the high damping rubber bearings is simulated by the trilinear hysteretic model, while the description of the behaviour of the low friction sliding bearings is modeled by a linear Coulomb friction model. In order to prove the effectiveness of the numerical method we compare the analytically solved trilinear hysteretic model calibrated from free-vibration field tests (Solarino project) against the same model solved with the Newmark method with Netwon-Raphson iteration. Almost perfect agreement is observed between the semi-analytical solution and the fully numerical solution with Newmark's time integration algorithm. This will allow for extension of the trilinear mechanical models to bidirectional horizontal motion, to time-varying vertical loads, to multi-degree-of-freedom-systems, as well to generalized models connected in parallel, where only numerical solutions are possible.
Tail Biting Trellis Representation of Codes: Decoding and Construction
NASA Technical Reports Server (NTRS)
Shao. Rose Y.; Lin, Shu; Fossorier, Marc
1999-01-01
This paper presents two new iterative algorithms for decoding linear codes based on their tail biting trellises, one is unidirectional and the other is bidirectional. Both algorithms are computationally efficient and achieves virtually optimum error performance with a small number of decoding iterations. They outperform all the previous suboptimal decoding algorithms. The bidirectional algorithm also reduces decoding delay. Also presented in the paper is a method for constructing tail biting trellises for linear block codes.
Partial volume segmentation in 3D of lesions and tissues in magnetic resonance images
NASA Astrophysics Data System (ADS)
Johnston, Brian; Atkins, M. Stella; Booth, Kellogg S.
1994-05-01
An important first step in diagnosis and treatment planning using tomographic imaging is differentiating and quantifying diseased as well as healthy tissue. One of the difficulties encountered in solving this problem to date has been distinguishing the partial volume constituents of each voxel in the image volume. Most proposed solutions to this problem involve analysis of planar images, in sequence, in two dimensions only. We have extended a model-based method of image segmentation which applies the technique of iterated conditional modes in three dimensions. A minimum of user intervention is required to train the algorithm. Partial volume estimates for each voxel in the image are obtained yielding fractional compositions of multiple tissue types for individual voxels. A multispectral approach is applied, where spatially registered data sets are available. The algorithm is simple and has been parallelized using a dataflow programming environment to reduce the computational burden. The algorithm has been used to segment dual echo MRI data sets of multiple sclerosis patients using lesions, gray matter, white matter, and cerebrospinal fluid as the partial volume constituents. The results of the application of the algorithm to these datasets is presented and compared to the manual lesion segmentation of the same data.
Parallel solution of the symmetric tridiagonal eigenproblem. Research report
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jessup, E.R.
1989-10-01
This thesis discusses methods for computing all eigenvalues and eigenvectors of a symmetric tridiagonal matrix on a distributed-memory Multiple Instruction, Multiple Data multiprocessor. Only those techniques having the potential for both high numerical accuracy and significant large-grained parallelism are investigated. These include the QL method or Cuppen's divide and conquer method based on rank-one updating to compute both eigenvalues and eigenvectors, bisection to determine eigenvalues and inverse iteration to compute eigenvectors. To begin, the methods are compared with respect to computation time, communication time, parallel speed up, and accuracy. Experiments on an IPSC hypercube multiprocessor reveal that Cuppen's method ismore » the most accurate approach, but bisection with inverse iteration is the fastest and most parallel. Because the accuracy of the latter combination is determined by the quality of the computed eigenvectors, the factors influencing the accuracy of inverse iteration are examined. This includes, in part, statistical analysis of the effect of a starting vector with random components. These results are used to develop an implementation of inverse iteration producing eigenvectors with lower residual error and better orthogonality than those generated by the EISPACK routine TINVIT. This thesis concludes with adaptions of methods for the symmetric tridiagonal eigenproblem to the related problem of computing the singular value decomposition (SVD) of a bidiagonal matrix.« less
Parallel solution of the symmetric tridiagonal eigenproblem
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jessup, E.R.
1989-01-01
This thesis discusses methods for computing all eigenvalues and eigenvectors of a symmetric tridiagonal matrix on a distributed memory MIMD multiprocessor. Only those techniques having the potential for both high numerical accuracy and significant large-grained parallelism are investigated. These include the QL method or Cuppen's divide and conquer method based on rank-one updating to compute both eigenvalues and eigenvectors, bisection to determine eigenvalues, and inverse iteration to compute eigenvectors. To begin, the methods are compared with respect to computation time, communication time, parallel speedup, and accuracy. Experiments on an iPSC hyper-cube multiprocessor reveal that Cuppen's method is the most accuratemore » approach, but bisection with inverse iteration is the fastest and most parallel. Because the accuracy of the latter combination is determined by the quality of the computed eigenvectors, the factors influencing the accuracy of inverse iteration are examined. This includes, in part, statistical analysis of the effects of a starting vector with random components. These results are used to develop an implementation of inverse iteration producing eigenvectors with lower residual error and better orthogonality than those generated by the EISPACK routine TINVIT. This thesis concludes with adaptations of methods for the symmetric tridiagonal eigenproblem to the related problem of computing the singular value decomposition (SVD) of a bidiagonal matrix.« less
Profiles of electrified drops and bubbles
NASA Technical Reports Server (NTRS)
Basaran, O. A.; Scriven, L. E.
1982-01-01
Axisymmetric equilibrium shapes of conducting drops and bubbles, (1) pendant or sessile on one face of a circular parallel-plate capacitor or (2) free and surface-charged, are found by solving simultaneously the free boundary problem consisting of the augmented Young-Laplace equation for surface shape and the Laplace equation for electrostatic field, given the surface potential. The problem is nonlinear and the method is a finite element algorithm employing Newton iteration, a modified frontal solver, and triangular as well as quadrilateral tessellations of the domain exterior to the drop in order to facilitate refined analysis of sharply curved drop tips seen in experiments. The stability limit predicted by this computer-aided theoretical analysis agrees well with experiments.
1987-02-28
g Ru + Ft~FzQZ...0 otherwise, and let X be the nN costate vector X =(Xt, X, , t,)t )X in E defined by X = Fz-tQz. Then, given u and z 0, the gradient g may be...34 ’ ’-" ’.’.’.’ ’ ". ."’.’,’,’ ./, ’ ’- ," ," .- ’- ". %.’’.’ "- ’" " ,,’’ " "- . "-" ’ ,r .’’"" ’ "." .r -".’ ". ." ." ." " . ." " . €"". ’.- -8- g Ru + FIX. (6) With the notation
Blind One-Bit Compressive Sampling
2013-01-17
14] Q. Li, C. A. Micchelli, L. Shen, and Y. Xu, A proximity algorithm accelerated by Gauss - Seidel iterations for L1/TV denoising models, Inverse...methods for nonconvex optimization on the unit sphere and has a provable convergence guarantees. Binary iterative hard thresholding (BIHT) algorithms were... Convergence analysis of the algorithm is presented. Our approach is to obtain a sequence of optimization problems by successively approximating the ℓ0
Iterative cross section sequence graph for handwritten character segmentation.
Dawoud, Amer
2007-08-01
The iterative cross section sequence graph (ICSSG) is an algorithm for handwritten character segmentation. It expands the cross section sequence graph concept by applying it iteratively at equally spaced thresholds. The iterative thresholding reduces the effect of information loss associated with image binarization. ICSSG preserves the characters' skeletal structure by preventing the interference of pixels that causes flooding of adjacent characters' segments. Improving the structural quality of the characters' skeleton facilitates better feature extraction and classification, which improves the overall performance of optical character recognition (OCR). Experimental results showed significant improvements in OCR recognition rates compared to other well-established segmentation algorithms.
Shading correction assisted iterative cone-beam CT reconstruction
NASA Astrophysics Data System (ADS)
Yang, Chunlin; Wu, Pengwei; Gong, Shutao; Wang, Jing; Lyu, Qihui; Tang, Xiangyang; Niu, Tianye
2017-11-01
Recent advances in total variation (TV) technology enable accurate CT image reconstruction from highly under-sampled and noisy projection data. The standard iterative reconstruction algorithms, which work well in conventional CT imaging, fail to perform as expected in cone beam CT (CBCT) applications, wherein the non-ideal physics issues, including scatter and beam hardening, are more severe. These physics issues result in large areas of shading artifacts and cause deterioration to the piecewise constant property assumed in reconstructed images. To overcome this obstacle, we incorporate a shading correction scheme into low-dose CBCT reconstruction and propose a clinically acceptable and stable three-dimensional iterative reconstruction method that is referred to as the shading correction assisted iterative reconstruction. In the proposed method, we modify the TV regularization term by adding a shading compensation image to the reconstructed image to compensate for the shading artifacts while leaving the data fidelity term intact. This compensation image is generated empirically, using image segmentation and low-pass filtering, and updated in the iterative process whenever necessary. When the compensation image is determined, the objective function is minimized using the fast iterative shrinkage-thresholding algorithm accelerated on a graphic processing unit. The proposed method is evaluated using CBCT projection data of the Catphan© 600 phantom and two pelvis patients. Compared with the iterative reconstruction without shading correction, the proposed method reduces the overall CT number error from around 200 HU to be around 25 HU and increases the spatial uniformity by a factor of 20 percent, given the same number of sparsely sampled projections. A clinically acceptable and stable iterative reconstruction algorithm for CBCT is proposed in this paper. Differing from the existing algorithms, this algorithm incorporates a shading correction scheme into the low-dose CBCT reconstruction and achieves more stable optimization path and more clinically acceptable reconstructed image. The method proposed by us does not rely on prior information and thus is practically attractive to the applications of low-dose CBCT imaging in the clinic.
Performance analysis of improved iterated cubature Kalman filter and its application to GNSS/INS.
Cui, Bingbo; Chen, Xiyuan; Xu, Yuan; Huang, Haoqian; Liu, Xiao
2017-01-01
In order to improve the accuracy and robustness of GNSS/INS navigation system, an improved iterated cubature Kalman filter (IICKF) is proposed by considering the state-dependent noise and system uncertainty. First, a simplified framework of iterated Gaussian filter is derived by using damped Newton-Raphson algorithm and online noise estimator. Then the effect of state-dependent noise coming from iterated update is analyzed theoretically, and an augmented form of CKF algorithm is applied to improve the estimation accuracy. The performance of IICKF is verified by field test and numerical simulation, and results reveal that, compared with non-iterated filter, iterated filter is less sensitive to the system uncertainty, and IICKF improves the accuracy of yaw, roll and pitch by 48.9%, 73.1% and 83.3%, respectively, compared with traditional iterated KF. Copyright © 2016 ISA. Published by Elsevier Ltd. All rights reserved.
On-line range images registration with GPGPU
NASA Astrophysics Data System (ADS)
Będkowski, J.; Naruniec, J.
2013-03-01
This paper concerns implementation of algorithms in the two important aspects of modern 3D data processing: data registration and segmentation. Solution proposed for the first topic is based on the 3D space decomposition, while the latter on image processing and local neighbourhood search. Data processing is implemented by using NVIDIA compute unified device architecture (NIVIDIA CUDA) parallel computation. The result of the segmentation is a coloured map where different colours correspond to different objects, such as walls, floor and stairs. The research is related to the problem of collecting 3D data with a RGB-D camera mounted on a rotated head, to be used in mobile robot applications. Performance of the data registration algorithm is aimed for on-line processing. The iterative closest point (ICP) approach is chosen as a registration method. Computations are based on the parallel fast nearest neighbour search. This procedure decomposes 3D space into cubic buckets and, therefore, the time of the matching is deterministic. First technique of the data segmentation uses accele-rometers integrated with a RGB-D sensor to obtain rotation compensation and image processing method for defining pre-requisites of the known categories. The second technique uses the adapted nearest neighbour search procedure for obtaining normal vectors for each range point.
Parallel iterative solution for h and p approximations of the shallow water equations
Barragy, E.J.; Walters, R.A.
1998-01-01
A p finite element scheme and parallel iterative solver are introduced for a modified form of the shallow water equations. The governing equations are the three-dimensional shallow water equations. After a harmonic decomposition in time and rearrangement, the resulting equations are a complex Helmholz problem for surface elevation, and a complex momentum equation for the horizontal velocity. Both equations are nonlinear and the resulting system is solved using the Picard iteration combined with a preconditioned biconjugate gradient (PBCG) method for the linearized subproblems. A subdomain-based parallel preconditioner is developed which uses incomplete LU factorization with thresholding (ILUT) methods within subdomains, overlapping ILUT factorizations for subdomain boundaries and under-relaxed iteration for the resulting block system. The method builds on techniques successfully applied to linear elements by introducing ordering and condensation techniques to handle uniform p refinement. The combined methods show good performance for a range of p (element order), h (element size), and N (number of processors). Performance and scalability results are presented for a field scale problem where up to 512 processors are used. ?? 1998 Elsevier Science Ltd. All rights reserved.
MR Image Reconstruction Using Block Matching and Adaptive Kernel Methods.
Schmidt, Johannes F M; Santelli, Claudio; Kozerke, Sebastian
2016-01-01
An approach to Magnetic Resonance (MR) image reconstruction from undersampled data is proposed. Undersampling artifacts are removed using an iterative thresholding algorithm applied to nonlinearly transformed image block arrays. Each block array is transformed using kernel principal component analysis where the contribution of each image block to the transform depends in a nonlinear fashion on the distance to other image blocks. Elimination of undersampling artifacts is achieved by conventional principal component analysis in the nonlinear transform domain, projection onto the main components and back-mapping into the image domain. Iterative image reconstruction is performed by interleaving the proposed undersampling artifact removal step and gradient updates enforcing consistency with acquired k-space data. The algorithm is evaluated using retrospectively undersampled MR cardiac cine data and compared to k-t SPARSE-SENSE, block matching with spatial Fourier filtering and k-t ℓ1-SPIRiT reconstruction. Evaluation of image quality and root-mean-squared-error (RMSE) reveal improved image reconstruction for up to 8-fold undersampled data with the proposed approach relative to k-t SPARSE-SENSE, block matching with spatial Fourier filtering and k-t ℓ1-SPIRiT. In conclusion, block matching and kernel methods can be used for effective removal of undersampling artifacts in MR image reconstruction and outperform methods using standard compressed sensing and ℓ1-regularized parallel imaging methods.
Digital tomosynthesis mammography using a parallel maximum-likelihood reconstruction method
NASA Astrophysics Data System (ADS)
Wu, Tao; Zhang, Juemin; Moore, Richard; Rafferty, Elizabeth; Kopans, Daniel; Meleis, Waleed; Kaeli, David
2004-05-01
A parallel reconstruction method, based on an iterative maximum likelihood (ML) algorithm, is developed to provide fast reconstruction for digital tomosynthesis mammography. Tomosynthesis mammography acquires 11 low-dose projections of a breast by moving an x-ray tube over a 50° angular range. In parallel reconstruction, each projection is divided into multiple segments along the chest-to-nipple direction. Using the 11 projections, segments located at the same distance from the chest wall are combined to compute a partial reconstruction of the total breast volume. The shape of the partial reconstruction forms a thin slab, angled toward the x-ray source at a projection angle 0°. The reconstruction of the total breast volume is obtained by merging the partial reconstructions. The overlap region between neighboring partial reconstructions and neighboring projection segments is utilized to compensate for the incomplete data at the boundary locations present in the partial reconstructions. A serial execution of the reconstruction is compared to a parallel implementation, using clinical data. The serial code was run on a PC with a single PentiumIV 2.2GHz CPU. The parallel implementation was developed using MPI and run on a 64-node Linux cluster using 800MHz Itanium CPUs. The serial reconstruction for a medium-sized breast (5cm thickness, 11cm chest-to-nipple distance) takes 115 minutes, while a parallel implementation takes only 3.5 minutes. The reconstruction time for a larger breast using a serial implementation takes 187 minutes, while a parallel implementation takes 6.5 minutes. No significant differences were observed between the reconstructions produced by the serial and parallel implementations.
A depth-first search algorithm to compute elementary flux modes by linear programming.
Quek, Lake-Ee; Nielsen, Lars K
2014-07-30
The decomposition of complex metabolic networks into elementary flux modes (EFMs) provides a useful framework for exploring reaction interactions systematically. Generating a complete set of EFMs for large-scale models, however, is near impossible. Even for moderately-sized models (<400 reactions), existing approaches based on the Double Description method must iterate through a large number of combinatorial candidates, thus imposing an immense processor and memory demand. Based on an alternative elementarity test, we developed a depth-first search algorithm using linear programming (LP) to enumerate EFMs in an exhaustive fashion. Constraints can be introduced to directly generate a subset of EFMs satisfying the set of constraints. The depth-first search algorithm has a constant memory overhead. Using flux constraints, a large LP problem can be massively divided and parallelized into independent sub-jobs for deployment into computing clusters. Since the sub-jobs do not overlap, the approach scales to utilize all available computing nodes with minimal coordination overhead or memory limitations. The speed of the algorithm was comparable to efmtool, a mainstream Double Description method, when enumerating all EFMs; the attrition power gained from performing flux feasibility tests offsets the increased computational demand of running an LP solver. Unlike the Double Description method, the algorithm enables accelerated enumeration of all EFMs satisfying a set of constraints.
Robust non-rigid registration algorithm based on local affine registration
NASA Astrophysics Data System (ADS)
Wu, Liyang; Xiong, Lei; Du, Shaoyi; Bi, Duyan; Fang, Ting; Liu, Kun; Wu, Dongpeng
2018-04-01
Aiming at the problem that the traditional point set non-rigid registration algorithm has low precision and slow convergence speed for complex local deformation data, this paper proposes a robust non-rigid registration algorithm based on local affine registration. The algorithm uses a hierarchical iterative method to complete the point set non-rigid registration from coarse to fine. In each iteration, the sub data point sets and sub model point sets are divided and the shape control points of each sub point set are updated. Then we use the control point guided affine ICP algorithm to solve the local affine transformation between the corresponding sub point sets. Next, the local affine transformation obtained by the previous step is used to update the sub data point sets and their shape control point sets. When the algorithm reaches the maximum iteration layer K, the loop ends and outputs the updated sub data point sets. Experimental results demonstrate that the accuracy and convergence of our algorithm are greatly improved compared with the traditional point set non-rigid registration algorithms.
Chen, Tinggui; Xiao, Renbin
2014-01-01
Due to fierce market competition, how to improve product quality and reduce development cost determines the core competitiveness of enterprises. However, design iteration generally causes increases of product cost and delays of development time as well, so how to identify and model couplings among tasks in product design and development has become an important issue for enterprises to settle. In this paper, the shortcomings existing in WTM model are discussed and tearing approach as well as inner iteration method is used to complement the classic WTM model. In addition, the ABC algorithm is also introduced to find out the optimal decoupling schemes. In this paper, firstly, tearing approach and inner iteration method are analyzed for solving coupled sets. Secondly, a hybrid iteration model combining these two technologies is set up. Thirdly, a high-performance swarm intelligence algorithm, artificial bee colony, is adopted to realize problem-solving. Finally, an engineering design of a chemical processing system is given in order to verify its reasonability and effectiveness.
Adaptive Dynamic Programming for Discrete-Time Zero-Sum Games.
Wei, Qinglai; Liu, Derong; Lin, Qiao; Song, Ruizhuo
2018-04-01
In this paper, a novel adaptive dynamic programming (ADP) algorithm, called "iterative zero-sum ADP algorithm," is developed to solve infinite-horizon discrete-time two-player zero-sum games of nonlinear systems. The present iterative zero-sum ADP algorithm permits arbitrary positive semidefinite functions to initialize the upper and lower iterations. A novel convergence analysis is developed to guarantee the upper and lower iterative value functions to converge to the upper and lower optimums, respectively. When the saddle-point equilibrium exists, it is emphasized that both the upper and lower iterative value functions are proved to converge to the optimal solution of the zero-sum game, where the existence criteria of the saddle-point equilibrium are not required. If the saddle-point equilibrium does not exist, the upper and lower optimal performance index functions are obtained, respectively, where the upper and lower performance index functions are proved to be not equivalent. Finally, simulation results and comparisons are shown to illustrate the performance of the present method.
Convergence of Proximal Iteratively Reweighted Nuclear Norm Algorithm for Image Processing.
Sun, Tao; Jiang, Hao; Cheng, Lizhi
2017-08-25
The nonsmooth and nonconvex regularization has many applications in imaging science and machine learning research due to its excellent recovery performance. A proximal iteratively reweighted nuclear norm algorithm has been proposed for the nonsmooth and nonconvex matrix minimizations. In this paper, we aim to investigate the convergence of the algorithm. With the Kurdyka-Łojasiewicz property, we prove the algorithm globally converges to a critical point of the objective function. The numerical results presented in this paper coincide with our theoretical findings.
A Fast and Accurate Algorithm for l1 Minimization Problems in Compressive Sampling (Preprint)
2013-01-22
However, updating uk+1 via the formulation of Step 2 in Algorithm 1 can be implemented through the use of the component-wise Gauss - Seidel iteration which...may accelerate the rate of convergence of the algorithm and therefore reduce the total CPU-time consumed. The efficiency of component-wise Gauss - Seidel ...Micchelli, L. Shen, and Y. Xu, A proximity algorithm accelerated by Gauss - Seidel iterations for L1/TV denoising models, Inverse Problems, 28 (2012), p
NASA Astrophysics Data System (ADS)
Koldan, Jelena; Puzyrev, Vladimir; de la Puente, Josep; Houzeaux, Guillaume; Cela, José María
2014-06-01
We present an elaborate preconditioning scheme for Krylov subspace methods which has been developed to improve the performance and reduce the execution time of parallel node-based finite-element (FE) solvers for 3-D electromagnetic (EM) numerical modelling in exploration geophysics. This new preconditioner is based on algebraic multigrid (AMG) that uses different basic relaxation methods, such as Jacobi, symmetric successive over-relaxation (SSOR) and Gauss-Seidel, as smoothers and the wave front algorithm to create groups, which are used for a coarse-level generation. We have implemented and tested this new preconditioner within our parallel nodal FE solver for 3-D forward problems in EM induction geophysics. We have performed series of experiments for several models with different conductivity structures and characteristics to test the performance of our AMG preconditioning technique when combined with biconjugate gradient stabilized method. The results have shown that, the more challenging the problem is in terms of conductivity contrasts, ratio between the sizes of grid elements and/or frequency, the more benefit is obtained by using this preconditioner. Compared to other preconditioning schemes, such as diagonal, SSOR and truncated approximate inverse, the AMG preconditioner greatly improves the convergence of the iterative solver for all tested models. Also, when it comes to cases in which other preconditioners succeed to converge to a desired precision, AMG is able to considerably reduce the total execution time of the forward-problem code-up to an order of magnitude. Furthermore, the tests have confirmed that our AMG scheme ensures grid-independent rate of convergence, as well as improvement in convergence regardless of how big local mesh refinements are. In addition, AMG is designed to be a black-box preconditioner, which makes it easy to use and combine with different iterative methods. Finally, it has proved to be very practical and efficient in the parallel context.
A multiresolution approach to iterative reconstruction algorithms in X-ray computed tomography.
De Witte, Yoni; Vlassenbroeck, Jelle; Van Hoorebeke, Luc
2010-09-01
In computed tomography, the application of iterative reconstruction methods in practical situations is impeded by their high computational demands. Especially in high resolution X-ray computed tomography, where reconstruction volumes contain a high number of volume elements (several giga voxels), this computational burden prevents their actual breakthrough. Besides the large amount of calculations, iterative algorithms require the entire volume to be kept in memory during reconstruction, which quickly becomes cumbersome for large data sets. To overcome this obstacle, we present a novel multiresolution reconstruction, which greatly reduces the required amount of memory without significantly affecting the reconstructed image quality. It is shown that, combined with an efficient implementation on a graphical processing unit, the multiresolution approach enables the application of iterative algorithms in the reconstruction of large volumes at an acceptable speed using only limited resources.
A class of parallel algorithms for computation of the manipulator inertia matrix
NASA Technical Reports Server (NTRS)
Fijany, Amir; Bejczy, Antal K.
1989-01-01
Parallel and parallel/pipeline algorithms for computation of the manipulator inertia matrix are presented. An algorithm based on composite rigid-body spatial inertia method, which provides better features for parallelization, is used for the computation of the inertia matrix. Two parallel algorithms are developed which achieve the time lower bound in computation. Also described is the mapping of these algorithms with topological variation on a two-dimensional processor array, with nearest-neighbor connection, and with cardinality variation on a linear processor array. An efficient parallel/pipeline algorithm for the linear array was also developed, but at significantly higher efficiency.
Mai, Xiaofeng; Liu, Jie; Wu, Xiong; Zhang, Qun; Guo, Changjian; Yang, Yanfu; Li, Zhaohui
2017-02-06
A Stokes-space modulation format classification (MFC) technique is proposed for coherent optical receivers by using a non-iterative clustering algorithm. In the clustering algorithm, two simple parameters are calculated to help find the density peaks of the data points in Stokes space and no iteration is required. Correct MFC can be realized in numerical simulations among PM-QPSK, PM-8QAM, PM-16QAM, PM-32QAM and PM-64QAM signals within practical optical signal-to-noise ratio (OSNR) ranges. The performance of the proposed MFC algorithm is also compared with those of other schemes based on clustering algorithms. The simulation results show that good classification performance can be achieved using the proposed MFC scheme with moderate time complexity. Proof-of-concept experiments are finally implemented to demonstrate MFC among PM-QPSK/16QAM/64QAM signals, which confirm the feasibility of our proposed MFC scheme.
Self-adaptive predictor-corrector algorithm for static nonlinear structural analysis
NASA Technical Reports Server (NTRS)
Padovan, J.
1981-01-01
A multiphase selfadaptive predictor corrector type algorithm was developed. This algorithm enables the solution of highly nonlinear structural responses including kinematic, kinetic and material effects as well as pro/post buckling behavior. The strategy involves three main phases: (1) the use of a warpable hyperelliptic constraint surface which serves to upperbound dependent iterate excursions during successive incremental Newton Ramphson (INR) type iterations; (20 uses an energy constraint to scale the generation of successive iterates so as to maintain the appropriate form of local convergence behavior; (3) the use of quality of convergence checks which enable various self adaptive modifications of the algorithmic structure when necessary. The restructuring is achieved by tightening various conditioning parameters as well as switch to different algorithmic levels to improve the convergence process. The capabilities of the procedure to handle various types of static nonlinear structural behavior are illustrated.
Run-time parallelization and scheduling of loops
NASA Technical Reports Server (NTRS)
Saltz, Joel H.; Mirchandaney, Ravi; Crowley, Kay
1990-01-01
Run time methods are studied to automatically parallelize and schedule iterations of a do loop in certain cases, where compile-time information is inadequate. The methods presented involve execution time preprocessing of the loop. At compile-time, these methods set up the framework for performing a loop dependency analysis. At run time, wave fronts of concurrently executable loop iterations are identified. Using this wavefront information, loop iterations are reordered for increased parallelism. Symbolic transformation rules are used to produce: inspector procedures that perform execution time preprocessing and executors or transformed versions of source code loop structures. These transformed loop structures carry out the calculations planned in the inspector procedures. Performance results are presented from experiments conducted on the Encore Multimax. These results illustrate that run time reordering of loop indices can have a significant impact on performance. Furthermore, the overheads associated with this type of reordering are amortized when the loop is executed several times with the same dependency structure.
Multi-limit unsymmetrical MLIBD image restoration algorithm
NASA Astrophysics Data System (ADS)
Yang, Yang; Cheng, Yiping; Chen, Zai-wang; Bo, Chen
2012-11-01
A novel multi-limit unsymmetrical iterative blind deconvolution(MLIBD) algorithm was presented to enhance the performance of adaptive optics image restoration.The algorithm enhances the reliability of iterative blind deconvolution by introducing the bandwidth limit into the frequency domain of point spread(PSF),and adopts the PSF dynamic support region estimation to improve the convergence speed.The unsymmetrical factor is automatically computed to advance its adaptivity.Image deconvolution comparing experiments between Richardson-Lucy IBD and MLIBD were done,and the result indicates that the iteration number is reduced by 22.4% and the peak signal-to-noise ratio is improved by 10.18dB with MLIBD method. The performance of MLIBD algorithm is outstanding in the images restoration the FK5-857 adaptive optics and the double-star adaptive optics.
Introducing parallelism to histogramming functions for GEM systems
NASA Astrophysics Data System (ADS)
Krawczyk, Rafał D.; Czarski, Tomasz; Kolasinski, Piotr; Pozniak, Krzysztof T.; Linczuk, Maciej; Byszuk, Adrian; Chernyshova, Maryna; Juszczyk, Bartlomiej; Kasprowicz, Grzegorz; Wojenski, Andrzej; Zabolotny, Wojciech
2015-09-01
This article is an assessment of potential parallelization of histogramming algorithms in GEM detector system. Histogramming and preprocessing algorithms in MATLAB were analyzed with regard to adding parallelism. Preliminary implementation of parallel strip histogramming resulted in speedup. Analysis of algorithms parallelizability is presented. Overview of potential hardware and software support to implement parallel algorithm is discussed.
Holmes, T J; Liu, Y H
1989-11-15
A maximum likelihood based iterative algorithm adapted from nuclear medicine imaging for noncoherent optical imaging was presented in a previous publication with some initial computer-simulation testing. This algorithm is identical in form to that previously derived in a different way by W. H. Richardson "Bayesian-Based Iterative Method of Image Restoration," J. Opt. Soc. Am. 62, 55-59 (1972) and L. B. Lucy "An Iterative Technique for the Rectification of Observed Distributions," Astron. J. 79, 745-765 (1974). Foreseen applications include superresolution and 3-D fluorescence microscopy. This paper presents further simulation testing of this algorithm and a preliminary experiment with a defocused camera. The simulations show quantified resolution improvement as a function of iteration number, and they show qualitatively the trend in limitations on restored resolution when noise is present in the data. Also shown are results of a simulation in restoring missing-cone information for 3-D imaging. Conclusions are in support of the feasibility of using these methods with real systems, while computational cost and timing estimates indicate that it should be realistic to implement these methods. Itis suggested in the Appendix that future extensions to the maximum likelihood based derivation of this algorithm will address some of the limitations that are experienced with the nonextended form of the algorithm presented here.
Kazemi, Mahdi; Arefi, Mohammad Mehdi
2017-03-01
In this paper, an online identification algorithm is presented for nonlinear systems in the presence of output colored noise. The proposed method is based on extended recursive least squares (ERLS) algorithm, where the identified system is in polynomial Wiener form. To this end, an unknown intermediate signal is estimated by using an inner iterative algorithm. The iterative recursive algorithm adaptively modifies the vector of parameters of the presented Wiener model when the system parameters vary. In addition, to increase the robustness of the proposed method against variations, a robust RLS algorithm is applied to the model. Simulation results are provided to show the effectiveness of the proposed approach. Results confirm that the proposed method has fast convergence rate with robust characteristics, which increases the efficiency of the proposed model and identification approach. For instance, the FIT criterion will be achieved 92% in CSTR process where about 400 data is used. Copyright © 2016 ISA. Published by Elsevier Ltd. All rights reserved.
A New Pivoting and Iterative Text Detection Algorithm for Biomedical Images
DOE Office of Scientific and Technical Information (OSTI.GOV)
Xu, Songhua; Krauthammer, Prof. Michael
2010-01-01
There is interest to expand the reach of literature mining to include the analysis of biomedical images, which often contain a paper's key findings. Examples include recent studies that use Optical Character Recognition (OCR) to extract image text, which is used to boost biomedical image retrieval and classification. Such studies rely on the robust identification of text elements in biomedical images, which is a non-trivial task. In this work, we introduce a new text detection algorithm for biomedical images based on iterative projection histograms. We study the effectiveness of our algorithm by evaluating the performance on a set of manuallymore » labeled random biomedical images, and compare the performance against other state-of-the-art text detection algorithms. We demonstrate that our projection histogram-based text detection approach is well suited for text detection in biomedical images, and that the iterative application of the algorithm boosts performance to an F score of .60. We provide a C++ implementation of our algorithm freely available for academic use.« less
Zhang, Jie; Fan, Shangang; Xiong, Jian; Cheng, Xiefeng; Sari, Hikmet; Adachi, Fumiyuki
2017-01-01
Both L1/2 and L2/3 are two typical non-convex regularizations of Lp (0
Li, Yunyi; Zhang, Jie; Fan, Shangang; Yang, Jie; Xiong, Jian; Cheng, Xiefeng; Sari, Hikmet; Adachi, Fumiyuki; Gui, Guan
2017-12-15
Both L 1/2 and L 2/3 are two typical non-convex regularizations of L p (0
Performance Models for the Spike Banded Linear System Solver
Manguoglu, Murat; Saied, Faisal; Sameh, Ahmed; ...
2011-01-01
With availability of large-scale parallel platforms comprised of tens-of-thousands of processors and beyond, there is significant impetus for the development of scalable parallel sparse linear system solvers and preconditioners. An integral part of this design process is the development of performance models capable of predicting performance and providing accurate cost models for the solvers and preconditioners. There has been some work in the past on characterizing performance of the iterative solvers themselves. In this paper, we investigate the problem of characterizing performance and scalability of banded preconditioners. Recent work has demonstrated the superior convergence properties and robustness of banded preconditioners,more » compared to state-of-the-art ILU family of preconditioners as well as algebraic multigrid preconditioners. Furthermore, when used in conjunction with efficient banded solvers, banded preconditioners are capable of significantly faster time-to-solution. Our banded solver, the Truncated Spike algorithm is specifically designed for parallel performance and tolerance to deep memory hierarchies. Its regular structure is also highly amenable to accurate performance characterization. Using these characteristics, we derive the following results in this paper: (i) we develop parallel formulations of the Truncated Spike solver, (ii) we develop a highly accurate pseudo-analytical parallel performance model for our solver, (iii) we show excellent predication capabilities of our model – based on which we argue the high scalability of our solver. Our pseudo-analytical performance model is based on analytical performance characterization of each phase of our solver. These analytical models are then parameterized using actual runtime information on target platforms. An important consequence of our performance models is that they reveal underlying performance bottlenecks in both serial and parallel formulations. All of our results are validated on diverse heterogeneous multiclusters – platforms for which performance prediction is particularly challenging. Finally, we provide predict the scalability of the Spike algorithm using up to 65,536 cores with our model. In this paper we extend the results presented in the Ninth International Symposium on Parallel and Distributed Computing.« less
NASA Technical Reports Server (NTRS)
Luke, Edward Allen
1993-01-01
Two algorithms capable of computing a transonic 3-D inviscid flow field about rotating machines are considered for parallel implementation. During the study of these algorithms, a significant new method of measuring the performance of parallel algorithms is developed. The theory that supports this new method creates an empirical definition of scalable parallel algorithms that is used to produce quantifiable evidence that a scalable parallel application was developed. The implementation of the parallel application and an automated domain decomposition tool are also discussed.
Domain decomposition methods for the parallel computation of reacting flows
NASA Technical Reports Server (NTRS)
Keyes, David E.
1988-01-01
Domain decomposition is a natural route to parallel computing for partial differential equation solvers. Subdomains of which the original domain of definition is comprised are assigned to independent processors at the price of periodic coordination between processors to compute global parameters and maintain the requisite degree of continuity of the solution at the subdomain interfaces. In the domain-decomposed solution of steady multidimensional systems of PDEs by finite difference methods using a pseudo-transient version of Newton iteration, the only portion of the computation which generally stands in the way of efficient parallelization is the solution of the large, sparse linear systems arising at each Newton step. For some Jacobian matrices drawn from an actual two-dimensional reacting flow problem, comparisons are made between relaxation-based linear solvers and also preconditioned iterative methods of Conjugate Gradient and Chebyshev type, focusing attention on both iteration count and global inner product count. The generalized minimum residual method with block-ILU preconditioning is judged the best serial method among those considered, and parallel numerical experiments on the Encore Multimax demonstrate for it approximately 10-fold speedup on 16 processors.
Space shuttle propulsion parameter estimation using optimal estimation techniques
NASA Technical Reports Server (NTRS)
1983-01-01
The first twelve system state variables are presented with the necessary mathematical developments for incorporating them into the filter/smoother algorithm. Other state variables, i.e., aerodynamic coefficients can be easily incorporated into the estimation algorithm, representing uncertain parameters, but for initial checkout purposes are treated as known quantities. An approach for incorporating the NASA propulsion predictive model results into the optimal estimation algorithm was identified. This approach utilizes numerical derivatives and nominal predictions within the algorithm with global iterations of the algorithm. The iterative process is terminated when the quality of the estimates provided no longer significantly improves.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yeung, Yu-Hong; Pothen, Alex; Halappanavar, Mahantesh
We present an augmented matrix approach to update the solution to a linear system of equations when the coefficient matrix is modified by a few elements within a principal submatrix. This problem arises in the dynamic security analysis of a power grid, where operators need to performmore » $N-x$ contingency analysis, i.e., determine the state of the system when up to $x$ links from $N$ fail. Our algorithms augment the coefficient matrix to account for the changes in it, and then compute the solution to the augmented system without refactoring the modified matrix. We provide two algorithms, a direct method, and a hybrid direct-iterative method for solving the augmented system. We also exploit the sparsity of the matrices and vectors to accelerate the overall computation. Our algorithms are compared on three power grids with PARDISO, a parallel direct solver, and CHOLMOD, a direct solver with the ability to modify the Cholesky factors of the coefficient matrix. We show that our augmented algorithms outperform PARDISO (by two orders of magnitude), and CHOLMOD (by a factor of up to 5). Further, our algorithms scale better than CHOLMOD as the number of elements updated increases. The solutions are computed with high accuracy. Our algorithms are capable of computing $N-x$ contingency analysis on a $778K$ bus grid, updating a solution with $x=20$ elements in $$1.6 \\times 10^{-2}$$ seconds on an Intel Xeon processor.« less
Parallel Lattice Basis Reduction Using a Multi-threaded Schnorr-Euchner LLL Algorithm
NASA Astrophysics Data System (ADS)
Backes, Werner; Wetzel, Susanne
In this paper, we introduce a new parallel variant of the LLL lattice basis reduction algorithm. Our new, multi-threaded algorithm is the first to provide an efficient, parallel implementation of the Schorr-Euchner algorithm for today’s multi-processor, multi-core computer architectures. Experiments with sparse and dense lattice bases show a speed-up factor of about 1.8 for the 2-thread and about factor 3.2 for the 4-thread version of our new parallel lattice basis reduction algorithm in comparison to the traditional non-parallel algorithm.
Minimizing inner product data dependencies in conjugate gradient iteration
NASA Technical Reports Server (NTRS)
Vanrosendale, J.
1983-01-01
The amount of concurrency available in conjugate gradient iteration is limited by the summations required in the inner product computations. The inner product of two vectors of length N requires time c log(N), if N or more processors are available. This paper describes an algebraic restructuring of the conjugate gradient algorithm which minimizes data dependencies due to inner product calculations. After an initial start up, the new algorithm can perform a conjugate gradient iteration in time c*log(log(N)).
NASA Technical Reports Server (NTRS)
Fijany, Amir
1993-01-01
In this paper, parallel O(log n) algorithms for computation of rigid multibody dynamics are developed. These parallel algorithms are derived by parallelization of new O(n) algorithms for the problem. The underlying feature of these O(n) algorithms is a drastically different strategy for decomposition of interbody force which leads to a new factorization of the mass matrix (M). Specifically, it is shown that a factorization of the inverse of the mass matrix in the form of the Schur Complement is derived as M(exp -1) = C - B(exp *)A(exp -1)B, wherein matrices C, A, and B are block tridiagonal matrices. The new O(n) algorithm is then derived as a recursive implementation of this factorization of M(exp -1). For the closed-chain systems, similar factorizations and O(n) algorithms for computation of Operational Space Mass Matrix lambda and its inverse lambda(exp -1) are also derived. It is shown that these O(n) algorithms are strictly parallel, that is, they are less efficient than other algorithms for serial computation of the problem. But, to our knowledge, they are the only known algorithms that can be parallelized and that lead to both time- and processor-optimal parallel algorithms for the problem, i.e., parallel O(log n) algorithms with O(n) processors. The developed parallel algorithms, in addition to their theoretical significance, are also practical from an implementation point of view due to their simple architectural requirements.
NASA Astrophysics Data System (ADS)
Schultz, A.
2010-12-01
3D forward solvers lie at the core of inverse formulations used to image the variation of electrical conductivity within the Earth's interior. This property is associated with variations in temperature, composition, phase, presence of volatiles, and in specific settings, the presence of groundwater, geothermal resources, oil/gas or minerals. The high cost of 3D solutions has been a stumbling block to wider adoption of 3D methods. Parallel algorithms for modeling frequency domain 3D EM problems have not achieved wide scale adoption, with emphasis on fairly coarse grained parallelism using MPI and similar approaches. The communications bandwidth as well as the latency required to send and receive network communication packets is a limiting factor in implementing fine grained parallel strategies, inhibiting wide adoption of these algorithms. Leading Graphics Processor Unit (GPU) companies now produce GPUs with hundreds of GPU processor cores per die. The footprint, in silicon, of the GPU's restricted instruction set is much smaller than the general purpose instruction set required of a CPU. Consequently, the density of processor cores on a GPU can be much greater than on a CPU. GPUs also have local memory, registers and high speed communication with host CPUs, usually through PCIe type interconnects. The extremely low cost and high computational power of GPUs provides the EM geophysics community with an opportunity to achieve fine grained (i.e. massive) parallelization of codes on low cost hardware. The current generation of GPUs (e.g. NVidia Fermi) provides 3 billion transistors per chip die, with nearly 500 processor cores and up to 6 GB of fast (DDR5) GPU memory. This latest generation of GPU supports fast hardware double precision (64 bit) floating point operations of the type required for frequency domain EM forward solutions. Each Fermi GPU board can sustain nearly 1 TFLOP in double precision, and multiple boards can be installed in the host computer system. We describe our ongoing efforts to achieve massive parallelization on a novel hybrid GPU testbed machine currently configured with 12 Intel Westmere Xeon CPU cores (or 24 parallel computational threads) with 96 GB DDR3 system memory, 4 GPU subsystems which in aggregate contain 960 NVidia Tesla GPU cores with 16 GB dedicated DDR3 GPU memory, and a second interleved bank of 4 GPU subsystems containing in aggregate 1792 NVidia Fermi GPU cores with 12 GB dedicated DDR5 GPU memory. We are applying domain decomposition methods to a modified version of Weiss' (2001) 3D frequency domain full physics EM finite difference code, an open source GPL licensed f90 code available for download from www.OpenEM.org. This will be the core of a new hybrid 3D inversion that parallelizes frequencies across CPUs and individual forward solutions across GPUs. We describe progress made in modifying the code to use direct solvers in GPU cores dedicated to each small subdomain, iteratively improving the solution by matching adjacent subdomain boundary solutions, rather than iterative Krylov space sparse solvers as currently applied to the whole domain.
Iterative inversion of deformation vector fields with feedback control.
Dubey, Abhishek; Iliopoulos, Alexandros-Stavros; Sun, Xiaobai; Yin, Fang-Fang; Ren, Lei
2018-05-14
Often, the inverse deformation vector field (DVF) is needed together with the corresponding forward DVF in four-dimesional (4D) reconstruction and dose calculation, adaptive radiation therapy, and simultaneous deformable registration. This study aims at improving both accuracy and efficiency of iterative algorithms for DVF inversion, and advancing our understanding of divergence and latency conditions. We introduce a framework of fixed-point iteration algorithms with active feedback control for DVF inversion. Based on rigorous convergence analysis, we design control mechanisms for modulating the inverse consistency (IC) residual of the current iterate, to be used as feedback into the next iterate. The control is designed adaptively to the input DVF with the objective to enlarge the convergence area and expedite convergence. Three particular settings of feedback control are introduced: constant value over the domain throughout the iteration; alternating values between iteration steps; and spatially variant values. We also introduce three spectral measures of the displacement Jacobian for characterizing a DVF. These measures reveal the critical role of what we term the nontranslational displacement component (NTDC) of the DVF. We carry out inversion experiments with an analytical DVF pair, and with DVFs associated with thoracic CT images of six patients at end of expiration and end of inspiration. The NTDC-adaptive iterations are shown to attain a larger convergence region at a faster pace compared to previous nonadaptive DVF inversion iteration algorithms. By our numerical experiments, alternating control yields smaller IC residuals and inversion errors than constant control. Spatially variant control renders smaller residuals and errors by at least an order of magnitude, compared to other schemes, in no more than 10 steps. Inversion results also show remarkable quantitative agreement with analysis-based predictions. Our analysis captures properties of DVF data associated with clinical CT images, and provides new understanding of iterative DVF inversion algorithms with a simple residual feedback control. Adaptive control is necessary and highly effective in the presence of nonsmall NTDCs. The adaptive iterations or the spectral measures, or both, may potentially be incorporated into deformable image registration methods. © 2018 American Association of Physicists in Medicine.
ERIC Educational Resources Information Center
von Davier, Matthias
2016-01-01
This report presents results on a parallel implementation of the expectation-maximization (EM) algorithm for multidimensional latent variable models. The developments presented here are based on code that parallelizes both the E step and the M step of the parallel-E parallel-M algorithm. Examples presented in this report include item response…
ProperCAD: A portable object-oriented parallel environment for VLSI CAD
NASA Technical Reports Server (NTRS)
Ramkumar, Balkrishna; Banerjee, Prithviraj
1993-01-01
Most parallel algorithms for VLSI CAD proposed to date have one important drawback: they work efficiently only on machines that they were designed for. As a result, algorithms designed to date are dependent on the architecture for which they are developed and do not port easily to other parallel architectures. A new project under way to address this problem is described. A Portable object-oriented parallel environment for CAD algorithms (ProperCAD) is being developed. The objectives of this research are (1) to develop new parallel algorithms that run in a portable object-oriented environment (CAD algorithms using a general purpose platform for portable parallel programming called CARM is being developed and a C++ environment that is truly object-oriented and specialized for CAD applications is also being developed); and (2) to design the parallel algorithms around a good sequential algorithm with a well-defined parallel-sequential interface (permitting the parallel algorithm to benefit from future developments in sequential algorithms). One CAD application that has been implemented as part of the ProperCAD project, flat VLSI circuit extraction, is described. The algorithm, its implementation, and its performance on a range of parallel machines are discussed in detail. It currently runs on an Encore Multimax, a Sequent Symmetry, Intel iPSC/2 and i860 hypercubes, a NCUBE 2 hypercube, and a network of Sun Sparc workstations. Performance data for other applications that were developed are provided: namely test pattern generation for sequential circuits, parallel logic synthesis, and standard cell placement.
An efficient iteration strategy for the solution of the Euler equations
NASA Technical Reports Server (NTRS)
Walters, R. W.; Dwoyer, D. L.
1985-01-01
A line Gauss-Seidel (LGS) relaxation algorithm in conjunction with a one-parameter family of upwind discretizations of the Euler equations in two-dimensions is described. The basic algorithm has the property that convergence to the steady-state is quadratic for fully supersonic flows and linear otherwise. This is in contrast to the block ADI methods (either central or upwind differenced) and the upwind biased relaxation schemes, all of which converge linearly, independent of the flow regime. Moreover, the algorithm presented here is easily enhanced to detect regions of subsonic flow embedded in supersonic flow. This allows marching by lines in the supersonic regions, converging each line quadratically, and iterating in the subsonic regions, thus yielding a very efficient iteration strategy. Numerical results are presented for two-dimensional supersonic and transonic flows containing both oblique and normal shock waves which confirm the efficiency of the iteration strategy.
Further investigation on "A multiplicative regularization for force reconstruction"
NASA Astrophysics Data System (ADS)
Aucejo, M.; De Smet, O.
2018-05-01
We have recently proposed a multiplicative regularization to reconstruct mechanical forces acting on a structure from vibration measurements. This method does not require any selection procedure for choosing the regularization parameter, since the amount of regularization is automatically adjusted throughout an iterative resolution process. The proposed iterative algorithm has been developed with performance and efficiency in mind, but it is actually a simplified version of a full iterative procedure not described in the original paper. The present paper aims at introducing the full resolution algorithm and comparing it with its simplified version in terms of computational efficiency and solution accuracy. In particular, it is shown that both algorithms lead to very similar identified solutions.
DAKOTA Design Analysis Kit for Optimization and Terascale
DOE Office of Scientific and Technical Information (OSTI.GOV)
Adams, Brian M.; Dalbey, Keith R.; Eldred, Michael S.
2010-02-24
The DAKOTA (Design Analysis Kit for Optimization and Terascale Applications) toolkit provides a flexible and extensible interface between simulation codes (computational models) and iterative analysis methods. By employing object-oriented design to implement abstractions of the key components required for iterative systems analyses, the DAKOTA toolkit provides a flexible and extensible problem-solving environment for design and analysis of computational models on high performance computers.A user provides a set of DAKOTA commands in an input file and launches DAKOTA. DAKOTA invokes instances of the computational models, collects their results, and performs systems analyses. DAKOTA contains algorithms for optimization with gradient and nongradient-basedmore » methods; uncertainty quantification with sampling, reliability, polynomial chaos, stochastic collocation, and epistemic methods; parameter estimation with nonlinear least squares methods; and sensitivity/variance analysis with design of experiments and parameter study methods. These capabilities may be used on their own or as components within advanced strategies such as hybrid optimization, surrogate-based optimization, mixed integer nonlinear programming, or optimization under uncertainty. Services for parallel computing, simulation interfacing, approximation modeling, fault tolerance, restart, and graphics are also included.« less
Three-dimensional focus of attention for iterative cone-beam micro-CT reconstruction
NASA Astrophysics Data System (ADS)
Benson, T. M.; Gregor, J.
2006-09-01
Three-dimensional iterative reconstruction of high-resolution, circular orbit cone-beam x-ray CT data is often considered impractical due to the demand for vast amounts of computer cycles and associated memory. In this paper, we show that the computational burden can be reduced by limiting the reconstruction to a small, well-defined portion of the image volume. We first discuss using the support region defined by the set of voxels covered by all of the projection views. We then present a data-driven preprocessing technique called focus of attention that heuristically separates both image and projection data into object and background before reconstruction, thereby further reducing the reconstruction region of interest. We present experimental results for both methods based on mouse data and a parallelized implementation of the SIRT algorithm. The computational savings associated with the support region are substantial. However, the results for focus of attention are even more impressive in that only about one quarter of the computer cycles and memory are needed compared with reconstruction of the entire image volume. The image quality is not compromised by either method.
NASA Astrophysics Data System (ADS)
Baránek, M.; Běhal, J.; Bouchal, Z.
2018-01-01
In the phase retrieval applications, the Gerchberg-Saxton (GS) algorithm is widely used for the simplicity of implementation. This iterative process can advantageously be deployed in the combination with a spatial light modulator (SLM) enabling simultaneous correction of optical aberrations. As recently demonstrated, the accuracy and efficiency of the aberration correction using the GS algorithm can be significantly enhanced by a vortex image spot used as the target intensity pattern in the iterative process. Here we present an optimization of the spiral phase modulation incorporated into the GS algorithm.
Multitasking TORT under UNICOS: Parallel performance models and measurements
DOE Office of Scientific and Technical Information (OSTI.GOV)
Barnett, A.; Azmy, Y.Y.
1999-09-27
The existing parallel algorithms in the TORT discrete ordinates code were updated to function in a UNICOS environment. A performance model for the parallel overhead was derived for the existing algorithms. The largest contributors to the parallel overhead were identified and a new algorithm was developed. A parallel overhead model was also derived for the new algorithm. The results of the comparison of parallel performance models were compared to applications of the code to two TORT standard test problems and a large production problem. The parallel performance models agree well with the measured parallel overhead.
Multitasking TORT Under UNICOS: Parallel Performance Models and Measurements
DOE Office of Scientific and Technical Information (OSTI.GOV)
Azmy, Y.Y.; Barnett, D.A.
1999-09-27
The existing parallel algorithms in the TORT discrete ordinates were updated to function in a UNI-COS environment. A performance model for the parallel overhead was derived for the existing algorithms. The largest contributors to the parallel overhead were identified and a new algorithm was developed. A parallel overhead model was also derived for the new algorithm. The results of the comparison of parallel performance models were compared to applications of the code to two TORT standard test problems and a large production problem. The parallel performance models agree well with the measured parallel overhead.
Iterative updating of model error for Bayesian inversion
NASA Astrophysics Data System (ADS)
Calvetti, Daniela; Dunlop, Matthew; Somersalo, Erkki; Stuart, Andrew
2018-02-01
In computational inverse problems, it is common that a detailed and accurate forward model is approximated by a computationally less challenging substitute. The model reduction may be necessary to meet constraints in computing time when optimization algorithms are used to find a single estimate, or to speed up Markov chain Monte Carlo (MCMC) calculations in the Bayesian framework. The use of an approximate model introduces a discrepancy, or modeling error, that may have a detrimental effect on the solution of the ill-posed inverse problem, or it may severely distort the estimate of the posterior distribution. In the Bayesian paradigm, the modeling error can be considered as a random variable, and by using an estimate of the probability distribution of the unknown, one may estimate the probability distribution of the modeling error and incorporate it into the inversion. We introduce an algorithm which iterates this idea to update the distribution of the model error, leading to a sequence of posterior distributions that are demonstrated empirically to capture the underlying truth with increasing accuracy. Since the algorithm is not based on rejections, it requires only limited full model evaluations. We show analytically that, in the linear Gaussian case, the algorithm converges geometrically fast with respect to the number of iterations when the data is finite dimensional. For more general models, we introduce particle approximations of the iteratively generated sequence of distributions; we also prove that each element of the sequence converges in the large particle limit under a simplifying assumption. We show numerically that, as in the linear case, rapid convergence occurs with respect to the number of iterations. Additionally, we show through computed examples that point estimates obtained from this iterative algorithm are superior to those obtained by neglecting the model error.
Deconvolution of interferometric data using interior point iterative algorithms
NASA Astrophysics Data System (ADS)
Theys, C.; Lantéri, H.; Aime, C.
2016-09-01
We address the problem of deconvolution of astronomical images that could be obtained with future large interferometers in space. The presentation is made in two complementary parts. The first part gives an introduction to the image deconvolution with linear and nonlinear algorithms. The emphasis is made on nonlinear iterative algorithms that verify the constraints of non-negativity and constant flux. The Richardson-Lucy algorithm appears there as a special case for photon counting conditions. More generally, the algorithm published recently by Lanteri et al. (2015) is based on scale invariant divergences without assumption on the statistic model of the data. The two proposed algorithms are interior-point algorithms, the latter being more efficient in terms of speed of calculation. These algorithms are applied to the deconvolution of simulated images corresponding to an interferometric system of 16 diluted telescopes in space. Two non-redundant configurations, one disposed around a circle and the other on an hexagonal lattice, are compared for their effectiveness on a simple astronomical object. The comparison is made in the direct and Fourier spaces. Raw "dirty" images have many artifacts due to replicas of the original object. Linear methods cannot remove these replicas while iterative methods clearly show their efficacy in these examples.
NASA Astrophysics Data System (ADS)
Chun, Tae Yoon; Lee, Jae Young; Park, Jin Bae; Choi, Yoon Ho
2018-06-01
In this paper, we propose two multirate generalised policy iteration (GPI) algorithms applied to discrete-time linear quadratic regulation problems. The proposed algorithms are extensions of the existing GPI algorithm that consists of the approximate policy evaluation and policy improvement steps. The two proposed schemes, named heuristic dynamic programming (HDP) and dual HDP (DHP), based on multirate GPI, use multi-step estimation (M-step Bellman equation) at the approximate policy evaluation step for estimating the value function and its gradient called costate, respectively. Then, we show that these two methods with the same update horizon can be considered equivalent in the iteration domain. Furthermore, monotonically increasing and decreasing convergences, so called value iteration (VI)-mode and policy iteration (PI)-mode convergences, are proved to hold for the proposed multirate GPIs. Further, general convergence properties in terms of eigenvalues are also studied. The data-driven online implementation methods for the proposed HDP and DHP are demonstrated and finally, we present the results of numerical simulations performed to verify the effectiveness of the proposed methods.
2014-01-01
Due to fierce market competition, how to improve product quality and reduce development cost determines the core competitiveness of enterprises. However, design iteration generally causes increases of product cost and delays of development time as well, so how to identify and model couplings among tasks in product design and development has become an important issue for enterprises to settle. In this paper, the shortcomings existing in WTM model are discussed and tearing approach as well as inner iteration method is used to complement the classic WTM model. In addition, the ABC algorithm is also introduced to find out the optimal decoupling schemes. In this paper, firstly, tearing approach and inner iteration method are analyzed for solving coupled sets. Secondly, a hybrid iteration model combining these two technologies is set up. Thirdly, a high-performance swarm intelligence algorithm, artificial bee colony, is adopted to realize problem-solving. Finally, an engineering design of a chemical processing system is given in order to verify its reasonability and effectiveness. PMID:25431584
Novel aspects of plasma control in ITER
DOE Office of Scientific and Technical Information (OSTI.GOV)
Humphreys, D.; Jackson, G.; Walker, M.
2015-02-15
ITER plasma control design solutions and performance requirements are strongly driven by its nuclear mission, aggressive commissioning constraints, and limited number of operational discharges. In addition, high plasma energy content, heat fluxes, neutron fluxes, and very long pulse operation place novel demands on control performance in many areas ranging from plasma boundary and divertor regulation to plasma kinetics and stability control. Both commissioning and experimental operations schedules provide limited time for tuning of control algorithms relative to operating devices. Although many aspects of the control solutions required by ITER have been well-demonstrated in present devices and even designed satisfactorily formore » ITER application, many elements unique to ITER including various crucial integration issues are presently under development. We describe selected novel aspects of plasma control in ITER, identifying unique parts of the control problem and highlighting some key areas of research remaining. Novel control areas described include control physics understanding (e.g., current profile regulation, tearing mode (TM) suppression), control mathematics (e.g., algorithmic and simulation approaches to high confidence robust performance), and integration solutions (e.g., methods for management of highly subscribed control resources). We identify unique aspects of the ITER TM suppression scheme, which will pulse gyrotrons to drive current within a magnetic island, and turn the drive off following suppression in order to minimize use of auxiliary power and maximize fusion gain. The potential role of active current profile control and approaches to design in ITER are discussed. Issues and approaches to fault handling algorithms are described, along with novel aspects of actuator sharing in ITER.« less
Novel aspects of plasma control in ITER
Humphreys, David; Ambrosino, G.; de Vries, Peter; ...
2015-02-12
ITER plasma control design solutions and performance requirements are strongly driven by its nuclear mission, aggressive commissioning constraints, and limited number of operational discharges. In addition, high plasma energy content, heat fluxes, neutron fluxes, and very long pulse operation place novel demands on control performance in many areas ranging from plasma boundary and divertor regulation to plasma kinetics and stability control. Both commissioning and experimental operations schedules provide limited time for tuning of control algorithms relative to operating devices. Although many aspects of the control solutions required by ITER have been well-demonstrated in present devices and even designed satisfactorily formore » ITER application, many elements unique to ITER including various crucial integration issues are presently under development. We describe selected novel aspects of plasma control in ITER, identifying unique parts of the control problem and highlighting some key areas of research remaining. Novel control areas described include control physics understanding (e.g. current profile regulation, tearing mode suppression (TM)), control mathematics (e.g. algorithmic and simulation approaches to high confidence robust performance), and integration solutions (e.g. methods for management of highly-subscribed control resources). We identify unique aspects of the ITER TM suppression scheme, which will pulse gyrotrons to drive current within a magnetic island, and turn the drive off following suppression in order to minimize use of auxiliary power and maximize fusion gain. The potential role of active current profile control and approaches to design in ITER are discussed. Finally, issues and approaches to fault handling algorithms are described, along with novel aspects of actuator sharing in ITER.« less
Efficient generation of discontinuity-preserving adaptive triangulations from range images.
Garcia, Miguel Angel; Sappa, Angel Domingo
2004-10-01
This paper presents an efficient technique for generating adaptive triangular meshes from range images. The algorithm consists of two stages. First, a user-defined number of points is adaptively sampled from the given range image. Those points are chosen by taking into account the surface shapes represented in the range image in such a way that points tend to group in areas of high curvature and to disperse in low-variation regions. This selection process is done through a noniterative, inherently parallel algorithm in order to gain efficiency. Once the image has been subsampled, the second stage applies a two and one half-dimensional Delaunay triangulation to obtain an initial triangular mesh. To favor the preservation of surface and orientation discontinuities (jump and crease edges) present in the original range image, the aforementioned triangular mesh is iteratively modified by applying an efficient edge flipping technique. Results with real range images show accurate triangular approximations of the given range images with low processing times.
Memory sparing, fast scattering formalism for rigorous diffraction modeling
NASA Astrophysics Data System (ADS)
Iff, W.; Kämpfe, T.; Jourlin, Y.; Tishchenko, A. V.
2017-07-01
The basics and algorithmic steps of a novel scattering formalism suited for memory sparing and fast electromagnetic calculations are presented. The formalism, called ‘S-vector algorithm’ (by analogy with the known scattering-matrix algorithm), allows the calculation of the collective scattering spectra of individual layered micro-structured scattering objects. A rigorous method of linear complexity is applied to model the scattering at individual layers; here the generalized source method (GSM) resorting to Fourier harmonics as basis functions is used as one possible method of linear complexity. The concatenation of the individual scattering events can be achieved sequentially or in parallel, both having pros and cons. The present development will largely concentrate on a consecutive approach based on the multiple reflection series. The latter will be reformulated into an implicit formalism which will be associated with an iterative solver, resulting in improved convergence. The examples will first refer to 1D grating diffraction for the sake of simplicity and intelligibility, with a final 2D application example.
Incorporation of a two metre long PET scanner in STIR
NASA Astrophysics Data System (ADS)
Tsoumpas, C.; Brain, C.; Dyke, T.; Gold, D.
2015-09-01
The Explorer project aims to investigate the potential benefits of a total-body 2 metre long PET scanner. The following investigation incorporates this scanner in STIR library and demonstrates the capabilities and weaknesses of existing reconstruction (FBP and OSEM) and single scatter simulation algorithms. It was found that sensible images are reconstructed but at the expense of high memory and processing time demands. FBP requires 4 hours on a core; OSEM: 2 hours per iteration if ran in parallel on 15-cores of a high performance computer. The single scatter simulation algorithm shows that on a short scale, up to a fifth of the scanner length, the assumption that the scatter between direct rings is similar to the scatter between the oblique rings is approximately valid. However, for more extreme cases this assumption is not longer valid, which illustrates that consideration of the oblique rings within the single scatter simulation will be necessary, if this scatter correction is the method of choice.
Multidisciplinary systems optimization by linear decomposition
NASA Technical Reports Server (NTRS)
Sobieski, J.
1984-01-01
In a typical design process major decisions are made sequentially. An illustrated example is given for an aircraft design in which the aerodynamic shape is usually decided first, then the airframe is sized for strength and so forth. An analogous sequence could be laid out for any other major industrial product, for instance, a ship. The loops in the discipline boxes symbolize iterative design improvements carried out within the confines of a single engineering discipline, or subsystem. The loops spanning several boxes depict multidisciplinary design improvement iterations. Omitted for graphical simplicity is parallelism of the disciplinary subtasks. The parallelism is important in order to develop a broad workfront necessary to shorten the design time. If all the intradisciplinary and interdisciplinary iterations were carried out to convergence, the process could yield a numerically optimal design. However, it usually stops short of that because of time and money limitations. This is especially true for the interdisciplinary iterations.
AMLSA Algorithm for Hybrid Precoding in Millimeter Wave MIMO Systems
NASA Astrophysics Data System (ADS)
Liu, Fulai; Sun, Zhenxing; Du, Ruiyan; Bai, Xiaoyu
2017-10-01
In this paper, an effective algorithm will be proposed for hybrid precoding in mmWave MIMO systems, referred to as alternating minimization algorithm with the least squares amendment (AMLSA algorithm). To be specific, for the fully-connected structure, the presented algorithm is exploited to minimize the classical objective function and obtain the hybrid precoding matrix. It introduces an orthogonal constraint to the digital precoding matrix which is amended subsequently by the least squares after obtaining its alternating minimization iterative result. Simulation results confirm that the achievable spectral efficiency of our proposed algorithm is better to some extent than that of the existing algorithm without the least squares amendment. Furthermore, the number of iterations is reduced slightly via improving the initialization procedure.
Block Iterative Methods for Elliptic and Parabolic Difference Equations.
1981-09-01
S V PARTER, M STEUERWALT N0OO14-7A-C-0341 UNCLASSIFIED CSTR -447 NL ENN.EEEEEN LLf SCOMPUTER SCIENCES c~DEPARTMENT SUniversity of Wisconsin- SMadison...suggests that iterative algorithms that solve for several points at once will converge more rapidly than point algorithms . The Gaussian elimination... algorithm is seen in this light to converge in one step. Frankel [14], Young [34], Arms, Gates, and Zondek [1], and Varga [32], using the algebraic structure
1990-02-01
noise. Tobias B. Orloff Work began on developing a high quality rendering algorithm based on the radiosity method. The algorithm is similar to...previous progressive radiosity algorithms except for the following improvements: 1. At each iteration vertex radiosities are computed using a modified scan...line approach, thus eliminating the quadratic cost associated with a ray tracing computation of vortex radiosities . 2. At each iteration the scene is
A real-time n/γ digital pulse shape discriminator based on FPGA.
Li, Shiping; Xu, Xiufeng; Cao, Hongrui; Yuan, Guoliang; Yang, Qingwei; Yin, Zejie
2013-02-01
A FPGA-based real-time digital pulse shape discriminator has been employed to distinguish between neutrons (n) and gammas (γ) in the Neutron Flux Monitor (NFM) for International Thermonuclear Experimental Reactor (ITER). The discriminator takes advantages of the Field Programmable Gate Array (FPGA) parallel and pipeline process capabilities to carry out the real-time sifting of neutrons in n/γ mixed radiation fields, and uses the rise time and amplitude inspection techniques simultaneously as the discrimination algorithm to observe good n/γ separation. Some experimental results have been presented which show that this discriminator can realize the anticipated goals of NFM perfectly with its excellent discrimination quality and zero dead time. Copyright © 2012 Elsevier Ltd. All rights reserved.
A multi-level solution algorithm for steady-state Markov chains
NASA Technical Reports Server (NTRS)
Horton, Graham; Leutenegger, Scott T.
1993-01-01
A new iterative algorithm, the multi-level algorithm, for the numerical solution of steady state Markov chains is presented. The method utilizes a set of recursively coarsened representations of the original system to achieve accelerated convergence. It is motivated by multigrid methods, which are widely used for fast solution of partial differential equations. Initial results of numerical experiments are reported, showing significant reductions in computation time, often an order of magnitude or more, relative to the Gauss-Seidel and optimal SOR algorithms for a variety of test problems. The multi-level method is compared and contrasted with the iterative aggregation-disaggregation algorithm of Takahashi.
Effective Iterated Greedy Algorithm for Flow-Shop Scheduling Problems with Time lags
NASA Astrophysics Data System (ADS)
ZHAO, Ning; YE, Song; LI, Kaidian; CHEN, Siyu
2017-05-01
Flow shop scheduling problem with time lags is a practical scheduling problem and attracts many studies. Permutation problem(PFSP with time lags) is concentrated but non-permutation problem(non-PFSP with time lags) seems to be neglected. With the aim to minimize the makespan and satisfy time lag constraints, efficient algorithms corresponding to PFSP and non-PFSP problems are proposed, which consist of iterated greedy algorithm for permutation(IGTLP) and iterated greedy algorithm for non-permutation (IGTLNP). The proposed algorithms are verified using well-known simple and complex instances of permutation and non-permutation problems with various time lag ranges. The permutation results indicate that the proposed IGTLP can reach near optimal solution within nearly 11% computational time of traditional GA approach. The non-permutation results indicate that the proposed IG can reach nearly same solution within less than 1% computational time compared with traditional GA approach. The proposed research combines PFSP and non-PFSP together with minimal and maximal time lag consideration, which provides an interesting viewpoint for industrial implementation.
The optimal algorithm for Multi-source RS image fusion.
Fu, Wei; Huang, Shui-Guang; Li, Zeng-Shun; Shen, Hao; Li, Jun-Shuai; Wang, Peng-Yuan
2016-01-01
In order to solve the issue which the fusion rules cannot be self-adaptively adjusted by using available fusion methods according to the subsequent processing requirements of Remote Sensing (RS) image, this paper puts forward GSDA (genetic-iterative self-organizing data analysis algorithm) by integrating the merit of genetic arithmetic together with the advantage of iterative self-organizing data analysis algorithm for multi-source RS image fusion. The proposed algorithm considers the wavelet transform of the translation invariance as the model operator, also regards the contrast pyramid conversion as the observed operator. The algorithm then designs the objective function by taking use of the weighted sum of evaluation indices, and optimizes the objective function by employing GSDA so as to get a higher resolution of RS image. As discussed above, the bullet points of the text are summarized as follows.•The contribution proposes the iterative self-organizing data analysis algorithm for multi-source RS image fusion.•This article presents GSDA algorithm for the self-adaptively adjustment of the fusion rules.•This text comes up with the model operator and the observed operator as the fusion scheme of RS image based on GSDA. The proposed algorithm opens up a novel algorithmic pathway for multi-source RS image fusion by means of GSDA.
A new pivoting and iterative text detection algorithm for biomedical images.
Xu, Songhua; Krauthammer, Michael
2010-12-01
There is interest to expand the reach of literature mining to include the analysis of biomedical images, which often contain a paper's key findings. Examples include recent studies that use Optical Character Recognition (OCR) to extract image text, which is used to boost biomedical image retrieval and classification. Such studies rely on the robust identification of text elements in biomedical images, which is a non-trivial task. In this work, we introduce a new text detection algorithm for biomedical images based on iterative projection histograms. We study the effectiveness of our algorithm by evaluating the performance on a set of manually labeled random biomedical images, and compare the performance against other state-of-the-art text detection algorithms. We demonstrate that our projection histogram-based text detection approach is well suited for text detection in biomedical images, and that the iterative application of the algorithm boosts performance to an F score of .60. We provide a C++ implementation of our algorithm freely available for academic use. Copyright © 2010 Elsevier Inc. All rights reserved.
Du, Shaoyi; Xu, Yiting; Wan, Teng; Hu, Huaizhong; Zhang, Sirui; Xu, Guanglin; Zhang, Xuetao
2017-01-01
The iterative closest point (ICP) algorithm is efficient and accurate for rigid registration but it needs the good initial parameters. It is easily failed when the rotation angle between two point sets is large. To deal with this problem, a new objective function is proposed by introducing a rotation invariant feature based on the Euclidean distance between each point and a global reference point, where the global reference point is a rotation invariant. After that, this optimization problem is solved by a variant of ICP algorithm, which is an iterative method. Firstly, the accurate correspondence is established by using the weighted rotation invariant feature distance and position distance together. Secondly, the rigid transformation is solved by the singular value decomposition method. Thirdly, the weight is adjusted to control the relative contribution of the positions and features. Finally this new algorithm accomplishes the registration by a coarse-to-fine way whatever the initial rotation angle is, which is demonstrated to converge monotonically. The experimental results validate that the proposed algorithm is more accurate and robust compared with the original ICP algorithm.
Du, Shaoyi; Xu, Yiting; Wan, Teng; Zhang, Sirui; Xu, Guanglin; Zhang, Xuetao
2017-01-01
The iterative closest point (ICP) algorithm is efficient and accurate for rigid registration but it needs the good initial parameters. It is easily failed when the rotation angle between two point sets is large. To deal with this problem, a new objective function is proposed by introducing a rotation invariant feature based on the Euclidean distance between each point and a global reference point, where the global reference point is a rotation invariant. After that, this optimization problem is solved by a variant of ICP algorithm, which is an iterative method. Firstly, the accurate correspondence is established by using the weighted rotation invariant feature distance and position distance together. Secondly, the rigid transformation is solved by the singular value decomposition method. Thirdly, the weight is adjusted to control the relative contribution of the positions and features. Finally this new algorithm accomplishes the registration by a coarse-to-fine way whatever the initial rotation angle is, which is demonstrated to converge monotonically. The experimental results validate that the proposed algorithm is more accurate and robust compared with the original ICP algorithm. PMID:29176780
NASA Astrophysics Data System (ADS)
Zhang, B.; Sang, Jun; Alam, Mohammad S.
2013-03-01
An image hiding method based on cascaded iterative Fourier transform and public-key encryption algorithm was proposed. Firstly, the original secret image was encrypted into two phase-only masks M1 and M2 via cascaded iterative Fourier transform (CIFT) algorithm. Then, the public-key encryption algorithm RSA was adopted to encrypt M2 into M2' . Finally, a host image was enlarged by extending one pixel into 2×2 pixels and each element in M1 and M2' was multiplied with a superimposition coefficient and added to or subtracted from two different elements in the 2×2 pixels of the enlarged host image. To recover the secret image from the stego-image, the two masks were extracted from the stego-image without the original host image. By applying public-key encryption algorithm, the key distribution was facilitated, and also compared with the image hiding method based on optical interference, the proposed method may reach higher robustness by employing the characteristics of the CIFT algorithm. Computer simulations show that this method has good robustness against image processing.
NASA Astrophysics Data System (ADS)
Qiu, Mo; Yu, Simin; Wen, Yuqiong; Lü, Jinhu; He, Jianbin; Lin, Zhuosheng
In this paper, a novel design methodology and its FPGA hardware implementation for a universal chaotic signal generator is proposed via the Verilog HDL fixed-point algorithm and state machine control. According to continuous-time or discrete-time chaotic equations, a Verilog HDL fixed-point algorithm and its corresponding digital system are first designed. In the FPGA hardware platform, each operation step of Verilog HDL fixed-point algorithm is then controlled by a state machine. The generality of this method is that, for any given chaotic equation, it can be decomposed into four basic operation procedures, i.e. nonlinear function calculation, iterative sequence operation, iterative values right shifting and ceiling, and chaotic iterative sequences output, each of which corresponds to only a state via state machine control. Compared with the Verilog HDL floating-point algorithm, the Verilog HDL fixed-point algorithm can save the FPGA hardware resources and improve the operation efficiency. FPGA-based hardware experimental results validate the feasibility and reliability of the proposed approach.
Scalable Parallel Density-based Clustering and Applications
NASA Astrophysics Data System (ADS)
Patwary, Mostofa Ali
2014-04-01
Recently, density-based clustering algorithms (DBSCAN and OPTICS) have gotten significant attention of the scientific community due to their unique capability of discovering arbitrary shaped clusters and eliminating noise data. These algorithms have several applications, which require high performance computing, including finding halos and subhalos (clusters) from massive cosmology data in astrophysics, analyzing satellite images, X-ray crystallography, and anomaly detection. However, parallelization of these algorithms are extremely challenging as they exhibit inherent sequential data access order, unbalanced workload resulting in low parallel efficiency. To break the data access sequentiality and to achieve high parallelism, we develop new parallel algorithms, both for DBSCAN and OPTICS, designed using graph algorithmic techniques. For example, our parallel DBSCAN algorithm exploits the similarities between DBSCAN and computing connected components. Using datasets containing up to a billion floating point numbers, we show that our parallel density-based clustering algorithms significantly outperform the existing algorithms, achieving speedups up to 27.5 on 40 cores on shared memory architecture and speedups up to 5,765 using 8,192 cores on distributed memory architecture. In our experiments, we found that while achieving the scalability, our algorithms produce clustering results with comparable quality to the classical algorithms.
Li, Xia; Guo, Meifang; Su, Yongfu
2016-01-01
In this article, a new multidirectional monotone hybrid iteration algorithm for finding a solution to the split common fixed point problem is presented for two countable families of quasi-nonexpansive mappings in Banach spaces. Strong convergence theorems are proved. The application of the result is to consider the split common null point problem of maximal monotone operators in Banach spaces. Strong convergence theorems for finding a solution of the split common null point problem are derived. This iteration algorithm can accelerate the convergence speed of iterative sequence. The results of this paper improve and extend the recent results of Takahashi and Yao (Fixed Point Theory Appl 2015:87, 2015) and many others .
NASA Technical Reports Server (NTRS)
Winget, J. M.; Hughes, T. J. R.
1985-01-01
The particular problems investigated in the present study arise from nonlinear transient heat conduction. One of two types of nonlinearities considered is related to a material temperature dependence which is frequently needed to accurately model behavior over the range of temperature of engineering interest. The second nonlinearity is introduced by radiation boundary conditions. The finite element equations arising from the solution of nonlinear transient heat conduction problems are formulated. The finite element matrix equations are temporally discretized, and a nonlinear iterative solution algorithm is proposed. Algorithms for solving the linear problem are discussed, taking into account the form of the matrix equations, Gaussian elimination, cost, and iterative techniques. Attention is also given to approximate factorization, implementational aspects, and numerical results.
NASA Technical Reports Server (NTRS)
Desideri, J. A.; Steger, J. L.; Tannehill, J. C.
1978-01-01
The iterative convergence properties of an approximate-factorization implicit finite-difference algorithm are analyzed both theoretically and numerically. Modifications to the base algorithm were made to remove the inconsistency in the original implementation of artificial dissipation. In this way, the steady-state solution became independent of the time-step, and much larger time-steps can be used stably. To accelerate the iterative convergence, large time-steps and a cyclic sequence of time-steps were used. For a model transonic flow problem governed by the Euler equations, convergence was achieved with 10 times fewer time-steps using the modified differencing scheme. A particular form of instability due to variable coefficients is also analyzed.
Parallel consistent labeling algorithms
DOE Office of Scientific and Technical Information (OSTI.GOV)
Samal, A.; Henderson, T.
Mackworth and Freuder have analyzed the time complexity of several constraint satisfaction algorithms. Mohr and Henderson have given new algorithms, AC-4 and PC-3, for arc and path consistency, respectively, and have shown that the arc consistency algorithm is optimal in time complexity and of the same order space complexity as the earlier algorithms. In this paper, they give parallel algorithms for solving node and arc consistency. They show that any parallel algorithm for enforcing arc consistency in the worst case must have O(na) sequential steps, where n is number of nodes, and a is the number of labels per node.more » They give several parallel algorithms to do arc consistency. It is also shown that they all have optimal time complexity. The results of running the parallel algorithms on a BBN Butterfly multiprocessor are also presented.« less
NASA Astrophysics Data System (ADS)
Elgohary, T.; Kim, D.; Turner, J.; Junkins, J.
2014-09-01
Several methods exist for integrating the motion in high order gravity fields. Some recent methods use an approximate starting orbit, and an efficient method is needed for generating warm starts that account for specific low order gravity approximations. By introducing two scalar Lagrange-like invariants and employing Leibniz product rule, the perturbed motion is integrated by a novel recursive formulation. The Lagrange-like invariants allow exact arbitrary order time derivatives. Restricting attention to the perturbations due to the zonal harmonics J2 through J6, we illustrate an idea. The recursively generated vector-valued time derivatives for the trajectory are used to develop a continuation series-based solution for propagating position and velocity. Numerical comparisons indicate performance improvements of ~ 70X over existing explicit Runge-Kutta methods while maintaining mm accuracy for the orbit predictions. The Modified Chebyshev Picard Iteration (MCPI) is an iterative path approximation method to solve nonlinear ordinary differential equations. The MCPI utilizes Picard iteration with orthogonal Chebyshev polynomial basis functions to recursively update the states. The key advantages of the MCPI are as follows: 1) Large segments of a trajectory can be approximated by evaluating the forcing function at multiple nodes along the current approximation during each iteration. 2) It can readily handle general gravity perturbations as well as non-conservative forces. 3) Parallel applications are possible. The Picard sequence converges to the solution over large time intervals when the forces are continuous and differentiable. According to the accuracy of the starting solutions, however, the MCPI may require significant number of iterations and function evaluations compared to other integrators. In this work, we provide an efficient methodology to establish good starting solutions from the continuation series method; this warm start improves the performance of the MCPI significantly and will likely be useful for other applications where efficiently computed approximate orbit solutions are needed.
NASA Astrophysics Data System (ADS)
Noble, J. H.; Lubasch, M.; Stevens, J.; Jentschura, U. D.
2017-12-01
We describe a matrix diagonalization algorithm for complex symmetric (not Hermitian) matrices, A ̲ =A̲T, which is based on a two-step algorithm involving generalized Householder reflections based on the indefinite inner product 〈 u ̲ , v ̲ 〉 ∗ =∑iuivi. This inner product is linear in both arguments and avoids complex conjugation. The complex symmetric input matrix is transformed to tridiagonal form using generalized Householder transformations (first step). An iterative, generalized QL decomposition of the tridiagonal matrix employing an implicit shift converges toward diagonal form (second step). The QL algorithm employs iterative deflation techniques when a machine-precision zero is encountered "prematurely" on the super-/sub-diagonal. The algorithm allows for a reliable and computationally efficient computation of resonance and antiresonance energies which emerge from complex-scaled Hamiltonians, and for the numerical determination of the real energy eigenvalues of pseudo-Hermitian and PT-symmetric Hamilton matrices. Numerical reference values are provided.
Bayesian block-diagonal variable selection and model averaging
Papaspiliopoulos, O.; Rossell, D.
2018-01-01
Summary We propose a scalable algorithmic framework for exact Bayesian variable selection and model averaging in linear models under the assumption that the Gram matrix is block-diagonal, and as a heuristic for exploring the model space for general designs. In block-diagonal designs our approach returns the most probable model of any given size without resorting to numerical integration. The algorithm also provides a novel and efficient solution to the frequentist best subset selection problem for block-diagonal designs. Posterior probabilities for any number of models are obtained by evaluating a single one-dimensional integral, and other quantities of interest such as variable inclusion probabilities and model-averaged regression estimates are obtained by an adaptive, deterministic one-dimensional numerical integration. The overall computational cost scales linearly with the number of blocks, which can be processed in parallel, and exponentially with the block size, rendering it most adequate in situations where predictors are organized in many moderately-sized blocks. For general designs, we approximate the Gram matrix by a block-diagonal matrix using spectral clustering and propose an iterative algorithm that capitalizes on the block-diagonal algorithms to explore efficiently the model space. All methods proposed in this paper are implemented in the R library mombf. PMID:29861501
Continuous analog of multiplicative algebraic reconstruction technique for computed tomography
NASA Astrophysics Data System (ADS)
Tateishi, Kiyoko; Yamaguchi, Yusaku; Abou Al-Ola, Omar M.; Kojima, Takeshi; Yoshinaga, Tetsuya
2016-03-01
We propose a hybrid dynamical system as a continuous analog to the block-iterative multiplicative algebraic reconstruction technique (BI-MART), which is a well-known iterative image reconstruction algorithm for computed tomography. The hybrid system is described by a switched nonlinear system with a piecewise smooth vector field or differential equation and, for consistent inverse problems, the convergence of non-negatively constrained solutions to a globally stable equilibrium is guaranteed by the Lyapunov theorem. Namely, we can prove theoretically that a weighted Kullback-Leibler divergence measure can be a common Lyapunov function for the switched system. We show that discretizing the differential equation by using the first-order approximation (Euler's method) based on the geometric multiplicative calculus leads to the same iterative formula of the BI-MART with the scaling parameter as a time-step of numerical discretization. The present paper is the first to reveal that a kind of iterative image reconstruction algorithm is constructed by the discretization of a continuous-time dynamical system for solving tomographic inverse problems. Iterative algorithms with not only the Euler method but also the Runge-Kutta methods of lower-orders applied for discretizing the continuous-time system can be used for image reconstruction. A numerical example showing the characteristics of the discretized iterative methods is presented.
Analysis and Design of ITER 1 MV Core Snubber
NASA Astrophysics Data System (ADS)
Wang, Haitian; Li, Ge
2012-11-01
The core snubber, as a passive protection device, can suppress arc current and absorb stored energy in stray capacitance during the electrical breakdown in accelerating electrodes of ITER NBI. In order to design the core snubber of ITER, the control parameters of the arc peak current have been firstly analyzed by the Fink-Baker-Owren (FBO) method, which are used for designing the DIIID 100 kV snubber. The B-H curve can be derived from the measured voltage and current waveforms, and the hysteresis loss of the core snubber can be derived using the revised parallelogram method. The core snubber can be a simplified representation as an equivalent parallel resistance and inductance, which has been neglected by the FBO method. A simulation code including the parallel equivalent resistance and inductance has been set up. The simulation and experiments result in dramatically large arc shorting currents due to the parallel inductance effect. The case shows that the core snubber utilizing the FBO method gives more compact design.
Parallel CE/SE Computations via Domain Decomposition
NASA Technical Reports Server (NTRS)
Himansu, Ananda; Jorgenson, Philip C. E.; Wang, Xiao-Yen; Chang, Sin-Chung
2000-01-01
This paper describes the parallelization strategy and achieved parallel efficiency of an explicit time-marching algorithm for solving conservation laws. The Space-Time Conservation Element and Solution Element (CE/SE) algorithm for solving the 2D and 3D Euler equations is parallelized with the aid of domain decomposition. The parallel efficiency of the resultant algorithm on a Silicon Graphics Origin 2000 parallel computer is checked.
Algorithms for the optimization of RBE-weighted dose in particle therapy.
Horcicka, M; Meyer, C; Buschbacher, A; Durante, M; Krämer, M
2013-01-21
We report on various algorithms used for the nonlinear optimization of RBE-weighted dose in particle therapy. Concerning the dose calculation carbon ions are considered and biological effects are calculated by the Local Effect Model. Taking biological effects fully into account requires iterative methods to solve the optimization problem. We implemented several additional algorithms into GSI's treatment planning system TRiP98, like the BFGS-algorithm and the method of conjugated gradients, in order to investigate their computational performance. We modified textbook iteration procedures to improve the convergence speed. The performance of the algorithms is presented by convergence in terms of iterations and computation time. We found that the Fletcher-Reeves variant of the method of conjugated gradients is the algorithm with the best computational performance. With this algorithm we could speed up computation times by a factor of 4 compared to the method of steepest descent, which was used before. With our new methods it is possible to optimize complex treatment plans in a few minutes leading to good dose distributions. At the end we discuss future goals concerning dose optimization issues in particle therapy which might benefit from fast optimization solvers.
Algorithms for the optimization of RBE-weighted dose in particle therapy
NASA Astrophysics Data System (ADS)
Horcicka, M.; Meyer, C.; Buschbacher, A.; Durante, M.; Krämer, M.
2013-01-01
We report on various algorithms used for the nonlinear optimization of RBE-weighted dose in particle therapy. Concerning the dose calculation carbon ions are considered and biological effects are calculated by the Local Effect Model. Taking biological effects fully into account requires iterative methods to solve the optimization problem. We implemented several additional algorithms into GSI's treatment planning system TRiP98, like the BFGS-algorithm and the method of conjugated gradients, in order to investigate their computational performance. We modified textbook iteration procedures to improve the convergence speed. The performance of the algorithms is presented by convergence in terms of iterations and computation time. We found that the Fletcher-Reeves variant of the method of conjugated gradients is the algorithm with the best computational performance. With this algorithm we could speed up computation times by a factor of 4 compared to the method of steepest descent, which was used before. With our new methods it is possible to optimize complex treatment plans in a few minutes leading to good dose distributions. At the end we discuss future goals concerning dose optimization issues in particle therapy which might benefit from fast optimization solvers.
Seismic waveform tomography with shot-encoding using a restarted L-BFGS algorithm.
Rao, Ying; Wang, Yanghua
2017-08-17
In seismic waveform tomography, or full-waveform inversion (FWI), one effective strategy used to reduce the computational cost is shot-encoding, which encodes all shots randomly and sums them into one super shot to significantly reduce the number of wavefield simulations in the inversion. However, this process will induce instability in the iterative inversion regardless of whether it uses a robust limited-memory BFGS (L-BFGS) algorithm. The restarted L-BFGS algorithm proposed here is both stable and efficient. This breakthrough ensures, for the first time, the applicability of advanced FWI methods to three-dimensional seismic field data. In a standard L-BFGS algorithm, if the shot-encoding remains unchanged, it will generate a crosstalk effect between different shots. This crosstalk effect can only be suppressed by employing sufficient randomness in the shot-encoding. Therefore, the implementation of the L-BFGS algorithm is restarted at every segment. Each segment consists of a number of iterations; the first few iterations use an invariant encoding, while the remainder use random re-coding. This restarted L-BFGS algorithm balances the computational efficiency of shot-encoding, the convergence stability of the L-BFGS algorithm, and the inversion quality characteristic of random encoding in FWI.
Compressed sensing with gradient total variation for low-dose CBCT reconstruction
NASA Astrophysics Data System (ADS)
Seo, Chang-Woo; Cha, Bo Kyung; Jeon, Seongchae; Huh, Young; Park, Justin C.; Lee, Byeonghun; Baek, Junghee; Kim, Eunyoung
2015-06-01
This paper describes the improvement of convergence speed with gradient total variation (GTV) in compressed sensing (CS) for low-dose cone-beam computed tomography (CBCT) reconstruction. We derive a fast algorithm for the constrained total variation (TV)-based a minimum number of noisy projections. To achieve this task we combine the GTV with a TV-norm regularization term to promote an accelerated sparsity in the X-ray attenuation characteristics of the human body. The GTV is derived from a TV and enforces more efficient computationally and faster in convergence until a desired solution is achieved. The numerical algorithm is simple and derives relatively fast convergence. We apply a gradient projection algorithm that seeks a solution iteratively in the direction of the projected gradient while enforcing a non-negatively of the found solution. In comparison with the Feldkamp, Davis, and Kress (FDK) and conventional TV algorithms, the proposed GTV algorithm showed convergence in ≤18 iterations, whereas the original TV algorithm needs at least 34 iterations in reducing 50% of the projections compared with the FDK algorithm in order to reconstruct the chest phantom images. Future investigation includes improving imaging quality, particularly regarding X-ray cone-beam scatter, and motion artifacts of CBCT reconstruction.
Glenn-ht/bem Conjugate Heat Transfer Solver for Large-scale Turbomachinery Models
NASA Technical Reports Server (NTRS)
Divo, E.; Steinthorsson, E.; Rodriquez, F.; Kassab, A. J.; Kapat, J. S.; Heidmann, James D. (Technical Monitor)
2003-01-01
A coupled Boundary Element/Finite Volume Method temperature-forward/flux-hack algorithm is developed for conjugate heat transfer (CHT) applications. A loosely coupled strategy is adopted with each field solution providing boundary conditions for the other in an iteration seeking continuity of temperature and heat flux at the fluid-solid interface. The NASA Glenn Navier-Stokes code Glenn-HT is coupled to a 3-D BEM steady state heat conduction code developed at the University of Central Florida. Results from CHT simulation of a 3-D film-cooled blade section are presented and compared with those computed by a two-temperature approach. Also presented are current developments of an iterative domain decomposition strategy accommodating large numbers of unknowns in the BEM. The blade is artificially sub-sectioned in the span-wise direction, 3-D BEM solutions are obtained in the subdomains, and interface temperatures are averaged symmetrically when the flux is updated while the fluxes are averaged anti-symmetrically to maintain continuity of heat flux when the temperatures are updated. An initial guess for interface temperatures uses a physically-based 1-D conduction argument to provide an effective starting point and significantly reduce iteration. 2-D and 3-D results show the process converges efficiently and offers substantial computational and storage savings. Future developments include a parallel multi-grid implementation of the approach under MPI for computation on PC clusters.
Towards a Better Distributed Framework for Learning Big Data
2017-06-14
UNLIMITED: PB Public Release 13. SUPPLEMENTARY NOTES 14. ABSTRACT This work aimed at solving issues in distributed machine learning. The PI’s team proposed...communication load. Finally, the team proposed the parallel least-squares policy iteration (parallel LSPI) to parallelize a reinforcement policy learning. 15
Improved pressure-velocity coupling algorithm based on minimization of global residual norm
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chatwani, A.U.; Turan, A.
1991-01-01
In this paper an improved pressure velocity coupling algorithm is proposed based on the minimization of the global residual norm. The procedure is applied to SIMPLE and SIMPLEC algorithms to automatically select the pressure underrelaxation factor to minimize the global residual norm at each iteration level. Test computations for three-dimensional turbulent, isothermal flow is a toroidal vortex combustor indicate that velocity underrelaxation factors as high as 0.7 can be used to obtain a converged solution in 300 iterations.
An Evaluation of an Algorithm for Linear Inequalities and Its Applications
NASA Technical Reports Server (NTRS)
Jurgensen, J.
1973-01-01
An algorithm is presented for obtaining a solution alpha to a set of inequalities (A alpha) 0 where A is an N x m-matrix and alpha is an m-vector. If the set of inequalities is consistant, then the algorithm is guaranteed to arrive at a solution in a finite number of steps. Also, if in the iteration, a negative vector is obtained, then the initial set of inequalities is inconsistant, and the iteration is terminated.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Demmel, James W.
This project addresses both communication-avoiding algorithms, and reproducible floating-point computation. Communication, i.e. moving data, either between levels of memory or processors over a network, is much more expensive per operation than arithmetic (measured in time or energy), so we seek algorithms that greatly reduce communication. We developed many new algorithms for both dense and sparse, and both direct and iterative linear algebra, attaining new communication lower bounds, and getting large speedups in many cases. We also extended this work in several ways: (1) We minimize writes separately from reads, since writes may be much more expensive than reads on emergingmore » memory technologies, like Flash, sometimes doing asymptotically fewer writes than reads. (2) We extend the lower bounds and optimal algorithms to arbitrary algorithms that may be expressed as perfectly nested loops accessing arrays, where the array subscripts may be arbitrary affine functions of the loop indices (eg A(i), B(i,j+k, k+3*m-7, …) etc.). (3) We extend our communication-avoiding approach to some machine learning algorithms, such as support vector machines. This work has won a number of awards. We also address reproducible floating-point computation. We define reproducibility to mean getting bitwise identical results from multiple runs of the same program, perhaps with different hardware resources or other changes that should ideally not change the answer. Many users depend on reproducibility for debugging or correctness. However, dynamic scheduling of parallel computing resources, combined with nonassociativity of floating point addition, makes attaining reproducibility a challenge even for simple operations like summing a vector of numbers, or more complicated operations like the Basic Linear Algebra Subprograms (BLAS). We describe an algorithm that computes a reproducible sum of floating point numbers, independent of the order of summation. The algorithm depends only on a subset of the IEEE Floating Point Standard 754-2008, uses just 6 words to represent a “reproducible accumulator,” and requires just one read-only pass over the data, or one reduction in parallel. New instructions based on this work are being considered for inclusion in the future IEEE 754-2018 floating-point standard, and new reproducible BLAS are being considered for the next version of the BLAS standard.« less
NASA Astrophysics Data System (ADS)
Verstraete, Hans R. G. W.; Heisler, Morgan; Ju, Myeong Jin; Wahl, Daniel J.; Bliek, Laurens; Kalkman, Jeroen; Bonora, Stefano; Sarunic, Marinko V.; Verhaegen, Michel; Jian, Yifan
2017-02-01
Optical Coherence Tomography (OCT) has revolutionized modern ophthalmology, providing depth resolved images of the retinal layers in a system that is suited to a clinical environment. A limitation of the performance and utilization of the OCT systems has been the lateral resolution. Through the combination of wavefront sensorless adaptive optics with dual variable optical elements, we present a compact lens based OCT system that is capable of imaging the photoreceptor mosaic. We utilized a commercially available variable focal length lens to correct for a wide range of defocus commonly found in patient eyes, and a multi-actuator adaptive lens after linearization of the hysteresis in the piezoelectric actuators for aberration correction to obtain near diffraction limited imaging at the retina. A parallel processing computational platform permitted real-time image acquisition and display. The Data-based Online Nonlinear Extremum seeker (DONE) algorithm was used for real time optimization of the wavefront sensorless adaptive optics OCT, and the performance was compared with a coordinate search algorithm. Cross sectional images of the retinal layers and en face images of the cone photoreceptor mosaic acquired in vivo from research volunteers before and after WSAO optimization are presented. Applying the DONE algorithm in vivo for wavefront sensorless AO-OCT demonstrates that the DONE algorithm succeeds in drastically improving the signal while achieving a computational time of 1 ms per iteration, making it applicable for high speed real time applications.
An algorithm for the kinetics of tire pyrolysis under different heating rates.
Quek, Augustine; Balasubramanian, Rajashekhar
2009-07-15
Tires exhibit different kinetic behaviors when pyrolyzed under different heating rates. A new algorithm has been developed to investigate pyrolysis behavior of scrap tires. The algorithm includes heat and mass transfer equations to account for the different extents of thermal lag as the tire is heated at different heating rates. The algorithm uses an iterative approach to fit model equations to experimental data to obtain quantitative values of kinetic parameters. These parameters describe the pyrolysis process well, with good agreement (r(2)>0.96) between the model and experimental data when the model is applied to three different brands of automobile tires heated under five different heating rates in a pure nitrogen atmosphere. The model agrees with other researchers' results that frequencies factors increased and time constants decreased with increasing heating rates. The model also shows the change in the behavior of individual tire components when the heating rates are increased above 30 K min(-1). This result indicates that heating rates, rather than temperature, can significantly affect pyrolysis reactions. This algorithm is simple in structure and yet accurate in describing tire pyrolysis under a wide range of heating rates (10-50 K min(-1)). It improves our understanding of the tire pyrolysis process by showing the relationship between the heating rate and the many components in a tire that depolymerize as parallel reactions.
A depth-first search algorithm to compute elementary flux modes by linear programming
2014-01-01
Background The decomposition of complex metabolic networks into elementary flux modes (EFMs) provides a useful framework for exploring reaction interactions systematically. Generating a complete set of EFMs for large-scale models, however, is near impossible. Even for moderately-sized models (<400 reactions), existing approaches based on the Double Description method must iterate through a large number of combinatorial candidates, thus imposing an immense processor and memory demand. Results Based on an alternative elementarity test, we developed a depth-first search algorithm using linear programming (LP) to enumerate EFMs in an exhaustive fashion. Constraints can be introduced to directly generate a subset of EFMs satisfying the set of constraints. The depth-first search algorithm has a constant memory overhead. Using flux constraints, a large LP problem can be massively divided and parallelized into independent sub-jobs for deployment into computing clusters. Since the sub-jobs do not overlap, the approach scales to utilize all available computing nodes with minimal coordination overhead or memory limitations. Conclusions The speed of the algorithm was comparable to efmtool, a mainstream Double Description method, when enumerating all EFMs; the attrition power gained from performing flux feasibility tests offsets the increased computational demand of running an LP solver. Unlike the Double Description method, the algorithm enables accelerated enumeration of all EFMs satisfying a set of constraints. PMID:25074068
Choi, Kihwan; Li, Ruijiang; Nam, Haewon; Xing, Lei
2014-06-21
As a solution to iterative CT image reconstruction, first-order methods are prominent for the large-scale capability and the fast convergence rate [Formula: see text]. In practice, the CT system matrix with a large condition number may lead to slow convergence speed despite the theoretically promising upper bound. The aim of this study is to develop a Fourier-based scaling technique to enhance the convergence speed of first-order methods applied to CT image reconstruction. Instead of working in the projection domain, we transform the projection data and construct a data fidelity model in Fourier space. Inspired by the filtered backprojection formalism, the data are appropriately weighted in Fourier space. We formulate an optimization problem based on weighted least-squares in the Fourier space and total-variation (TV) regularization in image space for parallel-beam, fan-beam and cone-beam CT geometry. To achieve the maximum computational speed, the optimization problem is solved using a fast iterative shrinkage-thresholding algorithm with backtracking line search and GPU implementation of projection/backprojection. The performance of the proposed algorithm is demonstrated through a series of digital simulation and experimental phantom studies. The results are compared with the existing TV regularized techniques based on statistics-based weighted least-squares as well as basic algebraic reconstruction technique. The proposed Fourier-based compressed sensing (CS) method significantly improves both the image quality and the convergence rate compared to the existing CS techniques.
Parallel computing techniques for rotorcraft aerodynamics
NASA Astrophysics Data System (ADS)
Ekici, Kivanc
The modification of unsteady three-dimensional Navier-Stokes codes for application on massively parallel and distributed computing environments is investigated. The Euler/Navier-Stokes code TURNS (Transonic Unsteady Rotor Navier-Stokes) was chosen as a test bed because of its wide use by universities and industry. For the efficient implementation of TURNS on parallel computing systems, two algorithmic changes are developed. First, main modifications to the implicit operator, Lower-Upper Symmetric Gauss Seidel (LU-SGS) originally used in TURNS, is performed. Second, application of an inexact Newton method, coupled with a Krylov subspace iterative method (Newton-Krylov method) is carried out. Both techniques have been tried previously for the Euler equations mode of the code. In this work, we have extended the methods to the Navier-Stokes mode. Several new implicit operators were tried because of convergence problems of traditional operators with the high cell aspect ratio (CAR) grids needed for viscous calculations on structured grids. Promising results for both Euler and Navier-Stokes cases are presented for these operators. For the efficient implementation of Newton-Krylov methods to the Navier-Stokes mode of TURNS, efficient preconditioners must be used. The parallel implicit operators used in the previous step are employed as preconditioners and the results are compared. The Message Passing Interface (MPI) protocol has been used because of its portability to various parallel architectures. It should be noted that the proposed methodology is general and can be applied to several other CFD codes (e.g. OVERFLOW).
Parallel/Vector Integration Methods for Dynamical Astronomy
NASA Astrophysics Data System (ADS)
Fukushima, Toshio
1999-01-01
This paper reviews three recent works on the numerical methods to integrate ordinary differential equations (ODE), which are specially designed for parallel, vector, and/or multi-processor-unit(PU) computers. The first is the Picard-Chebyshev method (Fukushima, 1997a). It obtains a global solution of ODE in the form of Chebyshev polynomial of large (> 1000) degree by applying the Picard iteration repeatedly. The iteration converges for smooth problems and/or perturbed dynamics. The method runs around 100-1000 times faster in the vector mode than in the scalar mode of a certain computer with vector processors (Fukushima, 1997b). The second is a parallelization of a symplectic integrator (Saha et al., 1997). It regards the implicit midpoint rules covering thousands of timesteps as large-scale nonlinear equations and solves them by the fixed-point iteration. The method is applicable to Hamiltonian systems and is expected to lead an acceleration factor of around 50 in parallel computers with more than 1000 PUs. The last is a parallelization of the extrapolation method (Ito and Fukushima, 1997). It performs trial integrations in parallel. Also the trial integrations are further accelerated by balancing computational load among PUs by the technique of folding. The method is all-purpose and achieves an acceleration factor of around 3.5 by using several PUs. Finally, we give a perspective on the parallelization of some implicit integrators which require multiple corrections in solving implicit formulas like the implicit Hermitian integrators (Makino and Aarseth, 1992), (Hut et al., 1995) or the implicit symmetric multistep methods (Fukushima, 1998), (Fukushima, 1999).
Volumetric quantification of lung nodules in CT with iterative reconstruction (ASiR and MBIR).
Chen, Baiyu; Barnhart, Huiman; Richard, Samuel; Robins, Marthony; Colsher, James; Samei, Ehsan
2013-11-01
Volume quantifications of lung nodules with multidetector computed tomography (CT) images provide useful information for monitoring nodule developments. The accuracy and precision of the volume quantification, however, can be impacted by imaging and reconstruction parameters. This study aimed to investigate the impact of iterative reconstruction algorithms on the accuracy and precision of volume quantification with dose and slice thickness as additional variables. Repeated CT images were acquired from an anthropomorphic chest phantom with synthetic nodules (9.5 and 4.8 mm) at six dose levels, and reconstructed with three reconstruction algorithms [filtered backprojection (FBP), adaptive statistical iterative reconstruction (ASiR), and model based iterative reconstruction (MBIR)] into three slice thicknesses. The nodule volumes were measured with two clinical software (A: Lung VCAR, B: iNtuition), and analyzed for accuracy and precision. Precision was found to be generally comparable between FBP and iterative reconstruction with no statistically significant difference noted for different dose levels, slice thickness, and segmentation software. Accuracy was found to be more variable. For large nodules, the accuracy was significantly different between ASiR and FBP for all slice thicknesses with both software, and significantly different between MBIR and FBP for 0.625 mm slice thickness with Software A and for all slice thicknesses with Software B. For small nodules, the accuracy was more similar between FBP and iterative reconstruction, with the exception of ASIR vs FBP at 1.25 mm with Software A and MBIR vs FBP at 0.625 mm with Software A. The systematic difference between the accuracy of FBP and iterative reconstructions highlights the importance of extending current segmentation software to accommodate the image characteristics of iterative reconstructions. In addition, a calibration process may help reduce the dependency of accuracy on reconstruction algorithms, such that volumes quantified from scans of different reconstruction algorithms can be compared. The little difference found between the precision of FBP and iterative reconstructions could be a result of both iterative reconstruction's diminished noise reduction at the edge of the nodules as well as the loss of resolution at high noise levels with iterative reconstruction. The findings do not rule out potential advantage of IR that might be evident in a study that uses a larger number of nodules or repeated scans.
NASA Astrophysics Data System (ADS)
Lalush, D. S.; Tsui, B. M. W.
1998-06-01
We study the statistical convergence properties of two fast iterative reconstruction algorithms, the rescaled block-iterative (RBI) and ordered subset (OS) EM algorithms, in the context of cardiac SPECT with 3D detector response modeling. The Monte Carlo method was used to generate nearly noise-free projection data modeling the effects of attenuation, detector response, and scatter from the MCAT phantom. One thousand noise realizations were generated with an average count level approximating a typical T1-201 cardiac study. Each noise realization was reconstructed using the RBI and OS algorithms for cases with and without detector response modeling. For each iteration up to twenty, we generated mean and variance images, as well as covariance images for six specific locations. Both OS and RBI converged in the mean to results that were close to the noise-free ML-EM result using the same projection model. When detector response was not modeled in the reconstruction, RBI exhibited considerably lower noise variance than OS for the same resolution. When 3D detector response was modeled, the RBI-EM provided a small improvement in the tradeoff between noise level and resolution recovery, primarily in the axial direction, while OS required about half the number of iterations of RBI to reach the same resolution. We conclude that OS is faster than RBI, but may be sensitive to errors in the projection model. Both OS-EM and RBI-EM are effective alternatives to the EVIL-EM algorithm, but noise level and speed of convergence depend on the projection model used.
Parallel block schemes for large scale least squares computations
DOE Office of Scientific and Technical Information (OSTI.GOV)
Golub, G.H.; Plemmons, R.J.; Sameh, A.
1986-04-01
Large scale least squares computations arise in a variety of scientific and engineering problems, including geodetic adjustments and surveys, medical image analysis, molecular structures, partial differential equations and substructuring methods in structural engineering. In each of these problems, matrices often arise which possess a block structure which reflects the local connection nature of the underlying physical problem. For example, such super-large nonlinear least squares computations arise in geodesy. Here the coordinates of positions are calculated by iteratively solving overdetermined systems of nonlinear equations by the Gauss-Newton method. The US National Geodetic Survey will complete this year (1986) the readjustment ofmore » the North American Datum, a problem which involves over 540 thousand unknowns and over 6.5 million observations (equations). The observation matrix for these least squares computations has a block angular form with 161 diagnonal blocks, each containing 3 to 4 thousand unknowns. In this paper parallel schemes are suggested for the orthogonal factorization of matrices in block angular form and for the associated backsubstitution phase of the least squares computations. In addition, a parallel scheme for the calculation of certain elements of the covariance matrix for such problems is described. It is shown that these algorithms are ideally suited for multiprocessors with three levels of parallelism such as the Cedar system at the University of Illinois. 20 refs., 7 figs.« less
Fast segmentation of satellite images using SLIC, WebGL and Google Earth Engine
NASA Astrophysics Data System (ADS)
Donchyts, Gennadii; Baart, Fedor; Gorelick, Noel; Eisemann, Elmar; van de Giesen, Nick
2017-04-01
Google Earth Engine (GEE) is a parallel geospatial processing platform, which harmonizes access to petabytes of freely available satellite images. It provides a very rich API, allowing development of dedicated algorithms to extract useful geospatial information from these images. At the same time, modern GPUs provide thousands of computing cores, which are mostly not utilized in this context. In the last years, WebGL became a popular and well-supported API, allowing fast image processing directly in web browsers. In this work, we will evaluate the applicability of WebGL to enable fast segmentation of satellite images. A new implementation of a Simple Linear Iterative Clustering (SLIC) algorithm using GPU shaders will be presented. SLIC is a simple and efficient method to decompose an image in visually homogeneous regions. It adapts a k-means clustering approach to generate superpixels efficiently. While this approach will be hard to scale, due to a significant amount of data to be transferred to the client, it should significantly improve exploratory possibilities and simplify development of dedicated algorithms for geoscience applications. Our prototype implementation will be used to improve surface water detection of the reservoirs using multispectral satellite imagery.
NASA Astrophysics Data System (ADS)
Kong, Fande; Cai, Xiao-Chuan
2017-07-01
Nonlinear fluid-structure interaction (FSI) problems on unstructured meshes in 3D appear in many applications in science and engineering, such as vibration analysis of aircrafts and patient-specific diagnosis of cardiovascular diseases. In this work, we develop a highly scalable, parallel algorithmic and software framework for FSI problems consisting of a nonlinear fluid system and a nonlinear solid system, that are coupled monolithically. The FSI system is discretized by a stabilized finite element method in space and a fully implicit backward difference scheme in time. To solve the large, sparse system of nonlinear algebraic equations at each time step, we propose an inexact Newton-Krylov method together with a multilevel, smoothed Schwarz preconditioner with isogeometric coarse meshes generated by a geometry preserving coarsening algorithm. Here "geometry" includes the boundary of the computational domain and the wet interface between the fluid and the solid. We show numerically that the proposed algorithm and implementation are highly scalable in terms of the number of linear and nonlinear iterations and the total compute time on a supercomputer with more than 10,000 processor cores for several problems with hundreds of millions of unknowns.
NASA Astrophysics Data System (ADS)
Grayver, Alexander V.; Kuvshinov, Alexey V.
2016-05-01
This paper presents a methodology to sample equivalence domain (ED) in nonlinear partial differential equation (PDE)-constrained inverse problems. For this purpose, we first applied state-of-the-art stochastic optimization algorithm called Covariance Matrix Adaptation Evolution Strategy (CMAES) to identify low-misfit regions of the model space. These regions were then randomly sampled to create an ensemble of equivalent models and quantify uncertainty. CMAES is aimed at exploring model space globally and is robust on very ill-conditioned problems. We show that the number of iterations required to converge grows at a moderate rate with respect to number of unknowns and the algorithm is embarrassingly parallel. We formulated the problem by using the generalized Gaussian distribution. This enabled us to seamlessly use arbitrary norms for residual and regularization terms. We show that various regularization norms facilitate studying different classes of equivalent solutions. We further show how performance of the standard Metropolis-Hastings Markov chain Monte Carlo algorithm can be substantially improved by using information CMAES provides. This methodology was tested by using individual and joint inversions of magneotelluric, controlled-source electromagnetic (EM) and global EM induction data.
Kong, Fande; Cai, Xiao-Chuan
2017-03-24
Nonlinear fluid-structure interaction (FSI) problems on unstructured meshes in 3D appear many applications in science and engineering, such as vibration analysis of aircrafts and patient-specific diagnosis of cardiovascular diseases. In this work, we develop a highly scalable, parallel algorithmic and software framework for FSI problems consisting of a nonlinear fluid system and a nonlinear solid system, that are coupled monolithically. The FSI system is discretized by a stabilized finite element method in space and a fully implicit backward difference scheme in time. To solve the large, sparse system of nonlinear algebraic equations at each time step, we propose an inexactmore » Newton-Krylov method together with a multilevel, smoothed Schwarz preconditioner with isogeometric coarse meshes generated by a geometry preserving coarsening algorithm. Here ''geometry'' includes the boundary of the computational domain and the wet interface between the fluid and the solid. We show numerically that the proposed algorithm and implementation are highly scalable in terms of the number of linear and nonlinear iterations and the total compute time on a supercomputer with more than 10,000 processor cores for several problems with hundreds of millions of unknowns.« less
Parallel Architectures and Parallel Algorithms for Integrated Vision Systems. Ph.D. Thesis
NASA Technical Reports Server (NTRS)
Choudhary, Alok Nidhi
1989-01-01
Computer vision is regarded as one of the most complex and computationally intensive problems. An integrated vision system (IVS) is a system that uses vision algorithms from all levels of processing to perform for a high level application (e.g., object recognition). An IVS normally involves algorithms from low level, intermediate level, and high level vision. Designing parallel architectures for vision systems is of tremendous interest to researchers. Several issues are addressed in parallel architectures and parallel algorithms for integrated vision systems.
NASA Astrophysics Data System (ADS)
Leclerc, Arnaud; Thomas, Phillip S.; Carrington, Tucker
2017-08-01
Vibrational spectra and wavefunctions of polyatomic molecules can be calculated at low memory cost using low-rank sum-of-product (SOP) decompositions to represent basis functions generated using an iterative eigensolver. Using a SOP tensor format does not determine the iterative eigensolver. The choice of the interative eigensolver is limited by the need to restrict the rank of the SOP basis functions at every stage of the calculation. We have adapted, implemented and compared different reduced-rank algorithms based on standard iterative methods (block-Davidson algorithm, Chebyshev iteration) to calculate vibrational energy levels and wavefunctions of the 12-dimensional acetonitrile molecule. The effect of using low-rank SOP basis functions on the different methods is analysed and the numerical results are compared with those obtained with the reduced rank block power method. Relative merits of the different algorithms are presented, showing that the advantage of using a more sophisticated method, although mitigated by the use of reduced-rank SOP functions, is noticeable in terms of CPU time.
Iterative Nonlinear Tikhonov Algorithm with Constraints for Electromagnetic Tomography
NASA Technical Reports Server (NTRS)
Xu, Feng; Deshpande, Manohar
2012-01-01
Low frequency electromagnetic tomography such as the capacitance tomography (ECT) has been proposed for monitoring and mass-gauging of gas-liquid two-phase system under microgravity condition in NASA's future long-term space missions. Due to the ill-posed inverse problem of ECT, images reconstructed using conventional linear algorithms often suffer from limitations such as low resolution and blurred edges. Hence, new efficient high resolution nonlinear imaging algorithms are needed for accurate two-phase imaging. The proposed Iterative Nonlinear Tikhonov Regularized Algorithm with Constraints (INTAC) is based on an efficient finite element method (FEM) forward model of quasi-static electromagnetic problem. It iteratively minimizes the discrepancy between FEM simulated and actual measured capacitances by adjusting the reconstructed image using the Tikhonov regularized method. More importantly, it enforces the known permittivity of two phases to the unknown pixels which exceed the reasonable range of permittivity in each iteration. This strategy does not only stabilize the converging process, but also produces sharper images. Simulations show that resolution improvement of over 2 times can be achieved by INTAC with respect to conventional approaches. Strategies to further improve spatial imaging resolution are suggested, as well as techniques to accelerate nonlinear forward model and thus increase the temporal resolution.
Efficient Parallel Kernel Solvers for Computational Fluid Dynamics Applications
NASA Technical Reports Server (NTRS)
Sun, Xian-He
1997-01-01
Distributed-memory parallel computers dominate today's parallel computing arena. These machines, such as Intel Paragon, IBM SP2, and Cray Origin2OO, have successfully delivered high performance computing power for solving some of the so-called "grand-challenge" problems. Despite initial success, parallel machines have not been widely accepted in production engineering environments due to the complexity of parallel programming. On a parallel computing system, a task has to be partitioned and distributed appropriately among processors to reduce communication cost and to attain load balance. More importantly, even with careful partitioning and mapping, the performance of an algorithm may still be unsatisfactory, since conventional sequential algorithms may be serial in nature and may not be implemented efficiently on parallel machines. In many cases, new algorithms have to be introduced to increase parallel performance. In order to achieve optimal performance, in addition to partitioning and mapping, a careful performance study should be conducted for a given application to find a good algorithm-machine combination. This process, however, is usually painful and elusive. The goal of this project is to design and develop efficient parallel algorithms for highly accurate Computational Fluid Dynamics (CFD) simulations and other engineering applications. The work plan is 1) developing highly accurate parallel numerical algorithms, 2) conduct preliminary testing to verify the effectiveness and potential of these algorithms, 3) incorporate newly developed algorithms into actual simulation packages. The work plan has well achieved. Two highly accurate, efficient Poisson solvers have been developed and tested based on two different approaches: (1) Adopting a mathematical geometry which has a better capacity to describe the fluid, (2) Using compact scheme to gain high order accuracy in numerical discretization. The previously developed Parallel Diagonal Dominant (PDD) algorithm and Reduced Parallel Diagonal Dominant (RPDD) algorithm have been carefully studied on different parallel platforms for different applications, and a NASA simulation code developed by Man M. Rai and his colleagues has been parallelized and implemented based on data dependency analysis. These achievements are addressed in detail in the paper.
Q-Learning-Based Adjustable Fixed-Phase Quantum Grover Search Algorithm
NASA Astrophysics Data System (ADS)
Guo, Ying; Shi, Wensha; Wang, Yijun; Hu, Jiankun
2017-02-01
We demonstrate that the rotation phase can be suitably chosen to increase the efficiency of the phase-based quantum search algorithm, leading to a dynamic balance between iterations and success probabilities of the fixed-phase quantum Grover search algorithm with Q-learning for a given number of solutions. In this search algorithm, the proposed Q-learning algorithm, which is a model-free reinforcement learning strategy in essence, is used for performing a matching algorithm based on the fraction of marked items λ and the rotation phase α. After establishing the policy function α = π(λ), we complete the fixed-phase Grover algorithm, where the phase parameter is selected via the learned policy. Simulation results show that the Q-learning-based Grover search algorithm (QLGA) enables fewer iterations and gives birth to higher success probabilities. Compared with the conventional Grover algorithms, it avoids the optimal local situations, thereby enabling success probabilities to approach one.
Zheng, Jingjing; Frisch, Michael J
2017-12-12
An efficient geometry optimization algorithm based on interpolated potential energy surfaces with iteratively updated Hessians is presented in this work. At each step of geometry optimization (including both minimization and transition structure search), an interpolated potential energy surface is properly constructed by using the previously calculated information (energies, gradients, and Hessians/updated Hessians), and Hessians of the two latest geometries are updated in an iterative manner. The optimized minimum or transition structure on the interpolated surface is used for the starting geometry of the next geometry optimization step. The cost of searching the minimum or transition structure on the interpolated surface and iteratively updating Hessians is usually negligible compared with most electronic structure single gradient calculations. These interpolated potential energy surfaces are often better representations of the true potential energy surface in a broader range than a local quadratic approximation that is usually used in most geometry optimization algorithms. Tests on a series of large and floppy molecules and transition structures both in gas phase and in solutions show that the new algorithm can significantly improve the optimization efficiency by using the iteratively updated Hessians and optimizations on interpolated surfaces.
NASA Technical Reports Server (NTRS)
Korkin, S.; Lyapustin, A.
2012-01-01
The Levenberg-Marquardt algorithm [1, 2] provides a numerical iterative solution to the problem of minimization of a function over a space of its parameters. In our work, the Levenberg-Marquardt algorithm retrieves optical parameters of a thin (single scattering) plane parallel atmosphere irradiated by collimated infinitely wide monochromatic beam of light. Black ground surface is assumed. Computational accuracy, sensitivity to the initial guess and the presence of noise in the signal, and other properties of the algorithm are investigated in scalar (using intensity only) and vector (including polarization) modes. We consider an atmosphere that contains a mixture of coarse and fine fractions. Following [3], the fractions are simulated using Henyey-Greenstein model. Though not realistic, this assumption is very convenient for tests [4, p.354]. In our case it yields analytical evaluation of Jacobian matrix. Assuming the MISR geometry of observation [5] as an example, the average scattering cosines and the ratio of coarse and fine fractions, the atmosphere optical depth, and the single scattering albedo, are the five parameters to be determined numerically. In our implementation of the algorithm, the system of five linear equations is solved using the fast Cramer s rule [6]. A simple subroutine developed by the authors, makes the algorithm independent from external libraries. All Fortran 90/95 codes discussed in the presentation will be available immediately after the meeting from sergey.v.korkin@nasa.gov by request.
New matrix bounds and iterative algorithms for the discrete coupled algebraic Riccati equation
NASA Astrophysics Data System (ADS)
Liu, Jianzhou; Wang, Li; Zhang, Juan
2017-11-01
The discrete coupled algebraic Riccati equation (DCARE) has wide applications in control theory and linear system. In general, for the DCARE, one discusses every term of the coupled term, respectively. In this paper, we consider the coupled term as a whole, which is different from the recent results. When applying eigenvalue inequalities to discuss the coupled term, our method has less error. In terms of the properties of special matrices and eigenvalue inequalities, we propose several upper and lower matrix bounds for the solution of DCARE. Further, we discuss the iterative algorithms for the solution of the DCARE. In the fixed point iterative algorithms, the scope of Lipschitz factor is wider than the recent results. Finally, we offer corresponding numerical examples to illustrate the effectiveness of the derived results.
A fast reconstruction algorithm for fluorescence optical diffusion tomography based on preiteration.
Song, Xiaolei; Xiong, Xiaoyun; Bai, Jing
2007-01-01
Fluorescence optical diffusion tomography in the near-infrared (NIR) bandwidth is considered to be one of the most promising ways for noninvasive molecular-based imaging. Many reconstructive approaches to it utilize iterative methods for data inversion. However, they are time-consuming and they are far from meeting the real-time imaging demands. In this work, a fast preiteration algorithm based on the generalized inverse matrix is proposed. This method needs only one step of matrix-vector multiplication online, by pushing the iteration process to be executed offline. In the preiteration process, the second-order iterative format is employed to exponentially accelerate the convergence. Simulations based on an analytical diffusion model show that the distribution of fluorescent yield can be well estimated by this algorithm and the reconstructed speed is remarkably increased.
DOE Office of Scientific and Technical Information (OSTI.GOV)
He, Hongxing; Fang, Hengrui; Miller, Mitchell D.
2016-07-15
An iterative transform algorithm is proposed to improve the conventional molecular-replacement method for solving the phase problem in X-ray crystallography. Several examples of successful trial calculations carried out with real diffraction data are presented. An iterative transform method proposed previously for direct phasing of high-solvent-content protein crystals is employed for enhancing the molecular-replacement (MR) algorithm in protein crystallography. Target structures that are resistant to conventional MR due to insufficient similarity between the template and target structures might be tractable with this modified phasing method. Trial calculations involving three different structures are described to test and illustrate the methodology. The relationshipmore » of the approach to PHENIX Phaser-MR and MR-Rosetta is discussed.« less
Parallel Algorithms for Least Squares and Related Computations.
1991-03-22
for dense computations in linear algebra . The work has recently been published in a general reference book on parallel algorithms by SIAM. AFO SR...written his Ph.D. dissertation with the principal investigator. (See publication 6.) • Parallel Algorithms for Dense Linear Algebra Computations. Our...and describe and to put into perspective a selection of the more important parallel algorithms for numerical linear algebra . We give a major new
Genetic algorithms using SISAL parallel programming language
DOE Office of Scientific and Technical Information (OSTI.GOV)
Tejada, S.
1994-05-06
Genetic algorithms are a mathematical optimization technique developed by John Holland at the University of Michigan [1]. The SISAL programming language possesses many of the characteristics desired to implement genetic algorithms. SISAL is a deterministic, functional programming language which is inherently parallel. Because SISAL is functional and based on mathematical concepts, genetic algorithms can be efficiently translated into the language. Several of the steps involved in genetic algorithms, such as mutation, crossover, and fitness evaluation, can be parallelized using SISAL. In this paper I will l discuss the implementation and performance of parallel genetic algorithms in SISAL.
A Rapid Convergent Low Complexity Interference Alignment Algorithm for Wireless Sensor Networks.
Jiang, Lihui; Wu, Zhilu; Ren, Guanghui; Wang, Gangyi; Zhao, Nan
2015-07-29
Interference alignment (IA) is a novel technique that can effectively eliminate the interference and approach the sum capacity of wireless sensor networks (WSNs) when the signal-to-noise ratio (SNR) is high, by casting the desired signal and interference into different signal subspaces. The traditional alternating minimization interference leakage (AMIL) algorithm for IA shows good performance in high SNR regimes, however, the complexity of the AMIL algorithm increases dramatically as the number of users and antennas increases, posing limits to its applications in the practical systems. In this paper, a novel IA algorithm, called directional quartic optimal (DQO) algorithm, is proposed to minimize the interference leakage with rapid convergence and low complexity. The properties of the AMIL algorithm are investigated, and it is discovered that the difference between the two consecutive iteration results of the AMIL algorithm will approximately point to the convergence solution when the precoding and decoding matrices obtained from the intermediate iterations are sufficiently close to their convergence values. Based on this important property, the proposed DQO algorithm employs the line search procedure so that it can converge to the destination directly. In addition, the optimal step size can be determined analytically by optimizing a quartic function. Numerical results show that the proposed DQO algorithm can suppress the interference leakage more rapidly than the traditional AMIL algorithm, and can achieve the same level of sum rate as that of AMIL algorithm with far less iterations and execution time.
On the suitability of the connection machine for direct particle simulation
NASA Technical Reports Server (NTRS)
Dagum, Leonard
1990-01-01
The algorithmic structure was examined of the vectorizable Stanford particle simulation (SPS) method and the structure is reformulated in data parallel form. Some of the SPS algorithms can be directly translated to data parallel, but several of the vectorizable algorithms have no direct data parallel equivalent. This requires the development of new, strictly data parallel algorithms. In particular, a new sorting algorithm is developed to identify collision candidates in the simulation and a master/slave algorithm is developed to minimize communication cost in large table look up. Validation of the method is undertaken through test calculations for thermal relaxation of a gas, shock wave profiles, and shock reflection from a stationary wall. A qualitative measure is provided of the performance of the Connection Machine for direct particle simulation. The massively parallel architecture of the Connection Machine is found quite suitable for this type of calculation. However, there are difficulties in taking full advantage of this architecture because of lack of a broad based tradition of data parallel programming. An important outcome of this work has been new data parallel algorithms specifically of use for direct particle simulation but which also expand the data parallel diction.
Simultaneous deblurring and iterative reconstruction of CBCT for image guided brain radiosurgery.
Hashemi, SayedMasoud; Song, William Y; Sahgal, Arjun; Lee, Young; Huynh, Christopher; Grouza, Vladimir; Nordström, Håkan; Eriksson, Markus; Dorenlot, Antoine; Régis, Jean Marie; Mainprize, James G; Ruschin, Mark
2017-04-07
One of the limiting factors in cone-beam CT (CBCT) image quality is system blur, caused by detector response, x-ray source focal spot size, azimuthal blurring, and reconstruction algorithm. In this work, we develop a novel iterative reconstruction algorithm that improves spatial resolution by explicitly accounting for image unsharpness caused by different factors in the reconstruction formulation. While the model-based iterative reconstruction techniques use prior information about the detector response and x-ray source, our proposed technique uses a simple measurable blurring model. In our reconstruction algorithm, denoted as simultaneous deblurring and iterative reconstruction (SDIR), the blur kernel can be estimated using the modulation transfer function (MTF) slice of the CatPhan phantom or any other MTF phantom, such as wire phantoms. The proposed image reconstruction formulation includes two regularization terms: (1) total variation (TV) and (2) nonlocal regularization, solved with a split Bregman augmented Lagrangian iterative method. The SDIR formulation preserves edges, eases the parameter adjustments to achieve both high spatial resolution and low noise variances, and reduces the staircase effect caused by regular TV-penalized iterative algorithms. The proposed algorithm is optimized for a point-of-care head CBCT unit for image-guided radiosurgery and is tested with CatPhan phantom, an anthropomorphic head phantom, and 6 clinical brain stereotactic radiosurgery cases. Our experiments indicate that SDIR outperforms the conventional filtered back projection and TV penalized simultaneous algebraic reconstruction technique methods (represented by adaptive steepest-descent POCS algorithm, ASD-POCS) in terms of MTF and line pair resolution, and retains the favorable properties of the standard TV-based iterative reconstruction algorithms in improving the contrast and reducing the reconstruction artifacts. It improves the visibility of the high contrast details in bony areas and the brain soft-tissue. For example, the results show the ventricles and some brain folds become visible in SDIR reconstructed images and the contrast of the visible lesions is effectively improved. The line-pair resolution was improved from 12 line-pair/cm in FBP to 14 line-pair/cm in SDIR. Adjusting the parameters of the ASD-POCS to achieve 14 line-pair/cm caused the noise variance to be higher than the SDIR. Using these parameters for ASD-POCS, the MTF of FBP and ASD-POCS were very close and equal to 0.7 mm -1 which was increased to 1.2 mm -1 by SDIR, at half maximum.
Simultaneous deblurring and iterative reconstruction of CBCT for image guided brain radiosurgery
NASA Astrophysics Data System (ADS)
Hashemi, SayedMasoud; Song, William Y.; Sahgal, Arjun; Lee, Young; Huynh, Christopher; Grouza, Vladimir; Nordström, Håkan; Eriksson, Markus; Dorenlot, Antoine; Régis, Jean Marie; Mainprize, James G.; Ruschin, Mark
2017-04-01
One of the limiting factors in cone-beam CT (CBCT) image quality is system blur, caused by detector response, x-ray source focal spot size, azimuthal blurring, and reconstruction algorithm. In this work, we develop a novel iterative reconstruction algorithm that improves spatial resolution by explicitly accounting for image unsharpness caused by different factors in the reconstruction formulation. While the model-based iterative reconstruction techniques use prior information about the detector response and x-ray source, our proposed technique uses a simple measurable blurring model. In our reconstruction algorithm, denoted as simultaneous deblurring and iterative reconstruction (SDIR), the blur kernel can be estimated using the modulation transfer function (MTF) slice of the CatPhan phantom or any other MTF phantom, such as wire phantoms. The proposed image reconstruction formulation includes two regularization terms: (1) total variation (TV) and (2) nonlocal regularization, solved with a split Bregman augmented Lagrangian iterative method. The SDIR formulation preserves edges, eases the parameter adjustments to achieve both high spatial resolution and low noise variances, and reduces the staircase effect caused by regular TV-penalized iterative algorithms. The proposed algorithm is optimized for a point-of-care head CBCT unit for image-guided radiosurgery and is tested with CatPhan phantom, an anthropomorphic head phantom, and 6 clinical brain stereotactic radiosurgery cases. Our experiments indicate that SDIR outperforms the conventional filtered back projection and TV penalized simultaneous algebraic reconstruction technique methods (represented by adaptive steepest-descent POCS algorithm, ASD-POCS) in terms of MTF and line pair resolution, and retains the favorable properties of the standard TV-based iterative reconstruction algorithms in improving the contrast and reducing the reconstruction artifacts. It improves the visibility of the high contrast details in bony areas and the brain soft-tissue. For example, the results show the ventricles and some brain folds become visible in SDIR reconstructed images and the contrast of the visible lesions is effectively improved. The line-pair resolution was improved from 12 line-pair/cm in FBP to 14 line-pair/cm in SDIR. Adjusting the parameters of the ASD-POCS to achieve 14 line-pair/cm caused the noise variance to be higher than the SDIR. Using these parameters for ASD-POCS, the MTF of FBP and ASD-POCS were very close and equal to 0.7 mm-1 which was increased to 1.2 mm-1 by SDIR, at half maximum.
A biological phantom for evaluation of CT image reconstruction algorithms
NASA Astrophysics Data System (ADS)
Cammin, J.; Fung, G. S. K.; Fishman, E. K.; Siewerdsen, J. H.; Stayman, J. W.; Taguchi, K.
2014-03-01
In recent years, iterative algorithms have become popular in diagnostic CT imaging to reduce noise or radiation dose to the patient. The non-linear nature of these algorithms leads to non-linearities in the imaging chain. However, the methods to assess the performance of CT imaging systems were developed assuming the linear process of filtered backprojection (FBP). Those methods may not be suitable any longer when applied to non-linear systems. In order to evaluate the imaging performance, a phantom is typically scanned and the image quality is measured using various indices. For reasons of practicality, cost, and durability, those phantoms often consist of simple water containers with uniform cylinder inserts. However, these phantoms do not represent the rich structure and patterns of real tissue accurately. As a result, the measured image quality or detectability performance for lesions may not reflect the performance on clinical images. The discrepancy between estimated and real performance may be even larger for iterative methods which sometimes produce "plastic-like", patchy images with homogeneous patterns. Consequently, more realistic phantoms should be used to assess the performance of iterative algorithms. We designed and constructed a biological phantom consisting of porcine organs and tissue that models a human abdomen, including liver lesions. We scanned the phantom on a clinical CT scanner and compared basic image quality indices between filtered backprojection and an iterative reconstruction algorithm.
Iterative Transform Phase Diversity: An Image-Based Object and Wavefront Recovery
NASA Technical Reports Server (NTRS)
Smith, Jeffrey
2012-01-01
The Iterative Transform Phase Diversity algorithm is designed to solve the problem of recovering the wavefront in the exit pupil of an optical system and the object being imaged. This algorithm builds upon the robust convergence capability of Variable Sampling Mapping (VSM), in combination with the known success of various deconvolution algorithms. VSM is an alternative method for enforcing the amplitude constraints of a Misell-Gerchberg-Saxton (MGS) algorithm. When provided the object and additional optical parameters, VSM can accurately recover the exit pupil wavefront. By combining VSM and deconvolution, one is able to simultaneously recover the wavefront and the object.
Optimal Decentralized Protocol for Electric Vehicle Charging
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gan, LW; Topcu, U; Low, SH
We propose a decentralized algorithm to optimally schedule electric vehicle (EV) charging. The algorithm exploits the elasticity of electric vehicle loads to fill the valleys in electric load profiles. We first formulate the EV charging scheduling problem as an optimal control problem, whose objective is to impose a generalized notion of valley-filling, and study properties of optimal charging profiles. We then give a decentralized algorithm to iteratively solve the optimal control problem. In each iteration, EVs update their charging profiles according to the control signal broadcast by the utility company, and the utility company alters the control signal to guidemore » their updates. The algorithm converges to optimal charging profiles (that are as "flat" as they can possibly be) irrespective of the specifications (e.g., maximum charging rate and deadline) of EVs, even if EVs do not necessarily update their charging profiles in every iteration, and use potentially outdated control signal when they update. Moreover, the algorithm only requires each EV solving its local problem, hence its implementation requires low computation capability. We also extend the algorithm to track a given load profile and to real-time implementation.« less
A New Pivoting and Iterative Text Detection Algorithm for Biomedical Images
Xu, Songhua; Krauthammer, Michael
2010-01-01
There is interest to expand the reach of literature mining to include the analysis of biomedical images, which often contain a paper’s key findings. Examples include recent studies that use Optical Character Recognition (OCR) to extract image text, which is used to boost biomedical image retrieval and classification. Such studies rely on the robust identification of text elements in biomedical images, which is a non-trivial task. In this work, we introduce a new text detection algorithm for biomedical images based on iterative projection histograms. We study the effectiveness of our algorithm by evaluating the performance on a set of manually labeled random biomedical images, and compare the performance against other state-of-the-art text detection algorithms. In this paper, we demonstrate that a projection histogram-based text detection approach is well suited for text detection in biomedical images, with a performance of F score of .60. The approach performs better than comparable approaches for text detection. Further, we show that the iterative application of the algorithm is boosting overall detection performance. A C++ implementation of our algorithm is freely available through email request for academic use. PMID:20887803
Marginal Consistency: Upper-Bounding Partition Functions over Commutative Semirings.
Werner, Tomás
2015-07-01
Many inference tasks in pattern recognition and artificial intelligence lead to partition functions in which addition and multiplication are abstract binary operations forming a commutative semiring. By generalizing max-sum diffusion (one of convergent message passing algorithms for approximate MAP inference in graphical models), we propose an iterative algorithm to upper bound such partition functions over commutative semirings. The iteration of the algorithm is remarkably simple: change any two factors of the partition function such that their product remains the same and their overlapping marginals become equal. In many commutative semirings, repeating this iteration for different pairs of factors converges to a fixed point when the overlapping marginals of every pair of factors coincide. We call this state marginal consistency. During that, an upper bound on the partition function monotonically decreases. This abstract algorithm unifies several existing algorithms, including max-sum diffusion and basic constraint propagation (or local consistency) algorithms in constraint programming. We further construct a hierarchy of marginal consistencies of increasingly higher levels and show than any such level can be enforced by adding identity factors of higher arity (order). Finally, we discuss instances of the framework for several semirings, including the distributive lattice and the max-sum and sum-product semirings.
Runtime support for parallelizing data mining algorithms
NASA Astrophysics Data System (ADS)
Jin, Ruoming; Agrawal, Gagan
2002-03-01
With recent technological advances, shared memory parallel machines have become more scalable, and offer large main memories and high bus bandwidths. They are emerging as good platforms for data warehousing and data mining. In this paper, we focus on shared memory parallelization of data mining algorithms. We have developed a series of techniques for parallelization of data mining algorithms, including full replication, full locking, fixed locking, optimized full locking, and cache-sensitive locking. Unlike previous work on shared memory parallelization of specific data mining algorithms, all of our techniques apply to a large number of common data mining algorithms. In addition, we propose a reduction-object based interface for specifying a data mining algorithm. We show how our runtime system can apply any of the technique we have developed starting from a common specification of the algorithm.
NASA Technical Reports Server (NTRS)
Zhang, Jun; Ge, Lixin; Kouatchou, Jules
2000-01-01
A new fourth order compact difference scheme for the three dimensional convection diffusion equation with variable coefficients is presented. The novelty of this new difference scheme is that it Only requires 15 grid points and that it can be decoupled with two colors. The entire computational grid can be updated in two parallel subsweeps with the Gauss-Seidel type iterative method. This is compared with the known 19 point fourth order compact differenCe scheme which requires four colors to decouple the computational grid. Numerical results, with multigrid methods implemented on a shared memory parallel computer, are presented to compare the 15 point and the 19 point fourth order compact schemes.
Parallel Reconstruction Using Null Operations (PRUNO)
Zhang, Jian; Liu, Chunlei; Moseley, Michael E.
2011-01-01
A novel iterative k-space data-driven technique, namely Parallel Reconstruction Using Null Operations (PRUNO), is presented for parallel imaging reconstruction. In PRUNO, both data calibration and image reconstruction are formulated into linear algebra problems based on a generalized system model. An optimal data calibration strategy is demonstrated by using Singular Value Decomposition (SVD). And an iterative conjugate- gradient approach is proposed to efficiently solve missing k-space samples during reconstruction. With its generalized formulation and precise mathematical model, PRUNO reconstruction yields good accuracy, flexibility, stability. Both computer simulation and in vivo studies have shown that PRUNO produces much better reconstruction quality than autocalibrating partially parallel acquisition (GRAPPA), especially under high accelerating rates. With the aid of PRUO reconstruction, ultra high accelerating parallel imaging can be performed with decent image quality. For example, we have done successful PRUNO reconstruction at a reduction factor of 6 (effective factor of 4.44) with 8 coils and only a few autocalibration signal (ACS) lines. PMID:21604290
Efficient convex-elastic net algorithm to solve the Euclidean traveling salesman problem.
Al-Mulhem, M; Al-Maghrabi, T
1998-01-01
This paper describes a hybrid algorithm that combines an adaptive-type neural network algorithm and a nondeterministic iterative algorithm to solve the Euclidean traveling salesman problem (E-TSP). It begins with a brief introduction to the TSP and the E-TSP. Then, it presents the proposed algorithm with its two major components: the convex-elastic net (CEN) algorithm and the nondeterministic iterative improvement (NII) algorithm. These two algorithms are combined into the efficient convex-elastic net (ECEN) algorithm. The CEN algorithm integrates the convex-hull property and elastic net algorithm to generate an initial tour for the E-TSP. The NII algorithm uses two rearrangement operators to improve the initial tour given by the CEN algorithm. The paper presents simulation results for two instances of E-TSP: randomly generated tours and tours for well-known problems in the literature. Experimental results are given to show that the proposed algorithm ran find the nearly optimal solution for the E-TSP that outperform many similar algorithms reported in the literature. The paper concludes with the advantages of the new algorithm and possible extensions.
NASA Astrophysics Data System (ADS)
Tolson, B.; Matott, L. S.; Gaffoor, T. A.; Asadzadeh, M.; Shafii, M.; Pomorski, P.; Xu, X.; Jahanpour, M.; Razavi, S.; Haghnegahdar, A.; Craig, J. R.
2015-12-01
We introduce asynchronous parallel implementations of the Dynamically Dimensioned Search (DDS) family of algorithms including DDS, discrete DDS, PA-DDS and DDS-AU. These parallel algorithms are unique from most existing parallel optimization algorithms in the water resources field in that parallel DDS is asynchronous and does not require an entire population (set of candidate solutions) to be evaluated before generating and then sending a new candidate solution for evaluation. One key advance in this study is developing the first parallel PA-DDS multi-objective optimization algorithm. The other key advance is enhancing the computational efficiency of solving optimization problems (such as model calibration) by combining a parallel optimization algorithm with the deterministic model pre-emption concept. These two efficiency techniques can only be combined because of the asynchronous nature of parallel DDS. Model pre-emption functions to terminate simulation model runs early, prior to completely simulating the model calibration period for example, when intermediate results indicate the candidate solution is so poor that it will definitely have no influence on the generation of further candidate solutions. The computational savings of deterministic model preemption available in serial implementations of population-based algorithms (e.g., PSO) disappear in synchronous parallel implementations as these algorithms. In addition to the key advances above, we implement the algorithms across a range of computation platforms (Windows and Unix-based operating systems from multi-core desktops to a supercomputer system) and package these for future modellers within a model-independent calibration software package called Ostrich as well as MATLAB versions. Results across multiple platforms and multiple case studies (from 4 to 64 processors) demonstrate the vast improvement over serial DDS-based algorithms and highlight the important role model pre-emption plays in the performance of parallel, pre-emptable DDS algorithms. Case studies include single- and multiple-objective optimization problems in water resources model calibration and in many cases linear or near linear speedups are observed.
A Novel Particle Swarm Optimization Algorithm for Global Optimization
Wang, Chun-Feng; Liu, Kui
2016-01-01
Particle Swarm Optimization (PSO) is a recently developed optimization method, which has attracted interest of researchers in various areas due to its simplicity and effectiveness, and many variants have been proposed. In this paper, a novel Particle Swarm Optimization algorithm is presented, in which the information of the best neighbor of each particle and the best particle of the entire population in the current iteration is considered. Meanwhile, to avoid premature, an abandoned mechanism is used. Furthermore, for improving the global convergence speed of our algorithm, a chaotic search is adopted in the best solution of the current iteration. To verify the performance of our algorithm, standard test functions have been employed. The experimental results show that the algorithm is much more robust and efficient than some existing Particle Swarm Optimization algorithms. PMID:26955387
Parallel transformation of K-SVD solar image denoising algorithm
NASA Astrophysics Data System (ADS)
Liang, Youwen; Tian, Yu; Li, Mei
2017-02-01
The images obtained by observing the sun through a large telescope always suffered with noise due to the low SNR. K-SVD denoising algorithm can effectively remove Gauss white noise. Training dictionaries for sparse representations is a time consuming task, due to the large size of the data involved and to the complexity of the training algorithms. In this paper, an OpenMP parallel programming language is proposed to transform the serial algorithm to the parallel version. Data parallelism model is used to transform the algorithm. Not one atom but multiple atoms updated simultaneously is the biggest change. The denoising effect and acceleration performance are tested after completion of the parallel algorithm. Speedup of the program is 13.563 in condition of using 16 cores. This parallel version can fully utilize the multi-core CPU hardware resources, greatly reduce running time and easily to transplant in multi-core platform.
A parallel simulated annealing algorithm for standard cell placement on a hypercube computer
NASA Technical Reports Server (NTRS)
Jones, Mark Howard
1987-01-01
A parallel version of a simulated annealing algorithm is presented which is targeted to run on a hypercube computer. A strategy for mapping the cells in a two dimensional area of a chip onto processors in an n-dimensional hypercube is proposed such that both small and large distance moves can be applied. Two types of moves are allowed: cell exchanges and cell displacements. The computation of the cost function in parallel among all the processors in the hypercube is described along with a distributed data structure that needs to be stored in the hypercube to support parallel cost evaluation. A novel tree broadcasting strategy is used extensively in the algorithm for updating cell locations in the parallel environment. Studies on the performance of the algorithm on example industrial circuits show that it is faster and gives better final placement results than the uniprocessor simulated annealing algorithms. An improved uniprocessor algorithm is proposed which is based on the improved results obtained from parallelization of the simulated annealing algorithm.
Guided particle swarm optimization method to solve general nonlinear optimization problems
NASA Astrophysics Data System (ADS)
Abdelhalim, Alyaa; Nakata, Kazuhide; El-Alem, Mahmoud; Eltawil, Amr
2018-04-01
The development of hybrid algorithms is becoming an important topic in the global optimization research area. This article proposes a new technique in hybridizing the particle swarm optimization (PSO) algorithm and the Nelder-Mead (NM) simplex search algorithm to solve general nonlinear unconstrained optimization problems. Unlike traditional hybrid methods, the proposed method hybridizes the NM algorithm inside the PSO to improve the velocities and positions of the particles iteratively. The new hybridization considers the PSO algorithm and NM algorithm as one heuristic, not in a sequential or hierarchical manner. The NM algorithm is applied to improve the initial random solution of the PSO algorithm and iteratively in every step to improve the overall performance of the method. The performance of the proposed method was tested over 20 optimization test functions with varying dimensions. Comprehensive comparisons with other methods in the literature indicate that the proposed solution method is promising and competitive.
A proximity algorithm accelerated by Gauss-Seidel iterations for L1/TV denoising models
NASA Astrophysics Data System (ADS)
Li, Qia; Micchelli, Charles A.; Shen, Lixin; Xu, Yuesheng
2012-09-01
Our goal in this paper is to improve the computational performance of the proximity algorithms for the L1/TV denoising model. This leads us to a new characterization of all solutions to the L1/TV model via fixed-point equations expressed in terms of the proximity operators. Based upon this observation we develop an algorithm for solving the model and establish its convergence. Furthermore, we demonstrate that the proposed algorithm can be accelerated through the use of the componentwise Gauss-Seidel iteration so that the CPU time consumed is significantly reduced. Numerical experiments using the proposed algorithm for impulsive noise removal are included, with a comparison to three recently developed algorithms. The numerical results show that while the proposed algorithm enjoys a high quality of the restored images, as the other three known algorithms do, it performs significantly better in terms of computational efficiency measured in the CPU time consumed.
Accuracy Improvement for Light-Emitting-Diode-Based Colorimeter by Iterative Algorithm
NASA Astrophysics Data System (ADS)
Yang, Pao-Keng
2011-09-01
We present a simple algorithm, combining an interpolating method with an iterative calculation, to enhance the resolution of spectral reflectance by removing the spectral broadening effect due to the finite bandwidth of the light-emitting diode (LED) from it. The proposed algorithm can be used to improve the accuracy of a reflective colorimeter using multicolor LEDs as probing light sources and is also applicable to the case when the probing LEDs have different bandwidths in different spectral ranges, to which the powerful deconvolution method cannot be applied.