Linear Bregman algorithm implemented in parallel GPU
NASA Astrophysics Data System (ADS)
Li, Pengyan; Ke, Jue; Sui, Dong; Wei, Ping
2015-08-01
At present, most compressed sensing (CS) algorithms have poor converging speed, thus are difficult to run on PC. To deal with this issue, we use a parallel GPU, to implement a broadly used compressed sensing algorithm, the Linear Bregman algorithm. Linear iterative Bregman algorithm is a reconstruction algorithm proposed by Osher and Cai. Compared with other CS reconstruction algorithms, the linear Bregman algorithm only involves the vector and matrix multiplication and thresholding operation, and is simpler and more efficient for programming. We use C as a development language and adopt CUDA (Compute Unified Device Architecture) as parallel computing architectures. In this paper, we compared the parallel Bregman algorithm with traditional CPU realized Bregaman algorithm. In addition, we also compared the parallel Bregman algorithm with other CS reconstruction algorithms, such as OMP and TwIST algorithms. Compared with these two algorithms, the result of this paper shows that, the parallel Bregman algorithm needs shorter time, and thus is more convenient for real-time object reconstruction, which is important to people's fast growing demand to information technology.
Efficient Parallel Video Processing Techniques on GPU: From Framework to Implementation
Su, Huayou; Wen, Mei; Wu, Nan; Ren, Ju; Zhang, Chunyuan
2014-01-01
Through reorganizing the execution order and optimizing the data structure, we proposed an efficient parallel framework for H.264/AVC encoder based on massively parallel architecture. We implemented the proposed framework by CUDA on NVIDIA's GPU. Not only the compute intensive components of the H.264 encoder are parallelized but also the control intensive components are realized effectively, such as CAVLC and deblocking filter. In addition, we proposed serial optimization methods, including the multiresolution multiwindow for motion estimation, multilevel parallel strategy to enhance the parallelism of intracoding as much as possible, component-based parallel CAVLC, and direction-priority deblocking filter. More than 96% of workload of H.264 encoder is offloaded to GPU. Experimental results show that the parallel implementation outperforms the serial program by 20 times of speedup ratio and satisfies the requirement of the real-time HD encoding of 30 fps. The loss of PSNR is from 0.14 dB to 0.77 dB, when keeping the same bitrate. Through the analysis to the kernels, we found that speedup ratios of the compute intensive algorithms are proportional with the computation power of the GPU. However, the performance of the control intensive parts (CAVLC) is much related to the memory bandwidth, which gives an insight for new architecture design. PMID:24757432
Efficient Parallel Implementation of Active Appearance Model Fitting Algorithm on GPU
Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou
2014-01-01
The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures. PMID:24723812
Ng, C M
2013-10-01
The development of a population PK/PD model, an essential component for model-based drug development, is both time- and labor-intensive. A graphical-processing unit (GPU) computing technology has been proposed and used to accelerate many scientific computations. The objective of this study was to develop a hybrid GPU-CPU implementation of parallelized Monte Carlo parametric expectation maximization (MCPEM) estimation algorithm for population PK data analysis. A hybrid GPU-CPU implementation of the MCPEM algorithm (MCPEMGPU) and identical algorithm that is designed for the single CPU (MCPEMCPU) were developed using MATLAB in a single computer equipped with dual Xeon 6-Core E5690 CPU and a NVIDIA Tesla C2070 GPU parallel computing card that contained 448 stream processors. Two different PK models with rich/sparse sampling design schemes were used to simulate population data in assessing the performance of MCPEMCPU and MCPEMGPU. Results were analyzed by comparing the parameter estimation and model computation times. Speedup factor was used to assess the relative benefit of parallelized MCPEMGPU over MCPEMCPU in shortening model computation time. The MCPEMGPU consistently achieved shorter computation time than the MCPEMCPU and can offer more than 48-fold speedup using a single GPU card. The novel hybrid GPU-CPU implementation of parallelized MCPEM algorithm developed in this study holds a great promise in serving as the core for the next-generation of modeling software for population PK/PD analysis.
Bayer image parallel decoding based on GPU
NASA Astrophysics Data System (ADS)
Hu, Rihui; Xu, Zhiyong; Wei, Yuxing; Sun, Shaohua
2012-11-01
In the photoelectrical tracking system, Bayer image is decompressed in traditional method, which is CPU-based. However, it is too slow when the images become large, for example, 2K×2K×16bit. In order to accelerate the Bayer image decoding, this paper introduces a parallel speedup method for NVIDA's Graphics Processor Unit (GPU) which supports CUDA architecture. The decoding procedure can be divided into three parts: the first is serial part, the second is task-parallelism part, and the last is data-parallelism part including inverse quantization, inverse discrete wavelet transform (IDWT) as well as image post-processing part. For reducing the execution time, the task-parallelism part is optimized by OpenMP techniques. The data-parallelism part could advance its efficiency through executing on the GPU as CUDA parallel program. The optimization techniques include instruction optimization, shared memory access optimization, the access memory coalesced optimization and texture memory optimization. In particular, it can significantly speed up the IDWT by rewriting the 2D (Tow-dimensional) serial IDWT into 1D parallel IDWT. Through experimenting with 1K×1K×16bit Bayer image, data-parallelism part is 10 more times faster than CPU-based implementation. Finally, a CPU+GPU heterogeneous decompression system was designed. The experimental result shows that it could achieve 3 to 5 times speed increase compared to the CPU serial method.
Comparison of GPU- and CPU-implementations of mean-firing rate neural networks on parallel hardware.
Dinkelbach, Helge Ülo; Vitay, Julien; Beuth, Frederik; Hamker, Fred H
2012-01-01
Modern parallel hardware such as multi-core processors (CPUs) and graphics processing units (GPUs) have a high computational power which can be greatly beneficial to the simulation of large-scale neural networks. Over the past years, a number of efforts have focused on developing parallel algorithms and simulators best suited for the simulation of spiking neural models. In this article, we aim at investigating the advantages and drawbacks of the CPU and GPU parallelization of mean-firing rate neurons, widely used in systems-level computational neuroscience. By comparing OpenMP, CUDA and OpenCL implementations towards a serial CPU implementation, we show that GPUs are better suited than CPUs for the simulation of very large networks, but that smaller networks would benefit more from an OpenMP implementation. As this performance strongly depends on data organization, we analyze the impact of various factors such as data structure, memory alignment and floating precision. We then discuss the suitability of the different hardware depending on the networks' size and connectivity, as random or sparse connectivities in mean-firing rate networks tend to break parallel performance on GPUs due to the violation of coalescence.
Gpu Implementation of Preconditioning Method for Low-Speed Flows
NASA Astrophysics Data System (ADS)
Zhang, Jiale; Chen, Hongquan
2016-06-01
An improved preconditioning method for low-Mach-number flows is implemented on a GPU platform. The improved preconditioning method employs the fluctuation of the fluid variables to weaken the influence of accuracy caused by the truncation error. The GPU parallel computing platform is implemented to accelerate the calculations. Both details concerning the improved preconditioning method and the GPU implementation technology are described in this paper. Then a set of typical low-speed flow cases are simulated for both validation and performance analysis of the resulting GPU solver. Numerical results show that dozens of times speedup relative to a serial CPU implementation can be achieved using a single GPU desktop platform, which demonstrates that the GPU desktop can serve as a cost-effective parallel computing platform to accelerate CFD simulations for low-Speed flows substantially.
Parallel hyperspectral compressive sensing method on GPU
NASA Astrophysics Data System (ADS)
Bernabé, Sergio; Martín, Gabriel; Nascimento, José M. P.
2015-10-01
Remote hyperspectral sensors collect large amounts of data per flight usually with low spatial resolution. It is known that the bandwidth connection between the satellite/airborne platform and the ground station is reduced, thus a compression onboard method is desirable to reduce the amount of data to be transmitted. This paper presents a parallel implementation of an compressive sensing method, called parallel hyperspectral coded aperture (P-HYCA), for graphics processing units (GPU) using the compute unified device architecture (CUDA). This method takes into account two main properties of hyperspectral dataset, namely the high correlation existing among the spectral bands and the generally low number of endmembers needed to explain the data, which largely reduces the number of measurements necessary to correctly reconstruct the original data. Experimental results conducted using synthetic and real hyperspectral datasets on two different GPU architectures by NVIDIA: GeForce GTX 590 and GeForce GTX TITAN, reveal that the use of GPUs can provide real-time compressive sensing performance. The achieved speedup is up to 20 times when compared with the processing time of HYCA running on one core of the Intel i7-2600 CPU (3.4GHz), with 16 Gbyte memory.
Parallelization and checkpointing of GPU applications through program transformation
Solano-Quinde, Lizandro Damian
2012-01-01
GPUs have emerged as a powerful tool for accelerating general-purpose applications. The availability of programming languages that makes writing general-purpose applications for running on GPUs tractable have consolidated GPUs as an alternative for accelerating general purpose applications. Among the areas that have benefited from GPU acceleration are: signal and image processing, computational fluid dynamics, quantum chemistry, and, in general, the High Performance Computing (HPC) Industry. In order to continue to exploit higher levels of parallelism with GPUs, multi-GPU systems are gaining popularity. In this context, single-GPU applications are parallelized for running in multi-GPU systems. Furthermore, multi-GPU systems help to solve the GPU memory limitation for applications with large application memory footprint. Parallelizing single-GPU applications has been approached by libraries that distribute the workload at runtime, however, they impose execution overhead and are not portable. On the other hand, on traditional CPU systems, parallelization has been approached through application transformation at pre-compile time, which enhances the application to distribute the workload at application level and does not have the issues of library-based approaches. Hence, a parallelization scheme for GPU systems based on application transformation is needed. Like any computing engine of today, reliability is also a concern in GPUs. GPUs are vulnerable to transient and permanent failures. Current checkpoint/restart techniques are not suitable for systems with GPUs. Checkpointing for GPU systems present new and interesting challenges, primarily due to the natural differences imposed by the hardware design, the memory subsystem architecture, the massive number of threads, and the limited amount of synchronization among threads. Therefore, a checkpoint/restart technique suitable for GPU systems is needed. The goal of this work is to exploit higher levels of parallelism and
Parallel Optimization of 3D Cardiac Electrophysiological Model Using GPU
Xia, Yong; Wang, Kuanquan; Zhang, Henggui
2015-01-01
Large-scale 3D virtual heart model simulations are highly demanding in computational resources. This imposes a big challenge to the traditional computation resources based on CPU environment, which already cannot meet the requirement of the whole computation demands or are not easily available due to expensive costs. GPU as a parallel computing environment therefore provides an alternative to solve the large-scale computational problems of whole heart modeling. In this study, using a 3D sheep atrial model as a test bed, we developed a GPU-based simulation algorithm to simulate the conduction of electrical excitation waves in the 3D atria. In the GPU algorithm, a multicellular tissue model was split into two components: one is the single cell model (ordinary differential equation) and the other is the diffusion term of the monodomain model (partial differential equation). Such a decoupling enabled realization of the GPU parallel algorithm. Furthermore, several optimization strategies were proposed based on the features of the virtual heart model, which enabled a 200-fold speedup as compared to a CPU implementation. In conclusion, an optimized GPU algorithm has been developed that provides an economic and powerful platform for 3D whole heart simulations. PMID:26581957
GPU-based Parallel Application Design for Emerging Mobile Devices
NASA Astrophysics Data System (ADS)
Gupta, Kshitij
A revolution is underway in the computing world that is causing a fundamental paradigm shift in device capabilities and form-factor, with a move from well-established legacy desktop/laptop computers to mobile devices in varying sizes and shapes. Amongst all the tasks these devices must support, graphics has emerged as the 'killer app' for providing a fluid user interface and high-fidelity game rendering, effectively making the graphics processor (GPU) one of the key components in (present and future) mobile systems. By utilizing the GPU as a general-purpose parallel processor, this dissertation explores the GPU computing design space from an applications standpoint, in the mobile context, by focusing on key challenges presented by these devices---limited compute, memory bandwidth, and stringent power consumption requirements---while improving the overall application efficiency of the increasingly important speech recognition workload for mobile user interaction. We broadly partition trends in GPU computing into four major categories. We analyze hardware and programming model limitations in current-generation GPUs and detail an alternate programming style called Persistent Threads, identify four use case patterns, and propose minimal modifications that would be required for extending native support. We show how by manually extracting data locality and altering the speech recognition pipeline, we are able to achieve significant savings in memory bandwidth while simultaneously reducing the compute burden on GPU-like parallel processors. As we foresee GPU computing to evolve from its current 'co-processor' model into an independent 'applications processor' that is capable of executing complex work independently, we create an alternate application framework that enables the GPU to handle all control-flow dependencies autonomously at run-time while minimizing host involvement to just issuing commands, that facilitates an efficient application implementation. Finally, as
Narayanaswamy, Arunachalam; Dwarakapuram, Saritha; Bjornsson, Christopher S.; Cutler, Barbara M.; Shain, William
2010-01-01
This paper presents robust 3-D algorithms to segment vasculature that is imaged by labeling laminae, rather than the lumenal volume. The signal is weak, sparse, noisy, nonuniform, low-contrast, and exhibits gaps and spectral artifacts, so adaptive thresholding and Hessian filtering based methods are not effective. The structure deviates from a tubular geometry, so tracing algorithms are not effective. We propose a four step approach. The first step detects candidate voxels using a robust hypothesis test based on a model that assumes Poisson noise and locally planar geometry. The second step performs an adaptive region growth to extract weakly labeled and fine vessels while rejecting spectral artifacts. To enable interactive visualization and estimation of features such as statistical confidence, local curvature, local thickness, and local normal, we perform the third step. In the third step, we construct an accurate mesh representation using marching tetrahedra, volume-preserving smoothing, and adaptive decimation algorithms. To enable topological analysis and efficient validation, we describe a method to estimate vessel centerlines using a ray casting and vote accumulation algorithm which forms the final step of our algorithm. Our algorithm lends itself to parallel processing, and yielded an 8× speedup on a graphics processor (GPU). On synthetic data, our meshes had average error per face (EPF) values of (0.1–1.6) voxels per mesh face for peak signal-to-noise ratios from (110–28 dB). Separately, the error from decimating the mesh to less than 1% of its original size, the EPF was less than 1 voxel/face. When validated on real datasets, the average recall and precision values were found to be 94.66% and 94.84%, respectively. PMID:20199906
Problems Related to Parallelization of CFD Algorithms on GPU, Multi-GPU and Hybrid Architectures
NASA Astrophysics Data System (ADS)
Biazewicz, Marek; Kurowski, Krzysztof; Ludwiczak, Bogdan; Napieraia, Krystyna
2010-09-01
Computational Fluid Dynamics (CFD) is one of the branches of fluid mechanics, which uses numerical methods and algorithms to solve and analyze fluid flows. CFD is used in various domains, such as oil and gas reservoir uncertainty analysis, aerodynamic body shapes optimization (e.g. planes, cars, ships, sport helmets, skis), natural phenomena analysis, numerical simulation for weather forecasting or realistic visualizations. CFD problem is very complex and needs a lot of computational power to obtain the results in a reasonable time. We have implemented a parallel application for two-dimensional CFD simulation with a free surface approximation (MAC method) using new hardware architectures, in particular multi-GPU and hybrid computing environments. For this purpose we decided to use NVIDIA graphic cards with CUDA environment due to its simplicity of programming and good computations performance. We used finite difference discretization of Navier-Stokes equations, where fluid is propagated over an Eulerian Grid. In this model, the behavior of the fluid inside the cell depends only on the properties of local, surrounding cells, therefore it is well suited for the GPU-based architecture. In this paper we demonstrate how to use efficiently the computing power of GPUs for CFD. Additionally, we present some best practices to help users analyze and improve the performance of CFD applications executed on GPU. Finally, we discuss various challenges around the multi-GPU implementation on the example of matrix multiplication.
GASPRNG: GPU accelerated scalable parallel random number generator library
NASA Astrophysics Data System (ADS)
Gao, Shuang; Peterson, Gregory D.
2013-04-01
Graphics processors represent a promising technology for accelerating computational science applications. Many computational science applications require fast and scalable random number generation with good statistical properties, so they use the Scalable Parallel Random Number Generators library (SPRNG). We present the GPU Accelerated SPRNG library (GASPRNG) to accelerate SPRNG in GPU-based high performance computing systems. GASPRNG includes code for a host CPU and CUDA code for execution on NVIDIA graphics processing units (GPUs) along with a programming interface to support various usage models for pseudorandom numbers and computational science applications executing on the CPU, GPU, or both. This paper describes the implementation approach used to produce high performance and also describes how to use the programming interface. The programming interface allows a user to be able to use GASPRNG the same way as SPRNG on traditional serial or parallel computers as well as to develop tightly coupled programs executing primarily on the GPU. We also describe how to install GASPRNG and use it. To help illustrate linking with GASPRNG, various demonstration codes are included for the different usage models. GASPRNG on a single GPU shows up to 280x speedup over SPRNG on a single CPU core and is able to scale for larger systems in the same manner as SPRNG. Because GASPRNG generates identical streams of pseudorandom numbers as SPRNG, users can be confident about the quality of GASPRNG for scalable computational science applications. Catalogue identifier: AEOI_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEOI_v1_0.html Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland Licensing provisions: UTK license. No. of lines in distributed program, including test data, etc.: 167900 No. of bytes in distributed program, including test data, etc.: 1422058 Distribution format: tar.gz Programming language: C and CUDA. Computer: Any PC or
GPU-based parallel group ICA for functional magnetic resonance data.
Jing, Yanshan; Zeng, Weiming; Wang, Nizhuan; Ren, Tianlong; Shi, Yingchao; Yin, Jun; Xu, Qi
2015-04-01
The goal of our study is to develop a fast parallel implementation of group independent component analysis (ICA) for functional magnetic resonance imaging (fMRI) data using graphics processing units (GPU). Though ICA has become a standard method to identify brain functional connectivity of the fMRI data, it is computationally intensive, especially has a huge cost for the group data analysis. GPU with higher parallel computation power and lower cost are used for general purpose computing, which could contribute to fMRI data analysis significantly. In this study, a parallel group ICA (PGICA) on GPU, mainly consisting of GPU-based PCA using SVD and Infomax-ICA, is presented. In comparison to the serial group ICA, the proposed method demonstrated both significant speedup with 6-11 times and comparable accuracy of functional networks in our experiments. This proposed method is expected to perform the real-time post-processing for fMRI data analysis.
Gpu Implementation of a Viscous Flow Solver on Unstructured Grids
NASA Astrophysics Data System (ADS)
Xu, Tianhao; Chen, Long
2016-06-01
Graphics processing units have gained popularities in scientific computing over past several years due to their outstanding parallel computing capability. Computational fluid dynamics applications involve large amounts of calculations, therefore a latest GPU card is preferable of which the peak computing performance and memory bandwidth are much better than a contemporary high-end CPU. We herein focus on the detailed implementation of our GPU targeting Reynolds-averaged Navier-Stokes equations solver based on finite-volume method. The solver employs a vertex-centered scheme on unstructured grids for the sake of being capable of handling complex topologies. Multiple optimizations are carried out to improve the memory accessing performance and kernel utilization. Both steady and unsteady flow simulation cases are carried out using explicit Runge-Kutta scheme. The solver with GPU acceleration in this paper is demonstrated to have competitive advantages over the CPU targeting one.
Revisiting Molecular Dynamics on a CPU/GPU system: Water Kernel and SHAKE Parallelization.
Ruymgaart, A Peter; Elber, Ron
2012-11-13
We report Graphics Processing Unit (GPU) and Open-MP parallel implementations of water-specific force calculations and of bond constraints for use in Molecular Dynamics simulations. We focus on a typical laboratory computing-environment in which a CPU with a few cores is attached to a GPU. We discuss in detail the design of the code and we illustrate performance comparable to highly optimized codes such as GROMACS. Beside speed our code shows excellent energy conservation. Utilization of water-specific lists allows the efficient calculations of non-bonded interactions that include water molecules and results in a speed-up factor of more than 40 on the GPU compared to code optimized on a single CPU core for systems larger than 20,000 atoms. This is up four-fold from a factor of 10 reported in our initial GPU implementation that did not include a water-specific code. Another optimization is the implementation of constrained dynamics entirely on the GPU. The routine, which enforces constraints of all bonds, runs in parallel on multiple Open-MP cores or entirely on the GPU. It is based on Conjugate Gradient solution of the Lagrange multipliers (CG SHAKE). The GPU implementation is partially in double precision and requires no communication with the CPU during the execution of the SHAKE algorithm. The (parallel) implementation of SHAKE allows an increase of the time step to 2.0fs while maintaining excellent energy conservation. Interestingly, CG SHAKE is faster than the usual bond relaxation algorithm even on a single core if high accuracy is expected. The significant speedup of the optimized components transfers the computational bottleneck of the MD calculation to the reciprocal part of Particle Mesh Ewald (PME). PMID:23264758
Revisiting Molecular Dynamics on a CPU/GPU system: Water Kernel and SHAKE Parallelization.
Ruymgaart, A Peter; Elber, Ron
2012-11-13
We report Graphics Processing Unit (GPU) and Open-MP parallel implementations of water-specific force calculations and of bond constraints for use in Molecular Dynamics simulations. We focus on a typical laboratory computing-environment in which a CPU with a few cores is attached to a GPU. We discuss in detail the design of the code and we illustrate performance comparable to highly optimized codes such as GROMACS. Beside speed our code shows excellent energy conservation. Utilization of water-specific lists allows the efficient calculations of non-bonded interactions that include water molecules and results in a speed-up factor of more than 40 on the GPU compared to code optimized on a single CPU core for systems larger than 20,000 atoms. This is up four-fold from a factor of 10 reported in our initial GPU implementation that did not include a water-specific code. Another optimization is the implementation of constrained dynamics entirely on the GPU. The routine, which enforces constraints of all bonds, runs in parallel on multiple Open-MP cores or entirely on the GPU. It is based on Conjugate Gradient solution of the Lagrange multipliers (CG SHAKE). The GPU implementation is partially in double precision and requires no communication with the CPU during the execution of the SHAKE algorithm. The (parallel) implementation of SHAKE allows an increase of the time step to 2.0fs while maintaining excellent energy conservation. Interestingly, CG SHAKE is faster than the usual bond relaxation algorithm even on a single core if high accuracy is expected. The significant speedup of the optimized components transfers the computational bottleneck of the MD calculation to the reciprocal part of Particle Mesh Ewald (PME).
Accelerating Large Scale Image Analyses on Parallel, CPU-GPU Equipped Systems.
Teodoro, George; Kurc, Tahsin M; Pan, Tony; Cooper, Lee A D; Kong, Jun; Widener, Patrick; Saltz, Joel H
2012-05-01
The past decade has witnessed a major paradigm shift in high performance computing with the introduction of accelerators as general purpose processors. These computing devices make available very high parallel computing power at low cost and power consumption, transforming current high performance platforms into heterogeneous CPU-GPU equipped systems. Although the theoretical performance achieved by these hybrid systems is impressive, taking practical advantage of this computing power remains a very challenging problem. Most applications are still deployed to either GPU or CPU, leaving the other resource under- or un-utilized. In this paper, we propose, implement, and evaluate a performance aware scheduling technique along with optimizations to make efficient collaborative use of CPUs and GPUs on a parallel system. In the context of feature computations in large scale image analysis applications, our evaluations show that intelligently co-scheduling CPUs and GPUs can significantly improve performance over GPU-only or multi-core CPU-only approaches.
Accelerating Large Scale Image Analyses on Parallel, CPU-GPU Equipped Systems.
Teodoro, George; Kurc, Tahsin M; Pan, Tony; Cooper, Lee A D; Kong, Jun; Widener, Patrick; Saltz, Joel H
2012-05-01
The past decade has witnessed a major paradigm shift in high performance computing with the introduction of accelerators as general purpose processors. These computing devices make available very high parallel computing power at low cost and power consumption, transforming current high performance platforms into heterogeneous CPU-GPU equipped systems. Although the theoretical performance achieved by these hybrid systems is impressive, taking practical advantage of this computing power remains a very challenging problem. Most applications are still deployed to either GPU or CPU, leaving the other resource under- or un-utilized. In this paper, we propose, implement, and evaluate a performance aware scheduling technique along with optimizations to make efficient collaborative use of CPUs and GPUs on a parallel system. In the context of feature computations in large scale image analysis applications, our evaluations show that intelligently co-scheduling CPUs and GPUs can significantly improve performance over GPU-only or multi-core CPU-only approaches. PMID:25419545
Accelerating Large Scale Image Analyses on Parallel, CPU-GPU Equipped Systems
Teodoro, George; Kurc, Tahsin M.; Pan, Tony; Cooper, Lee A.D.; Kong, Jun; Widener, Patrick; Saltz, Joel H.
2014-01-01
The past decade has witnessed a major paradigm shift in high performance computing with the introduction of accelerators as general purpose processors. These computing devices make available very high parallel computing power at low cost and power consumption, transforming current high performance platforms into heterogeneous CPU-GPU equipped systems. Although the theoretical performance achieved by these hybrid systems is impressive, taking practical advantage of this computing power remains a very challenging problem. Most applications are still deployed to either GPU or CPU, leaving the other resource under- or un-utilized. In this paper, we propose, implement, and evaluate a performance aware scheduling technique along with optimizations to make efficient collaborative use of CPUs and GPUs on a parallel system. In the context of feature computations in large scale image analysis applications, our evaluations show that intelligently co-scheduling CPUs and GPUs can significantly improve performance over GPU-only or multi-core CPU-only approaches. PMID:25419545
The development of GPU-based parallel PRNG for Monte Carlo applications in CUDA Fortran
NASA Astrophysics Data System (ADS)
Kargaran, Hamed; Minuchehr, Abdolhamid; Zolfaghari, Ahmad
2016-04-01
The implementation of Monte Carlo simulation on the CUDA Fortran requires a fast random number generation with good statistical properties on GPU. In this study, a GPU-based parallel pseudo random number generator (GPPRNG) have been proposed to use in high performance computing systems. According to the type of GPU memory usage, GPU scheme is divided into two work modes including GLOBAL_MODE and SHARED_MODE. To generate parallel random numbers based on the independent sequence method, the combination of middle-square method and chaotic map along with the Xorshift PRNG have been employed. Implementation of our developed PPRNG on a single GPU showed a speedup of 150x and 470x (with respect to the speed of PRNG on a single CPU core) for GLOBAL_MODE and SHARED_MODE, respectively. To evaluate the accuracy of our developed GPPRNG, its performance was compared to that of some other commercially available PPRNGs such as MATLAB, FORTRAN and Miller-Park algorithm through employing the specific standard tests. The results of this comparison showed that the developed GPPRNG in this study can be used as a fast and accurate tool for computational science applications.
Implementation of 2D computational models for NDE on GPU
NASA Astrophysics Data System (ADS)
Bardel, Charles; Lei, Naiguang; Udpa, Lalita
2012-05-01
This paper presents an attempt to implement a simulation model for electromagnetic NDE on a GPU. A sample electromagnetic NDE problem is examined and the solution is computed on both CPU and GPU. Diffierent matrix storage formats and matrix-vector computational strategies will be investigated. Analysis of the storage requirements for the matrix on the GPU is tabulated and a full-timing breakdown of the process is presented and discussed.
Rapid Parallel Calculation of shell Element Based On GPU
NASA Astrophysics Data System (ADS)
Wanga, Jian Hua; Lia, Guang Yao; Lib, Sheng; Li, Guang Yao
2010-06-01
Long computing time bottlenecked the application of finite element. In this paper, an effective method to speed up the FEM calculation by using the existing modern graphic processing unit and programmable colored rendering tool was put forward, which devised the representation of unit information in accordance with the features of GPU, converted all the unit calculation into film rendering process, solved the simulation work of all the unit calculation of the internal force, and overcame the shortcomings of lowly parallel level appeared ever before when it run in a single computer. Studies shown that this method could improve efficiency and shorten calculating hours greatly. The results of emulation calculation about the elasticity problem of large number cells in the sheet metal proved that using the GPU parallel simulation calculation was faster than using the CPU's. It is useful and efficient to solve the project problems in this way.
Bustamam, Alhadi; Burrage, Kevin; Hamilton, Nicholas A
2012-01-01
Markov clustering (MCL) is becoming a key algorithm within bioinformatics for determining clusters in networks. However,with increasing vast amount of data on biological networks, performance and scalability issues are becoming a critical limiting factor in applications. Meanwhile, GPU computing, which uses CUDA tool for implementing a massively parallel computing environment in the GPU card, is becoming a very powerful, efficient, and low-cost option to achieve substantial performance gains over CPU approaches. The use of on-chip memory on the GPU is efficiently lowering the latency time, thus, circumventing a major issue in other parallel computing environments, such as MPI. We introduce a very fast Markov clustering algorithm using CUDA (CUDA-MCL) to perform parallel sparse matrix-matrix computations and parallel sparse Markov matrix normalizations, which are at the heart of MCL. We utilized ELLPACK-R sparse format to allow the effective and fine-grain massively parallel processing to cope with the sparse nature of interaction networks data sets in bioinformatics applications. As the results show, CUDA-MCL is significantly faster than the original MCL running on CPU. Thus, large-scale parallel computation on off-the-shelf desktop-machines, that were previously only possible on supercomputing architectures, can significantly change the way bioinformaticians and biologists deal with their data.
GPU implementation of the simplex identification via split augmented Lagrangian
NASA Astrophysics Data System (ADS)
Sevilla, Jorge; Nascimento, José M. P.
2015-10-01
Hyperspectral imaging can be used for object detection and for discriminating between different objects based on their spectral characteristics. One of the main problems of hyperspectral data analysis is the presence of mixed pixels, due to the low spatial resolution of such images. This means that several spectrally pure signatures (endmembers) are combined into the same mixed pixel. Linear spectral unmixing follows an unsupervised approach which aims at inferring pure spectral signatures and their material fractions at each pixel of the scene. The huge data volumes acquired by such sensors put stringent requirements on processing and unmixing methods. This paper proposes an efficient implementation of a unsupervised linear unmixing method on GPUs using CUDA. The method finds the smallest simplex by solving a sequence of nonsmooth convex subproblems using variable splitting to obtain a constraint formulation, and then applying an augmented Lagrangian technique. The parallel implementation of SISAL presented in this work exploits the GPU architecture at low level, using shared memory and coalesced accesses to memory. The results herein presented indicate that the GPU implementation can significantly accelerate the method's execution over big datasets while maintaining the methods accuracy.
Embedded GPU implementation of anomaly detection for hyperspectral images
NASA Astrophysics Data System (ADS)
Wu, Yuanfeng; Gao, Lianru; Zhang, Bing; Yang, Bin; Chen, Zhengchao
2015-10-01
Anomaly detection is one of the most important techniques for remotely sensed hyperspectral data interpretation. Developing fast processing techniques for anomaly detection has received considerable attention in recent years, especially in analysis scenarios with real-time constraints. In this paper, we develop an embedded graphics processing units based parallel computation for streaming background statistics anomaly detection algorithm. The streaming background statistics method can simulate real-time anomaly detection, which refer to that the processing can be performed at the same time as the data are collected. The algorithm is implemented on NVIDIA Jetson TK1 development kit. The experiment, conducted with real hyperspectral data, indicate the effectiveness of the proposed implementations. This work shows the embedded GPU gives a promising solution for high-performance with low power consumption hyperspectral image applications.
Shen, Wenfeng; Wei, Daming; Xu, Weimin; Zhu, Xin; Yuan, Shizhong
2010-10-01
Biological computations like electrocardiological modelling and simulation usually require high-performance computing environments. This paper introduces an implementation of parallel computation for computer simulation of electrocardiograms (ECGs) in a personal computer environment with an Intel CPU of Core (TM) 2 Quad Q6600 and a GPU of Geforce 8800GT, with software support by OpenMP and CUDA. It was tested in three parallelization device setups: (a) a four-core CPU without a general-purpose GPU, (b) a general-purpose GPU plus 1 core of CPU, and (c) a four-core CPU plus a general-purpose GPU. To effectively take advantage of a multi-core CPU and a general-purpose GPU, an algorithm based on load-prediction dynamic scheduling was developed and applied to setting (c). In the simulation with 1600 time steps, the speedup of the parallel computation as compared to the serial computation was 3.9 in setting (a), 16.8 in setting (b), and 20.0 in setting (c). This study demonstrates that a current PC with a multi-core CPU and a general-purpose GPU provides a good environment for parallel computations in biological modelling and simulation studies.
Shen, Wenfeng; Wei, Daming; Xu, Weimin; Zhu, Xin; Yuan, Shizhong
2010-10-01
Biological computations like electrocardiological modelling and simulation usually require high-performance computing environments. This paper introduces an implementation of parallel computation for computer simulation of electrocardiograms (ECGs) in a personal computer environment with an Intel CPU of Core (TM) 2 Quad Q6600 and a GPU of Geforce 8800GT, with software support by OpenMP and CUDA. It was tested in three parallelization device setups: (a) a four-core CPU without a general-purpose GPU, (b) a general-purpose GPU plus 1 core of CPU, and (c) a four-core CPU plus a general-purpose GPU. To effectively take advantage of a multi-core CPU and a general-purpose GPU, an algorithm based on load-prediction dynamic scheduling was developed and applied to setting (c). In the simulation with 1600 time steps, the speedup of the parallel computation as compared to the serial computation was 3.9 in setting (a), 16.8 in setting (b), and 20.0 in setting (c). This study demonstrates that a current PC with a multi-core CPU and a general-purpose GPU provides a good environment for parallel computations in biological modelling and simulation studies. PMID:20674066
GPU-based parallel clustered differential pulse code modulation
NASA Astrophysics Data System (ADS)
Wu, Jiaji; Li, Wenze; Kong, Wanqiu
2015-10-01
Hyperspectral remote sensing technology is widely used in marine remote sensing, geological exploration, atmospheric and environmental remote sensing. Owing to the rapid development of hyperspectral remote sensing technology, resolution of hyperspectral image has got a huge boost. Thus data size of hyperspectral image is becoming larger. In order to reduce their saving and transmission cost, lossless compression for hyperspectral image has become an important research topic. In recent years, large numbers of algorithms have been proposed to reduce the redundancy between different spectra. Among of them, the most classical and expansible algorithm is the Clustered Differential Pulse Code Modulation (CDPCM) algorithm. This algorithm contains three parts: first clusters all spectral lines, then trains linear predictors for each band. Secondly, use these predictors to predict pixels, and get the residual image by subtraction between original image and predicted image. Finally, encode the residual image. However, the process of calculating predictors is timecosting. In order to improve the processing speed, we propose a parallel C-DPCM based on CUDA (Compute Unified Device Architecture) with GPU. Recently, general-purpose computing based on GPUs has been greatly developed. The capacity of GPU improves rapidly by increasing the number of processing units and storage control units. CUDA is a parallel computing platform and programming model created by NVIDIA. It gives developers direct access to the virtual instruction set and memory of the parallel computational elements in GPUs. Our core idea is to achieve the calculation of predictors in parallel. By respectively adopting global memory, shared memory and register memory, we finally get a decent speedup.
Study of improved ray tracing parallel algorithm for CGH of 3D objects on GPU
NASA Astrophysics Data System (ADS)
Cong, Bin; Jiang, Xiaoyu; Yao, Jun; Zhao, Kai
2014-11-01
An improved parallel algorithm for holograms of three-dimensional objects was presented. According to the physical characteristics and mathematical properties of the original ray tracing algorithm for computer generated holograms (CGH), using transform approximation and numerical analysis methods, we extract parts of ray tracing algorithm which satisfy parallelization features and implement them on graphics processing unit (GPU). Meanwhile, through proper design of parallel numerical procedure, we did parallel programming to the two-dimensional slices of three-dimensional object with CUDA. According to the experiments, an effective method of dealing with occlusion problem in ray tracing is proposed, as well as generating the holograms of 3D objects with additive property. Our results indicate that the improved algorithm can effectively shorten the computing time. Due to the different sizes of spatial object points and hologram pixels, the speed has increased 20 to 70 times comparing with original ray tracing algorithm.
NASA Astrophysics Data System (ADS)
Trigueros-Espinosa, Blas; Vélez-Reyes, Miguel; Santiago-Santiago, Nayda G.; Rosario-Torres, Samuel
2011-06-01
Hyperspectral sensors can collect hundreds of images taken at different narrow and contiguously spaced spectral bands. This high-resolution spectral information can be used to identify materials and objects within the field of view of the sensor by their spectral signature, but this process may be computationally intensive due to the large data sizes generated by the hyperspectral sensors, typically hundreds of megabytes. This can be an important limitation for some applications where the detection process must be performed in real time (surveillance, explosive detection, etc.). In this work, we developed a parallel implementation of three state-ofthe- art target detection algorithms (RX algorithm, matched filter and adaptive matched subspace detector) using a graphics processing unit (GPU) based on the NVIDIA® CUDA™ architecture. In addition, a multi-core CPUbased implementation of each algorithm was developed to be used as a baseline for the speedups estimation. We evaluated the performance of the GPU-based implementations using an NVIDIA ® Tesla® C1060 GPU card, and the detection accuracy of the implemented algorithms was evaluated using a set of phantom images simulating traces of different materials on clothing. We achieved a maximum speedup in the GPU implementations of around 20x over a multicore CPU-based implementation, which suggests that applications for real-time detection of targets in HSI can greatly benefit from the performance of GPUs as processing hardware.
NASA Astrophysics Data System (ADS)
Gutzwiller, David; Gontier, Mathieu; Demeulenaere, Alain
2014-11-01
Multi-Block structured solvers hold many advantages over their unstructured counterparts, such as a smaller memory footprint and efficient serial performance. Historically, multi-block structured solvers have not been easily adapted for use in a High Performance Computing (HPC) environment, and the recent trend towards hybrid GPU/CPU architectures has further complicated the situation. This paper will elaborate on developments and innovations applied to the NUMECA FINE/Turbo solver that have allowed near-linear scalability with real-world problems on over 250 hybrid GPU/GPU cluster nodes. Discussion will focus on the implementation of virtual partitioning and load balancing algorithms using a novel meta-block concept. This implementation is transparent to the user, allowing all pre- and post-processing steps to be performed using a simple, unpartitioned grid topology. Additional discussion will elaborate on developments that have improved parallel performance, including fully parallel I/O with the ADIOS API and the GPU porting of the computationally heavy CPUBooster convergence acceleration module. Head of HPC and Release Management, Numeca International.
A Survey on GPU-Based Implementation of Swarm Intelligence Algorithms.
Tan, Ying; Ding, Ke
2016-09-01
Inspired by the collective behavior of natural swarm, swarm intelligence algorithms (SIAs) have been developed and widely used for solving optimization problems. When applied to complex problems, a large number of fitness function evaluations are needed to obtain an acceptable solution. To tackle this vital issue, graphical processing units (GPUs) have been used to accelerate the optimization procedure of SIAs. Thanks to their inherent parallelism, SIAs are very suitable for parallel implementation under the GPU platform which have achieved a great success in recent years. This paper presents a comprehensive review of GPU-based parallel SIAs in accordance with a newly proposed taxonomy. Critical concerns for the efficient parallel implementation of SIAs are also described in detail. Moreover, novel criteria are also proposed to evaluate and compare the parallel implementation and algorithm performance universally. The rationality and practicability of the proposed optimization methodology and criteria are verified by careful case study. Finally, our opinions and perspectives on the trends and prospects on the relatively new research domain are also presented for future development. PMID:26571543
Mobile Devices and GPU Parallelism in Ionospheric Data Processing
NASA Astrophysics Data System (ADS)
Mascharka, D.; Pankratius, V.
2015-12-01
Scientific data acquisition in the field is often constrained by data transfer backchannels to analysis environments. Geoscientists are therefore facing practical bottlenecks with increasing sensor density and variety. Mobile devices, such as smartphones and tablets, offer promising solutions to key problems in scientific data acquisition, pre-processing, and validation by providing advanced capabilities in the field. This is due to affordable network connectivity options and the increasing mobile computational power. This contribution exemplifies a scenario faced by scientists in the field and presents the "Mahali TEC Processing App" developed in the context of the NSF-funded Mahali project. Aimed at atmospheric science and the study of ionospheric Total Electron Content (TEC), this app is able to gather data from various dual-frequency GPS receivers. It demonstrates parsing of full-day RINEX files on mobile devices and on-the-fly computation of vertical TEC values based on satellite ephemeris models that are obtained from NASA. Our experiments show how parallel computing on the mobile device GPU enables fast processing and visualization of up to 2 million datapoints in real-time using OpenGL. GPS receiver bias is estimated through minimum TEC approximations that can be interactively adjusted by scientists in the graphical user interface. Scientists can also perform approximate computations for "quickviews" to reduce CPU processing time and memory consumption. In the final stage of our mobile processing pipeline, scientists can upload data to the cloud for further processing. Acknowledgements: The Mahali project (http://mahali.mit.edu) is funded by the NSF INSPIRE grant no. AGS-1343967 (PI: V. Pankratius). We would like to acknowledge our collaborators at Boston College, Virginia Tech, Johns Hopkins University, Colorado State University, as well as the support of UNAVCO for loans of dual-frequency GPS receivers for use in this project, and Intel for loans of
GRay: A Massively Parallel GPU-based Code for Ray Tracing in Relativistic Spacetimes
NASA Astrophysics Data System (ADS)
Chan, Chi-kwan; Psaltis, Dimitrios; Özel, Feryal
2013-11-01
We introduce GRay, a massively parallel integrator designed to trace the trajectories of billions of photons in a curved spacetime. This graphics-processing-unit (GPU)-based integrator employs the stream processing paradigm, is implemented in CUDA C/C++, and runs on nVidia graphics cards. The peak performance of GRay using single-precision floating-point arithmetic on a single GPU exceeds 300 GFLOP (or 1 ns per photon per time step). For a realistic problem, where the peak performance cannot be reached, GRay is two orders of magnitude faster than existing central-processing-unit-based ray-tracing codes. This performance enhancement allows more effective searches of large parameter spaces when comparing theoretical predictions of images, spectra, and light curves from the vicinities of compact objects to observations. GRay can also perform on-the-fly ray tracing within general relativistic magnetohydrodynamic algorithms that simulate accretion flows around compact objects. Making use of this algorithm, we calculate the properties of the shadows of Kerr black holes and the photon rings that surround them. We also provide accurate fitting formulae of their dependencies on black hole spin and observer inclination, which can be used to interpret upcoming observations of the black holes at the center of the Milky Way, as well as M87, with the Event Horizon Telescope.
GRay: A MASSIVELY PARALLEL GPU-BASED CODE FOR RAY TRACING IN RELATIVISTIC SPACETIMES
Chan, Chi-kwan; Psaltis, Dimitrios; Özel, Feryal
2013-11-01
We introduce GRay, a massively parallel integrator designed to trace the trajectories of billions of photons in a curved spacetime. This graphics-processing-unit (GPU)-based integrator employs the stream processing paradigm, is implemented in CUDA C/C++, and runs on nVidia graphics cards. The peak performance of GRay using single-precision floating-point arithmetic on a single GPU exceeds 300 GFLOP (or 1 ns per photon per time step). For a realistic problem, where the peak performance cannot be reached, GRay is two orders of magnitude faster than existing central-processing-unit-based ray-tracing codes. This performance enhancement allows more effective searches of large parameter spaces when comparing theoretical predictions of images, spectra, and light curves from the vicinities of compact objects to observations. GRay can also perform on-the-fly ray tracing within general relativistic magnetohydrodynamic algorithms that simulate accretion flows around compact objects. Making use of this algorithm, we calculate the properties of the shadows of Kerr black holes and the photon rings that surround them. We also provide accurate fitting formulae of their dependencies on black hole spin and observer inclination, which can be used to interpret upcoming observations of the black holes at the center of the Milky Way, as well as M87, with the Event Horizon Telescope.
Dr. Dale M. Snider
2011-02-28
This report gives the result from the Phase-1 work on demonstrating greater than 10x speedup of the Barracuda computer program using parallel methods and GPU processors (General-Purpose Graphics Processing Unit or Graphics Processing Unit). Phase-1 demonstrated a 12x speedup on a typical Barracuda function using the GPU processor. The problem test case used about 5 million particles and 250,000 Eulerian grid cells. The relative speedup, compared to a single CPU, increases with increased number of particles giving greater than 12x speedup. Phase-1 work provided a path for reformatting data structure modifications to give good parallel performance while keeping a friendly environment for new physics development and code maintenance. The implementation of data structure changes will be in Phase-2. Phase-1 laid the ground work for the complete parallelization of Barracuda in Phase-2, with the caveat that implemented computer practices for parallel programming done in Phase-1 gives immediate speedup in the current Barracuda serial running code. The Phase-1 tasks were completed successfully laying the frame work for Phase-2. The detailed results of Phase-1 are within this document. In general, the speedup of one function would be expected to be higher than the speedup of the entire code because of I/O functions and communication between the algorithms. However, because one of the most difficult Barracuda algorithms was parallelized in Phase-1 and because advanced parallelization methods and proposed parallelization optimization techniques identified in Phase-1 will be used in Phase-2, an overall Barracuda code speedup (relative to a single CPU) is expected to be greater than 10x. This means that a job which takes 30 days to complete will be done in 3 days. Tasks completed in Phase-1 are: Task 1: Profile the entire Barracuda code and select which subroutines are to be parallelized (See Section Choosing a Function to Accelerate) Task 2: Select a GPU consultant company and
Radial basis function networks GPU-based implementation.
Brandstetter, Andreas; Artusi, Alessandro
2008-12-01
Neural networks (NNs) have been used in several areas, showing their potential but also their limitations. One of the main limitations is the long time required for the training process; this is not useful in the case of a fast training process being required to respond to changes in the application domain. A possible way to accelerate the learning process of an NN is to implement it in hardware, but due to the high cost and the reduced flexibility of the original central processing unit (CPU) implementation, this solution is often not chosen. Recently, the power of the graphic processing unit (GPU), on the market, has increased and it has started to be used in many applications. In particular, a kind of NN named radial basis function network (RBFN) has been used extensively, proving its power. However, their limiting time performances reduce their application in many areas. In this brief paper, we describe a GPU implementation of the entire learning process of an RBFN showing the ability to reduce the computational cost by about two orders of magnitude with respect to its CPU implementation.
Massively Parallel Computation of Soil Surface Roughness Parameters on A Fermi GPU
NASA Astrophysics Data System (ADS)
Li, Xiaojie; Song, Changhe
2016-06-01
Surface roughness is description of the surface micro topography of randomness or irregular. The standard deviation of surface height and the surface correlation length describe the statistical variation for the random component of a surface height relative to a reference surface. When the number of data points is large, calculation of surface roughness parameters is time-consuming. With the advent of Graphics Processing Unit (GPU) architectures, inherently parallel problem can be effectively solved using GPUs. In this paper we propose a GPU-based massively parallel computing method for 2D bare soil surface roughness estimation. This method was applied to the data collected by the surface roughness tester based on the laser triangulation principle during the field experiment in April 2012. The total number of data points was 52,040. It took 47 seconds on a Fermi GTX 590 GPU whereas its serial CPU version took 5422 seconds, leading to a significant 115x speedup.
Hybrid CUDA, OpenMP, and MPI parallel programming on multicore GPU clusters
NASA Astrophysics Data System (ADS)
Yang, Chao-Tung; Huang, Chih-Lin; Lin, Cheng-Fang
2011-01-01
Nowadays, NVIDIA's CUDA is a general purpose scalable parallel programming model for writing highly parallel applications. It provides several key abstractions - a hierarchy of thread blocks, shared memory, and barrier synchronization. This model has proven quite successful at programming multithreaded many core GPUs and scales transparently to hundreds of cores: scientists throughout industry and academia are already using CUDA to achieve dramatic speedups on production and research codes. In this paper, we propose a parallel programming approach using hybrid CUDA OpenMP, and MPI programming, which partition loop iterations according to the number of C1060 GPU nodes in a GPU cluster which consists of one C1060 and one S1070. Loop iterations assigned to one MPI process are processed in parallel by CUDA run by the processor cores in the same computational node.
APES-based procedure for super-resolution SAR imagery with GPU parallel computing
NASA Astrophysics Data System (ADS)
Jia, Weiwei; Xu, Xiaojian; Xu, Guangyao
2015-10-01
The amplitude and phase estimation (APES) algorithm is widely used in modern spectral analysis. Compared with conventional Fourier transform (FFT), APES results in lower sidelobes and narrower spectral peaks. However, in synthetic aperture radar (SAR) imaging with large scene, without parallel computation, it is difficult to apply APES directly to super-resolution radar image processing due to its great amount of calculation. In this paper, a procedure is proposed to achieve target extraction and parallel computing of APES for super-resolution SAR imaging. Numerical experimental are carried out on Tesla K40C with 745 MHz GPU clock rate and 2880 CUDA cores. Results of SAR image with GPU parallel computing show that the parallel APES is remarkably more efficient than that of CPU-based with the same super-resolution.
NASA Astrophysics Data System (ADS)
Lokavarapu, H. V.; Matsui, H.
2015-12-01
Convection and magnetic field of the Earth's outer core are expected to have vast length scales. To resolve these flows, high performance computing is required for geodynamo simulations using spherical harmonics transform (SHT), a significant portion of the execution time is spent on the Legendre transform. Calypso is a geodynamo code designed to model magnetohydrodynamics of a Boussinesq fluid in a rotating spherical shell, such as the outer core of the Earth. The code has been shown to scale well on computer clusters capable of computing at the order of 10⁵ cores using Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) parallelization for CPUs. To further optimize, we investigate three different algorithms of the SHT using GPUs. One is to preemptively compute the Legendre polynomials on the CPU before executing SHT on the GPU within the time integration loop. In the second approach, both the Legendre polynomials and the SHT are computed on the GPU simultaneously. In the third approach , we initially partition the radial grid for the forward transform and the harmonic order for the backward transform between the CPU and GPU. There after, the partitioned works are simultaneously computed in the time integration loop. We examine the trade-offs between space and time, memory bandwidth and GPU computations on Maverick, a Texas Advanced Computing Center (TACC) supercomputer. We have observed improved performance using a GPU enabled Legendre transform. Furthermore, we will compare and contrast the different algorithms in the context of GPUs.
Besozzi, Daniela; Pescini, Dario; Mauri, Giancarlo
2014-01-01
Tau-leaping is a stochastic simulation algorithm that efficiently reconstructs the temporal evolution of biological systems, modeled according to the stochastic formulation of chemical kinetics. The analysis of dynamical properties of these systems in physiological and perturbed conditions usually requires the execution of a large number of simulations, leading to high computational costs. Since each simulation can be executed independently from the others, a massive parallelization of tau-leaping can bring to relevant reductions of the overall running time. The emerging field of General Purpose Graphic Processing Units (GPGPU) provides power-efficient high-performance computing at a relatively low cost. In this work we introduce cuTauLeaping, a stochastic simulator of biological systems that makes use of GPGPU computing to execute multiple parallel tau-leaping simulations, by fully exploiting the Nvidia's Fermi GPU architecture. We show how a considerable computational speedup is achieved on GPU by partitioning the execution of tau-leaping into multiple separated phases, and we describe how to avoid some implementation pitfalls related to the scarcity of memory resources on the GPU streaming multiprocessors. Our results show that cuTauLeaping largely outperforms the CPU-based tau-leaping implementation when the number of parallel simulations increases, with a break-even directly depending on the size of the biological system and on the complexity of its emergent dynamics. In particular, cuTauLeaping is exploited to investigate the probability distribution of bistable states in the Schlögl model, and to carry out a bidimensional parameter sweep analysis to study the oscillatory regimes in the Ras/cAMP/PKA pathway in S. cerevisiae. PMID:24663957
A new morphological anomaly detection algorithm for hyperspectral images and its GPU implementation
NASA Astrophysics Data System (ADS)
Paz, Abel; Plaza, Antonio
2011-10-01
Anomaly detection is considered a very important task for hyperspectral data exploitation. It is now routinely applied in many application domains, including defence and intelligence, public safety, precision agriculture, geology, or forestry. Many of these applications require timely responses for swift decisions which depend upon high computing performance of algorithm analysis. However, with the recent explosion in the amount and dimensionality of hyperspectral imagery, this problem calls for the incorporation of parallel computing techniques. In the past, clusters of computers have offered an attractive solution for fast anomaly detection in hyperspectral data sets already transmitted to Earth. However, these systems are expensive and difficult to adapt to on-board data processing scenarios, in which low-weight and low-power integrated components are essential to reduce mission payload and obtain analysis results in (near) real-time, i.e., at the same time as the data is collected by the sensor. An exciting new development in the field of commodity computing is the emergence of commodity graphics processing units (GPUs), which can now bridge the gap towards on-board processing of remotely sensed hyperspectral data. In this paper, we develop a new morphological algorithm for anomaly detection in hyperspectral images along with an efficient GPU implementation of the algorithm. The algorithm is implemented on latest-generation GPU architectures, and evaluated with regards to other anomaly detection algorithms using hyperspectral data collected by NASA's Airborne Visible Infra-Red Imaging Spectrometer (AVIRIS) over the World Trade Center (WTC) in New York, five days after the terrorist attacks that collapsed the two main towers in the WTC complex. The proposed GPU implementation achieves real-time performance in the considered case study.
GPU-based parallel method of temperature field analysis in a floor heater with a controller
NASA Astrophysics Data System (ADS)
Forenc, Jaroslaw
2016-06-01
A parallel method enabling acceleration of the numerical analysis of the transient temperature field in an air floor heating system is presented in this paper. An initial-boundary value problem of the heater regulated by an on/off controller is formulated. The analogue model is discretized using the implicit finite difference method. The BiCGStab method is used to compute the obtained system of equations. A computer program implementing simultaneous computations on CPUand GPU(GPGPUtechnology) was developed. CUDA environment and linear algebra libraries (CUBLAS and CUSPARSE) are used by this program. The time of computations was reduced eight times in comparison with a program executed on the CPU only. Results of computations are presented in the form of time profiles and temperature field distributions. An influence of a model of the heat transfer coefficient on the simulation of the system operation was examined. The physical interpretation of obtained results is also presented.Results of computations were verified by comparing them with solutions obtained with the use of a commercial program - COMSOL Mutiphysics.
Synthetic transmit aperture technique in medical ultrasound imaging implemented on a GPU
NASA Astrophysics Data System (ADS)
Li, Ying; Chen, Xiaodong; Zhang, Chuang; Wang, Yi; Jiao, Zhihai; Yu, Daoyin
2014-11-01
In the medical ultrasound imaging, the synthetic transmit aperture (STA) technique is very promising and has been a hot research topic. It is dynamically focused in both transmit and receive yielding an improvement in resolution. But this imaging technique sets high demands on processing capabilities and makes implementation of a full STA system very challenging and costly. Many attempts have been made to reduce the demands on the system making it a more realistic task to implement. In this paper we don't consider how to reduce the demands, but consider how to accelerate the processing speed of the system. The recent introduction of general-purpose graphic processing units (GPU) seems to be quite promising in this view, especially for the affordable programming complexity. In this paper we explain the main computational features of STA processing unit, trying to disclose the degree of parallelism in the operations. On the basis of the compute unified device architecture (CUDA) programming model and the extremely flexible structure of the Single Instruction Multiple Threads (SIMT) model, we show that the optimization of STA processing unit can be performed more efficiently. The input data is read from Matlab, the post-processing and display also use Matlab. Performance shows that, using a single NIVDIA GTX-650 GPU board, this amount to a speed up of more than a factor of 30 compared to a highly optimized beamformer running on our test workstation with a 3.20-GHz Intel Core-i5 processor.
Big Data GPU-Driven Parallel Processing Spatial and Spatio-Temporal Clustering Algorithms
NASA Astrophysics Data System (ADS)
Konstantaras, Antonios; Skounakis, Emmanouil; Kilty, James-Alexander; Frantzeskakis, Theofanis; Maravelakis, Emmanuel
2016-04-01
Advances in graphics processing units' technology towards encompassing parallel architectures [1], comprised of thousands of cores and multiples of parallel threads, provide the foundation in terms of hardware for the rapid processing of various parallel applications regarding seismic big data analysis. Seismic data are normally stored as collections of vectors in massive matrices, growing rapidly in size as wider areas are covered, denser recording networks are being established and decades of data are being compiled together [2]. Yet, many processes regarding seismic data analysis are performed on each seismic event independently or as distinct tiles [3] of specific grouped seismic events within a much larger data set. Such processes, independent of one another can be performed in parallel narrowing down processing times drastically [1,3]. This research work presents the development and implementation of three parallel processing algorithms using Cuda C [4] for the investigation of potentially distinct seismic regions [5,6] present in the vicinity of the southern Hellenic seismic arc. The algorithms, programmed and executed in parallel comparatively, are the: fuzzy k-means clustering with expert knowledge [7] in assigning overall clusters' number; density-based clustering [8]; and a selves-developed spatio-temporal clustering algorithm encompassing expert [9] and empirical knowledge [10] for the specific area under investigation. Indexing terms: GPU parallel programming, Cuda C, heterogeneous processing, distinct seismic regions, parallel clustering algorithms, spatio-temporal clustering References [1] Kirk, D. and Hwu, W.: 'Programming massively parallel processors - A hands-on approach', 2nd Edition, Morgan Kaufman Publisher, 2013 [2] Konstantaras, A., Valianatos, F., Varley, M.R. and Makris, J.P.: 'Soft-Computing Modelling of Seismicity in the Southern Hellenic Arc', Geoscience and Remote Sensing Letters, vol. 5 (3), pp. 323-327, 2008 [3] Papadakis, S. and
GPU-accelerated Monte Carlo convolution∕superposition implementation for dose calculation
Zhou, Bo; Yu, Cedric X.; Chen, Danny Z.; Hu, X. Sharon
2010-01-01
Purpose: Dose calculation is a key component in radiation treatment planning systems. Its performance and accuracy are crucial to the quality of treatment plans as emerging advanced radiation therapy technologies are exerting ever tighter constraints on dose calculation. A common practice is to choose either a deterministic method such as the convolution∕superposition (CS) method for speed or a Monte Carlo (MC) method for accuracy. The goal of this work is to boost the performance of a hybrid Monte Carlo convolution∕superposition (MCCS) method by devising a graphics processing unit (GPU) implementation so as to make the method practical for day-to-day usage. Methods: Although the MCCS algorithm combines the merits of MC fluence generation and CS fluence transport, it is still not fast enough to be used as a day-to-day planning tool. To alleviate the speed issue of MC algorithms, the authors adopted MCCS as their target method and implemented a GPU-based version. In order to fully utilize the GPU computing power, the MCCS algorithm is modified to match the GPU hardware architecture. The performance of the authors’ GPU-based implementation on an Nvidia GTX260 card is compared to a multithreaded software implementation on a quad-core system. Results: A speedup in the range of 6.7–11.4× is observed for the clinical cases used. The less than 2% statistical fluctuation also indicates that the accuracy of the authors’ GPU-based implementation is in good agreement with the results from the quad-core CPU implementation. Conclusions: This work shows that GPU is a feasible and cost-efficient solution compared to other alternatives such as using cluster machines or field-programmable gate arrays for satisfying the increasing demands on computation speed and accuracy of dose calculation. But there are also inherent limitations of using GPU for accelerating MC-type applications, which are also analyzed in detail in this article. PMID:21158271
NASA Astrophysics Data System (ADS)
Paz, Abel; Plaza, Antonio
2010-08-01
Automatic target and anomaly detection are considered very important tasks for hyperspectral data exploitation. These techniques are now routinely applied in many application domains, including defence and intelligence, public safety, precision agriculture, geology, or forestry. Many of these applications require timely responses for swift decisions which depend upon high computing performance of algorithm analysis. However, with the recent explosion in the amount and dimensionality of hyperspectral imagery, this problem calls for the incorporation of parallel computing techniques. In the past, clusters of computers have offered an attractive solution for fast anomaly and target detection in hyperspectral data sets already transmitted to Earth. However, these systems are expensive and difficult to adapt to on-board data processing scenarios, in which low-weight and low-power integrated components are essential to reduce mission payload and obtain analysis results in (near) real-time, i.e., at the same time as the data is collected by the sensor. An exciting new development in the field of commodity computing is the emergence of commodity graphics processing units (GPUs), which can now bridge the gap towards on-board processing of remotely sensed hyperspectral data. In this paper, we describe several new GPU-based implementations of target and anomaly detection algorithms for hyperspectral data exploitation. The parallel algorithms are implemented on latest-generation Tesla C1060 GPU architectures, and quantitatively evaluated using hyperspectral data collected by NASA's AVIRIS system over the World Trade Center (WTC) in New York, five days after the terrorist attacks that collapsed the two main towers in the WTC complex.
Development of magnetron sputtering simulator with GPU parallel computing
NASA Astrophysics Data System (ADS)
Sohn, Ilyoup; Kim, Jihun; Bae, Junkyeong; Lee, Jinpil
2014-12-01
Sputtering devices are widely used in the semiconductor and display panel manufacturing process. Currently, a number of surface treatment applications using magnetron sputtering techniques are being used to improve the efficiency of the sputtering process, through the installation of magnets outside the vacuum chamber. Within the internal space of the low pressure chamber, plasma generated from the combination of a rarefied gas and an electric field is influenced interactively. Since the quality of the sputtering and deposition rate on the substrate is strongly dependent on the multi-physical phenomena of the plasma regime, numerical simulations using PIC-MCC (Particle In Cell, Monte Carlo Collision) should be employed to develop an efficient sputtering device. In this paper, the development of a magnetron sputtering simulator based on the PIC-MCC method and the associated numerical techniques are discussed. To solve the electric field equations in the 2-D Cartesian domain, a Poisson equation solver based on the FDM (Finite Differencing Method) is developed and coupled with the Monte Carlo Collision method to simulate the motion of gas particles influenced by an electric field. The magnetic field created from the permanent magnet installed outside the vacuum chamber is also numerically calculated using Biot-Savart's Law. All numerical methods employed in the present PIC code are validated by comparison with analytical and well-known commercial engineering software results, with all of the results showing good agreement. Finally, the developed PIC-MCC code is parallelized to be suitable for general purpose computing on graphics processing unit (GPGPU) acceleration, so as to reduce the large computation time which is generally required for particle simulations. The efficiency and accuracy of the GPGPU parallelized magnetron sputtering simulator are examined by comparison with the calculated results and computation times from the original serial code. It is found that
Storage strategies of eddy-current FE-BI model for GPU implementation
NASA Astrophysics Data System (ADS)
Bardel, Charles; Lei, Naiguang; Udpa, Lalita
2013-01-01
In the past few years graphical processing units (GPUs) have shown tremendous improvements in computational throughput over standard CPU architecture. However, this comes at the cost of restructuring the algorithms to meet the strengths and drawbacks of this GPU architecture. A major drawback is the state of limited memory, and hence storage of FE stiffness matrices on the GPU is important. In contrast to storage on CPU the GPU storage format has significant influence on the overall performance. This paper presents an investigation of a storage strategy in the implementation of a two-dimensional finite element-boundary integral (FE-BI) model for Eddy current NDE applications, on GPU architecture. Specifically, the high dimensional matrices are manipulated by examining the matrix structure and optimally splitting into structurally independent component matrices for efficient storage and retrieval of each component. Results obtained using the proposed approach are compared to those of conventional CPU implementation for validating the method.
Prayudhatama, D.; Waris, A.; Kurniasih, N.; Kurniadi, R.
2010-06-22
In-core fuel management study is a crucial activity in nuclear power plant design and operation. Its common problem is to find an optimum arrangement of fuel assemblies inside the reactor core. Main objective for this activity is to reduce the cost of generating electricity, which can be done by altering several physical properties of the nuclear reactor without violating any of the constraints imposed by operational and safety considerations. This research try to address the problem of nuclear fuel arrangement problem, which is, leads to the multi-objective optimization problem. However, the calculation of the reactor core physical properties itself is a heavy computation, which became obstacle in solving the optimization problem by using genetic algorithm optimization.This research tends to address that problem by using the emerging General Purpose Computation on Graphics Processing Units (GPGPU) techniques implemented by C language for CUDA (Compute Unified Device Architecture) parallel programming. By using this parallel programming technique, we develop parallelized nuclear reactor fitness calculation, which is involving numerical finite difference computation. This paper describes current prototype of the parallel algorithm code we have developed on CUDA, that performs one hundreds finite difference calculation for nuclear reactor fitness evaluation in parallel by using GPU G9 Hardware Series developed by NVIDIA.
Multi-GPU implementation of a VMAT treatment plan optimization algorithm
Tian, Zhen E-mail: Xun.Jia@UTSouthwestern.edu Folkerts, Michael; Tan, Jun; Jia, Xun E-mail: Xun.Jia@UTSouthwestern.edu Jiang, Steve B. E-mail: Xun.Jia@UTSouthwestern.edu; Peng, Fei
2015-06-15
Purpose: Volumetric modulated arc therapy (VMAT) optimization is a computationally challenging problem due to its large data size, high degrees of freedom, and many hardware constraints. High-performance graphics processing units (GPUs) have been used to speed up the computations. However, GPU’s relatively small memory size cannot handle cases with a large dose-deposition coefficient (DDC) matrix in cases of, e.g., those with a large target size, multiple targets, multiple arcs, and/or small beamlet size. The main purpose of this paper is to report an implementation of a column-generation-based VMAT algorithm, previously developed in the authors’ group, on a multi-GPU platform to solve the memory limitation problem. While the column-generation-based VMAT algorithm has been previously developed, the GPU implementation details have not been reported. Hence, another purpose is to present detailed techniques employed for GPU implementation. The authors also would like to utilize this particular problem as an example problem to study the feasibility of using a multi-GPU platform to solve large-scale problems in medical physics. Methods: The column-generation approach generates VMAT apertures sequentially by solving a pricing problem (PP) and a master problem (MP) iteratively. In the authors’ method, the sparse DDC matrix is first stored on a CPU in coordinate list format (COO). On the GPU side, this matrix is split into four submatrices according to beam angles, which are stored on four GPUs in compressed sparse row format. Computation of beamlet price, the first step in PP, is accomplished using multi-GPUs. A fast inter-GPU data transfer scheme is accomplished using peer-to-peer access. The remaining steps of PP and MP problems are implemented on CPU or a single GPU due to their modest problem scale and computational loads. Barzilai and Borwein algorithm with a subspace step scheme is adopted here to solve the MP problem. A head and neck (H and N) cancer case is
NASA Astrophysics Data System (ADS)
Kumar, B.; Dikshit, O.
2016-06-01
Extended morphological profile (EMP) is a good technique for extracting spectral-spatial information from the images but large size of hyperspectral images is an important concern for creating EMPs. However, with the availability of modern multi-core processors and commodity parallel processing systems like graphics processing units (GPUs) at desktop level, parallel computing provides a viable option to significantly accelerate execution of such computations. In this paper, parallel implementation of an EMP based spectralspatial classification method for hyperspectral imagery is presented. The parallel implementation is done both on multi-core CPU and GPU. The impact of parallelization on speed up and classification accuracy is analyzed. For GPU, the implementation is done in compute unified device architecture (CUDA) C. The experiments are carried out on two well-known hyperspectral images. It is observed from the experimental results that GPU implementation provides a speed up of about 7 times, while parallel implementation on multi-core CPU resulted in speed up of about 3 times. It is also observed that parallel implementation has no adverse impact on the classification accuracy.
Yang, Luobin; Chiu, Steve C; Liao, Wei-Keng; Thomas, Michael A
2014-10-01
Compared to Beowulf clusters and shared-memory machines, GPU and FPGA are emerging alternative architectures that provide massive parallelism and great computational capabilities. These architectures can be utilized to run compute-intensive algorithms to analyze ever-enlarging datasets and provide scalability. In this paper, we present four implementations of K-means data clustering algorithm for different high performance computing platforms. These four implementations include a CUDA implementation for GPUs, a Mitrion C implementation for FPGAs, an MPI implementation for Beowulf compute clusters, and an OpenMP implementation for shared-memory machines. The comparative analyses of the cost of each platform, difficulty level of programming for each platform, and the performance of each implementation are presented.
Yang, Luobin; Chiu, Steve C.; Liao, Wei-Keng; Thomas, Michael A.
2013-01-01
Compared to Beowulf clusters and shared-memory machines, GPU and FPGA are emerging alternative architectures that provide massive parallelism and great computational capabilities. These architectures can be utilized to run compute-intensive algorithms to analyze ever-enlarging datasets and provide scalability. In this paper, we present four implementations of K-means data clustering algorithm for different high performance computing platforms. These four implementations include a CUDA implementation for GPUs, a Mitrion C implementation for FPGAs, an MPI implementation for Beowulf compute clusters, and an OpenMP implementation for shared-memory machines. The comparative analyses of the cost of each platform, difficulty level of programming for each platform, and the performance of each implementation are presented. PMID:25309040
SPILADY: A parallel CPU and GPU code for spin-lattice magnetic molecular dynamics simulations
NASA Astrophysics Data System (ADS)
Ma, Pui-Wai; Dudarev, S. L.; Woo, C. H.
2016-10-01
Spin-lattice dynamics generalizes molecular dynamics to magnetic materials, where dynamic variables describing an evolving atomic system include not only coordinates and velocities of atoms but also directions and magnitudes of atomic magnetic moments (spins). Spin-lattice dynamics simulates the collective time evolution of spins and atoms, taking into account the effect of non-collinear magnetism on interatomic forces. Applications of the method include atomistic models for defects, dislocations and surfaces in magnetic materials, thermally activated diffusion of defects, magnetic phase transitions, and various magnetic and lattice relaxation phenomena. Spin-lattice dynamics retains all the capabilities of molecular dynamics, adding to them the treatment of non-collinear magnetic degrees of freedom. The spin-lattice dynamics time integration algorithm uses symplectic Suzuki-Trotter decomposition of atomic coordinate, velocity and spin evolution operators, and delivers highly accurate numerical solutions of dynamic evolution equations over extended intervals of time. The code is parallelized in coordinate and spin spaces, and is written in OpenMP C/C++ for CPU and in CUDA C/C++ for Nvidia GPU implementations. Temperatures of atoms and spins are controlled by Langevin thermostats. Conduction electrons are treated by coupling the discrete spin-lattice dynamics equations for atoms and spins to the heat transfer equation for the electrons. Worked examples include simulations of thermalization of ferromagnetic bcc iron, the dynamics of laser pulse demagnetization, and collision cascades. Catalogue identifier: AFAN_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AFAN_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Apache License, Version 2.0 No. of lines in distributed program, including test data, etc.: 1611165 No. of bytes in distributed program, including test data, etc.: 367246683
Pelletier, Mathew G.
2008-01-01
One of the main hurdles standing in the way of optimal cleaning of cotton lint is the lack of sensing systems that can react fast enough to provide the control system with real-time information as to the level of trash contamination of the cotton lint. This research examines the use of programmable graphic processing units (GPU) as an alternative to the PC's traditional use of the central processing unit (CPU). The use of the GPU, as an alternative computation platform, allowed for the machine vision system to gain a significant improvement in processing time. By improving the processing time, this research seeks to address the lack of availability of rapid trash sensing systems and thus alleviate a situation in which the current systems view the cotton lint either well before, or after, the cotton is cleaned. This extended lag/lead time that is currently imposed on the cotton trash cleaning control systems, is what is responsible for system operators utilizing a very large dead-band safety buffer in order to ensure that the cotton lint is not under-cleaned. Unfortunately, the utilization of a large dead-band buffer results in the majority of the cotton lint being over-cleaned which in turn causes lint fiber-damage as well as significant losses of the valuable lint due to the excessive use of cleaning machinery. This research estimates that upwards of a 30% reduction in lint loss could be gained through the use of a tightly coupled trash sensor to the cleaning machinery control systems. This research seeks to improve processing times through the development of a new algorithm for cotton trash sensing that allows for implementation on a highly parallel architecture. Additionally, by moving the new parallel algorithm onto an alternative computing platform, the graphic processing unit “GPU”, for processing of the cotton trash images, a speed up of over 6.5 times, over optimized code running on the PC's central processing unit “CPU”, was gained. The new
Parallel, distributed and GPU computing technologies in single-particle electron microscopy
Schmeisser, Martin; Heisen, Burkhard C.; Luettich, Mario; Busche, Boris; Hauer, Florian; Koske, Tobias; Knauber, Karl-Heinz; Stark, Holger
2009-01-01
Most known methods for the determination of the structure of macromolecular complexes are limited or at least restricted at some point by their computational demands. Recent developments in information technology such as multicore, parallel and GPU processing can be used to overcome these limitations. In particular, graphics processing units (GPUs), which were originally developed for rendering real-time effects in computer games, are now ubiquitous and provide unprecedented computational power for scientific applications. Each parallel-processing paradigm alone can improve overall performance; the increased computational performance obtained by combining all paradigms, unleashing the full power of today’s technology, makes certain applications feasible that were previously virtually impossible. In this article, state-of-the-art paradigms are introduced, the tools and infrastructure needed to apply these paradigms are presented and a state-of-the-art infrastructure and solution strategy for moving scientific applications to the next generation of computer hardware is outlined. PMID:19564686
NASA Astrophysics Data System (ADS)
Martin, Roland; Monteiller, Vadim; Komatitsch, Dimitri; Perrouty, Stéphane; Jessell, Mark; Bonvalot, Sylvain; Lindsay, Mark
2013-12-01
We solve the 3-D gravity inverse problem using a massively parallel voxel (or finite element) implementation on a hybrid multi-CPU/multi-GPU (graphics processing units/GPUs) cluster. This allows us to obtain information on density distributions in heterogeneous media with an efficient computational time. In a new software package called TOMOFAST3D, the inversion is solved with an iterative least-square or a gradient technique, which minimizes a hybrid L1-/L2-norm-based misfit function. It is drastically accelerated using either Haar or fourth-order Daubechies wavelet compression operators, which are applied to the sensitivity matrix kernels involved in the misfit minimization. The compression process behaves like a pre-conditioning of the huge linear system to be solved and a reduction of two or three orders of magnitude of the computational time can be obtained for a given number of CPU processor cores. The memory storage required is also significantly reduced by a similar factor. Finally, we show how this CPU parallel inversion code can be accelerated further by a factor between 3.5 and 10 using GPU computing. Performance levels are given for an application to Ghana, and physical information obtained after 3-D inversion using a sensitivity matrix with around 5.37 trillion elements is discussed. Using compression the whole inversion process can last from a few minutes to less than an hour for a given number of processor cores instead of tens of hours for a similar number of processor cores when compression is not used.
Performance analysis of parallel gravitational N-body codes on large GPU clusters
NASA Astrophysics Data System (ADS)
Huang, Si-Yi; Spurzem, Rainer; Berczik, Peter
2016-01-01
We compare the performance of two very different parallel gravitational N-body codes for astrophysical simulations on large Graphics Processing Unit (GPU) clusters, both of which are pioneers in their own fields as well as on certain mutual scales - NBODY6++ and Bonsai. We carry out benchmarks of the two codes by analyzing their performance, accuracy and efficiency through the modeling of structure decomposition and timing measurements. We find that both codes are heavily optimized to leverage the computational potential of GPUs as their performance has approached half of the maximum single precision performance of the underlying GPU cards. With such performance we predict that a speed-up of 200 – 300 can be achieved when up to 1k processors and GPUs are employed simultaneously. We discuss the quantitative information about comparisons of the two codes, finding that in the same cases Bonsai adopts larger time steps as well as larger relative energy errors than NBODY6++, typically ranging from 10 – 50 times larger, depending on the chosen parameters of the codes. Although the two codes are built for different astrophysical applications, in specified conditions they may overlap in performance at certain physical scales, thus allowing the user to choose either one by fine-tuning parameters accordingly.
Performance analysis of parallel gravitational N-body codes on large GPU clusters
NASA Astrophysics Data System (ADS)
Huang, Si-Yi; Spurzem, Rainer; Berczik, Peter
2016-01-01
We compare the performance of two very different parallel gravitational N-body codes for astrophysical simulations on large Graphics Processing Unit (GPU) clusters, both of which are pioneers in their own fields as well as on certain mutual scales - NBODY6++ and Bonsai. We carry out benchmarks of the two codes by analyzing their performance, accuracy and efficiency through the modeling of structure decomposition and timing measurements. We find that both codes are heavily optimized to leverage the computational potential of GPUs as their performance has approached half of the maximum single precision performance of the underlying GPU cards. With such performance we predict that a speed-up of 200 - 300 can be achieved when up to 1k processors and GPUs are employed simultaneously. We discuss the quantitative information about comparisons of the two codes, finding that in the same cases Bonsai adopts larger time steps as well as larger relative energy errors than NBODY6++, typically ranging from 10 - 50 times larger, depending on the chosen parameters of the codes. Although the two codes are built for different astrophysical applications, in specified conditions they may overlap in performance at certain physical scales, thus allowing the user to choose either one by fine-tuning parameters accordingly.
GPU Developments for General Circulation Models
NASA Astrophysics Data System (ADS)
Appleyard, Jeremy; Posey, Stan; Ponder, Carl; Eaton, Joe
2014-05-01
Current trends in high performance computing (HPC) are moving towards the use of graphics processing units (GPUs) to achieve speedups through the extraction of fine-grain parallelism of application software. GPUs have been developed exclusively for computational tasks as massively-parallel co-processors to the CPU, and during 2013 an extensive set of new HPC architectural features were developed in a 4th generation of NVIDIA GPUs that provide further opportunities for GPU acceleration of general circulation models used in climate science and numerical weather prediction. Today computational efficiency and simulation turnaround time continue to be important factors behind scientific decisions to develop models at higher resolutions and deploy increased use of ensembles. This presentation will examine the current state of GPU parallel developments for stencil based numerical operations typical of dynamical cores, and introduce new GPU-based implicit iterative schemes with GPU parallel preconditioning and linear solvers based on ILU, Krylov methods, and multigrid. Several GCMs show substantial gain in parallel efficiency from second-level fine-grain parallelism under first-level distributed memory parallel through a hybrid parallel implementation. Examples are provided relevant to science-scale HPC practice of CPU-GPU system configurations based on model resolution requirements of a particular simulation. Performance results compare use of the latest conventional CPUs with and without GPU acceleration. Finally a forward looking discussion is provided on the roadmap of GPU hardware, software, tools, and programmability for GCM development.
Implementation and optimization of ultrasound signal processing algorithms on mobile GPU
NASA Astrophysics Data System (ADS)
Kong, Woo Kyu; Lee, Wooyoul; Kim, Kyu Cheol; Yoo, Yangmo; Song, Tai-Kyong
2014-03-01
A general-purpose graphics processing unit (GPGPU) has been used for improving computing power in medical ultrasound imaging systems. Recently, a mobile GPU becomes powerful to deal with 3D games and videos at high frame rates on Full HD or HD resolution displays. This paper proposes the method to implement ultrasound signal processing on a mobile GPU available in the high-end smartphone (Galaxy S4, Samsung Electronics, Seoul, Korea) with programmable shaders on the OpenGL ES 2.0 platform. To maximize the performance of the mobile GPU, the optimization of shader design and load sharing between vertex and fragment shader was performed. The beamformed data were captured from a tissue mimicking phantom (Model 539 Multipurpose Phantom, ATS Laboratories, Inc., Bridgeport, CT, USA) by using a commercial ultrasound imaging system equipped with a research package (Ultrasonix Touch, Ultrasonix, Richmond, BC, Canada). The real-time performance is evaluated by frame rates while varying the range of signal processing blocks. The implementation method of ultrasound signal processing on OpenGL ES 2.0 was verified by analyzing PSNR with MATLAB gold standard that has the same signal path. CNR was also analyzed to verify the method. From the evaluations, the proposed mobile GPU-based processing method has no significant difference with the processing using MATLAB (i.e., PSNR<52.51 dB). The comparable results of CNR were obtained from both processing methods (i.e., 11.31). From the mobile GPU implementation, the frame rates of 57.6 Hz were achieved. The total execution time was 17.4 ms that was faster than the acquisition time (i.e., 34.4 ms). These results indicate that the mobile GPU-based processing method can support real-time ultrasound B-mode processing on the smartphone.
Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics
Ronald Babich, Michael Clark, Balint Joo
2010-11-01
Graphics Processing Units (GPUs) are having a transformational effect on numerical lattice quantum chromodynamics (LQCD) calculations of importance in nuclear and particle physics. The QUDA library provides a package of mixed precision sparse matrix linear solvers for LQCD applications, supporting single GPUs based on NVIDIA's Compute Unified Device Architecture (CUDA). This library, interfaced to the QDP++/Chroma framework for LQCD calculations, is currently in production use on the "9g" cluster at the Jefferson Laboratory, enabling unprecedented price/performance for a range of problems in LQCD. Nevertheless, memory constraints on current GPU devices limit the problem sizes that can be tackled. In this contribution we describe the parallelization of the QUDA library onto multiple GPUs using MPI, including strategies for the overlapping of communication and computation. We report on both weak and strong scaling for up to 32 GPUs interconnected by InfiniBand, on which we sustain in excess of 4 Tflops.
Park, Hyeong-Gyu; Shin, Yeong-Gil; Lee, Ho
2015-12-01
A ray-driven backprojector is based on ray-tracing, which computes the length of the intersection between the ray paths and each voxel to be reconstructed. To reduce the computational burden caused by these exhaustive intersection tests, we propose a fully graphics processing unit (GPU)-based ray-driven backprojector in conjunction with a ray-culling scheme that enables straightforward parallelization without compromising the high computing performance of a GPU. The purpose of the ray-culling scheme is to reduce the number of ray-voxel intersection tests by excluding rays irrelevant to a specific voxel computation. This rejection step is based on an axis-aligned bounding box (AABB) enclosing a region of voxel projection, where eight vertices of each voxel are projected onto the detector plane. The range of the rectangular-shaped AABB is determined by min/max operations on the coordinates in the region. Using the indices of pixels inside the AABB, the rays passing through the voxel can be identified and the voxel is weighted as the length of intersection between the voxel and the ray. This procedure makes it possible to reflect voxel-level parallelization, allowing an independent calculation at each voxel, which is feasible for a GPU implementation. To eliminate redundant calculations during ray-culling, a shared-memory optimization is applied to exploit the GPU memory hierarchy. In experimental results using real measurement data with phantoms, the proposed GPU-based ray-culling scheme reconstructed a volume of resolution 28032803176 in 77 seconds from 680 projections of resolution 10243768 , which is 26 times and 7.5 times faster than standard CPU-based and GPU-based ray-driven backprojectors, respectively. Qualitative and quantitative analyses showed that the ray-driven backprojector provides high-quality reconstruction images when compared with those generated by the Feldkamp-Davis-Kress algorithm using a pixel-driven backprojector, with an average of 2.5 times
Cazzaniga, Paolo; Nobile, Marco S; Besozzi, Daniela; Bellini, Matteo; Mauri, Giancarlo
2014-01-01
The introduction of general-purpose Graphics Processing Units (GPUs) is boosting scientific applications in Bioinformatics, Systems Biology, and Computational Biology. In these fields, the use of high-performance computing solutions is motivated by the need of performing large numbers of in silico analysis to study the behavior of biological systems in different conditions, which necessitate a computing power that usually overtakes the capability of standard desktop computers. In this work we present coagSODA, a CUDA-powered computational tool that was purposely developed for the analysis of a large mechanistic model of the blood coagulation cascade (BCC), defined according to both mass-action kinetics and Hill functions. coagSODA allows the execution of parallel simulations of the dynamics of the BCC by automatically deriving the system of ordinary differential equations and then exploiting the numerical integration algorithm LSODA. We present the biological results achieved with a massive exploration of perturbed conditions of the BCC, carried out with one-dimensional and bi-dimensional parameter sweep analysis, and show that GPU-accelerated parallel simulations of this model can increase the computational performances up to a 181× speedup compared to the corresponding sequential simulations.
Cazzaniga, Paolo; Nobile, Marco S.; Besozzi, Daniela; Bellini, Matteo; Mauri, Giancarlo
2014-01-01
The introduction of general-purpose Graphics Processing Units (GPUs) is boosting scientific applications in Bioinformatics, Systems Biology, and Computational Biology. In these fields, the use of high-performance computing solutions is motivated by the need of performing large numbers of in silico analysis to study the behavior of biological systems in different conditions, which necessitate a computing power that usually overtakes the capability of standard desktop computers. In this work we present coagSODA, a CUDA-powered computational tool that was purposely developed for the analysis of a large mechanistic model of the blood coagulation cascade (BCC), defined according to both mass-action kinetics and Hill functions. coagSODA allows the execution of parallel simulations of the dynamics of the BCC by automatically deriving the system of ordinary differential equations and then exploiting the numerical integration algorithm LSODA. We present the biological results achieved with a massive exploration of perturbed conditions of the BCC, carried out with one-dimensional and bi-dimensional parameter sweep analysis, and show that GPU-accelerated parallel simulations of this model can increase the computational performances up to a 181× speedup compared to the corresponding sequential simulations. PMID:25025072
Ma, Wenjing; Krishnamoorthy, Sriram; Villa, Oreste; Kowalski, Karol
2011-05-10
The details of the graphical processing unit (GPU) implementation of the most computationally intensive (T)-part of the recently introduced regularized CCSD(T) (Reg-CCSD(T)) method [ Kowalski , K. ; Valiev , M. J. Chem. Phys. 2009 , 131 , 234107 ] for calculating electronic energies of strongly correlated systems are discussed. Parallel tests performed for several molecular systems show very good scalability of the triples part of the Reg-CCSD(T) approach. We also discuss the performance of the Reg-CCSD(T) GPU implementation as a function of the parameters defining the partitioning of the spinorbital domain (tiling structure). The accuracy of the Reg-CCSD(T) method is illustrated on three examples: the methyfluoride molecule, dissociation of dodecane, and open-shell Spiro cation (5,5'(4H,4H')-spirobi[cyclopenta[c]pyrrole] 2,2',6,6'-tetrahydro cation), which is a frequently used model to study electron transfer processes. It is demonstrated that a simple regularization of the cluster amplitudes used in the noniterative corrections accounting for the effect of triply excited configurations significantly improves the accuracies of ground-state energies in the presence of strong quasidegeneracy effects. For methylfluoride, we compare the Reg-CCSD(T) results with the CR-CC(2,3) and CCSDT energies, whereas for Spiro cation we compare Reg-CCSD(T) results with the energies obtained with completely renormalized CCSD(T) method. Performance tests for the Spiro, dodecane, and uracil molecules are also discussed. PMID:26610126
Fast parallel image registration on CPU and GPU for diagnostic classification of Alzheimer's disease
Shamonin, Denis P.; Bron, Esther E.; Lelieveldt, Boudewijn P. F.; Smits, Marion; Klein, Stefan; Staring, Marius
2013-01-01
Nonrigid image registration is an important, but time-consuming task in medical image analysis. In typical neuroimaging studies, multiple image registrations are performed, i.e., for atlas-based segmentation or template construction. Faster image registration routines would therefore be beneficial. In this paper we explore acceleration of the image registration package elastix by a combination of several techniques: (i) parallelization on the CPU, to speed up the cost function derivative calculation; (ii) parallelization on the GPU building on and extending the OpenCL framework from ITKv4, to speed up the Gaussian pyramid computation and the image resampling step; (iii) exploitation of certain properties of the B-spline transformation model; (iv) further software optimizations. The accelerated registration tool is employed in a study on diagnostic classification of Alzheimer's disease and cognitively normal controls based on T1-weighted MRI. We selected 299 participants from the publicly available Alzheimer's Disease Neuroimaging Initiative database. Classification is performed with a support vector machine based on gray matter volumes as a marker for atrophy. We evaluated two types of strategies (voxel-wise and region-wise) that heavily rely on nonrigid image registration. Parallelization and optimization resulted in an acceleration factor of 4–5x on an 8-core machine. Using OpenCL a speedup factor of 2 was realized for computation of the Gaussian pyramids, and 15–60 for the resampling step, for larger images. The voxel-wise and the region-wise classification methods had an area under the receiver operator characteristic curve of 88 and 90%, respectively, both for standard and accelerated registration. We conclude that the image registration package elastix was substantially accelerated, with nearly identical results to the non-optimized version. The new functionality will become available in the next release of elastix as open source under the BSD license
Shamonin, Denis P; Bron, Esther E; Lelieveldt, Boudewijn P F; Smits, Marion; Klein, Stefan; Staring, Marius
2013-01-01
Nonrigid image registration is an important, but time-consuming task in medical image analysis. In typical neuroimaging studies, multiple image registrations are performed, i.e., for atlas-based segmentation or template construction. Faster image registration routines would therefore be beneficial. In this paper we explore acceleration of the image registration package elastix by a combination of several techniques: (i) parallelization on the CPU, to speed up the cost function derivative calculation; (ii) parallelization on the GPU building on and extending the OpenCL framework from ITKv4, to speed up the Gaussian pyramid computation and the image resampling step; (iii) exploitation of certain properties of the B-spline transformation model; (iv) further software optimizations. The accelerated registration tool is employed in a study on diagnostic classification of Alzheimer's disease and cognitively normal controls based on T1-weighted MRI. We selected 299 participants from the publicly available Alzheimer's Disease Neuroimaging Initiative database. Classification is performed with a support vector machine based on gray matter volumes as a marker for atrophy. We evaluated two types of strategies (voxel-wise and region-wise) that heavily rely on nonrigid image registration. Parallelization and optimization resulted in an acceleration factor of 4-5x on an 8-core machine. Using OpenCL a speedup factor of 2 was realized for computation of the Gaussian pyramids, and 15-60 for the resampling step, for larger images. The voxel-wise and the region-wise classification methods had an area under the receiver operator characteristic curve of 88 and 90%, respectively, both for standard and accelerated registration. We conclude that the image registration package elastix was substantially accelerated, with nearly identical results to the non-optimized version. The new functionality will become available in the next release of elastix as open source under the BSD license.
A 3D MPI-Parallel GPU-accelerated framework for simulating ocean wave energy converters
NASA Astrophysics Data System (ADS)
Pathak, Ashish; Raessi, Mehdi
2015-11-01
We present an MPI-parallel GPU-accelerated computational framework for studying the interaction between ocean waves and wave energy converters (WECs). The computational framework captures the viscous effects, nonlinear fluid-structure interaction (FSI), and breaking of waves around the structure, which cannot be captured in many potential flow solvers commonly used for WEC simulations. The full Navier-Stokes equations are solved using the two-step projection method, which is accelerated by porting the pressure Poisson equation to GPUs. The FSI is captured using the numerically stable fictitious domain method. A novel three-phase interface reconstruction algorithm is used to resolve three phases in a VOF-PLIC context. A consistent mass and momentum transport approach enables simulations at high density ratios. The accuracy of the overall framework is demonstrated via an array of test cases. Numerical simulations of the interaction between ocean waves and WECs are presented. Funding from the National Science Foundation CBET-1236462 grant is gratefully acknowledged.
a method of gravity and seismic sequential inversion and its GPU implementation
NASA Astrophysics Data System (ADS)
Liu, G.; Meng, X.
2011-12-01
In this abstract, we introduce a gravity and seismic sequential inversion method to invert for density and velocity together. For the gravity inversion, we use an iterative method based on correlation imaging algorithm; for the seismic inversion, we use the full waveform inversion. The link between the density and velocity is an empirical formula called Gardner equation, for large volumes of data, we use the GPU to accelerate the computation. For the gravity inversion method , we introduce a method based on correlation imaging algorithm,it is also a interative method, first we calculate the correlation imaging of the observed gravity anomaly, it is some value between -1 and +1, then we multiply this value with a little density ,this value become the initial density model. We get a forward reuslt with this initial model and also calculate the correaltion imaging of the misfit of observed data and the forward data, also multiply the correaltion imaging result a little density and add it to the initial model, then do the same procedure above , at last ,we can get a inversion density model. For the seismic inveron method ,we use a mothod base on the linearity of acoustic wave equation written in the frequency domain,with a intial velociy model, we can get a good velocity result. In the sequential inversion of gravity and seismic , we need a link formula to convert between density and velocity ,in our method , we use the Gardner equation. Driven by the insatiable market demand for real time, high-definition 3D images, the programmable NVIDIA Graphic Processing Unit (GPU) as co-processor of CPU has been developed for high performance computing. Compute Unified Device Architecture (CUDA) is a parallel programming model and software environment provided by NVIDIA designed to overcome the challenge of using traditional general purpose GPU while maintaining a low learn curve for programmers familiar with standard programming languages such as C. In our inversion processing
Redundancy computation analysis and implementation of phase diversity based on GPU
NASA Astrophysics Data System (ADS)
Zhang, Quan; Bao, Hua; Rao, Changhui; Peng, Zhenming
2015-10-01
Phase diversity method is not only used as an image restoration technique, but also as a wavefront sensor. However, its computations have been perceived as being too burdensome to achieve its real-time applications on a desktop computer platform. In this paper, the implementation of the phase diversity algorithm based on graphic processing unit (GPU) is presented. The redundancy computations for the pupil function, point spread function, and optical transfer function are analyzed. Two kinds of implementation methods based on GPU are compared: one is the general method which is accomplished by GPU library CUFFT without precision loss (method-1) and the other one performed by our own custom FFT with little damage of precision considering the redundant calculations (method-2). The results show the cost and gradient functions can be speeded up by method-2 in contrast with the method-1 and the overhead of global memory access by kernel fusion can be reduced. For the image of 256 × 256 with the sampling factor of 3, the results reveal that method-2 achieves speedup of 1.83× compared with method-1 when the central 128 × 128 pixels of the point spread function are used.
Implementing clips on a parallel computer
NASA Technical Reports Server (NTRS)
Riley, Gary
1987-01-01
The C language integrated production system (CLIPS) is a forward chaining rule based language to provide training and delivery for expert systems. Conceptually, rule based languages have great potential for benefiting from the inherent parallelism of the algorithms that they employ. During each cycle of execution, a knowledge base of information is compared against a set of rules to determine if any rules are applicable. Parallelism also can be employed for use with multiple cooperating expert systems. To investigate the potential benefits of using a parallel computer to speed up the comparison of facts to rules in expert systems, a parallel version of CLIPS was developed for the FLEX/32, a large grain parallel computer. The FLEX implementation takes a macroscopic approach in achieving parallelism by splitting whole sets of rules among several processors rather than by splitting the components of an individual rule among processors. The parallel CLIPS prototype demonstrates the potential advantages of integrating expert system tools with parallel computers.
GPU Accelerated Implementation of Density Functional Theory for Hybrid QM/MM Simulations.
Nitsche, Matías A; Ferreria, Manuel; Mocskos, Esteban E; González Lebrero, Mariano C
2014-03-11
The hybrid simulation tools (QM/MM) evolved into a fundamental methodology for studying chemical reactivity in complex environments. This paper presents an implementation of electronic structure calculations based on density functional theory. This development is optimized for performing hybrid molecular dynamics simulations by making use of graphic processors (GPU) for the most computationally demanding parts (exchange-correlation terms). The proposed implementation is able to take advantage of modern GPUs achieving acceleration in relevant portions between 20 to 30 times faster than the CPU version. The presented code was extensively tested, both in terms of numerical quality and performance over systems of different size and composition. PMID:26580175
Fast neuromimetic object recognition using FPGA outperforms GPU implementations.
Orchard, Garrick; Martin, Jacob G; Vogelstein, R Jacob; Etienne-Cummings, Ralph
2013-08-01
Recognition of objects in still images has traditionally been regarded as a difficult computational problem. Although modern automated methods for visual object recognition have achieved steadily increasing recognition accuracy, even the most advanced computational vision approaches are unable to obtain performance equal to that of humans. This has led to the creation of many biologically inspired models of visual object recognition, among them the hierarchical model and X (HMAX) model. HMAX is traditionally known to achieve high accuracy in visual object recognition tasks at the expense of significant computational complexity. Increasing complexity, in turn, increases computation time, reducing the number of images that can be processed per unit time. In this paper we describe how the computationally intensive and biologically inspired HMAX model for visual object recognition can be modified for implementation on a commercial field-programmable aate Array, specifically the Xilinx Virtex 6 ML605 evaluation board with XC6VLX240T FPGA. We show that with minor modifications to the traditional HMAX model we can perform recognition on images of size 128 × 128 pixels at a rate of 190 images per second with a less than 1% loss in recognition accuracy in both binary and multiclass visual object recognition tasks.
4D MR phase and magnitude segmentations with GPU parallel computing.
Bergen, Robert V; Lin, Hung-Yu; Alexander, Murray E; Bidinosti, Christopher P
2015-01-01
The increasing size and number of data sets of large four dimensional (three spatial, one temporal) magnetic resonance (MR) cardiac images necessitates efficient segmentation algorithms. Analysis of phase-contrast MR images yields cardiac flow information which can be manipulated to produce accurate segmentations of the aorta. Phase contrast segmentation algorithms are proposed that use simple mean-based calculations and least mean squared curve fitting techniques. The initial segmentations are generated on a multi-threaded central processing unit (CPU) in 10 seconds or less, though the computational simplicity of the algorithms results in a loss of accuracy. A more complex graphics processing unit (GPU)-based algorithm fits flow data to Gaussian waveforms, and produces an initial segmentation in 0.5 seconds. Level sets are then applied to a magnitude image, where the initial conditions are given by the previous CPU and GPU algorithms. A comparison of results shows that the GPU algorithm appears to produce the most accurate segmentation.
Xu, Zuwei; Zhao, Haibo Zheng, Chuguang
2015-01-15
This paper proposes a comprehensive framework for accelerating population balance-Monte Carlo (PBMC) simulation of particle coagulation dynamics. By combining Markov jump model, weighted majorant kernel and GPU (graphics processing unit) parallel computing, a significant gain in computational efficiency is achieved. The Markov jump model constructs a coagulation-rule matrix of differentially-weighted simulation particles, so as to capture the time evolution of particle size distribution with low statistical noise over the full size range and as far as possible to reduce the number of time loopings. Here three coagulation rules are highlighted and it is found that constructing appropriate coagulation rule provides a route to attain the compromise between accuracy and cost of PBMC methods. Further, in order to avoid double looping over all simulation particles when considering the two-particle events (typically, particle coagulation), the weighted majorant kernel is introduced to estimate the maximum coagulation rates being used for acceptance–rejection processes by single-looping over all particles, and meanwhile the mean time-step of coagulation event is estimated by summing the coagulation kernels of rejected and accepted particle pairs. The computational load of these fast differentially-weighted PBMC simulations (based on the Markov jump model) is reduced greatly to be proportional to the number of simulation particles in a zero-dimensional system (single cell). Finally, for a spatially inhomogeneous multi-dimensional (multi-cell) simulation, the proposed fast PBMC is performed in each cell, and multiple cells are parallel processed by multi-cores on a GPU that can implement the massively threaded data-parallel tasks to obtain remarkable speedup ratio (comparing with CPU computation, the speedup ratio of GPU parallel computing is as high as 200 in a case of 100 cells with 10 000 simulation particles per cell). These accelerating approaches of PBMC are
NASA Astrophysics Data System (ADS)
Xu, Zuwei; Zhao, Haibo; Zheng, Chuguang
2015-01-01
This paper proposes a comprehensive framework for accelerating population balance-Monte Carlo (PBMC) simulation of particle coagulation dynamics. By combining Markov jump model, weighted majorant kernel and GPU (graphics processing unit) parallel computing, a significant gain in computational efficiency is achieved. The Markov jump model constructs a coagulation-rule matrix of differentially-weighted simulation particles, so as to capture the time evolution of particle size distribution with low statistical noise over the full size range and as far as possible to reduce the number of time loopings. Here three coagulation rules are highlighted and it is found that constructing appropriate coagulation rule provides a route to attain the compromise between accuracy and cost of PBMC methods. Further, in order to avoid double looping over all simulation particles when considering the two-particle events (typically, particle coagulation), the weighted majorant kernel is introduced to estimate the maximum coagulation rates being used for acceptance-rejection processes by single-looping over all particles, and meanwhile the mean time-step of coagulation event is estimated by summing the coagulation kernels of rejected and accepted particle pairs. The computational load of these fast differentially-weighted PBMC simulations (based on the Markov jump model) is reduced greatly to be proportional to the number of simulation particles in a zero-dimensional system (single cell). Finally, for a spatially inhomogeneous multi-dimensional (multi-cell) simulation, the proposed fast PBMC is performed in each cell, and multiple cells are parallel processed by multi-cores on a GPU that can implement the massively threaded data-parallel tasks to obtain remarkable speedup ratio (comparing with CPU computation, the speedup ratio of GPU parallel computing is as high as 200 in a case of 100 cells with 10 000 simulation particles per cell). These accelerating approaches of PBMC are
National Combustion Code: Parallel Implementation and Performance
NASA Technical Reports Server (NTRS)
Quealy, A.; Ryder, R.; Norris, A.; Liu, N.-S.
2000-01-01
The National Combustion Code (NCC) is being developed by an industry-government team for the design and analysis of combustion systems. CORSAIR-CCD is the current baseline reacting flow solver for NCC. This is a parallel, unstructured grid code which uses a distributed memory, message passing model for its parallel implementation. The focus of the present effort has been to improve the performance of the NCC flow solver to meet combustor designer requirements for model accuracy and analysis turnaround time. Improving the performance of this code contributes significantly to the overall reduction in time and cost of the combustor design cycle. This paper describes the parallel implementation of the NCC flow solver and summarizes its current parallel performance on an SGI Origin 2000. Earlier parallel performance results on an IBM SP-2 are also included. The performance improvements which have enabled a turnaround of less than 15 hours for a 1.3 million element fully reacting combustion simulation are described.
Implementation and performance of parallel Prolog interpreter
Wei, S.; Kale, L.V.; Balkrishna, R. . Dept. of Computer Science)
1988-01-01
In this paper, the authors discuss the implementation of a parallel Prolog interpreter on different parallel machines. The implementation is based on the REDUCE--OR process model which exploits both AND and OR parallelism in logic programs. It is machine independent as it runs on top of the chare-kernel--a machine-independent parallel programming system. The authors also give the performance of the interpreter running a diverse set of benchmark pargrams on parallel machines including shared memory systems: an Alliant FX/8, Sequent and a MultiMax, and a non-shared memory systems: Intel iPSC/32 hypercube, in addition to its performance on a multiprocessor simulation system.
Randomized selection on the GPU
Monroe, Laura Marie; Wendelberger, Joanne R; Michalak, Sarah E
2011-01-13
We implement here a fast and memory-sparing probabilistic top N selection algorithm on the GPU. To our knowledge, this is the first direct selection in the literature for the GPU. The algorithm proceeds via a probabilistic-guess-and-chcck process searching for the Nth element. It always gives a correct result and always terminates. The use of randomization reduces the amount of data that needs heavy processing, and so reduces the average time required for the algorithm. Probabilistic Las Vegas algorithms of this kind are a form of stochastic optimization and can be well suited to more general parallel processors with limited amounts of fast memory.
CULA: hybrid GPU accelerated linear algebra routines
NASA Astrophysics Data System (ADS)
Humphrey, John R.; Price, Daniel K.; Spagnoli, Kyle E.; Paolini, Aaron L.; Kelmelis, Eric J.
2010-04-01
The modern graphics processing unit (GPU) found in many standard personal computers is a highly parallel math processor capable of nearly 1 TFLOPS peak throughput at a cost similar to a high-end CPU and an excellent FLOPS/watt ratio. High-level linear algebra operations are computationally intense, often requiring O(N3) operations and would seem a natural fit for the processing power of the GPU. Our work is on CULA, a GPU accelerated implementation of linear algebra routines. We present results from factorizations such as LU decomposition, singular value decomposition and QR decomposition along with applications like system solution and least squares. The GPU execution model featured by NVIDIA GPUs based on CUDA demands very strong parallelism, requiring between hundreds and thousands of simultaneous operations to achieve high performance. Some constructs from linear algebra map extremely well to the GPU and others map poorly. CPUs, on the other hand, do well at smaller order parallelism and perform acceptably during low-parallelism code segments. Our work addresses this via hybrid a processing model, in which the CPU and GPU work simultaneously to produce results. In many cases, this is accomplished by allowing each platform to do the work it performs most naturally.
Murphy, Mark; Alley, Marcus; Demmel, James; Keutzer, Kurt; Vasanawala, Shreyas; Lustig, Michael
2012-06-01
We present l₁-SPIRiT, a simple algorithm for auto calibrating parallel imaging (acPI) and compressed sensing (CS) that permits an efficient implementation with clinically-feasible runtimes. We propose a CS objective function that minimizes cross-channel joint sparsity in the wavelet domain. Our reconstruction minimizes this objective via iterative soft-thresholding, and integrates naturally with iterative self-consistent parallel imaging (SPIRiT). Like many iterative magnetic resonance imaging reconstructions, l₁-SPIRiT's image quality comes at a high computational cost. Excessively long runtimes are a barrier to the clinical use of any reconstruction approach, and thus we discuss our approach to efficiently parallelizing l₁-SPIRiT and to achieving clinically-feasible runtimes. We present parallelizations of l₁-SPIRiT for both multi-GPU systems and multi-core CPUs, and discuss the software optimization and parallelization decisions made in our implementation. The performance of these alternatives depends on the processor architecture, the size of the image matrix, and the number of parallel imaging channels. Fundamentally, achieving fast runtime requires the correct trade-off between cache usage and parallelization overheads. We demonstrate image quality via a case from our clinical experimentation, using a custom 3DFT spoiled gradient echo (SPGR) sequence with up to 8× acceleration via Poisson-disc undersampling in the two phase-encoded directions. PMID:22345529
Sailfish: A flexible multi-GPU implementation of the lattice Boltzmann method
NASA Astrophysics Data System (ADS)
Januszewski, M.; Kostur, M.
2014-09-01
We present Sailfish, an open source fluid simulation package implementing the lattice Boltzmann method (LBM) on modern Graphics Processing Units (GPUs) using CUDA/OpenCL. We take a novel approach to GPU code implementation and use run-time code generation techniques and a high level programming language (Python) to achieve state of the art performance, while allowing easy experimentation with different LBM models and tuning for various types of hardware. We discuss the general design principles of the code, scaling to multiple GPUs in a distributed environment, as well as the GPU implementation and optimization of many different LBM models, both single component (BGK, MRT, ELBM) and multicomponent (Shan-Chen, free energy). The paper also presents results of performance benchmarks spanning the last three NVIDIA GPU generations (Tesla, Fermi, Kepler), which we hope will be useful for researchers working with this type of hardware and similar codes. Catalogue identifier: AETA_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AETA_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GNU Lesser General Public License, version 3 No. of lines in distributed program, including test data, etc.: 225864 No. of bytes in distributed program, including test data, etc.: 46861049 Distribution format: tar.gz Programming language: Python, CUDA C, OpenCL. Computer: Any with an OpenCL or CUDA-compliant GPU. Operating system: No limits (tested on Linux and Mac OS X). RAM: Hundreds of megabytes to tens of gigabytes for typical cases. Classification: 12, 6.5. External routines: PyCUDA/PyOpenCL, Numpy, Mako, ZeroMQ (for multi-GPU simulations), scipy, sympy Nature of problem: GPU-accelerated simulation of single- and multi-component fluid flows. Solution method: A wide range of relaxation models (LBGK, MRT, regularized LB, ELBM, Shan-Chen, free energy, free surface) and boundary conditions within the lattice
3D Alternating Direction TV-Based Cone-Beam CT Reconstruction with Efficient GPU Implementation
Cai, Ailong; Zhang, Hanming; Li, Lei; Xi, Xiaoqi; Guan, Min; Li, Jianxin
2014-01-01
Iterative image reconstruction (IIR) with sparsity-exploiting methods, such as total variation (TV) minimization, claims potentially large reductions in sampling requirements. However, the computation complexity becomes a heavy burden, especially in 3D reconstruction situations. In order to improve the performance for iterative reconstruction, an efficient IIR algorithm for cone-beam computed tomography (CBCT) with GPU implementation has been proposed in this paper. In the first place, an algorithm based on alternating direction total variation using local linearization and proximity technique is proposed for CBCT reconstruction. The applied proximal technique avoids the horrible pseudoinverse computation of big matrix which makes the proposed algorithm applicable and efficient for CBCT imaging. The iteration for this algorithm is simple but convergent. The simulation and real CT data reconstruction results indicate that the proposed algorithm is both fast and accurate. The GPU implementation shows an excellent acceleration ratio of more than 100 compared with CPU computation without losing numerical accuracy. The runtime for the new 3D algorithm is about 6.8 seconds per loop with the image size of 256 × 256 × 256 and 36 projections of the size of 512 × 512. PMID:25045400
Chen, Wenan; Ward, Kevin; Li, Qi; Kecman, Vojislav; Najarian, Kayvan; Menke, Nathan
2011-01-01
The coagulation and fibrinolytic systems are complex, inter-connected biological systems with major physiological roles. The complex, nonlinear multi-point relationships between the molecular and cellular constituents of two systems render a comprehensive and simultaneous study of the system at the microscopic and macroscopic level a significant challenge. We have created an Agent Based Modeling and Simulation (ABMS) approach for simulating these complex interactions. As the scale of agents increase, the time complexity and cost of the resulting simulations presents a significant challenge. As such, in this paper, we also present a high-speed framework for the coagulation simulation utilizing the computing power of graphics processing units (GPU). For comparison, we also implemented the simulations in NetLogo, Repast, and a direct C version. As our experiments demonstrate, the computational speed of the GPU implementation of the million-level scale of agents is over 10 times faster versus the C version, over 100 times faster versus the Repast version and over 300 times faster versus the NetLogo simulation. PMID:22254271
Genetic Parallel Programming: design and implementation.
Cheang, Sin Man; Leung, Kwong Sak; Lee, Kin Hong
2006-01-01
This paper presents a novel Genetic Parallel Programming (GPP) paradigm for evolving parallel programs running on a Multi-Arithmetic-Logic-Unit (Multi-ALU) Processor (MAP). The MAP is a Multiple Instruction-streams, Multiple Data-streams (MIMD), general-purpose register machine that can be implemented on modern Very Large-Scale Integrated Circuits (VLSIs) in order to evaluate genetic programs at high speed. For human programmers, writing parallel programs is more difficult than writing sequential programs. However, experimental results show that GPP evolves parallel programs with less computational effort than that of their sequential counterparts. It creates a new approach to evolving a feasible problem solution in parallel program form and then serializes it into a sequential program if required. The effectiveness and efficiency of GPP are investigated using a suite of 14 well-studied benchmark problems. Experimental results show that GPP speeds up evolution substantially.
SOAP3: ultra-fast GPU-based parallel alignment tool for short reads.
Liu, Chi-Man; Wong, Thomas; Wu, Edward; Luo, Ruibang; Yiu, Siu-Ming; Li, Yingrui; Wang, Bingqiang; Yu, Chang; Chu, Xiaowen; Zhao, Kaiyong; Li, Ruiqiang; Lam, Tak-Wah
2012-03-15
SOAP3 is the first short read alignment tool that leverages the multi-processors in a graphic processing unit (GPU) to achieve a drastic improvement in speed. We adapted the compressed full-text index (BWT) used by SOAP2 in view of the advantages and disadvantages of GPU. When tested with millions of Illumina Hiseq 2000 length-100 bp reads, SOAP3 takes < 30 s to align a million read pairs onto the human reference genome and is at least 7.5 and 20 times faster than BWA and Bowtie, respectively. For aligning reads with up to four mismatches, SOAP3 aligns slightly more reads than BWA and Bowtie; this is because SOAP3, unlike BWA and Bowtie, is not heuristic-based and always reports all answers.
GPU-Based Parallelized Solver for Large Scale Vascular Blood Flow Modeling and Simulations.
Santhanam, Anand P; Neylon, John; Eldredge, Jeff; Teran, Joseph; Dutson, Erik; Benharash, Peyman
2016-01-01
Cardio-vascular blood flow simulations are essential in understanding the blood flow behavior during normal and disease conditions. To date, such blood flow simulations have only been done at a macro scale level due to computational limitations. In this paper, we present a GPU based large scale solver that enables modeling the flow even in the smallest arteries. A mechanical equivalent of the circuit based flow modeling system is first developed to employ the GPU computing framework. Numerical studies were employed using a set of 10 million connected vascular elements. Run-time flow analysis were performed to simulate vascular blockages, as well as arterial cut-off. Our results showed that we can achieve ~100 FPS using a GTX 680m and ~40 FPS using a Tegra K1 computing platform. PMID:27046603
Implementation of NAS Parallel Benchmarks in Java
NASA Technical Reports Server (NTRS)
Frumkin, Michael; Schultz, Matthew; Jin, Hao-Qiang; Yan, Jerry
2000-01-01
A number of features make Java an attractive but a debatable choice for High Performance Computing (HPC). In order to gauge the applicability of Java to the Computational Fluid Dynamics (CFD) we have implemented NAS Parallel Benchmarks in Java. The performance and scalability of the benchmarks point out the areas where improvement in Java compiler technology and in Java thread implementation would move Java closer to Fortran in the competition for CFD applications.
NASA Technical Reports Server (NTRS)
Putnam, William M.
2011-01-01
Earth system models like the Goddard Earth Observing System model (GEOS-5) have been pushing the limits of large clusters of multi-core microprocessors, producing breath-taking fidelity in resolving cloud systems at a global scale. GPU computing presents an opportunity for improving the efficiency of these leading edge models. A GPU implementation of GEOS-5 will facilitate the use of cloud-system resolving resolutions in data assimilation and weather prediction, at resolutions near 3.5 km, improving our ability to extract detailed information from high-resolution satellite observations and ultimately produce better weather and climate predictions
NASA Astrophysics Data System (ADS)
Yuan, Jie; Xu, Guan; Yu, Yao; Zhou, Yu; Carson, Paul L.; Wang, Xueding; Liu, Xiaojun
2014-03-01
Photoacoustic tomography (PAT) offers structural and functional imaging of living biological tissue with highly sensitive optical absorption contrast and excellent spatial resolution comparable to medical ultrasound (US) imaging. We report the development of a fully integrated PAT and US dual-modality imaging system, which performs signal scanning, image reconstruction and display for both photoacoustic (PA) and US imaging all in a truly real-time manner. The backprojection (BP) algorithm for PA image reconstruction is optimized to reduce the computational cost and facilitate parallel computation on a state of the art graphics processing unit (GPU) card. For the first time, PAT and US imaging of the same object can be conducted simultaneously and continuously, at a real time frame rate, presently limited by the laser repetition rate of 10 Hz. Noninvasive PAT and US imaging of human peripheral joints in vivo were achieved, demonstrating the satisfactory image quality realized with this system. Another experiment, simultaneous PAT and US imaging of contrast agent flowing through an artificial vessel was conducted to verify the performance of this system for imaging fast biological events. The GPU based image reconstruction software code for this dual-modality system is open source and available for download from http://sourceforge.net/projects/pat realtime .
Non-rigid multi-modal registration on the GPU
NASA Astrophysics Data System (ADS)
Vetter, Christoph; Guetter, Christoph; Xu, Chenyang; Westermann, Rüdiger
2007-03-01
Non-rigid multi-modal registration of images/volumes is becoming increasingly necessary in many medical settings. While efficient registration algorithms have been published, the speed of the solutions is a problem in clinical applications. Harnessing the computational power of graphics processing unit (GPU) for general purpose computations has become increasingly popular in order to speed up algorithms further, but the algorithms have to be adapted to the data-parallel, streaming model of the GPU. This paper describes the implementation of a non-rigid, multi-modal registration using mutual information and the Kullback-Leibler divergence between observed and learned joint intensity distributions. The entire registration process is implemented on the GPU, including a GPU-friendly computation of two-dimensional histograms using vertex texture fetches as well as an implementation of recursive Gaussian filtering on the GPU. Since the computation is performed on the GPU, interactive visualization of the registration process can be done without bus transfer between main memory and video memory. This allows the user to observe the registration process and to evaluate the result more easily. Two hybrid approaches distributing the computation between the GPU and CPU are discussed. The first approach uses the CPU for lower resolutions and the GPU for higher resolutions, the second approach uses the GPU to compute a first approximation to the registration that is used as starting point for registration on the CPU using double-precision. The results of the CPU implementation are compared to the different approaches using the GPU regarding speed as well as image quality. The GPU performs up to 5 times faster per iteration than the CPU implementation.
NASA Astrophysics Data System (ADS)
Rueda, Antonio J.; Noguera, José M.; Luque, Adrián
2016-02-01
In recent years GPU computing has gained wide acceptance as a simple low-cost solution for speeding up computationally expensive processing in many scientific and engineering applications. However, in most cases accelerating a traditional CPU implementation for a GPU is a non-trivial task that requires a thorough refactorization of the code and specific optimizations that depend on the architecture of the device. OpenACC is a promising technology that aims at reducing the effort required to accelerate C/C++/Fortran code on an attached multicore device. Virtually with this technology the CPU code only has to be augmented with a few compiler directives to identify the areas to be accelerated and the way in which data has to be moved between the CPU and GPU. Its potential benefits are multiple: better code readability, less development time, lower risk of errors and less dependency on the underlying architecture and future evolution of the GPU technology. Our aim with this work is to evaluate the pros and cons of using OpenACC against native GPU implementations in computationally expensive hydrological applications, using the classic D8 algorithm of O'Callaghan and Mark for river network extraction as case-study. We implemented the flow accumulation step of this algorithm in CPU, using OpenACC and two different CUDA versions, comparing the length and complexity of the code and its performance with different datasets. We advance that although OpenACC can not match the performance of a CUDA optimized implementation (×3.5 slower in average), it provides a significant performance improvement against a CPU implementation (×2-6) with by far a simpler code and less implementation effort.
NASA Astrophysics Data System (ADS)
Bauer, Petr; Klement, Vladimír; Oberhuber, Tomáš; Žabka, Vítězslav
2016-03-01
We present a complete GPU implementation of a geometric multigrid solver for the numerical solution of the Navier-Stokes equations for incompressible flow. The approximate solution is constructed on a two-dimensional unstructured triangular mesh. The problem is discretized by means of the mixed finite element method with semi-implicit timestepping. The linear saddle-point problem arising from the scheme is solved by the geometric multigrid method with a Vanka-type smoother. The parallel solver is based on the red-black coloring of the mesh triangles. We achieved a speed-up of 11 compared to a parallel (4 threads) code based on OpenMP and 19 compared to a sequential code.
Initial Characterization of Parallel NFS Implementations
Yu, Weikuan; Vetter, Jeffrey S
2010-01-01
Parallel NFS (pNFS) is touted as an emergent standard protocol for parallel I/O access in various storage environments. Several pNFS prototypes have been implemented for initial validation and protocol examination. Previous efforts have focused on realizing the pNFS protocol to expose the best bandwidth potential from underlying file and storage systems. In this presentation, we provide an initial characterization of two pNFS prototype implementations, lpNFS (a Lustre-based parallel NFS implementation) and spNFS (another reference implementation from Network Appliance, Inc.). We show that both lpNFS and spNFS can faithfully achieve the primary goal of pNFS, i.e., aggregating I/O bandwidth from many storage servers. However, they both face the challenge of scalable metadata management. Particularly, the throughput of sp-NFS metadata operations degrades significanlty with an increasing number of data servers. Even for the better-performing lpNFS, we discuss its architecture and propose a direct I/O request flow protocol to improve its performance.
Computation and parallel implementation for early vision
NASA Technical Reports Server (NTRS)
Gualtieri, J. Anthony
1990-01-01
The problem of early vision is to transform one or more retinal illuminance images-pixel arrays-to image representations built out of such primitive visual features such as edges, regions, disparities, and clusters. These transformed representations form the input to later vision stages that perform higher level vision tasks including matching and recognition. Researchers developed algorithms for: (1) edge finding in the scale space formulation; (2) correlation methods for computing matches between pairs of images; and (3) clustering of data by neural networks. These algorithms are formulated for parallel implementation of SIMD machines, such as the Massively Parallel Processor, a 128 x 128 array processor with 1024 bits of local memory per processor. For some cases, researchers can show speedups of three orders of magnitude over serial implementations.
Efficient implementation of the many-body Reactive Bond Order (REBO) potential on GPU
NASA Astrophysics Data System (ADS)
Trędak, Przemysław; Rudnicki, Witold R.; Majewski, Jacek A.
2016-09-01
The second generation Reactive Bond Order (REBO) empirical potential is commonly used to accurately model a wide range hydrocarbon materials. It is also extensible to other atom types and interactions. REBO potential assumes complex multi-body interaction model, that is difficult to represent efficiently in the SIMD or SIMT programming model. Hence, despite its importance, no efficient GPGPU implementation has been developed for this potential. Here we present a detailed description of a highly efficient GPGPU implementation of molecular dynamics algorithm using REBO potential. The presented algorithm takes advantage of rarely used properties of the SIMT architecture of a modern GPU to solve difficult synchronizations issues that arise in computations of multi-body potential. Techniques developed for this problem may be also used to achieve efficient solutions of different problems. The performance of proposed algorithm is assessed using a range of model systems. It is compared to highly optimized CPU implementation (both single core and OpenMP) available in LAMMPS package. These experiments show up to 6x improvement in forces computation time using single processor of the NVIDIA Tesla K80 compared to high end 16-core Intel Xeon processor.
Parallelizing Epistasis Detection in GWAS on FPGA and GPU-Accelerated Computing Systems.
González-Domínguez, Jorge; Wienbrandt, Lars; Kässens, Jan Christian; Ellinghaus, David; Schimmler, Manfred; Schmidt, Bertil
2015-01-01
High-throughput genotyping technologies (such as SNP-arrays) allow the rapid collection of up to a few million genetic markers of an individual. Detecting epistasis (based on 2-SNP interactions) in Genome-Wide Association Studies is an important but time consuming operation since statistical computations have to be performed for each pair of measured markers. Computational methods to detect epistasis therefore suffer from prohibitively long runtimes; e.g., processing a moderately-sized dataset consisting of about 500,000 SNPs and 5,000 samples requires several days using state-of-the-art tools on a standard 3 GHz CPU. In this paper, we demonstrate how this task can be accelerated using a combination of fine-grained and coarse-grained parallelism on two different computing systems. The first architecture is based on reconfigurable hardware (FPGAs) while the second architecture uses multiple GPUs connected to the same host. We show that both systems can achieve speedups of around four orders-of-magnitude compared to the sequential implementation. This significantly reduces the runtimes for detecting epistasis to only a few minutes for moderately-sized datasets and to a few hours for large-scale datasets.
Parallelizing Epistasis Detection in GWAS on FPGA and GPU-Accelerated Computing Systems.
González-Domínguez, Jorge; Wienbrandt, Lars; Kässens, Jan Christian; Ellinghaus, David; Schimmler, Manfred; Schmidt, Bertil
2015-01-01
High-throughput genotyping technologies (such as SNP-arrays) allow the rapid collection of up to a few million genetic markers of an individual. Detecting epistasis (based on 2-SNP interactions) in Genome-Wide Association Studies is an important but time consuming operation since statistical computations have to be performed for each pair of measured markers. Computational methods to detect epistasis therefore suffer from prohibitively long runtimes; e.g., processing a moderately-sized dataset consisting of about 500,000 SNPs and 5,000 samples requires several days using state-of-the-art tools on a standard 3 GHz CPU. In this paper, we demonstrate how this task can be accelerated using a combination of fine-grained and coarse-grained parallelism on two different computing systems. The first architecture is based on reconfigurable hardware (FPGAs) while the second architecture uses multiple GPUs connected to the same host. We show that both systems can achieve speedups of around four orders-of-magnitude compared to the sequential implementation. This significantly reduces the runtimes for detecting epistasis to only a few minutes for moderately-sized datasets and to a few hours for large-scale datasets. PMID:26451813
Chen, Guangye; Chacon, Luis; Barnes, Daniel C
2012-01-01
Recently, a fully implicit, energy- and charge-conserving particle-in-cell method has been developed for multi-scale, full-f kinetic simulations [G. Chen, et al., J. Comput. Phys. 230, 18 (2011)]. The method employs a Jacobian-free Newton-Krylov (JFNK) solver and is capable of using very large timesteps without loss of numerical stability or accuracy. A fundamental feature of the method is the segregation of particle orbit integrations from the field solver, while remaining fully self-consistent. This provides great flexibility, and dramatically improves the solver efficiency by reducing the degrees of freedom of the associated nonlinear system. However, it requires a particle push per nonlinear residual evaluation, which makes the particle push the most time-consuming operation in the algorithm. This paper describes a very efficient mixed-precision, hybrid CPU-GPU implementation of the implicit PIC algorithm. The JFNK solver is kept on the CPU (in double precision), while the inherent data parallelism of the particle mover is exploited by implementing it in single-precision on a graphics processing unit (GPU) using CUDA. Performance-oriented optimizations, with the aid of an analytical performance model, the roofline model, are employed. Despite being highly dynamic, the adaptive, charge-conserving particle mover algorithm achieves up to 300 400 GOp/s (including single-precision floating-point, integer, and logic operations) on a Nvidia GeForce GTX580, corresponding to 20 25% absolute GPU efficiency (against the peak theoretical performance) and 50-70% intrinsic efficiency (against the algorithm s maximum operational throughput, which neglects all latencies). This is about 200-300 times faster than an equivalent serial CPU implementation. When the single-precision GPU particle mover is combined with a double-precision CPU JFNK field solver, overall performance gains 100 vs. the double-precision CPU-only serial version are obtained, with no apparent loss of
On implementation of EM-type algorithms in the stochastic models for a matrix computing on GPU
Gorshenin, Andrey K.
2015-03-10
The paper discusses the main ideas of an implementation of EM-type algorithms for computing on the graphics processors and the application for the probabilistic models based on the Cox processes. An example of the GPU’s adapted MATLAB source code for the finite normal mixtures with the expectation-maximization matrix formulas is given. The testing of computational efficiency for GPU vs CPU is illustrated for the different sample sizes.
Solving global optimization problems on GPU cluster
NASA Astrophysics Data System (ADS)
Barkalov, Konstantin; Gergel, Victor; Lebedev, Ilya
2016-06-01
The paper contains the results of investigation of a parallel global optimization algorithm combined with a dimension reduction scheme. This allows solving multidimensional problems by means of reducing to data-independent subproblems with smaller dimension solved in parallel. The new element implemented in the research consists in using several graphic accelerators at different computing nodes. The paper also includes results of solving problems of well-known multiextremal test class GKLS on Lobachevsky supercomputer using tens of thousands of GPU cores.
Local alignment tool based on Hadoop framework and GPU architecture.
Hung, Che-Lun; Hua, Guan-Jie
2014-01-01
With the rapid growth of next generation sequencing technologies, such as Slex, more and more data have been discovered and published. To analyze such huge data the computational performance is an important issue. Recently, many tools, such as SOAP, have been implemented on Hadoop and GPU parallel computing architectures. BLASTP is an important tool, implemented on GPU architectures, for biologists to compare protein sequences. To deal with the big biology data, it is hard to rely on single GPU. Therefore, we implement a distributed BLASTP by combining Hadoop and multi-GPUs. The experimental results present that the proposed method can improve the performance of BLASTP on single GPU, and also it can achieve high availability and fault tolerance. PMID:24955362
High-Speed GPU-Based Fully Three-Dimensional Diffuse Optical Tomographic System
Saikia, Manob Jyoti; Kanhirodan, Rajan; Mohan Vasu, Ram
2014-01-01
We have developed a graphics processor unit (GPU-) based high-speed fully 3D system for diffuse optical tomography (DOT). The reduction in execution time of 3D DOT algorithm, a severely ill-posed problem, is made possible through the use of (1) an algorithmic improvement that uses Broyden approach for updating the Jacobian matrix and thereby updating the parameter matrix and (2) the multinode multithreaded GPU and CUDA (Compute Unified Device Architecture) software architecture. Two different GPU implementations of DOT programs are developed in this study: (1) conventional C language program augmented by GPU CUDA and CULA routines (C GPU), (2) MATLAB program supported by MATLAB parallel computing toolkit for GPU (MATLAB GPU). The computation time of the algorithm on host CPU and the GPU system is presented for C and Matlab implementations. The forward computation uses finite element method (FEM) and the problem domain is discretized into 14610, 30823, and 66514 tetrahedral elements. The reconstruction time, so achieved for one iteration of the DOT reconstruction for 14610 elements, is 0.52 seconds for a C based GPU program for 2-plane measurements. The corresponding MATLAB based GPU program took 0.86 seconds. The maximum number of reconstructed frames so achieved is 2 frames per second. PMID:24891848
Scalable simulations for directed self-assembly patterning with the use of GPU parallel computing
NASA Astrophysics Data System (ADS)
Yoshimoto, Kenji; Peters, Brandon L.; Khaira, Gurdaman S.; de Pablo, Juan J.
2012-03-01
Directed self-assembly (DSA) patterning has been increasingly investigated as an alternative lithographic process for future technology nodes. One of the critical specs for DSA patterning is defects generated through annealing process or by roughness of pre-patterned structure. Due to their high sensitivity to the process and wafer conditions, however, characterization of those defects still remain challenging. DSA simulations can be a powerful tool to predict the formation of the DSA defects. In this work, we propose a new method to perform parallel computing of DSA Monte Carlo (MC) simulations. A consumer graphics card was used to access its hundreds of processing units for parallel computing. By partitioning the simulation system into non-interacting domains, we were able to run MC trial moves in parallel on multiple graphics-processing units (GPUs). Our results show a significant improvement in computational performance.
NASA Astrophysics Data System (ADS)
Topping, T. Russell; French, James; Hancock, Monte F., Jr.
2010-04-01
Working with the Naval Research Laboratory, Celestech has implemented advanced non-linear hyperspectral image (HSI) processing algorithms optimized for Graphics Processing Units (GPU). These algorithms have demonstrated performance improvements of nearly 2 orders of magnitude over optimal CPU-based implementations. The paper briefly covers the architecture of the NIVIDIA GPU to provide a basis for discussing GPU optimization challenges and strategies. The paper then covers optimization approaches employed to extract performance from the GPU implementation of Dr. Bachmann's algorithms including memory utilization and process thread optimization considerations. The paper goes on to discuss strategies for deploying GPU-enabled servers into enterprise service oriented architectures. Also discussed are Celestech's on-going work in the area of middleware frameworks to provide an optimized multi-GPU utilization and scheduling approach that supports both multiple GPUs in a single computer as well as across multiple computers. This paper is a complementary work to the paper submitted by Dr. Charles Bachmann entitled "A Scalable Approach to Modeling Nonlinear Structure in Hyperspectral Imagery and Other High-Dimensional Data Using Manifold Coordinate Representations". Dr. Bachmann's paper covers the algorithmic and theoretical basis for the HSI processing approach.
NASA Astrophysics Data System (ADS)
Gonzalez, C. M.; Sanchez, D. A.; Yuen, D. A.; Wright, G. B.; Barnett, G. A.
2010-12-01
As computational modeling became prolific throughout the physical sciences community, newer and more efficient ways of processing large amounts of data needed to be devised. One particular method for processing such large amounts of data arose in the form of using a graphics processing unit (GPU) for calculations. Computational scientists were attracted to the GPU as a computational tool as the performance, growth, and availability of GPUs over the past decade increased. Scientists began to utilize the GPU as the sole workhorse for their brute force calculations and modeling. The GPUs, however, were not originally designed for this style of use. As a result, difficulty arose when trying to find a use for the GPU from a scientific standpoint. A lack of parallel programming routines was the main culprit behind the difficulty in programming with a GPU, but with time and a rise in popularity, NVIDIA released a proprietary architecture named Fermi. The Fermi architecture, when used in conjunction with development tools such as CUDA, allowed the programmer easier access to routines that made parallel programming with the NVIDIA GPUs an ease. This new architecture enabled the programmer full access to faster memory, double-precision support, and large amounts of global memory at their fingertips. Our model was based on using a second-order, spatially correct finite difference method and a third order Runge-Kutta time-stepping scheme for studying the 2D Rayleigh-Benard code. The code extensively used the CUBLAS routines to do the heavy linear algebra calculations. The calculations themselves were completed using a single GPU, the NVDIA C2070 Fermi, which boasts 6 GB of global memory. The overall scientific goal of our work was to apply the Tesla C2070's computing potential to achieve an onset of flow reversals as a function of increasing large Rayleigh numbers. Previous investigations were successful using a smaller grid size of 1000x1999 and a Rayleigh number of 10^9. The
GPU-centric resolved-particle disperse two-phase flow simulation using the Physalis method
NASA Astrophysics Data System (ADS)
Sierakowski, Adam J.
2016-10-01
We present work on a new implementation of the Physalis method for resolved-particle disperse two-phase flow simulations. We discuss specifically our GPU-centric programming model that avoids all device-host data communication during the simulation. Summarizing the details underlying the implementation of the Physalis method, we illustrate the application of two GPU-centric parallelization paradigms and record insights on how to best leverage the GPU's prioritization of bandwidth over latency. We perform a comparison of the computational efficiency between the current GPU-centric implementation and a legacy serial-CPU-optimized code and conclude that the GPU hardware accounts for run time improvements up to a factor of 60 by carefully normalizing the run times of both codes.
Performance evaluation of image processing algorithms on the GPU.
Castaño-Díez, Daniel; Moser, Dominik; Schoenegger, Andreas; Pruggnaller, Sabine; Frangakis, Achilleas S
2008-10-01
The graphics processing unit (GPU), which originally was used exclusively for visualization purposes, has evolved into an extremely powerful co-processor. In the meanwhile, through the development of elaborate interfaces, the GPU can be used to process data and deal with computationally intensive applications. The speed-up factors attained compared to the central processing unit (CPU) are dependent on the particular application, as the GPU architecture gives the best performance for algorithms that exhibit high data parallelism and high arithmetic intensity. Here, we evaluate the performance of the GPU on a number of common algorithms used for three-dimensional image processing. The algorithms were developed on a new software platform called "CUDA", which allows a direct translation from C code to the GPU. The implemented algorithms include spatial transformations, real-space and Fourier operations, as well as pattern recognition procedures, reconstruction algorithms and classification procedures. In our implementation, the direct porting of C code in the GPU achieves typical acceleration values in the order of 10-20 times compared to a state-of-the-art conventional processor, but they vary depending on the type of the algorithm. The gained speed-up comes with no additional costs, since the software runs on the GPU of the graphics card of common workstations.
Architecting the Finite Element Method Pipeline for the GPU
Fu, Zhisong; Lewis, T. James; Kirby, Robert M.
2014-01-01
The finite element method (FEM) is a widely employed numerical technique for approximating the solution of partial differential equations (PDEs) in various science and engineering applications. Many of these applications benefit from fast execution of the FEM pipeline. One way to accelerate the FEM pipeline is by exploiting advances in modern computational hardware, such as the many-core streaming processors like the graphical processing unit (GPU). In this paper, we present the algorithms and data-structures necessary to move the entire FEM pipeline to the GPU. First we propose an efficient GPU-based algorithm to generate local element information and to assemble the global linear system associated with the FEM discretization of an elliptic PDE. To solve the corresponding linear system efficiently on the GPU, we implement a conjugate gradient method preconditioned with a geometry-informed algebraic multi-grid (AMG) method preconditioner. We propose a new fine-grained parallelism strategy, a corresponding multigrid cycling stage and efficient data mapping to the many-core architecture of GPU. Comparison of our on-GPU assembly versus a traditional serial implementation on the CPU achieves up to an 87 × speedup. Focusing on the linear system solver alone, we achieve a speedup of up to 51 × versus use of a comparable state-of-the-art serial CPU linear system solver. Furthermore, the method compares favorably with other GPU-based, sparse, linear solvers. PMID:25202164
NASA Astrophysics Data System (ADS)
Imamura, N.; Schultz, A.
2015-12-01
Recently, a full waveform time domain solution has been developed for the magnetotelluric (MT) and controlled-source electromagnetic (CSEM) methods. The ultimate goal of this approach is to obtain a computationally tractable direct waveform joint inversion for source fields and earth conductivity structure in three and four dimensions. This is desirable on several grounds, including the improved spatial resolving power expected from use of a multitude of source illuminations of non-zero wavenumber, the ability to operate in areas of high levels of source signal spatial complexity and non-stationarity, etc. This goal would not be obtainable if one were to adopt the finite difference time-domain (FDTD) approach for the forward problem. This is particularly true for the case of MT surveys, since an enormous number of degrees of freedom are required to represent the observed MT waveforms across the large frequency bandwidth. It means that for FDTD simulation, the smallest time steps should be finer than that required to represent the highest frequency, while the number of time steps should also cover the lowest frequency. This leads to a linear system that is computationally burdensome to solve. We have implemented our code that addresses this situation through the use of a fictitious wave domain method and GPUs to speed up the computation time. We also substantially reduce the size of the linear systems by applying concepts from successive cascade decimation, through quasi-equivalent time domain decomposition. By combining these refinements, we have made good progress toward implementing the core of a full waveform joint source field/earth conductivity inverse modeling method. From results, we found the use of previous generation of CPU/GPU speeds computations by an order of magnitude over a parallel CPU only approach. In part, this arises from the use of the quasi-equivalent time domain decomposition, which shrinks the size of the linear system dramatically.
Implementation and performance of parallelized elegant.
Wang, Y.; Borland, M.; Accelerator Systems Division
2008-01-01
The program elegant is widely used for design and modeling of linacs for free-electron lasers and energy recovery linacs, as well as storage rings and other applications. As part of a multi-year effort, we have parallelized many aspects of the code, including single-particle dynamics, wakefields, and coherent synchrotron radiation. We report on the approach used for gradual parallelization, which proved very beneficial in getting parallel features into the hands of users quickly. We also report details of parallelization of collective effects. Finally, we discuss performance of the parallelized code in various applications.
Evaluation of likelihood functions on CPU and GPU devices
NASA Astrophysics Data System (ADS)
Jarp, Sverre; Lazzaro, Alfio; Leduc, Julien; Nowak, Andrzej; Sneen Lindal, Yngve
2012-06-01
We describe parallel implementations of an algorithm used to evaluate the likelihood function used in data analysis. The implementations run, respectively, on CPU and GPU, and both devices cooperatively (hybrid). CPU and GPU implementations are based on OpenMP and OpenCL, respectively. The hybrid implementation allows the application to run also on multi-GPU systems (not necessarily of the same type). The hybrid case uses a scheduler so that the workload needed for the evaluation of function is split and balanced in corresponding sub-workloads to be executed in parallel on each device, i. e. CPU-GPU or multi-CPUs. We present the results of the scalability when running on CPU. Then we show the comparison of the performance of the GPU implementation on different hardware systems from different vendors, and the performance when running in the hybrid case. The tests are based on likelihood functions from real data analysis carried out in the high energy physics community.
GPU COMPUTING FOR PARTICLE TRACKING
Nishimura, Hiroshi; Song, Kai; Muriki, Krishna; Sun, Changchun; James, Susan; Qin, Yong
2011-03-25
This is a feasibility study of using a modern Graphics Processing Unit (GPU) to parallelize the accelerator particle tracking code. To demonstrate the massive parallelization features provided by GPU computing, a simplified TracyGPU program is developed for dynamic aperture calculation. Performances, issues, and challenges from introducing GPU are also discussed. General purpose Computation on Graphics Processing Units (GPGPU) bring massive parallel computing capabilities to numerical calculation. However, the unique architecture of GPU requires a comprehensive understanding of the hardware and programming model to be able to well optimize existing applications. In the field of accelerator physics, the dynamic aperture calculation of a storage ring, which is often the most time consuming part of the accelerator modeling and simulation, can benefit from GPU due to its embarrassingly parallel feature, which fits well with the GPU programming model. In this paper, we use the Tesla C2050 GPU which consists of 14 multi-processois (MP) with 32 cores on each MP, therefore a total of 448 cores, to host thousands ot threads dynamically. Thread is a logical execution unit of the program on GPU. In the GPU programming model, threads are grouped into a collection of blocks Within each block, multiple threads share the same code, and up to 48 KB of shared memory. Multiple thread blocks form a grid, which is executed as a GPU kernel. A simplified code that is a subset of Tracy++ [2] is developed to demonstrate the possibility of using GPU to speed up the dynamic aperture calculation by having each thread track a particle.
NASA Astrophysics Data System (ADS)
Wong, Un-Hong; Aoki, Takayuki; Wong, Hon-Cheng
2014-07-01
Modern graphics processing units (GPUs) have been widely utilized in magnetohydrodynamic (MHD) simulations in recent years. Due to the limited memory of a single GPU, distributed multi-GPU systems are needed to be explored for large-scale MHD simulations. However, the data transfer between GPUs bottlenecks the efficiency of the simulations on such systems. In this paper we propose a novel GPU Direct-MPI hybrid approach to address this problem for overall performance enhancement. Our approach consists of two strategies: (1) We exploit GPU Direct 2.0 to speedup the data transfers between multiple GPUs in a single node and reduce the total number of message passing interface (MPI) communications; (2) We design Compute Unified Device Architecture (CUDA) kernels instead of using memory copy to speedup the fragmented data exchange in the three-dimensional (3D) decomposition. 3D decomposition is usually not preferable for distributed multi-GPU systems due to its low efficiency of the fragmented data exchange. Our approach has made a breakthrough to make 3D decomposition available on distributed multi-GPU systems. As a result, it can reduce the memory usage and computation time of each partition of the computational domain. Experiment results show twice the FLOPS comparing to common 2D decomposition MPI-only implementation method. The proposed approach has been developed in an efficient implementation for MHD simulations on distributed multi-GPU systems, called MGPU-MHD code. The code realizes the GPU parallelization of a total variation diminishing (TVD) algorithm for solving the multidimensional ideal MHD equations, extending our work from single GPU computation (Wong et al., 2011) to multiple GPUs. Numerical tests and performance measurements are conducted on the TSUBAME 2.0 supercomputer at the Tokyo Institute of Technology. Our code achieves 2 TFLOPS in double precision for the problem with 12003 grid points using 216 GPUs.
Parallel optimization algorithms and their implementation in VLSI design
NASA Technical Reports Server (NTRS)
Lee, G.; Feeley, J. J.
1991-01-01
Two new parallel optimization algorithms based on the simplex method are described. They may be executed by a SIMD parallel processor architecture and be implemented in VLSI design. Several VLSI design implementations are introduced. An application example is reported to demonstrate that the algorithms are effective.
GPU accelerated cell-based adaptive mesh refinement on unstructured quadrilateral grid
NASA Astrophysics Data System (ADS)
Luo, Xisheng; Wang, Luying; Ran, Wei; Qin, Fenghua
2016-10-01
A GPU accelerated inviscid flow solver is developed on an unstructured quadrilateral grid in the present work. For the first time, the cell-based adaptive mesh refinement (AMR) is fully implemented on GPU for the unstructured quadrilateral grid, which greatly reduces the frequency of data exchange between GPU and CPU. Specifically, the AMR is processed with atomic operations to parallelize list operations, and null memory recycling is realized to improve the efficiency of memory utilization. It is found that results obtained by GPUs agree very well with the exact or experimental results in literature. An acceleration ratio of 4 is obtained between the parallel code running on the old GPU GT9800 and the serial code running on E3-1230 V2. With the optimization of configuring a larger L1 cache and adopting Shared Memory based atomic operations on the newer GPU C2050, an acceleration ratio of 20 is achieved. The parallelized cell-based AMR processes have achieved 2x speedup on GT9800 and 18x on Tesla C2050, which demonstrates that parallel running of the cell-based AMR method on GPU is feasible and efficient. Our results also indicate that the new development of GPU architecture benefits the fluid dynamics computing significantly.
Parallel Implementation of MAFFT on CUDA-Enabled Graphics Hardware.
Zhu, Xiangyuan; Li, Kenli; Salah, Ahmad; Shi, Lin; Li, Keqin
2015-01-01
Multiple sequence alignment (MSA) constitutes an extremely powerful tool for many biological applications including phylogenetic tree estimation, secondary structure prediction, and critical residue identification. However, aligning large biological sequences with popular tools such as MAFFT requires long runtimes on sequential architectures. Due to the ever increasing sizes of sequence databases, there is increasing demand to accelerate this task. In this paper, we demonstrate how graphic processing units (GPUs), powered by the compute unified device architecture (CUDA), can be used as an efficient computational platform to accelerate the MAFFT algorithm. To fully exploit the GPU's capabilities for accelerating MAFFT, we have optimized the sequence data organization to eliminate the bandwidth bottleneck of memory access, designed a memory allocation and reuse strategy to make full use of limited memory of GPUs, proposed a new modified-run-length encoding (MRLE) scheme to reduce memory consumption, and used high-performance shared memory to speed up I/O operations. Our implementation tested in three NVIDIA GPUs achieves speedup up to 11.28 on a Tesla K20m GPU compared to the sequential MAFFT 7.015. PMID:26357090
Parallel Implementation of MAFFT on CUDA-Enabled Graphics Hardware.
Zhu, Xiangyuan; Li, Kenli; Salah, Ahmad; Shi, Lin; Li, Keqin
2015-01-01
Multiple sequence alignment (MSA) constitutes an extremely powerful tool for many biological applications including phylogenetic tree estimation, secondary structure prediction, and critical residue identification. However, aligning large biological sequences with popular tools such as MAFFT requires long runtimes on sequential architectures. Due to the ever increasing sizes of sequence databases, there is increasing demand to accelerate this task. In this paper, we demonstrate how graphic processing units (GPUs), powered by the compute unified device architecture (CUDA), can be used as an efficient computational platform to accelerate the MAFFT algorithm. To fully exploit the GPU's capabilities for accelerating MAFFT, we have optimized the sequence data organization to eliminate the bandwidth bottleneck of memory access, designed a memory allocation and reuse strategy to make full use of limited memory of GPUs, proposed a new modified-run-length encoding (MRLE) scheme to reduce memory consumption, and used high-performance shared memory to speed up I/O operations. Our implementation tested in three NVIDIA GPUs achieves speedup up to 11.28 on a Tesla K20m GPU compared to the sequential MAFFT 7.015.
Plain Polynomial Arithmetic on GPU
NASA Astrophysics Data System (ADS)
Anisul Haque, Sardar; Moreno Maza, Marc
2012-10-01
As for serial code on CPUs, parallel code on GPUs for dense polynomial arithmetic relies on a combination of asymptotically fast and plain algorithms. Those are employed for data of large and small size, respectively. Parallelizing both types of algorithms is required in order to achieve peak performances. In this paper, we show that the plain dense polynomial multiplication can be efficiently parallelized on GPUs. Remarkably, it outperforms (highly optimized) FFT-based multiplication up to degree 212 while on CPU the same threshold is usually at 26. We also report on a GPU implementation of the Euclidean Algorithm which is both work-efficient and runs in linear time for input polynomials up to degree 218 thus showing the performance of the GCD algorithm based on systolic arrays.
Design and implementation of a massively parallel version of DIRECT
He, J.; Verstak, A.; Watson, L.; Sosonkina, M.
2007-10-24
This paper describes several massively parallel implementations for a global search algorithm DIRECT. Two parallel schemes take different approaches to address DIRECT's design challenges imposed by memory requirements and data dependency. Three design aspects in topology, data structures, and task allocation are compared in detail. The goal is to analytically investigate the strengths and weaknesses of these parallel schemes, identify several key sources of inefficiency, and experimentally evaluate a number of improvements in the latest parallel DIRECT implementation. The performance studies demonstrate improved data structure efficiency and load balancing on a 2200 processor cluster.
Real-time GPU implementation of transverse oscillation vector velocity flow imaging
NASA Astrophysics Data System (ADS)
Bradway, David Pierson; Pihl, Michael Johannes; Krebs, Andreas; Tomov, Borislav Gueorguiev; Kjær, Carsten Straso; Nikolov, Svetoslav Ivanov; Jensen, Jørgen Arendt
2014-03-01
Rapid estimation of blood velocity and visualization of complex flow patterns are important for clinical use of diagnostic ultrasound. This paper presents real-time processing for two-dimensional (2-D) vector flow imaging which utilizes an off-the-shelf graphics processing unit (GPU). In this work, Open Computing Language (OpenCL) is used to estimate 2-D vector velocity flow in vivo in the carotid artery. Data are streamed live from a BK Medical 2202 Pro Focus UltraView Scanner to a workstation running a research interface software platform. Processing data from a 50 millisecond frame of a duplex vector flow acquisition takes 2.3 milliseconds seconds on an Advanced Micro Devices Radeon HD 7850 GPU card. The detected velocities are accurate to within the precision limit of the output format of the display routine. Because this tool was developed as a module external to the scanner's built-in processing, it enables new opportunities for prototyping novel algorithms, optimizing processing parameters, and accelerating the path from development lab to clinic.
Design and implementation of parallel multigrid algorithms
NASA Technical Reports Server (NTRS)
Chan, Tony F.; Tuminaro, Ray S.
1988-01-01
Techniques for mapping multigrid algorithms to solve elliptic PDEs on hypercube parallel computers are described and demonstrated. The need for proper data mapping to minimize communication distances is stressed, and an execution-time model is developed to show how algorithm efficiency is affected by changes in the machine and algorithm parameters. Particular attention is then given to the case of coarse computational grids, which can lead to idle processors, load imbalances, and inefficient performance. It is shown that convergence can be improved by using idle processors to solve a new problem concurrently on the fine grid defined by a splitting.
Advantages of GPU technology in DFT calculations of intercalated graphene
NASA Astrophysics Data System (ADS)
Pešić, J.; Gajić, R.
2014-09-01
Over the past few years, the expansion of general-purpose graphic-processing unit (GPGPU) technology has had a great impact on computational science. GPGPU is the utilization of a graphics-processing unit (GPU) to perform calculations in applications usually handled by the central processing unit (CPU). Use of GPGPUs as a way to increase computational power in the material sciences has significantly decreased computational costs in already highly demanding calculations. A level of the acceleration and parallelization depends on the problem itself. Some problems can benefit from GPU acceleration and parallelization, such as the finite-difference time-domain algorithm (FTDT) and density-functional theory (DFT), while others cannot take advantage of these modern technologies. A number of GPU-supported applications had emerged in the past several years (www.nvidia.com/object/gpu-applications.html). Quantum Espresso (QE) is reported as an integrated suite of open source computer codes for electronic-structure calculations and materials modeling at the nano-scale. It is based on DFT, the use of a plane-waves basis and a pseudopotential approach. Since the QE 5.0 version, it has been implemented as a plug-in component for standard QE packages that allows exploiting the capabilities of Nvidia GPU graphic cards (www.qe-forge.org/gf/proj). In this study, we have examined the impact of the usage of GPU acceleration and parallelization on the numerical performance of DFT calculations. Graphene has been attracting attention worldwide and has already shown some remarkable properties. We have studied an intercalated graphene, using the QE package PHonon, which employs GPU. The term ‘intercalation’ refers to a process whereby foreign adatoms are inserted onto a graphene lattice. In addition, by intercalating different atoms between graphene layers, it is possible to tune their physical properties. Our experiments have shown there are benefits from using GPUs, and we reached an
CPU-GPU hybrid accelerating the Zuker algorithm for RNA secondary structure prediction applications
2012-01-01
Background Prediction of ribonucleic acid (RNA) secondary structure remains one of the most important research areas in bioinformatics. The Zuker algorithm is one of the most popular methods of free energy minimization for RNA secondary structure prediction. Thus far, few studies have been reported on the acceleration of the Zuker algorithm on general-purpose processors or on extra accelerators such as Field Programmable Gate-Array (FPGA) and Graphics Processing Units (GPU). To the best of our knowledge, no implementation combines both CPU and extra accelerators, such as GPUs, to accelerate the Zuker algorithm applications. Results In this paper, a CPU-GPU hybrid computing system that accelerates Zuker algorithm applications for RNA secondary structure prediction is proposed. The computing tasks are allocated between CPU and GPU for parallel cooperate execution. Performance differences between the CPU and the GPU in the task-allocation scheme are considered to obtain workload balance. To improve the hybrid system performance, the Zuker algorithm is optimally implemented with special methods for CPU and GPU architecture. Conclusions Speedup of 15.93× over optimized multi-core SIMD CPU implementation and performance advantage of 16% over optimized GPU implementation are shown in the experimental results. More than 14% of the sequences are executed on CPU in the hybrid system. The system combining CPU and GPU to accelerate the Zuker algorithm is proven to be promising and can be applied to other bioinformatics applications. PMID:22369626
Bethel, E. Wes; Bethel, E. Wes
2012-01-06
This report explores using GPUs as a platform for performing high performance medical image data processing, specifically smoothing using a 3D bilateral filter, which performs anisotropic, edge-preserving smoothing. The algorithm consists of a running a specialized 3D convolution kernel over a source volume to produce an output volume. Overall, our objective is to understand what algorithmic design choices and configuration options lead to optimal performance of this algorithm on the GPU. We explore the performance impact of using different memory access patterns, of using different types of device/on-chip memories, of using strictly aligned and unaligned memory, and of varying the size/shape of thread blocks. Our results reveal optimal configuration parameters for our algorithm when executed sample 3D medical data set, and show performance gains ranging from 30x to over 200x as compared to a single-threaded CPU implementation.
Method for implementation of recursive hierarchical segmentation on parallel computers
NASA Technical Reports Server (NTRS)
Tilton, James C. (Inventor)
2005-01-01
A method, computer readable storage, and apparatus for implementing a recursive hierarchical segmentation algorithm on a parallel computing platform. The method includes setting a bottom level of recursion that defines where a recursive division of an image into sections stops dividing, and setting an intermediate level of recursion where the recursive division changes from a parallel implementation into a serial implementation. The segmentation algorithm is implemented according to the set levels. The method can also include setting a convergence check level of recursion with which the first level of recursion communicates with when performing a convergence check.
Yan, Hao E-mail: xun.jia@utsouthwestern.edu; Shi, Feng; Jiang, Steve B.; Jia, Xun E-mail: xun.jia@utsouthwestern.edu; Wang, Xiaoyu; Cervino, Laura; Bai, Ti; Folkerts, Michael
2014-11-01
Purpose: Compressed sensing (CS)-based iterative reconstruction (IR) techniques are able to reconstruct cone-beam CT (CBCT) images from undersampled noisy data, allowing for imaging dose reduction. However, there are a few practical concerns preventing the clinical implementation of these techniques. On the image quality side, data truncation along the superior–inferior direction under the cone-beam geometry produces severe cone artifacts in the reconstructed images. Ring artifacts are also seen in the half-fan scan mode. On the reconstruction efficiency side, the long computation time hinders clinical use in image-guided radiation therapy (IGRT). Methods: Image quality improvement methods are proposed to mitigate the cone and ring image artifacts in IR. The basic idea is to use weighting factors in the IR data fidelity term to improve projection data consistency with the reconstructed volume. In order to improve the computational efficiency, a multiple graphics processing units (GPUs)-based CS-IR system was developed. The parallelization scheme, detailed analyses of computation time at each step, their relationship with image resolution, and the acceleration factors were studied. The whole system was evaluated in various phantom and patient cases. Results: Ring artifacts can be mitigated by properly designing a weighting factor as a function of the spatial location on the detector. As for the cone artifact, without applying a correction method, it contaminated 13 out of 80 slices in a head-neck case (full-fan). Contamination was even more severe in a pelvis case under half-fan mode, where 36 out of 80 slices were affected, leading to poorer soft tissue delineation and reduced superior–inferior coverage. The proposed method effectively corrects those contaminated slices with mean intensity differences compared to FDK results decreasing from ∼497 and ∼293 HU to ∼39 and ∼27 HU for the full-fan and half-fan cases, respectively. In terms of efficiency boost
Yan, Hao; Wang, Xiaoyu; Shi, Feng; Bai, Ti; Folkerts, Michael; Cervino, Laura; Jiang, Steve B.; Jia, Xun
2014-01-01
Purpose: Compressed sensing (CS)-based iterative reconstruction (IR) techniques are able to reconstruct cone-beam CT (CBCT) images from undersampled noisy data, allowing for imaging dose reduction. However, there are a few practical concerns preventing the clinical implementation of these techniques. On the image quality side, data truncation along the superior–inferior direction under the cone-beam geometry produces severe cone artifacts in the reconstructed images. Ring artifacts are also seen in the half-fan scan mode. On the reconstruction efficiency side, the long computation time hinders clinical use in image-guided radiation therapy (IGRT). Methods: Image quality improvement methods are proposed to mitigate the cone and ring image artifacts in IR. The basic idea is to use weighting factors in the IR data fidelity term to improve projection data consistency with the reconstructed volume. In order to improve the computational efficiency, a multiple graphics processing units (GPUs)-based CS-IR system was developed. The parallelization scheme, detailed analyses of computation time at each step, their relationship with image resolution, and the acceleration factors were studied. The whole system was evaluated in various phantom and patient cases. Results: Ring artifacts can be mitigated by properly designing a weighting factor as a function of the spatial location on the detector. As for the cone artifact, without applying a correction method, it contaminated 13 out of 80 slices in a head-neck case (full-fan). Contamination was even more severe in a pelvis case under half-fan mode, where 36 out of 80 slices were affected, leading to poorer soft tissue delineation and reduced superior–inferior coverage. The proposed method effectively corrects those contaminated slices with mean intensity differences compared to FDK results decreasing from ∼497 and ∼293 HU to ∼39 and ∼27 HU for the full-fan and half-fan cases, respectively. In terms of efficiency boost
CFD Computations on Multi-GPU Configurations.
NASA Astrophysics Data System (ADS)
Menon, Sandeep; Perot, Blair
2007-11-01
Programmable graphics processors have shown favorable potential for use in practical CFD simulations -- often delivering a speed-up factor between 3 to 5 times over conventional CPUs. In recent times, most PCs are supplied with the option of installing multiple GPUs on a single motherboard, thereby providing the option of a parallel GPU configuration in a shared-memory paradigm. We demonstrate our implementation of an unstructured CFD solver using a set up which is configured to run two GPUs in parallel, and discuss its performance details.
Parallel implementation, validation, and performance of MM5
Michalakes, J.; Canfield, T.; Nanjundiah, R.; Hammond, S.; Grell, G.
1994-12-31
We describe a parallel implementation of the nonhydrostatic version of the Penn State/NCAR Mesoscale Model, MM5, that includes nesting capabilities. This version of the model can run on many different massively Parallel computers (including a cluster of workstations). The model has been implemented and run on the IBM SP and Intel multiprocessors using a columnwise decomposition that supports irregularly shaped allocations of the problem to processors. This stategy will facilitate dynamic load balancing for improved parallel efficiency and promotes a modular design that simplifies the nesting problem AU data communication for finite differencing, inter-domain exchange of data, and I/O is encapsulated within a parallel library, RSL. Hence, there are no sends or receives in the parallel model itself. The library is Generalizable to other, similar finite difference approximation codes. The code is validated by comparing the rate of growth in error between the sequential and parallel models with the error growth rate when the sequential model input is perturbed to simulate floating point rounding error. Series of runs on increasing numbers of parallel processors demonstrate that the parallel implementation is efficient and scalable to large numbers of processors.
ManyClaw: Implementation and Comparison of Intra-Node Parallelism of Clawpack
NASA Astrophysics Data System (ADS)
Terrel, A. R.; Mandli, K. T.
2012-12-01
Computational methods for geophysical phenomena have in the past seen tremendous increases in ability due to increased hardware capability and advances in the computational methods being used. In the past two decades these avenues for increasing computational power have become intertwined as the most advanced computational methods must be written specifically for the hardware it will run on. This has become a growing concern for many scientists as the effort needed to produce efficient codes on the state-of-the-art machines has become increasingly difficult. Next generation computer architectures will include an order of magnitude more intra-node parallelism. In this context, we have created ManyClaw, a project intended to explore the exploitation of intra-node parallelism in hyperbolic PDE solvers from the Clawpack software package. Clawpack uses a finite volume wave-propagation approach to solving linear and non-linear hyperbolic conservation and balance laws. The basic computational units of Clawpack include the Riemann solver, limiters, and the cell update. The goal of ManyClaw then is to implement each of these components in a number of different ways exploring the best way to exploit as much intra-node parallelism as possible. One result of this effort is that a number of design decisions in ManyClaw differ significantly from Clawpack. Although this was not unexpected, insuring compatibility with the original code through PyClaw, a Python version of Clawpack, has also been undertaken. In this presentation we will focus on a discussion of the scalability of various threading approaches in each of the basic computational units described above. We implemented a number of different Riemann problems with different levels of arithmetic intensity via vectorized Fortran, OpenMP, TBB, and ISPC. The code was factored to allow any threading model to call any of the Riemann problems, limiters, and update routines. Results from Intel MIC and NVidia GPU implementations
Multi-GPU kinetic solvers using MPI and CUDA
NASA Astrophysics Data System (ADS)
Zabelok, Sergey; Arslanbekov, Robert; Kolobov, Vladimir
2014-12-01
This paper describes recent progress towards porting a Unified Flow Solver (UFS) to heterogeneous parallel computing. The main challenge of porting UFS to graphics processing units (GPUs) comes from the dynamically adapted mesh, which causes irregular data access. We describe the implementation of CUDA kernels for three modules in UFS: the direct Boltzmann solver using discrete velocity method (DVM), the DSMC module, and the Lattice Boltzmann Method (LBM) solver, all using octree Cartesian mesh with adaptive Mesh Refinement (AMR). Double digit speedup on single GPU and good scaling for multi-GPU has been demonstrated.
Shi, Yulin; Veidenbaum, Alexander V.; Nicolau, Alex; Xu, Xiangmin
2014-01-01
Background Modern neuroscience research demands computing power. Neural circuit mapping studies such as those using laser scanning photostimulation (LSPS) produce large amounts of data and require intensive computation for post-hoc processing and analysis. New Method Here we report on the design and implementation of a cost-effective desktop computer system for accelerated experimental data processing with recent GPU computing technology. A new version of Matlab software with GPU enabled functions is used to develop programs that run on Nvidia GPUs to harness their parallel computing power. Results We evaluated both the central processing unit (CPU) and GPU-enabled computational performance of our system in benchmark testing and practical applications. The experimental results show that the GPU-CPU co-processing of simulated data and actual LSPS experimental data clearly outperformed the multi-core CPU with up to a 22x speedup, depending on computational tasks. Further, we present a comparison of numerical accuracy between GPU and CPU computation to verify the precision of GPU computation. In addition, we show how GPUs can be effectively adapted to improve the performance of commercial image processing software such as Adobe Photoshop. Comparison with Existing Method(s) To our best knowledge, this is the first demonstration of GPU application in neural circuit mapping and electrophysiology-based data processing. Conclusions Together, GPU enabled computation enhances our ability to process large-scale data sets derived from neural circuit mapping studies, allowing for increased processing speeds while retaining data precision. PMID:25277633
High Performance GPU-Based Fourier Volume Rendering
Abdellah, Marwan; Eldeib, Ayman; Sharawi, Amr
2015-01-01
Fourier volume rendering (FVR) is a significant visualization technique that has been used widely in digital radiography. As a result of its 𝒪(N2logN) time complexity, it provides a faster alternative to spatial domain volume rendering algorithms that are 𝒪(N3) computationally complex. Relying on the Fourier projection-slice theorem, this technique operates on the spectral representation of a 3D volume instead of processing its spatial representation to generate attenuation-only projections that look like X-ray radiographs. Due to the rapid evolution of its underlying architecture, the graphics processing unit (GPU) became an attractive competent platform that can deliver giant computational raw power compared to the central processing unit (CPU) on a per-dollar-basis. The introduction of the compute unified device architecture (CUDA) technology enables embarrassingly-parallel algorithms to run efficiently on CUDA-capable GPU architectures. In this work, a high performance GPU-accelerated implementation of the FVR pipeline on CUDA-enabled GPUs is presented. This proposed implementation can achieve a speed-up of 117x compared to a single-threaded hybrid implementation that uses the CPU and GPU together by taking advantage of executing the rendering pipeline entirely on recent GPU architectures. PMID:25866499
Priimak, Dmitri
2014-12-01
We present a finite difference numerical algorithm for solving two dimensional spatially homogeneous Boltzmann transport equation which describes electron transport in a semiconductor superlattice subject to crossed time dependent electric and constant magnetic fields. The algorithm is implemented both in C language targeted to CPU and in CUDA C language targeted to commodity NVidia GPU. We compare performances and merits of one implementation versus another and discuss various software optimisation techniques.
NASA Astrophysics Data System (ADS)
Bernabé, Sergio; Martin, Gabriel; Botella, Guillermo; Prieto-Matias, Manuel; Plaza, Antonio
2016-04-01
In the last years, hyperspectral analysis have been applied in many remote sensing applications. In fact, hyperspectral unmixing has been a challenging task in hyperspectral data exploitation. This process consists of three stages: (i) estimation of the number of pure spectral signatures or endmembers, (ii) automatic identification of the estimated endmembers, and (iii) estimation of the fractional abundance of each endmember in each pixel of the scene. However, unmixing algorithms can be computationally very expensive, a fact that compromises their use in applications under real-time constraints. In recent years, several techniques have been proposed to solve the aforementioned problem but until now, most works have focused on the second and third stages. The execution cost of the first stage is usually lower than the other stages. Indeed, it can be optional if we known a priori this estimation. However, its acceleration on parallel architectures is still an interesting and open problem. In this paper we have addressed this issue focusing on the GENE algorithm, a promising geometry-based proposal introduced in.1 We have evaluated our parallel implementation in terms of both accuracy and computational performance through Monte Carlo simulations for real and synthetic data experiments. Performance results on a modern GPU shows satisfactory 16x speedup factors, which allow us to expect that this method could meet real-time requirements on a fully operational unmixing chain.
GPU-accelerated Monte Carlo simulation of particle coagulation based on the inverse method
NASA Astrophysics Data System (ADS)
Wei, J.; Kruis, F. E.
2013-09-01
Simulating particle coagulation using Monte Carlo methods is in general a challenging computational task due to its numerical complexity and the computing cost. Currently, the lowest computing costs are obtained when applying a graphic processing unit (GPU) originally developed for speeding up graphic processing in the consumer market. In this article we present an implementation of accelerating a Monte Carlo method based on the Inverse scheme for simulating particle coagulation on the GPU. The abundant data parallelism embedded within the Monte Carlo method is explained as it will allow an efficient parallelization of the MC code on the GPU. Furthermore, the computation accuracy of the MC on GPU was validated with a benchmark, a CPU-based discrete-sectional method. To evaluate the performance gains by using the GPU, the computing time on the GPU against its sequential counterpart on the CPU were compared. The measured speedups show that the GPU can accelerate the execution of the MC code by a factor 10-100, depending on the chosen particle number of simulation particles. The algorithm shows a linear dependence of computing time with the number of simulation particles, which is a remarkable result in view of the n2 dependence of the coagulation.
Parallel implementation of an algorithm for Delaunay triangulation
NASA Technical Reports Server (NTRS)
Merriam, Marshal L.
1992-01-01
The theory and practice of implementing Tanemura's algorithm for 3D Delaunay triangulation on Intel's Gamma prototype, a 128 processor MIMD computer, is described. Efficient implementation of Tanemura's algorithm on a conventional, vector processing supercomputer is problematic. It does not vectorize to any significant degree and requires indirect addressing. Efficient implementation on a parallel architecture is possible, however. Speeds in excess of 20 times a single processor Cray Y-MP are realized on 128 processors of the Intel Gamma prototype.
Parallel implementation of an algorithm for Delaunay triangulation
NASA Technical Reports Server (NTRS)
Merriam, Marshall L.
1992-01-01
This work concerns the theory and practice of implementing Tanemura's algorithm for 3D Delaunay triangulation on Intel's Gamma prototype, a 128 processor MIMD computer. Tanemura's algorithm does not vectorize to any significant degree and requires indirect addressing. Efficient implementation on a conventional, vector processing, supercomputer is problematic. Efficient implementation on a parallel architecture is possible, however. In this work, speeds in excess of 8 times a single processor Cray Y-mp are realized on 128 processors of the Intel Gamma prototype.
Papadopoulos, Agathoklis; Kostoglou, Kyriaki; Mitsis, Georgios D; Theocharides, Theocharis
2015-01-01
The use of a GPGPU programming paradigm (running CUDA-enabled algorithms on GPU cards) in biomedical engineering and biology-related applications have shown promising results. GPU acceleration can be used to speedup computation-intensive models, such as the mathematical modeling of biological systems, which often requires the use of nonlinear modeling approaches with a large number of free parameters. In this context, we developed a CUDA-enabled version of a model which implements a nonlinear identification approach that combines basis expansions and polynomial-type networks, termed Laguerre-Volterra networks and can be used in diverse biological applications. The proposed software implementation uses the GPGPU programming paradigm to take advantage of the inherent parallel characteristics of the aforementioned modeling approach to execute the calculations on the GPU card of the host computer system. The initial results of the GPU-based model presented in this work, show performance improvements over the original MATLAB model. PMID:26736993
Implementation of a portable and reproducible parallel pseudorandom number generator
Pryor, D.V.; Cuccaro, S.A.; Mascagni, M.; Robinson, M.L.
1994-12-31
The authors describe in detail the parallel implementation of a family of additive lagged-Fibonacci pseudorandom number generators. The theoretical structure of these generators is exploited to preserve their well-known randomness properties and to provide a parallel system in of distinct cycles. The algorithm presented here solves the reproducibility problem for a far larger class of parallel Monte Carlo applications than has been previously possible. In particular, Monte Carlo applications that undergo ``splitting`` can be coded to be reproducible, independent both of the number of processors and the execution order of the parallel processes. A library of portable C routines (available from the authors) that implements these ideas is also described.
NASA Astrophysics Data System (ADS)
Ouerhani, Y.; Jridi, M.; Alfalou, A.; Brosseau, C.
2013-02-01
The key outcome of this work is to propose and validate a fast and robust correlation scheme for face recognition applications. The robustness of this fast correlator is ensured by an adapted pre-processing step for the target image allowing us to minimize the impact of its (possibly noisy and varying) amplitude spectrum information. A segmented composite filter is optimized, at the very outset of its fabrication, by weighting each reference with a specific coefficient which is proportional to the occurrence probability. A hierarchical classification procedure (called two-level decision tree learning approach) is also used in order to speed up the recognition procedure. Experimental results validating our approach are obtained with a prototype based on GPU implementation of the all-numerical correlator using the NVIDIA GPU GeForce 8400GS processor and test samples from the Pointing Head Pose Image Database (PHPID), e.g. true recognition rates larger than 85% with a run time lower than 120 ms have been obtained using fixed images from the PHPID, true recognition rates larger than 77% using a real video sequence with 2 frame per second when the database contains 100 persons. Besides, it has been shown experimentally that the use of more recent GPU processor like NVIDIA-GPU Quadro FX 770M can perform the recognition of 4 frame per second with the same length of database.
On the design, analysis, and implementation of efficient parallel algorithms
Sohn, S.M.
1989-01-01
There is considerable interest in developing algorithms for a variety of parallel computer architectures. This is not a trivial problem, although for certain models great progress has been made. Recently, general-purpose parallel machines have become available commercially. These machines possess widely varying interconnection topologies and data/instruction access schemes. It is important, therefore, to develop methodologies and design paradigms for not only synthesizing parallel algorithms from initial problem specifications, but also for mapping algorithms between different architectures. This work has considered both of these problems. A systolic array consists of a large collection of simple processors that are interconnected in a uniform pattern. The author has studied in detain the problem of mapping systolic algorithms onto more general-purpose parallel architectures such as the hypercube. The hypercube architecture is notable due to its symmetry and high connectivity, characteristics which are conducive to the efficient embedding of parallel algorithms. Although the parallel-to-parallel mapping techniques have yielded efficient target algorithms, it is not surprising that an algorithm designed directly for a particular parallel model would achieve superior performance. In this context, the author has developed hypercube algorithms for some important problems in speech and signal processing, text processing, language processing and artificial intelligence. These algorithms were implemented on a 64-node NCUBE/7 hypercube machine in order to evaluate their performance.
Medical image processing on the GPU - past, present and future.
Eklund, Anders; Dufort, Paul; Forsberg, Daniel; LaConte, Stephen M
2013-12-01
Graphics processing units (GPUs) are used today in a wide range of applications, mainly because they can dramatically accelerate parallel computing, are affordable and energy efficient. In the field of medical imaging, GPUs are in some cases crucial for enabling practical use of computationally demanding algorithms. This review presents the past and present work on GPU accelerated medical image processing, and is meant to serve as an overview and introduction to existing GPU implementations. The review covers GPU acceleration of basic image processing operations (filtering, interpolation, histogram estimation and distance transforms), the most commonly used algorithms in medical imaging (image registration, image segmentation and image denoising) and algorithms that are specific to individual modalities (CT, PET, SPECT, MRI, fMRI, DTI, ultrasound, optical imaging and microscopy). The review ends by highlighting some future possibilities and challenges.
Geological Visualization System with GPU-Based Interpolation
NASA Astrophysics Data System (ADS)
Huang, L.; Chen, K.; Lai, Y.; Chang, P.; Song, S.
2011-12-01
There has been a large number of research using parallel-processing GPU to accelerate the computation. In Near Surface Geology efficient interpolations are critical for proper interpretation of measured data. Additionally, an appropriate interpolation method for generating proper results depends on the factors such as the dense of the measured locations and the estimation model. Therefore, fast interpolation process is needed to efficiently find a proper interpolation algorithm for a set of collected data. However, a general CPU framework has to process each computation in a sequential manner and is not efficient enough to handle a large number of interpolation generally needed in Near Surface Geology. When carefully observing the interpolation processing, the computation for each grid point is independent from all other computation. Therefore, the GPU parallel framework should be an efficient technology to accelerate the interpolation process which is critical in Near Surface Geology. Thus in this paper we design a geological visualization system whose core includes a set of interpolation algorithms including Nearest Neighbor, Inverse Distance and Kriging. All these interpolation algorithms are implemented using both the CPU framework and GPU framework. The comparison between CPU and GPU implementation in the aspect of precision and processing speed shows that parallel computation can accelerate the interpolation process and also demonstrates the possibility of using GPU-equipped personal computer to replace the expensive workstation. Immediate update at the measurement site is the dream of geologists. In the future the parallel and remote computation ability of cloud will be explored to make the mobile computation on the measurement site possible.
A portable implementation of ARPACK for distributed memory parallel architectures
Maschhoff, K.J.; Sorensen, D.C.
1996-12-31
ARPACK is a package of Fortran 77 subroutines which implement the Implicitly Restarted Arnoldi Method used for solving large sparse eigenvalue problems. A parallel implementation of ARPACK is presented which is portable across a wide range of distributed memory platforms and requires minimal changes to the serial code. The communication layers used for message passing are the Basic Linear Algebra Communication Subprograms (BLACS) developed for the ScaLAPACK project and Message Passing Interface(MPI).
Implementation of the NAS Parallel Benchmarks in Java
NASA Technical Reports Server (NTRS)
Frumkin, Michael A.; Schultz, Matthew; Jin, Haoqiang; Yan, Jerry; Biegel, Bryan (Technical Monitor)
2002-01-01
Several features make Java an attractive choice for High Performance Computing (HPC). In order to gauge the applicability of Java to Computational Fluid Dynamics (CFD), we have implemented the NAS (NASA Advanced Supercomputing) Parallel Benchmarks in Java. The performance and scalability of the benchmarks point out the areas where improvement in Java compiler technology and in Java thread implementation would position Java closer to Fortran in the competition for CFD applications.
FPGA-Based Filterbank Implementation for Parallel Digital Signal Processing
NASA Technical Reports Server (NTRS)
Berner, Stephan; DeLeon, Phillip
1999-01-01
One approach to parallel digital signal processing decomposes a high bandwidth signal into multiple lower bandwidth (rate) signals by an analysis bank. After processing, the subband signals are recombined into a fullband output signal by a synthesis bank. This paper describes an implementation of the analysis and synthesis banks using (Field Programmable Gate Arrays) FPGAs.
Yi, Xi; Wang, Xin; Chen, Weiting; Wan, Wenbo; Zhao, Huijuan; Gao, Feng
2014-05-01
The common approach to diffuse optical tomography is to solve a nonlinear and ill-posed inverse problem using a linearized iteration process that involves repeated use of the forward and inverse solvers on an appropriately discretized domain of interest. This scheme normally brings severe computation and storage burdens to its applications on large-sized tissues, such as breast tumor diagnosis and brain functional imaging, and prevents from using the matrix-fashioned linear inversions for improved image quality. To cope with the difficulties, we propose in this paper a parallelized full domain-decomposition scheme, which divides the whole domain into several overlapped subdomains and solves the corresponding subinversions independently within the framework of the Schwarz-type iterations, with the support of a combined multicore CPU and multithread graphics processing unit (GPU) parallelization strategy. The numerical and phantom experiments both demonstrate that the proposed method can effectively reduce the computation time and memory occupation for the large-sized problem and improve the quantitative performance of the reconstruction.
GPU-based video motion magnification
NASA Astrophysics Data System (ADS)
DomŻał, Mariusz; Jedrasiak, Karol; Sobel, Dawid; Ryt, Artur; Nawrat, Aleksander
2016-06-01
Video motion magnification (VMM) allows people see otherwise not visible subtle changes in surrounding world. VMM is also capable of hiding them with a modified version of the algorithm. It is possible to magnify motion related to breathing of patients in hospital to observe it or extinguish it and extract other information from stabilized image sequence for example blood flow. In both cases we would like to perform calculations in real time. Unfortunately, the VMM algorithm requires a great amount of computing power. In the article we suggest that VMM algorithm can be parallelized (each thread processes one pixel) and in order to prove that we implemented the algorithm on GPU using CUDA technology. CPU is used only to grab, write, display frame and schedule work for GPU. Each GPU kernel performs spatial decomposition, reconstruction and motion amplification. In this work we presented approach that achieves a significant speedup over existing methods and allow to VMM process video in real-time. This solution can be used as preprocessing for other algorithms in more complex systems or can find application wherever real time motion magnification would be useful. It is worth to mention that the implementation runs on most modern desktops and laptops compatible with CUDA technology.
GPU-accelerated Gibbs ensemble Monte Carlo simulations of Lennard-Jonesium
NASA Astrophysics Data System (ADS)
Mick, Jason; Hailat, Eyad; Russo, Vincent; Rushaidat, Kamel; Schwiebert, Loren; Potoff, Jeffrey
2013-12-01
This work describes an implementation of canonical and Gibbs ensemble Monte Carlo simulations on graphics processing units (GPUs). The pair-wise energy calculations, which consume the majority of the computational effort, are parallelized using the energetic decomposition algorithm. While energetic decomposition is relatively inefficient for traditional CPU-bound codes, the algorithm is ideally suited to the architecture of the GPU. The performance of the CPU and GPU codes are assessed for a variety of CPU and GPU combinations for systems containing between 512 and 131,072 particles. For a system of 131,072 particles, the GPU-enabled canonical and Gibbs ensemble codes were 10.3 and 29.1 times faster (GTX 480 GPU vs. i5-2500K CPU), respectively, than an optimized serial CPU-bound code. Due to overhead from memory transfers from system RAM to the GPU, the CPU code was slightly faster than the GPU code for simulations containing less than 600 particles. The critical temperature Tc∗=1.312(2) and density ρc∗=0.316(3) were determined for the tail corrected Lennard-Jones potential from simulations of 10,000 particle systems, and found to be in exact agreement with prior mixed field finite-size scaling calculations [J.J. Potoff, A.Z. Panagiotopoulos, J. Chem. Phys. 109 (1998) 10914].
GPU acceleration of time-domain fluorescence lifetime imaging
NASA Astrophysics Data System (ADS)
Wu, Gang; Nowotny, Thomas; Chen, Yu; Li, David Day-Uei
2016-01-01
Fluorescence lifetime imaging microscopy (FLIM) plays a significant role in biological sciences, chemistry, and medical research. We propose a graphic processing unit (GPU) based FLIM analysis tool suitable for high-speed, flexible time-domain FLIM applications. With a large number of parallel processors, GPUs can significantly speed up lifetime calculations compared to CPU-OpenMP (parallel computing with multiple CPU cores) based analysis. We demonstrate how to implement and optimize FLIM algorithms on GPUs for both iterative and noniterative FLIM analysis algorithms. The implemented algorithms have been tested on both synthesized and experimental FLIM data. The results show that at the same precision, the GPU analysis can be up to 24-fold faster than its CPU-OpenMP counterpart. This means that even for high-precision but time-consuming iterative FLIM algorithms, GPUs enable fast or even real-time analysis.
Scalability and Portability of Two Parallel Implementations of ADI
NASA Technical Reports Server (NTRS)
Phung, Thanh; VanderWijngaart, Rob F.
1994-01-01
Two domain decompositions for the implementation of the NAS Scalar Penta-diagonal Parallel Benchmark on MIMD systems are investigated, namely transposition and multi-partitioning. Hardware platforms considered are the Intel iPSC/860 and Paragon XP/S-15, and clusters of SGI workstations on ethernet, communicating through PVM. It is found that the multi-partitioning strategy offers the kind of coarse granularity that allows scaling up to hundreds of processors on a massively parallel machine. Moreover, efficiency is retained when the code is ported verbatim (save message passing syntax) to a PVM environment on a modest size cluster of workstations.
NASA Astrophysics Data System (ADS)
Walsh, S. D.; Saar, M. O.; Bailey, P.; Lilja, D. J.
2008-12-01
Many complex natural systems studied in the geosciences are characterized by simple local-scale interactions that result in complex emergent behavior. Simulations of these systems, often implemented in parallel using standard CPU clusters, may be better suited to parallel processing environments with large numbers of simple processors. Such an environment is found in Graphics Processing Units (GPUs) on graphics cards. This presentation discusses graphics card implementations of three example applications from volcanology, seismology, and rock magnetics. These candidate applications involve important modeling techniques, widely employed in physical system simulation: 1) a multiphase lattice-Boltzmann code for geofluidic flows; 2) a spectral-finite-element code for seismic wave propagation simulations; and 3) a least-squares minimization code for interpreting magnetic force microscopy data. Significant performance increases, between one and two orders of magnitude, are seen in all three cases, demonstrating the power of graphics card implementations for these types of simulations.
Implementation of ADI: Schemes on MIMD parallel computers
NASA Technical Reports Server (NTRS)
Vanderwijngaart, Rob F.
1993-01-01
In order to simulate the effects of the impingement of hot exhaust jets of High Performance Aircraft on landing surfaces a multi-disciplinary computation coupling flow dynamics to heat conduction in the runway needs to be carried out. Such simulations, which are essentially unsteady, require very large computational power in order to be completed within a reasonable time frame of the order of an hour. Such power can be furnished by the latest generation of massively parallel computers. These remove the bottleneck of ever more congested data paths to one or a few highly specialized central processing units (CPU's) by having many off-the-shelf CPU's work independently on their own data, and exchange information only when needed. During the past year the first phase of this project was completed, in which the optimal strategy for mapping an ADI-algorithm for the three dimensional unsteady heat equation to a MIMD parallel computer was identified. This was done by implementing and comparing three different domain decomposition techniques that define the tasks for the CPU's in the parallel machine. These implementations were done for a Cartesian grid and Dirichlet boundary conditions. The most promising technique was then used to implement the heat equation solver on a general curvilinear grid with a suite of nontrivial boundary conditions. Finally, this technique was also used to implement the Scalar Penta-diagonal (SP) benchmark, which was taken from the NAS Parallel Benchmarks report. All implementations were done in the programming language C on the Intel iPSC/860 computer.
Experience of implementing applicative parallelism on Cray X-MP
Lee, Ching-Cheng
1988-05-01
The traditional approach on Cray multiprocessing has been multitasking in FORTRAN environments with operating system support. This kind of approach usually requires explicit user intervention to restructure or reformulated the algorithms to control parallel execution. However, it is error-prone and in most cases, only coarse-grain parallelism such as subrountines or functions was explored. In this paper, an implemention of micro-tasking for an applicative language called SISAL on Cray X-MP is reported. The experience indicates where the automatic approach to parallelization works well, as well as sources of inefficient behavior on the Cray X-MP. The experience also suggests some future improvements of SISAL multiprocessing on Cray X-MP. 9 refs., 1 fig.
Computationally efficient implementation of combustion chemistry in parallel PDF calculations
NASA Astrophysics Data System (ADS)
Lu, Liuyan; Lantz, Steven R.; Ren, Zhuyin; Pope, Stephen B.
2009-08-01
In parallel calculations of combustion processes with realistic chemistry, the serial in situ adaptive tabulation (ISAT) algorithm [S.B. Pope, Computationally efficient implementation of combustion chemistry using in situ adaptive tabulation, Combustion Theory and Modelling, 1 (1997) 41-63; L. Lu, S.B. Pope, An improved algorithm for in situ adaptive tabulation, Journal of Computational Physics 228 (2009) 361-386] substantially speeds up the chemistry calculations on each processor. To improve the parallel efficiency of large ensembles of such calculations in parallel computations, in this work, the ISAT algorithm is extended to the multi-processor environment, with the aim of minimizing the wall clock time required for the whole ensemble. Parallel ISAT strategies are developed by combining the existing serial ISAT algorithm with different distribution strategies, namely purely local processing (PLP), uniformly random distribution (URAN), and preferential distribution (PREF). The distribution strategies enable the queued load redistribution of chemistry calculations among processors using message passing. They are implemented in the software x2f_mpi, which is a Fortran 95 library for facilitating many parallel evaluations of a general vector function. The relative performance of the parallel ISAT strategies is investigated in different computational regimes via the PDF calculations of multiple partially stirred reactors burning methane/air mixtures. The results show that the performance of ISAT with a fixed distribution strategy strongly depends on certain computational regimes, based on how much memory is available and how much overlap exists between tabulated information on different processors. No one fixed strategy consistently achieves good performance in all the regimes. Therefore, an adaptive distribution strategy, which blends PLP, URAN and PREF, is devised and implemented. It yields consistently good performance in all regimes. In the adaptive parallel
Comparison of CPU and GPU based coding on low-complexity algorithms for display signals
NASA Astrophysics Data System (ADS)
Richter, Thomas; Simon, Sven
2013-09-01
Graphics Processing Units (GPUs) are freely programmable massively parallel general purpose processing units and thus offer the opportunity to off-load heavy computations from the CPU to the GPU. One application for GPU programming is image compression, where the massively parallel nature of GPUs promises high speed benefits. This article analyzes the predicaments of data-parallel image coding on the example of two high-throughput coding algorithms. The codecs discussed here were designed to answer a call from the Video Electronics Standards Association (VESA), and require only minimal buffering at encoder and decoder side while avoiding any pixel-based feedback loops limiting the operating frequency of hardware implementations. Comparing CPU and GPU implementations of the codes show that GPU based codes are usually not considerably faster, or perform only with less than ideal rate-distortion performance. Analyzing the details of this result provides theoretical evidence that for any coding engine either parts of the entropy coding and bit-stream build-up must remain serial, or rate-distortion penalties must be paid when offloading all computations on the GPU.
GPU-Accelerated Adjoint Algorithmic Differentiation
Gremse, Felix; Höfter, Andreas; Razik, Lukas; Kiessling, Fabian; Naumann, Uwe
2015-01-01
Many scientific problems such as classifier training or medical image reconstruction can be expressed as minimization of differentiable real-valued cost functions and solved with iterative gradient-based methods. Adjoint algorithmic differentiation (AAD) enables automated computation of gradients of such cost functions implemented as computer programs. To backpropagate adjoint derivatives, excessive memory is potentially required to store the intermediate partial derivatives on a dedicated data structure, referred to as the “tape”. Parallelization is difficult because threads need to synchronize their accesses during taping and backpropagation. This situation is aggravated for many-core architectures, such as Graphics Processing Units (GPUs), because of the large number of light-weight threads and the limited memory size in general as well as per thread. We show how these limitations can be mediated if the cost function is expressed using GPU-accelerated vector and matrix operations which are recognized as intrinsic functions by our AAD software. We compare this approach with naive and vectorized implementations for CPUs. We use four increasingly complex cost functions to evaluate the performance with respect to memory consumption and gradient computation times. Using vectorization, CPU and GPU memory consumption could be substantially reduced compared to the naive reference implementation, in some cases even by an order of complexity. The vectorization allowed usage of optimized parallel libraries during forward and reverse passes which resulted in high speedups for the vectorized CPU version compared to the naive reference implementation. The GPU version achieved an additional speedup of 7.5 ± 4.4, showing that the processing power of GPUs can be utilized for AAD using this concept. Furthermore, we show how this software can be systematically extended for more complex problems such as nonlinear absorption reconstruction for fluorescence-mediated tomography
GPU-accelerated adjoint algorithmic differentiation
NASA Astrophysics Data System (ADS)
Gremse, Felix; Höfter, Andreas; Razik, Lukas; Kiessling, Fabian; Naumann, Uwe
2016-03-01
Many scientific problems such as classifier training or medical image reconstruction can be expressed as minimization of differentiable real-valued cost functions and solved with iterative gradient-based methods. Adjoint algorithmic differentiation (AAD) enables automated computation of gradients of such cost functions implemented as computer programs. To backpropagate adjoint derivatives, excessive memory is potentially required to store the intermediate partial derivatives on a dedicated data structure, referred to as the "tape". Parallelization is difficult because threads need to synchronize their accesses during taping and backpropagation. This situation is aggravated for many-core architectures, such as Graphics Processing Units (GPUs), because of the large number of light-weight threads and the limited memory size in general as well as per thread. We show how these limitations can be mediated if the cost function is expressed using GPU-accelerated vector and matrix operations which are recognized as intrinsic functions by our AAD software. We compare this approach with naive and vectorized implementations for CPUs. We use four increasingly complex cost functions to evaluate the performance with respect to memory consumption and gradient computation times. Using vectorization, CPU and GPU memory consumption could be substantially reduced compared to the naive reference implementation, in some cases even by an order of complexity. The vectorization allowed usage of optimized parallel libraries during forward and reverse passes which resulted in high speedups for the vectorized CPU version compared to the naive reference implementation. The GPU version achieved an additional speedup of 7.5 ± 4.4, showing that the processing power of GPUs can be utilized for AAD using this concept. Furthermore, we show how this software can be systematically extended for more complex problems such as nonlinear absorption reconstruction for fluorescence-mediated tomography.
Hallock, Michael J.; Stone, John E.; Roberts, Elijah; Fry, Corey; Luthey-Schulten, Zaida
2014-01-01
Simulation of in vivo cellular processes with the reaction-diffusion master equation (RDME) is a computationally expensive task. Our previous software enabled simulation of inhomogeneous biochemical systems for small bacteria over long time scales using the MPD-RDME method on a single GPU. Simulations of larger eukaryotic systems exceed the on-board memory capacity of individual GPUs, and long time simulations of modest-sized cells such as yeast are impractical on a single GPU. We present a new multi-GPU parallel implementation of the MPD-RDME method based on a spatial decomposition approach that supports dynamic load balancing for workstations containing GPUs of varying performance and memory capacity. We take advantage of high-performance features of CUDA for peer-to-peer GPU memory transfers and evaluate the performance of our algorithms on state-of-the-art GPU devices. We present parallel e ciency and performance results for simulations using multiple GPUs as system size, particle counts, and number of reactions grow. We also demonstrate multi-GPU performance in simulations of the Min protein system in E. coli. Moreover, our multi-GPU decomposition and load balancing approach can be generalized to other lattice-based problems. PMID:24882911
Howison, Mark
2010-05-06
We compare the performance of hand-tuned CUDA implementations of bilateral and anisotropic diffusion filters for denoising 3D MRI datasets. Our tests sweep comparable parameters for the two filters and measure total runtime, memory bandwidth, computational throughput, and mean squared errors relative to a noiseless reference dataset.
GPU-Powered Coherent Beamforming
NASA Astrophysics Data System (ADS)
Magro, A.; Adami, K. Zarb; Hickish, J.
2015-03-01
Graphics processing units (GPU)-based beamforming is a relatively unexplored area in radio astronomy, possibly due to the assumption that any such system will be severely limited by the PCIe bandwidth required to transfer data to the GPU. We have developed a CUDA-based GPU implementation of a coherent beamformer, specifically designed and optimized for deployment at the BEST-2 array which can generate an arbitrary number of synthesized beams for a wide range of parameters. It achieves ˜1.3 TFLOPs on an NVIDIA Tesla K20, approximately 10x faster than an optimized, multithreaded CPU implementation. This kernel has been integrated into two real-time, GPU-based time-domain software pipelines deployed at the BEST-2 array in Medicina: a standalone beamforming pipeline and a transient detection pipeline. We present performance benchmarks for the beamforming kernel as well as the transient detection pipeline with beamforming capabilities as well as results of test observation.
A GPU implementation of adaptive mesh refinement to simulate tsunamis generated by landslides
NASA Astrophysics Data System (ADS)
de la Asunción, Marc; Castro, Manuel J.
2016-04-01
In this work we propose a CUDA implementation for the simulation of landslide-generated tsunamis using a two-layer Savage-Hutter type model and adaptive mesh refinement (AMR). The AMR method consists of dynamically increasing the spatial resolution of the regions of interest of the domain while keeping the rest of the domain at low resolution, thus obtaining better runtimes and similar results compared to increasing the spatial resolution of the entire domain. Our AMR implementation uses a patch-based approach, it supports up to three levels, power-of-two ratios of refinement, different refinement criteria and also several user parameters to control the refinement and clustering behaviour. A strategy based on the variation of the cell values during the simulation is used to interpolate and propagate the values of the fine cells. Several numerical experiments using artificial and realistic scenarios are presented.
Lossless data compression for improving the performance of a GPU-based beamformer.
Lok, U-Wai; Fan, Gang-Wei; Li, Pai-Chi
2015-04-01
The powerful parallel computation ability of a graphics processing unit (GPU) makes it feasible to perform dynamic receive beamforming However, a real time GPU-based beamformer requires high data rate to transfer radio-frequency (RF) data from hardware to software memory, as well as from central processing unit (CPU) to GPU memory. There are data compression methods (e.g. Joint Photographic Experts Group (JPEG)) available for the hardware front end to reduce data size, alleviating the data transfer requirement of the hardware interface. Nevertheless, the required decoding time may even be larger than the transmission time of its original data, in turn degrading the overall performance of the GPU-based beamformer. This article proposes and implements a lossless compression-decompression algorithm, which enables in parallel compression and decompression of data. By this means, the data transfer requirement of hardware interface and the transmission time of CPU to GPU data transfers are reduced, without sacrificing image quality. In simulation results, the compression ratio reached around 1.7. The encoder design of our lossless compression approach requires low hardware resources and reasonable latency in a field programmable gate array. In addition, the transmission time of transferring data from CPU to GPU with the parallel decoding process improved by threefold, as compared with transferring original uncompressed data. These results show that our proposed lossless compression plus parallel decoder approach not only mitigate the transmission bandwidth requirement to transfer data from hardware front end to software system but also reduce the transmission time for CPU to GPU data transfer.
Cell-based Adaptive Mesh Refinement on the GPU with Applications to Exascale Supercomputing
NASA Astrophysics Data System (ADS)
Trujillo, Dennis; Robey, Robert; Davis, Neal; Nicholaeff, David
2011-10-01
We present an OpenCL implementation of a cell-based adaptive mesh refinement (AMR) scheme for the shallow water equations. The challenges associated with ensuring the locality of algorithm architecture to fully exploit the massive number of parallel threads on the GPU is discussed. This includes a proof of concept that a cell-based AMR code can be effectively implemented, even on a small scale, in the memory and threading model provided by OpenCL. Additionally, the program requires dynamic memory in order to properly implement the mesh; as this is not supported in the OpenCL 1.1 standard, a combination of CPU memory management and GPU computation effectively implements a dynamic memory allocation scheme. Load balancing is achieved through a new stencil-based implementation of a space-filling curve, eliminating the need for a complete recalculation of the indexing on the mesh. A cartesian grid hash table scheme to allow fast parallel neighbor accesses is also discussed. Finally, the relative speedup of the GPU-enabled AMR code is compared to the original serial version. We conclude that parallelization using the GPU provides significant speedup for typical numerical applications and is feasible for scientific applications in the next generation of supercomputing.
NASA Astrophysics Data System (ADS)
Chase, Patrick; Vondran, Gary
2011-01-01
Tetrahedral interpolation is commonly used to implement continuous color space conversions from sparse 3D and 4D lookup tables. We investigate the implementation and optimization of tetrahedral interpolation algorithms for GPUs, and compare to the best known CPU implementations as well as to a well known GPU-based trilinear implementation. We show that a 500 NVIDIA GTX-580 GPU is 3x faster than a 1000 Intel Core i7 980X CPU for 3D interpolation, and 9x faster for 4D interpolation. Performance-relevant GPU attributes are explored including thread scheduling, local memory characteristics, global memory hierarchy, and cache behaviors. We consider existing tetrahedral interpolation algorithms and tune based on the structure and branching capabilities of current GPUs. Global memory performance is improved by reordering and expanding the lookup table to ensure optimal access behaviors. Per multiprocessor local memory is exploited to implement optimally coalesced global memory accesses, and local memory addressing is optimized to minimize bank conflicts. We explore the impacts of lookup table density upon computation and memory access costs. Also presented are CPU-based 3D and 4D interpolators, using SSE vector operations that are faster than any previously published solution.
Massively Parallel Latent Semantic Analyzes using a Graphics Processing Unit
Cavanagh, Joseph M; Cui, Xiaohui
2009-01-01
Latent Semantic Indexing (LSA) aims to reduce the dimensions of large Term-Document datasets using Singular Value Decomposition. However, with the ever expanding size of data sets, current implementations are not fast enough to quickly and easily compute the results on a standard PC. The Graphics Processing Unit (GPU) can solve some highly parallel problems much faster than the traditional sequential processor (CPU). Thus, a deployable system using a GPU to speedup large-scale LSA processes would be a much more effective choice (in terms of cost/performance ratio) than using a computer cluster. Due to the GPU s application-specific architecture, harnessing the GPU s computational prowess for LSA is a great challenge. We present a parallel LSA implementation on the GPU, using NVIDIA Compute Unified Device Architecture and Compute Unified Basic Linear Algebra Subprograms. The performance of this implementation is compared to traditional LSA implementation on CPU using an optimized Basic Linear Algebra Subprograms library. After implementation, we discovered that the GPU version of the algorithm was twice as fast for large matrices (1000x1000 and above) that had dimensions not divisible by 16. For large matrices that did have dimensions divisible by 16, the GPU algorithm ran five to six times faster than the CPU version. The large variation is due to architectural benefits the GPU has for matrices divisible by 16. It should be noted that the overall speeds for the CPU version did not vary from relative normal when the matrix dimensions were divisible by 16. Further research is needed in order to produce a fully implementable version of LSA. With that in mind, the research we presented shows that the GPU is a viable option for increasing the speed of LSA, in terms of cost/performance ratio.
Parallel implementation of electronic structure energy, gradient, and Hessian calculations.
Lotrich, V; Flocke, N; Ponton, M; Yau, A D; Perera, A; Deumens, E; Bartlett, R J
2008-05-21
ACES III is a newly written program in which the computationally demanding components of the computational chemistry code ACES II [J. F. Stanton et al., Int. J. Quantum Chem. 526, 879 (1992); [ACES II program system, University of Florida, 1994] have been redesigned and implemented in parallel. The high-level algorithms include Hartree-Fock (HF) self-consistent field (SCF), second-order many-body perturbation theory [MBPT(2)] energy, gradient, and Hessian, and coupled cluster singles, doubles, and perturbative triples [CCSD(T)] energy and gradient. For SCF, MBPT(2), and CCSD(T), both restricted HF and unrestricted HF reference wave functions are available. For MBPT(2) gradients and Hessians, a restricted open-shell HF reference is also supported. The methods are programed in a special language designed for the parallelization project. The language is called super instruction assembly language (SIAL). The design uses an extreme form of object-oriented programing. All compute intensive operations, such as tensor contractions and diagonalizations, all communication operations, and all input-output operations are handled by a parallel program written in C and FORTRAN 77. This parallel program, called the super instruction processor (SIP), interprets and executes the SIAL program. By separating the algorithmic complexity (in SIAL) from the complexities of execution on computer hardware (in SIP), a software system is created that allows for very effective optimization and tuning on different hardware architectures with quite manageable effort. PMID:18500853
Parallel implementation of electronic structure energy, gradient, and Hessian calculations
NASA Astrophysics Data System (ADS)
Lotrich, V.; Flocke, N.; Ponton, M.; Yau, A. D.; Perera, A.; Deumens, E.; Bartlett, R. J.
2008-05-01
ACES III is a newly written program in which the computationally demanding components of the computational chemistry code ACES II [J. F. Stanton et al., Int. J. Quantum Chem. 526, 879 (1992); [ACES II program system, University of Florida, 1994] have been redesigned and implemented in parallel. The high-level algorithms include Hartree-Fock (HF) self-consistent field (SCF), second-order many-body perturbation theory [MBPT(2)] energy, gradient, and Hessian, and coupled cluster singles, doubles, and perturbative triples [CCSD(T)] energy and gradient. For SCF, MBPT(2), and CCSD(T), both restricted HF and unrestricted HF reference wave functions are available. For MBPT(2) gradients and Hessians, a restricted open-shell HF reference is also supported. The methods are programed in a special language designed for the parallelization project. The language is called super instruction assembly language (SIAL). The design uses an extreme form of object-oriented programing. All compute intensive operations, such as tensor contractions and diagonalizations, all communication operations, and all input-output operations are handled by a parallel program written in C and FORTRAN 77. This parallel program, called the super instruction processor (SIP), interprets and executes the SIAL program. By separating the algorithmic complexity (in SIAL) from the complexities of execution on computer hardware (in SIP), a software system is created that allows for very effective optimization and tuning on different hardware architectures with quite manageable effort.
GPU-based cone-beam reconstruction using wavelet denoising
NASA Astrophysics Data System (ADS)
Jin, Kyungchan; Park, Jungbyung; Park, Jongchul
2012-03-01
The scattering noise artifact resulted in low-dose projection in repetitive cone-beam CT (CBCT) scans decreases the image quality and lessens the accuracy of the diagnosis. To improve the image quality of low-dose CT imaging, the statistical filtering is more effective in noise reduction. However, image filtering and enhancement during the entire reconstruction process exactly may be challenging due to high performance computing. The general reconstruction algorithm for CBCT data is the filtered back-projection, which for a volume of 512×512×512 takes up to a few minutes on a standard system. To speed up reconstruction, massively parallel architecture of current graphical processing unit (GPU) is a platform suitable for acceleration of mathematical calculation. In this paper, we focus on accelerating wavelet denoising and Feldkamp-Davis-Kress (FDK) back-projection using parallel processing on GPU, utilize compute unified device architecture (CUDA) platform and implement CBCT reconstruction based on CUDA technique. Finally, we evaluate our implementation on clinical tooth data sets. Resulting implementation of wavelet denoising is able to process a 1024×1024 image within 2 ms, except data loading process, and our GPU-based CBCT implementation reconstructs a 512×512×512 volume from 400 projection data in less than 1 minute.
Implementation of a Parallel Protein Structure Alignment Service on Cloud
Hung, Che-Lun; Lin, Yaw-Ling
2013-01-01
Protein structure alignment has become an important strategy by which to identify evolutionary relationships between protein sequences. Several alignment tools are currently available for online comparison of protein structures. In this paper, we propose a parallel protein structure alignment service based on the Hadoop distribution framework. This service includes a protein structure alignment algorithm, a refinement algorithm, and a MapReduce programming model. The refinement algorithm refines the result of alignment. To process vast numbers of protein structures in parallel, the alignment and refinement algorithms are implemented using MapReduce. We analyzed and compared the structure alignments produced by different methods using a dataset randomly selected from the PDB database. The experimental results verify that the proposed algorithm refines the resulting alignments more accurately than existing algorithms. Meanwhile, the computational performance of the proposed service is proportional to the number of processors used in our cloud platform. PMID:23671842
Implementation of a parallel protein structure alignment service on cloud.
Hung, Che-Lun; Lin, Yaw-Ling
2013-01-01
Protein structure alignment has become an important strategy by which to identify evolutionary relationships between protein sequences. Several alignment tools are currently available for online comparison of protein structures. In this paper, we propose a parallel protein structure alignment service based on the Hadoop distribution framework. This service includes a protein structure alignment algorithm, a refinement algorithm, and a MapReduce programming model. The refinement algorithm refines the result of alignment. To process vast numbers of protein structures in parallel, the alignment and refinement algorithms are implemented using MapReduce. We analyzed and compared the structure alignments produced by different methods using a dataset randomly selected from the PDB database. The experimental results verify that the proposed algorithm refines the resulting alignments more accurately than existing algorithms. Meanwhile, the computational performance of the proposed service is proportional to the number of processors used in our cloud platform. PMID:23671842
Computing 2D constrained delaunay triangulation using the GPU.
Qi, Meng; Cao, Thanh-Tung; Tan, Tiow-Seng
2013-05-01
We propose the first graphics processing unit (GPU) solution to compute the 2D constrained Delaunay triangulation (CDT) of a planar straight line graph (PSLG) consisting of points and edges. There are many existing CPU algorithms to solve the CDT problem in computational geometry, yet there has been no prior approach to solve this problem efficiently using the parallel computing power of the GPU. For the special case of the CDT problem where the PSLG consists of just points, which is simply the normal Delaunay triangulation (DT) problem, a hybrid approach using the GPU together with the CPU to partially speed up the computation has already been presented in the literature. Our work, on the other hand, accelerates the entire computation on the GPU. Our implementation using the CUDA programming model on NVIDIA GPUs is numerically robust, and runs up to an order of magnitude faster than the best sequential implementations on the CPU. This result is reflected in our experiment with both randomly generated PSLGs and real-world GIS data having millions of points and edges.
Renaud, M; Seuntjens, J; Roberge, D
2014-06-15
Purpose: Assessing the performance and uncertainty of a pre-calculated Monte Carlo (PMC) algorithm for proton and electron transport running on graphics processing units (GPU). While PMC methods have been described in the past, an explicit quantification of the latent uncertainty arising from recycling a limited number of tracks in the pre-generated track bank is missing from the literature. With a proper uncertainty analysis, an optimal pre-generated track bank size can be selected for a desired dose calculation uncertainty. Methods: Particle tracks were pre-generated for electrons and protons using EGSnrc and GEANT4, respectively. The PMC algorithm for track transport was implemented on the CUDA programming framework. GPU-PMC dose distributions were compared to benchmark dose distributions simulated using general-purpose MC codes in the same conditions. A latent uncertainty analysis was performed by comparing GPUPMC dose values to a “ground truth” benchmark while varying the track bank size and primary particle histories. Results: GPU-PMC dose distributions and benchmark doses were within 1% of each other in voxels with dose greater than 50% of Dmax. In proton calculations, a submillimeter distance-to-agreement error was observed at the Bragg Peak. Latent uncertainty followed a Poisson distribution with the number of tracks per energy (TPE) and a track bank of 20,000 TPE produced a latent uncertainty of approximately 1%. Efficiency analysis showed a 937× and 508× gain over a single processor core running DOSXYZnrc for 16 MeV electrons in water and bone, respectively. Conclusion: The GPU-PMC method can calculate dose distributions for electrons and protons to a statistical uncertainty below 1%. The track bank size necessary to achieve an optimal efficiency can be tuned based on the desired uncertainty. Coupled with a model to calculate dose contributions from uncharged particles, GPU-PMC is a candidate for inverse planning of modulated electron radiotherapy
Implementation of an Eta Belt Domain on Parallel Systems
NASA Technical Reports Server (NTRS)
Kouatchou, Jules; Rancic, Miodrag; Norris, Peter; Geiger, Jim
2001-01-01
We extend the Eta weather model from a regional domain into a belt domain that does not require meridional boundary conditions. We describe how the extension is achieved and the parallel implementation of the code on the Cray T3E and the SGI Origin 2000. We validate the forecast results on the two platforms and examine how the removal of the meridional boundary conditions affects these forecasts. In addition, using several domains of different sizes and resolutions, we present the scaling performance of the code on both systems.
Implementation of SAR interferometric map generation using parallel processors
Doren, N.; Wahl, D.E.
1998-07-01
Interferometric fringe maps are generated by accurately registering a pair of complex SAR images of the same scene imaged from two very similar geometries, and calculating the phase difference between the two images by averaging over a neighborhood of pixels at each spatial location. The phase difference (fringe) map resulting from this IFSAR operation is then unwrapped and used to calculate the height estimate of the imaged terrain. Although the method used to calculate interferometric fringe maps is well known, it is generally executed in a post-processing mode well after the image pairs have been collected. In that mode of operation, there is little concern about algorithm speed and the method is normally implemented on a single processor machine. This paper describes how the interferometric map generation is implemented on a distributed-memory parallel processing machine. This particular implementation is designed to operate on a 16 node Power-PC platform and to generate interferometric maps in near real-time. The implementation is able to accommodate large translational offsets, along with a slight amount of rotation which may exist between the interferometric pair of images. If the number of pixels in the IFSAR image is large enough, the implementation accomplishes nearly linear speed-up times with the addition of processors.
Massively parallel Wang Landau sampling on multiple GPUs
Yin, Junqi; Landau, D. P.
2012-01-01
Wang Landau sampling is implemented on the Graphics Processing Unit (GPU) with the Compute Unified Device Architecture (CUDA). Performances on three different GPU cards, including the new generation Fermi architecture card, are compared with that on a Central Processing Unit (CPU). The parameters for massively parallel Wang Landau sampling are tuned in order to achieve fast convergence. For simulations of the water cluster systems, we obtain an average of over 50 times speedup for a given workload.
GPU-based ultrafast IMRT plan optimization
NASA Astrophysics Data System (ADS)
Men, Chunhua; Gu, Xuejun; Choi, Dongju; Majumdar, Amitava; Zheng, Ziyi; Mueller, Klaus; Jiang, Steve B.
2009-11-01
The widespread adoption of on-board volumetric imaging in cancer radiotherapy has stimulated research efforts to develop online adaptive radiotherapy techniques to handle the inter-fraction variation of the patient's geometry. Such efforts face major technical challenges to perform treatment planning in real time. To overcome this challenge, we are developing a supercomputing online re-planning environment (SCORE) at the University of California, San Diego (UCSD). As part of the SCORE project, this paper presents our work on the implementation of an intensity-modulated radiation therapy (IMRT) optimization algorithm on graphics processing units (GPUs). We adopt a penalty-based quadratic optimization model, which is solved by using a gradient projection method with Armijo's line search rule. Our optimization algorithm has been implemented in CUDA for parallel GPU computing as well as in C for serial CPU computing for comparison purpose. A prostate IMRT case with various beamlet and voxel sizes was used to evaluate our implementation. On an NVIDIA Tesla C1060 GPU card, we have achieved speedup factors of 20-40 without losing accuracy, compared to the results from an Intel Xeon 2.27 GHz CPU. For a specific nine-field prostate IMRT case with 5 × 5 mm2 beamlet size and 2.5 × 2.5 × 2.5 mm3 voxel size, our GPU implementation takes only 2.8 s to generate an optimal IMRT plan. Our work has therefore solved a major problem in developing online re-planning technologies for adaptive radiotherapy.
A parallel implementation of the Wang Landau algorithm
NASA Astrophysics Data System (ADS)
Zhan, Lixin
2008-09-01
The Wang-Landau algorithm is a flat-histogram Monte Carlo method that performs random walks in the configuration space of a system to obtain a close estimation of the density of states iteratively. It has been applied successfully to many research fields. In this paper, we propose a parallel implementation of the Wang-Landau algorithm on computers of shared memory architectures by utilizing the OpenMP API for distributed computing. This implementation is applied to Ising model systems with promising speedups. We also examine the effects on the running speed when using different strategies in accessing the shared memory space during the updating procedure. The allowance of data race is recommended in consideration of the simulation efficiency. Such treatment does not affect the accuracy of the final density of states obtained.
NASA Astrophysics Data System (ADS)
Aalto, R. E.; Lauer, J. W.; Darby, S. E.; Best, J.; Dietrich, W. E.
2015-12-01
During glacial-marine transgressions vast volumes of sediment are deposited due to the infilling of lowland fluvial systems and shallow shelves, material that is removed during ensuing regressions. Modelling these processes would illuminate system morphodynamics, fluxes, and 'complexity' in response to base level change, yet such problems are computationally formidable. Environmental systems are characterized by strong interconnectivity, yet traditional supercomputers have slow inter-node communication -- whereas rapidly advancing Graphics Processing Unit (GPU) technology offers vastly higher (>100x) bandwidths. GULLEM (GpU-accelerated Lowland Landscape Evolution Model) employs massively parallel code to simulate coupled fluvial-landscape evolution for complex lowland river systems over large temporal and spatial scales. GULLEM models the accommodation space carved/infilled by representing a range of geomorphic processes, including: river & tributary incision within a multi-directional flow regime, non-linear diffusion, glacial-isostatic flexure, hydraulic geometry, tectonic deformation, sediment production, transport & deposition, and full 3D tracking of all resulting stratigraphy. Model results concur with the Holocene dynamics of the Fly River, PNG -- as documented with dated cores, sonar imaging of floodbasin stratigraphy, and the observations of topographic remnants from LGM conditions. Other supporting research was conducted along the Mekong River, the largest fluvial system of the Sunda Shelf. These and other field data provide tantalizing empirical glimpses into the lowland landscapes of large rivers during glacial-interglacial transitions, observations that can be explored with this powerful numerical model. GULLEM affords estimates for the timing and flux budgets within the Fly and Sunda Systems, illustrating complex internal system responses to the external forcing of sea level and climate. Furthermore, GULLEM can be applied to most ANY fluvial system to
SAR wind retrieval: from Singlecore to Multicore and GPU computing
NASA Astrophysics Data System (ADS)
Myasoedov, Alexander; Monzikova, Anna
The large spatial coverage and high resolution of spaceborne synthetic aperture radars (SAR) offers a unique opportunity to derive mesoscale wind fields over the ocean surface, providing high resolution wind fields near the shore. On the other hand, due to the large size of SAR images their processing might be a headache when dealing with operational tasks or doing long-period statistical analysis. Algorithms for satellite image processing often offer many possibilities for parallelism (e.g., pixel-by-pixel processing) which makes them good candidates for execution on high-performance parallel computing hardware such as Multicore CPUs and modern graphic processing units (GPUs). In this study we implement different SAR wind speed retrieval algorithms (e.g. CMOD4, CMOD5) for Singlecore and Multicore systems, including GPUs. For this purpose both serial and parallelized versions of CMOD algorithms were written in Matlab, Python, CPython and PyOpenCL. We apply these algorithms to an Envisat ASAR image, compare the results received with different versions of the algorithms executed on both Intel CPU and a Tesla GPU. As a result of our experiments we not only show the up to 400 times speedup of GPU comparing to CPU but also try to give some advises on how much time we have spent and efforts were made for writing the same algorithm using different programming languages. We hope that our experience will help other scientist to achieve all the goodness from the GPU/Multicore computing.
NASA Astrophysics Data System (ADS)
Deng, Liang; Bai, Hanli; Wang, Fang; Xu, Qingxin
2016-06-01
CPU/GPU computing allows scientists to tremendously accelerate their numerical codes. In this paper, we port and optimize a double precision alternating direction implicit (ADI) solver for three-dimensional compressible Navier-Stokes equations from our in-house Computational Fluid Dynamics (CFD) software on heterogeneous platform. First, we implement a full GPU version of the ADI solver to remove a lot of redundant data transfers between CPU and GPU, and then design two fine-grain schemes, namely “one-thread-one-point” and “one-thread-one-line”, to maximize the performance. Second, we present a dual-level parallelization scheme using the CPU/GPU collaborative model to exploit the computational resources of both multi-core CPUs and many-core GPUs within the heterogeneous platform. Finally, considering the fact that memory on a single node becomes inadequate when the simulation size grows, we present a tri-level hybrid programming pattern MPI-OpenMP-CUDA that merges fine-grain parallelism using OpenMP and CUDA threads with coarse-grain parallelism using MPI for inter-node communication. We also propose a strategy to overlap the computation with communication using the advanced features of CUDA and MPI programming. We obtain speedups of 6.0 for the ADI solver on one Tesla M2050 GPU in contrast to two Xeon X5670 CPUs. Scalability tests show that our implementation can offer significant performance improvement on heterogeneous platform.
GPU accelerated generation of digitally reconstructed radiographs for 2-D/3-D image registration.
Dorgham, Osama M; Laycock, Stephen D; Fisher, Mark H
2012-09-01
Recent advances in programming languages for graphics processing units (GPUs) provide developers with a convenient way of implementing applications which can be executed on the CPU and GPU interchangeably. GPUs are becoming relatively cheap, powerful, and widely available hardware components, which can be used to perform intensive calculations. The last decade of hardware performance developments shows that GPU-based computation is progressing significantly faster than CPU-based computation, particularly if one considers the execution of highly parallelisable algorithms. Future predictions illustrate that this trend is likely to continue. In this paper, we introduce a way of accelerating 2-D/3-D image registration by developing a hybrid system which executes on the CPU and utilizes the GPU for parallelizing the generation of digitally reconstructed radiographs (DRRs). Based on the advancements of the GPU over the CPU, it is timely to exploit the benefits of many-core GPU technology by developing algorithms for DRR generation. Although some previous work has investigated the rendering of DRRs using the GPU, this paper investigates approximations which reduce the computational overhead while still maintaining a quality consistent with that needed for 2-D/3-D registration with sufficient accuracy to be clinically acceptable in certain applications of radiation oncology. Furthermore, by comparing implementations of 2-D/3-D registration on the CPU and GPU, we investigate current performance and propose an optimal framework for PC implementations addressing the rigid registration problem. Using this framework, we are able to render DRR images from a 256×256×133 CT volume in ~24 ms using an NVidia GeForce 8800 GTX and in ~2 ms using NVidia GeForce GTX 580. In addition to applications requiring fast automatic patient setup, these levels of performance suggest image-guided radiation therapy at video frame rates is technically feasible using relatively low cost PC
GPU-Based Visualization of 3D Fluid Interfaces using Level Set Methods
NASA Astrophysics Data System (ADS)
Kadlec, B. J.
2009-12-01
We model a simple 3D fluid-interface problem using the level set method and visualize the interface as a dynamic surface. Level set methods allow implicit handling of complex topologies deformed by evolutions where sharp changes and cusps are present without destroying the representation. We present a highly optimized visualization and computation algorithm that is implemented in CUDA to run on the NVIDIA GeForce 295 GTX. CUDA is a general purpose parallel computing architecture that allows the NVIDIA GPU to be treated like a data parallel supercomputer in order to solve many computational problems in a fraction of the time required on a CPU. CUDA is compared to the new OpenCL™ (Open Computing Language), which is designed to run on heterogeneous computing environments but does not take advantage of low-level features in NVIDIA hardware that provide significant speedups. Therefore, our technique is implemented using CUDA and results are compared to a single CPU implementation to show the benefits of using the GPU and CUDA for visualizing fluid-interface problems. We solve a 1024^3 problem and experience significant speedup using the NVIDIA GeForce 295 GTX. Implementation details for mapping the problem to the GPU architecture are described as well as discussion on porting the technique to heterogeneous devices (AMD, Intel, IBM) using OpenCL. The results present a new interactive system for computing and visualizing the evolution of fluid interface problems on the GPU.
Bin recycling strategy for improving the histogram precision on GPU
NASA Astrophysics Data System (ADS)
Cárdenas-Montes, Miguel; Rodríguez-Vázquez, Juan José; Vega-Rodríguez, Miguel A.
2016-07-01
Histogram is an easily comprehensible way to present data and analyses. In the current scientific context with access to large volumes of data, the processing time for building histogram has dramatically increased. For this reason, parallel construction is necessary to alleviate the impact of the processing time in the analysis activities. In this scenario, GPU computing is becoming widely used for reducing until affordable levels the processing time of histogram construction. Associated to the increment of the processing time, the implementations are stressed on the bin-count accuracy. Accuracy aspects due to the particularities of the implementations are not usually taken into consideration when building histogram with very large data sets. In this work, a bin recycling strategy to create an accuracy-aware implementation for building histogram on GPU is presented. In order to evaluate the approach, this strategy was applied to the computation of the three-point angular correlation function, which is a relevant function in Cosmology for the study of the Large Scale Structure of Universe. As a consequence of the study a high-accuracy implementation for histogram construction on GPU is proposed.
Comparison of GPU and FPGA hardware for HWIL scene generation and image processing
NASA Astrophysics Data System (ADS)
Eales, Craig R.; Swierkowski, Leszek
2009-05-01
Hardware-in-the-Loop (HWIL) simulation is becoming increasingly important for cost-effective testing of imaging infrared systems. DSTO is developing real-time scene generation and image processing capabilities within its HWIL simulation programs, based on the application of COTS desktop PCs equipped with Graphics Processing Unit (GPU) cards, and including limited use of Field Programmable Gate Arrays (FPGAs). GPUs and FPGAs are high-performance parallel computing machines but are fundamentally different types of hardware. To determine which hardware type should be used to implement a real-time solution of a given application, a methodology is required to expose the concurrency within the problem and to structure the problem in a way that can be mapped to the hardware types. In this paper we use parallel programming patterns to compare the architectures of recent generation GPUs and FPGAs. We demonstrate the decomposition of a parallel application and its implementation on GPU and FPGA hardware and present preliminary results.
Ensemble Smoother implemented in parallel for groundwater problems applications
NASA Astrophysics Data System (ADS)
Leyva, E.; Herrera, G. S.; de la Cruz, L. M.
2013-05-01
Data assimilation is a process that links forecasting models and measurements using the benefits from both sources. The Ensemble Kalman Filter (EnKF) is a data-assimilation sequential-method that was designed to address two of the main problems related to the use of the Extended Kalman Filter (EKF) with nonlinear models in large state spaces, i-e the use of a closure problem and massive computational requirements associated with the storage and subsequent integration of the error covariance matrix. The EnKF has gained popularity because of its simple conceptual formulation and relative ease of implementation. It has been used successfully in various applications of meteorology and oceanography and more recently in petroleum engineering and hydrogeology. The Ensemble Smoother (ES) is a method similar to EnKF, it was proposed by Van Leeuwen and Evensen (1996). Herrera (1998) proposed a version of the ES which we call Ensemble Smoother of Herrera (ESH) to distinguish it from the former. It was introduced for space-time optimization of groundwater monitoring networks. In recent years, this method has been used for data assimilation and parameter estimation in groundwater flow and transport models. The ES method uses Monte Carlo simulation, which consists of generating repeated realizations of the random variable considered, using a flow and transport model. However, often a large number of model runs are required for the moments of the variable to converge. Therefore, depending on the complexity of problem a serial computer may require many hours of continuous use to apply the ES. For this reason, it is required to parallelize the process in order to do it in a reasonable time. In this work we present the results of a parallelization strategy to reduce the execution time for doing a high number of realizations. The software GWQMonitor by Herrera (1998), implements all the algorithms required for the ESH in Fortran 90. We develop a script in Python using mpi4py, in
Cickovski, Trevor; Flor, Tiffany; Irving-Sachs, Galen; Novikov, Philip; Parda, James; Narasimhan, Giri
2015-01-01
In order to make multiple copies of a target sequence in the laboratory, the technique of Polymerase Chain Reaction (PCR) requires the design of "primers", which are short fragments of nucleotides complementary to the flanking regions of the target sequence. If the same primer is to amplify multiple closely related target sequences, then it is necessary to make the primers "degenerate", which would allow it to hybridize to target sequences with a limited amount of variability that may have been caused by mutations. However, the PCR technique can only allow a limited amount of degeneracy, and therefore the design of degenerate primers requires the identification of reasonably well-conserved regions in the input sequences. We take an existing algorithm for designing degenerate primers that is based on clustering and parallelize it in a web-accessible software package GPUDePiCt, using a shared memory model and the computing power of Graphics Processing Units (GPUs). We test our implementation on large sets of aligned sequences from the human genome and show a multi-fold speedup for clustering using our hybrid GPU/CPU implementation over a pure CPU approach for these sequences, which consist of more than 7,500 nucleotides. We also demonstrate that this speedup is consistent over larger numbers and longer lengths of aligned sequences.
Cickovski, Trevor; Flor, Tiffany; Irving-Sachs, Galen; Novikov, Philip; Parda, James; Narasimhan, Giri
2015-01-01
In order to make multiple copies of a target sequence in the laboratory, the technique of Polymerase Chain Reaction (PCR) requires the design of "primers", which are short fragments of nucleotides complementary to the flanking regions of the target sequence. If the same primer is to amplify multiple closely related target sequences, then it is necessary to make the primers "degenerate", which would allow it to hybridize to target sequences with a limited amount of variability that may have been caused by mutations. However, the PCR technique can only allow a limited amount of degeneracy, and therefore the design of degenerate primers requires the identification of reasonably well-conserved regions in the input sequences. We take an existing algorithm for designing degenerate primers that is based on clustering and parallelize it in a web-accessible software package GPUDePiCt, using a shared memory model and the computing power of Graphics Processing Units (GPUs). We test our implementation on large sets of aligned sequences from the human genome and show a multi-fold speedup for clustering using our hybrid GPU/CPU implementation over a pure CPU approach for these sequences, which consist of more than 7,500 nucleotides. We also demonstrate that this speedup is consistent over larger numbers and longer lengths of aligned sequences. PMID:26357230
GPU computing with Kaczmarz’s and other iterative algorithms for linear systems
Elble, Joseph M.; Sahinidis, Nikolaos V.; Vouzis, Panagiotis
2009-01-01
The graphics processing unit (GPU) is used to solve large linear systems derived from partial differential equations. The differential equations studied are strongly convection-dominated, of various sizes, and common to many fields, including computational fluid dynamics, heat transfer, and structural mechanics. The paper presents comparisons between GPU and CPU implementations of several well-known iterative methods, including Kaczmarz’s, Cimmino’s, component averaging, conjugate gradient normal residual (CGNR), symmetric successive overrelaxation-preconditioned conjugate gradient, and conjugate-gradient-accelerated component-averaged row projections (CARP-CG). Computations are preformed with dense as well as general banded systems. The results demonstrate that our GPU implementation outperforms CPU implementations of these algorithms, as well as previously studied parallel implementations on Linux clusters and shared memory systems. While the CGNR method had begun to fall out of favor for solving such problems, for the problems studied in this paper, the CGNR method implemented on the GPU performed better than the other methods, including a cluster implementation of the CARP-CG method. PMID:20526446
GPU accelerated dynamic functional connectivity analysis for functional MRI data.
Akgün, Devrim; Sakoğlu, Ünal; Esquivel, Johnny; Adinoff, Bryon; Mete, Mutlu
2015-07-01
Recent advances in multi-core processors and graphics card based computational technologies have paved the way for an improved and dynamic utilization of parallel computing techniques. Numerous applications have been implemented for the acceleration of computationally-intensive problems in various computational science fields including bioinformatics, in which big data problems are prevalent. In neuroimaging, dynamic functional connectivity (DFC) analysis is a computationally demanding method used to investigate dynamic functional interactions among different brain regions or networks identified with functional magnetic resonance imaging (fMRI) data. In this study, we implemented and analyzed a parallel DFC algorithm based on thread-based and block-based approaches. The thread-based approach was designed to parallelize DFC computations and was implemented in both Open Multi-Processing (OpenMP) and Compute Unified Device Architecture (CUDA) programming platforms. Another approach developed in this study to better utilize CUDA architecture is the block-based approach, where parallelization involves smaller parts of fMRI time-courses obtained by sliding-windows. Experimental results showed that the proposed parallel design solutions enabled by the GPUs significantly reduce the computation time for DFC analysis. Multicore implementation using OpenMP on 8-core processor provides up to 7.7× speed-up. GPU implementation using CUDA yielded substantial accelerations ranging from 18.5× to 157× speed-up once thread-based and block-based approaches were combined in the analysis. Proposed parallel programming solutions showed that multi-core processor and CUDA-supported GPU implementations accelerated the DFC analyses significantly. Developed algorithms make the DFC analyses more practical for multi-subject studies with more dynamic analyses. PMID:25805449
The gputools package enables GPU computing in R
Buckner, Joshua; Wilson, Justin; Seligman, Mark; Athey, Brian; Watson, Stanley; Meng, Fan
2010-01-01
Motivation: By default, the R statistical environment does not make use of parallelism. Researchers may resort to expensive solutions such as cluster hardware for large analysis tasks. Graphics processing units (GPUs) provide an inexpensive and computationally powerful alternative. Using R and the CUDA toolkit from Nvidia, we have implemented several functions commonly used in microarray gene expression analysis for GPU-equipped computers. Results: R users can take advantage of the better performance provided by an Nvidia GPU. Availability: The package is available from CRAN, the R project's repository of packages, at http://cran.r-project.org/web/packages/gputools More information about our gputools R package is available at http://brainarray.mbni.med.umich.edu/brainarray/Rgpgpu Contact: bucknerj@umich.edu PMID:19850754
GPU accelerated FDTD solver and its application in MRI.
Chi, J; Liu, F; Jin, J; Mason, D G; Crozier, S
2010-01-01
The finite difference time domain (FDTD) method is a popular technique for computational electromagnetics (CEM). The large computational power often required, however, has been a limiting factor for its applications. In this paper, we will present a graphics processing unit (GPU)-based parallel FDTD solver and its successful application to the investigation of a novel B1 shimming scheme for high-field magnetic resonance imaging (MRI). The optimized shimming scheme exhibits considerably improved transmit B(1) profiles. The GPU implementation dramatically shortened the runtime of FDTD simulation of electromagnetic field compared with its CPU counterpart. The acceleration in runtime has made such investigation possible, and will pave the way for other studies of large-scale computational electromagnetic problems in modern MRI which were previously impractical.
Brabec, Jiri; Pittner, Jiri; van Dam, Hubertus JJ; Apra, Edoardo; Kowalski, Karol
2012-02-01
A novel algorithm for implementing general type of multireference coupled-cluster (MRCC) theory based on the Jeziorski-Monkhorst exponential Ansatz [B. Jeziorski, H.J. Monkhorst, Phys. Rev. A 24, 1668 (1981)] is introduced. The proposed algorithm utilizes processor groups to calculate the equations for the MRCC amplitudes. In the basic formulation each processor group constructs the equations related to a specific subset of references. By flexible choice of processor groups and subset of reference-specific sufficiency conditions designated to a given group one can assure optimum utilization of available computing resources. The performance of this algorithm is illustrated on the examples of the Brillouin-Wigner and Mukherjee MRCC methods with singles and doubles (BW-MRCCSD and Mk-MRCCSD). A significant improvement in scalability and in reduction of time to solution is reported with respect to recently reported parallel implementation of the BW-MRCCSD formalism [J.Brabec, H.J.J. van Dam, K. Kowalski, J. Pittner, Chem. Phys. Lett. 514, 347 (2011)].
SU-E-T-395: Multi-GPU-Based VMAT Treatment Plan Optimization Using a Column-Generation Approach
Tian, Z; Shi, F; Jia, X; Jiang, S; Peng, F
2014-06-01
Purpose: GPU has been employed to speed up VMAT optimizations from hours to minutes. However, its limited memory capacity makes it difficult to handle cases with a huge dose-deposition-coefficient (DDC) matrix, e.g. those with a large target size, multiple arcs, small beam angle intervals and/or small beamlet size. We propose multi-GPU-based VMAT optimization to solve this memory issue to make GPU-based VMAT more practical for clinical use. Methods: Our column-generation-based method generates apertures sequentially by iteratively searching for an optimal feasible aperture (referred as pricing problem, PP) and optimizing aperture intensities (referred as master problem, MP). The PP requires access to the large DDC matrix, which is implemented on a multi-GPU system. Each GPU stores a DDC sub-matrix corresponding to one fraction of beam angles and is only responsible for calculation related to those angles. Broadcast and parallel reduction schemes are adopted for inter-GPU data transfer. MP is a relatively small-scale problem and is implemented on one GPU. One headand- neck cancer case was used for test. Three different strategies for VMAT optimization on single GPU were also implemented for comparison: (S1) truncating DDC matrix to ignore its small value entries for optimization; (S2) transferring DDC matrix part by part to GPU during optimizations whenever needed; (S3) moving DDC matrix related calculation onto CPU. Results: Our multi-GPU-based implementation reaches a good plan within 1 minute. Although S1 was 10 seconds faster than our method, the obtained plan quality is worse. Both S2 and S3 handle the full DDC matrix and hence yield the same plan as in our method. However, the computation time is longer, namely 4 minutes and 30 minutes, respectively. Conclusion: Our multi-GPU-based VMAT optimization can effectively solve the limited memory issue with good plan quality and high efficiency, making GPUbased ultra-fast VMAT planning practical for real clinical use.
The experience of GPU calculations at Lunarc
NASA Astrophysics Data System (ADS)
Sjöström, Anders; Lindemann, Jonas; Church, Ross
2011-09-01
To meet the ever increasing demand for computational speed and use of ever larger datasets, multi GPU instal- lations look very tempting. Lunarc and the Theoretical Astrophysics group at Lund Observatory collaborate on a pilot project to evaluate and utilize multi-GPU architectures for scientific calculations. Starting with a small workshop in 2009, continued investigations eventually lead to the procurement of the GPU-resource Timaeus, which is a four-node eight-GPU cluster with two Nvidia m2050 GPU-cards per node. The resource is housed within the larger cluster Platon and share disk-, network- and system resources with that cluster. The inaugu- ration of Timaeus coincided with the meeting "Computational Physics with GPUs" in November 2010, hosted by the Theoretical Astrophysics group at Lund Observatory. The meeting comprised of a two-day workshop on GPU-computing and a two-day science meeting on using GPUs as a tool for computational physics research, with a particular focus on astrophysics and computational biology. Today Timaeus is used by research groups from Lund, Stockholm and Lule in fields ranging from Astrophysics to Molecular Chemistry. We are investigating the use of GPUs with commercial software packages and user supplied MPI-enabled codes. Looking ahead, Lunarc will be installing a new cluster during the summer of 2011 which will have a small number of GPU-enabled nodes that will enable us to continue working with the combination of parallel codes and GPU-computing. It is clear that the combination of GPUs/CPUs is becoming an important part of high performance computing and here we will describe what has been done at Lunarc regarding GPU-computations and how we will continue to investigate the new and coming multi-GPU servers and how they can be utilized in our environment.
Memory-Scalable GPU Spatial Hierarchy Construction.
Qiming Hou; Xin Sun; Kun Zhou; Lauterbach, C; Manocha, D
2011-04-01
Recent GPU algorithms for constructing spatial hierarchies have achieved promising performance for moderately complex models by using the breadth-first search (BFS) construction order. While being able to exploit the massive parallelism on the GPU, the BFS order also consumes excessive GPU memory, which becomes a serious issue for interactive applications involving very complex models with more than a few million triangles. In this paper, we propose to use the partial breadth-first search (PBFS) construction order to control memory consumption while maximizing performance. We apply the PBFS order to two hierarchy construction algorithms. The first algorithm is for kd-trees that automatically balances between the level of parallelism and intermediate memory usage. With PBFS, peak memory consumption during construction can be efficiently controlled without costly CPU-GPU data transfer. We also develop memory allocation strategies to effectively limit memory fragmentation. The resulting algorithm scales well with GPU memory and constructs kd-trees of models with millions of triangles at interactive rates on GPUs with 1 GB memory. Compared with existing algorithms, our algorithm is an order of magnitude more scalable for a given GPU memory bound. The second algorithm is for out-of-core bounding volume hierarchy (BVH) construction for very large scenes based on the PBFS construction order. At each iteration, all constructed nodes are dumped to the CPU memory, and the GPU memory is freed for the next iteration's use. In this way, the algorithm is able to build trees that are too large to be stored in the GPU memory. Experiments show that our algorithm can construct BVHs for scenes with up to 20 M triangles, several times larger than previous GPU algorithms.
HOOMD-blue - scaling up from one desktop GPU to Titan
NASA Astrophysics Data System (ADS)
Glaser, Jens; Anderson, Joshua A.; Glotzer, Sharon C.
2014-03-01
Scaling molecular dynamics simulations from one to many GPUs presents unique challenges. Due to the high parallel efficiency of a single GPU, communication processes become a bottleneck when multiple GPUs are combined in parallel and limit scaling. We show how the fastest general-purpose molecular dynamics code currently available for single GPUs, HOOMD-blue, has been extended using spatial domain decomposition to run efficiently on tens or hundreds of GPUs. A key to parallel efficiency is a highly optimized communication pattern using locally load-balancing algorithms fully implemented on the GPU. We will discuss comparisons to other state-of-the-art codes (LAMMPS) and present preliminary benchmarks on the Titan super computer.
Ha, S.; Matej, S.; Ispiryan, M.; Mueller, K.
2013-01-01
We describe a GPU-accelerated framework that efficiently models spatially (shift) variant system response kernels and performs forward- and back-projection operations with these kernels for the DIRECT (Direct Image Reconstruction for TOF) iterative reconstruction approach. Inherent challenges arise from the poor memory cache performance at non-axis aligned TOF directions. Focusing on the GPU memory access patterns, we utilize different kinds of GPU memory according to these patterns in order to maximize the memory cache performance. We also exploit the GPU instruction-level parallelism to efficiently hide long latencies from the memory operations. Our experiments indicate that our GPU implementation of the projection operators has slightly faster or approximately comparable time performance than FFT-based approaches using state-of-the-art FFTW routines. However, most importantly, our GPU framework can also efficiently handle any generic system response kernels, such as spatially symmetric and shift-variant as well as spatially asymmetric and shift-variant, both of which an FFT-based approach cannot cope with. PMID:23531763
NASA Astrophysics Data System (ADS)
Weigel, Martin
2011-09-01
Over the last couple of years it has been realized that the vast computational power of graphics processing units (GPUs) could be harvested for purposes other than the video game industry. This power, which at least nominally exceeds that of current CPUs by large factors, results from the relative simplicity of the GPU architectures as compared to CPUs, combined with a large number of parallel processing units on a single chip. To benefit from this setup for general computing purposes, the problems at hand need to be prepared in a way to profit from the inherent parallelism and hierarchical structure of memory accesses. In this contribution I discuss the performance potential for simulating spin models, such as the Ising model, on GPU as compared to conventional simulations on CPU.
Dong, Tingzing Tim; Tomov, Stanimire Z; Luszczek, Piotr R; Dongarra, Jack J
2015-01-01
As modern hardware keeps evolving, an increasingly effective approach to developing energy efficient and high-performance solvers is to design them to work on many small size and independent problems. Many applications already need this functionality, especially for GPUs, which are currently known to be about four to five times more energy efficient than multicore CPUs. We describe the development of one-sided factorizations that work for a set of small dense matrices in parallel, and we illustrate our techniques on the QR factorization based on Householder transformations. We refer to this mode of operation as a batched factorization. Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPU-only execution. This is in contrast to the hybrid CPU-GPU algorithms that rely heavily on using the multicore CPU for specific parts of the workload. But for a system to benefit fully from the GPU's significantly higher energy efficiency, avoiding the use of the multicore CPU must be a primary design goal, so the system can rely more heavily on the more efficient GPU. Additionally, this will result in the removal of the costly CPU-to-GPU communication. Furthermore, we do not use a single symmetric multiprocessor(on the GPU) to factorize a single problem at a time. We illustrate how our performance analysis, and the use of profiling and tracing tools, guided the development and optimization of our batched factorization to achieve up to a 2-fold speedup and a 3-fold energy efficiency improvement compared to our highly optimized batched CPU implementations based on the MKL library(when using two sockets of Intel Sandy Bridge CPUs). Compared to a batched QR factorization featured in the CUBLAS library for GPUs, we achieved up to 5x speedup on the K40 GPU.
Distributed GPU Computing in GIScience
NASA Astrophysics Data System (ADS)
Jiang, Y.; Yang, C.; Huang, Q.; Li, J.; Sun, M.
2013-12-01
Geoscientists strived to discover potential principles and patterns hidden inside ever-growing Big Data for scientific discoveries. To better achieve this objective, more capable computing resources are required to process, analyze and visualize Big Data (Ferreira et al., 2003; Li et al., 2013). Current CPU-based computing techniques cannot promptly meet the computing challenges caused by increasing amount of datasets from different domains, such as social media, earth observation, environmental sensing (Li et al., 2013). Meanwhile CPU-based computing resources structured as cluster or supercomputer is costly. In the past several years with GPU-based technology matured in both the capability and performance, GPU-based computing has emerged as a new computing paradigm. Compare to traditional computing microprocessor, the modern GPU, as a compelling alternative microprocessor, has outstanding high parallel processing capability with cost-effectiveness and efficiency(Owens et al., 2008), although it is initially designed for graphical rendering in visualization pipe. This presentation reports a distributed GPU computing framework for integrating GPU-based computing within distributed environment. Within this framework, 1) for each single computer, computing resources of both GPU-based and CPU-based can be fully utilized to improve the performance of visualizing and processing Big Data; 2) within a network environment, a variety of computers can be used to build up a virtual super computer to support CPU-based and GPU-based computing in distributed computing environment; 3) GPUs, as a specific graphic targeted device, are used to greatly improve the rendering efficiency in distributed geo-visualization, especially for 3D/4D visualization. Key words: Geovisualization, GIScience, Spatiotemporal Studies Reference : 1. Ferreira de Oliveira, M. C., & Levkowitz, H. (2003). From visual data exploration to visual data mining: A survey. Visualization and Computer Graphics, IEEE
NASA Astrophysics Data System (ADS)
Wu, Q.; Xiong, F.; Wang, F.; Xiong, Y.
2016-10-01
In order to reduce the computational time, a fully parallel implementation of the particle swarm optimization (PSO) algorithm on a graphics processing unit (GPU) is presented. Instead of being executed on the central processing unit (CPU) sequentially, PSO is executed in parallel via the GPU on the compute unified device architecture (CUDA) platform. The processes of fitness evaluation, updating of velocity and position of all particles are all parallelized and introduced in detail. Comparative studies on the optimization of four benchmark functions and a trajectory optimization problem are conducted by running PSO on the GPU (GPU-PSO) and CPU (CPU-PSO). The impact of design dimension, number of particles and size of the thread-block in the GPU and their interactions on the computational time is investigated. The results show that the computational time of the developed GPU-PSO is much shorter than that of CPU-PSO, with comparable accuracy, which demonstrates the remarkable speed-up capability of GPU-PSO.
MASSIVELY PARALLEL LATENT SEMANTIC ANALYSES USING A GRAPHICS PROCESSING UNIT
Cavanagh, J.; Cui, S.
2009-01-01
Latent Semantic Analysis (LSA) aims to reduce the dimensions of large term-document datasets using Singular Value Decomposition. However, with the ever-expanding size of datasets, current implementations are not fast enough to quickly and easily compute the results on a standard PC. A graphics processing unit (GPU) can solve some highly parallel problems much faster than a traditional sequential processor or central processing unit (CPU). Thus, a deployable system using a GPU to speed up large-scale LSA processes would be a much more effective choice (in terms of cost/performance ratio) than using a PC cluster. Due to the GPU’s application-specifi c architecture, harnessing the GPU’s computational prowess for LSA is a great challenge. We presented a parallel LSA implementation on the GPU, using NVIDIA® Compute Unifi ed Device Architecture and Compute Unifi ed Basic Linear Algebra Subprograms software. The performance of this implementation is compared to traditional LSA implementation on a CPU using an optimized Basic Linear Algebra Subprograms library. After implementation, we discovered that the GPU version of the algorithm was twice as fast for large matrices (1 000x1 000 and above) that had dimensions not divisible by 16. For large matrices that did have dimensions divisible by 16, the GPU algorithm ran fi ve to six times faster than the CPU version. The large variation is due to architectural benefi ts of the GPU for matrices divisible by 16. It should be noted that the overall speeds for the CPU version did not vary from relative normal when the matrix dimensions were divisible by 16. Further research is needed in order to produce a fully implementable version of LSA. With that in mind, the research we presented shows that the GPU is a viable option for increasing the speed of LSA, in terms of cost/performance ratio.
NASA Technical Reports Server (NTRS)
Barnard, Stephen T.; Simon, Horst; Lasinski, T. A. (Technical Monitor)
1994-01-01
The design of a parallel implementation of multilevel recursive spectral bisection is described. The goal is to implement a code that is fast enough to enable dynamic repartitioning of adaptive meshes.
NASA Astrophysics Data System (ADS)
Sakurai, Kazuyuki; Kyo, Shorin; Okazaki, Shin'ichiro
This paper describes the real-time implementation of a vision-based overtaking vehicle detection method for driver assistance systems using IMAPCAR, a highly parallel SIMD linear array processor. The implemented overtaking vehicle detection method is based on optical flows detected by block matching using SAD and detection of the flows' vanishing point. The implementation is done efficiently by taking advantage of the parallel SIMD architecture of IMAPCAR. As a result, video-rate (33 frames/s) implementation could be achieved.
Parallel Implementation of a High Order Implicit Collocation Method for the Heat Equation
NASA Technical Reports Server (NTRS)
Kouatchou, Jules; Halem, Milton (Technical Monitor)
2000-01-01
We combine a high order compact finite difference approximation and collocation techniques to numerically solve the two dimensional heat equation. The resulting method is implicit arid can be parallelized with a strategy that allows parallelization across both time and space. We compare the parallel implementation of the new method with a classical implicit method, namely the Crank-Nicolson method, where the parallelization is done across space only. Numerical experiments are carried out on the SGI Origin 2000.
Scalable Unix commands for parallel processors : a high-performance implementation.
Ong, E.; Lusk, E.; Gropp, W.
2001-06-22
We describe a family of MPI applications we call the Parallel Unix Commands. These commands are natural parallel versions of common Unix user commands such as ls, ps, and find, together with a few similar commands particular to the parallel environment. We describe the design and implementation of these programs and present some performance results on a 256-node Linux cluster. The Parallel Unix Commands are open source and freely available.
GPU-accelerated Monte-Carlo modeling for fluorescence propagation in turbid medium
NASA Astrophysics Data System (ADS)
Yi, Xi; Chen, Weiting; Wu, Linhui; Ma, Wenjuan; Zhang, Wei; Li, Jiao; Wang, Xin; Gao, Feng
2012-03-01
In biomedical optics, the Monte Carlo (MC) simulation is widely recognized as a gold standard for its high accuracy and versatility. However, in fluorescence regime, due to the requirement for tracing a huge number of the consecutive events of an excitation photon migration, the excitation-to-emission convention and the resultant fluorescent photon migration in tissue, the MC method is prohibitively time-consuming, especially when the tissue has an optically heterogeneous structure. To overcome the difficulty, we present a parallel implementation of MC modeling for fluorescence propagation in tissue, on the basis of the Graphics Processing Units (GPU) and the Compute Unified Device Architecture (CUDA) platform. By rationalizing the distribution of blocks and threads a certain number of photon migration procedures can be processed synchronously and efficiently, with the single-instruction-multiple-thread execution mode of GPU. We have evaluated the implementation for both homogeneous and heterogeneous scenarios by comparing with the conventional CPU implementations, and shown that the GPU method can obtain significant acceleration of about 20-30 times for fluorescence modeling in tissue, indicating that the GPU-based fluorescence MC simulation can be a practically effective tool for methodological investigations of tissue fluorescence spectroscopy and imaging.
Accelerating Satellite Image Based Large-Scale Settlement Detection with GPU
Patlolla, Dilip Reddy; Cheriyadat, Anil M; Weaver, Jeanette E; Bright, Eddie A
2012-01-01
Computer vision algorithms for image analysis are often computationally demanding. Application of such algorithms on large image databases\\---- such as the high-resolution satellite imagery covering the entire land surface, can easily saturate the computational capabilities of conventional CPUs. There is a great demand for vision algorithms running on high performance computing (HPC) architecture capable of processing petascale image data. We exploit the parallel processing capability of GPUs to present a GPU-friendly algorithm for robust and efficient detection of settlements from large-scale high-resolution satellite imagery. Feature descriptor generation is an expensive, but a key step in automated scene analysis. To address this challenge, we present GPU implementations for three different feature descriptors\\-- multiscale Historgram of Oriented Gradients (HOG), Gray Level Co-Occurrence Matrix (GLCM) Contrast and local pixel intensity statistics. We perform extensive experimental evaluations of our implementation using diverse and large image datasets. Our GPU implementation of the feature descriptor algorithms results in speedups of 220 times compared to the CPU version. We present an highly efficient settlement detection system running on a multiGPU architecture capable of extracting human settlement regions from a city-scale sub-meter spatial resolution aerial imagery spanning roughly 1200 sq. kilometers in just 56 seconds with detection accuracy close to 90\\%. This remarkable speedup gained by our vision algorithm maintaining high detection accuracy clearly demonstrates that such computational advancements clearly hold the solution for petascale image analysis challenges.
On Efficient Parallel Implementation of Moving Body Overset Grid Methods
NASA Technical Reports Server (NTRS)
Wissink, Andrew M.; Meakin, Robert L.; Warmbrodt, William (Technical Monitor)
1997-01-01
An investigation into the parallel performance of moving-body overset grid methods will be presented. Parallel versions of the OVERFLOW flow solver, DCF3D domain connectivity software, and SIXDO six-degree-of-freedom routine are coupled with an automatic load balance routine and tested for 3D Navier-Stokes calculations on the IBM SP2. The primary source of parallel inefficiency in moving and problems are the domain connectivity costs with DCF 3D. Although this algorithm constitutes a relatively low fraction of the total solution cost (e.g. 10-20%) in calculations on serial machines, the consequently cause a significant degradation in the overall parallel performance. The paper will highlight some approaches for improving the scalability of DCF3D. The paper will present results of a proposed new load balancing scheme that seeks more equal distribution of the inter-grid boundary points in order to more evenly load balance the donor search costs associated with DCF3D. Some preliminary results will also be given from a new solution-adaption algorithm coupled with OVERFLOW which incorporates overset cartesian grids with various levels of refinement. The measured parallel performance from a descending delta-wing configuration and a generic store-separation from a wing/pylon case will be presented.
A parallel implementation of kriging with a trend
Gajraj, A.; Joubert, W.; Jones, J.
1997-11-01
This paper describes the parallelization of the GSLIB ktb3dm code. The code is parallelized using the message passing paradigm, Parallel Virtual Machine (PVM), under a Multiple Instructions, Multiple Data (MIMD) architecture. The code performance is analyzed using different grid sizes of 5x5x1, 50x50x1, 100x100x1 and 500x500x1 with 1, 2, 4, 8 and in some cases 16 processors on the Cray T3D supercomputer. The parallelization effort focused on the main kriging do loop. The results confirm that there is a substantial benefit to be derived in terms of CPU time savings (or execution speed) by using the parallel version of the code, especially when considering larger grids. Additionally, speed-up and scalability analyses show that actual speed-up is close to theoretical, while the code scales appropriately within the 1 to 16 processor range tested. The kriging of a quarter-million grid cell system fell from over 9 CPU minutes on one Cray T3D processor to about 1.25 CPU minutes on 16 processors on the same machine.
GPU-accelerated 3D neutron diffusion code based on finite difference method
Xu, Q.; Yu, G.; Wang, K.
2012-07-01
Finite difference method, as a traditional numerical solution to neutron diffusion equation, although considered simpler and more precise than the coarse mesh nodal methods, has a bottle neck to be widely applied caused by the huge memory and unendurable computation time it requires. In recent years, the concept of General-Purpose computation on GPUs has provided us with a powerful computational engine for scientific research. In this study, a GPU-Accelerated multi-group 3D neutron diffusion code based on finite difference method was developed. First, a clean-sheet neutron diffusion code (3DFD-CPU) was written in C++ on the CPU architecture, and later ported to GPUs under NVIDIA's CUDA platform (3DFD-GPU). The IAEA 3D PWR benchmark problem was calculated in the numerical test, where three different codes, including the original CPU-based sequential code, the HYPRE (High Performance Pre-conditioners)-based diffusion code and CITATION, were used as counterpoints to test the efficiency and accuracy of the GPU-based program. The results demonstrate both high efficiency and adequate accuracy of the GPU implementation for neutron diffusion equation. A speedup factor of about 46 times was obtained, using NVIDIA's Geforce GTX470 GPU card against a 2.50 GHz Intel Quad Q9300 CPU processor. Compared with the HYPRE-based code performing in parallel on an 8-core tower server, the speedup of about 2 still could be observed. More encouragingly, without any mathematical acceleration technology, the GPU implementation ran about 5 times faster than CITATION which was speeded up by using the SOR method and Chebyshev extrapolation technique. (authors)
The OpenMP Implementation of NAS Parallel Benchmarks and its Performance
NASA Technical Reports Server (NTRS)
Jin, Hao-Qiang; Frumkin, Michael; Yan, Jerry
1999-01-01
As the new ccNUMA architecture became popular in recent years, parallel programming with compiler directives on these machines has evolved to accommodate new needs. In this study, we examine the effectiveness of OpenMP directives for parallelizing the NAS Parallel Benchmarks. Implementation details will be discussed and performance will be compared with the MPI implementation. We have demonstrated that OpenMP can achieve very good results for parallelization on a shared memory system, but effective use of memory and cache is very important.
Acceleration of 3D Finite Difference AWP-ODC for seismic simulation on GPU Fermi Architecture
NASA Astrophysics Data System (ADS)
Zhou, J.; Cui, Y.; Choi, D.
2011-12-01
AWP-ODC, a highly scalable parallel finite-difference application, enables petascale 3D earthquake calculations. This application generates realistic dynamic earthquake source description and detailed physics-based anelastic ground motions at frequencies pertinent to safe building design. In 2010, the code achieved M8, a full dynamical simulation of a magnitude-8 earthquake on the southern San Andreas fault up to 2-Hz, the largest-ever earthquake simulation. Building on the success of the previous work, we have implemented CUDA on AWP-ODC to accelerate wave propagation on GPU platform. Our CUDA development aims on aggressive parallel efficiency, optimized global and shared memory access to make the best use of GPU memory hierarchy. The benchmark on NVIDIA Tesla C2050 graphics cards demonstrated many tens of speedup in single precision compared to serial implementation at a testing problem size, while an MPI-CUDA implementation is in the progress to extend our solver to multi-GPU clusters. Our CUDA implementation has been carefully verified for accuracy.
Accelerating Computation of DCM for ERP in MATLAB by External Function Calls to the GPU.
Wang, Wei-Jen; Hsieh, I-Fan; Chen, Chun-Chuan
2013-01-01
This study aims to improve the performance of Dynamic Causal Modelling for Event Related Potentials (DCM for ERP) in MATLAB by using external function calls to a graphics processing unit (GPU). DCM for ERP is an advanced method for studying neuronal effective connectivity. DCM utilizes an iterative procedure, the expectation maximization (EM) algorithm, to find the optimal parameters given a set of observations and the underlying probability model. As the EM algorithm is computationally demanding and the analysis faces possible combinatorial explosion of models to be tested, we propose a parallel computing scheme using the GPU to achieve a fast estimation of DCM for ERP. The computation of DCM for ERP is dynamically partitioned and distributed to threads for parallel processing, according to the DCM model complexity and the hardware constraints. The performance efficiency of this hardware-dependent thread arrangement strategy was evaluated using the synthetic data. The experimental data were used to validate the accuracy of the proposed computing scheme and quantify the time saving in practice. The simulation results show that the proposed scheme can accelerate the computation by a factor of 155 for the parallel part. For experimental data, the speedup factor is about 7 per model on average, depending on the model complexity and the data. This GPU-based implementation of DCM for ERP gives qualitatively the same results as the original MATLAB implementation does at the group level analysis. In conclusion, we believe that the proposed GPU-based implementation is very useful for users as a fast screen tool to select the most likely model and may provide implementation guidance for possible future clinical applications such as online diagnosis.
Accelerating Computation of DCM for ERP in MATLAB by External Function Calls to the GPU
Wang, Wei-Jen; Hsieh, I-Fan; Chen, Chun-Chuan
2013-01-01
This study aims to improve the performance of Dynamic Causal Modelling for Event Related Potentials (DCM for ERP) in MATLAB by using external function calls to a graphics processing unit (GPU). DCM for ERP is an advanced method for studying neuronal effective connectivity. DCM utilizes an iterative procedure, the expectation maximization (EM) algorithm, to find the optimal parameters given a set of observations and the underlying probability model. As the EM algorithm is computationally demanding and the analysis faces possible combinatorial explosion of models to be tested, we propose a parallel computing scheme using the GPU to achieve a fast estimation of DCM for ERP. The computation of DCM for ERP is dynamically partitioned and distributed to threads for parallel processing, according to the DCM model complexity and the hardware constraints. The performance efficiency of this hardware-dependent thread arrangement strategy was evaluated using the synthetic data. The experimental data were used to validate the accuracy of the proposed computing scheme and quantify the time saving in practice. The simulation results show that the proposed scheme can accelerate the computation by a factor of 155 for the parallel part. For experimental data, the speedup factor is about 7 per model on average, depending on the model complexity and the data. This GPU-based implementation of DCM for ERP gives qualitatively the same results as the original MATLAB implementation does at the group level analysis. In conclusion, we believe that the proposed GPU-based implementation is very useful for users as a fast screen tool to select the most likely model and may provide implementation guidance for possible future clinical applications such as online diagnosis. PMID:23840507
GPU-based Integration with Application in Sensitivity Analysis
NASA Astrophysics Data System (ADS)
Atanassov, Emanouil; Ivanovska, Sofiya; Karaivanova, Aneta; Slavov, Dimitar
2010-05-01
The presented work is an important part of the grid application MCSAES (Monte Carlo Sensitivity Analysis for Environmental Studies) which aim is to develop an efficient Grid implementation of a Monte Carlo based approach for sensitivity studies in the domains of Environmental modelling and Environmental security. The goal is to study the damaging effects that can be caused by high pollution levels (especially effects on human health), when the main modeling tool is the Danish Eulerian Model (DEM). Generally speaking, sensitivity analysis (SA) is the study of how the variation in the output of a mathematical model can be apportioned to, qualitatively or quantitatively, different sources of variation in the input of a model. One of the important classes of methods for Sensitivity Analysis are Monte Carlo based, first proposed by Sobol, and then developed by Saltelli and his group. In MCSAES the general Saltelli procedure has been adapted for SA of the Danish Eulerian model. In our case we consider as factors the constants determining the speeds of the chemical reactions in the DEM and as output a certain aggregated measure of the pollution. Sensitivity simulations lead to huge computational tasks (systems with up to 4 × 109 equations at every time-step, and the number of time-steps can be more than a million) which motivates its grid implementation. MCSAES grid implementation scheme includes two main tasks: (i) Grid implementation of the DEM, (ii) Grid implementation of the Monte Carlo integration. In this work we present our new developments in the integration part of the application. We have developed an algorithm for GPU-based generation of scrambled quasirandom sequences which can be combined with the CPU-based computations related to the SA. Owen first proposed scrambling of Sobol sequence through permutation in a manner that improves the convergence rates. Scrambling is necessary not only for error analysis but for parallel implementations. Good scrambling is
Parallel processing implementation for the coupled transport of photons and electrons using OpenMP
NASA Astrophysics Data System (ADS)
Doerner, Edgardo
2016-05-01
In this work the use of OpenMP to implement the parallel processing of the Monte Carlo (MC) simulation of the coupled transport for photons and electrons is presented. This implementation was carried out using a modified EGSnrc platform which enables the use of the Microsoft Visual Studio 2013 (VS2013) environment, together with the developing tools available in the Intel Parallel Studio XE 2015 (XE2015). The performance study of this new implementation was carried out in a desktop PC with a multi-core CPU, taking as a reference the performance of the original platform. The results were satisfactory, both in terms of scalability as parallelization efficiency.
NASA Astrophysics Data System (ADS)
Liu, Yen-Fu; Lo, Nai-Wei; Subbarao, Murali; Carlson, Bradley S.
1998-07-01
A unified approach to image focus and defocus analysis (UFDA) was proposed recently for three-dimensional shape and focused image recovery of objects. One version of this approach which yields very accurate results is highly computationally intensive. In this paper we present a parallel implementation of this version of UFDA on the Parallel Virtual Machine (PVM). One of the most computationally intensive parts of the UFDA approach is the estimation of image data that would be recorded by a camera for a given solution for 3D shape and focused image. This computational step has to be repeated once during each iteration of the optimization algorithm. Therefore this step has been sped up by using the Parallel Virtual Machine (PVM). PVM is a software package that allows a heterogeneous network of parallel and serial computers to appear as a single concurrent computational resource. In our experimental environment PVM is installed on four UNIX workstations communicating over Ethernet to exploit parallel processing capability. Experimental results show that the communication over-head in this case is relatively low. An average of 1.92 speedup is attained by the parallel UFDA algorithm running on 2 PVM connected computers compared to the execution time of sequential processing. By applying the UFDA algorithm on 4 PVM connected machines an average of 3.44 speedup is reached. This demonstrates a practical application of PVM to 3D machine vision.
NASA Astrophysics Data System (ADS)
Wang, Jing-Hui; Kong, Guang-Qian; Liu, Cai-Hong
PVM (Parallel virtual machine) library is a tool which used processes large amounts of data sets. This paper wants to achieve a high performance solution that exploits PVM library and parallel computers to solve ICA (Independent Component Analysis) problem. The paper presents parallel power ICA implementations to decomposition data sets. Power iteration (PI) is an algorithm for independent component analysis, which has some desired features. It has higher performance and data capacity than current sequential implementations. This paper, we show the power iteration algorithm which learning updating is in the form of matrix transformation . From power iteration algorithm, we develop parallel power iteration algorithm and implement parallel component decomposition solution. At last, experimental results, analysis and future plans are presented.
NASA Astrophysics Data System (ADS)
Dardikman, Gili; Shaked, Natan T.
2016-03-01
We present highly parallel and efficient algorithms for real-time reconstruction of the quantitative three-dimensional (3-D) refractive-index maps of biological cells without labeling, as obtained from the interferometric projections acquired by tomographic phase microscopy (TPM). The new algorithms are implemented on the graphic processing unit (GPU) of the computer using CUDA programming environment. The reconstruction process includes two main parts. First, we used parallel complex wave-front reconstruction of the TPM-based interferometric projections acquired at various angles. The complex wave front reconstructions are done on the GPU in parallel, while minimizing the calculation time of the Fourier transforms and phase unwrapping needed. Next, we implemented on the GPU in parallel the 3-D refractive index map retrieval using the TPM filtered-back projection algorithm. The incorporation of algorithms that are inherently parallel with a programming environment such as Nvidia's CUDA makes it possible to obtain real-time processing rate, and enables high-throughput platform for label-free, 3-D cell visualization and diagnosis.
Naveros, Francisco; Luque, Niceto R; Garrido, Jesús A; Carrillo, Richard R; Anguita, Mancia; Ros, Eduardo
2015-07-01
Time-driven simulation methods in traditional CPU architectures perform well and precisely when simulating small-scale spiking neural networks. Nevertheless, they still have drawbacks when simulating large-scale systems. Conversely, event-driven simulation methods in CPUs and time-driven simulation methods in graphic processing units (GPUs) can outperform CPU time-driven methods under certain conditions. With this performance improvement in mind, we have developed an event-and-time-driven spiking neural network simulator suitable for a hybrid CPU-GPU platform. Our neural simulator is able to efficiently simulate bio-inspired spiking neural networks consisting of different neural models, which can be distributed heterogeneously in both small layers and large layers or subsystems. For the sake of efficiency, the low-activity parts of the neural network can be simulated in CPU using event-driven methods while the high-activity subsystems can be simulated in either CPU (a few neurons) or GPU (thousands or millions of neurons) using time-driven methods. In this brief, we have undertaken a comparative study of these different simulation methods. For benchmarking the different simulation methods and platforms, we have used a cerebellar-inspired neural-network model consisting of a very dense granular layer and a Purkinje layer with a smaller number of cells (according to biological ratios). Thus, this cerebellar-like network includes a dense diverging neural layer (increasing the dimensionality of its internal representation and sparse coding) and a converging neural layer (integration) similar to many other biologically inspired and also artificial neural networks.
GPU Lossless Hyperspectral Data Compression System
NASA Technical Reports Server (NTRS)
Aranki, Nazeeh I.; Keymeulen, Didier; Kiely, Aaron B.; Klimesh, Matthew A.
2014-01-01
Hyperspectral imaging systems onboard aircraft or spacecraft can acquire large amounts of data, putting a strain on limited downlink and storage resources. Onboard data compression can mitigate this problem but may require a system capable of a high throughput. In order to achieve a high throughput with a software compressor, a graphics processing unit (GPU) implementation of a compressor was developed targeting the current state-of-the-art GPUs from NVIDIA(R). The implementation is based on the fast lossless (FL) compression algorithm reported in "Fast Lossless Compression of Multispectral-Image Data" (NPO- 42517), NASA Tech Briefs, Vol. 30, No. 8 (August 2006), page 26, which operates on hyperspectral data and achieves excellent compression performance while having low complexity. The FL compressor uses an adaptive filtering method and achieves state-of-the-art performance in both compression effectiveness and low complexity. The new Consultative Committee for Space Data Systems (CCSDS) Standard for Lossless Multispectral & Hyperspectral image compression (CCSDS 123) is based on the FL compressor. The software makes use of the highly-parallel processing capability of GPUs to achieve a throughput at least six times higher than that of a software implementation running on a single-core CPU. This implementation provides a practical real-time solution for compression of data from airborne hyperspectral instruments.
Massively parallel implementation of the multi-reference Brillouin-Wigner CCSD method
Brabec, Jiri; Krishnamoorthy, Sriram; van Dam, Hubertus JJ; Kowalski, Karol; Pittner, Jiri
2011-10-06
This paper reports the parallel implementation of the Brillouin Wigner MultiReference Coupled Cluster method with Single and Double excitations (BW-MRCCSD). Preliminary tests for systems composed of 304 and 440 correlated obritals demonstrate the performance of our implementation across 1000 cores and clearly indicate the advantages of using improved task scheduling. Possible ways for further improvements of the parallel performance are also delineated.
Locality-Driven Dynamic GPU Cache Bypassing
Li, Chao; Song, Shuaiwen; Dai, Hongwen; Sidelnik, A.; Hari, Siva; Zhou, Huiyang
2015-06-07
This paper presents novel cache optimizations for massively parallel, throughput-oriented architectures like GPUs. Based on the reuse characteristics of GPU workloads, we propose a design that integrates such efficient locality filtering capability into the decoupled tag store of the existing L1 D-cache through simple and cost-effective hardware extensions.
Implementing hybrid MPI/OpenMP parallelism in Fluidity
NASA Astrophysics Data System (ADS)
Gorman, Gerard; Lange, Michael; Avdis, Alexandros; Guo, Xiaohu; Mitchell, Lawrence; Weiland, Michele
2014-05-01
Parallelising finite element codes using domain decomposition methods and MPI has nearly become routine at the application code level. This has been helped in no small part by the development of an eco-system of open source libraries to provide key functionality, for example SCOTCH for graph partitioning or PETSc for sparse iterative solvers. As we move to an era where pure MPI no longer suffices, application developers cannot only focus on the application code, but must consider the full software stack. In the case of Fluidity (an open source control volume/finite element general purpose fluid dynamics code) the decision to improve parallel efficiency by moving to a hybrid MPI/OpenMP programming model it became necessary to get involved in extending 3rd party open source libraries, specifically PETSc, in addition to the application code itself. The effort involved in re-engineering a large application code highlights the fact that as computing platforms continue their advance towards low power many core processors, the software stack must also develop at a similar pace or application codes will suffer. In this presentation we will illustrate the steps required to re-engineer Fluidity to achieve good parallel efficiency when using MPI/OpenMP. We identify performance pitfalls when using Fortran features such as automatic arrays in a multi-threaded context, as well as poor data locality on NUMA platforms. A significant proportion of the computational cost is in the sparse iterative solvers. For this we collaborated with the development team at Argonne National Laboratory to add OpenMP support to PETSc. We will present performance results for both the application as a whole, as well as for key individual components such as matrix assembly and the solvers. We also show that while we did not explicitly target I/O for optimisation here, its performance is nonetheless greatly improved because of fewer processes accessing the file system. One of the main remaining
GPUPEGAS: A NEW GPU-ACCELERATED HYDRODYNAMIC CODE FOR NUMERICAL SIMULATIONS OF INTERACTING GALAXIES
Kulikov, Igor
2014-09-01
In this paper, a new scalable hydrodynamic code, GPUPEGAS (GPU-accelerated Performance Gas Astrophysical Simulation), for the simulation of interacting galaxies is proposed. The details of a parallel numerical method co-design are described. A speed-up of 55 times was obtained within a single GPU accelerator. The use of 60 GPU accelerators resulted in 96% parallel efficiency. A collisionless hydrodynamic approach has been used for modeling of stars and dark matter. The scalability of the GPUPEGAS code is shown.
NASA Astrophysics Data System (ADS)
Owusu-Banson, Derek
In recent times, a variety of industries, applications and numerical methods including the meshless method have enjoyed a great deal of success by utilizing the graphical processing unit (GPU) as a parallel coprocessor. These benefits often include performance improvement over the previous implementations. Furthermore, applications running on graphics processors enjoy superior performance per dollar and performance per watt than implementations built exclusively on traditional central processing technologies. The GPU was originally designed for graphics acceleration but the modern GPU, known as the General Purpose Graphical Processing Unit (GPGPU) can be used for scientific and engineering calculations. The GPGPU consists of massively parallel array of integer and floating point processors. There are typically hundreds of processors per graphics card with dedicated high-speed memory. This work describes an application written by the author, titled GaussianRBF to show the implementation and results of a novel meshless method that in-cooperates the collocation of the Gaussian radial basis function by utilizing the GPU as a parallel co-processor. Key phases of the proposed meshless method have been executed on the GPU using the NVIDIA CUDA software development kit. Especially, the matrix fill and solution phases have been carried out on the GPU, along with some post processing. This approach resulted in a decreased processing time compared to similar algorithm implemented on the CPU while maintaining the same accuracy.
Object-Oriented Implementation of the NAS Parallel Benchmarks using Charm++
NASA Technical Reports Server (NTRS)
Krishnan, Sanjeev; Bhandarkar, Milind; Kale, Laxmikant V.
1996-01-01
This report describes experiences with implementing the NAS Computational Fluid Dynamics benchmarks using a parallel object-oriented language, Charm++. Our main objective in implementing the NAS CFD kernel benchmarks was to develop a code that could be used to easily experiment with different domain decomposition strategies and dynamic load balancing. We also wished to leverage the object-orientation provided by the Charm++ parallel object-oriented language, to develop reusable abstractions that would simplify the process of developing parallel applications. We first describe the Charm++ parallel programming model and the parallel object array abstraction, then go into detail about each of the Scalar Pentadiagonal (SP) and Lower/Upper Triangular (LU) benchmarks, along with performance results. Finally we conclude with an evaluation of the methodology used.
GPU Accelerated Vector Median Filter
NASA Technical Reports Server (NTRS)
Aras, Rifat; Shen, Yuzhong
2011-01-01
Noise reduction is an important step for most image processing tasks. For three channel color images, a widely used technique is vector median filter in which color values of pixels are treated as 3-component vectors. Vector median filters are computationally expensive; for a window size of n x n, each of the n(sup 2) vectors has to be compared with other n(sup 2) - 1 vectors in distances. General purpose computation on graphics processing units (GPUs) is the paradigm of utilizing high-performance many-core GPU architectures for computation tasks that are normally handled by CPUs. In this work. NVIDIA's Compute Unified Device Architecture (CUDA) paradigm is used to accelerate vector median filtering. which has to the best of our knowledge never been done before. The performance of GPU accelerated vector median filter is compared to that of the CPU and MPI-based versions for different image and window sizes, Initial findings of the study showed 100x improvement of performance of vector median filter implementation on GPUs over CPU implementations and further speed-up is expected after more extensive optimizations of the GPU algorithm .
Parallel asynchronous hardware implementation of image processing algorithms
NASA Technical Reports Server (NTRS)
Coon, Darryl D.; Perera, A. G. U.
1990-01-01
Research is being carried out on hardware for a new approach to focal plane processing. The hardware involves silicon injection mode devices. These devices provide a natural basis for parallel asynchronous focal plane image preprocessing. The simplicity and novel properties of the devices would permit an independent analog processing channel to be dedicated to every pixel. A laminar architecture built from arrays of the devices would form a two-dimensional (2-D) array processor with a 2-D array of inputs located directly behind a focal plane detector array. A 2-D image data stream would propagate in neuron-like asynchronous pulse-coded form through the laminar processor. No multiplexing, digitization, or serial processing would occur in the preprocessing state. High performance is expected, based on pulse coding of input currents down to one picoampere with noise referred to input of about 10 femtoamperes. Linear pulse coding has been observed for input currents ranging up to seven orders of magnitude. Low power requirements suggest utility in space and in conjunction with very large arrays. Very low dark current and multispectral capability are possible because of hardware compatibility with the cryogenic environment of high performance detector arrays. The aforementioned hardware development effort is aimed at systems which would integrate image acquisition and image processing.
Implementation of Parallel Computing Technology to Vortex Flow
NASA Technical Reports Server (NTRS)
Dacles-Mariani, Jennifer
1999-01-01
Mainframe supercomputers such as the Cray C90 was invaluable in obtaining large scale computations using several millions of grid points to resolve salient features of a tip vortex flow over a lifting wing. However, real flight configurations require tracking not only of the flow over several lifting wings but its growth and decay in the near- and intermediate- wake regions, not to mention the interaction of these vortices with each other. Resolving and tracking the evolution and interaction of these vortices shed from complex bodies is computationally intensive. Parallel computing technology is an attractive option in solving these flows. In planetary science vortical flows are also important in studying how planets and protoplanets form when cosmic dust and gases become gravitationally unstable and eventually form planets or protoplanets. The current paradigm for the formation of planetary systems maintains that the planets accreted from the nebula of gas and dust left over from the formation of the Sun. Traditional theory also indicate that such a preplanetary nebula took the form of flattened disk. The coagulation of dust led to the settling of aggregates toward the midplane of the disk, where they grew further into asteroid-like planetesimals. Some of the issues still remaining in this process are the onset of gravitational instability, the role of turbulence in the damping of particles and radial effects. In this study the focus will be with the role of turbulence and the radial effects.
Fast quantum Monte Carlo on a GPU
NASA Astrophysics Data System (ADS)
Lutsyshyn, Y.
2015-02-01
We present a scheme for the parallelization of quantum Monte Carlo method on graphical processing units, focusing on variational Monte Carlo simulation of bosonic systems. We use asynchronous execution schemes with shared memory persistence, and obtain an excellent utilization of the accelerator. The CUDA code is provided along with a package that simulates liquid helium-4. The program was benchmarked on several models of Nvidia GPU, including Fermi GTX560 and M2090, and the Kepler architecture K20 GPU. Special optimization was developed for the Kepler cards, including placement of data structures in the register space of the Kepler GPUs. Kepler-specific optimization is discussed.
Implementing the lattice Boltzmann model on commodity graphics hardware
NASA Astrophysics Data System (ADS)
Kaufman, Arie; Fan, Zhe; Petkov, Kaloian
2009-06-01
Modern graphics processing units (GPUs) can perform general-purpose computations in addition to the native specialized graphics operations. Due to the highly parallel nature of graphics processing, the GPU has evolved into a many-core coprocessor that supports high data parallelism. Its performance has been growing at a rate of squared Moore's law, and its peak floating point performance exceeds that of the CPU by an order of magnitude. Therefore, it is a viable platform for time-sensitive and computationally intensive applications. The lattice Boltzmann model (LBM) computations are carried out via linear operations at discrete lattice sites, which can be implemented efficiently using a GPU-based architecture. Our simulations produce results comparable to the CPU version while improving performance by an order of magnitude. We have demonstrated that the GPU is well suited for interactive simulations in many applications, including simulating fire, smoke, lightweight objects in wind, jellyfish swimming in water, and heat shimmering and mirage (using the hybrid thermal LBM). We further advocate the use of a GPU cluster for large scale LBM simulations and for high performance computing. The Stony Brook Visual Computing Cluster has been the platform for several applications, including simulations of real-time plume dispersion in complex urban environments and thermal fluid dynamics in a pressurized water reactor. Major GPU vendors have been targeting the high performance computing market with GPU hardware implementations. Software toolkits such as NVIDIA CUDA provide a convenient development platform that abstracts the GPU and allows access to its underlying stream computing architecture. However, software programming for a GPU cluster remains a challenging task. We have therefore developed the Zippy framework to simplify GPU cluster programming. Zippy is based on global arrays combined with the stream programming model and it hides the low-level details of the
Parallel implementation of the biorthogonal multiresolution time-domain method
NASA Astrophysics Data System (ADS)
Zhu, Xianyang; Carin, Lawrence; Dogaru, Traian
2003-05-01
The three-dimensional biorthogonal multiresolution time-domain (Bi-MRTD) method is presented for both free-space and half-space scattering problems. The perfectly matched layer (PML) is used as an absorbing boundary condition. It has been shown that improved numerical-dispersion properties can be obtained with the use of smooth, compactly supported wavelet functions as the basis, whereas we employ the Cohen-Daubechies-Fouveau (CDF) biorthogonal wavelets. When a CDF-wavelet expansion is used, the spatial-sampling rate can be reduced considerably compared with that of the conventional finite-difference time-domain (FDTD) method, implying that larger targets can be simulated without sacrificing accuracy. We implement the Bi-MRTD on a cluster of allocated-memory machines, using the message-passing interface (MPI), such that very large targets can be modeled. Numerical results are compared with analytical ones and with those obtained by use of the traditional FDTD method.
Impact of data layouts on the efficiency of GPU-accelerated IDW interpolation.
Mei, Gang; Tian, Hong
2016-01-01
This paper focuses on evaluating the impact of different data layouts on the computational efficiency of GPU-accelerated Inverse Distance Weighting (IDW) interpolation algorithm. First we redesign and improve our previous GPU implementation that was performed by exploiting the feature of CUDA dynamic parallelism (CDP). Then we implement three versions of GPU implementations, i.e., the naive version, the tiled version, and the improved CDP version, based upon five data layouts, including the Structure of Arrays (SoA), the Array of Structures (AoS), the Array of aligned Structures (AoaS), the Structure of Arrays of aligned Structures (SoAoS), and the Hybrid layout. We also carry out several groups of experimental tests to evaluate the impact. Experimental results show that: the layouts AoS and AoaS achieve better performance than the layout SoA for both the naive version and tiled version, while the layout SoA is the best choice for the improved CDP version. We also observe that: for the two combined data layouts (the SoAoS and the Hybrid), there are no notable performance gains when compared to other three basic layouts. We recommend that: in practical applications, the layout AoaS is the best choice since the tiled version is the fastest one among three versions. The source code of all implementations are publicly available.
NASA Astrophysics Data System (ADS)
Orjuela-Vargas, S. A.; Triana-Martinez, J.; Yañez, J. P.; Philips, W.
2014-03-01
Video analysis in real time requires fast and efficient algorithms to extract relevant information from a considerable number, commonly 25, of frames per second. Furthermore, robust algorithms for outdoor visual scenes may retrieve correspondent features along the day where a challenge is to deal with lighting changes. Currently, Local Binary Pattern (LBP) techniques are widely used for extracting features due to their robustness to illumination changes and the low requirements for implementation. We propose to compute an automatic threshold based on the distribution of the intensity residuals resulting from the pairwise comparisons when using LBP techniques. The intensity residuals distribution can be modelled by a Generalized Gaussian Distribution (GGD). In this paper we compute the adaptive threshold using the parameters of the GGD. We present a CUDA implementation of our proposed algorithm. We use the LBPSYM technique. Our approach is tested on videos of four different urban scenes with mobilities captured during day and night. The extracted features can be used in a further step to determine patterns, identify objects or detect background. However, further research must be conducted for blurring correction since the scenes at night are commonly blurred due to artificial lighting.
Cawkwell, M J; Wood, M A; Niklasson, Anders M N; Mniszewski, S M
2014-12-01
The algorithm developed in Cawkwell, M. J. et al. J. Chem. Theory Comput. 2012 , 8 , 4094 for the computation of the density matrix in electronic structure theory on a graphics processing unit (GPU) using the second-order spectral projection (SP2) method [ Niklasson, A. M. N. Phys. Rev. B 2002 , 66 , 155115 ] has been efficiently parallelized over multiple GPUs on a single compute node. The parallel implementation provides significant speed-ups with respect to the single GPU version with no loss of accuracy. The performance and accuracy of the parallel GPU-based algorithm is compared with the performance of the SP2 algorithm and traditional matrix diagonalization methods on a multicore central processing unit (CPU).
SPECT reconstruction on the GPU
NASA Astrophysics Data System (ADS)
Vetter, Christoph; Westermann, Rüdiger
2008-03-01
With the increasing reliance of doctors on imaging procedures, not only visualization needs to be optimized, but the reconstruction of the volumes from the scanner output is another bottleneck. Accelerating the computationally intensive reconstruction process improves the medical work flow, matches the reconstruction speed to the acquisition speed, and allows fast batch processing and interactive or near-interactive parameter tuning. Recently, much effort has been focused on using the computational power of graphics processing units (GPUs) for general purpose computations. This paper presents a GPU-accelerated implementation of single photon emission computed tomography (SPECT) reconstruction based on an ordered-subset expectation maximization algorithm. The algorithm uses models for the point-spread-function (PSF) to improve spatial resolution in the reconstruction images. Instead of computing the PSF directly, it is modeled as efficient blurring of slabs on the GPU in order to accelerate the process. The algorithm for the calculation of accumulated attenuation factors that allows correcting the generated volume according to the attenuation properties of the volume is optimized for processing on the GPU. Since these factors can be reused between different iterations, a cache is used that is adapted to different sizes of the video memory so that only those factors have to be recomputed that do not fit onto graphics memory. These improvements make the reconstruction of typical SPECT volume near interactive.
A two-level parallel direct search implementation for arbitrarily sized objective functions
Hutchinson, S.A.; Shadid, N.; Moffat, H.K.
1994-12-31
In the past, many optimization schemes for massively parallel computers have attempted to achieve parallel efficiency using one of two methods. In the case of large and expensive objective function calculations, the optimization itself may be run in serial and the objective function calculations parallelized. In contrast, if the objective function calculations are relatively inexpensive and can be performed on a single processor, then the actual optimization routine itself may be parallelized. In this paper, a scheme based upon the Parallel Direct Search (PDS) technique is presented which allows the objective function calculations to be done on an arbitrarily large number (p{sub 2}) of processors. If, p, the number of processors available, is greater than or equal to 2p{sub 2} then the optimization may be parallelized as well. This allows for efficient use of computational resources since the objective function calculations can be performed on the number of processors that allow for peak parallel efficiency and then further speedup may be achieved by parallelizing the optimization. Results are presented for an optimization problem which involves the solution of a PDE using a finite-element algorithm as part of the objective function calculation. The optimum number of processors for the finite-element calculations is less than p/2. Thus, the PDS method is also parallelized. Performance comparisons are given for a nCUBE 2 implementation.
Parallel Latent Semantic Analysis using a Graphics Processing Unit
Cui, Xiaohui; Potok, Thomas E; Cavanagh, Joseph M
2009-01-01
Latent Semantic Analysis (LSA) can be used to reduce the dimensions of large Term-Document datasets using Singular Value Decomposition. However, with the ever expanding size of data sets, current implementations are not fast enough to quickly and easily compute the results on a standard PC. The Graphics Processing Unit (GPU) can solve some highly parallel problems much faster than the traditional sequential processor (CPU). Thus, a deployable system using a GPU to speedup large-scale LSA processes would be a much more effective choice (in terms of cost/performance ratio) than using a computer cluster. In this paper, we presented a parallel LSA implementation on the GPU, using NVIDIA Compute Unified Device Architecture (CUDA) and Compute Unified Basic Linear Algebra Subprograms (CUBLAS). The performance of this implementation is compared to traditional LSA implementation on CPU using an optimized Basic Linear Algebra Subprograms library. For large matrices that have dimensions divisible by 16, the GPU algorithm ran five to six times faster than the CPU version.
NASA Astrophysics Data System (ADS)
Xiaoyong, Bai; Yingbo, He; Chengjun, Chen
2010-06-01
In order to make it easier to extend an finite element software framework with contact implementation for transient solid dynamic analysis, we have designed a general-purposed framework-oriented parallel contact class in this article. A parallel contact computation algorithm model has been generated based on contact schemes reported on last two decades. The class is integrated to an open source platform easily without affecting the rest code of the platform.
Molecular dynamics simulations through GPU video games technologies
Loukatou, Styliani; Papageorgiou, Louis; Fakourelis, Paraskevas; Filntisi, Arianna; Polychronidou, Eleftheria; Bassis, Ioannis; Megalooikonomou, Vasileios; Makałowski, Wojciech; Vlachakis, Dimitrios; Kossida, Sophia
2016-01-01
Bioinformatics is the scientific field that focuses on the application of computer technology to the management of biological information. Over the years, bioinformatics applications have been used to store, process and integrate biological and genetic information, using a wide range of methodologies. One of the most de novo techniques used to understand the physical movements of atoms and molecules is molecular dynamics (MD). MD is an in silico method to simulate the physical motions of atoms and molecules under certain conditions. This has become a state strategic technique and now plays a key role in many areas of exact sciences, such as chemistry, biology, physics and medicine. Due to their complexity, MD calculations could require enormous amounts of computer memory and time and therefore their execution has been a big problem. Despite the huge computational cost, molecular dynamics have been implemented using traditional computers with a central memory unit (CPU). A graphics processing unit (GPU) computing technology was first designed with the goal to improve video games, by rapidly creating and displaying images in a frame buffer such as screens. The hybrid GPU-CPU implementation, combined with parallel computing is a novel technology to perform a wide range of calculations. GPUs have been proposed and used to accelerate many scientific computations including MD simulations. Herein, we describe the new methodologies developed initially as video games and how they are now applied in MD simulations. PMID:27525251
A Simple GPU-Accelerated Two-Dimensional MUSCL-Hancock Solver for Ideal Magnetohydrodynamics
NASA Technical Reports Server (NTRS)
Bard, Christopher; Dorelli, John C.
2013-01-01
We describe our experience using NVIDIA's CUDA (Compute Unified Device Architecture) C programming environment to implement a two-dimensional second-order MUSCL-Hancock ideal magnetohydrodynamics (MHD) solver on a GTX 480 Graphics Processing Unit (GPU). Taking a simple approach in which the MHD variables are stored exclusively in the global memory of the GTX 480 and accessed in a cache-friendly manner (without further optimizing memory access by, for example, staging data in the GPU's faster shared memory), we achieved a maximum speed-up of approx. = 126 for a sq 1024 grid relative to the sequential C code running on a single Intel Nehalem (2.8 GHz) core. This speedup is consistent with simple estimates based on the known floating point performance, memory throughput and parallel processing capacity of the GTX 480.
A GPU Accelerated Discontinuous Galerkin Conservative Level Set Method for Simulating Atomization
NASA Astrophysics Data System (ADS)
Jibben, Zechariah J.
This dissertation describes a process for interface capturing via an arbitrary-order, nearly quadrature free, discontinuous Galerkin (DG) scheme for the conservative level set method (Olsson et al., 2005, 2008). The DG numerical method is utilized to solve both advection and reinitialization, and executed on a refined level set grid (Herrmann, 2008) for effective use of processing power. Computation is executed in parallel utilizing both CPU and GPU architectures to make the method feasible at high order. Finally, a sparse data structure is implemented to take full advantage of parallelism on the GPU, where performance relies on well-managed memory operations. With solution variables projected into a kth order polynomial basis, a k + 1 order convergence rate is found for both advection and reinitialization tests using the method of manufactured solutions. Other standard test cases, such as Zalesak's disk and deformation of columns and spheres in periodic vortices are also performed, showing several orders of magnitude improvement over traditional WENO level set methods. These tests also show the impact of reinitialization, which often increases shape and volume errors as a result of level set scalar trapping by normal vectors calculated from the local level set field. Accelerating advection via GPU hardware is found to provide a 30x speedup factor comparing a 2.0GHz Intel Xeon E5-2620 CPU in serial vs. a Nvidia Tesla K20 GPU, with speedup factors increasing with polynomial degree until shared memory is filled. A similar algorithm is implemented for reinitialization, which relies on heavier use of shared and global memory and as a result fills them more quickly and produces smaller speedups of 18x.
STEM image simulation with hybrid CPU/GPU programming.
Yao, Y; Ge, B H; Shen, X; Wang, Y G; Yu, R C
2016-07-01
STEM image simulation is achieved via hybrid CPU/GPU programming under parallel algorithm architecture to speed up calculation on a personal computer (PC). To utilize the calculation power of a PC fully, the simulation is performed using the GPU core and multi-CPU cores at the same time to significantly improve efficiency. GaSb and an artificial GaSb/InAs interface with atom diffusion have been used to verify the computation.
GPU-accelerated micromagnetic simulations using cloud computing
NASA Astrophysics Data System (ADS)
Jermain, C. L.; Rowlands, G. E.; Buhrman, R. A.; Ralph, D. C.
2016-03-01
Highly parallel graphics processing units (GPUs) can improve the speed of micromagnetic simulations significantly as compared to conventional computing using central processing units (CPUs). We present a strategy for performing GPU-accelerated micromagnetic simulations by utilizing cost-effective GPU access offered by cloud computing services with an open-source Python-based program for running the MuMax3 micromagnetics code remotely. We analyze the scaling and cost benefits of using cloud computing for micromagnetics.
Giantsoudi, D; Schuemann, J; Dowdell, S; Paganetti, H; Jia, X; Jiang, S
2014-06-15
Purpose: For proton radiation therapy, Monte Carlo simulation (MCS) methods are recognized as the gold-standard dose calculation approach. Although previously unrealistic due to limitations in available computing power, GPU-based applications allow MCS of proton treatment fields to be performed in routine clinical use, on time scales comparable to that of conventional pencil-beam algorithms. This study focuses on validating the results of our GPU-based code (gPMC) versus fully implemented proton therapy based MCS code (TOPAS) for clinical patient cases. Methods: Two treatment sites were selected to provide clinical cases for this study: head-and-neck cases due to anatomical geometrical complexity (air cavities and density heterogeneities), making dose calculation very challenging, and prostate cases due to higher proton energies used and close proximity of the treatment target to sensitive organs at risk. Both gPMC and TOPAS methods were used to calculate 3-dimensional dose distributions for all patients in this study. Comparisons were performed based on target coverage indices (mean dose, V90 and D90) and gamma index distributions for 2% of the prescription dose and 2mm. Results: For seven out of eight studied cases, mean target dose, V90 and D90 differed less than 2% between TOPAS and gPMC dose distributions. Gamma index analysis for all prostate patients resulted in passing rate of more than 99% of voxels in the target. Four out of five head-neck-cases showed passing rate of gamma index for the target of more than 99%, the fifth having a gamma index passing rate of 93%. Conclusion: Our current work showed excellent agreement between our GPU-based MCS code and fully implemented proton therapy based MC code for a group of dosimetrically challenging patient cases.
The GENGA Code: Gravitational Encounters in N-body simulations with GPU Acceleration.
NASA Astrophysics Data System (ADS)
Grimm, Simon; Stadel, Joachim
2013-07-01
We present a GPU (Graphics Processing Unit) implementation of a hybrid symplectic N-body integrator based on the Mercury Code (Chambers 1999), which handles close encounters with a very good energy conservation. It uses a combination of a mixed variable integration (Wisdom & Holman 1991) and a direct N-body Bulirsch-Stoer method. GENGA is written in CUDA C and runs on NVidia GPU's. The GENGA code supports three simulation modes: Integration of up to 2048 massive bodies, integration with up to a million test particles, or parallel integration of a large number of individual planetary systems. To achieve the best performance, GENGA runs completely on the GPU, where it can take advantage of the very fast, but limited, memory that exists there. All operations are performed in parallel, including the close encounter detection and grouping independent close encounter pairs. Compared to Mercury, GENGA runs up to 30 times faster. Two applications of GENGA are presented: First, the dynamics of planetesimals and the late stage of rocky planet formation due to planetesimal collisions. Second, a dynamical stability analysis of an exoplanetary system with an additional hypothetical super earth, which shows that in some multiple planetary systems, additional super earths could exist without perturbing the dynamical stability of the other planets (Elser et al. 2013).
Analysis and selection of optimal function implementations in massively parallel computer
Archer, Charles Jens; Peters, Amanda; Ratterman, Joseph D.
2011-05-31
An apparatus, program product and method optimize the operation of a parallel computer system by, in part, collecting performance data for a set of implementations of a function capable of being executed on the parallel computer system based upon the execution of the set of implementations under varying input parameters in a plurality of input dimensions. The collected performance data may be used to generate selection program code that is configured to call selected implementations of the function in response to a call to the function under varying input parameters. The collected performance data may be used to perform more detailed analysis to ascertain the comparative performance of the set of implementations of the function under the varying input parameters.
NASA Astrophysics Data System (ADS)
Poghosyan, G.; Matta, S.; Streit, A.; Bejger, M.; Królak, A.
2015-03-01
The parallelization, design and scalability of the PolGrawAllSky code to search for periodic gravitational waves from rotating neutron stars is discussed. The code is based on an efficient implementation of the F-statistic using the Fast Fourier Transform algorithm. To perform an analysis of data from the advanced LIGO and Virgo gravitational wave detectors' network, which will start operating in 2015, hundreds of millions of CPU hours will be required-the code utilizing the potential of massively parallel supercomputers is therefore mandatory. We have parallelized the code using the Message Passing Interface standard, implemented a mechanism for combining the searches at different sky-positions and frequency bands into one extremely scalable program. The parallel I/O interface is used to escape bottlenecks, when writing the generated data into file system. This allowed to develop a highly scalable computation code, which would enable the data analysis at large scales on acceptable time scales. Benchmarking of the code on a Cray XE6 system was performed to show efficiency of our parallelization concept and to demonstrate scaling up to 50 thousand cores in parallel.
A New GPU-Enabled MODTRAN Thermal Model for the PLUME TRACKER Volcanic Emission Analysis Toolkit
NASA Astrophysics Data System (ADS)
Acharya, P. K.; Berk, A.; Guiang, C.; Kennett, R.; Perkins, T.; Realmuto, V. J.
2013-12-01
Real-time quantification of volcanic gaseous and particulate releases is important for (1) recognizing rapid increases in SO2 gaseous emissions which may signal an impending eruption; (2) characterizing ash clouds to enable safe and efficient commercial aviation; and (3) quantifying the impact of volcanic aerosols on climate forcing. The Jet Propulsion Laboratory (JPL) has developed state-of-the-art algorithms, embedded in their analyst-driven Plume Tracker toolkit, for performing SO2, NH3, and CH4 retrievals from remotely sensed multi-spectral Thermal InfraRed spectral imagery. While Plume Tracker provides accurate results, it typically requires extensive analyst time. A major bottleneck in this processing is the relatively slow but accurate FORTRAN-based MODTRAN atmospheric and plume radiance model, developed by Spectral Sciences, Inc. (SSI). To overcome this bottleneck, SSI in collaboration with JPL, is porting these slow thermal radiance algorithms onto massively parallel, relatively inexpensive and commercially-available GPUs. This paper discusses SSI's efforts to accelerate the MODTRAN thermal emission algorithms used by Plume Tracker. Specifically, we are developing a GPU implementation of the Curtis-Godson averaging and the Voigt in-band transmittances from near line center molecular absorption, which comprise the major computational bottleneck. The transmittance calculations were decomposed into separate functions, individually implemented as GPU kernels, and tested for accuracy and performance relative to the original CPU code. Speedup factors of 14 to 30× were realized for individual processing components on an NVIDIA GeForce GTX 295 graphics card with no loss of accuracy. Due to the separate host (CPU) and device (GPU) memory spaces, a redesign of the MODTRAN architecture was required to ensure efficient data transfer between host and device, and to facilitate high parallel throughput. Currently, we are incorporating the separate GPU kernels into a
Implementation of Parallel Dynamic Simulation on Shared-Memory vs. Distributed-Memory Environments
Jin, Shuangshuang; Chen, Yousu; Wu, Di; Diao, Ruisheng; Huang, Zhenyu
2015-12-09
Power system dynamic simulation computes the system response to a sequence of large disturbance, such as sudden changes in generation or load, or a network short circuit followed by protective branch switching operation. It consists of a large set of differential and algebraic equations, which is computational intensive and challenging to solve using single-processor based dynamic simulation solution. High-performance computing (HPC) based parallel computing is a very promising technology to speed up the computation and facilitate the simulation process. This paper presents two different parallel implementations of power grid dynamic simulation using Open Multi-processing (OpenMP) on shared-memory platform, and Message Passing Interface (MPI) on distributed-memory clusters, respectively. The difference of the parallel simulation algorithms and architectures of the two HPC technologies are illustrated, and their performances for running parallel dynamic simulation are compared and demonstrated.
Same-source parallel implementation of the PSU/NCAR MM5
Michalakes, J.
1997-12-31
The Pennsylvania State/National Center for Atmospheric Research Mesoscale Model is a limited-area model of atmospheric systems, now in its fifth generation, MM5. Designed and maintained for vector and shared-memory parallel architectures, the official version of MM5 does not run on message-passing distributed memory (DM) parallel computers. The authors describe a same-source parallel implementation of the PSU/NCAR MM5 using FLIC, the Fortran Loop and Index Converter. The resulting source is nearly line-for-line identical with the original source code. The result is an efficient distributed memory parallel option to MM5 that can be seamlessly integrated into the official version.
A parallel implementation of an EBE solver for the finite element method
Silva, R.P.; Las Casas, E.B.; Carvalho, M.L.B.
1994-12-31
A parallel implementation using PVM on a cluster of workstations of an Element By Element (EBE) solver using the Preconditioned Conjugate Gradient (PCG) method is described, along with an application in the solution of the linear systems generated from finite element analysis of a problem in three dimensional linear elasticity. The PVM (Parallel Virtual Machine) system, developed at the Oak Ridge Laboratory, allows the construction of a parallel MIMD machine by connecting heterogeneous computers linked through a network. In this implementation, version 3.1 of PVM is used, and 11 SLC Sun workstations and a Sun SPARC-2 model are connected through Ethernet. The finite element program is based on SDP, System for Finite Element Based Software Development, developed at the Brazilian National Laboratory for Scientific Computation (LNCC). SDP provides the basic routines for a finite element application program, as well as a standard for programming and documentation, intended to allow exchanges between research groups in different centers.
A parallel implementation of a multisensor feature-based range-estimation method
NASA Technical Reports Server (NTRS)
Suorsa, Raymond E.; Sridhar, Banavar
1993-01-01
There are many proposed vision based methods to perform obstacle detection and avoidance for autonomous or semi-autonomous vehicles. All methods, however, will require very high processing rates to achieve real time performance. A system capable of supporting autonomous helicopter navigation will need to extract obstacle information from imagery at rates varying from ten frames per second to thirty or more frames per second depending on the vehicle speed. Such a system will need to sustain billions of operations per second. To reach such high processing rates using current technology, a parallel implementation of the obstacle detection/ranging method is required. This paper describes an efficient and flexible parallel implementation of a multisensor feature-based range-estimation algorithm, targeted for helicopter flight, realized on both a distributed-memory and shared-memory parallel computer.
NASA Astrophysics Data System (ADS)
Molero, Jose M.; Garzón, Ester M.; García, Inmaculada; Plaza, Antonio
2011-11-01
Anomaly detection is an important task for remotely sensed hyperspectral data exploitation. One of the most widely used and successful algorithms for anomaly detection in hyperspectral images is the Reed-Xiaoli (RX) algorithm. Despite its wide acceptance and high computational complexity when applied to real hyperspectral scenes, few documented parallel implementations of this algorithm exist, in particular for multi-core processors. The advantage of multi-core platforms over other specialized parallel architectures is that they are a low-power, inexpensive, widely available and well-known technology. A critical issue in the parallel implementation of RX is the sample covariance matrix calculation, which can be approached in global or local fashion. This aspect is crucial for the RX implementation since the consideration of a local or global strategy for the computation of the sample covariance matrix is expected to affect both the scalability of the parallel solution and the anomaly detection results. In this paper, we develop new parallel implementations of the RX in multi-core processors and specifically investigate the impact of different data partitioning strategies when parallelizing its computations. For this purpose, we consider both global and local data partitioning strategies in the spatial domain of the scene, and further analyze their scalability in different multi-core platforms. The numerical effectiveness of the considered solutions is evaluated using receiver operating characteristics (ROC) curves, analyzing their capacity to detect thermal hot spots (anomalies) in hyperspectral data collected by the NASA's Airborne Visible Infra- Red Imaging Spectrometer system over the World Trade Center in New York, five days after the terrorist attacks of September 11th, 2001.
GPU Accelerated Numerical Simulation of Viscous Flow Down a Slope
NASA Astrophysics Data System (ADS)
Gygax, Remo; Räss, Ludovic; Omlin, Samuel; Podladchikov, Yuri; Jaboyedoff, Michel
2014-05-01
Numerical simulations are an effective tool in natural risk analysis. They are useful to determine the propagation and the runout distance of gravity driven movements such as debris flows or landslides. To evaluate these processes an approach on analogue laboratory experiments and a GPU accelerated numerical simulation of the flow of a viscous liquid down an inclined slope is considered. The physical processes underlying large gravity driven flows share certain aspects with the propagation of debris mass in a rockslide and the spreading of water waves. Several studies have shown that the numerical implementation of the physical processes of viscous flow produce a good fit with the observation of experiments in laboratory in both a quantitative and a qualitative way. When considering a process that is this far explored we can concentrate on its numerical transcription and the application of the code in a GPU accelerated environment to obtain a 3D simulation. The objective of providing a numerical solution in high resolution by NVIDIA-CUDA GPU parallel processing is to increase the speed of the simulation and the accuracy on the prediction. The main goal is to write an easily adaptable and as short as possible code on the widely used platform MATLAB, which will be translated to C-CUDA to achieve higher resolution and processing speed while running on a NVIDIA graphics card cluster. The numerical model, based on the finite difference scheme, is compared to analogue laboratory experiments. This way our numerical model parameters are adjusted to reproduce the effective movements observed by high-speed camera acquisitions during the laboratory experiments.
Fast computer simulation of reconstructed image from rainbow hologram based on GPU
NASA Astrophysics Data System (ADS)
Shuming, Jiao; Yoshikawa, Hiroshi
2015-10-01
A fast computer simulation solution for rainbow hologram reconstruction based on GPU is proposed. In the commonly used segment Fourier transform method for rainbow hologram reconstruction, the computation of 2D Fourier transform on each hologram segment is very time consuming. GPU-based parallel computing can be applied to improve the computing speed. Compared with CPU computing, simulation results indicate that our proposed GPU computing can effectively reduce the computation time by as much as eight times.
Farber, R.M.; Lapedes, A.S.; Rico-Martinez, R.; Kevrekidis, I.G.
1993-12-31
Time-delay mappings constructed using neural networks have proven successful in performing nonlinear system identification; however, because of their discrete nature, their use in bifurcation analysis of continuous-time systems is limited. This shortcoming can be avoided by embedding the neural networks in a training algorithm that mimics a numerical integrator. Both explicit and implicit integrators can be used. The former case is based on repeated evaluations of the network in a feedforward implementation; the latter relies on a recurrent network implementation. Here the algorithms and their implementation on parallel machines (SIMD and MIMD architectures) are discussed.
Farber, R.M.; Lapedes, A.S. ); Rico-Martinez, R.; Kevrekidis, I.G. . Dept. of Chemical Engineering)
1993-01-01
Time-delay mappings constructed using neural networks have proven successful performing nonlinear system identification; however, because of their discrete nature, their use in bifurcation analysis of continuous-tune systems is limited. This shortcoming can be avoided by embedding the neural networks in a training algorithm that mimics a numerical integrator. Both explicit and implicit integrators can be used. The former case is based on repeated evaluations of the network in a feedforward implementation; the latter relies on a recurrent network implementation. Here the algorithms and their implementation on parallel machines (SIMD and MIMD architectures) are discussed.
Farber, R.M.; Lapedes, A.S.; Rico-Martinez, R.; Kevrekidis, I.G.
1993-06-01
Time-delay mappings constructed using neural networks have proven successful performing nonlinear system identification; however, because of their discrete nature, their use in bifurcation analysis of continuous-tune systems is limited. This shortcoming can be avoided by embedding the neural networks in a training algorithm that mimics a numerical integrator. Both explicit and implicit integrators can be used. The former case is based on repeated evaluations of the network in a feedforward implementation; the latter relies on a recurrent network implementation. Here the algorithms and their implementation on parallel machines (SIMD and MIMD architectures) are discussed.
Sub-second pencil beam dose calculation on GPU for adaptive proton therapy.
da Silva, Joakim; Ansorge, Richard; Jena, Rajesh
2015-06-21
Although proton therapy delivered using scanned pencil beams has the potential to produce better dose conformity than conventional radiotherapy, the created dose distributions are more sensitive to anatomical changes and patient motion. Therefore, the introduction of adaptive treatment techniques where the dose can be monitored as it is being delivered is highly desirable. We present a GPU-based dose calculation engine relying on the widely used pencil beam algorithm, developed for on-line dose calculation. The calculation engine was implemented from scratch, with each step of the algorithm parallelized and adapted to run efficiently on the GPU architecture. To ensure fast calculation, it employs several application-specific modifications and simplifications, and a fast scatter-based implementation of the computationally expensive kernel superposition step. The calculation time for a skull base treatment plan using two beam directions was 0.22 s on an Nvidia Tesla K40 GPU, whereas a test case of a cubic target in water from the literature took 0.14 s to calculate. The accuracy of the patient dose distributions was assessed by calculating the γ-index with respect to a gold standard Monte Carlo simulation. The passing rates were 99.2% and 96.7%, respectively, for the 3%/3 mm and 2%/2 mm criteria, matching those produced by a clinical treatment planning system.
Sub-second pencil beam dose calculation on GPU for adaptive proton therapy
NASA Astrophysics Data System (ADS)
da Silva, Joakim; Ansorge, Richard; Jena, Rajesh
2015-06-01
Although proton therapy delivered using scanned pencil beams has the potential to produce better dose conformity than conventional radiotherapy, the created dose distributions are more sensitive to anatomical changes and patient motion. Therefore, the introduction of adaptive treatment techniques where the dose can be monitored as it is being delivered is highly desirable. We present a GPU-based dose calculation engine relying on the widely used pencil beam algorithm, developed for on-line dose calculation. The calculation engine was implemented from scratch, with each step of the algorithm parallelized and adapted to run efficiently on the GPU architecture. To ensure fast calculation, it employs several application-specific modifications and simplifications, and a fast scatter-based implementation of the computationally expensive kernel superposition step. The calculation time for a skull base treatment plan using two beam directions was 0.22 s on an Nvidia Tesla K40 GPU, whereas a test case of a cubic target in water from the literature took 0.14 s to calculate. The accuracy of the patient dose distributions was assessed by calculating the γ-index with respect to a gold standard Monte Carlo simulation. The passing rates were 99.2% and 96.7%, respectively, for the 3%/3 mm and 2%/2 mm criteria, matching those produced by a clinical treatment planning system.
Efficient GPU Accelerationfor Integrating Large Thermonuclear Networks in Astrophysics
NASA Astrophysics Data System (ADS)
Guidry, Mike
2016-02-01
We demonstrate the systematic implementation of recently-developed fast explicit kinetic integration algorithms on modern graphics processing unit (GPU) accelerators. We take as representative test cases Type Ia supernova explosions with extremely stiff thermonuclear reaction networks having 150-365 isotopic species and 1600-4400 reactions, assumed coupled to hydrodynamics using operator splitting. In such examples we demonstrate the capability to integrate independent thermonuclear networks from ~250-500 hydro zones (assumed to be deployed on CPU cores) in parallel on a single GPU in the same wall clock time that standard implicit methods can integrate the network for a single zone. This two or more orders of magnitude increase in efficiency for solving systems of realistic thermonuclear networks coupled to fluid dynamics implies that important coupled, multiphysics problems in various scientific and technical disciplines that were intractable, or could be simulated only with highly schematic kinetic networks, are now computationally feasible. As examples of such applications I will discuss our ongoing deployment of these new methods for Type Ia supernova explosions in astrophysics and for simulation of the complex atmospheric chemistry entering into weather and climate problems.
Kerr, J.P.
1992-12-31
The objective of this study was to determine the feasibility of using an Artificial Neural Network (ANN), in particular a backpropagation ANN, to improve the speed and quality of the reconstruction of three-dimensional SPECT (single photon emission computed tomography) images. In addition, since the processing elements (PE)s in each layer of an ANN are independent of each other, the speed and efficiency of the neural network architecture could be better optimized by implementing the ANN on a massively parallel computer. The specific goals of this research were: to implement a fully interconnected backpropagation neural network on a serial computer and a SIMD parallel computer, to identify any reduction in the time required to train these networks on the parallel machine versus the serial machine, to determine if these neural networks can learn to recognize SPECT data by training them on a section of an actual SPECT image, and to determine from the knowledge obtained in this research if full SPECT image reconstruction by an ANN implemented on a parallel computer is feasible both in time required to train the network, and in quality of the images reconstructed.
Kerr, J.P.
1992-01-01
The objective of this study was to determine the feasibility of using an Artificial Neural Network (ANN), in particular a backpropagation ANN, to improve the speed and quality of the reconstruction of three-dimensional SPECT (single photon emission computed tomography) images. In addition, since the processing elements (PE)s in each layer of an ANN are independent of each other, the speed and efficiency of the neural network architecture could be better optimized by implementing the ANN on a massively parallel computer. The specific goals of this research were: to implement a fully interconnected backpropagation neural network on a serial computer and a SIMD parallel computer, to identify any reduction in the time required to train these networks on the parallel machine versus the serial machine, to determine if these neural networks can learn to recognize SPECT data by training them on a section of an actual SPECT image, and to determine from the knowledge obtained in this research if full SPECT image reconstruction by an ANN implemented on a parallel computer is feasible both in time required to train the network, and in quality of the images reconstructed.
Explicit integration with GPU acceleration for large kinetic networks
Brock, Benjamin; Belt, Andrew; Billings, Jay Jay; Guidry, Mike W.
2015-09-15
In this study, we demonstrate the first implementation of recently-developed fast explicit kinetic integration algorithms on modern graphics processing unit (GPU) accelerators. Taking as a generic test case a Type Ia supernova explosion with an extremely stiff thermonuclear network having 150 isotopic species and 1604 reactions coupled to hydrodynamics using operator splitting, we demonstrate the capability to solve of order 100 realistic kinetic networks in parallel in the same time that standard implicit methods can solve a single such network on a CPU. In addition, this orders-of-magnitude decrease in computation time for solving systems of realistic kinetic networks implies thatmore » important coupled, multiphysics problems in various scientific and technical fields that were intractable, or could be simulated only with highly schematic kinetic networks, are now computationally feasible.« less
Explicit integration with GPU acceleration for large kinetic networks
Brock, Benjamin; Belt, Andrew; Billings, Jay Jay; Guidry, Mike W.
2015-09-15
In this study, we demonstrate the first implementation of recently-developed fast explicit kinetic integration algorithms on modern graphics processing unit (GPU) accelerators. Taking as a generic test case a Type Ia supernova explosion with an extremely stiff thermonuclear network having 150 isotopic species and 1604 reactions coupled to hydrodynamics using operator splitting, we demonstrate the capability to solve of order 100 realistic kinetic networks in parallel in the same time that standard implicit methods can solve a single such network on a CPU. In addition, this orders-of-magnitude decrease in computation time for solving systems of realistic kinetic networks implies that important coupled, multiphysics problems in various scientific and technical fields that were intractable, or could be simulated only with highly schematic kinetic networks, are now computationally feasible.
Ma, S.; Chronopoulos, A.T. )
1990-01-01
This paper reports on the restructure of three outstanding iterative methods for large space nonsymmetric linear systems. These methods are CGS (conjugate gradient squared), CRS (conjugate residual squared), and Orthomin(k). The restructured methods are more suitable for vector and parallel processing. The authors implemented these methods on a parallel vector system. The linear systems for the numerical tests are obtained from discretizing four two- dimensional elliptic partial differential equations by finite difference and finite element methods. A vectorizable and parallelizable version of incomplete LU preconditioning is used. The authors restructured the subroutines to enhance the data locality in vector machines with storage hierarchy. Speedup was measured for multitasking by four processors.
Parallel implementation of the time-evolving block decimation algorithm for the Bose-Hubbard model
NASA Astrophysics Data System (ADS)
Urbanek, Miroslav; Soldán, Pavel
2016-02-01
A system of ultracold atoms in an optical lattice represents a powerful experimental setup for testing the fundamentals of quantum mechanics. While its microscopic interaction mechanisms are well understood, the system behavior for a moderate number of particles is difficult to simulate due to a high dimension of its many-body space. This article presents TEBDOL, a parallel implementation of the time-evolving block decimation (TEBD) algorithm that can efficiently simulate time evolution of a one-dimensional chain of atoms in optical lattices. We investigate the parallelization strategy and the strong and weak scaling with the number of processes.
A Massively Parallel Adaptive Fast Multipole Method on Heterogeneous Architectures
Lashuk, Ilya; Chandramowlishwaran, Aparna; Langston, Harper; Nguyen, Tuan-Anh; Sampath, Rahul S; Shringarpure, Aashay; Vuduc, Richard; Ying, Lexing; Zorin, Denis; Biros, George
2012-01-01
We describe a parallel fast multipole method (FMM) for highly nonuniform distributions of particles. We employ both distributed memory parallelism (via MPI) and shared memory parallelism (via OpenMP and GPU acceleration) to rapidly evaluate two-body nonoscillatory potentials in three dimensions on heterogeneous high performance computing architectures. We have performed scalability tests with up to 30 billion particles on 196,608 cores on the AMD/CRAY-based Jaguar system at ORNL. On a GPU-enabled system (NSF's Keeneland at Georgia Tech/ORNL), we observed 30x speedup over a single core CPU and 7x speedup over a multicore CPU implementation. By combining GPUs with MPI, we achieve less than 10 ns/particle and six digits of accuracy for a run with 48 million nonuniformly distributed particles on 192 GPUs.
GPU-optimized Code for Long-term Simulations of Beam-beam Effects in Colliders
Roblin, Yves; Morozov, Vasiliy; Terzic, Balsa; Aturban, Mohamed A.; Ranjan, D.; Zubair, Mohammed
2013-06-01
We report on the development of the new code for long-term simulation of beam-beam effects in particle colliders. The underlying physical model relies on a matrix-based arbitrary-order symplectic particle tracking for beam transport and the Bassetti-Erskine approximation for beam-beam interaction. The computations are accelerated through a parallel implementation on a hybrid GPU/CPU platform. With the new code, a previously computationally prohibitive long-term simulations become tractable. We use the new code to model the proposed medium-energy electron-ion collider (MEIC) at Jefferson Lab.
A fast and memory-sparing probabilistic selection algorithm for the GPU
Monroe, Laura M; Wendelberger, Joanne; Michalak, Sarah
2010-09-29
A fast and memory-sparing probabilistic top-N selection algorithm is implemented on the GPU. This probabilistic algorithm gives a deterministic result and always terminates. The use of randomization reduces the amount of data that needs heavy processing, and so reduces both the memory requirements and the average time required for the algorithm. This algorithm is well-suited to more general parallel processors with multiple layers of memory hierarchy. Probabilistic Las Vegas algorithms of this kind are a form of stochastic optimization and can be especially useful for processors having a limited amount of fast memory available.
Massively parallel implementation of the Penn State/NCAR Mesoscale Model
Foster, I.; Michalakes, J.
1992-12-01
Parallel computing promises significant improvements in both the raw speed and cost performance of mesoscale atmospheric models. On distributed-memory massively parallel computers available today, the performance of a mesoscale model will exceed that of conventional supercomputers; on the teraflops machines expected within the next five years, performance will increase by several orders of magnitude. As a result, scientists will be able to consider larger problems, more complex model processes, and finer resolutions. In this paper. we report on a project at Argonne National Laboratory that will allow scientists to take advantage of parallel computing technology. This Massively Parallel Mesoscale Model (MPMM) will be functionally equivalent to the Penn State/NCAR Mesoscale Model (MM). In a prototype study, we produced a parallel version of MM4 using a static (compile-time) coarse-grained ``patch`` decomposition. This code achieves one-third the performance of a one-processor CRAY Y-MP on twelve Intel 1860 microprocessors. The current version of MPMM is based on all MM5 and uses a more fine-grained approach, decomposing the grid as finely as the mesh itself allows so that each horizontal grid cell is a parallel process. This will allow the code to utilize many hundreds of processors. A high-level language for expressing parallel programs is used to implement communication strearns between the processes in a way that permits dynamic remapping to the physical processors of a particular parallel computer. This facilitates load balancing, grid nesting, and coupling with graphical systems and other models.
Massively parallel implementation of the Penn State/NCAR Mesoscale Model
Foster, I.; Michalakes, J.
1992-01-01
Parallel computing promises significant improvements in both the raw speed and cost performance of mesoscale atmospheric models. On distributed-memory massively parallel computers available today, the performance of a mesoscale model will exceed that of conventional supercomputers; on the teraflops machines expected within the next five years, performance will increase by several orders of magnitude. As a result, scientists will be able to consider larger problems, more complex model processes, and finer resolutions. In this paper. we report on a project at Argonne National Laboratory that will allow scientists to take advantage of parallel computing technology. This Massively Parallel Mesoscale Model (MPMM) will be functionally equivalent to the Penn State/NCAR Mesoscale Model (MM). In a prototype study, we produced a parallel version of MM4 using a static (compile-time) coarse-grained patch'' decomposition. This code achieves one-third the performance of a one-processor CRAY Y-MP on twelve Intel 1860 microprocessors. The current version of MPMM is based on all MM5 and uses a more fine-grained approach, decomposing the grid as finely as the mesh itself allows so that each horizontal grid cell is a parallel process. This will allow the code to utilize many hundreds of processors. A high-level language for expressing parallel programs is used to implement communication strearns between the processes in a way that permits dynamic remapping to the physical processors of a particular parallel computer. This facilitates load balancing, grid nesting, and coupling with graphical systems and other models.
GPU-accelerated FDTD modeling of radio-frequency field-tissue interactions in high-field MRI.
Chi, Jieru; Liu, Feng; Weber, Ewald; Li, Yu; Crozier, Stuart
2011-06-01
The analysis of high-field RF field-tissue interactions requires high-performance finite-difference time-domain (FDTD) computing. Conventional CPU-based FDTD calculations offer limited computing performance in a PC environment. This study presents a graphics processing unit (GPU)-based parallel-computing framework, producing substantially boosted computing efficiency (with a two-order speedup factor) at a PC-level cost. Specific details of implementing the FDTD method on a GPU architecture have been presented and the new computational strategy has been successfully applied to the design of a novel 8-element transceive RF coil system at 9.4 T. Facilitated by the powerful GPU-FDTD computing, the new RF coil array offers optimized fields (averaging 25% improvement in sensitivity, and 20% reduction in loop coupling compared with conventional array structures of the same size) for small animal imaging with a robust RF configuration. The GPU-enabled acceleration paves the way for FDTD to be applied for both detailed forward modeling and inverse design of MRI coils, which were previously impractical.
A parallel implementation of the Cellular Potts Model for simulation of cell-based morphogenesis
Chen, Nan; Glazier, James A.; Izaguirre, Jesús A.; Alber, Mark S.
2007-01-01
The Cellular Potts Model (CPM) has been used in a wide variety of biological simulations. However, most current CPM implementations use a sequential modified Metropolis algorithm which restricts the size of simulations. In this paper we present a parallel CPM algorithm for simulations of morphogenesis, which includes cell–cell adhesion, a cell volume constraint, and cell haptotaxis. The algorithm uses appropriate data structures and checkerboard subgrids for parallelization. Communication and updating algorithms synchronize properties of cells simulated on different processor nodes. Tests show that the parallel algorithm has good scalability, permitting large-scale simulations of cell morphogenesis (107 or more cells) and broadening the scope of CPM applications. The new algorithm satisfies the balance condition, which is sufficient for convergence of the underlying Markov chain. PMID:18084624
A parallel real-time computing-cluster implementation of spotlight SAR processing
NASA Astrophysics Data System (ADS)
Mathew, Bipin; Rabinkin, Daniel
2005-05-01
The high resolution imaging capability of Synthetic Aperture Radar (SAR) is largely unaffected by atmospheric conditions and has proven to be an indispensable asset in a variety of military and civilian applications. Application of SAR methodology for real-time imaging however carries with it the large computational complexity and storage requirements of the image-forming algorithms. Recently however, the rapidly diminishing cost of computing hardware and the related ascent of cluster-based computing, has made parallelization of these algorithms an appealing area of investigation. This paper describes a parallel SAR processor developed at MIT Lincoln Laboratory. Several novel technologies were employed in it's implementation, including pMatlab which is a parallel extension of standard Matlab that is also being developed at MIT Lincoln Laboratory. These technologies will be described later in the document. We begin with a brief description of the basic SAR algorithm.
NASA Astrophysics Data System (ADS)
Martin-Bragado, Ignacio; Abujas, J.; Galindo, P. L.; Pizarro, J.
2015-06-01
An adaptation of the synchronous parallel Kinetic Monte Carlo (spKMC) algorithm developed by Martinez et al. (2008) to the existing KMC code MMonCa (Martin-Bragado et al. 2013) is presented in this work. Two cases, general enough to provide an idea of the current state-of-the-art in parallel KMC, are presented: Object KMC simulations of the evolution of damage in irradiated iron, and Lattice KMC simulations of epitaxial regrowth of amorphized silicon. The results allow us to state that (a) the parallel overhead is critical, and severely degrades the performance of the simulator when it is comparable to the CPU time consumed per event, (b) the balance between domains is important, but not critical, (c) the algorithm and its implementation are correct and (d) further improvements are needed for spKMC to become a general, all-working solution for KMC simulations.
NASA Astrophysics Data System (ADS)
Xu, Dexiang
This dissertation presents a novel method of designing finite word length Finite Impulse Response (FIR) digital filters using a Real Parameter Parallel Genetic Algorithm (RPPGA). This algorithm is derived from basic Genetic Algorithms which are inspired by natural genetics principles. Both experimental results and theoretical studies in this work reveal that the RPPGA is a suitable method for determining the optimal or near optimal discrete coefficients of finite word length FIR digital filters. Performance of RPPGA is evaluated by comparing specifications of filters designed by other methods with filters designed by RPPGA. The parallel and spatial structures of the algorithm result in faster and more robust optimization than basic genetic algorithms. A filter designed by RPPGA is implemented in hardware to attenuate high frequency noise in a data acquisition system for collecting seismic signals. These studies may lead to more applications of the Real Parameter Parallel Genetic Algorithms in Electrical Engineering.
GPU Linear algebra extensions for GNU/Octave
NASA Astrophysics Data System (ADS)
Bosi, L. B.; Mariotti, M.; Santocchia, A.
2012-06-01
Octave is one of the most widely used open source tools for numerical analysis and liner algebra. Our project aims to improve Octave by introducing support for GPU computing in order to speed up some linear algebra operations. The core of our work is a C library that executes some BLAS operations concerning vector- vector, vector matrix and matrix-matrix functions on the GPU. OpenCL functions are used to program GPU kernels, which are bound within the GNU/octave framework. We report the project implementation design and some preliminary results about performance.
Design and implementation of the parallel processing system of multi-channel polarization images
NASA Astrophysics Data System (ADS)
Li, Zhi-yong; Huang, Qin-chao
2013-08-01
Compared with traditional optical intensity image processing, polarization images processing has two main problems. One is that the amount of data is larger. The other is that processing tasks is more complex. To resolve these problems, the parallel processing system of multi-channel polarization images is designed by the multi-DSP technique. It contains a communication control unit (CCU) and a data processing array (DPA). CCU controls communications inside and outside the system. Its logics are designed by a FPGA chip. DPA is made up of four Digital Signal Processor (DSP) chips, which are interlinked by the loose coupling method. DPA implements processing tasks including images registration and images synthesis by parallel processing methods. The polarization images parallel processing model is designed on multi levels including the system task, the algorithm and the operation. Its program is designed by the assemble language. While the polarization image resolution is 782x582 pixels, the pixel data length is 12 bits in the experiment. After it received 3 channels of polarization image simultaneously, this system implements parallel task to acquire the target polarization characteristics. Experimental results show that this system has good real-time and reliability. The processing time of images registration is 293.343ms while the registration accuracy achieves 0.5 pixel. The processing time of images synthesis is 3.199ms.
NASA Technical Reports Server (NTRS)
Tilton, James C.; Plaza, Antonio J. (Editor); Chang, Chein-I. (Editor)
2008-01-01
The hierarchical image segmentation algorithm (referred to as HSEG) is a hybrid of hierarchical step-wise optimization (HSWO) and constrained spectral clustering that produces a hierarchical set of image segmentations. HSWO is an iterative approach to region grooving segmentation in which the optimal image segmentation is found at N(sub R) regions, given a segmentation at N(sub R+1) regions. HSEG's addition of constrained spectral clustering makes it a computationally intensive algorithm, for all but, the smallest of images. To counteract this, a computationally efficient recursive approximation of HSEG (called RHSEG) has been devised. Further improvements in processing speed are obtained through a parallel implementation of RHSEG. This chapter describes this parallel implementation and demonstrates its computational efficiency on a Landsat Thematic Mapper test scene.
NASA Astrophysics Data System (ADS)
Pandya, Tara M.; Johnson, Seth R.; Evans, Thomas M.; Davidson, Gregory G.; Hamilton, Steven P.; Godfrey, Andrew T.
2016-03-01
This work discusses the implementation, capabilities, and validation of Shift, a massively parallel Monte Carlo radiation transport package authored at Oak Ridge National Laboratory. Shift has been developed to scale well from laptops to small computing clusters to advanced supercomputers and includes features such as support for multiple geometry and physics engines, hybrid capabilities for variance reduction methods such as the Consistent Adjoint-Driven Importance Sampling methodology, advanced parallel decompositions, and tally methods optimized for scalability on supercomputing architectures. The scaling studies presented in this paper demonstrate good weak and strong scaling behavior for the implemented algorithms. Shift has also been validated and verified against various reactor physics benchmarks, including the Consortium for Advanced Simulation of Light Water Reactors' Virtual Environment for Reactor Analysis criticality test suite and several Westinghouse AP1000® problems presented in this paper. These benchmark results compare well to those from other contemporary Monte Carlo codes such as MCNP5 and KENO.
Constant-time parallel sorting algorithm and its optical implementation using smart pixels
NASA Astrophysics Data System (ADS)
Louri, Ahmed; Hatch, James A., Jr.; Na, Jongwhoa
1995-06-01
Sorting is a fundamental operation that has important implications in a vast number of areas. For instance, sorting is heavily utilized in applications such as database machines, in which hashing techniques are used to accelerate data-processing algorithms. It is also the basis for interprocessor message routing and has strong implications in video telecommunications. However, high-speed electronic sorting networks are difficult to implement with VLSI technology because of the dense, global connectivity required. Optics eliminates this bottleneck by offering global interconnects, massive parallelism, and noninterfering communications. We present a parallel sorting algorithm and its efficient optical implementation. The algorithm sorts n data elements in few steps, independent of the number of elements to be sorted. Thus it is a constant-time sorting algorithm [i.e., O(1) time]. We also estimate the system's performance to show that the proposed sorting algorithm can provide at least 2 orders of magnitude improvement in execution time over conventional electronic algorithms.
FLIC: A translator for same-source parallel implementation of regular grid applications
Michalakes, J.
1997-02-01
FLIC, a Fortran loop and index converter, is a parser-based source translation tool that automates the conversion of program loops and array indices for distributed-memory parallel computers. This conversion is important in the implementation of gridded models on distributed memory because it allows for decomposition and shrinking of model data structures. FLIC does not provide the parallel services itself, but rather provides an automated and transparent mapping of the source code to calls or directives of the user`s choice of run-time systems or parallel libraries. The amount of user-supplied input required by FLIC to direct the conversion is small enough to fit as command line arguments for the tool. The tool requires no additional statements, comments, or directives in the source code, thus avoiding the pervasiveness and intrusiveness imposed by directives-based preprocessors and parallelizing compilers. FLIC is lightweight and suitable for use as a precompiler and facilitates a same-source approach to operability on diverse computer architectures. FLIC is targeted to new or existing applications that employ regular gridded domains, such as weather models, that will be parallelized by data-domain decomposition.
Holkundkar, Amol R.
2013-11-15
The objective of this article is to report the parallel implementation of the 3D molecular dynamic simulation code for laser-cluster interactions. The benchmarking of the code has been done by comparing the simulation results with some of the experiments reported in the literature. Scaling laws for the computational time is established by varying the number of processor cores and number of macroparticles used. The capabilities of the code are highlighted by implementing various diagnostic tools. To study the dynamics of the laser-cluster interactions, the executable version of the code is available from the author.
Parallel fixed point implementation of a radial basis function network in an FPGA.
de Souza, Alisson C D; Fernandes, Marcelo A C
2014-01-01
This paper proposes a parallel fixed point radial basis function (RBF) artificial neural network (ANN), implemented in a field programmable gate array (FPGA) trained online with a least mean square (LMS) algorithm. The processing time and occupied area were analyzed for various fixed point formats. The problems of precision of the ANN response for nonlinear classification using the XOR gate and interpolation using the sine function were also analyzed in a hardware implementation. The entire project was developed using the System Generator platform (Xilinx), with a Virtex-6 xc6vcx240t-1ff1156 as the target FPGA.
Parallel Fixed Point Implementation of a Radial Basis Function Network in an FPGA
de Souza, Alisson C. D.; Fernandes, Marcelo A. C.
2014-01-01
This paper proposes a parallel fixed point radial basis function (RBF) artificial neural network (ANN), implemented in a field programmable gate array (FPGA) trained online with a least mean square (LMS) algorithm. The processing time and occupied area were analyzed for various fixed point formats. The problems of precision of the ANN response for nonlinear classification using the XOR gate and interpolation using the sine function were also analyzed in a hardware implementation. The entire project was developed using the System Generator platform (Xilinx), with a Virtex-6 xc6vcx240t-1ff1156 as the target FPGA. PMID:25268918
A MIMD implementation of a parallel Euler solver for unstructured grids
NASA Technical Reports Server (NTRS)
Venkatakrishnan, V.; Simon, Horst D.; Barth, Timothy J.
1992-01-01
A mesh-vertex finite volume scheme for solving the Euler equations on triangular unstructured meshes is implemented on a MIMD (multiple instruction/multiple data stream) parallel computer. Three partitioning strategies for distributing the work load onto the processors are discussed. Issues pertaining to the communication costs are also addressed. We find that the spectral bisection strategy yields the best performance. The performance of this unstructured computation on the Intel iPSC/860 compares very favorably with that on a one-processor CRAY Y-MP/1 and an earlier implementation on the Connection Machine.
Butterfly interconnection implementation for an n-bit parallel full adder/subtractor
NASA Astrophysics Data System (ADS)
Sun, De-Gui; Xiang, Qian; Wang, Na-Xin; Weng, Zhao-Heng
1992-07-01
Free-space optical interconnections are important in both massive digital optical computing and communication systems. The architectural features of three interconnection networks are analyzed and compared, and the optical butterfly interconnection is shown to have many advantages over other interconnections in implementing various basic logic functions such as addition, subtraction, multiplication, and fast Fourier transforms. Starting with conventional Karnaugh maps and Boolean algebra, the characteristics of full addition and full subtraction are analyzed and compared. An n-bit parallel calculator that can implement both ripple carry full additions and ripple borrow full subtractions using multilayer butterfly interconnection networks is designed. Then the schematic and architecture of the full adder/subtractor, interconnection networks, and the patterns of key devices such as masks to implement AND and OR operations in this calculation are described in detail. The correct simulation results of several groups of multibit digits are provided. Finally, the development of the interconnections in implementing logic operations is discussed.
Parallel implementation of geometrical shock dynamics for two dimensional converging shock waves
NASA Astrophysics Data System (ADS)
Qiu, Shi; Liu, Kuang; Eliasson, Veronica
2016-10-01
Geometrical shock dynamics (GSD) theory is an appealing method to predict the shock motion in the sense that it is more computationally efficient than solving the traditional Euler equations, especially for converging shock waves. However, to solve and optimize large scale configurations, the main bottleneck is the computational cost. Among the existing numerical GSD schemes, there is only one that has been implemented on parallel computers, with the purpose to analyze detonation waves. To extend the computational advantage of the GSD theory to more general applications such as converging shock waves, a numerical implementation using a spatial decomposition method has been coupled with a front tracking approach on parallel computers. In addition, an efficient tridiagonal system solver for massively parallel computers has been applied to resolve the most expensive function in this implementation, resulting in an efficiency of 0.93 while using 32 HPCC cores. Moreover, symmetric boundary conditions have been developed to further reduce the computational cost, achieving a speedup of 19.26 for a 12-sided polygonal converging shock.
A Novel Implementation of Massively Parallel Three Dimensional Monte Carlo Radiation Transport
NASA Astrophysics Data System (ADS)
Robinson, P. B.; Peterson, J. D. L.
2005-12-01
The goal of our summer project was to implement the difference formulation for radiation transport into Cosmos++, a multidimensional, massively parallel, magneto hydrodynamics code for astrophysical applications (Peter Anninos - AX). The difference formulation is a new method for Symbolic Implicit Monte Carlo thermal transport (Brooks and Szöke - PAT). Formerly, simultaneous implementation of fully implicit Monte Carlo radiation transport in multiple dimensions on multiple processors had not been convincingly demonstrated. We found that a combination of the difference formulation and the inherent structure of Cosmos++ makes such an implementation both accurate and straightforward. We developed a "nearly nearest neighbor physics" technique to allow each processor to work independently, even with a fully implicit code. This technique coupled with the increased accuracy of an implicit Monte Carlo solution and the efficiency of parallel computing systems allows us to demonstrate the possibility of massively parallel thermal transport. This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No. W-7405-Eng-48
Task-parallel implementation of 3D shortest path raytracing for geophysical applications
NASA Astrophysics Data System (ADS)
Giroux, Bernard; Larouche, Benoît
2013-04-01
This paper discusses two variants of the shortest path method and their parallel implementation on a shared-memory system. One variant is designed to perform raytracing in models with stepwise distributions of interval velocity while the other is better suited for continuous velocity models. Both rely on a discretization scheme where primary nodes are located at the corners of cuboid cells and where secondary nodes are found on the edges and sides of the cells. The parallel implementations allow raytracing concurrently for different sources, providing an attractive framework for ray-based tomography. The accuracy and performance of the implementations were measured by comparison with the analytic solution for a layered model and for a vertical gradient model. Mean relative error less than 0.2% was obtained with 5 secondary nodes for the layered model and 9 secondary nodes for the gradient model. Parallel performance depends on the level of discretization refinement, on the number of threads, and on the problem size, with the most determinant variable being the level of discretization refinement (number of secondary nodes). The results indicate that a good trade-off between speed and accuracy is achieved with the number of secondary nodes equal to 5. The programs are written in C++ and rely on the Standard Template Library and OpenMP.
Discrete shearlet transform on GPU with applications in anomaly detection and denoising
NASA Astrophysics Data System (ADS)
Gibert, Xavier; Patel, Vishal M.; Labate, Demetrio; Chellappa, Rama
2014-12-01
Shearlets have emerged in recent years as one of the most successful methods for the multiscale analysis of multidimensional signals. Unlike wavelets, shearlets form a pyramid of well-localized functions defined not only over a range of scales and locations, but also over a range of orientations and with highly anisotropic supports. As a result, shearlets are much more effective than traditional wavelets in handling the geometry of multidimensional data, and this was exploited in a wide range of applications from image and signal processing. However, despite their desirable properties, the wider applicability of shearlets is limited by the computational complexity of current software implementations. For example, denoising a single 512 × 512 image using a current implementation of the shearlet-based shrinkage algorithm can take between 10 s and 2 min, depending on the number of CPU cores, and much longer processing times are required for video denoising. On the other hand, due to the parallel nature of the shearlet transform, it is possible to use graphics processing units (GPU) to accelerate its implementation. In this paper, we present an open source stand-alone implementation of the 2D discrete shearlet transform using CUDA C++ as well as GPU-accelerated MATLAB implementations of the 2D and 3D shearlet transforms. We have instrumented the code so that we can analyze the running time of each kernel under different GPU hardware. In addition to denoising, we describe a novel application of shearlets for detecting anomalies in textured images. In this application, computation times can be reduced by a factor of 50 or more, compared to multicore CPU implementations.
Li, Pengcheng; Liu, Celong; Li, Xianpeng; He, Honghui; Ma, Hui
2016-09-20
In earlier studies, we developed scattering models and the corresponding CPU-based Monte Carlo simulation programs to study the behavior of polarized photons as they propagate through complex biological tissues. Studying the simulation results in high degrees of freedom that created a demand for massive simulation tasks. In this paper, we report a parallel implementation of the simulation program based on the compute unified device architecture running on a graphics processing unit (GPU). Different schemes for sphere-only simulations and sphere-cylinder mixture simulations were developed. Diverse optimizing methods were employed to achieve the best acceleration. The final-version GPU program is hundreds of times faster than the CPU version. Dependence of the performance on input parameters and precision were also studied. It is shown that using single precision in the GPU simulations results in very limited losses in accuracy. Consumer-level graphics cards, even those in laptop computers, are more cost-effective than scientific graphics cards for single-precision computation. PMID:27661571
Tempest: GPU-CPU computing for high-throughput database spectral matching.
Milloy, Jeffrey A; Faherty, Brendan K; Gerber, Scott A
2012-07-01
Modern mass spectrometers are now capable of producing hundreds of thousands of tandem (MS/MS) spectra per experiment, making the translation of these fragmentation spectra into peptide matches a common bottleneck in proteomics research. When coupled with experimental designs that enrich for post-translational modifications such as phosphorylation and/or include isotopically labeled amino acids for quantification, additional burdens are placed on this computational infrastructure by shotgun sequencing. To address this issue, we have developed a new database searching program that utilizes the massively parallel compute capabilities of a graphical processing unit (GPU) to produce peptide spectral matches in a very high throughput fashion. Our program, named Tempest, combines efficient database digestion and MS/MS spectral indexing on a CPU with fast similarity scoring on a GPU. In our implementation, the entire similarity score, including the generation of full theoretical peptide candidate fragmentation spectra and its comparison to experimental spectra, is conducted on the GPU. Although Tempest uses the classical SEQUEST XCorr score as a primary metric for evaluating similarity for spectra collected at unit resolution, we have developed a new "Accelerated Score" for MS/MS spectra collected at high resolution that is based on a computationally inexpensive dot product but exhibits scoring accuracy similar to that of the classical XCorr. In our experience, Tempest provides compute-cluster level performance in an affordable desktop computer. PMID:22640374
An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU
Lyakh, Dmitry I.
2015-01-05
An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typicallymore » appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the na ve scattering algorithm (no memory access optimization). Furthermore, the tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).« less
An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU
Lyakh, Dmitry I.
2015-01-05
An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typically appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the na ve scattering algorithm (no memory access optimization). Furthermore, the tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).
Tempest: GPU-CPU computing for high-throughput database spectral matching.
Milloy, Jeffrey A; Faherty, Brendan K; Gerber, Scott A
2012-07-01
Modern mass spectrometers are now capable of producing hundreds of thousands of tandem (MS/MS) spectra per experiment, making the translation of these fragmentation spectra into peptide matches a common bottleneck in proteomics research. When coupled with experimental designs that enrich for post-translational modifications such as phosphorylation and/or include isotopically labeled amino acids for quantification, additional burdens are placed on this computational infrastructure by shotgun sequencing. To address this issue, we have developed a new database searching program that utilizes the massively parallel compute capabilities of a graphical processing unit (GPU) to produce peptide spectral matches in a very high throughput fashion. Our program, named Tempest, combines efficient database digestion and MS/MS spectral indexing on a CPU with fast similarity scoring on a GPU. In our implementation, the entire similarity score, including the generation of full theoretical peptide candidate fragmentation spectra and its comparison to experimental spectra, is conducted on the GPU. Although Tempest uses the classical SEQUEST XCorr score as a primary metric for evaluating similarity for spectra collected at unit resolution, we have developed a new "Accelerated Score" for MS/MS spectra collected at high resolution that is based on a computationally inexpensive dot product but exhibits scoring accuracy similar to that of the classical XCorr. In our experience, Tempest provides compute-cluster level performance in an affordable desktop computer.
Li, Pengcheng; Liu, Celong; Li, Xianpeng; He, Honghui; Ma, Hui
2016-09-20
In earlier studies, we developed scattering models and the corresponding CPU-based Monte Carlo simulation programs to study the behavior of polarized photons as they propagate through complex biological tissues. Studying the simulation results in high degrees of freedom that created a demand for massive simulation tasks. In this paper, we report a parallel implementation of the simulation program based on the compute unified device architecture running on a graphics processing unit (GPU). Different schemes for sphere-only simulations and sphere-cylinder mixture simulations were developed. Diverse optimizing methods were employed to achieve the best acceleration. The final-version GPU program is hundreds of times faster than the CPU version. Dependence of the performance on input parameters and precision were also studied. It is shown that using single precision in the GPU simulations results in very limited losses in accuracy. Consumer-level graphics cards, even those in laptop computers, are more cost-effective than scientific graphics cards for single-precision computation.
An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU
NASA Astrophysics Data System (ADS)
Lyakh, Dmitry I.
2015-04-01
An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typically appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the naïve scattering algorithm (no memory access optimization). The tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).
Roche-Lima, Abiel; Thulasiram, Ruppa K.
2016-01-01
Finite automata, in which each transition is augmented with an output label in addition to the familiar input label, are considered finite-state transducers. Transducers have been used to analyze some fundamental issues in bioinformatics. Weighted finite-state transducers have been proposed to pairwise alignments of DNA and protein sequences; as well as to develop kernels for computational biology. Machine learning algorithms for conditional transducers have been implemented and used for DNA sequence analysis. Transducer learning algorithms are based on conditional probability computation. It is calculated by using techniques, such as pair-database creation, normalization (with Maximum-Likelihood normalization) and parameters optimization (with Expectation-Maximization - EM). These techniques are intrinsically costly for computation, even worse when are applied to bioinformatics, because the databases sizes are large. In this work, we describe a parallel implementation of an algorithm to learn conditional transducers using these techniques. The algorithm is oriented to bioinformatics applications, such as alignments, phylogenetic trees, and other genome evolution studies. Indeed, several experiences were developed using the parallel and sequential algorithm on Westgrid (specifically, on the Breeze cluster). As results, we obtain that our parallel algorithm is scalable, because execution times are reduced considerably when the data size parameter is increased. Another experience is developed by changing precision parameter. In this case, we obtain smaller execution times using the parallel algorithm. Finally, number of threads used to execute the parallel algorithm on the Breezy cluster is changed. In this last experience, we obtain as result that speedup is considerably increased when more threads are used; however there is a convergence for number of threads equal to or greater than 16.
Accelerated GPU based SPECT Monte Carlo simulations
NASA Astrophysics Data System (ADS)
Garcia, Marie-Paule; Bert, Julien; Benoit, Didier; Bardiès, Manuel; Visvikis, Dimitris
2016-06-01
Monte Carlo (MC) modelling is widely used in the field of single photon emission computed tomography (SPECT) as it is a reliable technique to simulate very high quality scans. This technique provides very accurate modelling of the radiation transport and particle interactions in a heterogeneous medium. Various MC codes exist for nuclear medicine imaging simulations. Recently, new strategies exploiting the computing capabilities of graphical processing units (GPU) have been proposed. This work aims at evaluating the accuracy of such GPU implementation strategies in comparison to standard MC codes in the context of SPECT imaging. GATE was considered the reference MC toolkit and used to evaluate the performance of newly developed GPU Geant4-based Monte Carlo simulation (GGEMS) modules for SPECT imaging. Radioisotopes with different photon energies were used with these various CPU and GPU Geant4-based MC codes in order to assess the best strategy for each configuration. Three different isotopes were considered: 99m Tc, 111In and 131I, using a low energy high resolution (LEHR) collimator, a medium energy general purpose (MEGP) collimator and a high energy general purpose (HEGP) collimator respectively. Point source, uniform source, cylindrical phantom and anthropomorphic phantom acquisitions were simulated using a model of the GE infinia II 3/8" gamma camera. Both simulation platforms yielded a similar system sensitivity and image statistical quality for the various combinations. The overall acceleration factor between GATE and GGEMS platform derived from the same cylindrical phantom acquisition was between 18 and 27 for the different radioisotopes. Besides, a full MC simulation using an anthropomorphic phantom showed the full potential of the GGEMS platform, with a resulting acceleration factor up to 71. The good agreement with reference codes and the acceleration factors obtained support the use of GPU implementation strategies for improving computational efficiency
Accelerating Spaceborne SAR Imaging Using Multiple CPU/GPU Deep Collaborative Computing.
Zhang, Fan; Li, Guojun; Li, Wei; Hu, Wei; Hu, Yuxin
2016-04-07
With the development of synthetic aperture radar (SAR) technologies in recent years, the huge amount of remote sensing data brings challenges for real-time imaging processing. Therefore, high performance computing (HPC) methods have been presented to accelerate SAR imaging, especially the GPU based methods. In the classical GPU based imaging algorithm, GPU is employed to accelerate image processing by massive parallel computing, and CPU is only used to perform the auxiliary work such as data input/output (IO). However, the computing capability of CPU is ignored and underestimated. In this work, a new deep collaborative SAR imaging method based on multiple CPU/GPU is proposed to achieve real-time SAR imaging. Through the proposed tasks partitioning and scheduling strategy, the whole image can be generated with deep collaborative multiple CPU/GPU computing. In the part of CPU parallel imaging, the advanced vector extension (AVX) method is firstly introduced into the multi-core CPU parallel method for higher efficiency. As for the GPU parallel imaging, not only the bottlenecks of memory limitation and frequent data transferring are broken, but also kinds of optimized strategies are applied, such as streaming, parallel pipeline and so on. Experimental results demonstrate that the deep CPU/GPU collaborative imaging method enhances the efficiency of SAR imaging on single-core CPU by 270 times and realizes the real-time imaging in that the imaging rate outperforms the raw data generation rate.
Accelerating Spaceborne SAR Imaging Using Multiple CPU/GPU Deep Collaborative Computing
Zhang, Fan; Li, Guojun; Li, Wei; Hu, Wei; Hu, Yuxin
2016-01-01
With the development of synthetic aperture radar (SAR) technologies in recent years, the huge amount of remote sensing data brings challenges for real-time imaging processing. Therefore, high performance computing (HPC) methods have been presented to accelerate SAR imaging, especially the GPU based methods. In the classical GPU based imaging algorithm, GPU is employed to accelerate image processing by massive parallel computing, and CPU is only used to perform the auxiliary work such as data input/output (IO). However, the computing capability of CPU is ignored and underestimated. In this work, a new deep collaborative SAR imaging method based on multiple CPU/GPU is proposed to achieve real-time SAR imaging. Through the proposed tasks partitioning and scheduling strategy, the whole image can be generated with deep collaborative multiple CPU/GPU computing. In the part of CPU parallel imaging, the advanced vector extension (AVX) method is firstly introduced into the multi-core CPU parallel method for higher efficiency. As for the GPU parallel imaging, not only the bottlenecks of memory limitation and frequent data transferring are broken, but also kinds of optimized strategies are applied, such as streaming, parallel pipeline and so on. Experimental results demonstrate that the deep CPU/GPU collaborative imaging method enhances the efficiency of SAR imaging on single-core CPU by 270 times and realizes the real-time imaging in that the imaging rate outperforms the raw data generation rate. PMID:27070606
Accelerating Spaceborne SAR Imaging Using Multiple CPU/GPU Deep Collaborative Computing.
Zhang, Fan; Li, Guojun; Li, Wei; Hu, Wei; Hu, Yuxin
2016-01-01
With the development of synthetic aperture radar (SAR) technologies in recent years, the huge amount of remote sensing data brings challenges for real-time imaging processing. Therefore, high performance computing (HPC) methods have been presented to accelerate SAR imaging, especially the GPU based methods. In the classical GPU based imaging algorithm, GPU is employed to accelerate image processing by massive parallel computing, and CPU is only used to perform the auxiliary work such as data input/output (IO). However, the computing capability of CPU is ignored and underestimated. In this work, a new deep collaborative SAR imaging method based on multiple CPU/GPU is proposed to achieve real-time SAR imaging. Through the proposed tasks partitioning and scheduling strategy, the whole image can be generated with deep collaborative multiple CPU/GPU computing. In the part of CPU parallel imaging, the advanced vector extension (AVX) method is firstly introduced into the multi-core CPU parallel method for higher efficiency. As for the GPU parallel imaging, not only the bottlenecks of memory limitation and frequent data transferring are broken, but also kinds of optimized strategies are applied, such as streaming, parallel pipeline and so on. Experimental results demonstrate that the deep CPU/GPU collaborative imaging method enhances the efficiency of SAR imaging on single-core CPU by 270 times and realizes the real-time imaging in that the imaging rate outperforms the raw data generation rate. PMID:27070606
Astrophysical data mining with GPU. A case study: Genetic classification of globular clusters
NASA Astrophysics Data System (ADS)
Cavuoti, S.; Garofalo, M.; Brescia, M.; Paolillo, M.; Pescape', A.; Longo, G.; Ventre, G.
2014-01-01
We present a multi-purpose genetic algorithm, designed and implemented with GPGPU/CUDA parallel computing technology. The model was derived from our CPU serial implementation, named GAME (Genetic Algorithm Model Experiment). It was successfully tested and validated on the detection of candidate Globular Clusters in deep, wide-field, single band HST images. The GPU version of GAME will be made available to the community by integrating it into the web application DAMEWARE (DAta Mining Web Application REsource, http://dame.dsf.unina.it/beta_info.html), a public data mining service specialized on massive astrophysical data. Since genetic algorithms are inherently parallel, the GPGPU computing paradigm leads to a speedup of a factor of 200× in the training phase with respect to the CPU based version.
NASA Astrophysics Data System (ADS)
Abe, S.; Place, D.; Mora, P.
2001-12-01
The particle based lattice solid model has been used successfully as a virtual laboratory to simulate the dynamics of faults, earthquakes and gouge processes. The phenomena investigated with the lattice solid model range from the stick-slip behavior of faults, localization phenomena in gouge and the evolution of stress correlation in multi-fault systems, to the influence of rate and state-dependent friction laws on the macroscopic behavior of faults. However, the results from those simulations also show that in order to make a next step towards more realistic simulations it will be necessary to use three-dimensional models containing a large number of particles with a range of sizes, thus requiring a significantly increased amount of computing resources. Whereas the computing power provided by a single processor can be expected to double every 18 to 24 months, parallel computers which provide hundreds of times the computing power are available today and there are several efforts underway to construct dedicated parallel computers and associated simulation software systems for large-scale earth science simulation (e.g. The Australian Computational Earth Systems Simulator[1] and Japanese Earth Simulator[2])". In order to use the computing power made available by those large parallel computers, a parallel version of the lattice solid model has been implemented. In order to guarantee portability over a wide range of computer architectures, a message passing approach based on MPI has been used in the implementation. Particular care has been taken to eliminate serial bottlenecks in the program, thus ensuring high scalability on systems with a large number of CPUs. Measures taken to achieve this objective include the use of asynchronous communication between the parallel processes and the minimization of communication with and work done by a central ``master'' process. Benchmarks using models with up to 6 million particles on a parallel computer with 128 CPUs show that the
NASA Astrophysics Data System (ADS)
Furuichi, M.; Nishiura, D.
2015-12-01
Fully Lagrangian methods such as Smoothed Particle Hydrodynamics (SPH) and Discrete Element Method (DEM) have been widely used to solve the continuum and particles motions in the computational geodynamics field. These mesh-free methods are suitable for the problems with the complex geometry and boundary. In addition, their Lagrangian nature allows non-diffusive advection useful for tracking history dependent properties (e.g. rheology) of the material. These potential advantages over the mesh-based methods offer effective numerical applications to the geophysical flow and tectonic processes, which are for example, tsunami with free surface and floating body, magma intrusion with fracture of rock, and shear zone pattern generation of granular deformation. In order to investigate such geodynamical problems with the particle based methods, over millions to billion particles are required for the realistic simulation. Parallel computing is therefore important for handling such huge computational cost. An efficient parallel implementation of SPH and DEM methods is however known to be difficult especially for the distributed-memory architecture. Lagrangian methods inherently show workload imbalance problem for parallelization with the fixed domain in space, because particles move around and workloads change during the simulation. Therefore dynamic load balance is key technique to perform the large scale SPH and DEM simulation. In this work, we present the parallel implementation technique of SPH and DEM method utilizing dynamic load balancing algorithms toward the high resolution simulation over large domain using the massively parallel super computer system. Our method utilizes the imbalances of the executed time of each MPI process as the nonlinear term of parallel domain decomposition and minimizes them with the Newton like iteration method. In order to perform flexible domain decomposition in space, the slice-grid algorithm is used. Numerical tests show that our
GPU-based Monte Carlo radiotherapy dose calculation using phase-space sources.
Townson, Reid W; Jia, Xun; Tian, Zhen; Graves, Yan Jiang; Zavgorodni, Sergei; Jiang, Steve B
2013-06-21
A novel phase-space source implementation has been designed for graphics processing unit (GPU)-based Monte Carlo dose calculation engines. Short of full simulation of the linac head, using a phase-space source is the most accurate method to model a clinical radiation beam in dose calculations. However, in GPU-based Monte Carlo dose calculations where the computation efficiency is very high, the time required to read and process a large phase-space file becomes comparable to the particle transport time. Moreover, due to the parallelized nature of GPU hardware, it is essential to simultaneously transport particles of the same type and similar energies but separated spatially to yield a high efficiency. We present three methods for phase-space implementation that have been integrated into the most recent version of the GPU-based Monte Carlo radiotherapy dose calculation package gDPM v3.0. The first method is to sequentially read particles from a patient-dependent phase-space and sort them on-the-fly based on particle type and energy. The second method supplements this with a simple secondary collimator model and fluence map implementation so that patient-independent phase-space sources can be used. Finally, as the third method (called the phase-space-let, or PSL, method) we introduce a novel source implementation utilizing pre-processed patient-independent phase-spaces that are sorted by particle type, energy and position. Position bins located outside a rectangular region of interest enclosing the treatment field are ignored, substantially decreasing simulation time with little effect on the final dose distribution. The three methods were validated in absolute dose against BEAMnrc/DOSXYZnrc and compared using gamma-index tests (2%/2 mm above the 10% isodose). It was found that the PSL method has the optimal balance between accuracy and efficiency and thus is used as the default method in gDPM v3.0. Using the PSL method, open fields of 4 × 4, 10 × 10 and 30 × 30 cm
A parallel implementation of an off-lattice individual-based model of multicellular populations
NASA Astrophysics Data System (ADS)
Harvey, Daniel G.; Fletcher, Alexander G.; Osborne, James M.; Pitt-Francis, Joe
2015-07-01
As computational models of multicellular populations include ever more detailed descriptions of biophysical and biochemical processes, the computational cost of simulating such models limits their ability to generate novel scientific hypotheses and testable predictions. While developments in microchip technology continue to increase the power of individual processors, parallel computing offers an immediate increase in available processing power. To make full use of parallel computing technology, it is necessary to develop specialised algorithms. To this end, we present a parallel algorithm for a class of off-lattice individual-based models of multicellular populations. The algorithm divides the spatial domain between computing processes and comprises communication routines that ensure the model is correctly simulated on multiple processors. The parallel algorithm is shown to accurately reproduce the results of a deterministic simulation performed using a pre-existing serial implementation. We test the scaling of computation time, memory use and load balancing as more processes are used to simulate a cell population of fixed size. We find approximate linear scaling of both speed-up and memory consumption on up to 32 processor cores. Dynamic load balancing is shown to provide speed-up for non-regular spatial distributions of cells in the case of a growing population.
Full Parallel Implementation of an All-Electron Four-Component Dirac-Kohn-Sham Program.
Rampino, Sergio; Belpassi, Leonardo; Tarantelli, Francesco; Storchi, Loriano
2014-09-01
A full distributed-memory implementation of the Dirac-Kohn-Sham (DKS) module of the program BERTHA (Belpassi et al., Phys. Chem. Chem. Phys. 2011, 13, 12368-12394) is presented, where the self-consistent field (SCF) procedure is replicated on all the parallel processes, each process working on subsets of the global matrices. The key feature of the implementation is an efficient procedure for switching between two matrix distribution schemes, one (integral-driven) optimal for the parallel computation of the matrix elements and another (block-cyclic) optimal for the parallel linear algebra operations. This approach, making both CPU-time and memory scalable with the number of processors used, virtually overcomes at once both time and memory barriers associated with DKS calculations. Performance, portability, and numerical stability of the code are illustrated on the basis of test calculations on three gold clusters of increasing size, an organometallic compound, and a perovskite model. The calculations are performed on a Beowulf and a BlueGene/Q system.
Full Parallel Implementation of an All-Electron Four-Component Dirac-Kohn-Sham Program.
Rampino, Sergio; Belpassi, Leonardo; Tarantelli, Francesco; Storchi, Loriano
2014-09-01
A full distributed-memory implementation of the Dirac-Kohn-Sham (DKS) module of the program BERTHA (Belpassi et al., Phys. Chem. Chem. Phys. 2011, 13, 12368-12394) is presented, where the self-consistent field (SCF) procedure is replicated on all the parallel processes, each process working on subsets of the global matrices. The key feature of the implementation is an efficient procedure for switching between two matrix distribution schemes, one (integral-driven) optimal for the parallel computation of the matrix elements and another (block-cyclic) optimal for the parallel linear algebra operations. This approach, making both CPU-time and memory scalable with the number of processors used, virtually overcomes at once both time and memory barriers associated with DKS calculations. Performance, portability, and numerical stability of the code are illustrated on the basis of test calculations on three gold clusters of increasing size, an organometallic compound, and a perovskite model. The calculations are performed on a Beowulf and a BlueGene/Q system. PMID:26588521
Implementation and analysis of a Navier-Stokes algorithm on parallel computers
NASA Technical Reports Server (NTRS)
Fatoohi, Raad A.; Grosch, Chester E.
1988-01-01
The results of the implementation of a Navier-Stokes algorithm on three parallel/vector computers are presented. The object of this research is to determine how well, or poorly, a single numerical algorithm would map onto three different architectures. The algorithm is a compact difference scheme for the solution of the incompressible, two-dimensional, time-dependent Navier-Stokes equations. The computers were chosen so as to encompass a variety of architectures. They are the following: the MPP, an SIMD machine with 16K bit serial processors; Flex/32, an MIMD machine with 20 processors; and Cray/2. The implementation of the algorithm is discussed in relation to these architectures and measures of the performance on each machine are given. The basic comparison is among SIMD instruction parallelism on the MPP, MIMD process parallelism on the Flex/32, and vectorization of a serial code on the Cray/2. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally, conclusions are presented.
Parallel implementation of the accelerated BEM approach for EMSI of the human brain.
Ataseven, Y; Akalin-Acar, Z; Acar, C E; Gençer, N G
2008-07-01
Boundary element method (BEM) is one of the numerical methods which is commonly used to solve the forward problem (FP) of electro-magnetic source imaging with realistic head geometries. Application of BEM generates large systems of linear equations with dense matrices. Generation and solution of these matrix equations are time and memory consuming. This study presents a relatively cheap and effective solution for parallel implementation of the BEM to reduce the processing times to clinically acceptable values. This is achieved using a parallel cluster of personal computers on a local area network. We used eight workstations and implemented a parallel version of the accelerated BEM approach that distributes the computation and the BEM matrix efficiently to the processors. The performance of the solver is evaluated in terms of the CPU operations and memory usage for different number of processors. Once the transfer matrix is computed, for a 12,294 node mesh, a single FP solution takes 676 ms on a single processor and 72 ms on eight processors. It was observed that workstation clusters are cost effective tools for solving the complex BEM models in a clinically acceptable time. PMID:18299914
Implementation of a Parallel Kalman Filter for Stratospheric Chemical Tracer Assimilation
NASA Technical Reports Server (NTRS)
Chang, Lang-Ping; Lyster, Peter M.; Menard, R.; Cohn, S. E.
1998-01-01
A Kalman filter for the assimilation of long-lived atmospheric chemical constituents has been developed for two-dimensional transport models on isentropic surfaces over the globe. An important attribute of the Kalman filter is that it calculates error covariances of the constituent fields using the tracer dynamics. Consequently, the current Kalman-filter assimilation is a five-dimensional problem (coordinates of two points and time), and it can only be handled on computers with large memory and high floating point speed. In this paper, an implementation of the Kalman filter for distributed-memory, message-passing parallel computers is discussed. Two approaches were studied: an operator decomposition and a covariance decomposition. The latter was found to be more scalable than the former, and it possesses the property that the dynamical model does not need to be parallelized, which is of considerable practical advantage. This code is currently used to assimilate constituent data retrieved by limb sounders on the Upper Atmosphere Research Satellite. Tests of the code examined the variance transport and observability properties. Aspects of the parallel implementation, some timing results, and a brief discussion of the physical results will be presented.
NASA Astrophysics Data System (ADS)
Santos, Lucana; Magli, Enrico; Vitulli, Raffaele; Núñez, Antonio; López, José F.; Sarmiento, Roberto
2013-01-01
There is an intense necessity for the development of new hardware architectures for the implementation of algorithms for hyperspectral image compression on board satellites. Graphics processing units (GPUs) represent a very attractive opportunity, offering the possibility to dramatically increase the computation speed in applications that are data and task parallel. An algorithm for the lossy compression of hyperspectral images is implemented on a GPU using Nvidia computer unified device architecture (CUDA) parallel computing architecture. The parallelization strategy is explained, with emphasis on the entropy coding and bit packing phases, for which a more sophisticated strategy is necessary due to the existing data dependencies. Experimental results are obtained by comparing the performance of the GPU implementation with a single-threaded CPU implementation, showing high speedups of up to 15.41. A profiling of the algorithm is provided, demonstrating the high performance of the designed parallel entropy coding phase. The accuracy of the GPU implementation is presented, as well as the effect of the configuration parameters on performance. The convenience of using GPUs for on-board processing is demonstrated, and solutions to the potential difficulties encountered when accelerating hyperspectral compression algorithms are proposed, if space-qualified GPUs become a reality in the near future.
Tanner, David E; Phillips, James C; Schulten, Klaus
2012-07-10
Molecular dynamics methodologies comprise a vital research tool for structural biology. Molecular dynamics has benefited from technological advances in computing, such as multi-core CPUs and graphics processing units (GPUs), but harnessing the full power of hybrid GPU/CPU computers remains difficult. The generalized Born/solvent-accessible surface area implicit solvent model (GB/SA) stands to benefit from hybrid GPU/CPU computers, employing the GPU for the GB calculation and the CPU for the SA calculation. Here, we explore the computational challenges facing GB/SA calculations on hybrid GPU/CPU computers and demonstrate how NAMD, a parallel molecular dynamics program, is able to efficiently utilize GPUs and CPUs simultaneously for fast GB/SA simulations. The hybrid computation principles demonstrated here are generally applicable to parallel applications employing hybrid GPU/CPU calculations.
Pandya, Tara M.; Johnson, Seth R.; Evans, Thomas M.; Davidson, Gregory G.; Hamilton, Steven P.; Godfrey, Andrew T.
2015-12-21
This paper discusses the implementation, capabilities, and validation of Shift, a massively parallel Monte Carlo radiation transport package developed and maintained at Oak Ridge National Laboratory. It has been developed to scale well from laptop to small computing clusters to advanced supercomputers. Special features of Shift include hybrid capabilities for variance reduction such as CADIS and FW-CADIS, and advanced parallel decomposition and tally methods optimized for scalability on supercomputing architectures. Shift has been validated and verified against various reactor physics benchmarks and compares well to other state-of-the-art Monte Carlo radiation transport codes such as MCNP5, CE KENO-VI, and OpenMC. Somemore » specific benchmarks used for verification and validation include the CASL VERA criticality test suite and several Westinghouse AP1000® problems. These benchmark and scaling studies show promising results.« less
Pandya, Tara M.; Johnson, Seth R.; Evans, Thomas M.; Davidson, Gregory G.; Hamilton, Steven P.; Godfrey, Andrew T.
2015-12-21
This paper discusses the implementation, capabilities, and validation of Shift, a massively parallel Monte Carlo radiation transport package developed and maintained at Oak Ridge National Laboratory. It has been developed to scale well from laptop to small computing clusters to advanced supercomputers. Special features of Shift include hybrid capabilities for variance reduction such as CADIS and FW-CADIS, and advanced parallel decomposition and tally methods optimized for scalability on supercomputing architectures. Shift has been validated and verified against various reactor physics benchmarks and compares well to other state-of-the-art Monte Carlo radiation transport codes such as MCNP5, CE KENO-VI, and OpenMC. Some specific benchmarks used for verification and validation include the CASL VERA criticality test suite and several Westinghouse AP1000^{®} problems. These benchmark and scaling studies show promising results.
Parallel computation for blood cell classification in medical hyperspectral imagery
NASA Astrophysics Data System (ADS)
Li, Wei; Wu, Lucheng; Qiu, Xianbo; Ran, Qiong; Xie, Xiaoming
2016-09-01
With the advantage of fine spectral resolution, hyperspectral imagery provides great potential for cell classification. This paper provides a promising classification system including the following three stages: (1) band selection for a subset of spectral bands with distinctive and informative features, (2) spectral-spatial feature extraction, such as local binary patterns (LBP), and (3) followed by an effective classifier. Moreover, these three steps are further implemented on graphics processing units (GPU) respectively, which makes the system real-time and more practical. The GPU parallel implementation is compared with the serial implementation on central processing units (CPU). Experimental results based on real medical hyperspectral data demonstrate that the proposed system is able to offer high accuracy and fast speed, which are appealing for cell classification in medical hyperspectral imagery.
Scalar and Parallel Optimized Implementation of the Direct Simulation Monte Carlo Method
NASA Astrophysics Data System (ADS)
Dietrich, Stefan; Boyd, Iain D.
1996-07-01
This paper describes a new concept for the implementation of the direct simulation Monte Carlo (DSMC) method. It uses a localized data structure based on a computational cell to achieve high performance, especially on workstation processors, which can also be used in parallel. Since the data structure makes it possible to freely assign any cell to any processor, a domain decomposition can be found with equal calculation load on each processor while maintaining minimal communication among the nodes. Further, the new implementation strictly separates physical modeling, geometrical issues, and organizational tasks to achieve high maintainability and to simplify future enhancements. Three example flow configurations are calculated with the new implementation to demonstrate its generality and performance. They include a flow through a diverging channel using an adapted unstructured triangulated grid, a flow around a planetary probe, and an internal flow in a contactor used in plasma physics. The results are validated either by comparison with results obtained from other simulations or by comparison with experimental data. High performance on an IBM SP2 system is achieved if problem size and number of parallel processors are adapted accordingly. On 400 nodes, DSMC calculations with more than 100 million particles are possible.
Madduri, Kamesh; Ediger, David; Jiang, Karl; Bader, David A.; Chavarría-Miranda, Daniel
2009-05-29
We present a new lock-free parallel algorithm for computing betweenness centrality of massive small-world networks. With minor changes to the data structures, our algorithm also achieves better spatial cache locality compared to previous approaches. Betweenness centrality is a key algorithm kernel in the HPCS SSCA#2 Graph Analysis benchmark, which has been extensively used to evaluate the performance of emerging high-performance computing architectures for graph-theoretic computations. We design optimized implementations of betweenness centrality and the SSCA#2 benchmark for two hardware multithreaded systems: a Cray XMT system with the ThreadStorm processor, and a single-socket Sun multicore server with the UltraSparc T2 processor. For a small-world network of 134 million vertices and 1.073 billion edges, the 16-processor XMT system and the 8-core Sun Fire T5120 server achieve TEPS scores (an algorithmic performance count for the SSCA#2 benchmark) of 160 million and 90 million respectively, which corresponds to more than a 2X performance improvement over the previous parallel implementations. To better characterize the performance of these multithreaded systems, we correlate the SSCA#2 performance results with data from the memory-intensive STREAM and RandomAccess benchmarks. Finally, we demonstrate the applicability of our implementation to analyze massive real-world datasets by computing approximate betweenness centrality for a large-scale IMDb movie-actor network.
Madduri, Kamesh; Ediger, David; Jiang, Karl; Bader, David A.; Chavarria-Miranda, Daniel
2009-02-15
We present a new lock-free parallel algorithm for computing betweenness centralityof massive small-world networks. With minor changes to the data structures, ouralgorithm also achieves better spatial cache locality compared to previous approaches. Betweenness centrality is a key algorithm kernel in HPCS SSCA#2, a benchmark extensively used to evaluate the performance of emerging high-performance computing architectures for graph-theoretic computations. We design optimized implementations of betweenness centrality and the SSCA#2 benchmark for two hardware multithreaded systems: a Cray XMT system with the Threadstorm processor, and a single-socket Sun multicore server with the UltraSPARC T2 processor. For a small-world network of 134 million vertices and 1.073 billion edges, the 16-processor XMT system and the 8-core Sun Fire T5120 server achieve TEPS scores (an algorithmic performance count for the SSCA#2 benchmark) of 160 million and 90 million respectively, which corresponds to more than a 2X performance improvement over the previous parallel implementations. To better characterize the performance of these multithreaded systems, we correlate the SSCA#2 performance results with data from the memory-intensive STREAM and RandomAccess benchmarks. Finally, we demonstrate the applicability of our implementation to analyze massive real-world datasets by computing approximate betweenness centrality for a large-scale IMDb movie-actor network.
NASA Astrophysics Data System (ADS)
Komura, Yukihiro; Okabe, Yutaka
2014-03-01
We present sample CUDA programs for the GPU computing of the Swendsen-Wang multi-cluster spin flip algorithm. We deal with the classical spin models; the Ising model, the q-state Potts model, and the classical XY model. As for the lattice, both the 2D (square) lattice and the 3D (simple cubic) lattice are treated. We already reported the idea of the GPU implementation for 2D models (Komura and Okabe, 2012). We here explain the details of sample programs, and discuss the performance of the present GPU implementation for the 3D Ising and XY models. We also show the calculated results of the moment ratio for these models, and discuss phase transitions. Catalogue identifier: AERM_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AERM_v1_0.html Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 5632 No. of bytes in distributed program, including test data, etc.: 14688 Distribution format: tar.gz Programming language: C, CUDA. Computer: System with an NVIDIA CUDA enabled GPU. Operating system: System with an NVIDIA CUDA enabled GPU. Classification: 23. External routines: NVIDIA CUDA Toolkit 3.0 or newer Nature of problem: Monte Carlo simulation of classical spin systems. Ising, q-state Potts model, and the classical XY model are treated for both two-dimensional and three-dimensional lattices. Solution method: GPU-based Swendsen-Wang multi-cluster spin flip Monte Carlo method. The CUDA implementation for the cluster-labeling is based on the work by Hawick et al. [1] and that by Kalentev et al. [2]. Restrictions: The system size is limited depending on the memory of a GPU. Running time: For the parameters used in the sample programs, it takes about a minute for each program. Of course, it depends on the system size, the number of Monte Carlo steps, etc. References: [1] K
GPU-based fast gamma index calculation
NASA Astrophysics Data System (ADS)
Gu, Xuejun; Jia, Xun; Jiang, Steve B.
2011-03-01
The γ-index dose comparison tool has been widely used to compare dose distributions in cancer radiotherapy. The accurate calculation of γ-index requires an exhaustive search of the closest Euclidean distance in the high-resolution dose-distance space. This is a computational intensive task when dealing with 3D dose distributions. In this work, we combine a geometric method (Ju et al 2008 Med. Phys. 35 879-87) with a radial pre-sorting technique (Wendling et al 2007 Med. Phys. 34 1647-54) and implement them on computer graphics processing units (GPUs). The developed GPU-based γ-index computational tool is evaluated on eight pairs of IMRT dose distributions. The γ-index calculations can be finished within a few seconds for all 3D testing cases on one single NVIDIA Tesla C1060 card, achieving 45-75× speedup compared to CPU computations conducted on an Intel Xeon 2.27 GHz processor. We further investigated the effect of various factors on both CPU and GPU computation time. The strategy of pre-sorting voxels based on their dose difference values speeds up the GPU calculation by about 2.7-5.5 times. For n-dimensional dose distributions, γ-index calculation time on CPU is proportional to the summation of γn over all voxels, while that on GPU is affected by γn distributions and is approximately proportional to the γn summation over all voxels. We found that increasing the resolution of dose distributions leads to a quadratic increase of computation time on CPU, while less-than-quadratic increase on GPU. The values of dose difference and distance-to-agreement criteria also have an impact on γ-index calculation time.
NASA Astrophysics Data System (ADS)
Wittek, Peter; Calderaro, Luca
2015-12-01
We extended a parallel and distributed implementation of the Trotter-Suzuki algorithm for simulating quantum systems to study a wider range of physical problems and to make the library easier to use. The new release allows periodic boundary conditions, many-body simulations of non-interacting particles, arbitrary stationary potential functions, and imaginary time evolution to approximate the ground state energy. The new release is more resilient to the computational environment: a wider range of compiler chains and more platforms are supported. To ease development, we provide a more extensive command-line interface, an application programming interface, and wrappers from high-level languages.
NASA Astrophysics Data System (ADS)
Shao, Ran; Linton, David; Spence, Ivor; Zheng, Ning
2015-12-01
An effective way to accelerate the Finite-difference time-domain (FDTD) method is the use of a Graphic Processing Unit (GPU). This paper describes an implementation of the three dimensional FDTD method with CPML boundary condition on a Kepler (GK110) architecture GPU. We optimize the FDTD domain decomposition method on Kepler GPU. And then, several Kepler-based optimizations are studied and applied to the FDTD program. The optimized program achieved up to 270.9 times speedup compared to the CPU sequential version. The experiments show that 22.2% of the simulation time is saved compared to the GPU version without optimizations. The solution is also faster than previous works.
Diamond, Alan; Nowotny, Thomas; Schmuker, Michael
2015-01-01
Neuromorphic computing employs models of neuronal circuits to solve computing problems. Neuromorphic hardware systems are now becoming more widely available and "neuromorphic algorithms" are being developed. As they are maturing toward deployment in general research environments, it becomes important to assess and compare them in the context of the applications they are meant to solve. This should encompass not just task performance, but also ease of implementation, speed of processing, scalability, and power efficiency. Here, we report our practical experience of implementing a bio-inspired, spiking network for multivariate classification on three different platforms: the hybrid digital/analog Spikey system, the digital spike-based SpiNNaker system, and GeNN, a meta-compiler for parallel GPU hardware. We assess performance using a standard hand-written digit classification task. We found that whilst a different implementation approach was required for each platform, classification performances remained in line. This suggests that all three implementations were able to exercise the model's ability to solve the task rather than exposing inherent platform limits, although differences emerged when capacity was approached. With respect to execution speed and power consumption, we found that for each platform a large fraction of the computing time was spent outside of the neuromorphic device, on the host machine. Time was spent in a range of combinations of preparing the model, encoding suitable input spiking data, shifting data, and decoding spike-encoded results. This is also where a large proportion of the total power was consumed, most markedly for the SpiNNaker and Spikey systems. We conclude that the simulation efficiency advantage of the assessed specialized hardware systems is easily lost in excessive host-device communication, or non-neuronal parts of the computation. These results emphasize the need to optimize the host-device communication architecture for
Diamond, Alan; Nowotny, Thomas; Schmuker, Michael
2016-01-01
Neuromorphic computing employs models of neuronal circuits to solve computing problems. Neuromorphic hardware systems are now becoming more widely available and “neuromorphic algorithms” are being developed. As they are maturing toward deployment in general research environments, it becomes important to assess and compare them in the context of the applications they are meant to solve. This should encompass not just task performance, but also ease of implementation, speed of processing, scalability, and power efficiency. Here, we report our practical experience of implementing a bio-inspired, spiking network for multivariate classification on three different platforms: the hybrid digital/analog Spikey system, the digital spike-based SpiNNaker system, and GeNN, a meta-compiler for parallel GPU hardware. We assess performance using a standard hand-written digit classification task. We found that whilst a different implementation approach was required for each platform, classification performances remained in line. This suggests that all three implementations were able to exercise the model's ability to solve the task rather than exposing inherent platform limits, although differences emerged when capacity was approached. With respect to execution speed and power consumption, we found that for each platform a large fraction of the computing time was spent outside of the neuromorphic device, on the host machine. Time was spent in a range of combinations of preparing the model, encoding suitable input spiking data, shifting data, and decoding spike-encoded results. This is also where a large proportion of the total power was consumed, most markedly for the SpiNNaker and Spikey systems. We conclude that the simulation efficiency advantage of the assessed specialized hardware systems is easily lost in excessive host-device communication, or non-neuronal parts of the computation. These results emphasize the need to optimize the host-device communication architecture
Diamond, Alan; Nowotny, Thomas; Schmuker, Michael
2015-01-01
Neuromorphic computing employs models of neuronal circuits to solve computing problems. Neuromorphic hardware systems are now becoming more widely available and "neuromorphic algorithms" are being developed. As they are maturing toward deployment in general research environments, it becomes important to assess and compare them in the context of the applications they are meant to solve. This should encompass not just task performance, but also ease of implementation, speed of processing, scalability, and power efficiency. Here, we report our practical experience of implementing a bio-inspired, spiking network for multivariate classification on three different platforms: the hybrid digital/analog Spikey system, the digital spike-based SpiNNaker system, and GeNN, a meta-compiler for parallel GPU hardware. We assess performance using a standard hand-written digit classification task. We found that whilst a different implementation approach was required for each platform, classification performances remained in line. This suggests that all three implementations were able to exercise the model's ability to solve the task rather than exposing inherent platform limits, although differences emerged when capacity was approached. With respect to execution speed and power consumption, we found that for each platform a large fraction of the computing time was spent outside of the neuromorphic device, on the host machine. Time was spent in a range of combinations of preparing the model, encoding suitable input spiking data, shifting data, and decoding spike-encoded results. This is also where a large proportion of the total power was consumed, most markedly for the SpiNNaker and Spikey systems. We conclude that the simulation efficiency advantage of the assessed specialized hardware systems is easily lost in excessive host-device communication, or non-neuronal parts of the computation. These results emphasize the need to optimize the host-device communication architecture for
A GPU accelerated Barnes-Hut tree code for FLASH4
NASA Astrophysics Data System (ADS)
Lukat, Gunther; Banerjee, Robi
2016-05-01
We present a GPU accelerated CUDA-C implementation of the Barnes Hut (BH) tree code for calculating the gravitational potential on octree adaptive meshes. The tree code algorithm is implemented within the FLASH4 adaptive mesh refinement (AMR) code framework and therefore fully MPI parallel. We describe the algorithm and present test results that demonstrate its accuracy and performance in comparison to the algorithms available in the current FLASH4 version. We use a MacLaurin spheroid to test the accuracy of our new implementation and use spherical, collapsing cloud cores with effective AMR to carry out performance tests also in comparison with previous gravity solvers. Depending on the setup and the GPU/CPU ratio, we find a speedup for the gravity unit of at least a factor of 3 and up to 60 in comparison to the gravity solvers implemented in the FLASH4 code. We find an overall speedup factor for full simulations of at least factor 1.6 up to a factor of 10.
Cui, Xiaohui; Mueller, Frank; Zhang, Yongpeng; Potok, Thomas E
2009-01-01
Accelerating hardware devices represent a novel promise for improving the performance for many problem domains but it is not clear for which domains what accelerators are suitable. While there is no room in general-purpose processor design to significantly increase the processor frequency, developers are instead resorting to multi-core chips duplicating conventional computing capabilities on a single die. Yet, accelerators offer more radical designs with a much higher level of parallelism and novel programming environments. This present work assesses the viability of text mining on CUDA. Text mining is one of the key concepts that has become prominent as an effective means to index the Internet, but its applications range beyond this scope and extend to providing document similarity metrics, the subject of this work. We have developed and optimized text search algorithms for GPUs to exploit their potential for massive data processing. We discuss the algorithmic challenges of parallelization for text search problems on GPUs and demonstrate the potential of these devices in experiments by reporting significant speedups. Our study may be one of the first to assess more complex text search problems for suitability for GPU devices, and it may also be one of the first to exploit and report on atomic instruction usage that have recently become available in NVIDIA devices.
SU-D-BRD-03: A Gateway for GPU Computing in Cancer Radiotherapy Research
Jia, X; Folkerts, M; Shi, F; Yan, H; Yan, Y; Jiang, S; Sivagnanam, S; Majumdar, A
2014-06-01
Purpose: Graphics Processing Unit (GPU) has become increasingly important in radiotherapy. However, it is still difficult for general clinical researchers to access GPU codes developed by other researchers, and for developers to objectively benchmark their codes. Moreover, it is quite often to see repeated efforts spent on developing low-quality GPU codes. The goal of this project is to establish an infrastructure for testing GPU codes, cross comparing them, and facilitating code distributions in radiotherapy community. Methods: We developed a system called Gateway for GPU Computing in Cancer Radiotherapy Research (GCR2). A number of GPU codes developed by our group and other developers can be accessed via a web interface. To use the services, researchers first upload their test data or use the standard data provided by our system. Then they can select the GPU device on which the code will be executed. Our system offers all mainstream GPU hardware for code benchmarking purpose. After the code running is complete, the system automatically summarizes and displays the computing results. We also released a SDK to allow the developers to build their own algorithm implementation and submit their binary codes to the system. The submitted code is then systematically benchmarked using a variety of GPU hardware and representative data provided by our system. The developers can also compare their codes with others and generate benchmarking reports. Results: It is found that the developed system is fully functioning. Through a user-friendly web interface, researchers are able to test various GPU codes. Developers also benefit from this platform by comprehensively benchmarking their codes on various GPU platforms and representative clinical data sets. Conclusion: We have developed an open platform allowing the clinical researchers and developers to access the GPUs and GPU codes. This development will facilitate the utilization of GPU in radiation therapy field.
Design and implementation of a parallel array operator for the arbitrary remapping of data.
Dietz, Steven; Choi, S. E.; Chamberlain, B. L.; Snyder, Lawrence
2003-01-01
The data redistribution or remapping functions, gather and scatter, are of long-standing in high-performance computing, having been included in Cray Fortran for decades. In this paper, we present a highly-general array operator with powerful ga.ther and scatter capa.bilities unmatched in other array languages. We discuss an efficient parallel implementation, introducing several new optimizations-run length encoding, dead army reuse, and direct conimunica.tion-that lessen the costs associa.ted with the operator's wide applicability. In our implementation of this operator in ZPL, we demonstrade comparable performance to the highly-tuned, hand-coded Fortran plus MPI versions of the NAS FT and NAS CG benchmarks.
An object-oriented implementation of a parallel Monte Carlo code for radiation transport
NASA Astrophysics Data System (ADS)
Santos, Pedro Duarte; Lani, Andrea
2016-05-01
This paper describes the main features of a state-of-the-art Monte Carlo solver for radiation transport which has been implemented within COOLFluiD, a world-class open source object-oriented platform for scientific simulations. The Monte Carlo code makes use of efficient ray tracing algorithms (for 2D, axisymmetric and 3D arbitrary unstructured meshes) which are described in detail. The solver accuracy is first verified in testcases for which analytical solutions are available, then validated for a space re-entry flight experiment (i.e. FIRE II) for which comparisons against both experiments and reference numerical solutions are provided. Through the flexible design of the physical models, ray tracing and parallelization strategy (fully reusing the mesh decomposition inherited by the fluid simulator), the implementation was made efficient and reusable.
Recent Progress on the Parallel Implementation of Moving-Body Overset Grid Schemes
NASA Technical Reports Server (NTRS)
Wissink, Andrew; Allen, Edwin (Technical Monitor)
1998-01-01
Viscous calculations about geometrically complex bodies in which there is relative motion between component parts is one of the most computationally demanding problems facing CFD researchers today. This presentation documents results from the first two years of a CHSSI-funded effort within the U.S. Army AFDD to develop scalable dynamic overset grid methods for unsteady viscous calculations with moving-body problems. The first pan of the presentation will focus on results from OVERFLOW-D1, a parallelized moving-body overset grid scheme that employs traditional Chimera methodology. The two processes that dominate the cost of such problems are the flow solution on each component and the intergrid connectivity solution. Parallel implementations of the OVERFLOW flow solver and DCF3D connectivity software are coupled with a proposed two-part static-dynamic load balancing scheme and tested on the IBM SP and Cray T3E multi-processors. The second part of the presentation will cover some recent results from OVERFLOW-D2, a new flow solver that employs Cartesian grids with various levels of refinement, facilitating solution adaption. A study of the parallel performance of the scheme on large distributed- memory multiprocessor computer architectures will be reported.
On the design and implementation of a parallel, object-oriented, image processing toolkit
Kamath, C; Baldwin, C; Fodor, I; Tang, N A
2000-06-22
Advanced in technology have enabled us to collect data from observations, experiments, and simulations at an ever increasing pace. As these data sets approach the terabyte and petabyte range, scientists are increasingly using semi-automated techniques from data mining and pattern recognition to find useful information in the data. In order for data mining to be successful, the raw data must first be processed into a form suitable for the detection of patterns. When the data is in the form of images, this can involve a substantial amount of processing on very large data sets. To help make this task more efficient, they are designing and implementing an object-oriented image processing toolkit that specifically targets massively-parallel, distributed-memory architectures. They first show that it is possible to use object-oriented technology to effectively address the diverse needs of image applications. Next, they describe how we abstract out the similarities in image processing algorithms to enable re-use in the software. They will also discuss the difficulties encountered in parallelizing image algorithms on massively parallel machines as well as the bottlenecks to high performance. They will demonstrate the work using images from an astronomical data set, and illustrate how techniques such as filters and denoising through the thresholding of wavelet coefficients can be applied when a large image is distributed across several processors.
Parallel implementation of backpropagation neural networks on a heterogeneous array of transputers.
Foo, S K; Saratchandran, P; Sundararajan, N
1997-01-01
This paper analyzes parallel implementation of the backpropagation training algorithm on a heterogeneous transputer network (i.e., transputers of different speed and memory) connected in a pipelined ring topology. Training-set parallelism is employed as the parallelizing paradigm for the backpropagation algorithm. It is shown through analysis that finding the optimal allocation of the training patterns amongst the processors to minimize the time for a training epoch is a mixed integer programming problem. Using mixed integer programming optimal pattern allocations for heterogeneous processor networks having a mixture of T805-20 (20 MHz) and T805-25 (25 MHz) transputers are theoretically found for two benchmark problems. The time for an epoch corresponding to the optimal pattern allocations is then obtained experimentally for the benchmark problems from the T805-20, TS805-25 heterogeneous networks. A Monte Carlo simulation study is carried out to statistically verify the optimality of the epoch time obtained from the mixed integer programming based allocations. In this study pattern allocations are randomly generated and the corresponding time for an epoch is experimentally obtained from the heterogeneous network. The mean and standard deviation for the epoch times from the random allocations are then compared with the optimal epoch time. The results show the optimal epoch time to be always lower than the mean epoch times by more than three standard deviations (3sigma) for all the sample sizes used in the study thus giving validity to the theoretical analysis.
GPU phase-field lattice Boltzmann simulations of growth and motion of a binary alloy dendrite
NASA Astrophysics Data System (ADS)
Takaki, T.; Rojas, R.; Ohno, M.; Shimokawabe, T.; Aoki, T.
2015-06-01
A GPU code has been developed for a phase-field lattice Boltzmann (PFLB) method, which can simulate the dendritic growth with motion of solids in a dilute binary alloy melt. The GPU accelerated PFLB method has been implemented using CUDA C. The equiaxed dendritic growth in a shear flow and settling condition have been simulated by the developed GPU code. It has been confirmed that the PFLB simulations were efficiently accelerated by introducing the GPU computation. The characteristic dendrite morphologies which depend on the melt flow and the motion of the dendrite could also be confirmed by the simulations.
Development of a GPU Compatible Version of the Fast Radiation Code RRTMG
NASA Astrophysics Data System (ADS)
Iacono, M. J.; Mlawer, E. J.; Berthiaume, D.; Cady-Pereira, K. E.; Suarez, M.; Oreopoulos, L.; Lee, D.
2012-12-01
The absorption of solar radiation and emission/absorption of thermal radiation are crucial components of the physics that drive Earth's climate and weather. Therefore, accurate radiative transfer calculations are necessary for realistic climate and weather simulations. Efficient radiation codes have been developed for this purpose, but their accuracy requirements still necessitate that as much as 30% of the computational time of a GCM is spent computing radiative fluxes and heating rates. The overall computational expense constitutes a limitation on a GCM's predictive ability if it becomes an impediment to adding new physics to or increasing the spatial and/or vertical resolution of the model. The emergence of Graphics Processing Unit (GPU) technology, which will allow the parallel computation of multiple independent radiative calculations in a GCM, will lead to a fundamental change in the competition between accuracy and speed. Processing time previously consumed by radiative transfer will now be available for the modeling of other processes, such as physics parameterizations, without any sacrifice in the accuracy of the radiative transfer. Furthermore, fast radiation calculations can be performed much more frequently and will allow the modeling of radiative effects of rapid changes in the atmosphere. The fast radiation code RRTMG, developed at Atmospheric and Environmental Research (AER), is utilized operationally in many dynamical models throughout the world. We will present the results from the first stage of an effort to create a version of the RRTMG radiation code designed to run efficiently in a GPU environment. This effort will focus on the RRTMG implementation in GEOS-5. RRTMG has an internal pseudo-spectral vector of length of order 100 that, when combined with the much greater length of the global horizontal grid vector from which the radiation code is called in GEOS-5, makes RRTMG/GEOS-5 particularly suited to achieving a significant speed improvement
Analysis and parallel implementation of a forced N-body problem
NASA Astrophysics Data System (ADS)
Torres, C. E.; Parishani, H.; Ayala, O.; Rossi, L. F.; Wang, L.-P.
2013-07-01
The understanding of particle dynamics in N-body problems is of importance to many applications in astrophysics, molecular dynamics and cloud/plasma physics where the theoretical representation results in a coupled system of equations for a large number of entities. This paper concerns algorithms for solving a specific N-body problem, namely, a system of disturbance velocities for hydrodynamically interacting particles in a particle-laden turbulent flow. The system is derived from the improved superposition method of [1]. Targeting for scalable computations on petascale computers, we have carried out a thorough study of a parallel implementation of GMRes with different features, such as preconditioners, matrix-free and parallel sparse representation of the matrix through 1D and 2D spatial domain decompositions. Gauss-Seidel method is also studied as a reference iterative algorithm. The range of conditions for efficiency and failure of each method is discussed in detail. Through perturbation analysis, we have conducted a series of experiments to understand the effect of particle sizes, interaction symmetry, inter-particle distances and interaction truncation on the eigenvalues and normality of the linear system. For situations where the system is ill-conditioned, we introduce a restricted Schwarz type preconditioner. We verified the parallel efficiency of the preconditioner using 1D domain decomposition on a parallel machine. A benchmark problem of particle laden turbulence at 5123 resolution with 2×106 particles is studied to understand the scalability of the proposed methods on parallel machines. We have developed a stable and highly scalable parallel solver with an affordable computational cost even for ill-conditioned systems through preconditioning. On 64 cores, using GMRes in 2D domain decomposition, we achieved a speed-up of ˜5.6x (relative to 1D domain decomposition on the same number of processors). Our complexity analysis showed that for large N
NASA Astrophysics Data System (ADS)
Plaza, Antonio; Plaza, Javier
2011-11-01
Hyperspectral unmixing is a very important task for remotely sensed hyperspectral data exploitation. It addresses the (possibly) mixed nature of pixels collected by instruments for Earth observation, which are due to several phenomena including limited spatial resolution, presence of mixing effects at different scales, etc. Spectral unmixing involves the separation of a mixed pixel spectrum into its pure component spectra (called endmembers) and the estimation of the proportion (abundance) of endmember in the pixel. Two models have been widely used in the literature in order to address the mixture problem in hyperspectral data. The linear model assumes that the endmember substances are sitting side-by-side within the field of view of the imaging instrument. On the other hand, the nonlinear mixture model assumes nonlinear interactions between endmember substances. Both techniques can be computationally expensive, in particular, for high-dimensional hyperspectral data sets. In this paper, we develop and compare parallel implementations of linear and nonlinear unmixing techniques for remotely sensed hyperspectral data. For the linear model, we adopt a parallel unsupervised processing chain made up of two steps: i) identification of pure spectral materials or endmembers, and ii) estimation of the abundance of each endmember in each pixel of the scene. For the nonlinear model, we adopt a supervised procedure based on the training of a parallel multi-layer perceptron neural network using intelligently selected training samples also derived in parallel fashion. The compared techniques are experimentally validated using hyperspectral data collected at different altitudes over a so-called Dehesa (semi-arid environment) in Extremadura, Spain, and evaluated in terms of computational performance using high performance computing systems such as commodity Beowulf clusters.
GPU-based relative fuzzy connectedness image segmentation
Zhuge Ying; Ciesielski, Krzysztof C.; Udupa, Jayaram K.; Miller, Robert W.
2013-01-15
Purpose:Recently, clinical radiological research and practice are becoming increasingly quantitative. Further, images continue to increase in size and volume. For quantitative radiology to become practical, it is crucial that image segmentation algorithms and their implementations are rapid and yield practical run time on very large data sets. The purpose of this paper is to present a parallel version of an algorithm that belongs to the family of fuzzy connectedness (FC) algorithms, to achieve an interactive speed for segmenting large medical image data sets. Methods: The most common FC segmentations, optimizing an Script-Small-L {sub {infinity}}-based energy, are known as relative fuzzy connectedness (RFC) and iterative relative fuzzy connectedness (IRFC). Both RFC and IRFC objects (of which IRFC contains RFC) can be found via linear time algorithms, linear with respect to the image size. The new algorithm, P-ORFC (for parallel optimal RFC), which is implemented by using NVIDIA's Compute Unified Device Architecture (CUDA) platform, considerably improves the computational speed of the above mentioned CPU based IRFC algorithm. Results: Experiments based on four data sets of small, medium, large, and super data size, achieved speedup factors of 32.8 Multiplication-Sign , 22.9 Multiplication-Sign , 20.9 Multiplication-Sign , and 17.5 Multiplication-Sign , correspondingly, on the NVIDIA Tesla C1060 platform. Although the output of P-ORFC need not precisely match that of IRFC output, it is very close to it and, as the authors prove, always lies between the RFC and IRFC objects. Conclusions: A parallel version of a top-of-the-line algorithm in the family of FC has been developed on the NVIDIA GPUs. An interactive speed of segmentation has been achieved, even for the largest medical image data set. Such GPU implementations may play a crucial role in automatic anatomy recognition in clinical radiology.
Quantum.Ligand.Dock: protein-ligand docking with quantum entanglement refinement on a GPU system.
Kantardjiev, Alexander A
2012-07-01
Quantum.Ligand.Dock (protein-ligand docking with graphic processing unit (GPU) quantum entanglement refinement on a GPU system) is an original modern method for in silico prediction of protein-ligand interactions via high-performance docking code. The main flavour of our approach is a combination of fast search with a special account for overlooked physical interactions. On the one hand, we take care of self-consistency and proton equilibria mutual effects of docking partners. On the other hand, Quantum.Ligand.Dock is the the only docking server offering such a subtle supplement to protein docking algorithms as quantum entanglement contributions. The motivation for development and proposition of the method to the community hinges upon two arguments-the fundamental importance of quantum entanglement contribution in molecular interaction and the realistic possibility to implement it by the availability of supercomputing power. The implementation of sophisticated quantum methods is made possible by parallelization at several bottlenecks on a GPU supercomputer. The high-performance implementation will be of use for large-scale virtual screening projects, structural bioinformatics, systems biology and fundamental research in understanding protein-ligand recognition. The design of the interface is focused on feasibility and ease of use. Protein and ligand molecule structures are supposed to be submitted as atomic coordinate files in PDB format. A customization section is offered for addition of user-specified charges, extra ionogenic groups with intrinsic pK(a) values or fixed ions. Final predicted complexes are ranked according to obtained scores and provided in PDB format as well as interactive visualization in a molecular viewer. Quantum.Ligand.Dock server can be accessed at http://87.116.85.141/LigandDock.html.
Quantum.Ligand.Dock: protein-ligand docking with quantum entanglement refinement on a GPU system.
Kantardjiev, Alexander A
2012-07-01
Quantum.Ligand.Dock (protein-ligand docking with graphic processing unit (GPU) quantum entanglement refinement on a GPU system) is an original modern method for in silico prediction of protein-ligand interactions via high-performance docking code. The main flavour of our approach is a combination of fast search with a special account for overlooked physical interactions. On the one hand, we take care of self-consistency and proton equilibria mutual effects of docking partners. On the other hand, Quantum.Ligand.Dock is the the only docking server offering such a subtle supplement to protein docking algorithms as quantum entanglement contributions. The motivation for development and proposition of the method to the community hinges upon two arguments-the fundamental importance of quantum entanglement contribution in molecular interaction and the realistic possibility to implement it by the availability of supercomputing power. The implementation of sophisticated quantum methods is made possible by parallelization at several bottlenecks on a GPU supercomputer. The high-performance implementation will be of use for large-scale virtual screening projects, structural bioinformatics, systems biology and fundamental research in understanding protein-ligand recognition. The design of the interface is focused on feasibility and ease of use. Protein and ligand molecule structures are supposed to be submitted as atomic coordinate files in PDB format. A customization section is offered for addition of user-specified charges, extra ionogenic groups with intrinsic pK(a) values or fixed ions. Final predicted complexes are ranked according to obtained scores and provided in PDB format as well as interactive visualization in a molecular viewer. Quantum.Ligand.Dock server can be accessed at http://87.116.85.141/LigandDock.html. PMID:22669908
A nonvoxel-based dose convolution/superposition algorithm optimized for scalable GPU architectures
Neylon, J. Sheng, K.; Yu, V.; Low, D. A.; Kupelian, P.; Santhanam, A.; Chen, Q.
2014-10-15
, respectively. Accuracy was investigated using three distinct phantoms with varied geometries and heterogeneities and on a series of 14 segmented lung CT data sets. Performance gains were calculated using three 256 mm cube homogenous water phantoms, with isotropic voxel dimensions of 1, 2, and 4 mm. Results: The nonvoxel-based GPU algorithm was independent of the data size and provided significant computational gains over the CPU algorithm for large CT data sizes. The parameter search analysis also showed that the ray combination of 8 zenithal and 8 azimuthal angles along with 1 mm radial sampling and 2 mm parallel ray spacing maintained dose accuracy with greater than 99% of voxels passing the γ test. Combining the acceleration obtained from GPU parallelization with the sampling optimization, the authors achieved a total performance improvement factor of >175 000 when compared to our voxel-based ground truth CPU benchmark and a factor of 20 compared with a voxel-based GPU dose convolution method. Conclusions: The nonvoxel-based convolution method yielded substantial performance improvements over a generic GPU implementation, while maintaining accuracy as compared to a CPU computed ground truth dose distribution. Such an algorithm can be a key contribution toward developing tools for adaptive radiation therapy systems.
Accelerating Pathology Image Data Cross-Comparison on CPU-GPU Hybrid Systems
Wang, Kaibo; Huai, Yin; Lee, Rubao; Wang, Fusheng; Zhang, Xiaodong; Saltz, Joel H.
2012-01-01
As an important application of spatial databases in pathology imaging analysis, cross-comparing the spatial boundaries of a huge amount of segmented micro-anatomic objects demands extremely data- and compute-intensive operations, requiring high throughput at an affordable cost. However, the performance of spatial database systems has not been satisfactory since their implementations of spatial operations cannot fully utilize the power of modern parallel hardware. In this paper, we provide a customized software solution that exploits GPUs and multi-core CPUs to accelerate spatial cross-comparison in a cost-effective way. Our solution consists of an efficient GPU algorithm and a pipelined system framework with task migration support. Extensive experiments with real-world data sets demonstrate the effectiveness of our solution, which improves the performance of spatial cross-comparison by over 18 times compared with a parallelized spatial database approach. PMID:23355955
Accelerating Pathology Image Data Cross-Comparison on CPU-GPU Hybrid Systems.
Wang, Kaibo; Huai, Yin; Lee, Rubao; Wang, Fusheng; Zhang, Xiaodong; Saltz, Joel H
2012-07-01
As an important application of spatial databases in pathology imaging analysis, cross-comparing the spatial boundaries of a huge amount of segmented micro-anatomic objects demands extremely data- and compute-intensive operations, requiring high throughput at an affordable cost. However, the performance of spatial database systems has not been satisfactory since their implementations of spatial operations cannot fully utilize the power of modern parallel hardware. In this paper, we provide a customized software solution that exploits GPUs and multi-core CPUs to accelerate spatial cross-comparison in a cost-effective way. Our solution consists of an efficient GPU algorithm and a pipelined system framework with task migration support. Extensive experiments with real-world data sets demonstrate the effectiveness of our solution, which improves the performance of spatial cross-comparison by over 18 times compared with a parallelized spatial database approach.
NASA Astrophysics Data System (ADS)
Iwasawa, Masaki; Tanikawa, Ataru; Hosono, Natsuki; Nitadori, Keigo; Muranushi, Takayuki; Makino, Junichiro
2016-08-01
We present the basic idea, implementation, measured performance, and performance model of FDPS (Framework for Developing Particle Simulators). FDPS is an application-development framework which helps researchers to develop simulation programs using particle methods for large-scale distributed-memory parallel supercomputers. A particle-based simulation program for distributed-memory parallel computers needs to perform domain decomposition, exchange of particles which are not in the domain of each computing node, and gathering of the particle information in other nodes which are necessary for interaction calculation. Also, even if distributed-memory parallel computers are not used, in order to reduce the amount of computation, algorithms such as the Barnes-Hut tree algorithm or the Fast Multipole Method should be used in the case of long-range interactions. For short-range interactions, some methods to limit the calculation to neighbor particles are required. FDPS provides all of these functions which are necessary for efficient parallel execution of particle-based simulations as "templates," which are independent of the actual data structure of particles and the functional form of the particle-particle interaction. By using FDPS, researchers can write their programs with the amount of work necessary to write a simple, sequential and unoptimized program of O(N2) calculation cost, and yet the program, once compiled with FDPS, will run efficiently on large-scale parallel supercomputers. A simple gravitational N-body program can be written in around 120 lines. We report the actual performance of these programs and the performance model. The weak scaling performance is very good, and almost linear speed-up was obtained for up to the full system of the K computer. The minimum calculation time per timestep is in the range of 30 ms (N = 107) to 300 ms (N = 109). These are currently limited by the time for the calculation of the domain decomposition and communication
FARGO3D: A New GPU-oriented MHD Code
NASA Astrophysics Data System (ADS)
Benítez-Llambay, Pablo; Masset, Frédéric S.
2016-03-01
We present the FARGO3D code, recently publicly released. It is a magnetohydrodynamics code developed with special emphasis on the physics of protoplanetary disks and planet-disk interactions, and parallelized with MPI. The hydrodynamics algorithms are based on finite-difference upwind, dimensionally split methods. The magnetohydrodynamics algorithms consist of the constrained transport method to preserve the divergence-free property of the magnetic field to machine accuracy, coupled to a method of characteristics for the evaluation of electromotive forces and Lorentz forces. Orbital advection is implemented, and an N-body solver is included to simulate planets or stars interacting with the gas. We present our implementation in detail and present a number of widely known tests for comparison purposes. One strength of FARGO3D is that it can run on either graphical processing units (GPUs) or central processing units (CPUs), achieving large speed-up with respect to CPU cores. We describe our implementation choices, which allow a user with no prior knowledge of GPU programming to develop new routines for CPUs, and have them translated automatically for GPUs.
Parallel design of JPEG-LS encoder on graphics processing units
NASA Astrophysics Data System (ADS)
Duan, Hao; Fang, Yong; Huang, Bormin
2012-01-01
With recent technical advances in graphic processing units (GPUs), GPUs have outperformed CPUs in terms of compute capability and memory bandwidth. Many successful GPU applications to high performance computing have been reported. JPEG-LS is an ISO/IEC standard for lossless image compression which utilizes adaptive context modeling and run-length coding to improve compression ratio. However, adaptive context modeling causes data dependency among adjacent pixels and the run-length coding has to be performed in a sequential way. Hence, using JPEG-LS to compress large-volume hyperspectral image data is quite time-consuming. We implement an efficient parallel JPEG-LS encoder for lossless hyperspectral compression on a NVIDIA GPU using the computer unified device architecture (CUDA) programming technology. We use the block parallel strategy, as well as such CUDA techniques as coalesced global memory access, parallel prefix sum, and asynchronous data transfer. We also show the relation between GPU speedup and AVIRIS block size, as well as the relation between compression ratio and AVIRIS block size. When AVIRIS images are divided into blocks, each with 64×64 pixels, we gain the best GPU performance with 26.3x speedup over its original CPU code.
2014-01-01
Background The huge quantity of data produced in Biomedical research needs sophisticated algorithmic methodologies for its storage, analysis, and processing. High Performance Computing (HPC) appears as a magic bullet in this challenge. However, several hard to solve parallelization and load balancing problems arise in this context. Here we discuss the HPC-oriented implementation of a general purpose learning algorithm, originally conceived for DNA analysis and recently extended to treat uncertainty on data (U-BRAIN). The U-BRAIN algorithm is a learning algorithm that finds a Boolean formula in disjunctive normal form (DNF), of approximately minimum complexity, that is consistent with a set of data (instances) which may have missing bits. The conjunctive terms of the formula are computed in an iterative way by identifying, from the given data, a family of sets of conditions that must be satisfied by all the positive instances and violated by all the negative ones; such conditions allow the computation of a set of coefficients (relevances) for each attribute (literal), that form a probability distribution, allowing the selection of the term literals. The great versatility that characterizes it, makes U-BRAIN applicable in many of the fields in which there are data to be analyzed. However the memory and the execution time required by the running are of O(n3) and of O(n5) order, respectively, and so, the algorithm is unaffordable for huge data sets. Results We find mathematical and programming solutions able to lead us towards the implementation of the algorithm U-BRAIN on parallel computers. First we give a Dynamic Programming model of the U-BRAIN algorithm, then we minimize the representation of the relevances. When the data are of great size we are forced to use the mass memory, and depending on where the data are actually stored, the access times can be quite different. According to the evaluation of algorithmic efficiency based on the Disk Model, in order to
Nguyen, Tuan-Anh; Nakib, Amir; Nguyen, Huy-Nam
2016-06-01
The Non-local means denoising filter has been established as gold standard for image denoising problem in general and particularly in medical imaging due to its efficiency. However, its computation time limited its applications in real world application, especially in medical imaging. In this paper, a distributed version on parallel hybrid architecture is proposed to solve the computation time problem and a new method to compute the filters' coefficients is also proposed, where we focused on the implementation and the enhancement of filters' parameters via taking the neighborhood of the current voxel more accurately into account. In terms of implementation, our key contribution consists in reducing the number of shared memory accesses. The different tests of the proposed method were performed on the brain-web database for different levels of noise. Performances and the sensitivity were quantified in terms of speedup, peak signal to noise ratio, execution time, the number of floating point operations. The obtained results demonstrate the efficiency of the proposed method. Moreover, the implementation is compared to that of other techniques, recently published in the literature. PMID:27084318
Nguyen, Tuan-Anh; Nakib, Amir; Nguyen, Huy-Nam
2016-06-01
The Non-local means denoising filter has been established as gold standard for image denoising problem in general and particularly in medical imaging due to its efficiency. However, its computation time limited its applications in real world application, especially in medical imaging. In this paper, a distributed version on parallel hybrid architecture is proposed to solve the computation time problem and a new method to compute the filters' coefficients is also proposed, where we focused on the implementation and the enhancement of filters' parameters via taking the neighborhood of the current voxel more accurately into account. In terms of implementation, our key contribution consists in reducing the number of shared memory accesses. The different tests of the proposed method were performed on the brain-web database for different levels of noise. Performances and the sensitivity were quantified in terms of speedup, peak signal to noise ratio, execution time, the number of floating point operations. The obtained results demonstrate the efficiency of the proposed method. Moreover, the implementation is compared to that of other techniques, recently published in the literature.
A parallel strategy for implementing real-time expert systems using CLIPS
NASA Technical Reports Server (NTRS)
Ilyes, Laszlo A.; Villaseca, F. Eugenio; Delaat, John
1994-01-01
As evidenced by current literature, there appears to be a continued interest in the study of real-time expert systems. It is generally recognized that speed of execution is only one consideration when designing an effective real-time expert system. Some other features one must consider are the expert system's ability to perform temporal reasoning, handle interrupts, prioritize data, contend with data uncertainty, and perform context focusing as dictated by the incoming data to the expert system. This paper presents a strategy for implementing a real time expert system on the iPSC/860 hypercube parallel computer using CLIPS. The strategy takes into consideration not only the execution time of the software, but also those features which define a true real-time expert system. The methodology is then demonstrated using a practical implementation of an expert system which performs diagnostics on the Space Shuttle Main Engine (SSME). This particular implementation uses an eight node hypercube to process ten sensor measurements in order to simultaneously diagnose five different failure modes within the SSME. The main program is written in ANSI C and embeds CLIPS to better facilitate and debug the rule based expert system.
Parallel Implementation of Gamma-Point Pseudopotential Plane-Wave DFT with Exact Exchange
Bylaska, Eric J.; Tsemekhman, Kiril L.; Baden, Scott B.; Weare, John H.; Jonsson, Hannes
2011-01-15
One of the more persistent failures of conventional density functional theory (DFT) methods has been their failure to yield localized charge states such as polarons, excitons and solitons in solid-state and extended systems. It has been suggested that conventional DFT functionals, which are not self-interaction free, tend to favor delocalized electronic states since self-interaction creates a Coulomb barrier to charge localization. Pragmatic approaches in which the exchange correlation functionals are augmented with small amount of exact exchange (hybrid-DFT, e.g. B3LYP and PBE0) have shown promise in localizing charge states and predicting accurate band gaps and reaction barriers. We have developed a parallel algorithm for implementing exact exchange into pseudopotential plane-wave density functional theory and we have implemented it in the NWChem program package. The technique developed can readily be employed in plane-wave DFT programs. Furthermore, atomic forces and stresses are straightforward to implement, making it applicable to both confined and extended systems, as well as to Car-Parrinello ab initio molecular dynamic simulations. This method has been applied to several systems for which conventional DFT methods do not work well, including calculations for band gaps in oxides and the electronic structure of a charge trapped state in the Fe(II) containing mica, annite.
High-level GPU computing with jacket for MATLAB and C/C++
NASA Astrophysics Data System (ADS)
Pryor, Gallagher; Lucey, Brett; Maddipatla, Sandeep; McClanahan, Chris; Melonakos, John; Venugopalakrishnan, Vishwanath; Patel, Krunal; Yalamanchili, Pavan; Malcolm, James
2011-06-01
We describe a software platform for the rapid development of general purpose GPU (GPGPU) computing applications within the MATLAB computing environment, C, and C++: Jacket. Jacket provides thousands of GPU-tuned function syntaxes within MATLAB, C, and C++, including linear algebra, convolutions, reductions, and FFTs as well as signal, image, statistics, and graphics libraries. Additionally, Jacket includes a compiler that translates MATLAB and C++ code to CUDA PTX assembly and OpenGL shaders on demand at runtime. A facility is also included to compile a domain specific version of the MATLAB language to CUDA assembly at build time. Jacket includes the first parallel GPU FOR-loop construction and the first profiler for comparative analysis of CPU and GPU execution times. Jacket provides full GPU compute capability on CUDA hardware and limited, image processing focused compute on OpenGL/ES (2.0 and up) devices for mobile and embedded applications.
CudaChain: an alternative algorithm for finding 2D convex hulls on the GPU.
Mei, Gang
2016-01-01
This paper presents an alternative GPU-accelerated convex hull algorithm and a novel S orting-based P reprocessing A pproach (SPA) for planar point sets. The proposed convex hull algorithm termed as CudaChain consists of two stages: (1) two rounds of preprocessing performed on the GPU and (2) the finalization of calculating the expected convex hull on the CPU. Those interior points locating inside a quadrilateral formed by four extreme points are first discarded, and then the remaining points are distributed into several (typically four) sub regions. For each subset of points, they are first sorted in parallel; then the second round of discarding is performed using SPA; and finally a simple chain is formed for the current remaining points. A simple polygon can be easily generated by directly connecting all the chains in sub regions. The expected convex hull of the input points can be finally obtained by calculating the convex hull of the simple polygon. The library Thrust is utilized to realize the parallel sorting, reduction, and partitioning for better efficiency and simplicity. Experimental results show that: (1) SPA can very effectively detect and discard the interior points; and (2) CudaChain achieves 5×-6× speedups over the famous Qhull implementation for 20M points.
GPU-based ultra-fast dose calculation using a finite size pencil beam model
NASA Astrophysics Data System (ADS)
Gu, Xuejun; Choi, Dongju; Men, Chunhua; Pan, Hubert; Majumdar, Amitava; Jiang, Steve B.
2009-10-01
Online adaptive radiation therapy (ART) is an attractive concept that promises the ability to deliver an optimal treatment in response to the inter-fraction variability in patient anatomy. However, it has yet to be realized due to technical limitations. Fast dose deposit coefficient calculation is a critical component of the online planning process that is required for plan optimization of intensity-modulated radiation therapy (IMRT). Computer graphics processing units (GPUs) are well suited to provide the requisite fast performance for the data-parallel nature of dose calculation. In this work, we develop a dose calculation engine based on a finite-size pencil beam (FSPB) algorithm and a GPU parallel computing framework. The developed framework can accommodate any FSPB model. We test our implementation in the case of a water phantom and the case of a prostate cancer patient with varying beamlet and voxel sizes. All testing scenarios achieved speedup ranging from 200 to 400 times when using a NVIDIA Tesla C1060 card in comparison with a 2.27 GHz Intel Xeon CPU. The computational time for calculating dose deposition coefficients for a nine-field prostate IMRT plan with this new framework is less than 1 s. This indicates that the GPU-based FSPB algorithm is well suited for online re-planning for adaptive radiotherapy.
CudaChain: an alternative algorithm for finding 2D convex hulls on the GPU.
Mei, Gang
2016-01-01
This paper presents an alternative GPU-accelerated convex hull algorithm and a novel S orting-based P reprocessing A pproach (SPA) for planar point sets. The proposed convex hull algorithm termed as CudaChain consists of two stages: (1) two rounds of preprocessing performed on the GPU and (2) the finalization of calculating the expected convex hull on the CPU. Those interior points locating inside a quadrilateral formed by four extreme points are first discarded, and then the remaining points are distributed into several (typically four) sub regions. For each subset of points, they are first sorted in parallel; then the second round of discarding is performed using SPA; and finally a simple chain is formed for the current remaining points. A simple polygon can be easily generated by directly connecting all the chains in sub regions. The expected convex hull of the input points can be finally obtained by calculating the convex hull of the simple polygon. The library Thrust is utilized to realize the parallel sorting, reduction, and partitioning for better efficiency and simplicity. Experimental results show that: (1) SPA can very effectively detect and discard the interior points; and (2) CudaChain achieves 5×-6× speedups over the famous Qhull implementation for 20M points. PMID:27350927
An implementation of a tree code on a SIMD, parallel computer
NASA Technical Reports Server (NTRS)
Olson, Kevin M.; Dorband, John E.
1994-01-01
We describe a fast tree algorithm for gravitational N-body simulation on SIMD parallel computers. The tree construction uses fast, parallel sorts. The sorted lists are recursively divided along their x, y and z coordinates. This data structure is a completely balanced tree (i.e., each particle is paired with exactly one other particle) and maintains good spatial locality. An implementation of this tree-building algorithm on a 16k processor Maspar MP-1 performs well and constitutes only a small fraction (approximately 15%) of the entire cycle of finding the accelerations. Each node in the tree is treated as a monopole. The tree search and the summation of accelerations also perform well. During the tree search, node data that is needed from another processor is simply fetched. Roughly 55% of the tree search time is spent in communications between processors. We apply the code to two problems of astrophysical interest. The first is a simulation of the close passage of two gravitationally, interacting, disk galaxies using 65,636 particles. We also simulate the formation of structure in an expanding, model universe using 1,048,576 particles. Our code attains speeds comparable to one head of a Cray Y-MP, so single instruction, multiple data (SIMD) type computers can be used for these simulations. The cost/performance ratio for SIMD machines like the Maspar MP-1 make them an extremely attractive alternative to either vector processors or large multiple instruction, multiple data (MIMD) type parallel computers. With further optimizations (e.g., more careful load balancing), speeds in excess of today's vector processing computers should be possible.
Seny, Bruno Lambrechts, Jonathan; Toulorge, Thomas; Legat, Vincent; Remacle, Jean-François
2014-01-01
Although explicit time integration schemes require small computational efforts per time step, their efficiency is severely restricted by their stability limits. Indeed, the multi-scale nature of some physical processes combined with highly unstructured meshes can lead some elements to impose a severely small stable time step for a global problem. Multirate methods offer a way to increase the global efficiency by gathering grid cells in appropriate groups under local stability conditions. These methods are well suited to the discontinuous Galerkin framework. The parallelization of the multirate strategy is challenging because grid cells have different workloads. The computational cost is different for each sub-time step depending on the elements involved and a classical partitioning strategy is not adequate any more. In this paper, we propose a solution that makes use of multi-constraint mesh partitioning. It tends to minimize the inter-processor communications, while ensuring that the workload is almost equally shared by every computer core at every stage of the algorithm. Particular attention is given to the simplicity of the parallel multirate algorithm while minimizing computational and communication overheads. Our implementation makes use of the MeTiS library for mesh partitioning and the Message Passing Interface for inter-processor communication. Performance analyses for two and three-dimensional practical applications confirm that multirate methods preserve important computational advantages of explicit methods up to a significant number of processors.
Implementation of a Parallel High-Performance Visualization Technique in GRASS GIS
Sorokine, Alexandre
2007-01-01
This paper describes an extension for GRASS GIS that enables users to perform geographic visualization tasks on tiled high-resolution displays powered by the clusters of commodity personal computers. Parallel visualization systems are becoming more common in scientific computing due to the decreasing hardware costs and availability of the open source software to support such architecture. High-resolution displays allow scientists to visualize very large datasets with minimal loss of details. Such systems have a big promise especially in the field of geographic information systems because users can naturally combine several geographic scales on a single display. The paper discusses architecture, implementation and operation of pd-GRASS - a GRASS GIS extension for high-performance parallel visualization on tiled displays. pd-GRASS is specifically well suited for the very large geographic datasets such as LIDAR data or high-resolution nation-wide geographic databases. The paper also briefly touches on computational efficiency, performance and potential applications for such systems.
NASA Astrophysics Data System (ADS)
Seny, Bruno; Lambrechts, Jonathan; Toulorge, Thomas; Legat, Vincent; Remacle, Jean-François
2014-01-01
Although explicit time integration schemes require small computational efforts per time step, their efficiency is severely restricted by their stability limits. Indeed, the multi-scale nature of some physical processes combined with highly unstructured meshes can lead some elements to impose a severely small stable time step for a global problem. Multirate methods offer a way to increase the global efficiency by gathering grid cells in appropriate groups under local stability conditions. These methods are well suited to the discontinuous Galerkin framework. The parallelization of the multirate strategy is challenging because grid cells have different workloads. The computational cost is different for each sub-time step depending on the elements involved and a classical partitioning strategy is not adequate any more. In this paper, we propose a solution that makes use of multi-constraint mesh partitioning. It tends to minimize the inter-processor communications, while ensuring that the workload is almost equally shared by every computer core at every stage of the algorithm. Particular attention is given to the simplicity of the parallel multirate algorithm while minimizing computational and communication overheads. Our implementation makes use of the MeTiS library for mesh partitioning and the Message Passing Interface for inter-processor communication. Performance analyses for two and three-dimensional practical applications confirm that multirate methods preserve important computational advantages of explicit methods up to a significant number of processors.
Implementation and parallelization of fast matrix multiplication for a fast Legendre transform
Chen, Wentao
1993-09-01
An algorithm was presented by Alpert and Rokhlin for the rapid evaluation of Legendre transforms. The fast algorithm can be expressed as a matrix-vector product followed by a fast cosine transform. Using the Chebyshev expansion to approximate the entries of the matrix and exchanging the order of summations reduces the time complexity of computation from O(n{sup 2}) to O(n log n), where n is the size of the input vector. Our work has been focused on the implementation and the parallelization of the fast algorithm of matrix-vector product. Results have shown the expected performance of the algorithm. Precision problems which arise as n becomes large can be resolved by doubling the precision of the calculation.
Angle- and distance-constrained matcher with parallel implementations for model-based vision
NASA Astrophysics Data System (ADS)
Anhalt, David J.; Raney, Steven; Severson, William E.
1992-02-01
The matching component of a model-based vision system hypothesizes one-to-one correspondences between 2D image features and locations on the 3D model. As part of Wright Laboratory's ARAGTAP program [a synthetic aperture radar (SAR) object recognition program], we developed a matcher that searches for feature matches based on the hypothesized object type and aspect angle. Search is constrained by the presumed accuracy of the hypothesized aspect angle and scale. These constraints reduce the search space for matches, thus improving match performance and quality. The algorithm is presented and compared with a matcher based on geometric hashing. Parallel implementations on commercially available shared memory MIMD machines, distributed memory MIMD machines, and SIMD machines are presented and contrasted.
SU-D-207-04: GPU-Based 4D Cone-Beam CT Reconstruction Using Adaptive Meshing Method
Zhong, Z; Gu, X; Iyengar, P; Mao, W; Wang, J; Guo, X
2015-06-15
Purpose: Due to the limited number of projections at each phase, the image quality of a four-dimensional cone-beam CT (4D-CBCT) is often degraded, which decreases the accuracy of subsequent motion modeling. One of the promising methods is the simultaneous motion estimation and image reconstruction (SMEIR) approach. The objective of this work is to enhance the computational speed of the SMEIR algorithm using adaptive feature-based tetrahedral meshing and GPU-based parallelization. Methods: The first step is to generate the tetrahedral mesh based on the features of a reference phase 4D-CBCT, so that the deformation can be well captured and accurately diffused from the mesh vertices to voxels of the image volume. After the mesh generation, the updated motion model and other phases of 4D-CBCT can be obtained by matching the 4D-CBCT projection images at each phase with the corresponding forward projections of the deformed reference phase of 4D-CBCT. The entire process of this 4D-CBCT reconstruction method is implemented on GPU, resulting in significantly increasing the computational efficiency due to its tremendous parallel computing ability. Results: A 4D XCAT digital phantom was used to test the proposed mesh-based image reconstruction algorithm. The image Result shows both bone structures and inside of the lung are well-preserved and the tumor position can be well captured. Compared to the previous voxel-based CPU implementation of SMEIR, the proposed method is about 157 times faster for reconstructing a 10 -phase 4D-CBCT with dimension 256×256×150. Conclusion: The GPU-based parallel 4D CBCT reconstruction method uses the feature-based mesh for estimating motion model and demonstrates equivalent image Result with previous voxel-based SMEIR approach, with significantly improved computational speed.
NASA Astrophysics Data System (ADS)
Masset, Frédéric
2015-09-01
GFARGO is a GPU version of FARGO. It is written in C and C for CUDA and runs only on NVIDIA’s graphics cards. Though it corresponds to the standard, isothermal version of FARGO, not all functionnalities of the CPU version have been translated to CUDA. The code is available in single and double precision versions, the latter compatible with FERMI architectures. GFARGO can run on a graphics card connected to the display, allowing the user to see in real time how the fields evolve.
Known-plaintext attack on the double phase encoding and its implementation with parallel hardware
NASA Astrophysics Data System (ADS)
Wei, Hengzheng; Peng, Xiang; Liu, Haitao; Feng, Songlin; Gao, Bruce Z.
2008-03-01
A known-plaintext attack on the double phase encryption scheme implemented with parallel hardware is presented. The double random phase encoding (DRPE) is one of the most representative optical cryptosystems developed in mid of 90's and derives quite a few variants since then. Although the DRPE encryption system has a strong power resisting to a brute-force attack, the inherent architecture of DRPE leaves a hidden trouble due to its linearity nature. Recently the real security strength of this opto-cryptosystem has been doubted and analyzed from the cryptanalysis point of view. In this presentation, we demonstrate that the optical cryptosystems based on DRPE architecture are vulnerable to known-plain text attack. With this attack the two encryption keys in the DRPE can be accessed with the help of the phase retrieval technique. In our approach, we adopt hybrid input-output algorithm (HIO) to recover the random phase key in the object domain and then infer the key in frequency domain. Only a plaintext-ciphertext pair is sufficient to create vulnerability. Moreover this attack does not need to select particular plaintext. The phase retrieval technique based on HIO is an iterative process performing Fourier transforms, so it fits very much into the hardware implementation of the digital signal processor (DSP). We make use of the high performance DSP to accomplish the known-plaintext attack. Compared with the software implementation, the speed of the hardware implementation is much fast. The performance of this DSP-based cryptanalysis system is also evaluated.
Fast phase processing in off-axis holography by CUDA including parallel phase unwrapping.
Backoach, Ohad; Kariv, Saar; Girshovitz, Pinhas; Shaked, Natan T
2016-02-22
We present parallel processing implementation for rapid extraction of the quantitative phase maps from off-axis holograms on the Graphics Processing Unit (GPU) of the computer using computer unified device architecture (CUDA) programming. To obtain efficient implementation, we parallelized both the wrapped phase map extraction algorithm and the two-dimensional phase unwrapping algorithm. In contrast to previous implementations, we utilized unweighted least squares phase unwrapping algorithm that better suits parallelism. We compared the proposed algorithm run times on the CPU and the GPU of the computer for various sizes of off-axis holograms. Using the GPU implementation, we extracted the unwrapped phase maps from the recorded off-axis holograms at 35 frames per second (fps) for 4 mega pixel holograms, and at 129 fps for 1 mega pixel holograms, which presents the fastest processing framerates obtained so far, to the best of our knowledge. We then used common-path off-axis interferometric imaging to quantitatively capture the phase maps of a micro-organism with rapid flagellum movements. PMID:26906982
A GPU-based calculation using the three-dimensional FDTD method for electromagnetic field analysis.
Nagaoka, Tomoaki; Watanabe, Soichi
2010-01-01
Numerical simulations with the numerical human model using the finite-difference time domain (FDTD) method have recently been performed frequently in a number of fields in biomedical engineering. However, the FDTD calculation runs too slowly. We focus, therefore, on general purpose programming on the graphics processing unit (GPGPU). The three-dimensional FDTD method was implemented on the GPU using Compute Unified Device Architecture (CUDA). In this study, we used the NVIDIA Tesla C1060 as a GPGPU board. The performance of the GPU is evaluated in comparison with the performance of a conventional CPU and a vector supercomputer. The results indicate that three-dimensional FDTD calculations using a GPU can significantly reduce run time in comparison with that using a conventional CPU, even a native GPU implementation of the three-dimensional FDTD method, while the GPU/CPU speed ratio varies with the calculation domain and thread block size.
GPU-accelerated adaptive particle splitting and merging in SPH
NASA Astrophysics Data System (ADS)
Xiong, Qingang; Li, Bo; Xu, Ji
2013-07-01
Graphical processing unit (GPU) implementation of adaptive particle splitting and merging (APS) in the framework of smoothed particle hydrodynamics (SPH) is presented. Particle splitting and merging process are carried out based on a prescribed criterion. Multiple time stepping technology is used to reduce computational cost further. Detailed implementations on both single- and multi-GPU are discussed. A benchmark test that is a flow past fixed periodic circles is simulated to investigate the accuracy and speed of the algorithm. Comparable precision with uniformly fine simulation is achieved by APS, whereas computational demand is reduced considerably. Satisfactory speedup and acceptable scalability are obtained, demonstrating that GPU-accelerated APS is a promising tool to speed up large-scale particle-based simulations.
Mapping high-fidelity volume rendering for medical imaging to CPU, GPU and many-core architectures.
Smelyanskiy, Mikhail; Holmes, David; Chhugani, Jatin; Larson, Alan; Carmean, Douglas M; Hanson, Dennis; Dubey, Pradeep; Augustine, Kurt; Kim, Daehyun; Kyker, Alan; Lee, Victor W; Nguyen, Anthony D; Seiler, Larry; Robb, Richard
2009-01-01
Medical volumetric imaging requires high fidelity, high performance rendering algorithms. We motivate and analyze new volumetric rendering algorithms that are suited to modern parallel processing architectures. First, we describe the three major categories of volume rendering algorithms and confirm through an imaging scientist-guided evaluation that ray-casting is the most acceptable. We describe a thread- and data-parallel implementation of ray-casting that makes it amenable to key architectural trends of three modern commodity parallel architectures: multi-core, GPU, and an upcoming many-core Intel architecture code-named Larrabee. We achieve more than an order of magnitude performance improvement on a number of large 3D medical datasets. We further describe a data compression scheme that significantly reduces data-transfer overhead. This allows our approach to scale well to large numbers of Larrabee cores.
A fast and high-quality cone beam reconstruction pipeline using the GPU
NASA Astrophysics Data System (ADS)
Schiwietz, Thomas; Bose, Supratik; Maltz, Jonathan; Westermann, Rüdiger
2007-03-01
Cone beam scanners have evolved rapidly in the past years. Increasing sampling resolution of the projection images and the desire to reconstruct high resolution output volumes increases both the memory consumption and the processing time considerably. In order to keep the processing time down new strategies for memory management are required as well as new algorithmic implementations of the reconstruction pipeline. In this paper, we present a fast and high-quality cone beam reconstruction pipeline using the Graphics Processing Unit (GPU). This pipeline includes the backprojection process and also pre-filtering and post-filtering stages. In particular, we focus on a subset of five stages, but more stages can be integrated easily. In the pre-filtering stage, we first reduce the amount of noise in the acquired projection images by a non-linear curvature-based smoothing algorithm. Then, we apply a high-pass filter as required by the inverse Radon transform. Next, the backprojection pass reconstructs a raw 3D volume. In post-processing, we first filter the volume by a ring artifact removal. Then, we remove cupping artifacts by our novel uniformity correction algorithm. We present the algorithm in detail. In order to execute the pipeline as quickly as possible we take advantage of GPUs that have proven to be very fast parallel processors for numerical problems. Unfortunately, both the projection images and the reconstruction volume are too large to fit into 512 MB of GPU memory. Therefore, we present an efficient memory management strategy that minimizes the bus transfer between main memory and GPU memory. Our results show a 4 times performance gain over a highly optimized CPU implementation using SSE2/3 commands. At the same time, the image quality is comparable to the CPU results with an average per pixel difference of 10 -5.
Adaptive multi-GPU Exchange Monte Carlo for the 3D Random Field Ising Model
NASA Astrophysics Data System (ADS)
Navarro, Cristóbal A.; Huang, Wei; Deng, Youjin
2016-08-01
This work presents an adaptive multi-GPU Exchange Monte Carlo approach for the simulation of the 3D Random Field Ising Model (RFIM). The design is based on a two-level parallelization. The first level, spin-level parallelism, maps the parallel computation as optimal 3D thread-blocks that simulate blocks of spins in shared memory with minimal halo surface, assuming a constant block volume. The second level, replica-level parallelism, uses multi-GPU computation to handle the simulation of an ensemble of replicas. CUDA's concurrent kernel execution feature is used in order to fill the occupancy of each GPU with many replicas, providing a performance boost that is more notorious at the smallest values of L. In addition to the two-level parallel design, the work proposes an adaptive multi-GPU approach that dynamically builds a proper temperature set free of exchange bottlenecks. The strategy is based on mid-point insertions at the temperature gaps where the exchange rate is most compromised. The extra work generated by the insertions is balanced across the GPUs independently of where the mid-point insertions were performed. Performance results show that spin-level performance is approximately two orders of magnitude faster than a single-core CPU version and one order of magnitude faster than a parallel multi-core CPU version running on 16-cores. Multi-GPU performance is highly convenient under a weak scaling setting, reaching up to 99 % efficiency as long as the number of GPUs and L increase together. The combination of the adaptive approach with the parallel multi-GPU design has extended our possibilities of simulation to sizes of L = 32 , 64 for a workstation with two GPUs. Sizes beyond L = 64 can eventually be studied using larger multi-GPU systems.
Implementation of a parallel algorithm for thermo-chemical nonequilibrium flow simulations
Wong, C.C.; Blottner, F.G.; Payne, J.L.; Soetrisno, M.
1995-01-01
Massively parallel (MP) computing is considered to be the future direction of high performance computing. When engineers apply this new MP computing technology to solve large-scale problems, one major interest is what is the maximum problem size that a MP computer can handle. To determine the maximum size, it is important to address the code scalability issue. Scalability implies whether the code can provide an increase in performance proportional to an increase in problem size. If the size of the problem increases, by utilizing more computer nodes, the ideal elapsed time to simulate a problem should not increase much. Hence one important task in the development of the MP computing technology is to ensure scalability. A scalable code is an efficient code. In order to obtain good scaled performance, it is necessary to first have the code optimized for a single node performance before proceeding to a large-scale simulation with a large number of computer nodes. This paper will discuss the implementation of a massively parallel computing strategy and the process of optimization to improve the scaled performance. Specifically, we will look at domain decomposition, resource management in the code, communication overhead, and problem mapping. By incorporating these improvements and adopting an efficient MP computing strategy, an efficiency of about 85% and 96%, respectively, has been achieved using 64 nodes on MP computers for both perfect gas and chemically reactive gas problems. A comparison of the performance between MP computers and a vectorized computer, such as Cray-YMP, will also be presented.
The parallel implementation of the one-dimensional Fourier transformed Vlasov Poisson system
NASA Astrophysics Data System (ADS)
Eliasson, Bengt
2005-08-01
A parallel implementation of an algorithm for solving the one-dimensional, Fourier transformed Vlasov-Poisson system of equations is documented, together with the code structure, file formats and settings to run the code. The properties of the Fourier transformed Vlasov-Poisson system is discussed in connection with the numerical solution of the system. The Fourier method in velocity space is used to treat numerical problems arising due the filamentation of the solution in velocity space. Outflow boundary conditions in the Fourier transformed velocity space removes the highest oscillations in velocity space. A fourth-order compact Padé scheme is used to calculate derivatives in the Fourier transformed velocity space, and spatial derivatives are calculated with a pseudo-spectral method. The parallel algorithms used are described in more detail, in particular the parallel solver of the tri-diagonal systems occurring in the Padé scheme. Program summaryTitle of program:vlasov Catalogue identifier:ADVQ Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADVQ Program obtainable from: CPC Program Library, Queen's University of Belfast, N. Ireland Operating system under which the program has been tested: Sun Solaris; HP-UX; Read Hat Linux Programming language used: FORTRAN 90 with Message Passing Interface (MPI) Computers: Sun Ultra Sparc; HP 9000/785; HP IPF (Itanium Processor Family) ia64 Cluster; PCs cluster Number of lines in distributed program, including test data, etc.:3737 Number of bytes in distributed program, including test data, etc.:18 772 Distribution format: tar.gz Nature of physical problem: Kinetic simulations of collisionless electron-ion plasmas. Method of solution: A Fourier method in velocity space, a pseudo-spectral method in space and a fourth-order Runge-Kutta scheme in time. Memory required to execute with typical data: Uses typically of the order 10 5-10 6 double precision numbers. Restriction on the complexity of the problem: The program uses
Ultrafast convolution/superposition using tabulated and exponential kernels on GPU
Chen Quan; Chen Mingli; Lu Weiguo
2011-03-15
Purpose: Collapsed-cone convolution/superposition (CCCS) dose calculation is the workhorse for IMRT dose calculation. The authors present a novel algorithm for computing CCCS dose on the modern graphic processing unit (GPU). Methods: The GPU algorithm includes a novel TERMA calculation that has no write-conflicts and has linear computation complexity. The CCCS algorithm uses either tabulated or exponential cumulative-cumulative kernels (CCKs) as reported in literature. The authors have demonstrated that the use of exponential kernels can reduce the computation complexity by order of a dimension and achieve excellent accuracy. Special attentions are paid to the unique architecture of GPU, especially the memory accessing pattern, which increases performance by more than tenfold. Results: As a result, the tabulated kernel implementation in GPU is two to three times faster than other GPU implementations reported in literature. The implementation of CCCS showed significant speedup on GPU over single core CPU. On tabulated CCK, speedups as high as 70 are observed; on exponential CCK, speedups as high as 90 are observed. Conclusions: Overall, the GPU algorithm using exponential CCK is 1000-3000 times faster over a highly optimized single-threaded CPU implementation using tabulated CCK, while the dose differences are within 0.5% and 0.5 mm. This ultrafast CCCS algorithm will allow many time-sensitive applications to use accurate dose calculation.
The Design and Implementation of hypre, a Library of Parallel High Performance Preconditioners
Falgout, R D; Jones, J E; Yang, U M
2004-07-17
The increasing demands of computationally challenging applications and the advance of larger more powerful computers with more complicated architectures have necessitated the development of new solvers and preconditioners. Since the implementation of these methods is quite complex, the use of high performance libraries with the newest efficient solvers and preconditioners becomes more important for promulgating their use into applications with relative ease. The hypre library [14, 17] has been designed with the primary goal of providing users with advanced scalable parallel preconditioners. Issues of robustness, ease of use, flexibility and interoperability have also been important. It can be used both as a solver package and as a framework for algorithm development. Its object model is more general and flexible than most current generation solver libraries [9]. hypre also provides several of the most commonly used solvers, such as conjugate gradient for symmetric systems or GMRES for nonsymmetric systems to be used in conjunction with the preconditioners. Design innovations have been made to enable access to the library in the way that applications users naturally think about their problems. For example, application developers that use structured grids, typically think of their problems in terms of stencils and grids. hypre's users do not have to learn complicated sparse matrix structures; instead hypre does the work of building these data structures through various conceptual interfaces. The conceptual interfaces currently implemented include stencil-based structured and semi-structured interfaces, a finite-element based unstructured interface, and a traditional linear-algebra based interface. The primary focus of this paper is on the design and implementation of the conceptual interfaces in hypre. The paper is organized as follows. The first two sections are of general interest.We begin in Section 2 with an introductory discussion of conceptual interfaces and
Parallel computing-based sclera recognition for human identification
NASA Astrophysics Data System (ADS)
Lin, Yong; Du, Eliza Y.; Zhou, Zhi
2012-06-01
Compared to iris recognition, sclera recognition which uses line descriptor can achieve comparable recognition accuracy in visible wavelengths. However, this method is too time-consuming to be implemented in a real-time system. In this paper, we propose a GPU-based parallel computing approach to reduce the sclera recognition time. We define a new descriptor in which the information of KD tree structure and sclera edge are added. Registration and matching task is divided into subtasks in various sizes according to their computation complexities. Every affine transform parameters are generated by searching on KD tree. Texture memory, constant memory, and shared memory are used to store templates and transform matrixes. The experiment results show that the proposed method executed on GPU can dramatically improve the sclera matching speed in hundreds of times without accuracy decreasing.
NASA Technical Reports Server (NTRS)
Juang, Hann-Ming Henry; Tao, Wei-Kuo; Zeng, Xi-Ping; Shie, Chung-Lin; Simpson, Joanne; Lang, Steve
2004-01-01
The capability for massively parallel programming (MPP) using a message passing interface (MPI) has been implemented into a three-dimensional version of the Goddard Cumulus Ensemble (GCE) model. The design for the MPP with MPI uses the concept of maintaining similar code structure between the whole domain as well as the portions after decomposition. Hence the model follows the same integration for single and multiple tasks (CPUs). Also, it provides for minimal changes to the original code, so it is easily modified and/or managed by the model developers and users who have little knowledge of MPP. The entire model domain could be sliced into one- or two-dimensional decomposition with a halo regime, which is overlaid on partial domains. The halo regime requires that no data be fetched across tasks during the computational stage, but it must be updated before the next computational stage through data exchange via MPI. For reproducible purposes, transposing data among tasks is required for spectral transform (Fast Fourier Transform, FFT), which is used in the anelastic version of the model for solving the pressure equation. The performance of the MPI-implemented codes (i.e., the compressible and anelastic versions) was tested on three different computing platforms. The major results are: 1) both versions have speedups of about 99% up to 256 tasks but not for 512 tasks; 2) the anelastic version has better speedup and efficiency because it requires more computations than that of the compressible version; 3) equal or approximately-equal numbers of slices between the x- and y- directions provide the fastest integration due to fewer data exchanges; and 4) one-dimensional slices in the x-direction result in the slowest integration due to the need for more memory relocation for computation.
NASA Astrophysics Data System (ADS)
Bernabe, Sergio; Igual, Francisco D.; Botella, Guillermo; Prieto-Matias, Manuel; Plaza, Antonio
2015-10-01
In the last decade, the issue of endmember variability has received considerable attention, particularly when each pixel is modeled as a linear combination of endmembers or pure materials. As a result, several models and algorithms have been developed for considering the effect of endmember variability in spectral unmixing and possibly include multiple endmembers in the spectral unmixing stage. One of the most popular approach for this purpose is the multiple endmember spectral mixture analysis (MESMA) algorithm. The procedure executed by MESMA can be summarized as follows: (i) First, a standard linear spectral unmixing (LSU) or fully constrained linear spectral unmixing (FCLSU) algorithm is run in an iterative fashion; (ii) Then, we use different endmember combinations, randomly selected from a spectral library, to decompose each mixed pixel; (iii) Finally, the model with the best fit, i.e., with the lowest root mean square error (RMSE) in the reconstruction of the original pixel, is adopted. However, this procedure can be computationally very expensive due to the fact that several endmember combinations need to be tested and several abundance estimation steps need to be conducted, a fact that compromises the use of MESMA in applications under real-time constraints. In this paper we develop (for the first time in the literature) an efficient implementation of MESMA on different platforms using OpenCL, an open standard for parallel programing on heterogeneous systems. Our experiments have been conducted using a simulated data set and the clMAGMA mathematical library. This kind of implementations with the same descriptive language on different architectures are very important in order to actually calibrate the possibility of using heterogeneous platforms for efficient hyperspectral imaging processing in real remote sensing missions.
GPU-based four-dimensional general-relativistic ray tracing
NASA Astrophysics Data System (ADS)
Kuchelmeister, Daniel; Müller, Thomas; Ament, Marco; Wunner, Günter; Weiskopf, Daniel
2012-10-01
This paper presents a new general-relativistic ray tracer that enables image synthesis on an interactive basis by exploiting the performance of graphics processing units (GPUs). The application is capable of visualizing the distortion of the stellar background as well as trajectories of moving astronomical objects orbiting a compact mass. Its source code includes metric definitions for the Schwarzschild and Kerr spacetimes that can be easily extended to other metric definitions, relying on its object-oriented design. The basic functionality features a scene description interface based on the scripting language Lua, real-time image output, and the ability to edit almost every parameter at runtime. The ray tracing code itself is implemented for parallel execution on the GPU using NVidia's Compute Unified Device Architecture (CUDA), which leads to performance improvement of an order of magnitude compared to a single CPU and makes the application competitive with small CPU cluster architectures. Program summary Program title: GpuRay4D Catalog identifier: AEMV_v1_0 Program summary URL: http://cpc.cs.qub.ac.uk/summaries/AEMV_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 73649 No. of bytes in distributed program, including test data, etc.: 1334251 Distribution format: tar.gz Programming language: C++, CUDA. Computer: Linux platforms with a NVidia CUDA enabled GPU (Compute Capability 1.3 or higher), C++ compiler, NVCC (The CUDA Compiler Driver). Operating system: Linux. RAM: 2 GB Classification: 1.5. External routines: OpenGL Utility Toolkit development files, NVidia CUDA Toolkit 3.2, Lua5.2 Nature of problem: Ray tracing in four-dimensional Lorentzian spacetimes. Solution method: Numerical integration of light rays, GPU-based parallel programming using CUDA, 3D
Multi-GPU three dimensional Stokes solver for simulating glacier flow
NASA Astrophysics Data System (ADS)
Licul, Aleksandar; Herman, Frédéric; Podladchikov, Yuri; Räss, Ludovic; Omlin, Samuel
2016-04-01
Here we present how we have recently developed a three-dimensional Stokes solver on the GPUs and apply it to a glacier flow. We numerically solve the Stokes momentum balance equations together with the incompressibility equation, while also taking into account strong nonlinearities for ice rheology. We have developed a fully three-dimensional numerical MATLAB application based on an iterative finite difference scheme with preconditioning of residuals. Differential equations are discretized on a regular staggered grid. We have ported it to C-CUDA to run it on GPU's in parallel, using MPI. We demonstrate the accuracy and efficiency of our developed model by manufactured analytical solution test for three-dimensional Stokes ice sheet models (Leng et al.,2013) and by comparison with other well-established ice sheet models on diagnostic ISMIP-HOM benchmark experiments (Pattyn et al., 2008). The results show that our developed model is capable to accurately and efficiently solve Stokes system of equations in a variety of different test scenarios, while preserving good parallel efficiency on up to 80 GPU's. For example, in 3D test scenarios with 250000 grid points our solver converges in around 3 minutes for single precision computations and around 10 minutes for double precision computations. We have also optimized the developed code to efficiently run on our newly acquired state-of-the-art GPU cluster octopus. This allows us to solve our problem on more than 20 million grid points, by just increasing the number of GPU used, while keeping the computation time the same. In future work we will apply our solver to real world applications and implement the free surface evolution capabilities. REFERENCES Leng,W.,Ju,L.,Gunzburger,M. & Price,S., 2013. Manufactured solutions and the verification of three-dimensional stokes ice-sheet models. Cryosphere 7,19-29. Pattyn, F., Perichon, L., Aschwanden, A., Breuer, B., de Smedt, B., Gagliardini, O., Gudmundsson,G.H., Hindmarsh, R
Hermenegildo, M.V.
1986-01-01
The term Logic Programming refers to a variety of computer languages and execution models based on the traditional concept of Symbolic Logic. The expressive power of these languages offers promise to be of great assistance in facing the programming challenges of present and future symbolic processing applications in artificial intelligence, knowledge-based systems, and many other areas of computing. This dissertation presents an efficient parallel execution model for logic programs. The model is described from the source language level down to an Abstract Machine level, suitable for direct implementation on existing parallel systems or for the design of special purpose parallel architectures. Few assumptions are made at the source language level and, therefore, the techniques developed and the general Abstract Machine design are applicable to a variety of logic (and also functional) languages. These techniques offer efficient solutions to several areas of parallel Logic Programming implementation previously considered problematic or a source of considerable overhead, such as the detection and handling of variable binding conflicts in AND-parallelism, the specification of control and management of the execution tree, the treatment of distributed backtracking, and goal scheduling and memory management issues, etc. A parallel Abstract Machine design is offered, specifying data areas, operation, and a suitable instruction set.
GPU applications for data processing
Vladymyrov, Mykhailo; Aleksandrov, Andrey; Tioukov, Valeri
2015-12-31
Modern experiments that use nuclear photoemulsion imply fast and efficient data acquisition from the emulsion can be performed. The new approaches in developing scanning systems require real-time processing of large amount of data. Methods that use Graphical Processing Unit (GPU) computing power for emulsion data processing are presented here. It is shown how the GPU-accelerated emulsion processing helped us to rise the scanning speed by factor of nine.
Finite Difference Elastic Wave Field Simulation On GPU
NASA Astrophysics Data System (ADS)
Hu, Y.; Zhang, W.
2011-12-01
Numerical modeling of seismic wave propagation is considered as a basic and important aspect in investigation of the Earth's structure, and earthquake phenomenon. Among various numerical methods, the finite-difference method is considered one of the most efficient tools for the wave field simulation. However, with the increment of computing scale, the power of computing has becoming a bottleneck. With the development of hardware, in recent years, GPU shows powerful computational ability and bright application prospects in scientific computing. Many works using GPU demonstrate that GPU is powerful . Recently, GPU has not be used widely in the simulation of wave field. In this work, we present forward finite difference simulation of acoustic and elastic seismic wave propagation in heterogeneous media on NVIDIA graphics cards with the CUDA programming language. We also implement perfectly matched layers on the graphics cards to efficiently absorb outgoing waves on the fictitious edges of the grid Simulations compared with the results on CPU platform shows reliable accuracy and remarkable efficiency. This work proves that GPU can be an effective platform for wave field simulation, and it can also be used as a practical tool for real-time strong ground motion simulation.
Optimizing Tensor Contraction Expressions for Hybrid CPU-GPU Execution
Ma, Wenjing; Krishnamoorthy, Sriram; Villa, Oreste; Kowalski, Karol; Agrawal, Gagan
2013-03-01
Tensor contractions are generalized multidimensional matrix multiplication operations that widely occur in quantum chemistry. Efficient execution of tensor contractions on Graphics Processing Units (GPUs) requires several challenges to be addressed, including index permutation and small dimension-sizes reducing thread block utilization. Moreover, to apply the same optimizations to various expressions, we need a code generation tool. In this paper, we present our approach to automatically generate CUDA code to execute tensor contractions on GPUs, including management of data movement between CPU and GPU. To evaluate our tool, GPU-enabled code is generated for the most expensive contractions in CCSD(T), a key coupled cluster method, and incorporated into NWChem, a popular computational chemistry suite. For this method, we demonstrate speedup over a factor of 8.4 using one GPU (instead of one core per node) and over 2.6 when utilizing the entire system using hybrid CPU+GPU solution with 2 GPUs and 5 cores (instead of 7 cores per node). Finally, we analyze the implementation behavior on future GPU systems.
Parallel and Low-Order Scaling Implementation of Hartree-Fock Exchange Using Local Density Fitting.
Köppl, Christoph; Werner, Hans-Joachim
2016-07-12
Calculations using modern linear-scaling electron-correlation methods are often much faster than the necessary reference Hartree-Fock (HF) calculations. We report a newly implemented HF program that speeds up the most time-consuming step, namely, the evaluation of the exchange contributions to the Fock matrix. Using localized orbitals and their sparsity, local density fitting (LDF), and atomic orbital domains, we demonstrate that the calculation of the exchange matrix scales asymptotically linearly with molecular size. The remaining parts of the HF calculation scale cubically but become dominant only for very large molecular sizes or with many processing cores. The method is well parallelized, and the speedup scales well with up to about 100 CPU cores on multiple compute nodes. The effect of the local approximations on the accuracy of computed HF and local second-order Møller-Plesset perturbation theory energies is systematically investigated, and default values are established for the parameters that determine the domain sizes. Using these values, calculations for molecules with hundreds of atoms in combination with triple-ζ basis sets can be carried out in less than 1 h, with just a few compute nodes. The method can also be used to speed up density functional theory calculations with hybrid functionals that contain HF exchange. PMID:27267488
Three-dimensional morphological analysis method for geologic bodies and its parallel implementation
NASA Astrophysics Data System (ADS)
Mao, Xiancheng; Zhang, Bin; Deng, Hao; Zou, Yanhong; Chen, Jin
2016-11-01
It has been found that the spatial locations and distributions of orebodies, especially for certain hydrothermal mineral deposits, are closely related to the shape of intrusive geologic bodies. For complex and large-scale geologic bodies, however, it is challenging to achieve rigorous and quantitative morphological analysis by standard geological surface reconstruction and trend-surface analysis methods. This paper presents a novel, quantitative morphological analysis method for general geologic bodies of closed 2-manifold surface based on mathematical morphology. Through the processes of morphological filtering, set operations and three-dimensional Euclidean distance transform (3D-EDT), the global trend shape, local convex and concave zones as well as degree of surface undulation of a geologic body are extracted respectively. All of the three analysis phases are speeded up via parallel algorithms implemented by using the message passing interface (MPI) standard. The proposed method is tested with a case study of the Xinwuli intrusion with complex shape in Fenghuangshan deposit of the Tongling district, China. The results demonstrate that the method is an effective and efficient way to achieve quantitative morphological analysis, thereby decreasing the time necessary to find the association between morphological parameters of geologic bodies and mineralization.
A Real-Time Capable Software-Defined Receiver Using GPU for Adaptive Anti-Jam GPS Sensors
Seo, Jiwon; Chen, Yu-Hsuan; De Lorenzo, David S.; Lo, Sherman; Enge, Per; Akos, Dennis; Lee, Jiyun
2011-01-01
Due to their weak received signal power, Global Positioning System (GPS) signals are vulnerable to radio frequency interference. Adaptive beam and null steering of the gain pattern of a GPS antenna array can significantly increase the resistance of GPS sensors to signal interference and jamming. Since adaptive array processing requires intensive computational power, beamsteering GPS receivers were usually implemented using hardware such as field-programmable gate arrays (FPGAs). However, a software implementation using general-purpose processors is much more desirable because of its flexibility and cost effectiveness. This paper presents a GPS software-defined radio (SDR) with adaptive beamsteering capability for anti-jam applications. The GPS SDR design is based on an optimized desktop parallel processing architecture using a quad-core Central Processing Unit (CPU) coupled with a new generation Graphics Processing Unit (GPU) having massively parallel processors. This GPS SDR demonstrates sufficient computational capability to support a four-element antenna array and future GPS L5 signal processing in real time. After providing the details of our design and optimization schemes for future GPU-based GPS SDR developments, the jamming resistance of our GPS SDR under synthetic wideband jamming is presented. Since the GPS SDR uses commercial-off-the-shelf hardware and processors, it can be easily adopted in civil GPS applications requiring anti-jam capabilities. PMID:22164116
A real-time capable software-defined receiver using GPU for adaptive anti-jam GPS sensors.
Seo, Jiwon; Chen, Yu-Hsuan; De Lorenzo, David S; Lo, Sherman; Enge, Per; Akos, Dennis; Lee, Jiyun
2011-01-01
Due to their weak received signal power, Global Positioning System (GPS) signals are vulnerable to radio frequency interference. Adaptive beam and null steering of the gain pattern of a GPS antenna array can significantly increase the resistance of GPS sensors to signal interference and jamming. Since adaptive array processing requires intensive computational power, beamsteering GPS receivers were usually implemented using hardware such as field-programmable gate arrays (FPGAs). However, a software implementation using general-purpose processors is much more desirable because of its flexibility and cost effectiveness. This paper presents a GPS software-defined radio (SDR) with adaptive beamsteering capability for anti-jam applications. The GPS SDR design is based on an optimized desktop parallel processing architecture using a quad-core Central Processing Unit (CPU) coupled with a new generation Graphics Processing Unit (GPU) having massively parallel processors. This GPS SDR demonstrates sufficient computational capability to support a four-element antenna array and future GPS L5 signal processing in real time. After providing the details of our design and optimization schemes for future GPU-based GPS SDR developments, the jamming resistance of our GPS SDR under synthetic wideband jamming is presented. Since the GPS SDR uses commercial-off-the-shelf hardware and processors, it can be easily adopted in civil GPS applications requiring anti-jam capabilities.
A real-time capable software-defined receiver using GPU for adaptive anti-jam GPS sensors.
Seo, Jiwon; Chen, Yu-Hsuan; De Lorenzo, David S; Lo, Sherman; Enge, Per; Akos, Dennis; Lee, Jiyun
2011-01-01
Due to their weak received signal power, Global Positioning System (GPS) signals are vulnerable to radio frequency interference. Adaptive beam and null steering of the gain pattern of a GPS antenna array can significantly increase the resistance of GPS sensors to signal interference and jamming. Since adaptive array processing requires intensive computational power, beamsteering GPS receivers were usually implemented using hardware such as field-programmable gate arrays (FPGAs). However, a software implementation using general-purpose processors is much more desirable because of its flexibility and cost effectiveness. This paper presents a GPS software-defined radio (SDR) with adaptive beamsteering capability for anti-jam applications. The GPS SDR design is based on an optimized desktop parallel processing architecture using a quad-core Central Processing Unit (CPU) coupled with a new generation Graphics Processing Unit (GPU) having massively parallel processors. This GPS SDR demonstrates sufficient computational capability to support a four-element antenna array and future GPS L5 signal processing in real time. After providing the details of our design and optimization schemes for future GPU-based GPS SDR developments, the jamming resistance of our GPS SDR under synthetic wideband jamming is presented. Since the GPS SDR uses commercial-off-the-shelf hardware and processors, it can be easily adopted in civil GPS applications requiring anti-jam capabilities. PMID:22164116
A GPU-COMPUTING APPROACH TO SOLAR STOKES PROFILE INVERSION
Harker, Brian J.; Mighell, Kenneth J. E-mail: mighell@noao.edu
2012-09-20
We present a new computational approach to the inversion of solar photospheric Stokes polarization profiles, under the Milne-Eddington model, for vector magnetography. Our code, named GENESIS, employs multi-threaded parallel-processing techniques to harness the computing power of graphics processing units (GPUs), along with algorithms designed to exploit the inherent parallelism of the Stokes inversion problem. Using a genetic algorithm (GA) engineered specifically for use with a GPU, we produce full-disk maps of the photospheric vector magnetic field from polarized spectral line observations recorded by the Synoptic Optical Long-term Investigations of the Sun (SOLIS) Vector Spectromagnetograph (VSM) instrument. We show the advantages of pairing a population-parallel GA with data-parallel GPU-computing techniques, and present an overview of the Stokes inversion problem, including a description of our adaptation to the GPU-computing paradigm. Full-disk vector magnetograms derived by this method are shown using SOLIS/VSM data observed on 2008 March 28 at 15:45 UT.
A GPU-computing Approach to Solar Stokes Profile Inversion
NASA Astrophysics Data System (ADS)
Harker, Brian J.; Mighell, Kenneth J.
2012-09-01
We present a new computational approach to the inversion of solar photospheric Stokes polarization profiles, under the Milne-Eddington model, for vector magnetography. Our code, named GENESIS, employs multi-threaded parallel-processing techniques to harness the computing power of graphics processing units (GPUs), along with algorithms designed to exploit the inherent parallelism of the Stokes inversion problem. Using a genetic algorithm (GA) engineered specifically for use with a GPU, we produce full-disk maps of the photospheric vector magnetic field from polarized spectral line observations recorded by the Synoptic Optical Long-term Investigations of the Sun (SOLIS) Vector Spectromagnetograph (VSM) instrument. We show the advantages of pairing a population-parallel GA with data-parallel GPU-computing techniques, and present an overview of the Stokes inversion problem, including a description of our adaptation to the GPU-computing paradigm. Full-disk vector magnetograms derived by this method are shown using SOLIS/VSM data observed on 2008 March 28 at 15:45 UT.
Lemanski, W.J.
1989-03-01
The success of the Strategic Defense Initiative depends directly on significant advances in both computer hardware and software development technologies. Parallel architectures and the Ada programming language have advantages that make them candidates for use in SDI command and control computer systems. This thesis examines those advantages in the context of an SDI-type application: implementation of a Kalman-filter tracking system. This research consists of three parts. The first is a set of software engineering guidelines developed for use in creating parallel designs suitable for implementation in Ada. These guidelines cover the design process from initial problem analysis to final detailed design. Methods of problem decomposition are discussed, as are language partitioning strategies. Justification is provided for using the Ada task construct for process boundaries, and Ada multitasking design issues are reviewed. A parallel software design methodology is also described.
GPU Accelerated Smith-Waterman
Liu, Y; Huang, W; Johnson, J; Vaidya, S
2006-02-08
We present a novel hardware implementation of the double affine Smith-Waterman (DASW) algorithm, which uses dynamic programming to compare and align genomic sequences such as DNA and proteins. We implement DASW on a commodity graphics card, taking advantage of the general purpose programmability of the graphics processing unit to leverage its cheap parallel processing power. The results demonstrate that our system's performance is competitive with current optimized software packages.
ERIC Educational Resources Information Center
Smith, Gayle
The Parallel Alternate Curriculum (PAC), a model providng regular content courses, in regular classes, to secondary students with learning problems, combines basic skill instruction with alternative teaching strategies. PAC, a mainstreaming implementation program, is designed to provide inservice training emphasizing the process of how students…
Real-time optical flow estimation on a GPU for a skied-steered mobile robot
NASA Astrophysics Data System (ADS)
Kniaz, V. V.
2016-04-01
Accurate egomotion estimation is required for mobile robot navigation. Often the egomotion is estimated using optical flow algorithms. For an accurate estimation of optical flow most of modern algorithms require high memory resources and processor speed. However simple single-board computers that control the motion of the robot usually do not provide such resources. On the other hand, most of modern single-board computers are equipped with an embedded GPU that could be used in parallel with a CPU to improve the performance of the optical flow estimation algorithm. This paper presents a new Z-flow algorithm for efficient computation of an optical flow using an embedded GPU. The algorithm is based on the phase correlation optical flow estimation and provide a real-time performance on a low cost embedded GPU. The layered optical flow model is used. Layer segmentation is performed using graph-cut algorithm with a time derivative based energy function. Such approach makes the algorithm both fast and robust in low light and low texture conditions. The algorithm implementation for a Raspberry Pi Model B computer is discussed. For evaluation of the algorithm the computer was mounted on a Hercules mobile skied-steered robot equipped with a monocular camera. The evaluation was performed using a hardware-in-the-loop simulation and experiments with Hercules mobile robot. Also the algorithm was evaluated using KITTY Optical Flow 2015 dataset. The resulting endpoint error of the optical flow calculated with the developed algorithm was low enough for navigation of the robot along the desired trajectory.
NASA Astrophysics Data System (ADS)
Fuchs, A.; Androsov, A.; Harig, S.; Hiller, W.; Rakowsky, N.
2012-04-01
Based on the jeopardy of devastating tsunamis and the unpredictability of such events, tsunami modelling as part of warning systems is still a contemporary topic. The tsunami group of Alfred Wegener Institute developed the simulation tool TsunAWI as contribution to the Early Warning System in Indonesia. Although the precomputed scenarios for this purpose qualify for satisfying deliverables, the study of further improvements continues. While TsunAWI is governed by the Shallow Water Equations, an extension of the model is based on a nonhydrostatic approach. At the arrival of a tsunami wave in coastal regions with rough bathymetry, the term containing the nonhydrostatic part of pressure, that is neglected in the original hydrostatic model, gains in importance. In consideration of this term, a better approximation of the wave is expected. Differences of hydrostatic and nonhydrostatic model results are contrasted in the standard benchmark problem of a solitary wave runup on a plane beach. The observation data provided by Titov and Synolakis (1995) serves as reference. The nonhydrostatic approach implies a set of equations that are similar to the Shallow Water Equations, so the variation of the code can be implemented on top. However, this additional routines cause a lot of issues you have to cope with. So far the computations of the model were purely explicit. In the nonhydrostatic version the determination of an additional unknown and the solution of a large sparse system of linear equations is necessary. The latter constitutes the lion's share of computing time and memory requirement. Since the corresponding matrix is only symmetric in structure and not in values, an iterative Krylov Subspace Method is used, in particular the restarted Generalized Minimal Residual Algorithm GMRES(m). With regard to optimization, we present a comparison of several combinations of sequential and parallel preconditioning techniques respective number of iterations and setup
Treveaven, P.
1989-01-01
This book presents an introduction to object-oriented, functional, and logic parallel computing on which the fifth generation of computer systems will be based. Coverage includes concepts for parallel computing languages, a parallel object-oriented system (DOOM) and its language (POOL), an object-oriented multilevel VLSI simulator using POOL, and implementation of lazy functional languages on parallel architectures.
Limpanuparb, Taweetham; Milthorpe, Josh; Rendell, Alistair P
2014-10-30
Use of the modern parallel programming language X10 for computing long-range Coulomb and exchange interactions is presented. By using X10, a partitioned global address space language with support for task parallelism and the explicit representation of data locality, the resolution of the Ewald operator can be parallelized in a straightforward manner including use of both intranode and internode parallelism. We evaluate four different schemes for dynamic load balancing of integral calculation using X10's work stealing runtime, and report performance results for long-range HF energy calculation of large molecule/high quality basis running on up to 1024 cores of a high performance cluster machine.
GAMUT: GPU accelerated microRNA analysis to uncover target genes through CUDA-miRanda
2014-01-01
Background Non-coding sequences such as microRNAs have important roles in disease processes. Computational microRNA target identification (CMTI) is becoming increasingly important since traditional experimental methods for target identification pose many difficulties. These methods are time-consuming, costly, and often need guidance from computational methods to narrow down candidate genes anyway. However, most CMTI methods are computationally demanding, since they need to handle not only several million query microRNA and reference RNA pairs, but also several million nucleotide comparisons within each given pair. Thus, the need to perform microRNA identification at such large scale has increased the demand for parallel computing. Methods Although most CMTI programs (e.g., the miRanda algorithm) are based on a modified Smith-Waterman (SW) algorithm, the existing parallel SW implementations (e.g., CUDASW++ 2.0/3.0, SWIPE) are unable to meet this demand in CMTI tasks. We present CUDA-miRanda, a fast microRNA target identification algorithm that takes advantage of massively parallel computing on Graphics Processing Units (GPU) using NVIDIA's Compute Unified Device Architecture (CUDA). CUDA-miRanda specifically focuses on the local alignment of short (i.e., ≤ 32 nucleotides) sequences against longer reference sequences (e.g., 20K nucleotides). Moreover, the proposed algorithm is able to report multiple alignments (up to 191 top scores) and the corresponding traceback sequences for any given (query sequence, reference sequence) pair. Results Speeds over 5.36 Giga Cell Updates Per Second (GCUPs) are achieved on a server with 4 NVIDIA Tesla M2090 GPUs. Compared to the original miRanda algorithm, which is evaluated on an Intel Xeon E5620@2.4 GHz CPU, the experimental results show up to 166 times performance gains in terms of execution time. In addition, we have verified that the exact same targets were predicted in both CUDA-miRanda and the original mi
GPU Based Software Correlators - Perspectives for VLBI2010
NASA Technical Reports Server (NTRS)
Hobiger, Thomas; Kimura, Moritaka; Takefuji, Kazuhiro; Oyama, Tomoaki; Koyama, Yasuhiro; Kondo, Tetsuro; Gotoh, Tadahiro; Amagai, Jun
2010-01-01
Caused by historical separation and driven by the requirements of the PC gaming industry, Graphics Processing Units (GPUs) have evolved to massive parallel processing systems which entered the area of non-graphic related applications. Although a single processing core on the GPU is much slower and provides less functionality than its counterpart on the CPU, the huge number of these small processing entities outperforms the classical processors when the application can be parallelized. Thus, in recent years various radio astronomical projects have started to make use of this technology either to realize the correlator on this platform or to establish the post-processing pipeline with GPUs. Therefore, the feasibility of GPUs as a choice for a VLBI correlator is being investigated, including pros and cons of this technology. Additionally, a GPU based software correlator will be reviewed with respect to energy consumption/GFlop/sec and cost/GFlop/sec.
Singular value decomposition utilizing parallel algorithms on graphical processors
Kotas, Charlotte W; Barhen, Jacob
2011-01-01
One of the current challenges in underwater acoustic array signal processing is the detection of quiet targets in the presence of noise. In order to enable robust detection, one of the key processing steps requires data and replica whitening. This, in turn, involves the eigen-decomposition of the sample spectral matrix, Cx = 1/K xKX(k)XH(k) where X(k) denotes a single frequency snapshot with an element for each element of the array. By employing the singular value decomposition (SVD) method, the eigenvectors and eigenvalues can be determined directly from the data without computing the sample covariance matrix, reducing the computational requirements for a given level of accuracy (van Trees, Optimum Array Processing). (Recall that the SVD of a complex matrix A involves determining V, , and U such that A = U VH where U and V are orthonormal and is a positive, real, diagonal matrix containing the singular values of A. U and V are the eigenvectors of AAH and AHA, respectively, while the singular values are the square roots of the eigenvalues of AAH.) Because it is desirable to be able to compute these quantities in real time, an efficient technique for computing the SVD is vital. In addition, emerging multicore processors like graphical processing units (GPUs) are bringing parallel processing capabilities to an ever increasing number of users. Since the computational tasks involved in array signal processing are well suited for parallelization, it is expected that these computations will be implemented using GPUs as soon as users have the necessary computational tools available to them. Thus, it is important to have an SVD algorithm that is suitable for these processors. This work explores the effectiveness of two different parallel SVD implementations on an NVIDIA Tesla C2050 GPU (14 multiprocessors, 32 cores per multiprocessor, 1.15 GHz clock - peed). The first algorithm is based on a two-step algorithm which bidiagonalizes the matrix using Householder
Rapid indirect trajectory optimization on highly parallel computing architectures
NASA Astrophysics Data System (ADS)
Antony, Thomas
Trajectory optimization is a field which can benefit greatly from the advantages offered by parallel computing. The current state-of-the-art in trajectory optimization focuses on the use of direct optimization methods, such as the pseudo-spectral method. These methods are favored due to their ease of implementation and large convergence regions while indirect methods have largely been ignored in the literature in the past decade except for specific applications in astrodynamics. It has been shown that the shortcomings conventionally associated with indirect methods can be overcome by the use of a continuation method in which complex trajectory solutions are obtained by solving a sequence of progressively difficult optimization problems. High performance computing hardware is trending towards more parallel architectures as opposed to powerful single-core processors. Graphics Processing Units (GPU), which were originally developed for 3D graphics rendering have gained popularity in the past decade as high-performance, programmable parallel processors. The Compute Unified Device Architecture (CUDA) framework, a parallel computing architecture and programming model developed by NVIDIA, is one of the most widely used platforms in GPU computing. GPUs have been applied to a wide range of fields that require the solution of complex, computationally demanding problems. A GPU-accelerated indirect trajectory optimization methodology which uses the multiple shooting method and continuation is developed using the CUDA platform. The various algorithmic optimizations used to exploit the parallelism inherent in the indirect shooting method are described. The resulting rapid optimal control framework enables the construction of high quality optimal trajectories that satisfy problem-specific constraints and fully satisfy the necessary conditions of optimality. The benefits of the framework are highlighted by construction of maximum terminal velocity trajectories for a hypothetical
The GENGA code: gravitational encounters in N-body simulations with GPU acceleration
Grimm, Simon L.; Stadel, Joachim G.
2014-11-20
We describe an open source GPU implementation of a hybrid symplectic N-body integrator, GENGA (Gravitational ENcounters with Gpu Acceleration), designed to integrate planet and planetesimal dynamics in the late stage of planet formation and stability analyses of planetary systems. GENGA uses a hybrid symplectic integrator to handle close encounters with very good energy conservation, which is essential in long-term planetary system integration. We extended the second-order hybrid integration scheme to higher orders. The GENGA code supports three simulation modes: integration of up to 2048 massive bodies, integration with up to a million test particles, or parallel integration of a large number of individual planetary systems. We compare the results of GENGA to Mercury and pkdgrav2 in terms of energy conservation and performance and find that the energy conservation of GENGA is comparable to Mercury and around two orders of magnitude better than pkdgrav2. GENGA runs up to 30 times faster than Mercury and up to 8 times faster than pkdgrav2. GENGA is written in CUDA C and runs on all NVIDIA GPUs with a computing capability of at least 2.0.
NASA Astrophysics Data System (ADS)
Mena, Andres; Ferrero, Jose M.; Rodriguez Matas, Jose F.
2015-11-01
Solving the electric activity of the heart possess a big challenge, not only because of the structural complexities inherent to the heart tissue, but also because of the complex electric behaviour of the cardiac cells. The multi-scale nature of the electrophysiology problem makes difficult its numerical solution, requiring temporal and spatial resolutions of 0.1 ms and 0.2 mm respectively for accurate simulations, leading to models with millions degrees of freedom that need to be solved for thousand time steps. Solution of this problem requires the use of algorithms with higher level of parallelism in multi-core platforms. In this regard the newer programmable graphic processing units (GPU) has become a valid alternative due to their tremendous computational horsepower. This paper presents results obtained with a novel electrophysiology simulation software entirely developed in Compute Unified Device Architecture (CUDA). The software implements fully explicit and semi-implicit solvers for the monodomain model, using operator splitting. Performance is compared with classical multi-core MPI based solvers operating on dedicated high-performance computer clusters. Results obtained with the GPU based solver show enormous potential for this technology with accelerations over 50 × for three-dimensional problems.
Efficient Irregular Wavefront Propagation Algorithms on Hybrid CPU-GPU Machines.
Teodoro, George; Pan, Tony; Kurc, Tahsin; Kong, Jun; Cooper, Lee; Saltz, Joel
2013-04-01
We address the problem of efficient execution of a computation pattern, referred to here as the irregular wavefront propagation pattern (IWPP), on hybrid systems with multiple CPUs and GPUs. The IWPP is common in several image processing operations. In the IWPP, data elements in the wavefront propagate waves to their neighboring elements on a grid if a propagation condition is satisfied. Elements receiving the propagated waves become part of the wavefront. This pattern results in irregular data accesses and computations. We develop and evaluate strategies for efficient computation and propagation of wavefronts using a multi-level queue structure. This queue structure improves the utilization of fast memories in a GPU and reduces synchronization overheads. We also develop a tile-based parallelization strategy to support execution on multiple CPUs and GPUs. We evaluate our approaches on a state-of-the-art GPU accelerated machine (equipped with 3 GPUs and 2 multicore CPUs) using the IWPP implementations of two widely used image processing operations: morphological reconstruction and euclidean distance transform. Our results show significant performance improvements on GPUs. The use of multiple CPUs and GPUs cooperatively attains speedups of 50× and 85× with respect to single core CPU executions for morphological reconstruction and euclidean distance transform, respectively.
Efficient Irregular Wavefront Propagation Algorithms on Hybrid CPU-GPU Machines
Teodoro, George; Pan, Tony; Kurc, Tahsin; Kong, Jun; Cooper, Lee; Saltz, Joel
2013-01-01
We address the problem of efficient execution of a computation pattern, referred to here as the irregular wavefront propagation pattern (IWPP), on hybrid systems with multiple CPUs and GPUs. The IWPP is common in several image processing operations. In the IWPP, data elements in the wavefront propagate waves to their neighboring elements on a grid if a propagation condition is satisfied. Elements receiving the propagated waves become part of the wavefront. This pattern results in irregular data accesses and computations. We develop and evaluate strategies for efficient computation and propagation of wavefronts using a multi-level queue structure. This queue structure improves the utilization of fast memories in a GPU and reduces synchronization overheads. We also develop a tile-based parallelization strategy to support execution on multiple CPUs and GPUs. We evaluate our approaches on a state-of-the-art GPU accelerated machine (equipped with 3 GPUs and 2 multicore CPUs) using the IWPP implementations of two widely used image processing operations: morphological reconstruction and euclidean distance transform. Our results show significant performance improvements on GPUs. The use of multiple CPUs and GPUs cooperatively attains speedups of 50× and 85× with respect to single core CPU executions for morphological reconstruction and euclidean distance transform, respectively. PMID:23908562
NASA Astrophysics Data System (ADS)
Mena, Andres; Ferrero, Jose M.; Rodriguez Matas, Jose F.
2015-11-01
Solving the electric activity of the heart possess a big challenge, not only because of the structural complexities inherent to the heart tissue, but also because of the complex electric behaviour of the cardiac cells. The multi-scale nature of the electrophysiology problem makes difficult its numerical solution, requiring temporal and spatial resolutions of 0.1 ms and 0.2 mm respectively for accurate simulations, leading to models with millions degrees of freedom that need to be solved for thousand time steps. Solution of this problem requires the use of algorithms with higher level of parallelism in multi-core platforms. In this regard the newer programmable graphic processing units (GPU) has become a valid alternative due to their tremendous computational horsepower. This paper presents results obtained with a novel electrophysiology simulation software entirely developed in Compute Unified Device Architecture (CUDA). The software implements fully explicit and semi-implicit solvers for the monodomain model, using operator splitting. Performance is compared with classical multi-core MPI based solvers operating on dedicated high-performance computer clusters. Results obtained with the GPU based solver show enormous potential for this technology with accelerations over 50 × for three-dimensional problems.
Modelling Nonlinear Dynamic Textures using Hybrid DWT-DCT and Kernel PCA with GPU
NASA Astrophysics Data System (ADS)
Ghadekar, Premanand Pralhad; Chopade, Nilkanth Bhikaji
2016-06-01
Most of the real-world dynamic textures are nonlinear, non-stationary, and irregular. Nonlinear motion also has some repetition of motion, but it exhibits high variation, stochasticity, and randomness. Hybrid DWT-DCT and Kernel Principal Component Analysis (KPCA) with YCbCr/YIQ colour coding using the Dynamic Texture Unit (DTU) approach is proposed to model a nonlinear dynamic texture, which provides better results than state-of-art methods in terms of PSNR, compression ratio, model coefficients, and model size. Dynamic texture is decomposed into DTUs as they help to extract temporal self-similarity. Hybrid DWT-DCT is used to extract spatial redundancy. YCbCr/YIQ colour encoding is performed to capture chromatic correlation. KPCA is applied to capture nonlinear motion. Further, the proposed algorithm is implemented on Graphics Processing Unit (GPU), which comprise of hundreds of small processors to decrease time complexity and to achieve parallelism.
Abdellah, Marwan; Eldeib, Ayman; Owis, Mohamed I
2015-01-01
This paper features an advanced implementation of the X-ray rendering algorithm that harnesses the giant computing power of the current commodity graphics processors to accelerate the generation of high resolution digitally reconstructed radiographs (DRRs). The presented pipeline exploits the latest features of NVIDIA Graphics Processing Unit (GPU) architectures, mainly bindless texture objects and dynamic parallelism. The rendering throughput is substantially improved by exploiting the interoperability mechanisms between CUDA and OpenGL. The benchmarks of our optimized rendering pipeline reflect its capability of generating DRRs with resolutions of 2048(2) and 4096(2) at interactive and semi interactive frame-rates using an NVIDIA GeForce 970 GTX device. PMID:26737231
Abdellah, Marwan; Eldeib, Ayman; Owis, Mohamed I
2015-01-01
This paper features an advanced implementation of the X-ray rendering algorithm that harnesses the giant computing power of the current commodity graphics processors to accelerate the generation of high resolution digitally reconstructed radiographs (DRRs). The presented pipeline exploits the latest features of NVIDIA Graphics Processing Unit (GPU) architectures, mainly bindless texture objects and dynamic parallelism. The rendering throughput is substantially improved by exploiting the interoperability mechanisms between CUDA and OpenGL. The benchmarks of our optimized rendering pipeline reflect its capability of generating DRRs with resolutions of 2048(2) and 4096(2) at interactive and semi interactive frame-rates using an NVIDIA GeForce 970 GTX device.
Li, Chuan; Petukh, Marharyta; Li, Lin; Alexov, Emil
2013-08-15
Due to the enormous importance of electrostatics in molecular biology, calculating the electrostatic potential and corresponding energies has become a standard computational approach for the study of biomolecules and nano-objects immersed in water and salt phase or other media. However, the electrostatics of large macromolecules and macromolecular complexes, including nano-objects, may not be obtainable via explicit methods and even the standard continuum electrostatics methods may not be applicable due to high computational time and memory requirements. Here, we report further development of the parallelization scheme reported in our previous work (Li, et al., J. Comput. Chem. 2012, 33, 1960) to include parallelization of the molecular surface and energy calculations components of the algorithm. The parallelization scheme utilizes different approaches such as space domain parallelization, algorithmic parallelization, multithreading, and task scheduling, depending on the quantity being calculated. This allows for efficient use of the computing resources of the corresponding computer cluster. The parallelization scheme is implemented in the popular software DelPhi and results in speedup of several folds. As a demonstration of the efficiency and capability of this methodology, the electrostatic potential, and electric field distributions are calculated for the bovine mitochondrial supercomplex illustrating their complex topology, which cannot be obtained by modeling the supercomplex components alone.
Li, Chuan; Petukh, Marharyta; Li, Lin; Alexov, Emil
2013-01-01
Due to the enormous importance of electrostatics in molecular biology, calculating the electrostatic potential and corresponding energies has become a standard computational approach for the study of biomolecules and nano-objects immersed in water and salt phase or other media. However, the electrostatics of large macromolecules and macromolecular complexes, including nano-objects, may not be obtainable via explicit methods and even the standard continuum electrostatics methods may not be applicable due to high computational time and memory requirements. Here, we report further development of the parallelization scheme reported in our previous work (J Comput Chem. 2012 Sep 15; 33(24):1960–6.) to include parallelization of the molecular surface and energy calculations components of the algorithm. The parallelization scheme utilizes different approaches such as space domain parallelization, algorithmic parallelization, multi-threading, and task scheduling, depending on the quantity being calculated. This allows for efficient use of the computing resources of the corresponding computer cluster. The parallelization scheme is implemented in the popular software DelPhi and results in speedup of several folds. As a demonstration of the efficiency and capability of this methodology, the electrostatic potential and electric field distributions are calculated for the bovine mitochondrial supercomplex illustrating their complex topology which cannot be obtained by modeling the supercomplex components alone. PMID:23733490
Beta operations: efficient implementation of a primitive parallel operation. Technical report
Cohn, E.R.; Haddad, R.W.
1986-08-01
The ever-decreasing cost of computer processors has created a great interest in multi-processor computers. However, along with the increased power that this parallelism brings comes increased complexity in programming. One approach to lessening this complexity is to provide the programmer with general-purpose parallel primitives that shield him from the structure of the underlying machine. In The Connection Machine, Hillis suggests the beta operation as a parallel primitive for his hypercube-based machine. This paper explores efficient ways to perform this operation on several different well known architectures, including the hypercube. It presents some lower bounds associated with the problem.
NASA Astrophysics Data System (ADS)
van der Werff, H. M. A.; Bakker, W. H.
2014-02-01
A graphics processing unit (GPU) can perform massively parallel computations at relatively low cost. Software interfaces like NVIDIA CUDA allow for General Purpose computing on a GPU (GPGPU). Wrappers of the CUDA libraries for higher-level programming languages such as MATLAB and IDL allow its use in image processing. In this paper, we implement GPGPU in IDL with two distance measures frequently used in image classification, Euclidean distance and spectral angle, and apply these to hyperspectral imagery. First we vary the data volume of a synthetic dataset by changing the number of image pixels, spectral bands and classification endmembers to determine speed-up and to find the smallest data volume that would still benefit from using graphics hardware. Then we process real datasets that are too large to fit in the GPU memory, and study the effect of resulting extra data transfers on computing performance. We show that our GPU algorithms outperform the same algorithms for a central processor unit (CPU), that a significant speed-up can already be obtained on relatively small datasets, and that data transfers in large datasets do not significantly influence performance. Given that no specific knowledge on parallel computing is required for this implementation, remote sensing scientists should now be able to implement and use GPGPU for their data analysis.
Miller, R.L.
1998-11-01
A numerically stable, accurate, and robust form of the exponential characteristic (EC) method, used to solve the time-independent linearized Boltzmann Transport Equation, is derived using direct affine coordinate transformations on unstructured meshes of tetrahedra. This quadrature, as well as the linear characteristic (LC) spatial quadrature, is implemented in the transport code, called TETRAN. This code solves multi-group neutral particle transport problems with anisotropic scattering and was parallelized using High Performance Fortran and angular domain decomposition. A new, parallel algorithm for updating the scattering source is introduced. The EC source and inflow flux coefficients are efficiently evaluated using Broyden`s rootsolver, started with special approximations developed here. TETRAN showed robustness, stability and accuracy on a variety of challenging test problems. Parallel speed-up was observed as the number of processors was increased using an IBM SP computer system.
RGBA packing for fast cone beam reconstruction on the GPU
NASA Astrophysics Data System (ADS)
Ino, Fumihiko; Yoshida, Seiji; Hagihara, Kenichi
2009-02-01
This paper presents a fast cone beam reconstruction method accelerated on the graphics processing unit (GPU). We implement the Feldkamp, Davis, and Kress (FDK) algorithm on the OpenGL graphics pipeline, which allows us to exploit the full resources and capabilities available on the GPU. The proposed method differs from previous GPU-based methods in having an RGBA packing scheme capable of directly dealing with projections without rebinning. It also reduces the amount of computation by using a data reuse scheme, which is useful to save the memory bandwidth for this memory-intensive problem. Both schemes contribute to reduce the number of rendering passes, namely the number of kernel invocations on the GPU, realizing fast cone beam reconstruction. We show some experimental results obtained on a desktop PC with an nVIDIA GeForce 8800 GTX card. As a result, the proposed method takes 8.1 seconds to reconstruct a 5123-voxel volume from 360 5122-pixel projection images. This execution time is equivalent to a 15.6-fold speedup over a CPU implementation, showing 10% higher performance as compared with a previous OpenGL-based method that requires the single-slice rebinning of projections for acceleration. With respect to non-rebinned data, our method provides approximately three times higher performance than the previous method.
PARALLEL IMPLEMENTATION OF THE TOPAZ OPACITY CODE: ISSUES IN LOAD-BALANCING
Sonnad, V; Iglesias, C A
2008-05-12
The TOPAZ opacity code explicitly includes configuration term structure in the calculation of bound-bound radiative transitions. This approach involves myriad spectral lines and requires the large computational capabilities of parallel processing computers. It is important, however, to make use of these resources efficiently. For example, an increase in the number of processors should yield a comparable reduction in computational time. This proportional 'speedup' indicates that very large problems can be addressed with massively parallel computers. Opacity codes can readily take advantage of parallel architecture since many intermediate calculations are independent. On the other hand, since the different tasks entail significantly disparate computational effort, load-balancing issues emerge so that parallel efficiency does not occur naturally. Several schemes to distribute the labor among processors are discussed.
Implementation of parallel matrix decomposition for NIKE3D on the KSR1 system
Su, Philip S.; Fulton, R.E.; Zacharia, T.
1995-06-01
New massively parallel computer architecture has revolutionized the design of computer algorithms and promises to have significant influence on algorithms for engineering computations. Realistic engineering problems using finite element analysis typically imply excessively large computational requirements. Parallel supercomputers that have the potential for significantly increasing calculation speeds can meet these computational requirements. This report explores the potential for the parallel Cholesky (U{sup T}DU) matrix decomposition algorithm on NIKE3D through actual computations. The examples of two- and three-dimensional nonlinear dynamic finite element problems are presented on the Kendall Square Research (KSR1) multiprocessor system, with 64 processors, at Oak Ridge National Laboratory. The numerical results indicate that the parallel Cholesky (U{sup T}DU) matrix decomposition algorithm is attractive for NIKE3D under multi-processor system environments.
Graphics processing unit (GPU)-based computation of heat conduction in thermally anisotropic solids
NASA Astrophysics Data System (ADS)
Nahas, C. A.; Balasubramaniam, Krishnan; Rajagopal, Prabhu
2013-01-01
Numerical modeling of anisotropic media is a computationally intensive task since it brings additional complexity to the field problem in such a way that the physical properties are different in different directions. Largely used in the aerospace industry because of their lightweight nature, composite materials are a very good example of thermally anisotropic media. With advancements in video gaming technology, parallel processors are much cheaper today and accessibility to higher-end graphical processing devices has increased dramatically over the past couple of years. Since these massively parallel GPUs are very good in handling floating point arithmetic, they provide a new platform for engineers and scientists to accelerate their numerical models using commodity hardware. In this paper we implement a parallel finite difference model of thermal diffusion through anisotropic media using the NVIDIA CUDA (Compute Unified device Architecture). We use the NVIDIA GeForce GTX 560 Ti as our primary computing device which consists of 384 CUDA cores clocked at 1645 MHz with a standard desktop pc as the host platform. We compare the results from standard CPU implementation for its accuracy and speed and draw implications for simulation using the GPU paradigm.
Coarse-grained and fine-grained parallel optimization for real-time en-face OCT imaging
NASA Astrophysics Data System (ADS)
Kapinchev, Konstantin; Bradu, Adrian; Barnes, Frederick; Podoleanu, Adrian
2016-03-01
This paper presents parallel optimizations in the en-face (C-scan) optical coherence tomography (OCT) display. Compared with the cross-sectional (B-scan) imagery, the production of en-face images is more computationally demanding, due to the increased size of the data handled by the digital signal processing (DSP) algorithms. A sequential implementation of the DSP leads to a limited number of real-time generated en-face images. There are OCT applications, where simultaneous production of large number of en-face images from multiple depths is required, such as real-time diagnostics and monitoring of surgery and ablation. In sequential computing, this requirement leads to a significant increase of the time to process the data and to generate the images. As a result, the processing time exceeds the acquisition time and the image generation is not in real-time. In these cases, not producing en-face images in real-time makes the OCT system ineffective. Parallel optimization of the DSP algorithms provides a solution to this problem. Coarse-grained central processing unit (CPU) based and fine-grained graphics processing unit (GPU) based parallel implementations of the conventional Fourier domain (CFD) OCT method and the Master-Slave Interferometry (MSI) OCT method are studied. In the coarse-grained CPU implementation, each parallel thread processes the whole OCT frame and generates a single en-face image. The corresponding fine-grained GPU implementation launches one parallel thread for every data point from the OCT frame and thus achieves maximum parallelism. The performance and scalability of the CPU-based and GPU-based parallel approaches are analyzed and compared. The quality and the resolution of the images generated by the CFD method and the MSI method are also discussed and compared.
Implementation of a parallel unstructured Euler solver on the CM-5
NASA Technical Reports Server (NTRS)
Morano, Eric; Mavriplis, D. J.
1995-01-01
An efficient unstructured 3D Euler solver is parallelized on a Thinking Machine Corporation Connection Machine 5, distributed memory computer with vectoring capability. In this paper, the single instruction multiple data (SIMD) strategy is employed through the use of the CM Fortran language and the CMSSL scientific library. The performance of the CMSSL mesh partitioner is evaluated and the overall efficiency of the parallel flow solver is discussed.
Li, Chuan; Li, Lin; Zhang, Jie; Alexov, Emil
2012-01-01
The Gauss-Seidel method is a standard iterative numerical method widely used to solve a system of equations and, in general, is more efficient comparing to other iterative methods, such as the Jacobi method. However, standard implementation of the Gauss-Seidel method restricts its utilization in parallel computing due to its requirement of using updated neighboring values (i.e., in current iteration) as soon as they are available. Here we report an efficient and exact (not requiring assumptions) method to parallelize iterations and to reduce the computational time as a linear/nearly linear function of the number of CPUs. In contrast to other existing solutions, our method does not require any assumptions and is equally applicable for solving linear and nonlinear equations. This approach is implemented in the DelPhi program, which is a finite difference Poisson-Boltzmann equation solver to model electrostatics in molecular biology. This development makes the iterative procedure on obtaining the electrostatic potential distribution in the parallelized DelPhi several folds faster than that in the serial code. Further we demonstrate the advantages of the new parallelized DelPhi by computing the electrostatic potential and the corresponding energies of large supramolecular structures. PMID:22674480
NASA Astrophysics Data System (ADS)
Kulshreshtha, Kshitij; Nataraj, Neela
2005-08-01
The paper deals with a parallel implementation of a mixed finite element method of approximation of eigenvalues and eigenvectors of fourth order eigenvalue problems with variable/constant coefficients. The implementation has been done in Silicon Graphics Origin 3800, a four processor Intel Xeon Symmetric Multiprocessor and a beowulf cluster of four Intel Pentium III PCs. The generalised eigenvalue problem obtained after discretization using the mixed finite element method is solved using the package LANSO. The numerical results obtained are compared with existing results (if available). The time, speedup comparisons in different environments for some examples of practical and research interest and importance are also given.
The implementation of an aeronautical CFD flow code onto distributed memory parallel systems
NASA Astrophysics Data System (ADS)
Ierotheou, C. S.; Forsey, C. R.; Leatham, M.
2000-04-01
The parallelization of an industrially important in-house computational fluid dynamics (CFD) code for calculating the airflow over complex aircraft configurations using the Euler or Navier-Stokes equations is presented. The code discussed is the flow solver module of the SAUNA CFD suite. This suite uses a novel grid system that may include block-structured hexahedral or pyramidal grids, unstructured tetrahedral grids or a hybrid combination of both. To assist in the rapid convergence to a solution, a number of convergence acceleration techniques are employed including implicit residual smoothing and a multigrid full approximation storage scheme (FAS). Key features of the parallelization approach are the use of domain decomposition and encapsulated message passing to enable the execution in parallel using a single programme multiple data (SPMD) paradigm. In the case where a hybrid grid is used, a unified grid partitioning scheme is employed to define the decomposition of the mesh. The parallel code has been tested using both structured and hybrid grids on a number of different distributed memory parallel systems and is now routinely used to perform industrial scale aeronautical simulations. Copyright
NASA Astrophysics Data System (ADS)
Plaza, Antonio; Chang, Chein-I.; Plaza, Javier; Valencia, David
2006-05-01
The incorporation of hyperspectral sensors aboard airborne/satellite platforms is currently producing a nearly continual stream of multidimensional image data, and this high data volume has soon introduced new processing challenges. The price paid for the wealth spatial and spectral information available from hyperspectral sensors is the enormous amounts of data that they generate. Several applications exist, however, where having the desired information calculated quickly enough for practical use is highly desirable. High computing performance of algorithm analysis is particularly important in homeland defense and security applications, in which swift decisions often involve detection of (sub-pixel) military targets (including hostile weaponry, camouflage, concealment, and decoys) or chemical/biological agents. In order to speed-up computational performance of hyperspectral imaging algorithms, this paper develops several fast parallel data processing techniques. Techniques include four classes of algorithms: (1) unsupervised classification, (2) spectral unmixing, and (3) automatic target recognition, and (4) onboard data compression. A massively parallel Beowulf cluster (Thunderhead) at NASA's Goddard Space Flight Center in Maryland is used to measure parallel performance of the proposed algorithms. In order to explore the viability of developing onboard, real-time hyperspectral data compression algorithms, a Xilinx Virtex-II field programmable gate array (FPGA) is also used in experiments. Our quantitative and comparative assessment of parallel techniques and strategies may help image analysts in selection of parallel hyperspectral algorithms for specific applications.
New GPU optimizations for intensity-based registration
NASA Astrophysics Data System (ADS)
Yousfi, Razik; Bousquet, Guillaume; Chefd'hotel, Christophe
2009-02-01
The task of registering 3D medical images is very computationally expensive. With CPU-based implementations of registration algorithms it is typical to use various approximations, such as subsampling, to maintain reasonable computation times. This may however result in suboptimal alignments. With the constant increase of capabilities and performances of GPUs (Graphics Processing Unit), these highly vectorized processors have become a viable alternative to CPUs for image related computation tasks. This paper describes new strategies to implement on GPU the computation of image similarity metrics for intensity-based registration, using in particular the latest features of NVIDIA's GeForce 8 architecture and the Cg language. Our experimental results show that the computations are many times faster. In this paper, several GPU implementations of two image similarity criteria for both intramodal and multi-modal registration have been compared. In particular, we propose a new efficient and flexible solution based on the geometry shader.
Parallel Molecular Dynamics simulation: implementation of PVM for a lipid membrane
NASA Astrophysics Data System (ADS)
Fang, Zhiwu; Haymet, A. D. J.; Shinoda, Wataru; Okazaki, Susumu
1999-02-01
This paper describes a parallel algorithm for Molecular Dynamics simulation of a lipid membrane using the isothermal—isobaric ensemble. A message-passing paradigm is adopted for interprocessor communications using PVM3 (Parallel Virtual Machine). A data decomposition technique is employed for the parallelization of the calculation of intermolecular forces. The algorithm has been tested both on distributed memory architecture (DEC Alpha 500 workstation clusters) and shared memory architecture (SGI Powerchallenge with 20 R10000 processors) for a dipalmitoylphosphatidylcholine (DPPC) lipid bilayer consisting of 32 DPPC molecules and 928 water molecules. For each architecture, we measure the execution time with average work load, and the optimal number of processors for the current simulation. Some dynamical quantities are presented for a 2 ns simulation obtained with 5 processors on DEC Alpha 500 workstations. Our results show that the code is extremely efficient on 5-8 processors, and a useful addition to other major computational resources.
A GPU accelerated PDF transparency engine
NASA Astrophysics Data System (ADS)
Recker, John; Lin, I.-Jong; Tastl, Ingeborg
2011-01-01
As commercial printing presses become faster, cheaper and more efficient, so too must the Raster Image Processors (RIP) that prepare data for them to print. Digital press RIPs, however, have been challenged to on the one hand meet the ever increasing print performance of the latest digital presses, and on the other hand process increasingly complex documents with transparent layers and embedded ICC profiles. This paper explores the challenges encountered when implementing a GPU accelerated driver for the open source Ghostscript Adobe PostScript and PDF language interpreter targeted at accelerating PDF transparency for high speed commercial presses. It further describes our solution, including an image memory manager for tiling input and output images and documents, a PDF compatible multiple image layer blending engine, and a GPU accelerated ICC v4 compatible color transformation engine. The result, we believe, is the foundation for a scalable, efficient, distributed RIP system that can meet current and future RIP requirements for a wide range of commercial digital presses.
Synthetic aperture elastography: a GPU based approach
NASA Astrophysics Data System (ADS)
Verma, Prashant; Doyley, Marvin M.
2014-03-01
Synthetic aperture (SA) ultrasound imaging system produces highly accurate axial and lateral displacement estimates; however, low frame rates and large data volumes can hamper its clinical use. This paper describes a real-time SA imaging based ultrasound elastography system that we have recently developed to overcome this limitation. In this system, we implemented both beamforming and 2D cross-correlation echo tracking on Nvidia GTX 480 graphics processing unit (GPU). We used one thread per pixel for beamforming; whereas, one block per pixel was used for echo tracking. We compared the quality of elastograms computed with our real-time system relative to those computed using our standard single threaded elastographic imaging methodology. In all studies, we used conventional measures of image quality such as elastographic signal to noise ratio (SNRe). Specifically, SNRe of axial and lateral strain elastograms computed with real-time system were 36 dB and 23 dB, respectively, which was numerically equal to those computed with our standard approach. We achieved a frame rate of 6 frames per second using our GPU based approach for 16 transmits and kernel size of 60 × 60 pixels, which is 400 times faster than that achieved using our standard protocol.
GPU-Acceleration of Sequence Homology Searches with Database Subsequence Clustering.
Suzuki, Shuji; Kakuta, Masanori; Ishida, Takashi; Akiyama, Yutaka
2016-01-01
Sequence homology searches are used in various fields and require large amounts of computation time, especially for metagenomic analysis, owing to the large number of queries and the database size. To accelerate computing analyses, graphics processing units (GPUs) are widely used as a low-cost, high-performance computing platform. Therefore, we mapped the time-consuming steps involved in GHOSTZ, which is a state-of-the-art homology search algorithm for protein sequences, onto a GPU and implemented it as GHOSTZ-GPU. In addition, we optimized memory access for GPU calculations and for communication between the CPU and GPU. As per results of the evaluation test involving metagenomic data, GHOSTZ-GPU with 12 CPU threads and 1 GPU was approximately 3.0- to 4.1-fold faster than GHOSTZ with 12 CPU threads. Moreover, GHOSTZ-GPU with 12 CPU threads and 3 GPUs was approximately 5.8- to 7.7-fold faster than GHOSTZ with 12 CPU threads. PMID:27482905
GPU-Acceleration of Sequence Homology Searches with Database Subsequence Clustering
Suzuki, Shuji; Kakuta, Masanori; Ishida, Takashi; Akiyama, Yutaka
2016-01-01
Sequence homology searches are used in various fields and require large amounts of computation time, especially for metagenomic analysis, owing to the large number of queries and the database size. To accelerate computing analyses, graphics processing units (GPUs) are widely used as a low-cost, high-performance computing platform. Therefore, we mapped the time-consuming steps involved in GHOSTZ, which is a state-of-the-art homology search algorithm for protein sequences, onto a GPU and implemented it as GHOSTZ-GPU. In addition, we optimized memory access for GPU calculations and for communication between the CPU and GPU. As per results of the evaluation test involving metagenomic data, GHOSTZ-GPU with 12 CPU threads and 1 GPU was approximately 3.0- to 4.1-fold faster than GHOSTZ with 12 CPU threads. Moreover, GHOSTZ-GPU with 12 CPU threads and 3 GPUs was approximately 5.8- to 7.7-fold faster than GHOSTZ with 12 CPU threads. PMID:27482905
Implementation of a 3D mixing layer code on parallel computers
NASA Technical Reports Server (NTRS)
Roe, K.; Thakur, R.; Dang, T.; Bogucz, E.
1995-01-01
This paper summarizes our progress and experience in the development of a Computational-Fluid-Dynamics code on parallel computers to simulate three-dimensional spatially-developing mixing layers. In this initial study, the three-dimensional time-dependent Euler equations are solved using a finite-volume explicit time-marching algorithm. The code was first programmed in Fortran 77 for sequential computers. The code was then converted for use on parallel computers using the conventional message-passing technique, while we have not been able to compile the code with the present version of HPF compilers.
Method, systems, and computer program products for implementing function-parallel network firewall
Fulp, Errin W.; Farley, Ryan J.
2011-10-11
Methods, systems, and computer program products for providing function-parallel firewalls are disclosed. According to one aspect, a function-parallel firewall includes a first firewall node for filtering received packets using a first portion of a rule set including a plurality of rules. The first portion includes less than all of the rules in the rule set. At least one second firewall node filters packets using a second portion of the rule set. The second portion includes at least one rule in the rule set that is not present in the first portion. The first and second portions together include all of the rules in the rule set.
High-throughput Analysis of Large Microscopy Image Datasets on CPU-GPU Cluster Platforms.
Teodoro, George; Pan, Tony; Kurc, Tahsin M; Kong, Jun; Cooper, Lee A D; Podhorszki, Norbert; Klasky, Scott; Saltz, Joel H
2013-05-01
Analysis of large pathology image datasets offers significant opportunities for the investigation of disease morphology, but the resource requirements of analysis pipelines limit the scale of such studies. Motivated by a brain cancer study, we propose and evaluate a parallel image analysis application pipeline for high throughput computation of large datasets of high resolution pathology tissue images on distributed CPU-GPU platforms. To achieve efficient execution on these hybrid systems, we have built runtime support that allows us to express the cancer image analysis application as a hierarchical data processing pipeline. The application is implemented as a coarse-grain pipeline of stages, where each stage may be further partitioned into another pipeline of fine-grain operations. The fine-grain operations are efficiently managed and scheduled for computation on CPUs and GPUs using performance aware scheduling techniques along with several optimizations, including architecture aware process placement, data locality conscious task assignment, data prefetching, and asynchronous data copy. These optimizations are employed to maximize the utilization of the aggregate computing power of CPUs and GPUs and minimize data copy overheads. Our experimental evaluation shows that the cooperative use of CPUs and GPUs achieves significant improvements on top of GPU-only versions (up to 1.6×) and that the execution of the application as a set of fine-grain operations provides more opportunities for runtime optimizations and attains better performance than coarser-grain, monolithic implementations used in other works. An implementation of the cancer image analysis pipeline using the runtime support was able to process an image dataset consisting of 36,848 4Kx4K-pixel image tiles (about 1.8TB uncompressed) in less than 4 minutes (150 tiles/second) on 100 nodes of a state-of-the-art hybrid cluster system.
High-throughput Analysis of Large Microscopy Image Datasets on CPU-GPU Cluster Platforms.
Teodoro, George; Pan, Tony; Kurc, Tahsin M; Kong, Jun; Cooper, Lee A D; Podhorszki, Norbert; Klasky, Scott; Saltz, Joel H
2013-05-01
Analysis of large pathology image datasets offers significant opportunities for the investigation of disease morphology, but the resource requirements of analysis pipelines limit the scale of such studies. Motivated by a brain cancer study, we propose and evaluate a parallel image analysis application pipeline for high throughput computation of large datasets of high resolution pathology tissue images on distributed CPU-GPU platforms. To achieve efficient execution on these hybrid systems, we have built runtime support that allows us to express the cancer image analysis application as a hierarchical data processing pipeline. The application is implemented as a coarse-grain pipeline of stages, where each stage may be further partitioned into another pipeline of fine-grain operations. The fine-grain operations are efficiently managed and scheduled for computation on CPUs and GPUs using performance aware scheduling techniques along with several optimizations, including architecture aware process placement, data locality conscious task assignment, data prefetching, and asynchronous data copy. These optimizations are employed to maximize the utilization of the aggregate computing power of CPUs and GPUs and minimize data copy overheads. Our experimental evaluation shows that the cooperative use of CPUs and GPUs achieves significant improvements on top of GPU-only versions (up to 1.6×) and that the execution of the application as a set of fine-grain operations provides more opportunities for runtime optimizations and attains better performance than coarser-grain, monolithic implementations used in other works. An implementation of the cancer image analysis pipeline using the runtime support was able to process an image dataset consisting of 36,848 4Kx4K-pixel image tiles (about 1.8TB uncompressed) in less than 4 minutes (150 tiles/second) on 100 nodes of a state-of-the-art hybrid cluster system. PMID:25419546
Davis, Marion Kei; Zhang, Xuechen; Jiang, Song
2009-01-01
Collective I/O is a widely used technique to improve I/O performance in parallel computing. It can be implemented as a client-based or server-based scheme. The client-based implementation is more widely adopted in MPI-IO software such as ROMIO because of its independence from the storage system configuration and its greater portability. However, existing implementations of client-side collective I/O do not take into account the actual pattern offile striping over multiple I/O nodes in the storage system. This can cause a significant number of requests for non-sequential data at I/O nodes, substantially degrading I/O performance. Investigating the surprisingly high I/O throughput achieved when there is an accidental match between a particular request pattern and the data striping pattern on the I/O nodes, we reveal the resonance phenomenon as the cause. Exploiting readily available information on data striping from the metadata server in popular file systems such as PVFS2 and Lustre, we design a new collective I/O implementation technique, resonant I/O, that makes resonance a common case. Resonant I/O rearranges requests from multiple MPI processes to transform non-sequential data accesses on I/O nodes into sequential accesses, significantly improving I/O performance without compromising the independence ofa client-based implementation. We have implemented our design in ROMIO. Our experimental results show that the scheme can increase I/O throughput for some commonly used parallel I/O benchmarks such as mpi-io-test and ior-mpi-io over the existing implementation of ROMIO by up to 157%, with no scenario demonstrating significantly decreased performance.
An SDR-Based Real-Time Testbed for GNSS Adaptive Array Anti-Jamming Algorithms Accelerated by GPU.
Xu, Hailong; Cui, Xiaowei; Lu, Mingquan
2016-01-01
Nowadays, software-defined radio (SDR) has become a common approach to evaluate new algorithms. However, in the field of Global Navigation Satellite System (GNSS) adaptive array anti-jamming, previous work has been limited due to the high computational power demanded by adaptive algorithms, and often lack flexibility and configurability. In this paper, the design and implementation of an SDR-based real-time testbed for GNSS adaptive array anti-jamming accelerated by a Graphics Processing Unit (GPU) are documented. This testbed highlights itself as a feature-rich and extendible platform with great flexibility and configurability, as well as high computational performance. Both Space-Time Adaptive Processing (STAP) and Space-Frequency Adaptive Processing (SFAP) are implemented with a wide range of parameters. Raw data from as many as eight antenna elements can be processed in real-time in either an adaptive nulling or beamforming mode. To fully take advantage of the parallelism resource provided by the GPU, a batched method in programming is proposed. Tests and experiments are conducted to evaluate both the computational and anti-jamming performance. This platform can be used for research and prototyping, as well as a real product in certain applications. PMID:26978363
An SDR-Based Real-Time Testbed for GNSS Adaptive Array Anti-Jamming Algorithms Accelerated by GPU.
Xu, Hailong; Cui, Xiaowei; Lu, Mingquan
2016-03-11
Nowadays, software-defined radio (SDR) has become a common approach to evaluate new algorithms. However, in the field of Global Navigation Satellite System (GNSS) adaptive array anti-jamming, previous work has been limited due to the high computational power demanded by adaptive algorithms, and often lack flexibility and configurability. In this paper, the design and implementation of an SDR-based real-time testbed for GNSS adaptive array anti-jamming accelerated by a Graphics Processing Unit (GPU) are documented. This testbed highlights itself as a feature-rich and extendible platform with great flexibility and configurability, as well as high computational performance. Both Space-Time Adaptive Processing (STAP) and Space-Frequency Adaptive Processing (SFAP) are implemented with a wide range of parameters. Raw data from as many as eight antenna elements can be processed in real-time in either an adaptive nulling or beamforming mode. To fully take advantage of the parallelism resource provided by the GPU, a batched method in programming is proposed. Tests and experiments are conducted to evaluate both the computational and anti-jamming performance. This platform can be used for research and prototyping, as well as a real product in certain applications.
An SDR-Based Real-Time Testbed for GNSS Adaptive Array Anti-Jamming Algorithms Accelerated by GPU
Xu, Hailong; Cui, Xiaowei; Lu, Mingquan
2016-01-01
Nowadays, software-defined radio (SDR) has become a common approach to evaluate new algorithms. However, in the field of Global Navigation Satellite System (GNSS) adaptive array anti-jamming, previous work has been limited due to the high computational power demanded by adaptive algorithms, and often lack flexibility and configurability. In this paper, the design and implementation of an SDR-based real-time testbed for GNSS adaptive array anti-jamming accelerated by a Graphics Processing Unit (GPU) are documented. This testbed highlights itself as a feature-rich and extendible platform with great flexibility and configurability, as well as high computational performance. Both Space-Time Adaptive Processing (STAP) and Space-Frequency Adaptive Processing (SFAP) are implemented with a wide range of parameters. Raw data from as many as eight antenna elements can be processed in real-time in either an adaptive nulling or beamforming mode. To fully take advantage of the parallelism resource provided by the GPU, a batched method in programming is proposed. Tests and experiments are conducted to evaluate both the computational and anti-jamming performance. This platform can be used for research and prototyping, as well as a real product in certain applications. PMID:26978363
NASA Technical Reports Server (NTRS)
Mavriplis, D. J.; Das, Raja; Saltz, Joel; Vermeland, R. E.
1992-01-01
An efficient three dimensional unstructured Euler solver is parallelized on a Cray Y-MP C90 shared memory computer and on an Intel Touchstone Delta distributed memory computer. This paper relates the experiences gained and describes the software tools and hardware used in this study. Performance comparisons between two differing architectures are made.
NASA Technical Reports Server (NTRS)
Eidson, T. M.; Erlebacher, G.
1994-01-01
While parallel computers offer significant computational performance, it is generally necessary to evaluate several programming strategies. Two programming strategies for a fairly common problem - a periodic tridiagonal solver - are developed and evaluated. Simple model calculations as well as timing results are presented to evaluate the various strategies. The particular tridiagonal solver evaluated is used in many computational fluid dynamic simulation codes. The feature that makes this algorithm unique is that these simulation codes usually require simultaneous solutions for multiple right-hand-sides (RHS) of the system of equations. Each RHS solutions is independent and thus can be computed in parallel. Thus a Gaussian elimination type algorithm can be used in a parallel computation and the more complicated approaches such as cyclic reduction are not required. The two strategies are a transpose strategy and a distributed solver strategy. For the transpose strategy, the data is moved so that a subset of all the RHS problems is solved on each of the several processors. This usually requires significant data movement between processor memories across a network. The second strategy attempts to have the algorithm allow the data across processor boundaries in a chained manner. This usually requires significantly less data movement. An approach to accomplish this second strategy in a near-perfect load-balanced manner is developed. In addition, an algorithm will be shown to directly transform a sequential Gaussian elimination type algorithm into the parallel chained, load-balanced algorithm.
GPU-accelerated Chemical Similarity Assessment for Large Scale Databases
Maggioni, Marco; Santambrogio, Marco Domenico; Liang, Jie
2016-01-01
The assessment of chemical similarity between molecules is a basic operation in chemoinformatics, a computational area concerning with the manipulation of chemical structural information. Comparing molecules is the basis for a wide range of applications such as searching in chemical databases, training prediction models for virtual screening or aggregating clusters of similar compounds. However, currently available multimillion databases represent a challenge for conventional chemoinformatics algorithms raising the necessity for faster similarity methods. In this paper, we extensively analyze the advantages of using many-core architectures for calculating some commonly-used chemical similarity coefficients such as Tanimoto, Dice or Cosine. Our aim is to provide a wide-breath proof-of-concept regarding the usefulness of GPU architectures to chemoinformatics, a class of computing problems still uncovered. In our work, we present a general GPU algorithm for all-to-all chemical comparisons considering both binary fingerprints and floating point descriptors as molecule representation. Subsequently, we adopt optimization techniques to minimize global memory accesses and to further improve efficiency. We test the proposed algorithm on different experimental setups, a laptop with a low-end GPU and a desktop with a more performant GPU. In the former case, we obtain a 4-to-6-fold speed-up over a single-core implementation for fingerprints and a 4-to-7-fold speed-up for descriptors. In the latter case, we respectively obtain a 195-to-206-fold speed-up and a 100-to-328-fold speed-up. PMID:27774113
GPU and APU computations of Finite Time Lyapunov Exponent fields
NASA Astrophysics Data System (ADS)
Conti, Christian; Rossinelli, Diego; Koumoutsakos, Petros
2012-03-01
We present GPU and APU accelerated computations of Finite-Time Lyapunov Exponent (FTLE) fields. The calculation of FTLEs is a computationally intensive process, as in order to obtain the sharp ridges associated with the Lagrangian Coherent Structures an extensive resampling of the flow field is required. The computational performance of this resampling is limited by the memory bandwidth of the underlying computer architecture. The present technique harnesses data-parallel execution of many-core architectures and relies on fast and accurate evaluations of moment conserving functions for the mesh to particle interpolations. We demonstrate how the computation of FTLEs can be efficiently performed on a GPU and on an APU through OpenCL and we report over one order of magnitude improvements over multi-threaded executions in FTLE computations of bluff body flows.
GPU-Accelerated Molecular Modeling Coming Of Age
Stone, John E.; Hardy, David J.; Ufimtsev, Ivan S.
2010-01-01
Graphics processing units (GPUs) have traditionally been used in molecular modeling solely for visualization of molecular structures and animation of trajectories resulting from molecular dynamics simulations. Modern GPUs have evolved into fully programmable, massively parallel co-processors that can now be exploited to accelerate many scientific computations, typically providing about one order of magnitude speedup over CPU code and in special cases providing speedups of two orders of magnitude. This paper surveys the development of molecular modeling algorithms that leverage GPU computing, the advances already made and remaining issues to be resolved, and the continuing evolution of GPU technology that promises to become even more useful to molecular modeling. Hardware acceleration with commodity GPUs is expected to benefit the overall computational biology community by bringing teraflops performance to desktop workstations and in some cases potentially changing what were formerly batch-mode computational jobs into interactive tasks. PMID:20675161
Xie, Dexuan; Dash, Ranjan K.; Beard, Daniel A.
2009-01-01
Fast algorithms for simulating mathematical models of coupled blood-tissue transport and metabolism are critical for the analysis of data on transport and reaction in tissues. Here, by combining the method of characteristics with the standard grid discretization technique, a novel algorithm is introduced for solving a general blood-tissue transport and metabolism model governed by a large system of one-dimensional semilinear first order partial differential equations. The key part of the algorithm is to approximate the model as a group of independent ordinary differential equation (ODE) systems such that each ODE system has the same size as the model and can be integrated independently. Thus the method can be easily implemented in parallel on a large scale multiprocessor computer. The accuracy of the algorithm is demonstrated for solving a simple blood-tissue exchange model introduced by Sangren and Sheppard (Bull. Math. Biophys. 15:387–394, 1953), which has an analytical solution. Numerical experiments made on a distributed-memory parallel computer (an HP Linux cluster) and a shared-memory parallel computer (a SGI Origin 2000) demonstrate the parallel efficiency of the algorithm. PMID:20161089
NASA Technical Reports Server (NTRS)
Keppenne, Christian L.; Rienecker, Michele M.; Koblinsky, Chester (Technical Monitor)
2001-01-01
A multivariate ensemble Kalman filter (MvEnKF) implemented on a massively parallel computer architecture has been implemented for the Poseidon ocean circulation model and tested with a Pacific Basin model configuration. There are about two million prognostic state-vector variables. Parallelism for the data assimilation step is achieved by regionalization of the background-error covariances that are calculated from the phase-space distribution of the ensemble. Each processing element (PE) collects elements of a matrix measurement functional from nearby PEs. To avoid the introduction of spurious long-range covariances associated with finite ensemble sizes, the background-error covariances are given compact support by means of a Hadamard (element by element) product with a three-dimensional canonical correlation function. The methodology and the MvEnKF configuration are discussed. It is shown that the regionalization of the background covariances; has a negligible impact on the quality of the analyses. The parallel algorithm is very efficient for large numbers of observations but does not scale well beyond 100 PEs at the current model resolution. On a platform with distributed memory, memory rather than speed is the limiting factor.
Massively parallel neural encoding and decoding of visual stimuli.
Lazar, Aurel A; Zhou, Yiyin
2012-08-01
The massively parallel nature of video Time Encoding Machines (TEMs) calls for scalable, massively parallel decoders that are implemented with neural components. The current generation of decoding algorithms is based on computing the pseudo-inverse of a matrix and does not satisfy these requirements. Here we consider video TEMs with an architecture built using Gabor receptive fields and a population of Integrate-and-Fire neurons. We show how to build a scalable architecture for video Time Decoding Machines using recurrent neural networks. Furthermore, we extend our architecture to handle the reconstruction of visual stimuli encoded with massively parallel video TEMs having neurons with random thresholds. Finally, we discuss in detail our algorithms and demonstrate their scalability and performance on a large scale GPU cluster. PMID:22397951
Harnessing your GPU for interactive immersive oceanographic modeling
NASA Astrophysics Data System (ADS)
Hermann, A. J.; Moore, C. W.
2011-12-01
We report on recent success using GPU for interactive Lagrangian (fish) and Eulerian (tsunami) modeling of marine systems. Lagrangian analyses based on numerical float tracks are a fundamental tool in hydrodynamic and marine biological modeling. In particular, spatially-explicit individual-based models (IBMs) can be used to explore how changes in coastal circulation affect fish recruitment, and 3D viewing of the results leads to new insights regarding the effects of behavior on spatial path. One limit to the usefulness of this modeling approach is the latency between submission of a run and examination of the results, especially when a large (i.e. statistically meaningful) number of individuals are being tracked through finely resolved current and scalar fields. Since float tracking is an inherently parallel problem, the hundreds of cores available in modern graphics cards (GPU) can readily increase the performance of suitably adapted code by two orders of magnitude at low cost. This offers a way forward to achieve interactive submission/examination of IBMs (and float tracks in general), even on a laptop computer. Latency is an even larger issue in tsunami forecasting, where there is a need to run simple deep-ocean shallow water wave models in real time, particularly during an event when tsunamigenic earthquake events occur outside known fault zones. This problem, too, lends itself to dramatic speedup via GPU, given a suitable parallel algorithm for the shallow water solver. Here we demonstrate successful model speedup using GPU-adapted code for: 1) a spatially explicit IBM prototype, based on pre-stored circulation model output for the Bering Sea; 2) real-time runs of tsunami propagation. In both cases, results will be presented using the stereo-immersive capabilities of the graphics card, for 3D animation.
Wilson, J Adam; Williams, Justin C
2009-01-01
The clock speeds of modern computer processors have nearly plateaued in the past 5 years. Consequently, neural prosthetic systems that rely on processing large quantities of data in a short period of time face a bottleneck, in that it may not be possible to process all of the data recorded from an electrode array with high channel counts and bandwidth, such as electrocorticographic grids or other implantable systems. Therefore, in this study a method of using the processing capabilities of a graphics card [graphics processing unit (GPU)] was developed for real-time neural signal processing of a brain-computer interface (BCI). The NVIDIA CUDA system was used to offload processing to the GPU, which is capable of running many operations in parallel, potentially greatly increasing the speed of existing algorithms. The BCI system records many channels of data, which are processed and translated into a control signal, such as the movement of a computer cursor. This signal processing chain involves computing a matrix-matrix multiplication (i.e., a spatial filter), followed by calculating the power spectral density on every channel using an auto-regressive method, and finally classifying appropriate features for control. In this study, the first two computationally intensive steps were implemented on the GPU, and the speed was compared to both the current implementation and a central processing unit-based implementation that uses multi-threading. Significant performance gains were obtained with GPU processing: the current implementation processed 1000 channels of 250 ms in 933 ms, while the new GPU method took only 27 ms, an improvement of nearly 35 times.
GPU-based Scalable Volumetric Reconstruction for Multi-view Stereo
Kim, H; Duchaineau, M; Max, N
2011-09-21
We present a new scalable volumetric reconstruction algorithm for multi-view stereo using a graphics processing unit (GPU). It is an effectively parallelized GPU algorithm that simultaneously uses a large number of GPU threads, each of which performs voxel carving, in order to integrate depth maps with images from multiple views. Each depth map, triangulated from pair-wise semi-dense correspondences, represents a view-dependent surface of the scene. This algorithm also provides scalability for large-scale scene reconstruction in a high resolution voxel grid by utilizing streaming and parallel computation. The output is a photo-realistic 3D scene model in a volumetric or point-based representation. We demonstrate the effectiveness and the speed of our algorithm with a synthetic scene and real urban/outdoor scenes. Our method can also be integrated with existing multi-view stereo algorithms such as PMVS2 to fill holes or gaps in textureless regions.
Complex fluid flow modeling with SPH on GPU
NASA Astrophysics Data System (ADS)
Bilotta, Giuseppe; Hérault, Alexis; Del Negro, Ciro; Russo, Giovanni; Vicari, Annamaria
2010-05-01
SPH meshless method. In comparison to other particle methods, SPH also provides additional benefits such as the automatic preservation of mass. The direct computation of most physical quantities (e.g. pressure) without resorting to large, sparse implicit systems makes SPH particularly favorable to implementation on highly parallel computational hardware such as modern video cards. The graphical processing units (GPUs) on modern video cards often surpasses the computational power of the CPU that drives them. The CUDA architecture, introduced by NVIDIA in the spring of 2007, allows generic GPU programming with an extension of the C language, making it easy to write highly parallelized code. Our lava simulation model uses the SPH method with a pure GPU implementation in CUDA to achieve high computational performance, modeling both the dynamic and thermal aspects of a lava flow. The dynamic parts of the SPH algorithms are based on the ones of the SPHysics simulator, enhanced to include the treatment of non-Newtonian fluids, the integration of thermal effects including temperature-dependent rheological parameters, and an optimal handling of large-scale natural topographies. For the non-Newtonian rheologies priority is given to the power law recently brought into light by physical modeling of lava flows. For the thermal part of the model, the SPH model has been compared with classical finite elements to simulate a lava lake solidification, a problem for which an analytical solution is known. The comparison shows the significantly higher accuracy of the SPH method in proximity of the contact area of two or more solidification fronts.
NASA Astrophysics Data System (ADS)
Baba, Toshitaka; Takahashi, Narumi; Kaneda, Yoshiyuki; Ando, Kazuto; Matsuoka, Daisuke; Kato, Toshihiro
2015-12-01
Because of improvements in offshore tsunami observation technology, dispersion phenomena during tsunami propagation have often been observed in recent tsunamis, for example the 2004 Indian Ocean and 2011 Tohoku tsunamis. The dispersive propagation of tsunamis can be simulated by use of the Boussinesq model, but the model demands many computational resources. However, rapid progress has been made in parallel computing technology. In this study, we investigated a parallelized approach for dispersive tsunami wave modeling. Our new parallel software solves the nonlinear Boussinesq dispersive equations in spherical coordinates. A variable nested algorithm was used to increase spatial resolution in the target region. The software can also be used to predict tsunami inundation on land. We used the dispersive tsunami model to simulate the 2011 Tohoku earthquake on the Supercomputer K. Good agreement was apparent between the dispersive wave model results and the tsunami waveforms observed offshore. The finest bathymetric grid interval was 2/9 arcsec (approx. 5 m) along longitude and latitude lines. Use of this grid simulated tsunami soliton fission near the Sendai coast. Incorporating the three-dimensional shape of buildings and structures led to improved modeling of tsunami inundation.
Lu, Xiaofeng; Song, Li; Shen, Sumin; He, Kang; Yu, Songyu; Ling, Nam
2013-07-17
Hough Transform has been widely used for straight line detection in low-definition and still images, but it suffers from execution time and resource requirements. Field Programmable Gate Arrays (FPGA) provide a competitive alternative for hardware acceleration to reap tremendous computing performance. In this paper, we propose a novel parallel Hough Transform (PHT) and FPGA architecture-associated framework for real-time straight line detection in high-definition videos. A resource-optimized Canny edge detection method with enhanced non-maximum suppression conditions is presented to suppress most possible false edges and obtain more accurate candidate edge pixels for subsequent accelerated computation. Then, a novel PHT algorithm exploiting spatial angle-level parallelism is proposed to upgrade computational accuracy by improving the minimum computational step. Moreover, the FPGA based multi-level pipelined PHT architecture optimized by spatial parallelism ensures real-time computation for 1,024 × 768 resolution videos without any off-chip memory consumption. This framework is evaluated on ALTERA DE2-115 FPGA evaluation platform at a maximum frequency of 200 MHz, and it can calculate straight line parameters in 15.59 ms on the average for one frame. Qualitative and quantitative evaluation results have validated the system performance regarding data throughput, memory bandwidth, resource, speed and robustness.
A GPU-based framework for simulation of medical ultrasound
NASA Astrophysics Data System (ADS)
Kutter, Oliver; Karamalis, Athanasios; Wein, Wolfgang; Navab, Nassir
2009-02-01
Simulation of ultrasound (US) images from volumetric medical image data has been shown to be an important tool in medical image analysis. However, there is a trade off between the accuracy of the simulation and its real-time performance. In this paper, we present a framework for acceleration of ultrasound simulation on the graphics processing unit (GPU) of commodity computer hardware. Our framework can accommodate ultrasound modeling with varying degrees of complexity. To demonstrate the flexibility of our proposed method, we have implemented several models of acoustic propagation through 3D volumes. We conducted multiple experiments to evaluate the performance of our method for its application in multi-modal image registration and training. The results demonstrate the high performance of the GPU accelerated simulation outperforming CPU implementations by up to two orders of magnitude and encourage the investigation of even more realistic acoustic models.
Implementation of BT, SP, LU, and FT of NAS Parallel Benchmarks in Java
NASA Technical Reports Server (NTRS)
Schultz, Matthew; Frumkin, Michael; Jin, Hao-Qiang; Yan, Jerry
2000-01-01
A number of Java features make it an attractive but a debatable choice for High Performance Computing. We have implemented benchmarks working on single structured grid BT,SP,LU and FT in Java. The performance and scalability of the Java code shows that a significant improvement in Java compiler technology and in Java thread implementation are necessary for Java to compete with Fortran in HPC applications.
Ultra-fast hybrid CPU-GPU multiple scatter simulation for 3-D PET.
Kim, Kyung Sang; Son, Young Don; Cho, Zang Hee; Ra, Jong Beom; Ye, Jong Chul
2014-01-01
Scatter correction is very important in 3-D PET reconstruction due to a large scatter contribution in measurements. Currently, one of the most popular methods is the so-called single scatter simulation (SSS), which considers single Compton scattering contributions from many randomly distributed scatter points. The SSS enables a fast calculation of scattering with a relatively high accuracy; however, the accuracy of SSS is dependent on the accuracy of tail fitting to find a correct scaling factor, which is often difficult in low photon count measurements. To overcome this drawback as well as to improve accuracy of scatter estimation by incorporating multiple scattering contribution, we propose a multiple scatter simulation (MSS) based on a simplified Monte Carlo (MC) simulation that considers photon migration and interactions due to photoelectric absorption and Compton scattering. Unlike the SSS, the MSS calculates a scaling factor by comparing simulated prompt data with the measured data in the whole volume, which enables a more robust estimation of a scaling factor. Even though the proposed MSS is based on MC, a significant acceleration of the computational time is possible by using a virtual detector array with a larger pitch by exploiting that the scatter distribution varies slowly in spatial domain. Furthermore, our MSS implementation is nicely fit to a parallel implementation using graphic processor unit (GPU). In particular, we exploit a hybrid CPU-GPU technique using the open multiprocessing and the compute unified device architecture, which results in 128.3 times faster than using a single CPU. Overall, the computational time of MSS is 9.4 s for a high-resolution research tomograph (HRRT) system. The performance of the proposed MSS is validated through actual experiments using an HRRT.
Phillips, Carolyn L.; Anderson, Joshua A.; Glotzer, Sharon C.
2011-08-10
Highlights: {yields} Molecular Dynamics codes implemented on GPUs have achieved two-order of magnitude computational accelerations. {yields} Brownian Dynamics and Dissipative Particle Dynamics simulations require a large number of random numbers per time step. {yields} We introduce a method for generating small batches of pseudorandom numbers distributed over many threads of calculations. {yields} With this method, Dissipative Particle Dynamics is implemented on a GPU device without requiring thread-to-thread communication. - Abstract: Brownian Dynamics (BD), also known as Langevin Dynamics, and Dissipative Particle Dynamics (DPD) are implicit solvent methods commonly used in models of soft matter and biomolecular systems. The interaction of the numerous solvent particles with larger particles is coarse-grained as a Langevin thermostat is applied to individual particles or to particle pairs. The Langevin thermostat requires a pseudo-random number generator (PRNG) to generate the stochastic force applied to each particle or pair of neighboring particles during each time step in the integration of Newton's equations of motion. In a Single-Instruction-Multiple-Thread (SIMT) GPU parallel computing environment, small batches of random numbers must be generated over thousands of threads and millions of kernel calls. In this communication we introduce a one-PRNG-per-kernel-call-per-thread scheme, in which a micro-stream of pseudorandom numbers is generated in each thread and kernel call. These high quality, statistically robust micro-streams require no global memory for state storage, are more computationally efficient than other PRNG schemes in memory-bound kernels, and uniquely enable the DPD simulation method without requiring communication between threads.
3D Laplace-domain full waveform inversion using a single GPU card
NASA Astrophysics Data System (ADS)
Shin, Jungkyun; Ha, Wansoo; Jun, Hyunggu; Min, Dong-Joo; Shin, Changsoo
2014-06-01
The Laplace-domain full waveform inversion is an efficient long-wavelength velocity estimation method for seismic datasets lacking low-frequency components. However, to invert a 3D velocity model, a large cluster of CPU cores have commonly been required to overcome the extremely long computing time caused by a large impedance matrix and a number of source positions. In this study, a workstation with a single GPU card (NVIDIA GTX 580) is successfully used for the 3D Laplace-domain full waveform inversion rather than a large cluster of CPU cores. To exploit a GPU for our inversion algorithm, the routine for the iterative matrix solver is ported to the CUDA programming language for forward and backward modeling parts with minimized modification of the remaining parts, which were originally written in Fortran 90. Using a uniformly structured grid set, nonzero values in the sparse impedance matrix can be arranged according to certain rules, which efficiently parallelize the preconditioned conjugate gradient method for a number of threads contained in the GPU card. We perform a numerical experiment to verify the accuracy of a floating point operation performed by a GPU to calculate the Laplace-domain wavefield. We also measure the efficiencies of the original CPU and modified GPU programs using a cluster of CPU cores and a workstation with a GPU card, respectively. Through the analysis, the parallelized inversion code for a GPU achieves the speedup of 14.7-24.6x compared to a CPU-based serial code depending on the degrees of freedom of the impedance matrix. Finally, the practicality of the proposed algorithm is examined by inverting a 3D long-wavelength velocity model using wide azimuth real datasets in 3.7 days.
NASA Astrophysics Data System (ADS)
Chun, Kyungwon; Kim, Huioon; Hong, Hyunpyo; Chung, Youngjoo
GMES which stands for GIST Maxwell's Equations Solver is a Python package for a Finite-Difference Time-Domain (FDTD) simulation. The FDTD method widely used for electromagnetic simulations is an algorithm to solve the Maxwell's equations. GMES follows Object-Oriented Programming (OOP) paradigm for the good maintainability and usability. With the several optimization techniques along with parallel computing environment, we could make the fast and interactive implementation. Execution speed has been tested in a single host and Beowulf class cluster. GMES is open source and available on the web (http://www.sf.net/projects/gmes).
MEGA Node: an implementation of a coarse-grain totally reconfigurable parallel machine
NASA Astrophysics Data System (ADS)
Blum, M. B.; Burrer, Caroline
1990-09-01
The MEGA Node is a loosely coupled highly parallel computer based on transputers. One of its main characteristics is its ability to change the topology of the network using an electronic switch. It covers a range from 128 to 1024 " worker processors" delivering from 550 to 4400 Mflops peak performance. To achieve these performances a hierarchical structure has been adopted. This highly parallel machine is issued from the Esprit I P1085 " Supernode" project. The software has to support a wide spectrum of users going from those who wish to obtain maximum performance from the machine to those who wish to use it as a general purpose multi-user parallel machine. This paper describes the different ways to use the MEGA Node and the software environments provided to satisfy all kind of users. The Helios environment is a good example to explain how an operating system can control this machine particularly the networking management and the fundamental problem of mapping. The MEGA Node has already been used for a wide range of applications like signal/image processing (high and low level) image synthesis scientific and engineering numbercrunching neural network simulation and logic simulation. Only a few of them are discussed in this paper: medical image analysis and vision and ray tracing. 1 Esprit I P1085 project The work done under the Esprit I P1085 project [1] partially funded by CEC was related the development and applications of a low cost high performance multiprocessor machine. The project involved Royal Signals and Radar Establishment (RSRE prime contractor) Inmos Thorn-Emi CRL University of Southampton University of Liverpool Aptor University of Grenoble(Imag) and Telmat Informatique. The objectives of the project were to develop a highly parallel architecture based on transputers and associated system software and applications. To achieve this Inmos developed the T800 transputer starting from the T414 and the consortium defined the whole machine architecture
Highly-Parallel, Highly-Compact Computing Structures Implemented in Nanotechnology
NASA Technical Reports Server (NTRS)
Crawley, D. G.; Duff, M. J. B.; Fountain, T. J.; Moffat, C. D.; Tomlinson, C. D.
1995-01-01
In this paper, we describe work in which we are evaluating how the evolving properties of nano-electronic devices could best be utilized in highly parallel computing structures. Because of their combination of high performance, low power, and extreme compactness, such structures would have obvious applications in spaceborne environments, both for general mission control and for on-board data analysis. However, the anticipated properties of nano-devices mean that the optimum architecture for such systems is by no means certain. Candidates include single instruction multiple datastream (SIMD) arrays, neural networks, and multiple instruction multiple datastream (MIMD) assemblies.
NASA Technical Reports Server (NTRS)
Wigton, Larry
1996-01-01
Improving the numerical linear algebra routines for use in new Navier-Stokes codes, specifically Tim Barth's unstructured grid code, with spin-offs to TRANAIR is reported. A fast distance calculation routine for Navier-Stokes codes using the new one-equation turbulence models is written. The primary focus of this work was devoted to improving matrix-iterative methods. New algorithms have been developed which activate the full potential of classical Cray-class computers as well as distributed-memory parallel computers.
Sharma, Diksha; Badal, Andreu; Badano, Aldo
2012-04-21
The computational modeling of medical imaging systems often requires obtaining a large number of simulated images with low statistical uncertainty which translates into prohibitive computing times. We describe a novel hybrid approach for Monte Carlo simulations that maximizes utilization of CPUs and GPUs in modern workstations. We apply the method to the modeling of indirect x-ray detectors using a new and improved version of the code MANTIS, an open source software tool used for the Monte Carlo simulations of indirect x-ray imagers. We first describe a GPU implementation of the physics and geometry models in fastDETECT2 (the optical transport model) and a serial CPU version of the same code. We discuss its new features like on-the-fly column geometry and columnar crosstalk in relation to the MANTIS code, and point out areas where our model provides more flexibility for the modeling of realistic columnar structures in large area detectors. Second, we modify PENELOPE (the open source software package that handles the x-ray and electron transport in MANTIS) to allow direct output of location and energy deposited during x-ray and electron interactions occurring within the scintillator. This information is then handled by optical transport routines in fastDETECT2. A load balancer dynamically allocates optical transport showers to the GPU and CPU computing cores. Our hybridMANTIS approach achieves a significant speed-up factor of 627 when compared to MANTIS and of 35 when compared to the same code running only in a CPU instead of a GPU. Using hybridMANTIS, we successfully hide hours of optical transport time by running it in parallel with the x-ray and electron transport, thus shifting the computational bottleneck from optical tox-ray transport. The new code requires much less memory than MANTIS and, asa result, allows us to efficiently simulate large area detectors.
NASA Astrophysics Data System (ADS)
Caplan, Ronald Meyer
We numerically study the dynamics and interactions of vortex rings in the nonlinear Schrodinger equation (NLSE). Single ring dynamics for both bright and dark vortex rings are explored including their traverse velocity, stability, and perturbations resulting in quadrupole oscillations. Multi-ring dynamics of dark vortex rings are investigated, including scattering and merging of two colliding rings, leapfrogging interactions of co-traveling rings, as well as co-moving steady-state multi-ring ensembles. Simulations of choreographed multi-ring setups are also performed, leading to intriguing interaction dynamics. Due to the inherent lack of a close form solution for vortex rings and the dimensionality where they live, efficient numerical methods to integrate the NLSE have to be developed in order to perform the extensive number of required simulations. To facilitate this, compact high-order numerical schemes for the spatial derivatives are developed which include a new semi-compact modulus-squared Dirichlet boundary condition. The schemes are combined with a fourth-order Runge-Kutta time-stepping scheme in order to keep the overall method fully explicit. To ensure efficient use of the schemes, a stability analysis is performed to find bounds on the largest usable time step-size as a function of the spatial step-size. The numerical methods are implemented into codes which are run on NVIDIA graphic processing unit (GPU) parallel architectures. The codes running on the GPU are shown to be many times faster than their serial counterparts. The codes are developed with future usability in mind, and therefore are written to interface with MATLAB utilizing custom GPU-enabled C codes with a MEX-compiler interface. Reproducibility of results is achieved by combining the codes into a code package called NLSEmagic which is freely distributed on a dedicated website.
Data Parallel Bin-Based Indexing for Answering Queries on Multi-Core Architectures
Gosink, Luke; Wu, Kesheng; Bethel, E. Wes; Owens, John D.; Joy, Kenneth I.
2009-06-02
The multi-core trend in CPUs and general purpose graphics processing units (GPUs) offers new opportunities for the database community. The increase of cores at exponential rates is likely to affect virtually every server and client in the coming decade, and presents database management systems with a huge, compelling disruption that will radically change how processing is done. This paper presents a new parallel indexing data structure for answering queries that takes full advantage of the increasing thread-level parallelism emerging in multi-core architectures. In our approach, our Data Parallel Bin-based Index Strategy (DP-BIS) first bins the base data, and then partitions and stores the values in each bin as a separate, bin-based data cluster. In answering a query, the procedures for examining the bin numbers and the bin-based data clusters offer the maximum possible level of concurrency; each record is evaluated by a single thread and all threads are processed simultaneously in parallel. We implement and demonstrate the effectiveness of DP-BIS on two multi-core architectures: a multi-core CPU and a GPU. The concurrency afforded by DP-BIS allows us to fully utilize the thread-level parallelism provided by each architecture--for example, our GPU-based DP-BIS implementation simultaneously evaluates over 12,000 records with an equivalent number of concurrently executing threads. In comparing DP-BIS's performance across these architectures, we show that the GPU-based DP-BIS implementation requires significantly less computation time to answer a query than the CPU-based implementation. We also demonstrate in our analysis that DP-BIS provides better overall performance than the commonly utilized CPU and GPU-based projection index. Finally, due to data encoding, we show that DP-BIS accesses significantly smaller amounts of data than index strategies that operate solely on a column's base data; this smaller data footprint is critical for parallel processors that possess
Parallel multithread computing for spectroscopic analysis in optical coherence tomography
NASA Astrophysics Data System (ADS)
Trojanowski, Michal; Kraszewski, Maciej; Strakowski, Marcin; Pluciński, Jerzy
2014-05-01
Spectroscopic Optical Coherence Tomography (SOCT) is an extension of Optical Coherence Tomography (OCT). It allows gathering spectroscopic information from individual scattering points inside the sample. It is based on time-frequency analysis of interferometric signals. Such analysis requires calculating hundreds of Fourier transforms while performing a single A-scan. Additionally, further processing of acquired spectroscopic information is needed. This significantly increases the time of required computations. During last years, application of graphical processing units (GPU's) was proposed to reduce computation time in OCT by using parallel computing algorithms. GPU technology can be also used to speed-up signal processing in SOCT. However, parallel algorithms used in classical OCT need to be revised because of different character of analyzed data. The classical OCT requires processing of long, independent interferometric signals for obtaining subsequent A-scans. The difference with SOCT is that it requires processing of multiple, shorter signals, which differ only in a small part of samples. We have developed new algorithms for parallel signal processing for usage in SOCT, implemented with NVIDIA CUDA (Compute Unified Device Architecture). We present details of the algorithms and performance tests for analyzing data from in-house SD-OCT system. We also give a brief discussion about usefulness of developed algorithm. Presented algorithms might be useful for researchers working on OCT, as they allow to reduce computation time and are step toward real-time signal processing of SOCT data.
Hybrid parallel computing architecture for multiview phase shifting
NASA Astrophysics Data System (ADS)
Zhong, Kai; Li, Zhongwei; Zhou, Xiaohui; Shi, Yusheng; Wang, Congjun
2014-11-01
The multiview phase-shifting method shows its powerful capability in achieving high resolution three-dimensional (3-D) shape measurement. Unfortunately, this ability results in very high computation costs and 3-D computations have to be processed offline. To realize real-time 3-D shape measurement, a hybrid parallel computing architecture is proposed for multiview phase shifting. In this architecture, the central processing unit can co-operate with the graphic processing unit (GPU) to achieve hybrid parallel computing. The high computation cost procedures, including lens distortion rectification, phase computation, correspondence, and 3-D reconstruction, are implemented in GPU, and a three-layer kernel function model is designed to simultaneously realize coarse-grained and fine-grained paralleling computing. Experimental results verify that the developed system can perform 50 fps (frame per second) real-time 3-D measurement with 260 K 3-D points per frame. A speedup of up to 180 times is obtained for the performance of the proposed technique using a NVIDIA GT560Ti graphics card rather than a sequential C in a 3.4 GHZ Inter Core i7 3770.
SDA 7: A modular and parallel implementation of the simulation of diffusional association software.
Martinez, Michael; Bruce, Neil J; Romanowska, Julia; Kokh, Daria B; Ozboyaci, Musa; Yu, Xiaofeng; Öztürk, Mehmet Ali; Richter, Stefan; Wade, Rebecca C
2015-08-01
The simulation of diffusional association (SDA) Brownian dynamics software package has been widely used in the study of biomacromolecular association. Initially developed to calculate bimolecular protein-protein association rate constants, it has since been extended to study electron transfer rates, to predict the structures of biomacromolecular complexes, to investigate the adsorption of proteins to inorganic surfaces, and to simulate the dynamics of large systems containing many biomacromolecular solutes, allowing the study of concentration-dependent effects. These extensions have led to a number of divergent versions of the software. In this article, we report the development of the latest version of the software (SDA 7). This release was developed to consolidate the existing codes into a single framework, while improving the parallelization of the code to better exploit modern multicore shared memory computer architectures. It is built using a modular object-oriented programming scheme, to allow for easy maintenance and extension of the software, and includes new features, such as adding flexible solute representations. We discuss a number of application examples, which describe some of the methods available in the release, and provide benchmarking data to demonstrate the parallel performance.
Portable implementation of implicit methods for the UEDGE and BOUT codes on parallel computers
Rognlien, T D; Xu, X Q
1999-02-17
A description is given of the parallelization algorithms and results for two codes used ex- tensively to model edge-plasmas in magnetic fusion energy devices. The codes are UEDGE, which calculates two-dimensional plasma and neutral gas profiles, and BOUT, which cal- culates three-dimensional plasma turbulence using experimental or UEDGE profiles. Both codes describe the plasma behavior using fluid equations. A domain decomposition model is used for parallelization by dividing the global spatial simulation region into a set of domains. This approach allows the used of two recently developed LLNL Newton-Krylov numerical solvers, PVODE and KINSOL. Results show an order of magnitude speed up in execution time for the plasma equations with UEDGE. A problem which is identified for UEDGE is the solution of the fluid gas equations on a highly anisotropic mesh. The speed up of BOUT is closer to two orders of magnitude, especially if one includes the initial improvement from switching to the fully implicit Newton-Krylov solver. The turbulent transport coefficients obtained from BOUT guide the use of anomalous transport models within UEDGE, with the eventual goal of a self-consistent coupling.
SDA 7: A modular and parallel implementation of the simulation of diffusional association software.
Martinez, Michael; Bruce, Neil J; Romanowska, Julia; Kokh, Daria B; Ozboyaci, Musa; Yu, Xiaofeng; Öztürk, Mehmet Ali; Richter, Stefan; Wade, Rebecca C
2015-08-01
The simulation of diffusional association (SDA) Brownian dynamics software package has been widely used in the study of biomacromolecular association. Initially developed to calculate bimolecular protein-protein association rate constants, it has since been extended to study electron transfer rates, to predict the structures of biomacromolecular complexes, to investigate the adsorption of proteins to inorganic surfaces, and to simulate the dynamics of large systems containing many biomacromolecular solutes, allowing the study of concentration-dependent effects. These extensions have led to a number of divergent versions of the software. In this article, we report the development of the latest version of the software (SDA 7). This release was developed to consolidate the existing codes into a single framework, while improving the parallelization of the code to better exploit modern multicore shared memory computer architectures. It is built using a modular object-oriented programming scheme, to allow for easy maintenance and extension of the software, and includes new features, such as adding flexible solute representations. We discuss a number of application examples, which describe some of the methods available in the release, and provide benchmarking data to demonstrate the parallel performance. PMID:26123630
SDA 7: A modular and parallel implementation of the simulation of diffusional association software
Martinez, Michael; Romanowska, Julia; Kokh, Daria B.; Ozboyaci, Musa; Yu, Xiaofeng; Öztürk, Mehmet Ali; Richter, Stefan
2015-01-01
The simulation of diffusional association (SDA) Brownian dynamics software package has been widely used in the study of biomacromolecular association. Initially developed to calculate bimolecular protein–protein association rate constants, it has since been extended to study electron transfer rates, to predict the structures of biomacromolecular complexes, to investigate the adsorption of proteins to inorganic surfaces, and to simulate the dynamics of large systems containing many biomacromolecular solutes, allowing the study of concentration‐dependent effects. These extensions have led to a number of divergent versions of the software. In this article, we report the development of the latest version of the software (SDA 7). This release was developed to consolidate the existing codes into a single framework, while improving the parallelization of the code to better exploit modern multicore shared memory computer architectures. It is built using a modular object‐oriented programming scheme, to allow for easy maintenance and extension of the software, and includes new features, such as adding flexible solute representations. We discuss a number of application examples, which describe some of the methods available in the release, and provide benchmarking data to demonstrate the parallel performance. © 2015 The Authors. Journal of Computational Chemistry Published by Wiley Periodicals, Inc. PMID:26123630
NASA Technical Reports Server (NTRS)
Patten, William Neff
1989-01-01
There is an evident need to discover a means of establishing reliable, implementable controls for systems that are plagued by nonlinear and, or uncertain, model dynamics. The development of a generic controller design tool for tough-to-control systems is reported. The method utilizes a moving grid, time infinite element based solution of the necessary conditions that describe an optimal controller for a system. The technique produces a discrete feedback controller. Real time laboratory experiments are now being conducted to demonstrate the viability of the method. The algorithm that results is being implemented in a microprocessor environment. Critical computational tasks are accomplished using a low cost, on-board, multiprocessor (INMOS T800 Transputers) and parallel processing. Progress to date validates the methodology presented. Applications of the technique to the control of highly flexible robotic appendages are suggested.
Zhang, Site; Asoubar, Daniel; Hellmann, Christian; Wyrowski, Frank
2016-01-20
The propagation of electromagnetic fields between non-parallel planes based on a spectrum-of-plane-wave analysis is discussed and formulations for an efficient numerical implementation are presented in detail. It is shown that with the help of interpolation techniques, the numerical implementation can be done with the standard uniform fast Fourier transform (FFT) of easy access. Different interpolation techniques are numerically examined, and it turns out that the use of cubic interpolation, together with the uniform FFT, brings both significantly increased computational efficiency and high simulation accuracy. Apart from the aspect of computational efficiency, all formulations in this work are generalized in a fully vectorial manner in comparison to previous works. PMID:26835928
Zhang, Site; Asoubar, Daniel; Hellmann, Christian; Wyrowski, Frank
2016-01-20
The propagation of electromagnetic fields between non-parallel planes based on a spectrum-of-plane-wave analysis is discussed and formulations for an efficient numerical implementation are presented in detail. It is shown that with the help of interpolation techniques, the numerical implementation can be done with the standard uniform fast Fourier transform (FFT) of easy access. Different interpolation techniques are numerically examined, and it turns out that the use of cubic interpolation, together with the uniform FFT, brings both significantly increased computational efficiency and high simulation accuracy. Apart from the aspect of computational efficiency, all formulations in this work are generalized in a fully vectorial manner in comparison to previous works.
NASA Astrophysics Data System (ADS)
Hara, Tatsuhiko
2004-08-01
We implement the Direct Solution Method (DSM) on a vector-parallel supercomputer and show that it is possible to significantly improve its computational efficiency through parallel computing. We apply the parallel DSM calculation to waveform inversion of long period (250-500 s) surface wave data for three-dimensional (3-D) S-wave velocity structure in the upper and uppermost lower mantle. We use a spherical harmonic expansion to represent lateral variation with the maximum angular degree 16. We find significant low velocities under south Pacific hot spots in the transition zone. This is consistent with other seismological studies conducted in the Superplume project, which suggests deep roots of these hot spots. We also perform simultaneous waveform inversion for 3-D S-wave velocity and Q structure. Since resolution for Q is not good, we develop a new technique in which power spectra are used as data for inversion. We find good correlation between long wavelength patterns of Vs and Q in the transition zone such as high Vs and high Q under the western Pacific.
POM.gpu-v1.0: a GPU-based Princeton Ocean Model
NASA Astrophysics Data System (ADS)
Xu, S.; Huang, X.; Oey, L.-Y.; Xu, F.; Fu, H.; Zhang, Y.; Yang, G.
2015-09-01
Graphics processing units (GPUs) are an attractive solution in many scientific applications due to their high performance. However, most existing GPU conversions of climate models use GPUs for only a few computationally intensive regions. In the present study, we redesign the mpiPOM (a parallel version of the Princeton Ocean Model) with GPUs. Specifically, we first convert the model from its original Fortran form to a new Compute Unified Device Architecture C (CUDA-C) code, then we optimize the code on each of the GPUs, the communications between the GPUs, and the I / O between the GPUs and the central processing units (CPUs). We show that the performance of the new model on a workstation containing four GPUs is comparable to that on a powerful cluster with 408 standard CPU cores, and it reduces the energy consumption by a factor of 6.8.
Mechanically verified hardware implementing an 8-bit parallel IO Byzantine agreement processor
NASA Technical Reports Server (NTRS)
Moore, J. Strother
1992-01-01
Consider a network of four processors that use the Oral Messages (Byzantine Generals) Algorithm of Pease, Shostak, and Lamport to achieve agreement in the presence of faults. Bevier and Young have published a functional description of a single processor that, when interconnected appropriately with three identical others, implements this network under the assumption that the four processors step in synchrony. By formalizing the original Pease, et al work, Bevier and Young mechanically proved that such a network achieves fault tolerance. We develop, formalize, and discuss a hardware design that has been mechanically proven to implement their processor. In particular, we formally define mapping functions from the abstract state space of the Bevier-Young processor to a concrete state space of a hardware module and state a theorem that expresses the claim that the hardware correctly implements the processor. We briefly discuss the Brock-Hunt Formal Hardware Description Language which permits designs both to be proved correct with the Boyer-Moore theorem prover and to be expressed in a commercially supported hardware description language for additional electrical analysis and layout. We briefly describe our implementation.
Grid-based Parallel Data Streaming Implemented for the Gyrokinetic Toroidal Code
S. Klasky; S. Ethier; Z. Lin; K. Martins; D. McCune; R. Samtaney
2003-09-15
We have developed a threaded parallel data streaming approach using Globus to transfer multi-terabyte simulation data from a remote supercomputer to the scientist's home analysis/visualization cluster, as the simulation executes, with negligible overhead. Data transfer experiments show that this concurrent data transfer approach is more favorable compared with writing to local disk and then transferring this data to be post-processed. The present approach is conducive to using the grid to pipeline the simulation with post-processing and visualization. We have applied this method to the Gyrokinetic Toroidal Code (GTC), a 3-dimensional particle-in-cell code used to study microturbulence in magnetic confinement fusion from first principles plasma theory.
Parallel Implementation of an Adaptive Scheme for 3D Unstructured Grids on the SP2
NASA Technical Reports Server (NTRS)
Oliker, Leonid; Biswas, Rupak; Strawn, Roger C.
1996-01-01
Dynamic mesh adaption on unstructured grids is a powerful tool for computing unsteady flows that require local grid modifications to efficiently resolve solution features. For this work, we consider an edge-based adaption scheme that has shown good single-processor performance on the C90. We report on our experience parallelizing this code for the SP2. Results show a 47.OX speedup on 64 processors when 10% of the mesh is randomly refined. Performance deteriorates to 7.7X when the same number of edges are refined in a highly-localized region. This is because almost all mesh adaption is confined to a single processor. However, this problem can be remedied by repartitioni