Kohno, R; Hotta, K; Nishioka, S; Matsubara, K; Tansho, R; Suzuki, T
2011-11-21
We implemented the simplified Monte Carlo (SMC) method on graphics processing unit (GPU) architecture under the computer-unified device architecture platform developed by NVIDIA. The GPU-based SMC was clinically applied for four patients with head and neck, lung, or prostate cancer. The results were compared to those obtained by a traditional CPU-based SMC with respect to the computation time and discrepancy. In the CPU- and GPU-based SMC calculations, the estimated mean statistical errors of the calculated doses in the planning target volume region were within 0.5% rms. The dose distributions calculated by the GPU- and CPU-based SMCs were similar, within statistical errors. The GPU-based SMC showed 12.30-16.00 times faster performance than the CPU-based SMC. The computation time per beam arrangement using the GPU-based SMC for the clinical cases ranged 9-67 s. The results demonstrate the successful application of the GPU-based SMC to a clinical proton treatment planning.
A GPU-based calculation using the three-dimensional FDTD method for electromagnetic field analysis.
Nagaoka, Tomoaki; Watanabe, Soichi
2010-01-01
Numerical simulations with the numerical human model using the finite-difference time domain (FDTD) method have recently been performed frequently in a number of fields in biomedical engineering. However, the FDTD calculation runs too slowly. We focus, therefore, on general purpose programming on the graphics processing unit (GPGPU). The three-dimensional FDTD method was implemented on the GPU using Compute Unified Device Architecture (CUDA). In this study, we used the NVIDIA Tesla C1060 as a GPGPU board. The performance of the GPU is evaluated in comparison with the performance of a conventional CPU and a vector supercomputer. The results indicate that three-dimensional FDTD calculations using a GPU can significantly reduce run time in comparison with that using a conventional CPU, even a native GPU implementation of the three-dimensional FDTD method, while the GPU/CPU speed ratio varies with the calculation domain and thread block size.
Xiao, Kai; Chen, Danny Z; Hu, X Sharon; Zhou, Bo
2012-12-01
The three-dimensional digital differential analyzer (3D-DDA) algorithm is a widely used ray traversal method, which is also at the core of many convolution∕superposition (C∕S) dose calculation approaches. However, porting existing C∕S dose calculation methods onto graphics processing unit (GPU) has brought challenges to retaining the efficiency of this algorithm. In particular, straightforward implementation of the original 3D-DDA algorithm inflicts a lot of branch divergence which conflicts with the GPU programming model and leads to suboptimal performance. In this paper, an efficient GPU implementation of the 3D-DDA algorithm is proposed, which effectively reduces such branch divergence and improves performance of the C∕S dose calculation programs running on GPU. The main idea of the proposed method is to convert a number of conditional statements in the original 3D-DDA algorithm into a set of simple operations (e.g., arithmetic, comparison, and logic) which are better supported by the GPU architecture. To verify and demonstrate the performance improvement, this ray traversal method was integrated into a GPU-based collapsed cone convolution∕superposition (CCCS) dose calculation program. The proposed method has been tested using a water phantom and various clinical cases on an NVIDIA GTX570 GPU. The CCCS dose calculation program based on the efficient 3D-DDA ray traversal implementation runs 1.42 ∼ 2.67× faster than the one based on the original 3D-DDA implementation, without losing any accuracy. The results show that the proposed method can effectively reduce branch divergence in the original 3D-DDA ray traversal algorithm and improve the performance of the CCCS program running on GPU. Considering the wide utilization of the 3D-DDA algorithm, various applications can benefit from this implementation method.
GPU-accelerated Monte Carlo convolution/superposition implementation for dose calculation.
Zhou, Bo; Yu, Cedric X; Chen, Danny Z; Hu, X Sharon
2010-11-01
Dose calculation is a key component in radiation treatment planning systems. Its performance and accuracy are crucial to the quality of treatment plans as emerging advanced radiation therapy technologies are exerting ever tighter constraints on dose calculation. A common practice is to choose either a deterministic method such as the convolution/superposition (CS) method for speed or a Monte Carlo (MC) method for accuracy. The goal of this work is to boost the performance of a hybrid Monte Carlo convolution/superposition (MCCS) method by devising a graphics processing unit (GPU) implementation so as to make the method practical for day-to-day usage. Although the MCCS algorithm combines the merits of MC fluence generation and CS fluence transport, it is still not fast enough to be used as a day-to-day planning tool. To alleviate the speed issue of MC algorithms, the authors adopted MCCS as their target method and implemented a GPU-based version. In order to fully utilize the GPU computing power, the MCCS algorithm is modified to match the GPU hardware architecture. The performance of the authors' GPU-based implementation on an Nvidia GTX260 card is compared to a multithreaded software implementation on a quad-core system. A speedup in the range of 6.7-11.4x is observed for the clinical cases used. The less than 2% statistical fluctuation also indicates that the accuracy of the authors' GPU-based implementation is in good agreement with the results from the quad-core CPU implementation. This work shows that GPU is a feasible and cost-efficient solution compared to other alternatives such as using cluster machines or field-programmable gate arrays for satisfying the increasing demands on computation speed and accuracy of dose calculation. But there are also inherent limitations of using GPU for accelerating MC-type applications, which are also analyzed in detail in this article.
Multi-GPU hybrid programming accelerated three-dimensional phase-field model in binary alloy
NASA Astrophysics Data System (ADS)
Zhu, Changsheng; Liu, Jieqiong; Zhu, Mingfang; Feng, Li
2018-03-01
In the process of dendritic growth simulation, the computational efficiency and the problem scales have extremely important influence on simulation efficiency of three-dimensional phase-field model. Thus, seeking for high performance calculation method to improve the computational efficiency and to expand the problem scales has a great significance to the research of microstructure of the material. A high performance calculation method based on MPI+CUDA hybrid programming model is introduced. Multi-GPU is used to implement quantitative numerical simulations of three-dimensional phase-field model in binary alloy under the condition of multi-physical processes coupling. The acceleration effect of different GPU nodes on different calculation scales is explored. On the foundation of multi-GPU calculation model that has been introduced, two optimization schemes, Non-blocking communication optimization and overlap of MPI and GPU computing optimization, are proposed. The results of two optimization schemes and basic multi-GPU model are compared. The calculation results show that the use of multi-GPU calculation model can improve the computational efficiency of three-dimensional phase-field obviously, which is 13 times to single GPU, and the problem scales have been expanded to 8193. The feasibility of two optimization schemes is shown, and the overlap of MPI and GPU computing optimization has better performance, which is 1.7 times to basic multi-GPU model, when 21 GPUs are used.
Accelerating EPI distortion correction by utilizing a modern GPU-based parallel computation.
Yang, Yao-Hao; Huang, Teng-Yi; Wang, Fu-Nien; Chuang, Tzu-Chao; Chen, Nan-Kuei
2013-04-01
The combination of phase demodulation and field mapping is a practical method to correct echo planar imaging (EPI) geometric distortion. However, since phase dispersion accumulates in each phase-encoding step, the calculation complexity of phase modulation is Ny-fold higher than conventional image reconstructions. Thus, correcting EPI images via phase demodulation is generally a time-consuming task. Parallel computing by employing general-purpose calculations on graphics processing units (GPU) can accelerate scientific computing if the algorithm is parallelized. This study proposes a method that incorporates the GPU-based technique into phase demodulation calculations to reduce computation time. The proposed parallel algorithm was applied to a PROPELLER-EPI diffusion tensor data set. The GPU-based phase demodulation method reduced the EPI distortion correctly, and accelerated the computation. The total reconstruction time of the 16-slice PROPELLER-EPI diffusion tensor images with matrix size of 128 × 128 was reduced from 1,754 seconds to 101 seconds by utilizing the parallelized 4-GPU program. GPU computing is a promising method to accelerate EPI geometric correction. The resulting reduction in computation time of phase demodulation should accelerate postprocessing for studies performed with EPI, and should effectuate the PROPELLER-EPI technique for clinical practice. Copyright © 2011 by the American Society of Neuroimaging.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Li, Y; UT Southwestern Medical Center, Dallas, TX; Tian, Z
2015-06-15
Purpose: Intensity-modulated proton therapy (IMPT) is increasingly used in proton therapy. For IMPT optimization, Monte Carlo (MC) is desired for spots dose calculations because of its high accuracy, especially in cases with a high level of heterogeneity. It is also preferred in biological optimization problems due to the capability of computing quantities related to biological effects. However, MC simulation is typically too slow to be used for this purpose. Although GPU-based MC engines have become available, the achieved efficiency is still not ideal. The purpose of this work is to develop a new optimization scheme to include GPU-based MC intomore » IMPT. Methods: A conventional approach using MC in IMPT simply calls the MC dose engine repeatedly for each spot dose calculations. However, this is not the optimal approach, because of the unnecessary computations on some spots that turned out to have very small weights after solving the optimization problem. GPU-memory writing conflict occurring at a small beam size also reduces computational efficiency. To solve these problems, we developed a new framework that iteratively performs MC dose calculations and plan optimizations. At each dose calculation step, the particles were sampled from different spots altogether with Metropolis algorithm, such that the particle number is proportional to the latest optimized spot intensity. Simultaneously transporting particles from multiple spots also mitigated the memory writing conflict problem. Results: We have validated the proposed MC-based optimization schemes in one prostate case. The total computation time of our method was ∼5–6 min on one NVIDIA GPU card, including both spot dose calculation and plan optimization, whereas a conventional method naively using the same GPU-based MC engine were ∼3 times slower. Conclusion: A fast GPU-based MC dose calculation method along with a novel optimization workflow is developed. The high efficiency makes it attractive for clinical usages.« less
SU-E-J-60: Efficient Monte Carlo Dose Calculation On CPU-GPU Heterogeneous Systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Xiao, K; Chen, D. Z; Hu, X. S
Purpose: It is well-known that the performance of GPU-based Monte Carlo dose calculation implementations is bounded by memory bandwidth. One major cause of this bottleneck is the random memory writing patterns in dose deposition, which leads to several memory efficiency issues on GPU such as un-coalesced writing and atomic operations. We propose a new method to alleviate such issues on CPU-GPU heterogeneous systems, which achieves overall performance improvement for Monte Carlo dose calculation. Methods: Dose deposition is to accumulate dose into the voxels of a dose volume along the trajectories of radiation rays. Our idea is to partition this proceduremore » into the following three steps, which are fine-tuned for CPU or GPU: (1) each GPU thread writes dose results with location information to a buffer on GPU memory, which achieves fully-coalesced and atomic-free memory transactions; (2) the dose results in the buffer are transferred to CPU memory; (3) the dose volume is constructed from the dose buffer on CPU. We organize the processing of all radiation rays into streams. Since the steps within a stream use different hardware resources (i.e., GPU, DMA, CPU), we can overlap the execution of these steps for different streams by pipelining. Results: We evaluated our method using a Monte Carlo Convolution Superposition (MCCS) program and tested our implementation for various clinical cases on a heterogeneous system containing an Intel i7 quad-core CPU and an NVIDIA TITAN GPU. Comparing with a straightforward MCCS implementation on the same system (using both CPU and GPU for radiation ray tracing), our method gained 2-5X speedup without losing dose calculation accuracy. Conclusion: The results show that our new method improves the effective memory bandwidth and overall performance for MCCS on the CPU-GPU systems. Our proposed method can also be applied to accelerate other Monte Carlo dose calculation approaches. This research was supported in part by NSF under Grants CCF-1217906, and also in part by a research contract from the Sandia National Laboratories.« less
Nagaoka, Tomoaki; Watanabe, Soichi
2012-01-01
Electromagnetic simulation with anatomically realistic computational human model using the finite-difference time domain (FDTD) method has recently been performed in a number of fields in biomedical engineering. To improve the method's calculation speed and realize large-scale computing with the computational human model, we adapt three-dimensional FDTD code to a multi-GPU cluster environment with Compute Unified Device Architecture and Message Passing Interface. Our multi-GPU cluster system consists of three nodes. The seven GPU boards (NVIDIA Tesla C2070) are mounted on each node. We examined the performance of the FDTD calculation on multi-GPU cluster environment. We confirmed that the FDTD calculation on the multi-GPU clusters is faster than that on a multi-GPU (a single workstation), and we also found that the GPU cluster system calculate faster than a vector supercomputer. In addition, our GPU cluster system allowed us to perform the large-scale FDTD calculation because were able to use GPU memory of over 100 GB.
GPU-Q-J, a fast method for calculating root mean square deviation (RMSD) after optimal superposition
2011-01-01
Background Calculation of the root mean square deviation (RMSD) between the atomic coordinates of two optimally superposed structures is a basic component of structural comparison techniques. We describe a quaternion based method, GPU-Q-J, that is stable with single precision calculations and suitable for graphics processor units (GPUs). The application was implemented on an ATI 4770 graphics card in C/C++ and Brook+ in Linux where it was 260 to 760 times faster than existing unoptimized CPU methods. Source code is available from the Compbio website http://software.compbio.washington.edu/misc/downloads/st_gpu_fit/ or from the author LHH. Findings The Nutritious Rice for the World Project (NRW) on World Community Grid predicted de novo, the structures of over 62,000 small proteins and protein domains returning a total of 10 billion candidate structures. Clustering ensembles of structures on this scale requires calculation of large similarity matrices consisting of RMSDs between each pair of structures in the set. As a real-world test, we calculated the matrices for 6 different ensembles from NRW. The GPU method was 260 times faster that the fastest existing CPU based method and over 500 times faster than the method that had been previously used. Conclusions GPU-Q-J is a significant advance over previous CPU methods. It relieves a major bottleneck in the clustering of large numbers of structures for NRW. It also has applications in structure comparison methods that involve multiple superposition and RMSD determination steps, particularly when such methods are applied on a proteome and genome wide scale. PMID:21453553
SU-E-T-395: Multi-GPU-Based VMAT Treatment Plan Optimization Using a Column-Generation Approach
DOE Office of Scientific and Technical Information (OSTI.GOV)
Tian, Z; Shi, F; Jia, X
Purpose: GPU has been employed to speed up VMAT optimizations from hours to minutes. However, its limited memory capacity makes it difficult to handle cases with a huge dose-deposition-coefficient (DDC) matrix, e.g. those with a large target size, multiple arcs, small beam angle intervals and/or small beamlet size. We propose multi-GPU-based VMAT optimization to solve this memory issue to make GPU-based VMAT more practical for clinical use. Methods: Our column-generation-based method generates apertures sequentially by iteratively searching for an optimal feasible aperture (referred as pricing problem, PP) and optimizing aperture intensities (referred as master problem, MP). The PP requires accessmore » to the large DDC matrix, which is implemented on a multi-GPU system. Each GPU stores a DDC sub-matrix corresponding to one fraction of beam angles and is only responsible for calculation related to those angles. Broadcast and parallel reduction schemes are adopted for inter-GPU data transfer. MP is a relatively small-scale problem and is implemented on one GPU. One headand- neck cancer case was used for test. Three different strategies for VMAT optimization on single GPU were also implemented for comparison: (S1) truncating DDC matrix to ignore its small value entries for optimization; (S2) transferring DDC matrix part by part to GPU during optimizations whenever needed; (S3) moving DDC matrix related calculation onto CPU. Results: Our multi-GPU-based implementation reaches a good plan within 1 minute. Although S1 was 10 seconds faster than our method, the obtained plan quality is worse. Both S2 and S3 handle the full DDC matrix and hence yield the same plan as in our method. However, the computation time is longer, namely 4 minutes and 30 minutes, respectively. Conclusion: Our multi-GPU-based VMAT optimization can effectively solve the limited memory issue with good plan quality and high efficiency, making GPUbased ultra-fast VMAT planning practical for real clinical use.« less
Zhu, Jinhan; Chen, Lixin; Chen, Along; Luo, Guangwen; Deng, Xiaowu; Liu, Xiaowei
2015-04-11
To use a graphic processing unit (GPU) calculation engine to implement a fast 3D pre-treatment dosimetric verification procedure based on an electronic portal imaging device (EPID). The GPU algorithm includes the deconvolution and convolution method for the fluence-map calculations, the collapsed-cone convolution/superposition (CCCS) algorithm for the 3D dose calculations and the 3D gamma evaluation calculations. The results of the GPU-based CCCS algorithm were compared to those of Monte Carlo simulations. The planned and EPID-based reconstructed dose distributions in overridden-to-water phantoms and the original patients were compared for 6 MV and 10 MV photon beams in intensity-modulated radiation therapy (IMRT) treatment plans based on dose differences and gamma analysis. The total single-field dose computation time was less than 8 s, and the gamma evaluation for a 0.1-cm grid resolution was completed in approximately 1 s. The results of the GPU-based CCCS algorithm exhibited good agreement with those of the Monte Carlo simulations. The gamma analysis indicated good agreement between the planned and reconstructed dose distributions for the treatment plans. For the target volume, the differences in the mean dose were less than 1.8%, and the differences in the maximum dose were less than 2.5%. For the critical organs, minor differences were observed between the reconstructed and planned doses. The GPU calculation engine was used to boost the speed of 3D dose and gamma evaluation calculations, thus offering the possibility of true real-time 3D dosimetric verification.
SU-E-T-29: A Web Application for GPU-Based Monte Carlo IMRT/VMAT QA with Delivered Dose Verification
DOE Office of Scientific and Technical Information (OSTI.GOV)
Folkerts, M; University of California, San Diego, La Jolla, CA; Graves, Y
Purpose: To enable an existing web application for GPU-based Monte Carlo (MC) 3D dosimetry quality assurance (QA) to compute “delivered dose” from linac logfile data. Methods: We added significant features to an IMRT/VMAT QA web application which is based on existing technologies (HTML5, Python, and Django). This tool interfaces with python, c-code libraries, and command line-based GPU applications to perform a MC-based IMRT/VMAT QA. The web app automates many complicated aspects of interfacing clinical DICOM and logfile data with cutting-edge GPU software to run a MC dose calculation. The resultant web app is powerful, easy to use, and is ablemore » to re-compute both plan dose (from DICOM data) and delivered dose (from logfile data). Both dynalog and trajectorylog file formats are supported. Users upload zipped DICOM RP, CT, and RD data and set the expected statistic uncertainty for the MC dose calculation. A 3D gamma index map, 3D dose distribution, gamma histogram, dosimetric statistics, and DVH curves are displayed to the user. Additional the user may upload the delivery logfile data from the linac to compute a 'delivered dose' calculation and corresponding gamma tests. A comprehensive PDF QA report summarizing the results can also be downloaded. Results: We successfully improved a web app for a GPU-based QA tool that consists of logfile parcing, fluence map generation, CT image processing, GPU based MC dose calculation, gamma index calculation, and DVH calculation. The result is an IMRT and VMAT QA tool that conducts an independent dose calculation for a given treatment plan and delivery log file. The system takes both DICOM data and logfile data to compute plan dose and delivered dose respectively. Conclusion: We sucessfully improved a GPU-based MC QA tool to allow for logfile dose calculation. The high efficiency and accessibility will greatly facilitate IMRT and VMAT QA.« less
Li, Yongbao; Tian, Zhen; Song, Ting; Wu, Zhaoxia; Liu, Yaqiang; Jiang, Steve; Jia, Xun
2017-01-07
Monte Carlo (MC)-based spot dose calculation is highly desired for inverse treatment planning in proton therapy because of its accuracy. Recent studies on biological optimization have also indicated the use of MC methods to compute relevant quantities of interest, e.g. linear energy transfer. Although GPU-based MC engines have been developed to address inverse optimization problems, their efficiency still needs to be improved. Also, the use of a large number of GPUs in MC calculation is not favorable for clinical applications. The previously proposed adaptive particle sampling (APS) method can improve the efficiency of MC-based inverse optimization by using the computationally expensive MC simulation more effectively. This method is more efficient than the conventional approach that performs spot dose calculation and optimization in two sequential steps. In this paper, we propose a computational library to perform MC-based spot dose calculation on GPU with the APS scheme. The implemented APS method performs a non-uniform sampling of the particles from pencil beam spots during the optimization process, favoring those from the high intensity spots. The library also conducts two computationally intensive matrix-vector operations frequently used when solving an optimization problem. This library design allows a streamlined integration of the MC-based spot dose calculation into an existing proton therapy inverse planning process. We tested the developed library in a typical inverse optimization system with four patient cases. The library achieved the targeted functions by supporting inverse planning in various proton therapy schemes, e.g. single field uniform dose, 3D intensity modulated proton therapy, and distal edge tracking. The efficiency was 41.6 ± 15.3% higher than the use of a GPU-based MC package in a conventional calculation scheme. The total computation time ranged between 2 and 50 min on a single GPU card depending on the problem size.
NASA Astrophysics Data System (ADS)
Li, Yongbao; Tian, Zhen; Song, Ting; Wu, Zhaoxia; Liu, Yaqiang; Jiang, Steve; Jia, Xun
2017-01-01
Monte Carlo (MC)-based spot dose calculation is highly desired for inverse treatment planning in proton therapy because of its accuracy. Recent studies on biological optimization have also indicated the use of MC methods to compute relevant quantities of interest, e.g. linear energy transfer. Although GPU-based MC engines have been developed to address inverse optimization problems, their efficiency still needs to be improved. Also, the use of a large number of GPUs in MC calculation is not favorable for clinical applications. The previously proposed adaptive particle sampling (APS) method can improve the efficiency of MC-based inverse optimization by using the computationally expensive MC simulation more effectively. This method is more efficient than the conventional approach that performs spot dose calculation and optimization in two sequential steps. In this paper, we propose a computational library to perform MC-based spot dose calculation on GPU with the APS scheme. The implemented APS method performs a non-uniform sampling of the particles from pencil beam spots during the optimization process, favoring those from the high intensity spots. The library also conducts two computationally intensive matrix-vector operations frequently used when solving an optimization problem. This library design allows a streamlined integration of the MC-based spot dose calculation into an existing proton therapy inverse planning process. We tested the developed library in a typical inverse optimization system with four patient cases. The library achieved the targeted functions by supporting inverse planning in various proton therapy schemes, e.g. single field uniform dose, 3D intensity modulated proton therapy, and distal edge tracking. The efficiency was 41.6 ± 15.3% higher than the use of a GPU-based MC package in a conventional calculation scheme. The total computation time ranged between 2 and 50 min on a single GPU card depending on the problem size.
Li, Yongbao; Tian, Zhen; Song, Ting; Wu, Zhaoxia; Liu, Yaqiang; Jiang, Steve; Jia, Xun
2016-01-01
Monte Carlo (MC)-based spot dose calculation is highly desired for inverse treatment planning in proton therapy because of its accuracy. Recent studies on biological optimization have also indicated the use of MC methods to compute relevant quantities of interest, e.g. linear energy transfer. Although GPU-based MC engines have been developed to address inverse optimization problems, their efficiency still needs to be improved. Also, the use of a large number of GPUs in MC calculation is not favorable for clinical applications. The previously proposed adaptive particle sampling (APS) method can improve the efficiency of MC-based inverse optimization by using the computationally expensive MC simulation more effectively. This method is more efficient than the conventional approach that performs spot dose calculation and optimization in two sequential steps. In this paper, we propose a computational library to perform MC-based spot dose calculation on GPU with the APS scheme. The implemented APS method performs a non-uniform sampling of the particles from pencil beam spots during the optimization process, favoring those from the high intensity spots. The library also conducts two computationally intensive matrix-vector operations frequently used when solving an optimization problem. This library design allows a streamlined integration of the MC-based spot dose calculation into an existing proton therapy inverse planning process. We tested the developed library in a typical inverse optimization system with four patient cases. The library achieved the targeted functions by supporting inverse planning in various proton therapy schemes, e.g. single field uniform dose, 3D intensity modulated proton therapy, and distal edge tracking. The efficiency was 41.6±15.3% higher than the use of a GPU-based MC package in a conventional calculation scheme. The total computation time ranged between 2 and 50 min on a single GPU card depending on the problem size. PMID:27991456
An Approach in Radiation Therapy Treatment Planning: A Fast, GPU-Based Monte Carlo Method.
Karbalaee, Mojtaba; Shahbazi-Gahrouei, Daryoush; Tavakoli, Mohammad B
2017-01-01
An accurate and fast radiation dose calculation is essential for successful radiation radiotherapy. The aim of this study was to implement a new graphic processing unit (GPU) based radiation therapy treatment planning for accurate and fast dose calculation in radiotherapy centers. A program was written for parallel running based on GPU. The code validation was performed by EGSnrc/DOSXYZnrc. Moreover, a semi-automatic, rotary, asymmetric phantom was designed and produced using a bone, the lung, and the soft tissue equivalent materials. All measurements were performed using a Mapcheck dosimeter. The accuracy of the code was validated using the experimental data, which was obtained from the anthropomorphic phantom as the gold standard. The findings showed that, compared with those of DOSXYZnrc in the virtual phantom and for most of the voxels (>95%), <3% dose-difference or 3 mm distance-to-agreement (DTA) was found. Moreover, considering the anthropomorphic phantom, compared to the Mapcheck dose measurements, <5% dose-difference or 5 mm DTA was observed. Fast calculation speed and high accuracy of GPU-based Monte Carlo method in dose calculation may be useful in routine radiation therapy centers as the core and main component of a treatment planning verification system.
SU-E-T-493: Accelerated Monte Carlo Methods for Photon Dosimetry Using a Dual-GPU System and CUDA.
Liu, T; Ding, A; Xu, X
2012-06-01
To develop a Graphics Processing Unit (GPU) based Monte Carlo (MC) code that accelerates dose calculations on a dual-GPU system. We simulated a clinical case of prostate cancer treatment. A voxelized abdomen phantom derived from 120 CT slices was used containing 218×126×60 voxels, and a GE LightSpeed 16-MDCT scanner was modeled. A CPU version of the MC code was first developed in C++ and tested on Intel Xeon X5660 2.8GHz CPU, then it was translated into GPU version using CUDA C 4.1 and run on a dual Tesla m 2 090 GPU system. The code was featured with automatic assignment of simulation task to multiple GPUs, as well as accurate calculation of energy- and material- dependent cross-sections. Double-precision floating point format was used for accuracy. Doses to the rectum, prostate, bladder and femoral heads were calculated. When running on a single GPU, the MC GPU code was found to be ×19 times faster than the CPU code and ×42 times faster than MCNPX. These speedup factors were doubled on the dual-GPU system. The dose Result was benchmarked against MCNPX and a maximum difference of 1% was observed when the relative error is kept below 0.1%. A GPU-based MC code was developed for dose calculations using detailed patient and CT scanner models. Efficiency and accuracy were both guaranteed in this code. Scalability of the code was confirmed on the dual-GPU system. © 2012 American Association of Physicists in Medicine.
Rapid Parallel Calculation of shell Element Based On GPU
NASA Astrophysics Data System (ADS)
Wanga, Jian Hua; Lia, Guang Yao; Lib, Sheng; Li, Guang Yao
2010-06-01
Long computing time bottlenecked the application of finite element. In this paper, an effective method to speed up the FEM calculation by using the existing modern graphic processing unit and programmable colored rendering tool was put forward, which devised the representation of unit information in accordance with the features of GPU, converted all the unit calculation into film rendering process, solved the simulation work of all the unit calculation of the internal force, and overcame the shortcomings of lowly parallel level appeared ever before when it run in a single computer. Studies shown that this method could improve efficiency and shorten calculating hours greatly. The results of emulation calculation about the elasticity problem of large number cells in the sheet metal proved that using the GPU parallel simulation calculation was faster than using the CPU's. It is useful and efficient to solve the project problems in this way.
Acceleration for 2D time-domain elastic full waveform inversion using a single GPU card
NASA Astrophysics Data System (ADS)
Jiang, Jinpeng; Zhu, Peimin
2018-05-01
Full waveform inversion (FWI) is a challenging procedure due to the high computational cost related to the modeling, especially for the elastic case. The graphics processing unit (GPU) has become a popular device for the high-performance computing (HPC). To reduce the long computation time, we design and implement the GPU-based 2D elastic FWI (EFWI) in time domain using a single GPU card. We parallelize the forward modeling and gradient calculations using the CUDA programming language. To overcome the limitation of relatively small global memory on GPU, the boundary saving strategy is exploited to reconstruct the forward wavefield. Moreover, the L-BFGS optimization method used in the inversion increases the convergence of the misfit function. A multiscale inversion strategy is performed in the workflow to obtain the accurate inversion results. In our tests, the GPU-based implementations using a single GPU device achieve >15 times speedup in forward modeling, and about 12 times speedup in gradient calculation, compared with the eight-core CPU implementations optimized by OpenMP. The test results from the GPU implementations are verified to have enough accuracy by comparing the results obtained from the CPU implementations.
A GPU OpenCL based cross-platform Monte Carlo dose calculation engine (goMC)
NASA Astrophysics Data System (ADS)
Tian, Zhen; Shi, Feng; Folkerts, Michael; Qin, Nan; Jiang, Steve B.; Jia, Xun
2015-09-01
Monte Carlo (MC) simulation has been recognized as the most accurate dose calculation method for radiotherapy. However, the extremely long computation time impedes its clinical application. Recently, a lot of effort has been made to realize fast MC dose calculation on graphic processing units (GPUs). However, most of the GPU-based MC dose engines have been developed under NVidia’s CUDA environment. This limits the code portability to other platforms, hindering the introduction of GPU-based MC simulations to clinical practice. The objective of this paper is to develop a GPU OpenCL based cross-platform MC dose engine named goMC with coupled photon-electron simulation for external photon and electron radiotherapy in the MeV energy range. Compared to our previously developed GPU-based MC code named gDPM (Jia et al 2012 Phys. Med. Biol. 57 7783-97), goMC has two major differences. First, it was developed under the OpenCL environment for high code portability and hence could be run not only on different GPU cards but also on CPU platforms. Second, we adopted the electron transport model used in EGSnrc MC package and PENELOPE’s random hinge method in our new dose engine, instead of the dose planning method employed in gDPM. Dose distributions were calculated for a 15 MeV electron beam and a 6 MV photon beam in a homogenous water phantom, a water-bone-lung-water slab phantom and a half-slab phantom. Satisfactory agreement between the two MC dose engines goMC and gDPM was observed in all cases. The average dose differences in the regions that received a dose higher than 10% of the maximum dose were 0.48-0.53% for the electron beam cases and 0.15-0.17% for the photon beam cases. In terms of efficiency, goMC was ~4-16% slower than gDPM when running on the same NVidia TITAN card for all the cases we tested, due to both the different electron transport models and the different development environments. The code portability of our new dose engine goMC was validated by successfully running it on a variety of different computing devices including an NVidia GPU card, two AMD GPU cards and an Intel CPU processor. Computational efficiency among these platforms was compared.
A GPU OpenCL based cross-platform Monte Carlo dose calculation engine (goMC).
Tian, Zhen; Shi, Feng; Folkerts, Michael; Qin, Nan; Jiang, Steve B; Jia, Xun
2015-10-07
Monte Carlo (MC) simulation has been recognized as the most accurate dose calculation method for radiotherapy. However, the extremely long computation time impedes its clinical application. Recently, a lot of effort has been made to realize fast MC dose calculation on graphic processing units (GPUs). However, most of the GPU-based MC dose engines have been developed under NVidia's CUDA environment. This limits the code portability to other platforms, hindering the introduction of GPU-based MC simulations to clinical practice. The objective of this paper is to develop a GPU OpenCL based cross-platform MC dose engine named goMC with coupled photon-electron simulation for external photon and electron radiotherapy in the MeV energy range. Compared to our previously developed GPU-based MC code named gDPM (Jia et al 2012 Phys. Med. Biol. 57 7783-97), goMC has two major differences. First, it was developed under the OpenCL environment for high code portability and hence could be run not only on different GPU cards but also on CPU platforms. Second, we adopted the electron transport model used in EGSnrc MC package and PENELOPE's random hinge method in our new dose engine, instead of the dose planning method employed in gDPM. Dose distributions were calculated for a 15 MeV electron beam and a 6 MV photon beam in a homogenous water phantom, a water-bone-lung-water slab phantom and a half-slab phantom. Satisfactory agreement between the two MC dose engines goMC and gDPM was observed in all cases. The average dose differences in the regions that received a dose higher than 10% of the maximum dose were 0.48-0.53% for the electron beam cases and 0.15-0.17% for the photon beam cases. In terms of efficiency, goMC was ~4-16% slower than gDPM when running on the same NVidia TITAN card for all the cases we tested, due to both the different electron transport models and the different development environments. The code portability of our new dose engine goMC was validated by successfully running it on a variety of different computing devices including an NVidia GPU card, two AMD GPU cards and an Intel CPU processor. Computational efficiency among these platforms was compared.
SU-E-T-37: A GPU-Based Pencil Beam Algorithm for Dose Calculations in Proton Radiation Therapy
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kalantzis, G; Leventouri, T; Tachibana, H
Purpose: Recent developments in radiation therapy have been focused on applications of charged particles, especially protons. Over the years several dose calculation methods have been proposed in proton therapy. A common characteristic of all these methods is their extensive computational burden. In the current study we present for the first time, to our best knowledge, a GPU-based PBA for proton dose calculations in Matlab. Methods: In the current study we employed an analytical expression for the protons depth dose distribution. The central-axis term is taken from the broad-beam central-axis depth dose in water modified by an inverse square correction whilemore » the distribution of the off-axis term was considered Gaussian. The serial code was implemented in MATLAB and was launched on a desktop with a quad core Intel Xeon X5550 at 2.67GHz with 8 GB of RAM. For the parallelization on the GPU, the parallel computing toolbox was employed and the code was launched on a GTX 770 with Kepler architecture. The performance comparison was established on the speedup factors. Results: The performance of the GPU code was evaluated for three different energies: low (50 MeV), medium (100 MeV) and high (150 MeV). Four square fields were selected for each energy, and the dose calculations were performed with both the serial and parallel codes for a homogeneous water phantom with size 300×300×300 mm3. The resolution of the PBs was set to 1.0 mm. The maximum speedup of ∼127 was achieved for the highest energy and the largest field size. Conclusion: A GPU-based PB algorithm for proton dose calculations in Matlab was presented. A maximum speedup of ∼127 was achieved. Future directions of the current work include extension of our method for dose calculation in heterogeneous phantoms.« less
Accelerated rescaling of single Monte Carlo simulation runs with the Graphics Processing Unit (GPU).
Yang, Owen; Choi, Bernard
2013-01-01
To interpret fiber-based and camera-based measurements of remitted light from biological tissues, researchers typically use analytical models, such as the diffusion approximation to light transport theory, or stochastic models, such as Monte Carlo modeling. To achieve rapid (ideally real-time) measurement of tissue optical properties, especially in clinical situations, there is a critical need to accelerate Monte Carlo simulation runs. In this manuscript, we report on our approach using the Graphics Processing Unit (GPU) to accelerate rescaling of single Monte Carlo runs to calculate rapidly diffuse reflectance values for different sets of tissue optical properties. We selected MATLAB to enable non-specialists in C and CUDA-based programming to use the generated open-source code. We developed a software package with four abstraction layers. To calculate a set of diffuse reflectance values from a simulated tissue with homogeneous optical properties, our rescaling GPU-based approach achieves a reduction in computation time of several orders of magnitude as compared to other GPU-based approaches. Specifically, our GPU-based approach generated a diffuse reflectance value in 0.08ms. The transfer time from CPU to GPU memory currently is a limiting factor with GPU-based calculations. However, for calculation of multiple diffuse reflectance values, our GPU-based approach still can lead to processing that is ~3400 times faster than other GPU-based approaches.
NASA Astrophysics Data System (ADS)
Jia, Weile; Wang, Jue; Chi, Xuebin; Wang, Lin-Wang
2017-02-01
LS3DF, namely linear scaling three-dimensional fragment method, is an efficient linear scaling ab initio total energy electronic structure calculation code based on a divide-and-conquer strategy. In this paper, we present our GPU implementation of the LS3DF code. Our test results show that the GPU code can calculate systems with about ten thousand atoms fully self-consistently in the order of 10 min using thousands of computing nodes. This makes the electronic structure calculations of 10,000-atom nanosystems routine work. This speed is 4.5-6 times faster than the CPU calculations using the same number of nodes on the Titan machine in the Oak Ridge leadership computing facility (OLCF). Such speedup is achieved by (a) carefully re-designing of the computationally heavy kernels; (b) redesign of the communication pattern for heterogeneous supercomputers.
An analytic linear accelerator source model for GPU-based Monte Carlo dose calculations.
Tian, Zhen; Li, Yongbao; Folkerts, Michael; Shi, Feng; Jiang, Steve B; Jia, Xun
2015-10-21
Recently, there has been a lot of research interest in developing fast Monte Carlo (MC) dose calculation methods on graphics processing unit (GPU) platforms. A good linear accelerator (linac) source model is critical for both accuracy and efficiency considerations. In principle, an analytical source model should be more preferred for GPU-based MC dose engines than a phase-space file-based model, in that data loading and CPU-GPU data transfer can be avoided. In this paper, we presented an analytical field-independent source model specifically developed for GPU-based MC dose calculations, associated with a GPU-friendly sampling scheme. A key concept called phase-space-ring (PSR) was proposed. Each PSR contained a group of particles that were of the same type, close in energy and reside in a narrow ring on the phase-space plane located just above the upper jaws. The model parameterized the probability densities of particle location, direction and energy for each primary photon PSR, scattered photon PSR and electron PSR. Models of one 2D Gaussian distribution or multiple Gaussian components were employed to represent the particle direction distributions of these PSRs. A method was developed to analyze a reference phase-space file and derive corresponding model parameters. To efficiently use our model in MC dose calculations on GPU, we proposed a GPU-friendly sampling strategy, which ensured that the particles sampled and transported simultaneously are of the same type and close in energy to alleviate GPU thread divergences. To test the accuracy of our model, dose distributions of a set of open fields in a water phantom were calculated using our source model and compared to those calculated using the reference phase-space files. For the high dose gradient regions, the average distance-to-agreement (DTA) was within 1 mm and the maximum DTA within 2 mm. For relatively low dose gradient regions, the root-mean-square (RMS) dose difference was within 1.1% and the maximum dose difference within 1.7%. The maximum relative difference of output factors was within 0.5%. Over 98.5% passing rate was achieved in 3D gamma-index tests with 2%/2 mm criteria in both an IMRT prostate patient case and a head-and-neck case. These results demonstrated the efficacy of our model in terms of accurately representing a reference phase-space file. We have also tested the efficiency gain of our source model over our previously developed phase-space-let file source model. The overall efficiency of dose calculation was found to be improved by ~1.3-2.2 times in water and patient cases using our analytical model.
Multi-GPU accelerated three-dimensional FDTD method for electromagnetic simulation.
Nagaoka, Tomoaki; Watanabe, Soichi
2011-01-01
Numerical simulation with a numerical human model using the finite-difference time domain (FDTD) method has recently been performed in a number of fields in biomedical engineering. To improve the method's calculation speed and realize large-scale computing with the numerical human model, we adapt three-dimensional FDTD code to a multi-GPU environment using Compute Unified Device Architecture (CUDA). In this study, we used NVIDIA Tesla C2070 as GPGPU boards. The performance of multi-GPU is evaluated in comparison with that of a single GPU and vector supercomputer. The calculation speed with four GPUs was approximately 3.5 times faster than with a single GPU, and was slightly (approx. 1.3 times) slower than with the supercomputer. Calculation speed of the three-dimensional FDTD method using GPUs can significantly improve with an expanding number of GPUs.
Ellingwood, Nathan D; Yin, Youbing; Smith, Matthew; Lin, Ching-Long
2016-04-01
Faster and more accurate methods for registration of images are important for research involved in conducting population-based studies that utilize medical imaging, as well as improvements for use in clinical applications. We present a novel computation- and memory-efficient multi-level method on graphics processing units (GPU) for performing registration of two computed tomography (CT) volumetric lung images. We developed a computation- and memory-efficient Diffeomorphic Multi-level B-Spline Transform Composite (DMTC) method to implement nonrigid mass-preserving registration of two CT lung images on GPU. The framework consists of a hierarchy of B-Spline control grids of increasing resolution. A similarity criterion known as the sum of squared tissue volume difference (SSTVD) was adopted to preserve lung tissue mass. The use of SSTVD consists of the calculation of the tissue volume, the Jacobian, and their derivatives, which makes its implementation on GPU challenging due to memory constraints. The use of the DMTC method enabled reduced computation and memory storage of variables with minimal communication between GPU and Central Processing Unit (CPU) due to ability to pre-compute values. The method was assessed on six healthy human subjects. Resultant GPU-generated displacement fields were compared against the previously validated CPU counterpart fields, showing good agreement with an average normalized root mean square error (nRMS) of 0.044±0.015. Runtime and performance speedup are compared between single-threaded CPU, multi-threaded CPU, and GPU algorithms. Best performance speedup occurs at the highest resolution in the GPU implementation for the SSTVD cost and cost gradient computations, with a speedup of 112 times that of the single-threaded CPU version and 11 times over the twelve-threaded version when considering average time per iteration using a Nvidia Tesla K20X GPU. The proposed GPU-based DMTC method outperforms its multi-threaded CPU version in terms of runtime. Total registration time reduced runtime to 2.9min on the GPU version, compared to 12.8min on twelve-threaded CPU version and 112.5min on a single-threaded CPU. Furthermore, the GPU implementation discussed in this work can be adapted for use of other cost functions that require calculation of the first derivatives. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wan, H; Tseung, Chan; Beltran, C
Purpose: To demonstrate fast and accurate Monte Carlo (MC) calculations of proton dose-averaged linear energy transfer (LETd) and biological dose (BD) on a Graphics Processing Unit (GPU) card. Methods: A previously validated GPU-based MC simulation of proton transport was used to rapidly generate LETd distributions for proton treatment plans. Since this MC handles proton-nuclei interactions on an event-by-event using a Bertini intranuclear cascade-evaporation model, secondary protons were taken into account. The smaller contributions of secondary neutrons and recoil nuclei were ignored. Recent work has shown that LETd values are sensitive to the scoring method. The GPU-based LETd calculations were verifiedmore » by comparing with a TOPAS custom scorer that uses tabulated stopping powers, following recommendations by other authors. Comparisons were made for prostate and head-and-neck patients. A python script is used to convert the MC-generated LETd distributions to BD using a variety of published linear quadratic models, and to export the BD in DICOM format for subsequent evaluation. Results: Very good agreement is obtained between TOPAS and our GPU MC. Given a complex head-and-neck plan with 1 mm voxel spacing, the physical dose, LETd and BD calculations for 10{sup 8} proton histories can be completed in ∼5 minutes using a NVIDIA Titan X card. The rapid turnover means that MC feedback can be obtained on dosimetric plan accuracy as well as BD hotspot locations, particularly in regards to their proximity to critical structures. In our institution the GPU MC-generated dose, LETd and BD maps are used to assess plan quality for all patients undergoing treatment. Conclusion: Fast and accurate MC-based LETd calculations can be performed on the GPU. The resulting BD maps provide valuable feedback during treatment plan review. Partially funded by Varian Medical Systems.« less
Kalantzis, Georgios; Tachibana, Hidenobu
2014-01-01
For microdosimetric calculations event-by-event Monte Carlo (MC) methods are considered the most accurate. The main shortcoming of those methods is the extensive requirement for computational time. In this work we present an event-by-event MC code of low projectile energy electron and proton tracks for accelerated microdosimetric MC simulations on a graphic processing unit (GPU). Additionally, a hybrid implementation scheme was realized by employing OpenMP and CUDA in such a way that both GPU and multi-core CPU were utilized simultaneously. The two implementation schemes have been tested and compared with the sequential single threaded MC code on the CPU. Performance comparison was established on the speed-up for a set of benchmarking cases of electron and proton tracks. A maximum speedup of 67.2 was achieved for the GPU-based MC code, while a further improvement of the speedup up to 20% was achieved for the hybrid approach. The results indicate the capability of our CPU-GPU implementation for accelerated MC microdosimetric calculations of both electron and proton tracks without loss of accuracy. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.
GPU Accelerated Chemical Similarity Calculation for Compound Library Comparison
Ma, Chao; Wang, Lirong; Xie, Xiang-Qun
2012-01-01
Chemical similarity calculation plays an important role in compound library design, virtual screening, and “lead” optimization. In this manuscript, we present a novel GPU-accelerated algorithm for all-vs-all Tanimoto matrix calculation and nearest neighbor search. By taking advantage of multi-core GPU architecture and CUDA parallel programming technology, the algorithm is up to 39 times superior to the existing commercial software that runs on CPUs. Because of the utilization of intrinsic GPU instructions, this approach is nearly 10 times faster than existing GPU-accelerated sparse vector algorithm, when Unity fingerprints are used for Tanimoto calculation. The GPU program that implements this new method takes about 20 minutes to complete the calculation of Tanimoto coefficients between 32M PubChem compounds and 10K Active Probes compounds, i.e., 324G Tanimoto coefficients, on a 128-CUDA-core GPU. PMID:21692447
Implementing a GPU-based numerical algorithm for modelling dynamics of a high-speed train
NASA Astrophysics Data System (ADS)
Sytov, E. S.; Bratus, A. S.; Yurchenko, D.
2018-04-01
This paper discusses the initiative of implementing a GPU-based numerical algorithm for studying various phenomena associated with dynamics of a high-speed railway transport. The proposed numerical algorithm for calculating a critical speed of the bogie is based on the first Lyapunov number. Numerical algorithm is validated by analytical results, derived for a simple model. A dynamic model of a carriage connected to a new dual-wheelset flexible bogie is studied for linear and dry friction damping. Numerical results obtained by CPU, MPU and GPU approaches are compared and appropriateness of these methods is discussed.
A GPU-accelerated and Monte Carlo-based intensity modulated proton therapy optimization system.
Ma, Jiasen; Beltran, Chris; Seum Wan Chan Tseung, Hok; Herman, Michael G
2014-12-01
Conventional spot scanning intensity modulated proton therapy (IMPT) treatment planning systems (TPSs) optimize proton spot weights based on analytical dose calculations. These analytical dose calculations have been shown to have severe limitations in heterogeneous materials. Monte Carlo (MC) methods do not have these limitations; however, MC-based systems have been of limited clinical use due to the large number of beam spots in IMPT and the extremely long calculation time of traditional MC techniques. In this work, the authors present a clinically applicable IMPT TPS that utilizes a very fast MC calculation. An in-house graphics processing unit (GPU)-based MC dose calculation engine was employed to generate the dose influence map for each proton spot. With the MC generated influence map, a modified least-squares optimization method was used to achieve the desired dose volume histograms (DVHs). The intrinsic CT image resolution was adopted for voxelization in simulation and optimization to preserve spatial resolution. The optimizations were computed on a multi-GPU framework to mitigate the memory limitation issues for the large dose influence maps that resulted from maintaining the intrinsic CT resolution. The effects of tail cutoff and starting condition were studied and minimized in this work. For relatively large and complex three-field head and neck cases, i.e., >100,000 spots with a target volume of ∼ 1000 cm(3) and multiple surrounding critical structures, the optimization together with the initial MC dose influence map calculation was done in a clinically viable time frame (less than 30 min) on a GPU cluster consisting of 24 Nvidia GeForce GTX Titan cards. The in-house MC TPS plans were comparable to a commercial TPS plans based on DVH comparisons. A MC-based treatment planning system was developed. The treatment planning can be performed in a clinically viable time frame on a hardware system costing around 45,000 dollars. The fast calculation and optimization make the system easily expandable to robust and multicriteria optimization.
NASA Astrophysics Data System (ADS)
Zhang, Jilin; Sha, Chaoqun; Wu, Yusen; Wan, Jian; Zhou, Li; Ren, Yongjian; Si, Huayou; Yin, Yuyu; Jing, Ya
2017-02-01
GPU not only is used in the field of graphic technology but also has been widely used in areas needing a large number of numerical calculations. In the energy industry, because of low carbon, high energy density, high duration and other characteristics, the development of nuclear energy cannot easily be replaced by other energy sources. Management of core fuel is one of the major areas of concern in a nuclear power plant, and it is directly related to the economic benefits and cost of nuclear power. The large-scale reactor core expansion equation is large and complicated, so the calculation of the diffusion equation is crucial in the core fuel management process. In this paper, we use CUDA programming technology on a GPU cluster to run the LU-SGS parallel iterative calculation against the background of the diffusion equation of the reactor. We divide one-dimensional and two-dimensional mesh into a plurality of domains, with each domain evenly distributed on the GPU blocks. A parallel collision scheme is put forward that defines the virtual boundary of the grid exchange information and data transmission by non-stop collision. Compared with the serial program, the experiment shows that GPU greatly improves the efficiency of program execution and verifies that GPU is playing a much more important role in the field of numerical calculations.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Renaud, M; Seuntjens, J; Roberge, D
Purpose: Assessing the performance and uncertainty of a pre-calculated Monte Carlo (PMC) algorithm for proton and electron transport running on graphics processing units (GPU). While PMC methods have been described in the past, an explicit quantification of the latent uncertainty arising from recycling a limited number of tracks in the pre-generated track bank is missing from the literature. With a proper uncertainty analysis, an optimal pre-generated track bank size can be selected for a desired dose calculation uncertainty. Methods: Particle tracks were pre-generated for electrons and protons using EGSnrc and GEANT4, respectively. The PMC algorithm for track transport was implementedmore » on the CUDA programming framework. GPU-PMC dose distributions were compared to benchmark dose distributions simulated using general-purpose MC codes in the same conditions. A latent uncertainty analysis was performed by comparing GPUPMC dose values to a “ground truth” benchmark while varying the track bank size and primary particle histories. Results: GPU-PMC dose distributions and benchmark doses were within 1% of each other in voxels with dose greater than 50% of Dmax. In proton calculations, a submillimeter distance-to-agreement error was observed at the Bragg Peak. Latent uncertainty followed a Poisson distribution with the number of tracks per energy (TPE) and a track bank of 20,000 TPE produced a latent uncertainty of approximately 1%. Efficiency analysis showed a 937× and 508× gain over a single processor core running DOSXYZnrc for 16 MeV electrons in water and bone, respectively. Conclusion: The GPU-PMC method can calculate dose distributions for electrons and protons to a statistical uncertainty below 1%. The track bank size necessary to achieve an optimal efficiency can be tuned based on the desired uncertainty. Coupled with a model to calculate dose contributions from uncharged particles, GPU-PMC is a candidate for inverse planning of modulated electron radiotherapy and scanned proton beams. This work was supported in part by FRSQ-MSSS (Grant No. 22090), NSERC RG (Grant No. 432290) and CIHR MOP (Grant No. MOP-211360)« less
Yang, Yiqun; Urban, Matthew W; McGough, Robert J
2018-05-15
Shear wave calculations induced by an acoustic radiation force are very time-consuming on desktop computers, and high-performance graphics processing units (GPUs) achieve dramatic reductions in the computation time for these simulations. The acoustic radiation force is calculated using the fast near field method and the angular spectrum approach, and then the shear waves are calculated in parallel with Green's functions on a GPU. This combination enables rapid evaluation of shear waves for push beams with different spatial samplings and for apertures with different f/#. Relative to shear wave simulations that evaluate the same algorithm on an Intel i7 desktop computer, a high performance nVidia GPU reduces the time required for these calculations by a factor of 45 and 700 when applied to elastic and viscoelastic shear wave simulation models, respectively. These GPU-accelerated simulations also compared to measurements in different viscoelastic phantoms, and the results are similar. For parametric evaluations and for comparisons with measured shear wave data, shear wave simulations with the Green's function approach are ideally suited for high-performance GPUs.
SU-E-T-423: Fast Photon Convolution Calculation with a 3D-Ideal Kernel On the GPU
DOE Office of Scientific and Technical Information (OSTI.GOV)
Moriya, S; Sato, M; Tachibana, H
Purpose: The calculation time is a trade-off for improving the accuracy of convolution dose calculation with fine calculation spacing of the KERMA kernel. We investigated to accelerate the convolution calculation using an ideal kernel on the Graphic Processing Units (GPU). Methods: The calculation was performed on the AMD graphics hardware of Dual FirePro D700 and our algorithm was implemented using the Aparapi that convert Java bytecode to OpenCL. The process of dose calculation was separated with the TERMA and KERMA steps. The dose deposited at the coordinate (x, y, z) was determined in the process. In the dose calculation runningmore » on the central processing unit (CPU) of Intel Xeon E5, the calculation loops were performed for all calculation points. On the GPU computation, all of the calculation processes for the points were sent to the GPU and the multi-thread computation was done. In this study, the dose calculation was performed in a water equivalent homogeneous phantom with 150{sup 3} voxels (2 mm calculation grid) and the calculation speed on the GPU to that on the CPU and the accuracy of PDD were compared. Results: The calculation time for the GPU and the CPU were 3.3 sec and 4.4 hour, respectively. The calculation speed for the GPU was 4800 times faster than that for the CPU. The PDD curve for the GPU was perfectly matched to that for the CPU. Conclusion: The convolution calculation with the ideal kernel on the GPU was clinically acceptable for time and may be more accurate in an inhomogeneous region. Intensity modulated arc therapy needs dose calculations for different gantry angles at many control points. Thus, it would be more practical that the kernel uses a coarse spacing technique if the calculation is faster while keeping the similar accuracy to a current treatment planning system.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Setiani, Tia Dwi, E-mail: tiadwisetiani@gmail.com; Suprijadi; Nuclear Physics and Biophysics Reaserch Division, Faculty of Mathematics and Natural Sciences, Institut Teknologi Bandung Jalan Ganesha 10 Bandung, 40132
Monte Carlo (MC) is one of the powerful techniques for simulation in x-ray imaging. MC method can simulate the radiation transport within matter with high accuracy and provides a natural way to simulate radiation transport in complex systems. One of the codes based on MC algorithm that are widely used for radiographic images simulation is MC-GPU, a codes developed by Andrea Basal. This study was aimed to investigate the time computation of x-ray imaging simulation in GPU (Graphics Processing Unit) compared to a standard CPU (Central Processing Unit). Furthermore, the effect of physical parameters to the quality of radiographic imagesmore » and the comparison of image quality resulted from simulation in the GPU and CPU are evaluated in this paper. The simulations were run in CPU which was simulated in serial condition, and in two GPU with 384 cores and 2304 cores. In simulation using GPU, each cores calculates one photon, so, a large number of photon were calculated simultaneously. Results show that the time simulations on GPU were significantly accelerated compared to CPU. The simulations on the 2304 core of GPU were performed about 64 -114 times faster than on CPU, while the simulation on the 384 core of GPU were performed about 20 – 31 times faster than in a single core of CPU. Another result shows that optimum quality of images from the simulation was gained at the history start from 10{sup 8} and the energy from 60 Kev to 90 Kev. Analyzed by statistical approach, the quality of GPU and CPU images are relatively the same.« less
Monte Carlo dose calculations for high-dose-rate brachytherapy using GPU-accelerated processing.
Tian, Z; Zhang, M; Hrycushko, B; Albuquerque, K; Jiang, S B; Jia, X
2016-01-01
Current clinical brachytherapy dose calculations are typically based on the Association of American Physicists in Medicine Task Group report 43 (TG-43) guidelines, which approximate patient geometry as an infinitely large water phantom. This ignores patient and applicator geometries and heterogeneities, causing dosimetric errors. Although Monte Carlo (MC) dose calculation is commonly recognized as the most accurate method, its associated long computational time is a major bottleneck for routine clinical applications. This article presents our recent developments of a fast MC dose calculation package for high-dose-rate (HDR) brachytherapy, gBMC, built on a graphics processing unit (GPU) platform. gBMC-simulated photon transport in voxelized geometry with physics in (192)Ir HDR brachytherapy energy range considered. A phase-space file was used as a source model. GPU-based parallel computation was used to simultaneously transport multiple photons, one on a GPU thread. We validated gBMC by comparing the dose calculation results in water with that computed TG-43. We also studied heterogeneous phantom cases and a patient case and compared gBMC results with Acuros BV results. Radial dose function in water calculated by gBMC showed <0.6% relative difference from that of the TG-43 data. Difference in anisotropy function was <1%. In two heterogeneous slab phantoms and one shielded cylinder applicator case, average dose discrepancy between gBMC and Acuros BV was <0.87%. For a tandem and ovoid patient case, good agreement between gBMC and Acruos BV results was observed in both isodose lines and dose-volume histograms. In terms of the efficiency, it took ∼47.5 seconds for gBMC to reach 0.15% statistical uncertainty within the 5% isodose line for the patient case. The accuracy and efficiency of a new GPU-based MC dose calculation package, gBMC, for HDR brachytherapy make it attractive for clinical applications. Copyright © 2016 American Brachytherapy Society. Published by Elsevier Inc. All rights reserved.
Fast generation of computer-generated hologram by graphics processing unit
NASA Astrophysics Data System (ADS)
Matsuda, Sho; Fujii, Tomohiko; Yamaguchi, Takeshi; Yoshikawa, Hiroshi
2009-02-01
A cylindrical hologram is well known to be viewable in 360 deg. This hologram depends high pixel resolution.Therefore, Computer-Generated Cylindrical Hologram (CGCH) requires huge calculation amount.In our previous research, we used look-up table method for fast calculation with Intel Pentium4 2.8 GHz.It took 480 hours to calculate high resolution CGCH (504,000 x 63,000 pixels and the average number of object points are 27,000).To improve quality of CGCH reconstructed image, fringe pattern requires higher spatial frequency and resolution.Therefore, to increase the calculation speed, we have to change the calculation method. In this paper, to reduce the calculation time of CGCH (912,000 x 108,000 pixels), we employ Graphics Processing Unit (GPU).It took 4,406 hours to calculate high resolution CGCH on Xeon 3.4 GHz.Since GPU has many streaming processors and a parallel processing structure, GPU works as the high performance parallel processor.In addition, GPU gives max performance to 2 dimensional data and streaming data.Recently, GPU can be utilized for the general purpose (GPGPU).For example, NVIDIA's GeForce7 series became a programmable processor with Cg programming language.Next GeForce8 series have CUDA as software development kit made by NVIDIA.Theoretically, calculation ability of GPU is announced as 500 GFLOPS. From the experimental result, we have achieved that 47 times faster calculation compared with our previous work which used CPU.Therefore, CGCH can be generated in 95 hours.So, total time is 110 hours to calculate and print the CGCH.
NASA Astrophysics Data System (ADS)
Cai, Yong; Cui, Xiangyang; Li, Guangyao; Liu, Wenyang
2018-04-01
The edge-smooth finite element method (ES-FEM) can improve the computational accuracy of triangular shell elements and the mesh partition efficiency of complex models. In this paper, an approach is developed to perform explicit finite element simulations of contact-impact problems with a graphical processing unit (GPU) using a special edge-smooth triangular shell element based on ES-FEM. Of critical importance for this problem is achieving finer-grained parallelism to enable efficient data loading and to minimize communication between the device and host. Four kinds of parallel strategies are then developed to efficiently solve these ES-FEM based shell element formulas, and various optimization methods are adopted to ensure aligned memory access. Special focus is dedicated to developing an approach for the parallel construction of edge systems. A parallel hierarchy-territory contact-searching algorithm (HITA) and a parallel penalty function calculation method are embedded in this parallel explicit algorithm. Finally, the program flow is well designed, and a GPU-based simulation system is developed, using Nvidia's CUDA. Several numerical examples are presented to illustrate the high quality of the results obtained with the proposed methods. In addition, the GPU-based parallel computation is shown to significantly reduce the computing time.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Li, Y; Liu, B; Liang, B
Purpose: Current CyberKnife treatment planning system (TPS) provided two dose calculation algorithms: Ray-tracing and Monte Carlo. Ray-tracing algorithm is fast, but less accurate, and also can’t handle irregular fields since a multi-leaf collimator system was recently introduced to CyberKnife M6 system. Monte Carlo method has well-known accuracy, but the current version still takes a long time to finish dose calculations. The purpose of this paper is to develop a GPU-based fast C/S dose engine for CyberKnife system to achieve both accuracy and efficiency. Methods: The TERMA distribution from a poly-energetic source was calculated based on beam’s eye view coordinate system,more » which is GPU friendly and has linear complexity. The dose distribution was then computed by inversely collecting the energy depositions from all TERMA points along 192 collapsed-cone directions. EGSnrc user code was used to pre-calculate energy deposition kernels (EDKs) for a series of mono-energy photons The energy spectrum was reconstructed based on measured tissue maximum ratio (TMR) curve, the TERMA averaged cumulative kernels was then calculated. Beam hardening parameters and intensity profiles were optimized based on measurement data from CyberKnife system. Results: The difference between measured and calculated TMR are less than 1% for all collimators except in the build-up regions. The calculated profiles also showed good agreements with the measured doses within 1% except in the penumbra regions. The developed C/S dose engine was also used to evaluate four clinical CyberKnife treatment plans, the results showed a better dose calculation accuracy than Ray-tracing algorithm compared with Monte Carlo method for heterogeneous cases. For the dose calculation time, it takes about several seconds for one beam depends on collimator size and dose calculation grids. Conclusion: A GPU-based C/S dose engine has been developed for CyberKnife system, which was proven to be efficient and accurate for clinical purpose, and can be easily implemented in TPS.« less
NASA Astrophysics Data System (ADS)
Yang, Yiqun; Urban, Matthew W.; McGough, Robert J.
2018-05-01
Shear wave calculations induced by an acoustic radiation force are very time-consuming on desktop computers, and high-performance graphics processing units (GPUs) achieve dramatic reductions in the computation time for these simulations. The acoustic radiation force is calculated using the fast near field method and the angular spectrum approach, and then the shear waves are calculated in parallel with Green’s functions on a GPU. This combination enables rapid evaluation of shear waves for push beams with different spatial samplings and for apertures with different f/#. Relative to shear wave simulations that evaluate the same algorithm on an Intel i7 desktop computer, a high performance nVidia GPU reduces the time required for these calculations by a factor of 45 and 700 when applied to elastic and viscoelastic shear wave simulation models, respectively. These GPU-accelerated simulations also compared to measurements in different viscoelastic phantoms, and the results are similar. For parametric evaluations and for comparisons with measured shear wave data, shear wave simulations with the Green’s function approach are ideally suited for high-performance GPUs.
Wu, Xin; Koslowski, Axel; Thiel, Walter
2012-07-10
In this work, we demonstrate that semiempirical quantum chemical calculations can be accelerated significantly by leveraging the graphics processing unit (GPU) as a coprocessor on a hybrid multicore CPU-GPU computing platform. Semiempirical calculations using the MNDO, AM1, PM3, OM1, OM2, and OM3 model Hamiltonians were systematically profiled for three types of test systems (fullerenes, water clusters, and solvated crambin) to identify the most time-consuming sections of the code. The corresponding routines were ported to the GPU and optimized employing both existing library functions and a GPU kernel that carries out a sequence of noniterative Jacobi transformations during pseudodiagonalization. The overall computation times for single-point energy calculations and geometry optimizations of large molecules were reduced by one order of magnitude for all methods, as compared to runs on a single CPU core.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Giantsoudi, D; Schuemann, J; Dowdell, S
Purpose: For proton radiation therapy, Monte Carlo simulation (MCS) methods are recognized as the gold-standard dose calculation approach. Although previously unrealistic due to limitations in available computing power, GPU-based applications allow MCS of proton treatment fields to be performed in routine clinical use, on time scales comparable to that of conventional pencil-beam algorithms. This study focuses on validating the results of our GPU-based code (gPMC) versus fully implemented proton therapy based MCS code (TOPAS) for clinical patient cases. Methods: Two treatment sites were selected to provide clinical cases for this study: head-and-neck cases due to anatomical geometrical complexity (air cavitiesmore » and density heterogeneities), making dose calculation very challenging, and prostate cases due to higher proton energies used and close proximity of the treatment target to sensitive organs at risk. Both gPMC and TOPAS methods were used to calculate 3-dimensional dose distributions for all patients in this study. Comparisons were performed based on target coverage indices (mean dose, V90 and D90) and gamma index distributions for 2% of the prescription dose and 2mm. Results: For seven out of eight studied cases, mean target dose, V90 and D90 differed less than 2% between TOPAS and gPMC dose distributions. Gamma index analysis for all prostate patients resulted in passing rate of more than 99% of voxels in the target. Four out of five head-neck-cases showed passing rate of gamma index for the target of more than 99%, the fifth having a gamma index passing rate of 93%. Conclusion: Our current work showed excellent agreement between our GPU-based MCS code and fully implemented proton therapy based MC code for a group of dosimetrically challenging patient cases.« less
Leang, Sarom S; Rendell, Alistair P; Gordon, Mark S
2014-03-11
Increasingly, modern computer systems comprise a multicore general-purpose processor augmented with a number of special purpose devices or accelerators connected via an external interface such as a PCI bus. The NVIDIA Kepler Graphical Processing Unit (GPU) and the Intel Phi are two examples of such accelerators. Accelerators offer peak performances that can be well above those of the host processor. How to exploit this heterogeneous environment for legacy application codes is not, however, straightforward. This paper considers how matrix operations in typical quantum chemical calculations can be migrated to the GPU and Phi systems. Double precision general matrix multiply operations are endemic in electronic structure calculations, especially methods that include electron correlation, such as density functional theory, second order perturbation theory, and coupled cluster theory. The use of approaches that automatically determine whether to use the host or an accelerator, based on problem size, is explored, with computations that are occurring on the accelerator and/or the host. For data-transfers over PCI-e, the GPU provides the best overall performance for data sizes up to 4096 MB with consistent upload and download rates between 5-5.6 GB/s and 5.4-6.3 GB/s, respectively. The GPU outperforms the Phi for both square and nonsquare matrix multiplications.
NASA Astrophysics Data System (ADS)
Qin, Cheng-Zhi; Zhan, Lijun
2012-06-01
As one of the important tasks in digital terrain analysis, the calculation of flow accumulations from gridded digital elevation models (DEMs) usually involves two steps in a real application: (1) using an iterative DEM preprocessing algorithm to remove the depressions and flat areas commonly contained in real DEMs, and (2) using a recursive flow-direction algorithm to calculate the flow accumulation for every cell in the DEM. Because both algorithms are computationally intensive, quick calculation of the flow accumulations from a DEM (especially for a large area) presents a practical challenge to personal computer (PC) users. In recent years, rapid increases in hardware capacity of the graphics processing units (GPUs) provided in modern PCs have made it possible to meet this challenge in a PC environment. Parallel computing on GPUs using a compute-unified-device-architecture (CUDA) programming model has been explored to speed up the execution of the single-flow-direction algorithm (SFD). However, the parallel implementation on a GPU of the multiple-flow-direction (MFD) algorithm, which generally performs better than the SFD algorithm, has not been reported. Moreover, GPU-based parallelization of the DEM preprocessing step in the flow-accumulation calculations has not been addressed. This paper proposes a parallel approach to calculate flow accumulations (including both iterative DEM preprocessing and a recursive MFD algorithm) on a CUDA-compatible GPU. For the parallelization of an MFD algorithm (MFD-md), two different parallelization strategies using a GPU are explored. The first parallelization strategy, which has been used in the existing parallel SFD algorithm on GPU, has the problem of computing redundancy. Therefore, we designed a parallelization strategy based on graph theory. The application results show that the proposed parallel approach to calculate flow accumulations on a GPU performs much faster than either sequential algorithms or other parallel GPU-based algorithms based on existing parallelization strategies.
GPU based contouring method on grid DEM data
NASA Astrophysics Data System (ADS)
Tan, Liheng; Wan, Gang; Li, Feng; Chen, Xiaohui; Du, Wenlong
2017-08-01
This paper presents a novel method to generate contour lines from grid DEM data based on the programmable GPU pipeline. The previous contouring approaches often use CPU to construct a finite element mesh from the raw DEM data, and then extract contour segments from the elements. They also need a tracing or sorting strategy to generate the final continuous contours. These approaches can be heavily CPU-costing and time-consuming. Meanwhile the generated contours would be unsmooth if the raw data is sparsely distributed. Unlike the CPU approaches, we employ the GPU's vertex shader to generate a triangular mesh with arbitrary user-defined density, in which the height of each vertex is calculated through a third-order Cardinal spline function. Then in the same frame, segments are extracted from the triangles by the geometry shader, and translated to the CPU-side with an internal order in the GPU's transform feedback stage. Finally we propose a "Grid Sorting" algorithm to achieve the continuous contour lines by travelling the segments only once. Our method makes use of multiple stages of GPU pipeline for computation, which can generate smooth contour lines, and is significantly faster than the previous CPU approaches. The algorithm can be easily implemented with OpenGL 3.3 API or higher on consumer-level PCs.
Validation of GPU based TomoTherapy dose calculation engine.
Chen, Quan; Lu, Weiguo; Chen, Yu; Chen, Mingli; Henderson, Douglas; Sterpin, Edmond
2012-04-01
The graphic processing unit (GPU) based TomoTherapy convolution/superposition(C/S) dose engine (GPU dose engine) achieves a dramatic performance improvement over the traditional CPU-cluster based TomoTherapy dose engine (CPU dose engine). Besides the architecture difference between the GPU and CPU, there are several algorithm changes from the CPU dose engine to the GPU dose engine. These changes made the GPU dose slightly different from the CPU-cluster dose. In order for the commercial release of the GPU dose engine, its accuracy has to be validated. Thirty eight TomoTherapy phantom plans and 19 patient plans were calculated with both dose engines to evaluate the equivalency between the two dose engines. Gamma indices (Γ) were used for the equivalency evaluation. The GPU dose was further verified with the absolute point dose measurement with ion chamber and film measurements for phantom plans. Monte Carlo calculation was used as a reference for both dose engines in the accuracy evaluation in heterogeneous phantom and actual patients. The GPU dose engine showed excellent agreement with the current CPU dose engine. The majority of cases had over 99.99% of voxels with Γ(1%, 1 mm) < 1. The worst case observed in the phantom had 0.22% voxels violating the criterion. In patient cases, the worst percentage of voxels violating the criterion was 0.57%. For absolute point dose verification, all cases agreed with measurement to within ±3% with average error magnitude within 1%. All cases passed the acceptance criterion that more than 95% of the pixels have Γ(3%, 3 mm) < 1 in film measurement, and the average passing pixel percentage is 98.5%-99%. The GPU dose engine also showed similar degree of accuracy in heterogeneous media as the current TomoTherapy dose engine. It is verified and validated that the ultrafast TomoTherapy GPU dose engine can safely replace the existing TomoTherapy cluster based dose engine without degradation in dose accuracy.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Liu, T; Lin, H; Xu, X
Purpose: (1) To perform phase space (PS) based source modeling for Tomotherapy and Varian TrueBeam 6 MV Linacs, (2) to examine the accuracy and performance of the ARCHER Monte Carlo code on a heterogeneous computing platform with Many Integrated Core coprocessors (MIC, aka Xeon Phi) and GPUs, and (3) to explore the software micro-optimization methods. Methods: The patient-specific source of Tomotherapy and Varian TrueBeam Linacs was modeled using the PS approach. For the helical Tomotherapy case, the PS data were calculated in our previous study (Su et al. 2014 41(7) Medical Physics). For the single-view Varian TrueBeam case, we analyticallymore » derived them from the raw patient-independent PS data in IAEA’s database, partial geometry information of the jaw and MLC as well as the fluence map. The phantom was generated from DICOM images. The Monte Carlo simulation was performed by ARCHER-MIC and GPU codes, which were benchmarked against a modified parallel DPM code. Software micro-optimization was systematically conducted, and was focused on SIMD vectorization of tight for-loops and data prefetch, with the ultimate goal of increasing 512-bit register utilization and reducing memory access latency. Results: Dose calculation was performed for two clinical cases, a Tomotherapy-based prostate cancer treatment and a TrueBeam-based left breast treatment. ARCHER was verified against the DPM code. The statistical uncertainty of the dose to the PTV was less than 1%. Using double-precision, the total wall time of the multithreaded CPU code on a X5650 CPU was 339 seconds for the Tomotherapy case and 131 seconds for the TrueBeam, while on 3 5110P MICs it was reduced to 79 and 59 seconds, respectively. The single-precision GPU code on a K40 GPU took 45 seconds for the Tomotherapy dose calculation. Conclusion: We have extended ARCHER, the MIC and GPU-based Monte Carlo dose engine to Tomotherapy and Truebeam dose calculations.« less
Cost-effective GPU-grid for genome-wide epistasis calculations.
Pütz, B; Kam-Thong, T; Karbalai, N; Altmann, A; Müller-Myhsok, B
2013-01-01
Until recently, genotype studies were limited to the investigation of single SNP effects due to the computational burden incurred when studying pairwise interactions of SNPs. However, some genetic effects as simple as coloring (in plants and animals) cannot be ascribed to a single locus but only understood when epistasis is taken into account [1]. It is expected that such effects are also found in complex diseases where many genes contribute to the clinical outcome of affected individuals. Only recently have such problems become feasible computationally. The inherently parallel structure of the problem makes it a perfect candidate for massive parallelization on either grid or cloud architectures. Since we are also dealing with confidential patient data, we were not able to consider a cloud-based solution but had to find a way to process the data in-house and aimed to build a local GPU-based grid structure. Sequential epistatsis calculations were ported to GPU using CUDA at various levels. Parallelization on the CPU was compared to corresponding GPU counterparts with regards to performance and cost. A cost-effective solution was created by combining custom-built nodes equipped with relatively inexpensive consumer-level graphics cards with highly parallel GPUs in a local grid. The GPU method outperforms current cluster-based systems on a price/performance criterion, as a single GPU shows speed performance comparable up to 200 CPU cores. The outlined approach will work for problems that easily lend themselves to massive parallelization. Code for various tasks has been made available and ongoing development of tools will further ease the transition from sequential to parallel algorithms.
High Performance Computing of Meshless Time Domain Method on Multi-GPU Cluster
NASA Astrophysics Data System (ADS)
Ikuno, Soichiro; Nakata, Susumu; Hirokawa, Yuta; Itoh, Taku
2015-01-01
High performance computing of Meshless Time Domain Method (MTDM) on multi-GPU using the supercomputer HA-PACS (Highly Accelerated Parallel Advanced system for Computational Sciences) at University of Tsukuba is investigated. Generally, the finite difference time domain (FDTD) method is adopted for the numerical simulation of the electromagnetic wave propagation phenomena. However, the numerical domain must be divided into rectangle meshes, and it is difficult to adopt the problem in a complexed domain to the method. On the other hand, MTDM can be easily adept to the problem because MTDM does not requires meshes. In the present study, we implement MTDM on multi-GPU cluster to speedup the method, and numerically investigate the performance of the method on multi-GPU cluster. To reduce the computation time, the communication time between the decomposed domain is hided below the perfect matched layer (PML) calculation procedure. The results of computation show that speedup of MTDM on 128 GPUs is 173 times faster than that of single CPU calculation.
Real-time liquid-crystal atmosphere turbulence simulator with graphic processing unit.
Hu, Lifa; Xuan, Li; Li, Dayu; Cao, Zhaoliang; Mu, Quanquan; Liu, Yonggang; Peng, Zenghui; Lu, Xinghai
2009-04-27
To generate time-evolving atmosphere turbulence in real time, a phase-generating method for our liquid-crystal (LC) atmosphere turbulence simulator (ATS) is derived based on the Fourier series (FS) method. A real matrix expression for generating turbulence phases is given and calculated with a graphic processing unit (GPU), the GeForce 8800 Ultra. A liquid crystal on silicon (LCOS) with 256x256 pixels is used as the turbulence simulator. The total time to generate a turbulence phase is about 7.8 ms for calculation and readout with the GPU. A parallel processing method of calculating and sending a picture to the LCOS is used to improve the simulating speed of our LC ATS. Therefore, the real-time turbulence phase-generation frequency of our LC ATS is up to 128 Hz. To our knowledge, it is the highest speed used to generate a turbulence phase in real time.
cellGPU: Massively parallel simulations of dynamic vertex models
NASA Astrophysics Data System (ADS)
Sussman, Daniel M.
2017-10-01
Vertex models represent confluent tissue by polygonal or polyhedral tilings of space, with the individual cells interacting via force laws that depend on both the geometry of the cells and the topology of the tessellation. This dependence on the connectivity of the cellular network introduces several complications to performing molecular-dynamics-like simulations of vertex models, and in particular makes parallelizing the simulations difficult. cellGPU addresses this difficulty and lays the foundation for massively parallelized, GPU-based simulations of these models. This article discusses its implementation for a pair of two-dimensional models, and compares the typical performance that can be expected between running cellGPU entirely on the CPU versus its performance when running on a range of commercial and server-grade graphics cards. By implementing the calculation of topological changes and forces on cells in a highly parallelizable fashion, cellGPU enables researchers to simulate time- and length-scales previously inaccessible via existing single-threaded CPU implementations. Program Files doi:http://dx.doi.org/10.17632/6j2cj29t3r.1 Licensing provisions: MIT Programming language: CUDA/C++ Nature of problem: Simulations of off-lattice "vertex models" of cells, in which the interaction forces depend on both the geometry and the topology of the cellular aggregate. Solution method: Highly parallelized GPU-accelerated dynamical simulations in which the force calculations and the topological features can be handled on either the CPU or GPU. Additional comments: The code is hosted at https://gitlab.com/dmsussman/cellGPU, with documentation additionally maintained at http://dmsussman.gitlab.io/cellGPUdocumentation
Porting ONETEP to graphical processing unit-based coprocessors. 1. FFT box operations.
Wilkinson, Karl; Skylaris, Chris-Kriton
2013-10-30
We present the first graphical processing unit (GPU) coprocessor-enabled version of the Order-N Electronic Total Energy Package (ONETEP) code for linear-scaling first principles quantum mechanical calculations on materials. This work focuses on porting to the GPU the parts of the code that involve atom-localized fast Fourier transform (FFT) operations. These are among the most computationally intensive parts of the code and are used in core algorithms such as the calculation of the charge density, the local potential integrals, the kinetic energy integrals, and the nonorthogonal generalized Wannier function gradient. We have found that direct porting of the isolated FFT operations did not provide any benefit. Instead, it was necessary to tailor the port to each of the aforementioned algorithms to optimize data transfer to and from the GPU. A detailed discussion of the methods used and tests of the resulting performance are presented, which show that individual steps in the relevant algorithms are accelerated by a significant amount. However, the transfer of data between the GPU and host machine is a significant bottleneck in the reported version of the code. In addition, an initial investigation into a dynamic precision scheme for the ONETEP energy calculation has been performed to take advantage of the enhanced single precision capabilities of GPUs. The methods used here result in no disruption to the existing code base. Furthermore, as the developments reported here concern the core algorithms, they will benefit the full range of ONETEP functionality. Our use of a directive-based programming model ensures portability to other forms of coprocessors and will allow this work to form the basis of future developments to the code designed to support emerging high-performance computing platforms. Copyright © 2013 Wiley Periodicals, Inc.
GPU-accelerated FDTD modeling of radio-frequency field-tissue interactions in high-field MRI.
Chi, Jieru; Liu, Feng; Weber, Ewald; Li, Yu; Crozier, Stuart
2011-06-01
The analysis of high-field RF field-tissue interactions requires high-performance finite-difference time-domain (FDTD) computing. Conventional CPU-based FDTD calculations offer limited computing performance in a PC environment. This study presents a graphics processing unit (GPU)-based parallel-computing framework, producing substantially boosted computing efficiency (with a two-order speedup factor) at a PC-level cost. Specific details of implementing the FDTD method on a GPU architecture have been presented and the new computational strategy has been successfully applied to the design of a novel 8-element transceive RF coil system at 9.4 T. Facilitated by the powerful GPU-FDTD computing, the new RF coil array offers optimized fields (averaging 25% improvement in sensitivity, and 20% reduction in loop coupling compared with conventional array structures of the same size) for small animal imaging with a robust RF configuration. The GPU-enabled acceleration paves the way for FDTD to be applied for both detailed forward modeling and inverse design of MRI coils, which were previously impractical.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wang, Y; Mazur, T; Green, O
Purpose: To build a fast, accurate and easily-deployable research platform for Monte-Carlo dose calculations. We port the dose calculation engine PENELOPE to C++, and accelerate calculations using GPU acceleration. Simulations of a Co-60 beam model provided by ViewRay demonstrate the capabilities of the platform. Methods: We built software that incorporates a beam model interface, CT-phantom model, GPU-accelerated PENELOPE engine, and GUI front-end. We rewrote the PENELOPE kernel in C++ (from Fortran) and accelerated the code on a GPU. We seamlessly integrated a Co-60 beam model (obtained from ViewRay) into our platform. Simulations of various field sizes and SSDs using amore » homogeneous water phantom generated PDDs, dose profiles, and output factors that were compared to experiment data. Results: With GPU acceleration using a dated graphics card (Nvidia Tesla C2050), a highly accurate simulation – including 100*100*100 grid, 3×3×3 mm3 voxels, <1% uncertainty, and 4.2×4.2 cm2 field size – runs 24 times faster (20 minutes versus 8 hours) than when parallelizing on 8 threads across a new CPU (Intel i7-4770). Simulated PDDs, profiles and output ratios for the commercial system agree well with experiment data measured using radiographic film or ionization chamber. Based on our analysis, this beam model is precise enough for general applications. Conclusions: Using a beam model for a Co-60 system provided by ViewRay, we evaluate a dose calculation platform that we developed. Comparison to measurements demonstrates the promise of our software for use as a research platform for dose calculations, with applications including quality assurance and treatment plan verification.« less
Nalichowski, Adrian; Burmeister, Jay
2013-07-01
To compare optimization characteristics, plan quality, and treatment delivery efficiency between total marrow irradiation (TMI) plans using the new TomoTherapy graphic processing unit (GPU) based dose engine and CPU/cluster based dose engine. Five TMI plans created on an anthropomorphic phantom were optimized and calculated with both dose engines. The planning treatment volume (PTV) included all the bones from head to mid femur except for upper extremities. Evaluated organs at risk (OAR) consisted of lung, liver, heart, kidneys, and brain. The following treatment parameters were used to generate the TMI plans: field widths of 2.5 and 5 cm, modulation factors of 2 and 2.5, and pitch of either 0.287 or 0.43. The optimization parameters were chosen based on the PTV and OAR priorities and the plans were optimized with a fixed number of iterations. The PTV constraint was selected to ensure that at least 95% of the PTV received the prescription dose. The plans were evaluated based on D80 and D50 (dose to 80% and 50% of the OAR volume, respectively) and hotspot volumes within the PTVs. Gamma indices (Γ) were also used to compare planar dose distributions between the two modalities. The optimization and dose calculation times were compared between the two systems. The treatment delivery times were also evaluated. The results showed very good dosimetric agreement between the GPU and CPU calculated plans for any of the evaluated planning parameters indicating that both systems converge on nearly identical plans. All D80 and D50 parameters varied by less than 3% of the prescription dose with an average difference of 0.8%. A gamma analysis Γ(3%, 3 mm) < 1 of the GPU plan resulted in over 90% of calculated voxels satisfying Γ < 1 criterion as compared to baseline CPU plan. The average number of voxels meeting the Γ < 1 criterion for all the plans was 97%. In terms of dose optimization/calculation efficiency, there was a 20-fold reduction in planning time with the new GPU system. The average optimization/dose calculation time utilizing the traditional CPU/cluster based system was 579 vs 26.8 min for the GPU based system. There was no difference in the calculated treatment delivery time per fraction. Beam-on time varied based on field width and pitch and ranged between 15 and 28 min. The TomoTherapy GPU based dose engine is capable of calculating TMI treatment plans with plan quality nearly identical to plans calculated using the traditional CPU/cluster based system, while significantly reducing the time required for optimization and dose calculation.
Sub-second pencil beam dose calculation on GPU for adaptive proton therapy.
da Silva, Joakim; Ansorge, Richard; Jena, Rajesh
2015-06-21
Although proton therapy delivered using scanned pencil beams has the potential to produce better dose conformity than conventional radiotherapy, the created dose distributions are more sensitive to anatomical changes and patient motion. Therefore, the introduction of adaptive treatment techniques where the dose can be monitored as it is being delivered is highly desirable. We present a GPU-based dose calculation engine relying on the widely used pencil beam algorithm, developed for on-line dose calculation. The calculation engine was implemented from scratch, with each step of the algorithm parallelized and adapted to run efficiently on the GPU architecture. To ensure fast calculation, it employs several application-specific modifications and simplifications, and a fast scatter-based implementation of the computationally expensive kernel superposition step. The calculation time for a skull base treatment plan using two beam directions was 0.22 s on an Nvidia Tesla K40 GPU, whereas a test case of a cubic target in water from the literature took 0.14 s to calculate. The accuracy of the patient dose distributions was assessed by calculating the γ-index with respect to a gold standard Monte Carlo simulation. The passing rates were 99.2% and 96.7%, respectively, for the 3%/3 mm and 2%/2 mm criteria, matching those produced by a clinical treatment planning system.
Persoon, Lucas C G G; Podesta, Mark; van Elmpt, Wouter J C; Nijsten, Sebastiaan M J J G; Verhaegen, Frank
2011-07-01
A widely accepted method to quantify differences in dose distributions is the gamma (gamma) evaluation. Currently, almost all gamma implementations utilize the central processing unit (CPU). Recently, the graphics processing unit (GPU) has become a powerful platform for specific computing tasks. In this study, we describe the implementation of a 3D gamma evaluation using a GPU to improve calculation time. The gamma evaluation algorithm was implemented on an NVIDIA Tesla C2050 GPU using the compute unified device architecture (CUDA). First, several cubic virtual phantoms were simulated. These phantoms were tested with varying dose cube sizes and set-ups, introducing artificial dose differences. Second, to show applicability in clinical practice, five patient cases have been evaluated using the 3D dose distribution from a treatment planning system as the reference and the delivered dose determined during treatment as the comparison. A calculation time comparison between the CPU and GPU was made with varying thread-block sizes including the option of using texture or global memory. A GPU over CPU speed-up of 66 +/- 12 was achieved for the virtual phantoms. For the patient cases, a speed-up of 57 +/- 15 using the GPU was obtained. A thread-block size of 16 x 16 performed best in all cases. The use of texture memory improved the total calculation time, especially when interpolation was applied. Differences between the CPU and GPU gammas were negligible. The GPU and its features, such as texture memory, decreased the calculation time for gamma evaluations considerably without loss of accuracy.
High performance transcription factor-DNA docking with GPU computing
2012-01-01
Background Protein-DNA docking is a very challenging problem in structural bioinformatics and has important implications in a number of applications, such as structure-based prediction of transcription factor binding sites and rational drug design. Protein-DNA docking is very computational demanding due to the high cost of energy calculation and the statistical nature of conformational sampling algorithms. More importantly, experiments show that the docking quality depends on the coverage of the conformational sampling space. It is therefore desirable to accelerate the computation of the docking algorithm, not only to reduce computing time, but also to improve docking quality. Methods In an attempt to accelerate the sampling process and to improve the docking performance, we developed a graphics processing unit (GPU)-based protein-DNA docking algorithm. The algorithm employs a potential-based energy function to describe the binding affinity of a protein-DNA pair, and integrates Monte-Carlo simulation and a simulated annealing method to search through the conformational space. Algorithmic techniques were developed to improve the computation efficiency and scalability on GPU-based high performance computing systems. Results The effectiveness of our approach is tested on a non-redundant set of 75 TF-DNA complexes and a newly developed TF-DNA docking benchmark. We demonstrated that the GPU-based docking algorithm can significantly accelerate the simulation process and thereby improving the chance of finding near-native TF-DNA complex structures. This study also suggests that further improvement in protein-DNA docking research would require efforts from two integral aspects: improvement in computation efficiency and energy function design. Conclusions We present a high performance computing approach for improving the prediction accuracy of protein-DNA docking. The GPU-based docking algorithm accelerates the search of the conformational space and thus increases the chance of finding more near-native structures. To the best of our knowledge, this is the first ad hoc effort of applying GPU or GPU clusters to the protein-DNA docking problem. PMID:22759575
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wan Chan Tseung, H; Ma, J; Ma, D
2015-06-15
Purpose: To demonstrate the feasibility of fast Monte Carlo (MC) based biological planning for the treatment of thyroid tumors in spot-scanning proton therapy. Methods: Recently, we developed a fast and accurate GPU-based MC simulation of proton transport that was benchmarked against Geant4.9.6 and used as the dose calculation engine in a clinically-applicable GPU-accelerated IMPT optimizer. Besides dose, it can simultaneously score the dose-averaged LET (LETd), which makes fast biological dose (BD) estimates possible. To convert from LETd to BD, we used a linear relation based on cellular irradiation data. Given a thyroid patient with a 93cc tumor volume, we createdmore » a 2-field IMPT plan in Eclipse (Varian Medical Systems). This plan was re-calculated with our MC to obtain the BD distribution. A second 5-field plan was made with our in-house optimizer, using pre-generated MC dose and LETd maps. Constraints were placed to maintain the target dose to within 25% of the prescription, while maximizing the BD. The plan optimization and calculation of dose and LETd maps were performed on a GPU cluster. The conventional IMPT and biologically-optimized plans were compared. Results: The mean target physical and biological doses from our biologically-optimized plan were, respectively, 5% and 14% higher than those from the MC re-calculation of the IMPT plan. Dose sparing to critical structures in our plan was also improved. The biological optimization, including the initial dose and LETd map calculations, can be completed in a clinically viable time (∼30 minutes) on a cluster of 25 GPUs. Conclusion: Taking advantage of GPU acceleration, we created a MC-based, biologically optimized treatment plan for a thyroid patient. Compared to a standard IMPT plan, a 5% increase in the target’s physical dose resulted in ∼3 times as much increase in the BD. Biological planning was thus effective in escalating the target BD.« less
Development of High-speed Visualization System of Hypocenter Data Using CUDA-based GPU computing
NASA Astrophysics Data System (ADS)
Kumagai, T.; Okubo, K.; Uchida, N.; Matsuzawa, T.; Kawada, N.; Takeuchi, N.
2014-12-01
After the Great East Japan Earthquake on March 11, 2011, intelligent visualization of seismic information is becoming important to understand the earthquake phenomena. On the other hand, to date, the quantity of seismic data becomes enormous as a progress of high accuracy observation network; we need to treat many parameters (e.g., positional information, origin time, magnitude, etc.) to efficiently display the seismic information. Therefore, high-speed processing of data and image information is necessary to handle enormous amounts of seismic data. Recently, GPU (Graphic Processing Unit) is used as an acceleration tool for data processing and calculation in various study fields. This movement is called GPGPU (General Purpose computing on GPUs). In the last few years the performance of GPU keeps on improving rapidly. GPU computing gives us the high-performance computing environment at a lower cost than before. Moreover, use of GPU has an advantage of visualization of processed data, because GPU is originally architecture for graphics processing. In the GPU computing, the processed data is always stored in the video memory. Therefore, we can directly write drawing information to the VRAM on the video card by combining CUDA and the graphics API. In this study, we employ CUDA and OpenGL and/or DirectX to realize full-GPU implementation. This method makes it possible to write drawing information to the VRAM on the video card without PCIe bus data transfer: It enables the high-speed processing of seismic data. The present study examines the GPU computing-based high-speed visualization and the feasibility for high-speed visualization system of hypocenter data.
Mitra, Ayan; Politte, David G; Whiting, Bruce R; Williamson, Jeffrey F; O'Sullivan, Joseph A
2017-01-01
Model-based image reconstruction (MBIR) techniques have the potential to generate high quality images from noisy measurements and a small number of projections which can reduce the x-ray dose in patients. These MBIR techniques rely on projection and backprojection to refine an image estimate. One of the widely used projectors for these modern MBIR based technique is called branchless distance driven (DD) projection and backprojection. While this method produces superior quality images, the computational cost of iterative updates keeps it from being ubiquitous in clinical applications. In this paper, we provide several new parallelization ideas for concurrent execution of the DD projectors in multi-GPU systems using CUDA programming tools. We have introduced some novel schemes for dividing the projection data and image voxels over multiple GPUs to avoid runtime overhead and inter-device synchronization issues. We have also reduced the complexity of overlap calculation of the algorithm by eliminating the common projection plane and directly projecting the detector boundaries onto image voxel boundaries. To reduce the time required for calculating the overlap between the detector edges and image voxel boundaries, we have proposed a pre-accumulation technique to accumulate image intensities in perpendicular 2D image slabs (from a 3D image) before projection and after backprojection to ensure our DD kernels run faster in parallel GPU threads. For the implementation of our iterative MBIR technique we use a parallel multi-GPU version of the alternating minimization (AM) algorithm with penalized likelihood update. The time performance using our proposed reconstruction method with Siemens Sensation 16 patient scan data shows an average of 24 times speedup using a single TITAN X GPU and 74 times speedup using 3 TITAN X GPUs in parallel for combined projection and backprojection.
Sub-second pencil beam dose calculation on GPU for adaptive proton therapy
NASA Astrophysics Data System (ADS)
da Silva, Joakim; Ansorge, Richard; Jena, Rajesh
2015-06-01
Although proton therapy delivered using scanned pencil beams has the potential to produce better dose conformity than conventional radiotherapy, the created dose distributions are more sensitive to anatomical changes and patient motion. Therefore, the introduction of adaptive treatment techniques where the dose can be monitored as it is being delivered is highly desirable. We present a GPU-based dose calculation engine relying on the widely used pencil beam algorithm, developed for on-line dose calculation. The calculation engine was implemented from scratch, with each step of the algorithm parallelized and adapted to run efficiently on the GPU architecture. To ensure fast calculation, it employs several application-specific modifications and simplifications, and a fast scatter-based implementation of the computationally expensive kernel superposition step. The calculation time for a skull base treatment plan using two beam directions was 0.22 s on an Nvidia Tesla K40 GPU, whereas a test case of a cubic target in water from the literature took 0.14 s to calculate. The accuracy of the patient dose distributions was assessed by calculating the γ-index with respect to a gold standard Monte Carlo simulation. The passing rates were 99.2% and 96.7%, respectively, for the 3%/3 mm and 2%/2 mm criteria, matching those produced by a clinical treatment planning system.
A GPU-accelerated implicit meshless method for compressible flows
NASA Astrophysics Data System (ADS)
Zhang, Jia-Le; Ma, Zhi-Hua; Chen, Hong-Quan; Cao, Cheng
2018-05-01
This paper develops a recently proposed GPU based two-dimensional explicit meshless method (Ma et al., 2014) by devising and implementing an efficient parallel LU-SGS implicit algorithm to further improve the computational efficiency. The capability of the original 2D meshless code is extended to deal with 3D complex compressible flow problems. To resolve the inherent data dependency of the standard LU-SGS method, which causes thread-racing conditions destabilizing numerical computation, a generic rainbow coloring method is presented and applied to organize the computational points into different groups by painting neighboring points with different colors. The original LU-SGS method is modified and parallelized accordingly to perform calculations in a color-by-color manner. The CUDA Fortran programming model is employed to develop the key kernel functions to apply boundary conditions, calculate time steps, evaluate residuals as well as advance and update the solution in the temporal space. A series of two- and three-dimensional test cases including compressible flows over single- and multi-element airfoils and a M6 wing are carried out to verify the developed code. The obtained solutions agree well with experimental data and other computational results reported in the literature. Detailed analysis on the performance of the developed code reveals that the developed CPU based implicit meshless method is at least four to eight times faster than its explicit counterpart. The computational efficiency of the implicit method could be further improved by ten to fifteen times on the GPU.
Gpu Implementation of a Viscous Flow Solver on Unstructured Grids
NASA Astrophysics Data System (ADS)
Xu, Tianhao; Chen, Long
2016-06-01
Graphics processing units have gained popularities in scientific computing over past several years due to their outstanding parallel computing capability. Computational fluid dynamics applications involve large amounts of calculations, therefore a latest GPU card is preferable of which the peak computing performance and memory bandwidth are much better than a contemporary high-end CPU. We herein focus on the detailed implementation of our GPU targeting Reynolds-averaged Navier-Stokes equations solver based on finite-volume method. The solver employs a vertex-centered scheme on unstructured grids for the sake of being capable of handling complex topologies. Multiple optimizations are carried out to improve the memory accessing performance and kernel utilization. Both steady and unsteady flow simulation cases are carried out using explicit Runge-Kutta scheme. The solver with GPU acceleration in this paper is demonstrated to have competitive advantages over the CPU targeting one.
A multi-GPU real-time dose simulation software framework for lung radiotherapy.
Santhanam, A P; Min, Y; Neelakkantan, H; Papp, N; Meeks, S L; Kupelian, P A
2012-09-01
Medical simulation frameworks facilitate both the preoperative and postoperative analysis of the patient's pathophysical condition. Of particular importance is the simulation of radiation dose delivery for real-time radiotherapy monitoring and retrospective analyses of the patient's treatment. In this paper, a software framework tailored for the development of simulation-based real-time radiation dose monitoring medical applications is discussed. A multi-GPU-based computational framework coupled with inter-process communication methods is introduced for simulating the radiation dose delivery on a deformable 3D volumetric lung model and its real-time visualization. The model deformation and the corresponding dose calculation are allocated among the GPUs in a task-specific manner and is performed in a pipelined manner. Radiation dose calculations are computed on two different GPU hardware architectures. The integration of this computational framework with a front-end software layer and back-end patient database repository is also discussed. Real-time simulation of the dose delivered is achieved at once every 120 ms using the proposed framework. With a linear increase in the number of GPU cores, the computational time of the simulation was linearly decreased. The inter-process communication time also improved with an increase in the hardware memory. Variations in the delivered dose and computational speedup for variations in the data dimensions are investigated using D70 and D90 as well as gEUD as metrics for a set of 14 patients. Computational speed-up increased with an increase in the beam dimensions when compared with a CPU-based commercial software while the error in the dose calculation was <1%. Our analyses show that the framework applied to deformable lung model-based radiotherapy is an effective tool for performing both real-time and retrospective analyses.
Protein-protein docking on hardware accelerators: comparison of GPU and MIC architectures
2015-01-01
Background The hardware accelerators will provide solutions to computationally complex problems in bioinformatics fields. However, the effect of acceleration depends on the nature of the application, thus selection of an appropriate accelerator requires some consideration. Results In the present study, we compared the effects of acceleration using graphics processing unit (GPU) and many integrated core (MIC) on the speed of fast Fourier transform (FFT)-based protein-protein docking calculation. The GPU implementation performed the protein-protein docking calculations approximately five times faster than the MIC offload mode implementation. The MIC native mode implementation has the advantage in the implementation costs. However, the performance was worse with larger protein pairs because of memory limitations. Conclusion The results suggest that GPU is more suitable than MIC for accelerating FFT-based protein-protein docking applications. PMID:25707855
Giantsoudi, Drosoula; Schuemann, Jan; Jia, Xun; Dowdell, Stephen; Jiang, Steve; Paganetti, Harald
2015-03-21
Monte Carlo (MC) methods are recognized as the gold-standard for dose calculation, however they have not replaced analytical methods up to now due to their lengthy calculation times. GPU-based applications allow MC dose calculations to be performed on time scales comparable to conventional analytical algorithms. This study focuses on validating our GPU-based MC code for proton dose calculation (gPMC) using an experimentally validated multi-purpose MC code (TOPAS) and compare their performance for clinical patient cases. Clinical cases from five treatment sites were selected covering the full range from very homogeneous patient geometries (liver) to patients with high geometrical complexity (air cavities and density heterogeneities in head-and-neck and lung patients) and from short beam range (breast) to large beam range (prostate). Both gPMC and TOPAS were used to calculate 3D dose distributions for all patients. Comparisons were performed based on target coverage indices (mean dose, V95, D98, D50, D02) and gamma index distributions. Dosimetric indices differed less than 2% between TOPAS and gPMC dose distributions for most cases. Gamma index analysis with 1%/1 mm criterion resulted in a passing rate of more than 94% of all patient voxels receiving more than 10% of the mean target dose, for all patients except for prostate cases. Although clinically insignificant, gPMC resulted in systematic underestimation of target dose for prostate cases by 1-2% compared to TOPAS. Correspondingly the gamma index analysis with 1%/1 mm criterion failed for most beams for this site, while for 2%/1 mm criterion passing rates of more than 94.6% of all patient voxels were observed. For the same initial number of simulated particles, calculation time for a single beam for a typical head and neck patient plan decreased from 4 CPU hours per million particles (2.8-2.9 GHz Intel X5600) for TOPAS to 2.4 s per million particles (NVIDIA TESLA C2075) for gPMC. Excellent agreement was demonstrated between our fast GPU-based MC code (gPMC) and a previously extensively validated multi-purpose MC code (TOPAS) for a comprehensive set of clinical patient cases. This shows that MC dose calculations in proton therapy can be performed on time scales comparable to analytical algorithms with accuracy comparable to state-of-the-art CPU-based MC codes.
3D brain tumor localization and parameter estimation using thermographic approach on GPU.
Bousselham, Abdelmajid; Bouattane, Omar; Youssfi, Mohamed; Raihani, Abdelhadi
2018-01-01
The aim of this paper is to present a GPU parallel algorithm for brain tumor detection to estimate its size and location from surface temperature distribution obtained by thermography. The normal brain tissue is modeled as a rectangular cube including spherical tumor. The temperature distribution is calculated using forward three dimensional Pennes bioheat transfer equation, it's solved using massively parallel Finite Difference Method (FDM) and implemented on Graphics Processing Unit (GPU). Genetic Algorithm (GA) was used to solve the inverse problem and estimate the tumor size and location by minimizing an objective function involving measured temperature on the surface to those obtained by numerical simulation. The parallel implementation of Finite Difference Method reduces significantly the time of bioheat transfer and greatly accelerates the inverse identification of brain tumor thermophysical and geometrical properties. Experimental results show significant gains in the computational speed on GPU and achieve a speedup of around 41 compared to the CPU. The analysis performance of the estimation based on tumor size inside brain tissue also presented. Copyright © 2017 Elsevier Ltd. All rights reserved.
GPU Accelerated DG-FDF Large Eddy Simulator
NASA Astrophysics Data System (ADS)
Inkarbekov, Medet; Aitzhan, Aidyn; Sammak, Shervin; Givi, Peyman; Kaltayev, Aidarkhan
2017-11-01
A GPU accelerated simulator is developed and implemented for large eddy simulation (LES) of turbulent flows. The filtered density function (FDF) is utilized for modeling of the subgrid scale quantities. The filtered transport equations are solved via a discontinuous Galerkin (DG) and the FDF is simulated via particle based Lagrangian Monte-Carlo (MC) method. It is demonstrated that the GPUs simulations are of the order of 100 times faster than the CPU-based calculations. This brings LES of turbulent flows to a new level, facilitating efficient simulation of more complex problems. The work at Al-Faraby Kazakh National University is sponsored by MoES of RK under Grant 3298/GF-4.
Rana, Vijay; Rudin, Stephen; Bednarek, Daniel R.
2012-01-01
We have developed a dose-tracking system (DTS) that calculates the radiation dose to the patient’s skin in real-time by acquiring exposure parameters and imaging-system-geometry from the digital bus on a Toshiba Infinix C-arm unit. The cumulative dose values are then displayed as a color map on an OpenGL-based 3D graphic of the patient for immediate feedback to the interventionalist. Determination of those elements on the surface of the patient 3D-graphic that intersect the beam and calculation of the dose for these elements in real time demands fast computation. Reducing the size of the elements results in more computation load on the computer processor and therefore a tradeoff occurs between the resolution of the patient graphic and the real-time performance of the DTS. The speed of the DTS for calculating dose to the skin is limited by the central processing unit (CPU) and can be improved by using the parallel processing power of a graphics processing unit (GPU). Here, we compare the performance speed of GPU-based DTS software to that of the current CPU-based software as a function of the resolution of the patient graphics. Results show a tremendous improvement in speed using the GPU. While an increase in the spatial resolution of the patient graphics resulted in slowing down the computational speed of the DTS on the CPU, the speed of the GPU-based DTS was hardly affected. This GPU-based DTS can be a powerful tool for providing accurate, real-time feedback about patient skin-dose to physicians while performing interventional procedures. PMID:24027616
Rana, Vijay; Rudin, Stephen; Bednarek, Daniel R
2012-02-23
We have developed a dose-tracking system (DTS) that calculates the radiation dose to the patient's skin in real-time by acquiring exposure parameters and imaging-system-geometry from the digital bus on a Toshiba Infinix C-arm unit. The cumulative dose values are then displayed as a color map on an OpenGL-based 3D graphic of the patient for immediate feedback to the interventionalist. Determination of those elements on the surface of the patient 3D-graphic that intersect the beam and calculation of the dose for these elements in real time demands fast computation. Reducing the size of the elements results in more computation load on the computer processor and therefore a tradeoff occurs between the resolution of the patient graphic and the real-time performance of the DTS. The speed of the DTS for calculating dose to the skin is limited by the central processing unit (CPU) and can be improved by using the parallel processing power of a graphics processing unit (GPU). Here, we compare the performance speed of GPU-based DTS software to that of the current CPU-based software as a function of the resolution of the patient graphics. Results show a tremendous improvement in speed using the GPU. While an increase in the spatial resolution of the patient graphics resulted in slowing down the computational speed of the DTS on the CPU, the speed of the GPU-based DTS was hardly affected. This GPU-based DTS can be a powerful tool for providing accurate, real-time feedback about patient skin-dose to physicians while performing interventional procedures.
Fast CPU-based Monte Carlo simulation for radiotherapy dose calculation.
Ziegenhein, Peter; Pirner, Sven; Ph Kamerling, Cornelis; Oelfke, Uwe
2015-08-07
Monte-Carlo (MC) simulations are considered to be the most accurate method for calculating dose distributions in radiotherapy. Its clinical application, however, still is limited by the long runtimes conventional implementations of MC algorithms require to deliver sufficiently accurate results on high resolution imaging data. In order to overcome this obstacle we developed the software-package PhiMC, which is capable of computing precise dose distributions in a sub-minute time-frame by leveraging the potential of modern many- and multi-core CPU-based computers. PhiMC is based on the well verified dose planning method (DPM). We could demonstrate that PhiMC delivers dose distributions which are in excellent agreement to DPM. The multi-core implementation of PhiMC scales well between different computer architectures and achieves a speed-up of up to 37[Formula: see text] compared to the original DPM code executed on a modern system. Furthermore, we could show that our CPU-based implementation on a modern workstation is between 1.25[Formula: see text] and 1.95[Formula: see text] faster than a well-known GPU implementation of the same simulation method on a NVIDIA Tesla C2050. Since CPUs work on several hundreds of GB RAM the typical GPU memory limitation does not apply for our implementation and high resolution clinical plans can be calculated.
a method of gravity and seismic sequential inversion and its GPU implementation
NASA Astrophysics Data System (ADS)
Liu, G.; Meng, X.
2011-12-01
In this abstract, we introduce a gravity and seismic sequential inversion method to invert for density and velocity together. For the gravity inversion, we use an iterative method based on correlation imaging algorithm; for the seismic inversion, we use the full waveform inversion. The link between the density and velocity is an empirical formula called Gardner equation, for large volumes of data, we use the GPU to accelerate the computation. For the gravity inversion method , we introduce a method based on correlation imaging algorithm,it is also a interative method, first we calculate the correlation imaging of the observed gravity anomaly, it is some value between -1 and +1, then we multiply this value with a little density ,this value become the initial density model. We get a forward reuslt with this initial model and also calculate the correaltion imaging of the misfit of observed data and the forward data, also multiply the correaltion imaging result a little density and add it to the initial model, then do the same procedure above , at last ,we can get a inversion density model. For the seismic inveron method ,we use a mothod base on the linearity of acoustic wave equation written in the frequency domain,with a intial velociy model, we can get a good velocity result. In the sequential inversion of gravity and seismic , we need a link formula to convert between density and velocity ,in our method , we use the Gardner equation. Driven by the insatiable market demand for real time, high-definition 3D images, the programmable NVIDIA Graphic Processing Unit (GPU) as co-processor of CPU has been developed for high performance computing. Compute Unified Device Architecture (CUDA) is a parallel programming model and software environment provided by NVIDIA designed to overcome the challenge of using traditional general purpose GPU while maintaining a low learn curve for programmers familiar with standard programming languages such as C. In our inversion processing, we use the GPU to accelerate our gravity and seismic inversion. Taking the gravity inversion as an example, its kernels are gravity forward simulation and correlation imaging, after the parallelization in GPU, in 3D case,the inversion module, the original five CPU loops are reduced to three,the forward module the original five CPU loops are reduced to two. Acknowledgments We acknowledge the financial support of Sinoprobe project (201011039 and 201011049-03), the Fundamental Research Funds for the Central Universities (2010ZY26 and 2011PY0183), the National Natural Science Foundation of China (41074095) and the Open Project of State Key Laboratory of Geological Processes and Mineral Resources (GPMR0945).
Advantages of GPU technology in DFT calculations of intercalated graphene
NASA Astrophysics Data System (ADS)
Pešić, J.; Gajić, R.
2014-09-01
Over the past few years, the expansion of general-purpose graphic-processing unit (GPGPU) technology has had a great impact on computational science. GPGPU is the utilization of a graphics-processing unit (GPU) to perform calculations in applications usually handled by the central processing unit (CPU). Use of GPGPUs as a way to increase computational power in the material sciences has significantly decreased computational costs in already highly demanding calculations. A level of the acceleration and parallelization depends on the problem itself. Some problems can benefit from GPU acceleration and parallelization, such as the finite-difference time-domain algorithm (FTDT) and density-functional theory (DFT), while others cannot take advantage of these modern technologies. A number of GPU-supported applications had emerged in the past several years (www.nvidia.com/object/gpu-applications.html). Quantum Espresso (QE) is reported as an integrated suite of open source computer codes for electronic-structure calculations and materials modeling at the nano-scale. It is based on DFT, the use of a plane-waves basis and a pseudopotential approach. Since the QE 5.0 version, it has been implemented as a plug-in component for standard QE packages that allows exploiting the capabilities of Nvidia GPU graphic cards (www.qe-forge.org/gf/proj). In this study, we have examined the impact of the usage of GPU acceleration and parallelization on the numerical performance of DFT calculations. Graphene has been attracting attention worldwide and has already shown some remarkable properties. We have studied an intercalated graphene, using the QE package PHonon, which employs GPU. The term ‘intercalation’ refers to a process whereby foreign adatoms are inserted onto a graphene lattice. In addition, by intercalating different atoms between graphene layers, it is possible to tune their physical properties. Our experiments have shown there are benefits from using GPUs, and we reached an acceleration of several times compared to standard CPU calculations.
A real-time spike sorting method based on the embedded GPU.
Zelan Yang; Kedi Xu; Xiang Tian; Shaomin Zhang; Xiaoxiang Zheng
2017-07-01
Microelectrode arrays with hundreds of channels have been widely used to acquire neuron population signals in neuroscience studies. Online spike sorting is becoming one of the most important challenges for high-throughput neural signal acquisition systems. Graphic processing unit (GPU) with high parallel computing capability might provide an alternative solution for increasing real-time computational demands on spike sorting. This study reported a method of real-time spike sorting through computing unified device architecture (CUDA) which was implemented on an embedded GPU (NVIDIA JETSON Tegra K1, TK1). The sorting approach is based on the principal component analysis (PCA) and K-means. By analyzing the parallelism of each process, the method was further optimized in the thread memory model of GPU. Our results showed that the GPU-based classifier on TK1 is 37.92 times faster than the MATLAB-based classifier on PC while their accuracies were the same with each other. The high-performance computing features of embedded GPU demonstrated in our studies suggested that the embedded GPU provide a promising platform for the real-time neural signal processing.
Accelerating Spaceborne SAR Imaging Using Multiple CPU/GPU Deep Collaborative Computing
Zhang, Fan; Li, Guojun; Li, Wei; Hu, Wei; Hu, Yuxin
2016-01-01
With the development of synthetic aperture radar (SAR) technologies in recent years, the huge amount of remote sensing data brings challenges for real-time imaging processing. Therefore, high performance computing (HPC) methods have been presented to accelerate SAR imaging, especially the GPU based methods. In the classical GPU based imaging algorithm, GPU is employed to accelerate image processing by massive parallel computing, and CPU is only used to perform the auxiliary work such as data input/output (IO). However, the computing capability of CPU is ignored and underestimated. In this work, a new deep collaborative SAR imaging method based on multiple CPU/GPU is proposed to achieve real-time SAR imaging. Through the proposed tasks partitioning and scheduling strategy, the whole image can be generated with deep collaborative multiple CPU/GPU computing. In the part of CPU parallel imaging, the advanced vector extension (AVX) method is firstly introduced into the multi-core CPU parallel method for higher efficiency. As for the GPU parallel imaging, not only the bottlenecks of memory limitation and frequent data transferring are broken, but also kinds of optimized strategies are applied, such as streaming, parallel pipeline and so on. Experimental results demonstrate that the deep CPU/GPU collaborative imaging method enhances the efficiency of SAR imaging on single-core CPU by 270 times and realizes the real-time imaging in that the imaging rate outperforms the raw data generation rate. PMID:27070606
Accelerating Spaceborne SAR Imaging Using Multiple CPU/GPU Deep Collaborative Computing.
Zhang, Fan; Li, Guojun; Li, Wei; Hu, Wei; Hu, Yuxin
2016-04-07
With the development of synthetic aperture radar (SAR) technologies in recent years, the huge amount of remote sensing data brings challenges for real-time imaging processing. Therefore, high performance computing (HPC) methods have been presented to accelerate SAR imaging, especially the GPU based methods. In the classical GPU based imaging algorithm, GPU is employed to accelerate image processing by massive parallel computing, and CPU is only used to perform the auxiliary work such as data input/output (IO). However, the computing capability of CPU is ignored and underestimated. In this work, a new deep collaborative SAR imaging method based on multiple CPU/GPU is proposed to achieve real-time SAR imaging. Through the proposed tasks partitioning and scheduling strategy, the whole image can be generated with deep collaborative multiple CPU/GPU computing. In the part of CPU parallel imaging, the advanced vector extension (AVX) method is firstly introduced into the multi-core CPU parallel method for higher efficiency. As for the GPU parallel imaging, not only the bottlenecks of memory limitation and frequent data transferring are broken, but also kinds of optimized strategies are applied, such as streaming, parallel pipeline and so on. Experimental results demonstrate that the deep CPU/GPU collaborative imaging method enhances the efficiency of SAR imaging on single-core CPU by 270 times and realizes the real-time imaging in that the imaging rate outperforms the raw data generation rate.
Araki, Hiromitsu; Takada, Naoki; Niwase, Hiroaki; Ikawa, Shohei; Fujiwara, Masato; Nakayama, Hirotaka; Kakue, Takashi; Shimobaba, Tomoyoshi; Ito, Tomoyoshi
2015-12-01
We propose real-time time-division color electroholography using a single graphics processing unit (GPU) and a simple synchronization system of reference light. To facilitate real-time time-division color electroholography, we developed a light emitting diode (LED) controller with a universal serial bus (USB) module and the drive circuit for reference light. A one-chip RGB LED connected to a personal computer via an LED controller was used as the reference light. A single GPU calculates three computer-generated holograms (CGHs) suitable for red, green, and blue colors in each frame of a three-dimensional (3D) movie. After CGH calculation using a single GPU, the CPU can synchronize the CGH display with the color switching of the one-chip RGB LED via the LED controller. Consequently, we succeeded in real-time time-division color electroholography for a 3D object consisting of around 1000 points per color when an NVIDIA GeForce GTX TITAN was used as the GPU. Furthermore, we implemented the proposed method in various GPUs. The experimental results showed that the proposed method was effective for various GPUs.
NASA Astrophysics Data System (ADS)
Murni, Bustamam, A.; Ernastuti, Handhika, T.; Kerami, D.
2017-07-01
Calculation of the matrix-vector multiplication in the real-world problems often involves large matrix with arbitrary size. Therefore, parallelization is needed to speed up the calculation process that usually takes a long time. Graph partitioning techniques that have been discussed in the previous studies cannot be used to complete the parallelized calculation of matrix-vector multiplication with arbitrary size. This is due to the assumption of graph partitioning techniques that can only solve the square and symmetric matrix. Hypergraph partitioning techniques will overcome the shortcomings of the graph partitioning technique. This paper addresses the efficient parallelization of matrix-vector multiplication through hypergraph partitioning techniques using CUDA GPU-based parallel computing. CUDA (compute unified device architecture) is a parallel computing platform and programming model that was created by NVIDIA and implemented by the GPU (graphics processing unit).
TH-A-18C-04: Ultrafast Cone-Beam CT Scatter Correction with GPU-Based Monte Carlo Simulation
DOE Office of Scientific and Technical Information (OSTI.GOV)
Xu, Y; Southern Medical University, Guangzhou; Bai, T
2014-06-15
Purpose: Scatter artifacts severely degrade image quality of cone-beam CT (CBCT). We present an ultrafast scatter correction framework by using GPU-based Monte Carlo (MC) simulation and prior patient CT image, aiming at automatically finish the whole process including both scatter correction and reconstructions within 30 seconds. Methods: The method consists of six steps: 1) FDK reconstruction using raw projection data; 2) Rigid Registration of planning CT to the FDK results; 3) MC scatter calculation at sparse view angles using the planning CT; 4) Interpolation of the calculated scatter signals to other angles; 5) Removal of scatter from the raw projections;more » 6) FDK reconstruction using the scatter-corrected projections. In addition to using GPU to accelerate MC photon simulations, we also use a small number of photons and a down-sampled CT image in simulation to further reduce computation time. A novel denoising algorithm is used to eliminate MC scatter noise caused by low photon numbers. The method is validated on head-and-neck cases with simulated and clinical data. Results: We have studied impacts of photo histories, volume down sampling factors on the accuracy of scatter estimation. The Fourier analysis was conducted to show that scatter images calculated at 31 angles are sufficient to restore those at all angles with <0.1% error. For the simulated case with a resolution of 512×512×100, we simulated 10M photons per angle. The total computation time is 23.77 seconds on a Nvidia GTX Titan GPU. The scatter-induced shading/cupping artifacts are substantially reduced, and the average HU error of a region-of-interest is reduced from 75.9 to 19.0 HU. Similar results were found for a real patient case. Conclusion: A practical ultrafast MC-based CBCT scatter correction scheme is developed. The whole process of scatter correction and reconstruction is accomplished within 30 seconds. This study is supported in part by NIH (1R01CA154747-01), The Core Technology Research in Strategic Emerging Industry, Guangdong, China (2011A081402003)« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wang, Y; Mazur, T; Green, O
Purpose: The clinical commissioning of IMRT subject to a magnetic field is challenging. The purpose of this work is to develop a GPU-accelerated Monte Carlo dose calculation platform based on PENELOPE and then use the platform to validate a vendor-provided MRIdian head model toward quality assurance of clinical IMRT treatment plans subject to a 0.35 T magnetic field. Methods: We first translated PENELOPE from FORTRAN to C++ and validated that the translation produced equivalent results. Then we adapted the C++ code to CUDA in a workflow optimized for GPU architecture. We expanded upon the original code to include voxelized transportmore » boosted by Woodcock tracking, faster electron/positron propagation in a magnetic field, and several features that make gPENELOPE highly user-friendly. Moreover, we incorporated the vendor-provided MRIdian head model into the code. We performed a set of experimental measurements on MRIdian to examine the accuracy of both the head model and gPENELOPE, and then applied gPENELOPE toward independent validation of patient doses calculated by MRIdian’s KMC. Results: We achieve an average acceleration factor of 152 compared to the original single-thread FORTRAN implementation with the original accuracy preserved. For 16 treatment plans including stomach (4), lung (2), liver (3), adrenal gland (2), pancreas (2), spleen (1), mediastinum (1) and breast (1), the MRIdian dose calculation engine agrees with gPENELOPE with a mean gamma passing rate of 99.1% ± 0.6% (2%/2 mm). Conclusions: We developed a Monte Carlo simulation platform based on a GPU-accelerated version of PENELOPE. We validated that both the vendor provided head model and fast Monte Carlo engine used by the MRIdian system are accurate in modeling radiation transport in a patient using 2%/2 mm gamma criteria. Future applications of this platform will include dose validation and accumulation, IMRT optimization, and dosimetry system modeling for next generation MR-IGRT systems.« less
High-performance computing on GPUs for resistivity logging of oil and gas wells
NASA Astrophysics Data System (ADS)
Glinskikh, V.; Dudaev, A.; Nechaev, O.; Surodina, I.
2017-10-01
We developed and implemented into software an algorithm for high-performance simulation of electrical logs from oil and gas wells using high-performance heterogeneous computing. The numerical solution of the 2D forward problem is based on the finite-element method and the Cholesky decomposition for solving a system of linear algebraic equations (SLAE). Software implementations of the algorithm used the NVIDIA CUDA technology and computing libraries are made, allowing us to perform decomposition of SLAE and find its solution on central processor unit (CPU) and graphics processor unit (GPU). The calculation time is analyzed depending on the matrix size and number of its non-zero elements. We estimated the computing speed on CPU and GPU, including high-performance heterogeneous CPU-GPU computing. Using the developed algorithm, we simulated resistivity data in realistic models.
GGEMS-Brachy: GPU GEant4-based Monte Carlo simulation for brachytherapy applications
NASA Astrophysics Data System (ADS)
Lemaréchal, Yannick; Bert, Julien; Falconnet, Claire; Després, Philippe; Valeri, Antoine; Schick, Ulrike; Pradier, Olivier; Garcia, Marie-Paule; Boussion, Nicolas; Visvikis, Dimitris
2015-07-01
In brachytherapy, plans are routinely calculated using the AAPM TG43 formalism which considers the patient as a simple water object. An accurate modeling of the physical processes considering patient heterogeneity using Monte Carlo simulation (MCS) methods is currently too time-consuming and computationally demanding to be routinely used. In this work we implemented and evaluated an accurate and fast MCS on Graphics Processing Units (GPU) for brachytherapy low dose rate (LDR) applications. A previously proposed Geant4 based MCS framework implemented on GPU (GGEMS) was extended to include a hybrid GPU navigator, allowing navigation within voxelized patient specific images and analytically modeled 125I seeds used in LDR brachytherapy. In addition, dose scoring based on track length estimator including uncertainty calculations was incorporated. The implemented GGEMS-brachy platform was validated using a comparison with Geant4 simulations and reference datasets. Finally, a comparative dosimetry study based on the current clinical standard (TG43) and the proposed platform was performed on twelve prostate cancer patients undergoing LDR brachytherapy. Considering patient 3D CT volumes of 400 × 250 × 65 voxels and an average of 58 implanted seeds, the mean patient dosimetry study run time for a 2% dose uncertainty was 9.35 s (≈500 ms 10-6 simulated particles) and 2.5 s when using one and four GPUs, respectively. The performance of the proposed GGEMS-brachy platform allows envisaging the use of Monte Carlo simulation based dosimetry studies in brachytherapy compatible with clinical practice. Although the proposed platform was evaluated for prostate cancer, it is equally applicable to other LDR brachytherapy clinical applications. Future extensions will allow its application in high dose rate brachytherapy applications.
Nguyen, Van-Giang; Lee, Soo-Jin
2016-07-01
Iterative reconstruction from Compton scattered data is known to be computationally more challenging than that from conventional line-projection based emission data in that the gamma rays that undergo Compton scattering are modeled as conic projections rather than line projections. In conventional tomographic reconstruction, to parallelize the projection and backprojection operations using the graphics processing unit (GPU), approximated methods that use an unmatched pair of ray-tracing forward projector and voxel-driven backprojector have been widely used. In this work, we propose a new GPU-accelerated method for Compton camera reconstruction which is more accurate by using exactly matched pair of projector and backprojector. To calculate conic forward projection, we first sample the cone surface into conic rays and accumulate the intersecting chord lengths of the conic rays passing through voxels using a fast ray-tracing method (RTM). For conic backprojection, to obtain the true adjoint of the conic forward projection, while retaining the computational efficiency of the GPU, we use a voxel-driven RTM which is essentially the same as the standard RTM used for the conic forward projector. Our simulation results show that, while the new method is about 3 times slower than the approximated method, it is still about 16 times faster than the CPU-based method without any loss of accuracy. The net conclusion is that our proposed method is guaranteed to retain the reconstruction accuracy regardless of the number of iterations by providing a perfectly matched projector-backprojector pair, which makes iterative reconstruction methods for Compton imaging faster and more accurate. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Petascale turbulence simulation using a highly parallel fast multipole method on GPUs
NASA Astrophysics Data System (ADS)
Yokota, Rio; Barba, L. A.; Narumi, Tetsu; Yasuoka, Kenji
2013-03-01
This paper reports large-scale direct numerical simulations of homogeneous-isotropic fluid turbulence, achieving sustained performance of 1.08 petaflop/s on GPU hardware using single precision. The simulations use a vortex particle method to solve the Navier-Stokes equations, with a highly parallel fast multipole method (FMM) as numerical engine, and match the current record in mesh size for this application, a cube of 40963 computational points solved with a spectral method. The standard numerical approach used in this field is the pseudo-spectral method, relying on the FFT algorithm as the numerical engine. The particle-based simulations presented in this paper quantitatively match the kinetic energy spectrum obtained with a pseudo-spectral method, using a trusted code. In terms of parallel performance, weak scaling results show the FMM-based vortex method achieving 74% parallel efficiency on 4096 processes (one GPU per MPI process, 3 GPUs per node of the TSUBAME-2.0 system). The FFT-based spectral method is able to achieve just 14% parallel efficiency on the same number of MPI processes (using only CPU cores), due to the all-to-all communication pattern of the FFT algorithm. The calculation time for one time step was 108 s for the vortex method and 154 s for the spectral method, under these conditions. Computing with 69 billion particles, this work exceeds by an order of magnitude the largest vortex-method calculations to date.
FastGCN: A GPU Accelerated Tool for Fast Gene Co-Expression Networks
Liang, Meimei; Zhang, Futao; Jin, Gulei; Zhu, Jun
2015-01-01
Gene co-expression networks comprise one type of valuable biological networks. Many methods and tools have been published to construct gene co-expression networks; however, most of these tools and methods are inconvenient and time consuming for large datasets. We have developed a user-friendly, accelerated and optimized tool for constructing gene co-expression networks that can fully harness the parallel nature of GPU (Graphic Processing Unit) architectures. Genetic entropies were exploited to filter out genes with no or small expression changes in the raw data preprocessing step. Pearson correlation coefficients were then calculated. After that, we normalized these coefficients and employed the False Discovery Rate to control the multiple tests. At last, modules identification was conducted to construct the co-expression networks. All of these calculations were implemented on a GPU. We also compressed the coefficient matrix to save space. We compared the performance of the GPU implementation with those of multi-core CPU implementations with 16 CPU threads, single-thread C/C++ implementation and single-thread R implementation. Our results show that GPU implementation largely outperforms single-thread C/C++ implementation and single-thread R implementation, and GPU implementation outperforms multi-core CPU implementation when the number of genes increases. With the test dataset containing 16,000 genes and 590 individuals, we can achieve greater than 63 times the speed using a GPU implementation compared with a single-thread R implementation when 50 percent of genes were filtered out and about 80 times the speed when no genes were filtered out. PMID:25602758
FastGCN: a GPU accelerated tool for fast gene co-expression networks.
Liang, Meimei; Zhang, Futao; Jin, Gulei; Zhu, Jun
2015-01-01
Gene co-expression networks comprise one type of valuable biological networks. Many methods and tools have been published to construct gene co-expression networks; however, most of these tools and methods are inconvenient and time consuming for large datasets. We have developed a user-friendly, accelerated and optimized tool for constructing gene co-expression networks that can fully harness the parallel nature of GPU (Graphic Processing Unit) architectures. Genetic entropies were exploited to filter out genes with no or small expression changes in the raw data preprocessing step. Pearson correlation coefficients were then calculated. After that, we normalized these coefficients and employed the False Discovery Rate to control the multiple tests. At last, modules identification was conducted to construct the co-expression networks. All of these calculations were implemented on a GPU. We also compressed the coefficient matrix to save space. We compared the performance of the GPU implementation with those of multi-core CPU implementations with 16 CPU threads, single-thread C/C++ implementation and single-thread R implementation. Our results show that GPU implementation largely outperforms single-thread C/C++ implementation and single-thread R implementation, and GPU implementation outperforms multi-core CPU implementation when the number of genes increases. With the test dataset containing 16,000 genes and 590 individuals, we can achieve greater than 63 times the speed using a GPU implementation compared with a single-thread R implementation when 50 percent of genes were filtered out and about 80 times the speed when no genes were filtered out.
2013-03-01
time (milliseconds) GFlops Comparison to GPU peak performance (%) Cascade Gaussian Filtering 13 45.19 6.3 Difference of Gaussian 0.512 152...values for the GPU-targeted actor implementations in terms of Giga Floating Point Operations Per Second ( GFLOPS ). Our GFLOPS calculation for an actor...kernels. The results for GFLOPS are provided in Table . The actors were implemented on an NVIDIA GTX260 GPU, which provides 715 GFLOPS as peak
Wan Chan Tseung, H; Ma, J; Beltran, C
2015-06-01
Very fast Monte Carlo (MC) simulations of proton transport have been implemented recently on graphics processing units (GPUs). However, these MCs usually use simplified models for nonelastic proton-nucleus interactions. Our primary goal is to build a GPU-based proton transport MC with detailed modeling of elastic and nonelastic proton-nucleus collisions. Using the cuda framework, the authors implemented GPU kernels for the following tasks: (1) simulation of beam spots from our possible scanning nozzle configurations, (2) proton propagation through CT geometry, taking into account nuclear elastic scattering, multiple scattering, and energy loss straggling, (3) modeling of the intranuclear cascade stage of nonelastic interactions when they occur, (4) simulation of nuclear evaporation, and (5) statistical error estimates on the dose. To validate our MC, the authors performed (1) secondary particle yield calculations in proton collisions with therapeutically relevant nuclei, (2) dose calculations in homogeneous phantoms, (3) recalculations of complex head and neck treatment plans from a commercially available treatment planning system, and compared with (GEANT)4.9.6p2/TOPAS. Yields, energy, and angular distributions of secondaries from nonelastic collisions on various nuclei are in good agreement with the (GEANT)4.9.6p2 Bertini and Binary cascade models. The 3D-gamma pass rate at 2%-2 mm for treatment plan simulations is typically 98%. The net computational time on a NVIDIA GTX680 card, including all CPU-GPU data transfers, is ∼ 20 s for 1 × 10(7) proton histories. Our GPU-based MC is the first of its kind to include a detailed nuclear model to handle nonelastic interactions of protons with any nucleus. Dosimetric calculations are in very good agreement with (GEANT)4.9.6p2/TOPAS. Our MC is being integrated into a framework to perform fast routine clinical QA of pencil-beam based treatment plans, and is being used as the dose calculation engine in a clinically applicable MC-based IMPT treatment planning system. The detailed nuclear modeling will allow us to perform very fast linear energy transfer and neutron dose estimates on the GPU.
A Graphics Processing Unit Implementation of Coulomb Interaction in Molecular Dynamics.
Jha, Prateek K; Sknepnek, Rastko; Guerrero-García, Guillermo Iván; Olvera de la Cruz, Monica
2010-10-12
We report a GPU implementation in HOOMD Blue of long-range electrostatic interactions based on the orientation-averaged Ewald sum scheme, introduced by Yakub and Ronchi (J. Chem. Phys. 2003, 119, 11556). The performance of the method is compared to an optimized CPU version of the traditional Ewald sum available in LAMMPS, in the molecular dynamics of electrolytes. Our GPU implementation is significantly faster than the CPU implementation of the Ewald method for small to a sizable number of particles (∼10(5)). Thermodynamic and structural properties of monovalent and divalent hydrated salts in the bulk are calculated for a wide range of ionic concentrations. An excellent agreement between the two methods was found at the level of electrostatic energy, heat capacity, radial distribution functions, and integrated charge of the electrolytes.
GPU accelerated manifold correction method for spinning compact binaries
NASA Astrophysics Data System (ADS)
Ran, Chong-xi; Liu, Song; Zhong, Shuang-ying
2018-04-01
The graphics processing unit (GPU) acceleration of the manifold correction algorithm based on the compute unified device architecture (CUDA) technology is designed to simulate the dynamic evolution of the Post-Newtonian (PN) Hamiltonian formulation of spinning compact binaries. The feasibility and the efficiency of parallel computation on GPU have been confirmed by various numerical experiments. The numerical comparisons show that the accuracy on GPU execution of manifold corrections method has a good agreement with the execution of codes on merely central processing unit (CPU-based) method. The acceleration ability when the codes are implemented on GPU can increase enormously through the use of shared memory and register optimization techniques without additional hardware costs, implying that the speedup is nearly 13 times as compared with the codes executed on CPU for phase space scan (including 314 × 314 orbits). In addition, GPU-accelerated manifold correction method is used to numerically study how dynamics are affected by the spin-induced quadrupole-monopole interaction for black hole binary system.
Maurer, S A; Kussmann, J; Ochsenfeld, C
2014-08-07
We present a low-prefactor, cubically scaling scaled-opposite-spin second-order Møller-Plesset perturbation theory (SOS-MP2) method which is highly suitable for massively parallel architectures like graphics processing units (GPU). The scaling is reduced from O(N⁵) to O(N³) by a reformulation of the MP2-expression in the atomic orbital basis via Laplace transformation and the resolution-of-the-identity (RI) approximation of the integrals in combination with efficient sparse algebra for the 3-center integral transformation. In contrast to previous works that employ GPUs for post Hartree-Fock calculations, we do not simply employ GPU-based linear algebra libraries to accelerate the conventional algorithm. Instead, our reformulation allows to replace the rate-determining contraction step with a modified J-engine algorithm, that has been proven to be highly efficient on GPUs. Thus, our SOS-MP2 scheme enables us to treat large molecular systems in an accurate and efficient manner on a single GPU-server.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Liu, T; Lin, H; Xu, X
Purpose: To develop a nuclear medicine dosimetry module for the GPU-based Monte Carlo code ARCHER. Methods: We have developed a nuclear medicine dosimetry module for the fast Monte Carlo code ARCHER. The coupled electron-photon Monte Carlo transport kernel included in ARCHER is built upon the Dose Planning Method code (DPM). The developed module manages the radioactive decay simulation by consecutively tracking several types of radiation on a per disintegration basis using the statistical sampling method. Optimization techniques such as persistent threads and prefetching are studied and implemented. The developed module is verified against the VIDA code, which is based onmore » Geant4 toolkit and has previously been verified against OLINDA/EXM. A voxelized geometry is used in the preliminary test: a sphere made of ICRP soft tissue is surrounded by a box filled with water. Uniform activity distribution of I-131 is assumed in the sphere. Results: The self-absorption dose factors (mGy/MBqs) of the sphere with varying diameters are calculated by ARCHER and VIDA respectively. ARCHER’s result is in agreement with VIDA’s that are obtained from a previous publication. VIDA takes hours of CPU time to finish the computation, while it takes ARCHER 4.31 seconds for the 12.4-cm uniform activity sphere case. For a fairer CPU-GPU comparison, more effort will be made to eliminate the algorithmic differences. Conclusion: The coupled electron-photon Monte Carlo code ARCHER has been extended to radioactive decay simulation for nuclear medicine dosimetry. The developed code exhibits good performance in our preliminary test. The GPU-based Monte Carlo code is developed with grant support from the National Institute of Biomedical Imaging and Bioengineering through an R01 grant (R01EB015478)« less
Revisiting Molecular Dynamics on a CPU/GPU system: Water Kernel and SHAKE Parallelization.
Ruymgaart, A Peter; Elber, Ron
2012-11-13
We report Graphics Processing Unit (GPU) and Open-MP parallel implementations of water-specific force calculations and of bond constraints for use in Molecular Dynamics simulations. We focus on a typical laboratory computing-environment in which a CPU with a few cores is attached to a GPU. We discuss in detail the design of the code and we illustrate performance comparable to highly optimized codes such as GROMACS. Beside speed our code shows excellent energy conservation. Utilization of water-specific lists allows the efficient calculations of non-bonded interactions that include water molecules and results in a speed-up factor of more than 40 on the GPU compared to code optimized on a single CPU core for systems larger than 20,000 atoms. This is up four-fold from a factor of 10 reported in our initial GPU implementation that did not include a water-specific code. Another optimization is the implementation of constrained dynamics entirely on the GPU. The routine, which enforces constraints of all bonds, runs in parallel on multiple Open-MP cores or entirely on the GPU. It is based on Conjugate Gradient solution of the Lagrange multipliers (CG SHAKE). The GPU implementation is partially in double precision and requires no communication with the CPU during the execution of the SHAKE algorithm. The (parallel) implementation of SHAKE allows an increase of the time step to 2.0fs while maintaining excellent energy conservation. Interestingly, CG SHAKE is faster than the usual bond relaxation algorithm even on a single core if high accuracy is expected. The significant speedup of the optimized components transfers the computational bottleneck of the MD calculation to the reciprocal part of Particle Mesh Ewald (PME).
Architecting the Finite Element Method Pipeline for the GPU.
Fu, Zhisong; Lewis, T James; Kirby, Robert M; Whitaker, Ross T
2014-02-01
The finite element method (FEM) is a widely employed numerical technique for approximating the solution of partial differential equations (PDEs) in various science and engineering applications. Many of these applications benefit from fast execution of the FEM pipeline. One way to accelerate the FEM pipeline is by exploiting advances in modern computational hardware, such as the many-core streaming processors like the graphical processing unit (GPU). In this paper, we present the algorithms and data-structures necessary to move the entire FEM pipeline to the GPU. First we propose an efficient GPU-based algorithm to generate local element information and to assemble the global linear system associated with the FEM discretization of an elliptic PDE. To solve the corresponding linear system efficiently on the GPU, we implement a conjugate gradient method preconditioned with a geometry-informed algebraic multi-grid (AMG) method preconditioner. We propose a new fine-grained parallelism strategy, a corresponding multigrid cycling stage and efficient data mapping to the many-core architecture of GPU. Comparison of our on-GPU assembly versus a traditional serial implementation on the CPU achieves up to an 87 × speedup. Focusing on the linear system solver alone, we achieve a speedup of up to 51 × versus use of a comparable state-of-the-art serial CPU linear system solver. Furthermore, the method compares favorably with other GPU-based, sparse, linear solvers.
Chi, Yujie; Tian, Zhen; Jia, Xun
2016-08-07
Monte Carlo (MC) particle transport simulation on a graphics-processing unit (GPU) platform has been extensively studied recently due to the efficiency advantage achieved via massive parallelization. Almost all of the existing GPU-based MC packages were developed for voxelized geometry. This limited application scope of these packages. The purpose of this paper is to develop a module to model parametric geometry and integrate it in GPU-based MC simulations. In our module, each continuous region was defined by its bounding surfaces that were parameterized by quadratic functions. Particle navigation functions in this geometry were developed. The module was incorporated to two previously developed GPU-based MC packages and was tested in two example problems: (1) low energy photon transport simulation in a brachytherapy case with a shielded cylinder applicator and (2) MeV coupled photon/electron transport simulation in a phantom containing several inserts of different shapes. In both cases, the calculated dose distributions agreed well with those calculated in the corresponding voxelized geometry. The averaged dose differences were 1.03% and 0.29%, respectively. We also used the developed package to perform simulations of a Varian VS 2000 brachytherapy source and generated a phase-space file. The computation time under the parameterized geometry depended on the memory location storing the geometry data. When the data was stored in GPU's shared memory, the highest computational speed was achieved. Incorporation of parameterized geometry yielded a computation time that was ~3 times of that in the corresponding voxelized geometry. We also developed a strategy to use an auxiliary index array to reduce frequency of geometry calculations and hence improve efficiency. With this strategy, the computational time ranged in 1.75-2.03 times of the voxelized geometry for coupled photon/electron transport depending on the voxel dimension of the auxiliary index array, and in 0.69-1.23 times for photon only transport.
Neylon, J; Min, Y; Kupelian, P; Low, D A; Santhanam, A
2017-04-01
In this paper, a multi-GPU cloud-based server (MGCS) framework is presented for dose calculations, exploring the feasibility of remote computing power for parallelization and acceleration of computationally and time intensive radiotherapy tasks in moving toward online adaptive therapies. An analytical model was developed to estimate theoretical MGCS performance acceleration and intelligently determine workload distribution. Numerical studies were performed with a computing setup of 14 GPUs distributed over 4 servers interconnected by a 1 Gigabits per second (Gbps) network. Inter-process communication methods were optimized to facilitate resource distribution and minimize data transfers over the server interconnect. The analytically predicted computation time predicted matched experimentally observations within 1-5 %. MGCS performance approached a theoretical limit of acceleration proportional to the number of GPUs utilized when computational tasks far outweighed memory operations. The MGCS implementation reproduced ground-truth dose computations with negligible differences, by distributing the work among several processes and implemented optimization strategies. The results showed that a cloud-based computation engine was a feasible solution for enabling clinics to make use of fast dose calculations for advanced treatment planning and adaptive radiotherapy. The cloud-based system was able to exceed the performance of a local machine even for optimized calculations, and provided significant acceleration for computationally intensive tasks. Such a framework can provide access to advanced technology and computational methods to many clinics, providing an avenue for standardization across institutions without the requirements of purchasing, maintaining, and continually updating hardware.
NASA Astrophysics Data System (ADS)
Liu, Guofeng; Li, Chun
2016-08-01
In this study, we present a practical implementation of prestack Kirchhoff time migration (PSTM) on a general purpose graphic processing unit. First, we consider the three main optimizations of the PSTM GPU code, i.e., designing a configuration based on a reasonable execution, using the texture memory for velocity interpolation, and the application of an intrinsic function in device code. This approach can achieve a speedup of nearly 45 times on a NVIDIA GTX 680 GPU compared with CPU code when a larger imaging space is used, where the PSTM output is a common reflection point that is gathered as I[ nx][ ny][ nh][ nt] in matrix format. However, this method requires more memory space so the limited imaging space cannot fully exploit the GPU sources. To overcome this problem, we designed a PSTM scheme with multi-GPUs for imaging different seismic data on different GPUs using an offset value. This process can achieve the peak speedup of GPU PSTM code and it greatly increases the efficiency of the calculations, but without changing the imaging result.
Rapid earthquake detection through GPU-Based template matching
NASA Astrophysics Data System (ADS)
Mu, Dawei; Lee, En-Jui; Chen, Po
2017-12-01
The template-matching algorithm (TMA) has been widely adopted for improving the reliability of earthquake detection. The TMA is based on calculating the normalized cross-correlation coefficient (NCC) between a collection of selected template waveforms and the continuous waveform recordings of seismic instruments. In realistic applications, the computational cost of the TMA is much higher than that of traditional techniques. In this study, we provide an analysis of the TMA and show how the GPU architecture provides an almost ideal environment for accelerating the TMA and NCC-based pattern recognition algorithms in general. So far, our best-performing GPU code has achieved a speedup factor of more than 800 with respect to a common sequential CPU code. We demonstrate the performance of our GPU code using seismic waveform recordings from the ML 6.6 Meinong earthquake sequence in Taiwan.
GRAVIDY, a GPU modular, parallel direct-summation N-body integrator: dynamics with softening
NASA Astrophysics Data System (ADS)
Maureira-Fredes, Cristián; Amaro-Seoane, Pau
2018-01-01
A wide variety of outstanding problems in astrophysics involve the motion of a large number of particles under the force of gravity. These include the global evolution of globular clusters, tidal disruptions of stars by a massive black hole, the formation of protoplanets and sources of gravitational radiation. The direct-summation of N gravitational forces is a complex problem with no analytical solution and can only be tackled with approximations and numerical methods. To this end, the Hermite scheme is a widely used integration method. With different numerical techniques and special-purpose hardware, it can be used to speed up the calculations. But these methods tend to be computationally slow and cumbersome to work with. We present a new graphics processing unit (GPU), direct-summation N-body integrator written from scratch and based on this scheme, which includes relativistic corrections for sources of gravitational radiation. GRAVIDY has high modularity, allowing users to readily introduce new physics, it exploits available computational resources and will be maintained by regular updates. GRAVIDY can be used in parallel on multiple CPUs and GPUs, with a considerable speed-up benefit. The single-GPU version is between one and two orders of magnitude faster than the single-CPU version. A test run using four GPUs in parallel shows a speed-up factor of about 3 as compared to the single-GPU version. The conception and design of this first release is aimed at users with access to traditional parallel CPU clusters or computational nodes with one or a few GPU cards.
Wilson, J Adam; Williams, Justin C
2009-01-01
The clock speeds of modern computer processors have nearly plateaued in the past 5 years. Consequently, neural prosthetic systems that rely on processing large quantities of data in a short period of time face a bottleneck, in that it may not be possible to process all of the data recorded from an electrode array with high channel counts and bandwidth, such as electrocorticographic grids or other implantable systems. Therefore, in this study a method of using the processing capabilities of a graphics card [graphics processing unit (GPU)] was developed for real-time neural signal processing of a brain-computer interface (BCI). The NVIDIA CUDA system was used to offload processing to the GPU, which is capable of running many operations in parallel, potentially greatly increasing the speed of existing algorithms. The BCI system records many channels of data, which are processed and translated into a control signal, such as the movement of a computer cursor. This signal processing chain involves computing a matrix-matrix multiplication (i.e., a spatial filter), followed by calculating the power spectral density on every channel using an auto-regressive method, and finally classifying appropriate features for control. In this study, the first two computationally intensive steps were implemented on the GPU, and the speed was compared to both the current implementation and a central processing unit-based implementation that uses multi-threading. Significant performance gains were obtained with GPU processing: the current implementation processed 1000 channels of 250 ms in 933 ms, while the new GPU method took only 27 ms, an improvement of nearly 35 times.
SU-E-J-91: FFT Based Medical Image Registration Using a Graphics Processing Unit (GPU).
Luce, J; Hoggarth, M; Lin, J; Block, A; Roeske, J
2012-06-01
To evaluate the efficiency gains obtained from using a Graphics Processing Unit (GPU) to perform a Fourier Transform (FT) based image registration. Fourier-based image registration involves obtaining the FT of the component images, and analyzing them in Fourier space to determine the translations and rotations of one image set relative to another. An important property of FT registration is that by enlarging the images (adding additional pixels), one can obtain translations and rotations with sub-pixel resolution. The expense, however, is an increased computational time. GPUs may decrease the computational time associated with FT image registration by taking advantage of their parallel architecture to perform matrix computations much more efficiently than a Central Processor Unit (CPU). In order to evaluate the computational gains produced by a GPU, images with known translational shifts were utilized. A program was written in the Interactive Data Language (IDL; Exelis, Boulder, CO) to performCPU-based calculations. Subsequently, the program was modified using GPU bindings (Tech-X, Boulder, CO) to perform GPU-based computation on the same system. Multiple image sizes were used, ranging from 256×256 to 2304×2304. The time required to complete the full algorithm by the CPU and GPU were benchmarked and the speed increase was defined as the ratio of the CPU-to-GPU computational time. The ratio of the CPU-to- GPU time was greater than 1.0 for all images, which indicates the GPU is performing the algorithm faster than the CPU. The smallest improvement, a 1.21 ratio, was found with the smallest image size of 256×256, and the largest speedup, a 4.25 ratio, was observed with the largest image size of 2304×2304. GPU programming resulted in a significant decrease in computational time associated with a FT image registration algorithm. The inclusion of the GPU may provide near real-time, sub-pixel registration capability. © 2012 American Association of Physicists in Medicine.
SU-E-T-558: Monte Carlo Photon Transport Simulations On GPU with Quadric Geometry
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chi, Y; Tian, Z; Jiang, S
Purpose: Monte Carlo simulation on GPU has experienced rapid advancements over the past a few years and tremendous accelerations have been achieved. Yet existing packages were developed only in voxelized geometry. In some applications, e.g. radioactive seed modeling, simulations in more complicated geometry are needed. This abstract reports our initial efforts towards developing a quadric geometry module aiming at expanding the application scope of GPU-based MC simulations. Methods: We defined the simulation geometry consisting of a number of homogeneous bodies, each specified by its material composition and limiting surfaces characterized by quadric functions. A tree data structure was utilized tomore » define geometric relationship between different bodies. We modified our GPU-based photon MC transport package to incorporate this geometry. Specifically, geometry parameters were loaded into GPU’s shared memory for fast access. Geometry functions were rewritten to enable the identification of the body that contains the current particle location via a fast searching algorithm based on the tree data structure. Results: We tested our package in an example problem of HDR-brachytherapy dose calculation for shielded cylinder. The dose under the quadric geometry and that under the voxelized geometry agreed in 94.2% of total voxels within 20% isodose line based on a statistical t-test (95% confidence level), where the reference dose was defined to be the one at 0.5cm away from the cylinder surface. It took 243sec to transport 100million source photons under this quadric geometry on an NVidia Titan GPU card. Compared with simulation time of 99.6sec in the voxelized geometry, including quadric geometry reduced efficiency due to the complicated geometry-related computations. Conclusion: Our GPU-based MC package has been extended to support photon transport simulation in quadric geometry. Satisfactory accuracy was observed with a reduced efficiency. Developments for charged particle transport in this geometry are currently in progress.« less
High-Speed GPU-Based Fully Three-Dimensional Diffuse Optical Tomographic System
Saikia, Manob Jyoti; Kanhirodan, Rajan; Mohan Vasu, Ram
2014-01-01
We have developed a graphics processor unit (GPU-) based high-speed fully 3D system for diffuse optical tomography (DOT). The reduction in execution time of 3D DOT algorithm, a severely ill-posed problem, is made possible through the use of (1) an algorithmic improvement that uses Broyden approach for updating the Jacobian matrix and thereby updating the parameter matrix and (2) the multinode multithreaded GPU and CUDA (Compute Unified Device Architecture) software architecture. Two different GPU implementations of DOT programs are developed in this study: (1) conventional C language program augmented by GPU CUDA and CULA routines (C GPU), (2) MATLAB program supported by MATLAB parallel computing toolkit for GPU (MATLAB GPU). The computation time of the algorithm on host CPU and the GPU system is presented for C and Matlab implementations. The forward computation uses finite element method (FEM) and the problem domain is discretized into 14610, 30823, and 66514 tetrahedral elements. The reconstruction time, so achieved for one iteration of the DOT reconstruction for 14610 elements, is 0.52 seconds for a C based GPU program for 2-plane measurements. The corresponding MATLAB based GPU program took 0.86 seconds. The maximum number of reconstructed frames so achieved is 2 frames per second. PMID:24891848
High-Speed GPU-Based Fully Three-Dimensional Diffuse Optical Tomographic System.
Saikia, Manob Jyoti; Kanhirodan, Rajan; Mohan Vasu, Ram
2014-01-01
We have developed a graphics processor unit (GPU-) based high-speed fully 3D system for diffuse optical tomography (DOT). The reduction in execution time of 3D DOT algorithm, a severely ill-posed problem, is made possible through the use of (1) an algorithmic improvement that uses Broyden approach for updating the Jacobian matrix and thereby updating the parameter matrix and (2) the multinode multithreaded GPU and CUDA (Compute Unified Device Architecture) software architecture. Two different GPU implementations of DOT programs are developed in this study: (1) conventional C language program augmented by GPU CUDA and CULA routines (C GPU), (2) MATLAB program supported by MATLAB parallel computing toolkit for GPU (MATLAB GPU). The computation time of the algorithm on host CPU and the GPU system is presented for C and Matlab implementations. The forward computation uses finite element method (FEM) and the problem domain is discretized into 14610, 30823, and 66514 tetrahedral elements. The reconstruction time, so achieved for one iteration of the DOT reconstruction for 14610 elements, is 0.52 seconds for a C based GPU program for 2-plane measurements. The corresponding MATLAB based GPU program took 0.86 seconds. The maximum number of reconstructed frames so achieved is 2 frames per second.
NASA Astrophysics Data System (ADS)
Liu, Tianyu; Du, Xining; Ji, Wei; Xu, X. George; Brown, Forrest B.
2014-06-01
For nuclear reactor analysis such as the neutron eigenvalue calculations, the time consuming Monte Carlo (MC) simulations can be accelerated by using graphics processing units (GPUs). However, traditional MC methods are often history-based, and their performance on GPUs is affected significantly by the thread divergence problem. In this paper we describe the development of a newly designed event-based vectorized MC algorithm for solving the neutron eigenvalue problem. The code was implemented using NVIDIA's Compute Unified Device Architecture (CUDA), and tested on a NVIDIA Tesla M2090 GPU card. We found that although the vectorized MC algorithm greatly reduces the occurrence of thread divergence thus enhancing the warp execution efficiency, the overall simulation speed is roughly ten times slower than the history-based MC code on GPUs. Profiling results suggest that the slow speed is probably due to the memory access latency caused by the large amount of global memory transactions. Possible solutions to improve the code efficiency are discussed.
NASA Astrophysics Data System (ADS)
Cai, Xiaohui; Liu, Yang; Ren, Zhiming
2018-06-01
Reverse-time migration (RTM) is a powerful tool for imaging geologically complex structures such as steep-dip and subsalt. However, its implementation is quite computationally expensive. Recently, as a low-cost solution, the graphic processing unit (GPU) was introduced to improve the efficiency of RTM. In the paper, we develop three ameliorative strategies to implement RTM on GPU card. First, given the high accuracy and efficiency of the adaptive optimal finite-difference (FD) method based on least squares (LS) on central processing unit (CPU), we study the optimal LS-based FD method on GPU. Second, we develop the CPU-based hybrid absorbing boundary condition (ABC) to the GPU-based one by addressing two issues of the former when introduced to GPU card: time-consuming and chaotic threads. Third, for large-scale data, the combinatorial strategy for optimal checkpointing and efficient boundary storage is introduced for the trade-off between memory and recomputation. To save the time of communication between host and disk, the portable operating system interface (POSIX) thread is utilized to create the other CPU core at the checkpoints. Applications of the three strategies on GPU with the compute unified device architecture (CUDA) programming language in RTM demonstrate their efficiency and validity.
NASA Astrophysics Data System (ADS)
Wei, Hui; Gong, Guanghong; Li, Ni
2017-10-01
Computer-generated hologram (CGH) is a promising 3D display technology while it is challenged by heavy computation load and vast memory requirement. To solve these problems, a depth compensating CGH calculation method based on symmetry and similarity of zone plates is proposed and implemented on graphics processing unit (GPU). An improved LUT method is put forward to compute the distances between object points and hologram pixels in the XY direction. The concept of depth compensating factor is defined and used for calculating the holograms of points with different depth positions instead of layer-based methods. The proposed method is suitable for arbitrary sampling objects with lower memory usage and higher computational efficiency compared to other CGH methods. The effectiveness of the proposed method is validated by numerical and optical experiments.
GPU COMPUTING FOR PARTICLE TRACKING
DOE Office of Scientific and Technical Information (OSTI.GOV)
Nishimura, Hiroshi; Song, Kai; Muriki, Krishna
2011-03-25
This is a feasibility study of using a modern Graphics Processing Unit (GPU) to parallelize the accelerator particle tracking code. To demonstrate the massive parallelization features provided by GPU computing, a simplified TracyGPU program is developed for dynamic aperture calculation. Performances, issues, and challenges from introducing GPU are also discussed. General purpose Computation on Graphics Processing Units (GPGPU) bring massive parallel computing capabilities to numerical calculation. However, the unique architecture of GPU requires a comprehensive understanding of the hardware and programming model to be able to well optimize existing applications. In the field of accelerator physics, the dynamic aperture calculationmore » of a storage ring, which is often the most time consuming part of the accelerator modeling and simulation, can benefit from GPU due to its embarrassingly parallel feature, which fits well with the GPU programming model. In this paper, we use the Tesla C2050 GPU which consists of 14 multi-processois (MP) with 32 cores on each MP, therefore a total of 448 cores, to host thousands ot threads dynamically. Thread is a logical execution unit of the program on GPU. In the GPU programming model, threads are grouped into a collection of blocks Within each block, multiple threads share the same code, and up to 48 KB of shared memory. Multiple thread blocks form a grid, which is executed as a GPU kernel. A simplified code that is a subset of Tracy++ [2] is developed to demonstrate the possibility of using GPU to speed up the dynamic aperture calculation by having each thread track a particle.« less
GPU-based ultra-fast dose calculation using a finite size pencil beam model.
Gu, Xuejun; Choi, Dongju; Men, Chunhua; Pan, Hubert; Majumdar, Amitava; Jiang, Steve B
2009-10-21
Online adaptive radiation therapy (ART) is an attractive concept that promises the ability to deliver an optimal treatment in response to the inter-fraction variability in patient anatomy. However, it has yet to be realized due to technical limitations. Fast dose deposit coefficient calculation is a critical component of the online planning process that is required for plan optimization of intensity-modulated radiation therapy (IMRT). Computer graphics processing units (GPUs) are well suited to provide the requisite fast performance for the data-parallel nature of dose calculation. In this work, we develop a dose calculation engine based on a finite-size pencil beam (FSPB) algorithm and a GPU parallel computing framework. The developed framework can accommodate any FSPB model. We test our implementation in the case of a water phantom and the case of a prostate cancer patient with varying beamlet and voxel sizes. All testing scenarios achieved speedup ranging from 200 to 400 times when using a NVIDIA Tesla C1060 card in comparison with a 2.27 GHz Intel Xeon CPU. The computational time for calculating dose deposition coefficients for a nine-field prostate IMRT plan with this new framework is less than 1 s. This indicates that the GPU-based FSPB algorithm is well suited for online re-planning for adaptive radiotherapy.
Park, Hyeong-Gyu; Shin, Yeong-Gil; Lee, Ho
2015-12-01
A ray-driven backprojector is based on ray-tracing, which computes the length of the intersection between the ray paths and each voxel to be reconstructed. To reduce the computational burden caused by these exhaustive intersection tests, we propose a fully graphics processing unit (GPU)-based ray-driven backprojector in conjunction with a ray-culling scheme that enables straightforward parallelization without compromising the high computing performance of a GPU. The purpose of the ray-culling scheme is to reduce the number of ray-voxel intersection tests by excluding rays irrelevant to a specific voxel computation. This rejection step is based on an axis-aligned bounding box (AABB) enclosing a region of voxel projection, where eight vertices of each voxel are projected onto the detector plane. The range of the rectangular-shaped AABB is determined by min/max operations on the coordinates in the region. Using the indices of pixels inside the AABB, the rays passing through the voxel can be identified and the voxel is weighted as the length of intersection between the voxel and the ray. This procedure makes it possible to reflect voxel-level parallelization, allowing an independent calculation at each voxel, which is feasible for a GPU implementation. To eliminate redundant calculations during ray-culling, a shared-memory optimization is applied to exploit the GPU memory hierarchy. In experimental results using real measurement data with phantoms, the proposed GPU-based ray-culling scheme reconstructed a volume of resolution 28032803176 in 77 seconds from 680 projections of resolution 10243768 , which is 26 times and 7.5 times faster than standard CPU-based and GPU-based ray-driven backprojectors, respectively. Qualitative and quantitative analyses showed that the ray-driven backprojector provides high-quality reconstruction images when compared with those generated by the Feldkamp-Davis-Kress algorithm using a pixel-driven backprojector, with an average of 2.5 times higher contrast-to-noise ratio, 1.04 times higher universal quality index, and 1.39 times higher normalized mutual information. © The Author(s) 2014.
NASA Astrophysics Data System (ADS)
Song, Chenchen; Martínez, Todd J.
2016-05-01
We present a tensor hypercontracted (THC) scaled opposite spin second order Møller-Plesset perturbation theory (SOS-MP2) method. By using THC, we reduce the formal scaling of SOS-MP2 with respect to molecular size from quartic to cubic. We achieve further efficiency by exploiting sparsity in the atomic orbitals and using graphical processing units (GPUs) to accelerate integral construction and matrix multiplication. The practical scaling of GPU-accelerated atomic orbital-based THC-SOS-MP2 calculations is found to be N2.6 for reference data sets of water clusters and alanine polypeptides containing up to 1600 basis functions. The errors in correlation energy with respect to density-fitting-SOS-MP2 are less than 0.5 kcal/mol for all systems tested (up to 162 atoms).
Song, Chenchen; Martínez, Todd J
2016-05-07
We present a tensor hypercontracted (THC) scaled opposite spin second order Møller-Plesset perturbation theory (SOS-MP2) method. By using THC, we reduce the formal scaling of SOS-MP2 with respect to molecular size from quartic to cubic. We achieve further efficiency by exploiting sparsity in the atomic orbitals and using graphical processing units (GPUs) to accelerate integral construction and matrix multiplication. The practical scaling of GPU-accelerated atomic orbital-based THC-SOS-MP2 calculations is found to be N(2.6) for reference data sets of water clusters and alanine polypeptides containing up to 1600 basis functions. The errors in correlation energy with respect to density-fitting-SOS-MP2 are less than 0.5 kcal/mol for all systems tested (up to 162 atoms).
Real-time generation of infrared ocean scene based on GPU
NASA Astrophysics Data System (ADS)
Jiang, Zhaoyi; Wang, Xun; Lin, Yun; Jin, Jianqiu
2007-12-01
Infrared (IR) image synthesis for ocean scene has become more and more important nowadays, especially for remote sensing and military application. Although a number of works present ready-to-use simulations, those techniques cover only a few possible ways of water interacting with the environment. And the detail calculation of ocean temperature is rarely considered by previous investigators. With the advance of programmable features of graphic card, many algorithms previously limited to offline processing have become feasible for real-time usage. In this paper, we propose an efficient algorithm for real-time rendering of infrared ocean scene using the newest features of programmable graphics processors (GPU). It differs from previous works in three aspects: adaptive GPU-based ocean surface tessellation, sophisticated balance equation of thermal balance for ocean surface, and GPU-based rendering for infrared ocean scene. Finally some results of infrared image are shown, which are in good accordance with real images.
A high precision extrapolation method in multiphase-field model for simulating dendrite growth
NASA Astrophysics Data System (ADS)
Yang, Cong; Xu, Qingyan; Liu, Baicheng
2018-05-01
The phase-field method coupling with thermodynamic data has become a trend for predicting the microstructure formation in technical alloys. Nevertheless, the frequent access to thermodynamic database and calculation of local equilibrium conditions can be time intensive. The extrapolation methods, which are derived based on Taylor expansion, can provide approximation results with a high computational efficiency, and have been proven successful in applications. This paper presents a high precision second order extrapolation method for calculating the driving force in phase transformation. To obtain the phase compositions, different methods in solving the quasi-equilibrium condition are tested, and the M-slope approach is chosen for its best accuracy. The developed second order extrapolation method along with the M-slope approach and the first order extrapolation method are applied to simulate dendrite growth in a Ni-Al-Cr ternary alloy. The results of the extrapolation methods are compared with the exact solution with respect to the composition profile and dendrite tip position, which demonstrate the high precision and efficiency of the newly developed algorithm. To accelerate the phase-field and extrapolation computation, the graphic processing unit (GPU) based parallel computing scheme is developed. The application to large-scale simulation of multi-dendrite growth in an isothermal cross-section has demonstrated the ability of the developed GPU-accelerated second order extrapolation approach for multiphase-field model.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Maurer, S. A.; Kussmann, J.; Ochsenfeld, C., E-mail: Christian.Ochsenfeld@cup.uni-muenchen.de
2014-08-07
We present a low-prefactor, cubically scaling scaled-opposite-spin second-order Møller-Plesset perturbation theory (SOS-MP2) method which is highly suitable for massively parallel architectures like graphics processing units (GPU). The scaling is reduced from O(N{sup 5}) to O(N{sup 3}) by a reformulation of the MP2-expression in the atomic orbital basis via Laplace transformation and the resolution-of-the-identity (RI) approximation of the integrals in combination with efficient sparse algebra for the 3-center integral transformation. In contrast to previous works that employ GPUs for post Hartree-Fock calculations, we do not simply employ GPU-based linear algebra libraries to accelerate the conventional algorithm. Instead, our reformulation allows tomore » replace the rate-determining contraction step with a modified J-engine algorithm, that has been proven to be highly efficient on GPUs. Thus, our SOS-MP2 scheme enables us to treat large molecular systems in an accurate and efficient manner on a single GPU-server.« less
Hallock, Michael J.; Stone, John E.; Roberts, Elijah; Fry, Corey; Luthey-Schulten, Zaida
2014-01-01
Simulation of in vivo cellular processes with the reaction-diffusion master equation (RDME) is a computationally expensive task. Our previous software enabled simulation of inhomogeneous biochemical systems for small bacteria over long time scales using the MPD-RDME method on a single GPU. Simulations of larger eukaryotic systems exceed the on-board memory capacity of individual GPUs, and long time simulations of modest-sized cells such as yeast are impractical on a single GPU. We present a new multi-GPU parallel implementation of the MPD-RDME method based on a spatial decomposition approach that supports dynamic load balancing for workstations containing GPUs of varying performance and memory capacity. We take advantage of high-performance features of CUDA for peer-to-peer GPU memory transfers and evaluate the performance of our algorithms on state-of-the-art GPU devices. We present parallel e ciency and performance results for simulations using multiple GPUs as system size, particle counts, and number of reactions grow. We also demonstrate multi-GPU performance in simulations of the Min protein system in E. coli. Moreover, our multi-GPU decomposition and load balancing approach can be generalized to other lattice-based problems. PMID:24882911
Hallock, Michael J; Stone, John E; Roberts, Elijah; Fry, Corey; Luthey-Schulten, Zaida
2014-05-01
Simulation of in vivo cellular processes with the reaction-diffusion master equation (RDME) is a computationally expensive task. Our previous software enabled simulation of inhomogeneous biochemical systems for small bacteria over long time scales using the MPD-RDME method on a single GPU. Simulations of larger eukaryotic systems exceed the on-board memory capacity of individual GPUs, and long time simulations of modest-sized cells such as yeast are impractical on a single GPU. We present a new multi-GPU parallel implementation of the MPD-RDME method based on a spatial decomposition approach that supports dynamic load balancing for workstations containing GPUs of varying performance and memory capacity. We take advantage of high-performance features of CUDA for peer-to-peer GPU memory transfers and evaluate the performance of our algorithms on state-of-the-art GPU devices. We present parallel e ciency and performance results for simulations using multiple GPUs as system size, particle counts, and number of reactions grow. We also demonstrate multi-GPU performance in simulations of the Min protein system in E. coli . Moreover, our multi-GPU decomposition and load balancing approach can be generalized to other lattice-based problems.
NASA Astrophysics Data System (ADS)
Komura, Yukihiro; Okabe, Yutaka
2014-03-01
We present sample CUDA programs for the GPU computing of the Swendsen-Wang multi-cluster spin flip algorithm. We deal with the classical spin models; the Ising model, the q-state Potts model, and the classical XY model. As for the lattice, both the 2D (square) lattice and the 3D (simple cubic) lattice are treated. We already reported the idea of the GPU implementation for 2D models (Komura and Okabe, 2012). We here explain the details of sample programs, and discuss the performance of the present GPU implementation for the 3D Ising and XY models. We also show the calculated results of the moment ratio for these models, and discuss phase transitions. Catalogue identifier: AERM_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AERM_v1_0.html Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 5632 No. of bytes in distributed program, including test data, etc.: 14688 Distribution format: tar.gz Programming language: C, CUDA. Computer: System with an NVIDIA CUDA enabled GPU. Operating system: System with an NVIDIA CUDA enabled GPU. Classification: 23. External routines: NVIDIA CUDA Toolkit 3.0 or newer Nature of problem: Monte Carlo simulation of classical spin systems. Ising, q-state Potts model, and the classical XY model are treated for both two-dimensional and three-dimensional lattices. Solution method: GPU-based Swendsen-Wang multi-cluster spin flip Monte Carlo method. The CUDA implementation for the cluster-labeling is based on the work by Hawick et al. [1] and that by Kalentev et al. [2]. Restrictions: The system size is limited depending on the memory of a GPU. Running time: For the parameters used in the sample programs, it takes about a minute for each program. Of course, it depends on the system size, the number of Monte Carlo steps, etc. References: [1] K.A. Hawick, A. Leist, and D. P. Playne, Parallel Computing 36 (2010) 655-678 [2] O. Kalentev, A. Rai, S. Kemnitzb, and R. Schneider, J. Parallel Distrib. Comput. 71 (2011) 615-620
Testing and Validating Gadget2 for GPUs
NASA Astrophysics Data System (ADS)
Wibking, Benjamin; Holley-Bockelmann, K.; Berlind, A. A.
2013-01-01
We are currently upgrading a version of Gadget2 (Springel et al., 2005) that is optimized for NVIDIA's CUDA GPU architecture (Frigaard, unpublished) to work with the latest libraries and graphics cards. Preliminary tests of its performance indicate a ~40x speedup in the particle force tree approximation calculation, with overall speedup of 5-10x for cosmological simulations run with GPUs compared to running on the same CPU cores without GPU acceleration. We believe this speedup can be reasonably increased by an additional factor of two with futher optimization, including overlap of computation on CPU and GPU. Tests of single-precision GPU numerical fidelity currently indicate accuracy of the mass function and the spectral power density to within a few percent of extended-precision CPU results with the unmodified form of Gadget. Additionally, we plan to test and optimize the GPU code for Millenium-scale "grand challenge" simulations of >10^9 particles, a scale that has been previously untested with this code, with the aid of the NSF XSEDE flagship GPU-based supercomputing cluster codenamed "Keeneland." Current work involves additional validation of numerical results, extending the numerical precision of the GPU calculations to double precision, and evaluating performance/accuracy tradeoffs. We believe that this project, if successful, will yield substantial computational performance benefits to the N-body research community as the next generation of GPU supercomputing resources becomes available, both increasing the electrical power efficiency of ever-larger computations (making simulations possible a decade from now at scales and resolutions unavailable today) and accelerating the pace of research in the field.
CUDA-based high-performance computing of the S-BPF algorithm with no-waiting pipelining
NASA Astrophysics Data System (ADS)
Deng, Lin; Yan, Bin; Chang, Qingmei; Han, Yu; Zhang, Xiang; Xi, Xiaoqi; Li, Lei
2015-10-01
The backprojection-filtration (BPF) algorithm has become a good solution for local reconstruction in cone-beam computed tomography (CBCT). However, the reconstruction speed of BPF is a severe limitation for clinical applications. The selective-backprojection filtration (S-BPF) algorithm is developed to improve the parallel performance of BPF by selective backprojection. Furthermore, the general-purpose graphics processing unit (GP-GPU) is a popular tool for accelerating the reconstruction. Much work has been performed aiming for the optimization of the cone-beam back-projection. As the cone-beam back-projection process becomes faster, the data transportation holds a much bigger time proportion in the reconstruction than before. This paper focuses on minimizing the total time in the reconstruction with the S-BPF algorithm by hiding the data transportation among hard disk, CPU and GPU. And based on the analysis of the S-BPF algorithm, some strategies are implemented: (1) the asynchronous calls are used to overlap the implemention of CPU and GPU, (2) an innovative strategy is applied to obtain the DBP image to hide the transport time effectively, (3) two streams for data transportation and calculation are synchronized by the cudaEvent in the inverse of finite Hilbert transform on GPU. Our main contribution is a smart reconstruction of the S-BPF algorithm with GPU's continuous calculation and no data transportation time cost. a 5123 volume is reconstructed in less than 0.7 second on a single Tesla-based K20 GPU from 182 views projection with 5122 pixel per projection. The time cost of our implementation is about a half of that without the overlap behavior.
GPU-based parallel algorithm for blind image restoration using midfrequency-based methods
NASA Astrophysics Data System (ADS)
Xie, Lang; Luo, Yi-han; Bao, Qi-liang
2013-08-01
GPU-based general-purpose computing is a new branch of modern parallel computing, so the study of parallel algorithms specially designed for GPU hardware architecture is of great significance. In order to solve the problem of high computational complexity and poor real-time performance in blind image restoration, the midfrequency-based algorithm for blind image restoration was analyzed and improved in this paper. Furthermore, a midfrequency-based filtering method is also used to restore the image hardly with any recursion or iteration. Combining the algorithm with data intensiveness, data parallel computing and GPU execution model of single instruction and multiple threads, a new parallel midfrequency-based algorithm for blind image restoration is proposed in this paper, which is suitable for stream computing of GPU. In this algorithm, the GPU is utilized to accelerate the estimation of class-G point spread functions and midfrequency-based filtering. Aiming at better management of the GPU threads, the threads in a grid are scheduled according to the decomposition of the filtering data in frequency domain after the optimization of data access and the communication between the host and the device. The kernel parallelism structure is determined by the decomposition of the filtering data to ensure the transmission rate to get around the memory bandwidth limitation. The results show that, with the new algorithm, the operational speed is significantly increased and the real-time performance of image restoration is effectively improved, especially for high-resolution images.
NASA Astrophysics Data System (ADS)
Niwase, Hiroaki; Takada, Naoki; Araki, Hiromitsu; Maeda, Yuki; Fujiwara, Masato; Nakayama, Hirotaka; Kakue, Takashi; Shimobaba, Tomoyoshi; Ito, Tomoyoshi
2016-09-01
Parallel calculations of large-pixel-count computer-generated holograms (CGHs) are suitable for multiple-graphics processing unit (multi-GPU) cluster systems. However, it is not easy for a multi-GPU cluster system to accomplish fast CGH calculations when CGH transfers between PCs are required. In these cases, the CGH transfer between the PCs becomes a bottleneck. Usually, this problem occurs only in multi-GPU cluster systems with a single spatial light modulator. To overcome this problem, we propose a simple method using the InfiniBand network. The computational speed of the proposed method using 13 GPUs (NVIDIA GeForce GTX TITAN X) was more than 3000 times faster than that of a CPU (Intel Core i7 4770) when the number of three-dimensional (3-D) object points exceeded 20,480. In practice, we achieved ˜40 tera floating point operations per second (TFLOPS) when the number of 3-D object points exceeded 40,960. Our proposed method was able to reconstruct a real-time movie of a 3-D object comprising 95,949 points.
Bedez, Mathieu; Belhachmi, Zakaria; Haeberlé, Olivier; Greget, Renaud; Moussaoui, Saliha; Bouteiller, Jean-Marie; Bischoff, Serge
2016-01-15
The resolution of a model describing the electrical activity of neural tissue and its propagation within this tissue is highly consuming in term of computing time and requires strong computing power to achieve good results. In this study, we present a method to solve a model describing the electrical propagation in neuronal tissue, using parareal algorithm, coupling with parallelization space using CUDA in graphical processing unit (GPU). We applied the method of resolution to different dimensions of the geometry of our model (1-D, 2-D and 3-D). The GPU results are compared with simulations from a multi-core processor cluster, using message-passing interface (MPI), where the spatial scale was parallelized in order to reach a comparable calculation time than that of the presented method using GPU. A gain of a factor 100 in term of computational time between sequential results and those obtained using the GPU has been obtained, in the case of 3-D geometry. Given the structure of the GPU, this factor increases according to the fineness of the geometry used in the computation. To the best of our knowledge, it is the first time such a method is used, even in the case of neuroscience. Parallelization time coupled with GPU parallelization space allows for drastically reducing computational time with a fine resolution of the model describing the propagation of the electrical signal in a neuronal tissue. Copyright © 2015 Elsevier B.V. All rights reserved.
NASA Astrophysics Data System (ADS)
Lin, Hui; Liu, Tianyu; Su, Lin; Bednarz, Bryan; Caracappa, Peter; Xu, X. George
2017-09-01
Monte Carlo (MC) simulation is well recognized as the most accurate method for radiation dose calculations. For radiotherapy applications, accurate modelling of the source term, i.e. the clinical linear accelerator is critical to the simulation. The purpose of this paper is to perform source modelling and examine the accuracy and performance of the models on Intel Many Integrated Core coprocessors (aka Xeon Phi) and Nvidia GPU using ARCHER and explore the potential optimization methods. Phase Space-based source modelling for has been implemented. Good agreements were found in a tomotherapy prostate patient case and a TrueBeam breast case. From the aspect of performance, the whole simulation for prostate plan and breast plan cost about 173s and 73s with 1% statistical error.
A non-voxel-based broad-beam (NVBB) framework for IMRT treatment planning.
Lu, Weiguo
2010-12-07
We present a novel framework that enables very large scale intensity-modulated radiation therapy (IMRT) planning in limited computation resources with improvements in cost, plan quality and planning throughput. Current IMRT optimization uses a voxel-based beamlet superposition (VBS) framework that requires pre-calculation and storage of a large amount of beamlet data, resulting in large temporal and spatial complexity. We developed a non-voxel-based broad-beam (NVBB) framework for IMRT capable of direct treatment parameter optimization (DTPO). In this framework, both objective function and derivative are evaluated based on the continuous viewpoint, abandoning 'voxel' and 'beamlet' representations. Thus pre-calculation and storage of beamlets are no longer needed. The NVBB framework has linear complexities (O(N(3))) in both space and time. The low memory, full computation and data parallelization nature of the framework render its efficient implementation on the graphic processing unit (GPU). We implemented the NVBB framework and incorporated it with the TomoTherapy treatment planning system (TPS). The new TPS runs on a single workstation with one GPU card (NVBB-GPU). Extensive verification/validation tests were performed in house and via third parties. Benchmarks on dose accuracy, plan quality and throughput were compared with the commercial TomoTherapy TPS that is based on the VBS framework and uses a computer cluster with 14 nodes (VBS-cluster). For all tests, the dose accuracy of these two TPSs is comparable (within 1%). Plan qualities were comparable with no clinically significant difference for most cases except that superior target uniformity was seen in the NVBB-GPU for some cases. However, the planning time using the NVBB-GPU was reduced many folds over the VBS-cluster. In conclusion, we developed a novel NVBB framework for IMRT optimization. The continuous viewpoint and DTPO nature of the algorithm eliminate the need for beamlets and lead to better plan quality. The computation parallelization on a GPU instead of a computer cluster significantly reduces hardware and service costs. Compared with using the current VBS framework on a computer cluster, the planning time is significantly reduced using the NVBB framework on a single workstation with a GPU card.
gpuPOM: a GPU-based Princeton Ocean Model
NASA Astrophysics Data System (ADS)
Xu, S.; Huang, X.; Zhang, Y.; Fu, H.; Oey, L.-Y.; Xu, F.; Yang, G.
2014-11-01
Rapid advances in the performance of the graphics processing unit (GPU) have made the GPU a compelling solution for a series of scientific applications. However, most existing GPU acceleration works for climate models are doing partial code porting for certain hot spots, and can only achieve limited speedup for the entire model. In this work, we take the mpiPOM (a parallel version of the Princeton Ocean Model) as our starting point, design and implement a GPU-based Princeton Ocean Model. By carefully considering the architectural features of the state-of-the-art GPU devices, we rewrite the full mpiPOM model from the original Fortran version into a new Compute Unified Device Architecture C (CUDA-C) version. We take several accelerating methods to further improve the performance of gpuPOM, including optimizing memory access in a single GPU, overlapping communication and boundary operations among multiple GPUs, and overlapping input/output (I/O) between the hybrid Central Processing Unit (CPU) and the GPU. Our experimental results indicate that the performance of the gpuPOM on a workstation containing 4 GPUs is comparable to a powerful cluster with 408 CPU cores and it reduces the energy consumption by 6.8 times.
Fast 3D elastic micro-seismic source location using new GPU features
NASA Astrophysics Data System (ADS)
Xue, Qingfeng; Wang, Yibo; Chang, Xu
2016-12-01
In this paper, we describe new GPU features and their applications in passive seismic - micro-seismic location. Locating micro-seismic events is quite important in seismic exploration, especially when searching for unconventional oil and gas resources. Different from the traditional ray-based methods, the wave equation method, such as the method we use in our paper, has a remarkable advantage in adapting to low signal-to-noise ratio conditions and does not need a person to select the data. However, because it has a conspicuous deficiency due to its computation cost, these methods are not widely used in industrial fields. To make the method useful, we implement imaging-like wave equation micro-seismic location in a 3D elastic media and use GPU to accelerate our algorithm. We also introduce some new GPU features into the implementation to solve the data transfer and GPU utilization problems. Numerical and field data experiments show that our method can achieve a more than 30% performance improvement in GPU implementation just by using these new features.
Application of a GPU-Assisted Maxwell Code to Electromagnetic Wave Propagation in ITER
NASA Astrophysics Data System (ADS)
Kubota, S.; Peebles, W. A.; Woodbury, D.; Johnson, I.; Zolfaghari, A.
2014-10-01
The Low Field Side Reflectometer (LSFR) on ITER is envisioned to provide capabilities for electron density profile and fluctuations measurements in both the plasma core and edge. The current design for the Equatorial Port Plug 11 (EPP11) employs seven monostatic antennas for use with both fixed-frequency and swept-frequency systems. The present work examines the characteristics of this layout using the 3-D version of the GPU-Assisted Maxwell Code (GAMC-3D). Previous studies in this area were performed with either 2-D full wave codes or 3-D ray- and beam-tracing. GAMC-3D is based on the FDTD method and can be run with either a fixed-frequency or modulated (e.g. FMCW) source, and with either a stationary or moving target (e.g. Doppler backscattering). The code is designed to run on a single NVIDIA Tesla GPU accelerator, and utilizes a technique based on the moving window method to overcome the size limitation of the onboard memory. Effects such as beam drift, linear mode conversion, and diffraction/scattering will be examined. Comparisons will be made with beam-tracing calculations using the complex eikonal method. Supported by U.S. DoE Grants DE-FG02-99ER54527 and DE-AC02-09CH11466, and the DoE SULI Program at PPPL.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Song, Chenchen; Martínez, Todd J.; SLAC National Accelerator Laboratory, Menlo Park, California 94025
We present a tensor hypercontracted (THC) scaled opposite spin second order Møller-Plesset perturbation theory (SOS-MP2) method. By using THC, we reduce the formal scaling of SOS-MP2 with respect to molecular size from quartic to cubic. We achieve further efficiency by exploiting sparsity in the atomic orbitals and using graphical processing units (GPUs) to accelerate integral construction and matrix multiplication. The practical scaling of GPU-accelerated atomic orbital-based THC-SOS-MP2 calculations is found to be N{sup 2.6} for reference data sets of water clusters and alanine polypeptides containing up to 1600 basis functions. The errors in correlation energy with respect to density-fitting-SOS-MP2 aremore » less than 0.5 kcal/mol for all systems tested (up to 162 atoms).« less
A nonvoxel-based dose convolution/superposition algorithm optimized for scalable GPU architectures
DOE Office of Scientific and Technical Information (OSTI.GOV)
Neylon, J., E-mail: jneylon@mednet.ucla.edu; Sheng, K.; Yu, V.
Purpose: Real-time adaptive planning and treatment has been infeasible due in part to its high computational complexity. There have been many recent efforts to utilize graphics processing units (GPUs) to accelerate the computational performance and dose accuracy in radiation therapy. Data structure and memory access patterns are the key GPU factors that determine the computational performance and accuracy. In this paper, the authors present a nonvoxel-based (NVB) approach to maximize computational and memory access efficiency and throughput on the GPU. Methods: The proposed algorithm employs a ray-tracing mechanism to restructure the 3D data sets computed from the CT anatomy intomore » a nonvoxel-based framework. In a process that takes only a few milliseconds of computing time, the algorithm restructured the data sets by ray-tracing through precalculated CT volumes to realign the coordinate system along the convolution direction, as defined by zenithal and azimuthal angles. During the ray-tracing step, the data were resampled according to radial sampling and parallel ray-spacing parameters making the algorithm independent of the original CT resolution. The nonvoxel-based algorithm presented in this paper also demonstrated a trade-off in computational performance and dose accuracy for different coordinate system configurations. In order to find the best balance between the computed speedup and the accuracy, the authors employed an exhaustive parameter search on all sampling parameters that defined the coordinate system configuration: zenithal, azimuthal, and radial sampling of the convolution algorithm, as well as the parallel ray spacing during ray tracing. The angular sampling parameters were varied between 4 and 48 discrete angles, while both radial sampling and parallel ray spacing were varied from 0.5 to 10 mm. The gamma distribution analysis method (γ) was used to compare the dose distributions using 2% and 2 mm dose difference and distance-to-agreement criteria, respectively. Accuracy was investigated using three distinct phantoms with varied geometries and heterogeneities and on a series of 14 segmented lung CT data sets. Performance gains were calculated using three 256 mm cube homogenous water phantoms, with isotropic voxel dimensions of 1, 2, and 4 mm. Results: The nonvoxel-based GPU algorithm was independent of the data size and provided significant computational gains over the CPU algorithm for large CT data sizes. The parameter search analysis also showed that the ray combination of 8 zenithal and 8 azimuthal angles along with 1 mm radial sampling and 2 mm parallel ray spacing maintained dose accuracy with greater than 99% of voxels passing the γ test. Combining the acceleration obtained from GPU parallelization with the sampling optimization, the authors achieved a total performance improvement factor of >175 000 when compared to our voxel-based ground truth CPU benchmark and a factor of 20 compared with a voxel-based GPU dose convolution method. Conclusions: The nonvoxel-based convolution method yielded substantial performance improvements over a generic GPU implementation, while maintaining accuracy as compared to a CPU computed ground truth dose distribution. Such an algorithm can be a key contribution toward developing tools for adaptive radiation therapy systems.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wang, Yuhe; Mazur, Thomas R.; Green, Olga
Purpose: The clinical commissioning of IMRT subject to a magnetic field is challenging. The purpose of this work is to develop a GPU-accelerated Monte Carlo dose calculation platform based on PENELOPE and then use the platform to validate a vendor-provided MRIdian head model toward quality assurance of clinical IMRT treatment plans subject to a 0.35 T magnetic field. Methods: PENELOPE was first translated from FORTRAN to C++ and the result was confirmed to produce equivalent results to the original code. The C++ code was then adapted to CUDA in a workflow optimized for GPU architecture. The original code was expandedmore » to include voxelized transport with Woodcock tracking, faster electron/positron propagation in a magnetic field, and several features that make gPENELOPE highly user-friendly. Moreover, the vendor-provided MRIdian head model was incorporated into the code in an effort to apply gPENELOPE as both an accurate and rapid dose validation system. A set of experimental measurements were performed on the MRIdian system to examine the accuracy of both the head model and gPENELOPE. Ultimately, gPENELOPE was applied toward independent validation of patient doses calculated by MRIdian’s KMC. Results: An acceleration factor of 152 was achieved in comparison to the original single-thread FORTRAN implementation with the original accuracy being preserved. For 16 treatment plans including stomach (4), lung (2), liver (3), adrenal gland (2), pancreas (2), spleen(1), mediastinum (1), and breast (1), the MRIdian dose calculation engine agrees with gPENELOPE with a mean gamma passing rate of 99.1% ± 0.6% (2%/2 mm). Conclusions: A Monte Carlo simulation platform was developed based on a GPU- accelerated version of PENELOPE. This platform was used to validate that both the vendor-provided head model and fast Monte Carlo engine used by the MRIdian system are accurate in modeling radiation transport in a patient using 2%/2 mm gamma criteria. Future applications of this platform will include dose validation and accumulation, IMRT optimization, and dosimetry system modeling for next generation MR-IGRT systems.« less
Wang, Yuhe; Mazur, Thomas R.; Green, Olga; Hu, Yanle; Li, Hua; Rodriguez, Vivian; Wooten, H. Omar; Yang, Deshan; Zhao, Tianyu; Mutic, Sasa; Li, H. Harold
2016-01-01
Purpose: The clinical commissioning of IMRT subject to a magnetic field is challenging. The purpose of this work is to develop a GPU-accelerated Monte Carlo dose calculation platform based on penelope and then use the platform to validate a vendor-provided MRIdian head model toward quality assurance of clinical IMRT treatment plans subject to a 0.35 T magnetic field. Methods: penelope was first translated from fortran to c++ and the result was confirmed to produce equivalent results to the original code. The c++ code was then adapted to cuda in a workflow optimized for GPU architecture. The original code was expanded to include voxelized transport with Woodcock tracking, faster electron/positron propagation in a magnetic field, and several features that make gpenelope highly user-friendly. Moreover, the vendor-provided MRIdian head model was incorporated into the code in an effort to apply gpenelope as both an accurate and rapid dose validation system. A set of experimental measurements were performed on the MRIdian system to examine the accuracy of both the head model and gpenelope. Ultimately, gpenelope was applied toward independent validation of patient doses calculated by MRIdian’s kmc. Results: An acceleration factor of 152 was achieved in comparison to the original single-thread fortran implementation with the original accuracy being preserved. For 16 treatment plans including stomach (4), lung (2), liver (3), adrenal gland (2), pancreas (2), spleen(1), mediastinum (1), and breast (1), the MRIdian dose calculation engine agrees with gpenelope with a mean gamma passing rate of 99.1% ± 0.6% (2%/2 mm). Conclusions: A Monte Carlo simulation platform was developed based on a GPU- accelerated version of penelope. This platform was used to validate that both the vendor-provided head model and fast Monte Carlo engine used by the MRIdian system are accurate in modeling radiation transport in a patient using 2%/2 mm gamma criteria. Future applications of this platform will include dose validation and accumulation, IMRT optimization, and dosimetry system modeling for next generation MR-IGRT systems. PMID:27370123
GPU accelerated cell-based adaptive mesh refinement on unstructured quadrilateral grid
NASA Astrophysics Data System (ADS)
Luo, Xisheng; Wang, Luying; Ran, Wei; Qin, Fenghua
2016-10-01
A GPU accelerated inviscid flow solver is developed on an unstructured quadrilateral grid in the present work. For the first time, the cell-based adaptive mesh refinement (AMR) is fully implemented on GPU for the unstructured quadrilateral grid, which greatly reduces the frequency of data exchange between GPU and CPU. Specifically, the AMR is processed with atomic operations to parallelize list operations, and null memory recycling is realized to improve the efficiency of memory utilization. It is found that results obtained by GPUs agree very well with the exact or experimental results in literature. An acceleration ratio of 4 is obtained between the parallel code running on the old GPU GT9800 and the serial code running on E3-1230 V2. With the optimization of configuring a larger L1 cache and adopting Shared Memory based atomic operations on the newer GPU C2050, an acceleration ratio of 20 is achieved. The parallelized cell-based AMR processes have achieved 2x speedup on GT9800 and 18x on Tesla C2050, which demonstrates that parallel running of the cell-based AMR method on GPU is feasible and efficient. Our results also indicate that the new development of GPU architecture benefits the fluid dynamics computing significantly.
The development of GPU-based parallel PRNG for Monte Carlo applications in CUDA Fortran
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kargaran, Hamed, E-mail: h-kargaran@sbu.ac.ir; Minuchehr, Abdolhamid; Zolfaghari, Ahmad
The implementation of Monte Carlo simulation on the CUDA Fortran requires a fast random number generation with good statistical properties on GPU. In this study, a GPU-based parallel pseudo random number generator (GPPRNG) have been proposed to use in high performance computing systems. According to the type of GPU memory usage, GPU scheme is divided into two work modes including GLOBAL-MODE and SHARED-MODE. To generate parallel random numbers based on the independent sequence method, the combination of middle-square method and chaotic map along with the Xorshift PRNG have been employed. Implementation of our developed PPRNG on a single GPU showedmore » a speedup of 150x and 470x (with respect to the speed of PRNG on a single CPU core) for GLOBAL-MODE and SHARED-MODE, respectively. To evaluate the accuracy of our developed GPPRNG, its performance was compared to that of some other commercially available PPRNGs such as MATLAB, FORTRAN and Miller-Park algorithm through employing the specific standard tests. The results of this comparison showed that the developed GPPRNG in this study can be used as a fast and accurate tool for computational science applications.« less
Noniterative Multireference Coupled Cluster Methods on Heterogeneous CPU-GPU Systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bhaskaran-Nair, Kiran; Ma, Wenjing; Krishnamoorthy, Sriram
2013-04-09
A novel parallel algorithm for non-iterative multireference coupled cluster (MRCC) theories, which merges recently introduced reference-level parallelism (RLP) [K. Bhaskaran-Nair, J.Brabec, E. Aprà, H.J.J. van Dam, J. Pittner, K. Kowalski, J. Chem. Phys. 137, 094112 (2012)] with the possibility of accelerating numerical calculations using graphics processing unit (GPU) is presented. We discuss the performance of this algorithm on the example of the MRCCSD(T) method (iterative singles and doubles and perturbative triples), where the corrections due to triples are added to the diagonal elements of the MRCCSD (iterative singles and doubles) effective Hamiltonian matrix. The performance of the combined RLP/GPU algorithmmore » is illustrated on the example of the Brillouin-Wigner (BW) and Mukherjee (Mk) state-specific MRCCSD(T) formulations.« less
OCTGRAV: Sparse Octree Gravitational N-body Code on Graphics Processing Units
NASA Astrophysics Data System (ADS)
Gaburov, Evghenii; Bédorf, Jeroen; Portegies Zwart, Simon
2010-10-01
Octgrav is a very fast tree-code which runs on massively parallel Graphical Processing Units (GPU) with NVIDIA CUDA architecture. The algorithms are based on parallel-scan and sort methods. The tree-construction and calculation of multipole moments is carried out on the host CPU, while the force calculation which consists of tree walks and evaluation of interaction list is carried out on the GPU. In this way, a sustained performance of about 100GFLOP/s and data transfer rates of about 50GB/s is achieved. It takes about a second to compute forces on a million particles with an opening angle of heta approx 0.5. To test the performance and feasibility, we implemented the algorithms in CUDA in the form of a gravitational tree-code which completely runs on the GPU. The tree construction and traverse algorithms are portable to many-core devices which have support for CUDA or OpenCL programming languages. The gravitational tree-code outperforms tuned CPU code during the tree-construction and shows a performance improvement of more than a factor 20 overall, resulting in a processing rate of more than 2.8 million particles per second. The code has a convenient user interface and is freely available for use.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Li, B; Southern Medical University, Guangzhou, Guangdong; Tian, Z
Purpose: While compressed sensing-based cone-beam CT (CBCT) iterative reconstruction techniques have demonstrated tremendous capability of reconstructing high-quality images from undersampled noisy data, its long computation time still hinders wide application in routine clinic. The purpose of this study is to develop a reconstruction framework that employs modern consensus optimization techniques to achieve CBCT reconstruction on a multi-GPU platform for improved computational efficiency. Methods: Total projection data were evenly distributed to multiple GPUs. Each GPU performed reconstruction using its own projection data with a conventional total variation regularization approach to ensure image quality. In addition, the solutions from GPUs were subjectmore » to a consistency constraint that they should be identical. We solved the optimization problem with all the constraints considered rigorously using an alternating direction method of multipliers (ADMM) algorithm. The reconstruction framework was implemented using OpenCL on a platform with two Nvidia GTX590 GPU cards, each with two GPUs. We studied the performance of our method and demonstrated its advantages through a simulation case with a NCAT phantom and an experimental case with a Catphan phantom. Result: Compared with the CBCT images reconstructed using conventional FDK method with full projection datasets, our proposed method achieved comparable image quality with about one third projection numbers. The computation time on the multi-GPU platform was ∼55 s and ∼ 35 s in the two cases respectively, achieving a speedup factor of ∼ 3.0 compared with single GPU reconstruction. Conclusion: We have developed a consensus ADMM-based CBCT reconstruction method which enabled performing reconstruction on a multi-GPU platform. The achieved efficiency made this method clinically attractive.« less
Fast GPU-based computation of spatial multigrid multiframe LMEM for PET.
Nassiri, Moulay Ali; Carrier, Jean-François; Després, Philippe
2015-09-01
Significant efforts were invested during the last decade to accelerate PET list-mode reconstructions, notably with GPU devices. However, the computation time per event is still relatively long, and the list-mode efficiency on the GPU is well below the histogram-mode efficiency. Since list-mode data are not arranged in any regular pattern, costly accesses to the GPU global memory can hardly be optimized and geometrical symmetries cannot be used. To overcome obstacles that limit the acceleration of reconstruction from list-mode on the GPU, a multigrid and multiframe approach of an expectation-maximization algorithm was developed. The reconstruction process is started during data acquisition, and calculations are executed concurrently on the GPU and the CPU, while the system matrix is computed on-the-fly. A new convergence criterion also was introduced, which is computationally more efficient on the GPU. The implementation was tested on a Tesla C2050 GPU device for a Gemini GXL PET system geometry. The results show that the proposed algorithm (multigrid and multiframe list-mode expectation-maximization, MGMF-LMEM) converges to the same solution as the LMEM algorithm more than three times faster. The execution time of the MGMF-LMEM algorithm was 1.1 s per million of events on the Tesla C2050 hardware used, for a reconstructed space of 188 x 188 x 57 voxels of 2 x 2 x 3.15 mm3. For 17- and 22-mm simulated hot lesions, the MGMF-LMEM algorithm led on the first iteration to contrast recovery coefficients (CRC) of more than 75 % of the maximum CRC while achieving a minimum in the relative mean square error. Therefore, the MGMF-LMEM algorithm can be used as a one-pass method to perform real-time reconstructions for low-count acquisitions, as in list-mode gated studies. The computation time for one iteration and 60 millions of events was approximately 66 s.
Accelerating Time Integration for the Shallow Water Equations on the Sphere Using GPUs
Archibald, R.; Evans, K. J.; Salinger, A.
2015-06-01
The push towards larger and larger computational platforms has made it possible for climate simulations to resolve climate dynamics across multiple spatial and temporal scales. This direction in climate simulation has created a strong need to develop scalable timestepping methods capable of accelerating throughput on high performance computing. This study details the recent advances in the implementation of implicit time stepping of the spectral element dynamical core within the United States Department of Energy (DOE) Accelerated Climate Model for Energy (ACME) on graphical processing units (GPU) based machines. We demonstrate how solvers in the Trilinos project are interfaced with ACMEmore » and GPU kernels to increase computational speed of the residual calculations in the implicit time stepping method for the atmosphere dynamics. We demonstrate the optimization gains and data structure reorganization that facilitates the performance improvements.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wan Chan Tseung, Hok Seum, E-mail: wanchantseung.hok@mayo.edu; Ma, Jiasen; Kreofsky, Cole R.
Purpose: Our aim is to demonstrate the feasibility of fast Monte Carlo (MC)–based inverse biological planning for the treatment of head and neck tumors in spot-scanning proton therapy. Methods and Materials: Recently, a fast and accurate graphics processor unit (GPU)–based MC simulation of proton transport was developed and used as the dose-calculation engine in a GPU-accelerated intensity modulated proton therapy (IMPT) optimizer. Besides dose, the MC can simultaneously score the dose-averaged linear energy transfer (LET{sub d}), which makes biological dose (BD) optimization possible. To convert from LET{sub d} to BD, a simple linear relation was assumed. By use of thismore » novel optimizer, inverse biological planning was applied to 4 patients, including 2 small and 1 large thyroid tumor targets, as well as 1 glioma case. To create these plans, constraints were placed to maintain the physical dose (PD) within 1.25 times the prescription while maximizing target BD. For comparison, conventional intensity modulated radiation therapy (IMRT) and IMPT plans were also created using Eclipse (Varian Medical Systems) in each case. The same critical-structure PD constraints were used for the IMRT, IMPT, and biologically optimized plans. The BD distributions for the IMPT plans were obtained through MC recalculations. Results: Compared with standard IMPT, the biologically optimal plans for patients with small tumor targets displayed a BD escalation that was around twice the PD increase. Dose sparing to critical structures was improved compared with both IMRT and IMPT. No significant BD increase could be achieved for the large thyroid tumor case and when the presence of critical structures mitigated the contribution of additional fields. The calculation of the biologically optimized plans can be completed in a clinically viable time (<30 minutes) on a small 24-GPU system. Conclusions: By exploiting GPU acceleration, MC-based, biologically optimized plans were created for small–tumor target patients. This optimizer will be used in an upcoming feasibility trial on LET{sub d} painting for radioresistant tumors.« less
Zhang, Bo; Yang, Xiang; Yang, Fei; Yang, Xin; Qin, Chenghu; Han, Dong; Ma, Xibo; Liu, Kai; Tian, Jie
2010-09-13
In molecular imaging (MI), especially the optical molecular imaging, bioluminescence tomography (BLT) emerges as an effective imaging modality for small animal imaging. The finite element methods (FEMs), especially the adaptive finite element (AFE) framework, play an important role in BLT. The processing speed of the FEMs and the AFE framework still needs to be improved, although the multi-thread CPU technology and the multi CPU technology have already been applied. In this paper, we for the first time introduce a new kind of acceleration technology to accelerate the AFE framework for BLT, using the graphics processing unit (GPU). Besides the processing speed, the GPU technology can get a balance between the cost and performance. The CUBLAS and CULA are two main important and powerful libraries for programming on NVIDIA GPUs. With the help of CUBLAS and CULA, it is easy to code on NVIDIA GPU and there is no need to worry about the details about the hardware environment of a specific GPU. The numerical experiments are designed to show the necessity, effect and application of the proposed CUBLAS and CULA based GPU acceleration. From the results of the experiments, we can reach the conclusion that the proposed CUBLAS and CULA based GPU acceleration method can improve the processing speed of the AFE framework very much while getting a balance between cost and performance.
The approximation of anomalous magnetic field by array of magnetized rods
NASA Astrophysics Data System (ADS)
Denis, Byzov; Lev, Muravyev; Natalia, Fedorova
2017-07-01
The method for calculation the vertical component of an anomalous magnetic field from its absolute value is presented. Conversion is based on the approximation of magnetic induction module anomalies by the set of singular sources and the subsequent calculation for the vertical component of the field with the chosen distribution. The rods that are uniformly magnetized along their axis were used as a set of singular sources. Applicability analysis of different methods of nonlinear optimization for solving the given task was carried out. The algorithm is implemented using the parallel computing technology on the NVidia GPU. The approximation and calculation of vertical component is demonstrated for regional magnetic field of North Eurasia territories.
GPU-accelerated Kernel Regression Reconstruction for Freehand 3D Ultrasound Imaging.
Wen, Tiexiang; Li, Ling; Zhu, Qingsong; Qin, Wenjian; Gu, Jia; Yang, Feng; Xie, Yaoqin
2017-07-01
Volume reconstruction method plays an important role in improving reconstructed volumetric image quality for freehand three-dimensional (3D) ultrasound imaging. By utilizing the capability of programmable graphics processing unit (GPU), we can achieve a real-time incremental volume reconstruction at a speed of 25-50 frames per second (fps). After incremental reconstruction and visualization, hole-filling is performed on GPU to fill remaining empty voxels. However, traditional pixel nearest neighbor-based hole-filling fails to reconstruct volume with high image quality. On the contrary, the kernel regression provides an accurate volume reconstruction method for 3D ultrasound imaging but with the cost of heavy computational complexity. In this paper, a GPU-based fast kernel regression method is proposed for high-quality volume after the incremental reconstruction of freehand ultrasound. The experimental results show that improved image quality for speckle reduction and details preservation can be obtained with the parameter setting of kernel window size of [Formula: see text] and kernel bandwidth of 1.0. The computational performance of the proposed GPU-based method can be over 200 times faster than that on central processing unit (CPU), and the volume with size of 50 million voxels in our experiment can be reconstructed within 10 seconds.
Implementation and optimization of ultrasound signal processing algorithms on mobile GPU
NASA Astrophysics Data System (ADS)
Kong, Woo Kyu; Lee, Wooyoul; Kim, Kyu Cheol; Yoo, Yangmo; Song, Tai-Kyong
2014-03-01
A general-purpose graphics processing unit (GPGPU) has been used for improving computing power in medical ultrasound imaging systems. Recently, a mobile GPU becomes powerful to deal with 3D games and videos at high frame rates on Full HD or HD resolution displays. This paper proposes the method to implement ultrasound signal processing on a mobile GPU available in the high-end smartphone (Galaxy S4, Samsung Electronics, Seoul, Korea) with programmable shaders on the OpenGL ES 2.0 platform. To maximize the performance of the mobile GPU, the optimization of shader design and load sharing between vertex and fragment shader was performed. The beamformed data were captured from a tissue mimicking phantom (Model 539 Multipurpose Phantom, ATS Laboratories, Inc., Bridgeport, CT, USA) by using a commercial ultrasound imaging system equipped with a research package (Ultrasonix Touch, Ultrasonix, Richmond, BC, Canada). The real-time performance is evaluated by frame rates while varying the range of signal processing blocks. The implementation method of ultrasound signal processing on OpenGL ES 2.0 was verified by analyzing PSNR with MATLAB gold standard that has the same signal path. CNR was also analyzed to verify the method. From the evaluations, the proposed mobile GPU-based processing method has no significant difference with the processing using MATLAB (i.e., PSNR<52.51 dB). The comparable results of CNR were obtained from both processing methods (i.e., 11.31). From the mobile GPU implementation, the frame rates of 57.6 Hz were achieved. The total execution time was 17.4 ms that was faster than the acquisition time (i.e., 34.4 ms). These results indicate that the mobile GPU-based processing method can support real-time ultrasound B-mode processing on the smartphone.
NASA Astrophysics Data System (ADS)
Rybakin, B.; Bogatencov, P.; Secrieru, G.; Iliuha, N.
2013-10-01
The paper deals with a parallel algorithm for calculations on multiprocessor computers and GPU accelerators. The calculations of shock waves interaction with low-density bubble results and the problem of the gas flow with the forces of gravity are presented. This algorithm combines a possibility to capture a high resolution of shock waves, the second-order accuracy for TVD schemes, and a possibility to observe a low-level diffusion of the advection scheme. Many complex problems of continuum mechanics are numerically solved on structured or unstructured grids. To improve the accuracy of the calculations is necessary to choose a sufficiently small grid (with a small cell size). This leads to the drawback of a substantial increase of computation time. Therefore, for the calculations of complex problems it is reasonable to use the method of Adaptive Mesh Refinement. That is, the grid refinement is performed only in the areas of interest of the structure, where, e.g., the shock waves are generated, or a complex geometry or other such features exist. Thus, the computing time is greatly reduced. In addition, the execution of the application on the resulting sequence of nested, decreasing nets can be parallelized. Proposed algorithm is based on the AMR method. Utilization of AMR method can significantly improve the resolution of the difference grid in areas of high interest, and from other side to accelerate the processes of the multi-dimensional problems calculating. Parallel algorithms of the analyzed difference models realized for the purpose of calculations on graphic processors using the CUDA technology [1].
Cucheb: A GPU implementation of the filtered Lanczos procedure
NASA Astrophysics Data System (ADS)
Aurentz, Jared L.; Kalantzis, Vassilis; Saad, Yousef
2017-11-01
This paper describes the software package Cucheb, a GPU implementation of the filtered Lanczos procedure for the solution of large sparse symmetric eigenvalue problems. The filtered Lanczos procedure uses a carefully chosen polynomial spectral transformation to accelerate convergence of the Lanczos method when computing eigenvalues within a desired interval. This method has proven particularly effective for eigenvalue problems that arise in electronic structure calculations and density functional theory. We compare our implementation against an equivalent CPU implementation and show that using the GPU can reduce the computation time by more than a factor of 10. Program Summary Program title: Cucheb Program Files doi:http://dx.doi.org/10.17632/rjr9tzchmh.1 Licensing provisions: MIT Programming language: CUDA C/C++ Nature of problem: Electronic structure calculations require the computation of all eigenvalue-eigenvector pairs of a symmetric matrix that lie inside a user-defined real interval. Solution method: To compute all the eigenvalues within a given interval a polynomial spectral transformation is constructed that maps the desired eigenvalues of the original matrix to the exterior of the spectrum of the transformed matrix. The Lanczos method is then used to compute the desired eigenvectors of the transformed matrix, which are then used to recover the desired eigenvalues of the original matrix. The bulk of the operations are executed in parallel using a graphics processing unit (GPU). Runtime: Variable, depending on the number of eigenvalues sought and the size and sparsity of the matrix. Additional comments: Cucheb is compatible with CUDA Toolkit v7.0 or greater.
A nonvoxel-based dose convolution/superposition algorithm optimized for scalable GPU architectures.
Neylon, J; Sheng, K; Yu, V; Chen, Q; Low, D A; Kupelian, P; Santhanam, A
2014-10-01
Real-time adaptive planning and treatment has been infeasible due in part to its high computational complexity. There have been many recent efforts to utilize graphics processing units (GPUs) to accelerate the computational performance and dose accuracy in radiation therapy. Data structure and memory access patterns are the key GPU factors that determine the computational performance and accuracy. In this paper, the authors present a nonvoxel-based (NVB) approach to maximize computational and memory access efficiency and throughput on the GPU. The proposed algorithm employs a ray-tracing mechanism to restructure the 3D data sets computed from the CT anatomy into a nonvoxel-based framework. In a process that takes only a few milliseconds of computing time, the algorithm restructured the data sets by ray-tracing through precalculated CT volumes to realign the coordinate system along the convolution direction, as defined by zenithal and azimuthal angles. During the ray-tracing step, the data were resampled according to radial sampling and parallel ray-spacing parameters making the algorithm independent of the original CT resolution. The nonvoxel-based algorithm presented in this paper also demonstrated a trade-off in computational performance and dose accuracy for different coordinate system configurations. In order to find the best balance between the computed speedup and the accuracy, the authors employed an exhaustive parameter search on all sampling parameters that defined the coordinate system configuration: zenithal, azimuthal, and radial sampling of the convolution algorithm, as well as the parallel ray spacing during ray tracing. The angular sampling parameters were varied between 4 and 48 discrete angles, while both radial sampling and parallel ray spacing were varied from 0.5 to 10 mm. The gamma distribution analysis method (γ) was used to compare the dose distributions using 2% and 2 mm dose difference and distance-to-agreement criteria, respectively. Accuracy was investigated using three distinct phantoms with varied geometries and heterogeneities and on a series of 14 segmented lung CT data sets. Performance gains were calculated using three 256 mm cube homogenous water phantoms, with isotropic voxel dimensions of 1, 2, and 4 mm. The nonvoxel-based GPU algorithm was independent of the data size and provided significant computational gains over the CPU algorithm for large CT data sizes. The parameter search analysis also showed that the ray combination of 8 zenithal and 8 azimuthal angles along with 1 mm radial sampling and 2 mm parallel ray spacing maintained dose accuracy with greater than 99% of voxels passing the γ test. Combining the acceleration obtained from GPU parallelization with the sampling optimization, the authors achieved a total performance improvement factor of >175 000 when compared to our voxel-based ground truth CPU benchmark and a factor of 20 compared with a voxel-based GPU dose convolution method. The nonvoxel-based convolution method yielded substantial performance improvements over a generic GPU implementation, while maintaining accuracy as compared to a CPU computed ground truth dose distribution. Such an algorithm can be a key contribution toward developing tools for adaptive radiation therapy systems.
GPU accelerated implementation of NCI calculations using promolecular density.
Rubez, Gaëtan; Etancelin, Jean-Matthieu; Vigouroux, Xavier; Krajecki, Michael; Boisson, Jean-Charles; Hénon, Eric
2017-05-30
The NCI approach is a modern tool to reveal chemical noncovalent interactions. It is particularly attractive to describe ligand-protein binding. A custom implementation for NCI using promolecular density is presented. It is designed to leverage the computational power of NVIDIA graphics processing unit (GPU) accelerators through the CUDA programming model. The code performances of three versions are examined on a test set of 144 systems. NCI calculations are particularly well suited to the GPU architecture, which reduces drastically the computational time. On a single compute node, the dual-GPU version leads to a 39-fold improvement for the biggest instance compared to the optimal OpenMP parallel run (C code, icc compiler) with 16 CPU cores. Energy consumption measurements carried out on both CPU and GPU NCI tests show that the GPU approach provides substantial energy savings. © 2017 Wiley Periodicals, Inc. © 2017 Wiley Periodicals, Inc.
A practical implementation of 3D TTI reverse time migration with multi-GPUs
NASA Astrophysics Data System (ADS)
Li, Chun; Liu, Guofeng; Li, Yihang
2017-05-01
Tilted transversely isotropic (TTI) media are typical earth anisotropy media from practical observational studies. Accurate anisotropic imaging is recognized as a breakthrough in areas with complex anisotropic structures. TTI reverse time migration (RTM) is an important method for these areas. However, P and SV waves are coupled together in the pseudo-acoustic wave equation. The SV wave is regarded as an artifact for RTM of the P wave. We adopt matching of the anisotropy parameters to suppress the SV artifacts. Another problem in the implementation of TTI RTM is instability of the numerical solution for a variably oriented axis of symmetry. We adopt Fletcher's equation by setting a small amount of SV velocity without an acoustic approximation to stabilize the wavefield propagation. To improve calculation efficiency, we use NVIDIA graphic processing unit (GPU) with compute unified device architecture instead of traditional CPU architecture. To accomplish this, we introduced a random velocity boundary and an extended homogeneous anisotropic boundary for the remaining four anisotropic parameters in the source propagation. This process avoids large storage memory and IO requirements, which is important when using a GPU with limited bandwidth of PCI-E. Furthermore, we extend the single GPU code to multi-GPUs and present a corresponding high concurrent strategy with multiple asynchronous streams, which closely achieved an ideal speedup ratio of 2:1 when compared with a single GPU. Synthetic tests validate the correctness and effectiveness of our multi-GPUs-based TTI RTM method.
Understanding GPU Power. A Survey of Profiling, Modeling, and Simulation Methods
Bridges, Robert A.; Imam, Neena; Mintz, Tiffany M.
2016-09-01
Modern graphics processing units (GPUs) have complex architectures that admit exceptional performance and energy efficiency for high throughput applications.Though GPUs consume large amounts of power, their use for high throughput applications facilitate state-of-the-art energy efficiency and performance. Consequently, continued development relies on understanding their power consumption. Our work is a survey of GPU power modeling and profiling methods with increased detail on noteworthy efforts. Moreover, as direct measurement of GPU power is necessary for model evaluation and parameter initiation, internal and external power sensors are discussed. Hardware counters, which are low-level tallies of hardware events, share strong correlation to powermore » use and performance. Statistical correlation between power and performance counters has yielded worthwhile GPU power models, yet the complexity inherent to GPU architectures presents new hurdles for power modeling. Developments and challenges of counter-based GPU power modeling is discussed. Often building on the counter-based models, research efforts for GPU power simulation, which make power predictions from input code and hardware knowledge, provide opportunities for optimization in programming or architectural design. Noteworthy strides in power simulations for GPUs are included along with their performance or functional simulator counterparts when appropriate. Lastly, possible directions for future research are discussed.« less
Understanding GPU Power. A Survey of Profiling, Modeling, and Simulation Methods
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bridges, Robert A.; Imam, Neena; Mintz, Tiffany M.
Modern graphics processing units (GPUs) have complex architectures that admit exceptional performance and energy efficiency for high throughput applications.Though GPUs consume large amounts of power, their use for high throughput applications facilitate state-of-the-art energy efficiency and performance. Consequently, continued development relies on understanding their power consumption. Our work is a survey of GPU power modeling and profiling methods with increased detail on noteworthy efforts. Moreover, as direct measurement of GPU power is necessary for model evaluation and parameter initiation, internal and external power sensors are discussed. Hardware counters, which are low-level tallies of hardware events, share strong correlation to powermore » use and performance. Statistical correlation between power and performance counters has yielded worthwhile GPU power models, yet the complexity inherent to GPU architectures presents new hurdles for power modeling. Developments and challenges of counter-based GPU power modeling is discussed. Often building on the counter-based models, research efforts for GPU power simulation, which make power predictions from input code and hardware knowledge, provide opportunities for optimization in programming or architectural design. Noteworthy strides in power simulations for GPUs are included along with their performance or functional simulator counterparts when appropriate. Lastly, possible directions for future research are discussed.« less
Maia, Julio Daniel Carvalho; Urquiza Carvalho, Gabriel Aires; Mangueira, Carlos Peixoto; Santana, Sidney Ramos; Cabral, Lucidio Anjos Formiga; Rocha, Gerd B
2012-09-11
In this study, we present some modifications in the semiempirical quantum chemistry MOPAC2009 code that accelerate single-point energy calculations (1SCF) of medium-size (up to 2500 atoms) molecular systems using GPU coprocessors and multithreaded shared-memory CPUs. Our modifications consisted of using a combination of highly optimized linear algebra libraries for both CPU (LAPACK and BLAS from Intel MKL) and GPU (MAGMA and CUBLAS) to hasten time-consuming parts of MOPAC such as the pseudodiagonalization, full diagonalization, and density matrix assembling. We have shown that it is possible to obtain large speedups just by using CPU serial linear algebra libraries in the MOPAC code. As a special case, we show a speedup of up to 14 times for a methanol simulation box containing 2400 atoms and 4800 basis functions, with even greater gains in performance when using multithreaded CPUs (2.1 times in relation to the single-threaded CPU code using linear algebra libraries) and GPUs (3.8 times). This degree of acceleration opens new perspectives for modeling larger structures which appear in inorganic chemistry (such as zeolites and MOFs), biochemistry (such as polysaccharides, small proteins, and DNA fragments), and materials science (such as nanotubes and fullerenes). In addition, we believe that this parallel (GPU-GPU) MOPAC code will make it feasible to use semiempirical methods in lengthy molecular simulations using both hybrid QM/MM and QM/QM potentials.
Large-scale ground motion simulation using GPGPU
NASA Astrophysics Data System (ADS)
Aoi, S.; Maeda, T.; Nishizawa, N.; Aoki, T.
2012-12-01
Huge computation resources are required to perform large-scale ground motion simulations using 3-D finite difference method (FDM) for realistic and complex models with high accuracy. Furthermore, thousands of various simulations are necessary to evaluate the variability of the assessment caused by uncertainty of the assumptions of the source models for future earthquakes. To conquer the problem of restricted computational resources, we introduced the use of GPGPU (General purpose computing on graphics processing units) which is the technique of using a GPU as an accelerator of the computation which has been traditionally conducted by the CPU. We employed the CPU version of GMS (Ground motion Simulator; Aoi et al., 2004) as the original code and implemented the function for GPU calculation using CUDA (Compute Unified Device Architecture). GMS is a total system for seismic wave propagation simulation based on 3-D FDM scheme using discontinuous grids (Aoi&Fujiwara, 1999), which includes the solver as well as the preprocessor tools (parameter generation tool) and postprocessor tools (filter tool, visualization tool, and so on). The computational model is decomposed in two horizontal directions and each decomposed model is allocated to a different GPU. We evaluated the performance of our newly developed GPU version of GMS on the TSUBAME2.0 which is one of the Japanese fastest supercomputer operated by the Tokyo Institute of Technology. First we have performed a strong scaling test using the model with about 22 million grids and achieved 3.2 and 7.3 times of the speed-up by using 4 and 16 GPUs. Next, we have examined a weak scaling test where the model sizes (number of grids) are increased in proportion to the degree of parallelism (number of GPUs). The result showed almost perfect linearity up to the simulation with 22 billion grids using 1024 GPUs where the calculation speed reached to 79.7 TFlops and about 34 times faster than the CPU calculation using the same number of cores. Finally, we applied GPU calculation to the simulation of the 2011 Tohoku-oki earthquake. The model was constructed using a slip model from inversion of strong motion data (Suzuki et al., 2012), and a geological- and geophysical-based velocity structure model comprising all the Tohoku and Kanto regions as well as the large source area, which consists of about 1.9 billion grids. The overall characteristics of observed velocity seismograms for a longer period than range of 8 s were successfully reproduced (Maeda et al., 2012 AGU meeting). The turn around time for 50 thousand-step calculation (which correspond to 416 s in seismograph) using 100 GPUs was 52 minutes which is fairly short, especially considering this is the performance for the realistic and complex model.
NASA Astrophysics Data System (ADS)
Bhosale, Parag; Staring, Marius; Al-Ars, Zaid; Berendsen, Floris F.
2018-03-01
Currently, non-rigid image registration algorithms are too computationally intensive to use in time-critical applications. Existing implementations that focus on speed typically address this by either parallelization on GPU-hardware, or by introducing methodically novel techniques into CPU-oriented algorithms. Stochastic gradient descent (SGD) optimization and variations thereof have proven to drastically reduce the computational burden for CPU-based image registration, but have not been successfully applied in GPU hardware due to its stochastic nature. This paper proposes 1) NiftyRegSGD, a SGD optimization for the GPU-based image registration tool NiftyReg, 2) random chunk sampler, a new random sampling strategy that better utilizes the memory bandwidth of GPU hardware. Experiments have been performed on 3D lung CT data of 19 patients, which compared NiftyRegSGD (with and without random chunk sampler) with CPU-based elastix Fast Adaptive SGD (FASGD) and NiftyReg. The registration runtime was 21.5s, 4.4s and 2.8s for elastix-FASGD, NiftyRegSGD without, and NiftyRegSGD with random chunk sampling, respectively, while similar accuracy was obtained. Our method is publicly available at https://github.com/SuperElastix/NiftyRegSGD.
Ink Wash Painting Style Rendering With Physically-based Ink Dispersion Model
NASA Astrophysics Data System (ADS)
Wang, Yifan; Li, Weiran; Zhu, Qing
2018-04-01
This paper presents a real-time rendering method based on the GPU programmable pipeline for rendering the 3D scene in ink wash painting style. The method is divided into main three parts: First, render the ink properties of 3D model by calculating its vertex curvature. Then, cached the ink properties to a paper structure and using an ink dispersion model which is defined by referencing the theory of porous media to simulate the dispersion of ink. Finally, convert the ink properties to the pixel color information and render it to the screen. This method has a better performance than previous methods in visual quality.
Shi, Yulin; Veidenbaum, Alexander V.; Nicolau, Alex; Xu, Xiangmin
2014-01-01
Background Modern neuroscience research demands computing power. Neural circuit mapping studies such as those using laser scanning photostimulation (LSPS) produce large amounts of data and require intensive computation for post-hoc processing and analysis. New Method Here we report on the design and implementation of a cost-effective desktop computer system for accelerated experimental data processing with recent GPU computing technology. A new version of Matlab software with GPU enabled functions is used to develop programs that run on Nvidia GPUs to harness their parallel computing power. Results We evaluated both the central processing unit (CPU) and GPU-enabled computational performance of our system in benchmark testing and practical applications. The experimental results show that the GPU-CPU co-processing of simulated data and actual LSPS experimental data clearly outperformed the multi-core CPU with up to a 22x speedup, depending on computational tasks. Further, we present a comparison of numerical accuracy between GPU and CPU computation to verify the precision of GPU computation. In addition, we show how GPUs can be effectively adapted to improve the performance of commercial image processing software such as Adobe Photoshop. Comparison with Existing Method(s) To our best knowledge, this is the first demonstration of GPU application in neural circuit mapping and electrophysiology-based data processing. Conclusions Together, GPU enabled computation enhances our ability to process large-scale data sets derived from neural circuit mapping studies, allowing for increased processing speeds while retaining data precision. PMID:25277633
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yang, R; Fallone, B; Cross Cancer Institute, Edmonton, AB
Purpose: To develop a Graphic Processor Unit (GPU) accelerated deterministic solution to the Linear Boltzmann Transport Equation (LBTE) for accurate dose calculations in radiotherapy (RT). A deterministic solution yields the potential for major speed improvements due to the sparse matrix-vector and vector-vector multiplications and would thus be of benefit to RT. Methods: In order to leverage the massively parallel architecture of GPUs, the first order LBTE was reformulated as a second order self-adjoint equation using the Least Squares Finite Element Method (LSFEM). This produces a symmetric positive-definite matrix which is efficiently solved using a parallelized conjugate gradient (CG) solver. Themore » LSFEM formalism is applied in space, discrete ordinates is applied in angle, and the Multigroup method is applied in energy. The final linear system of equations produced is tightly coupled in space and angle. Our code written in CUDA-C was benchmarked on an Nvidia GeForce TITAN-X GPU against an Intel i7-6700K CPU. A spatial mesh of 30,950 tetrahedral elements was used with an S4 angular approximation. Results: To avoid repeating a full computationally intensive finite element matrix assembly at each Multigroup energy, a novel mapping algorithm was developed which minimized the operations required at each energy. Additionally, a parallelized memory mapping for the kronecker product between the sparse spatial and angular matrices, including Dirichlet boundary conditions, was created. Atomicity is preserved by graph-coloring overlapping nodes into separate kernel launches. The one-time mapping calculations for matrix assembly, kronecker product, and boundary condition application took 452±1ms on GPU. Matrix assembly for 16 energy groups took 556±3s on CPU, and 358±2ms on GPU using the mappings developed. The CG solver took 93±1s on CPU, and 468±2ms on GPU. Conclusion: Three computationally intensive subroutines in deterministically solving the LBTE have been formulated on GPU, resulting in two orders of magnitude speedup. Funding support from Natural Sciences and Engineering Research Council and Alberta Innovates Health Solutions. Dr. Fallone is a co-founder and CEO of MagnetTx Oncology Solutions (under discussions to license Alberta bi-planar linac MR for commercialization).« less
NASA Astrophysics Data System (ADS)
Laracuente, Nicholas; Grossman, Carl
2013-03-01
We developed an algorithm and software to calculate autocorrelation functions from real-time photon-counting data using the fast, parallel capabilities of graphical processor units (GPUs). Recent developments in hardware and software have allowed for general purpose computing with inexpensive GPU hardware. These devices are more suited for emulating hardware autocorrelators than traditional CPU-based software applications by emphasizing parallel throughput over sequential speed. Incoming data are binned in a standard multi-tau scheme with configurable points-per-bin size and are mapped into a GPU memory pattern to reduce time-expensive memory access. Applications include dynamic light scattering (DLS) and fluorescence correlation spectroscopy (FCS) experiments. We ran the software on a 64-core graphics pci card in a 3.2 GHz Intel i5 CPU based computer running Linux. FCS measurements were made on Alexa-546 and Texas Red dyes in a standard buffer (PBS). Software correlations were compared to hardware correlator measurements on the same signals. Supported by HHMI and Swarthmore College
NASA Technical Reports Server (NTRS)
Gorospe, George E., Jr.; Daigle, Matthew J.; Sankararaman, Shankar; Kulkarni, Chetan S.; Ng, Eley
2017-01-01
Prognostic methods enable operators and maintainers to predict the future performance for critical systems. However, these methods can be computationally expensive and may need to be performed each time new information about the system becomes available. In light of these computational requirements, we have investigated the application of graphics processing units (GPUs) as a computational platform for real-time prognostics. Recent advances in GPU technology have reduced cost and increased the computational capability of these highly parallel processing units, making them more attractive for the deployment of prognostic software. We present a survey of model-based prognostic algorithms with considerations for leveraging the parallel architecture of the GPU and a case study of GPU-accelerated battery prognostics with computational performance results.
Su, Lin; Yang, Youming; Bednarz, Bryan; Sterpin, Edmond; Du, Xining; Liu, Tianyu; Ji, Wei; Xu, X. George
2014-01-01
Purpose: Using the graphical processing units (GPU) hardware technology, an extremely fast Monte Carlo (MC) code ARCHERRT is developed for radiation dose calculations in radiation therapy. This paper describes the detailed software development and testing for three clinical TomoTherapy® cases: the prostate, lung, and head & neck. Methods: To obtain clinically relevant dose distributions, phase space files (PSFs) created from optimized radiation therapy treatment plan fluence maps were used as the input to ARCHERRT. Patient-specific phantoms were constructed from patient CT images. Batch simulations were employed to facilitate the time-consuming task of loading large PSFs, and to improve the estimation of statistical uncertainty. Furthermore, two different Woodcock tracking algorithms were implemented and their relative performance was compared. The dose curves of an Elekta accelerator PSF incident on a homogeneous water phantom were benchmarked against DOSXYZnrc. For each of the treatment cases, dose volume histograms and isodose maps were produced from ARCHERRT and the general-purpose code, GEANT4. The gamma index analysis was performed to evaluate the similarity of voxel doses obtained from these two codes. The hardware accelerators used in this study are one NVIDIA K20 GPU, one NVIDIA K40 GPU, and six NVIDIA M2090 GPUs. In addition, to make a fairer comparison of the CPU and GPU performance, a multithreaded CPU code was developed using OpenMP and tested on an Intel E5-2620 CPU. Results: For the water phantom, the depth dose curve and dose profiles from ARCHERRT agree well with DOSXYZnrc. For clinical cases, results from ARCHERRT are compared with those from GEANT4 and good agreement is observed. Gamma index test is performed for voxels whose dose is greater than 10% of maximum dose. For 2%/2mm criteria, the passing rates for the prostate, lung case, and head & neck cases are 99.7%, 98.5%, and 97.2%, respectively. Due to specific architecture of GPU, modified Woodcock tracking algorithm performed inferior to the original one. ARCHERRT achieves a fast speed for PSF-based dose calculations. With a single M2090 card, the simulations cost about 60, 50, 80 s for three cases, respectively, with the 1% statistical error in the PTV. Using the latest K40 card, the simulations are 1.7–1.8 times faster. More impressively, six M2090 cards could finish the simulations in 8.9–13.4 s. For comparison, the same simulations on Intel E5-2620 (12 hyperthreading) cost about 500–800 s. Conclusions: ARCHERRT was developed successfully to perform fast and accurate MC dose calculation for radiotherapy using PSFs and patient CT phantoms. PMID:24989378
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hoogcarspel, S J; Kontaxis, C; Velden, J M van der
2014-06-01
Purpose: To develop an MR accelerator-enabled online planning-todelivery technique for stereotactic palliative radiotherapy treatment of spinal metastases. The technical challenges include; automated stereotactic treatment planning, online MR-based dose calculation and MR guidance during treatment. Methods: Using the CT data of 20 patients previously treated at our institution, a class solution for automated treatment planning for spinal bone metastases was created. For accurate dose simulation right before treatment, we fused geometrically correct online MR data with pretreatment CT data of the target volume (TV). For target tracking during treatment, a dynamic T2-weighted TSE MR sequence was developed. An in house developedmore » GPU based IMRT optimization and dose calculation algorithm was used for fast treatment planning and simulation. An automatically generated treatment plan developed with this treatment planning system was irradiated on a clinical 6 MV linear accelerator and evaluated using a Delta4 dosimeter. Results: The automated treatment planning method yielded clinically viable plans for all patients. The MR-CT fusion based dose calculation accuracy was within 2% as compared to calculations performed with original CT data. The dynamic T2-weighted TSE MR Sequence was able to provide an update of the anatomical location of the TV every 10 seconds. Dose calculation and optimization of the automatically generated treatment plans using only one GPU took on average 8 minutes. The Delta4 measurement of the irradiated plan agreed with the dose calculation with a 3%/3mm gamma pass rate of 86.4%. Conclusions: The development of an MR accelerator-enabled planning-todelivery technique for stereotactic palliative radiotherapy treatment of spinal metastases was presented. Future work will involve developing an intrafraction motion adaptation strategy, MR-only dose calculation, radiotherapy quality-assurance in a magnetic field, and streamlining the entire treatment process on an MR accelerator.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chi, Y; Li, Y; Tian, Z
2015-06-15
Purpose: Pencil-beam or superposition-convolution type dose calculation algorithms are routinely used in inverse plan optimization for intensity modulated radiation therapy (IMRT). However, due to their limited accuracy in some challenging cases, e.g. lung, the resulting dose may lose its optimality after being recomputed using an accurate algorithm, e.g. Monte Carlo (MC). It is the objective of this study to evaluate the feasibility and advantages of a new method to include MC in the treatment planning process. Methods: We developed a scheme to iteratively perform MC-based beamlet dose calculations and plan optimization. In the MC stage, a GPU-based dose engine wasmore » used and the particle number sampled from a beamlet was proportional to its optimized fluence from the previous step. We tested this scheme in four lung cancer IMRT cases. For each case, the original plan dose, plan dose re-computed by MC, and dose optimized by our scheme were obtained. Clinically relevant dosimetric quantities in these three plans were compared. Results: Although the original plan achieved a satisfactory PDV dose coverage, after re-computing doses using MC method, it was found that the PTV D95% were reduced by 4.60%–6.67%. After re-optimizing these cases with our scheme, the PTV coverage was improved to the same level as in the original plan, while the critical OAR coverages were maintained to clinically acceptable levels. Regarding the computation time, it took on average 144 sec per case using only one GPU card, including both MC-based beamlet dose calculation and treatment plan optimization. Conclusion: The achieved dosimetric gains and high computational efficiency indicate the feasibility and advantages of the proposed MC-based IMRT optimization method. Comprehensive validations in more patient cases are in progress.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Modiri, A; Hagan, A; Gu, X
Purpose 4D-IMRT planning, combined with dynamic MLC tracking delivery, utilizes the temporal dimension as an additional degree of freedom to achieve improved OAR-sparing. The computational complexity for such optimization increases exponentially with increase in dimensionality. In order to accomplish this task in a clinically-feasible time frame, we present an initial implementation of GPU-based 4D-IMRT planning based on particle swarm optimization (PSO). Methods The target and normal structures were manually contoured on ten phases of a 4DCT scan of a NSCLC patient with a 54cm3 right-lower-lobe tumor (1.5cm motion). Corresponding ten 3D-IMRT plans were created in the Eclipse treatment planning systemmore » (Ver-13.6). A vendor-provided scripting interface was used to export 3D-dose matrices corresponding to each control point (10 phases × 9 beams × 166 control points = 14,940), which served as input to PSO. The optimization task was to iteratively adjust the weights of each control point and scale the corresponding dose matrices. In order to handle the large amount of data in GPU memory, dose matrices were sparsified and placed in contiguous memory blocks with the 14,940 weight-variables. PSO was implemented on CPU (dual-Xeon, 3.1GHz) and GPU (dual-K20 Tesla, 2496 cores, 3.52Tflops, each) platforms. NiftyReg, an open-source deformable image registration package, was used to calculate the summed dose. Results The 4D-PSO plan yielded PTV coverage comparable to the clinical ITV-based plan and significantly higher OAR-sparing, as follows: lung Dmean=33%; lung V20=27%; spinal cord Dmax=26%; esophagus Dmax=42%; heart Dmax=0%; heart Dmean=47%. The GPU-PSO processing time for 14940 variables and 7 PSO-particles was 41% that of CPU-PSO (199 vs. 488 minutes). Conclusion Truly 4D-IMRT planning can yield significant OAR dose-sparing while preserving PTV coverage. The corresponding optimization problem is large-scale, non-convex and computationally rigorous. Our initial results indicate that GPU-based PSO with further software optimization can make such planning clinically feasible. This work was supported through funding from the National Institutes of Health and Varian Medical Systems.« less
Xia, Yidong; Lou, Jialin; Luo, Hong; ...
2015-02-09
Here, an OpenACC directive-based graphics processing unit (GPU) parallel scheme is presented for solving the compressible Navier–Stokes equations on 3D hybrid unstructured grids with a third-order reconstructed discontinuous Galerkin method. The developed scheme requires the minimum code intrusion and algorithm alteration for upgrading a legacy solver with the GPU computing capability at very little extra effort in programming, which leads to a unified and portable code development strategy. A face coloring algorithm is adopted to eliminate the memory contention because of the threading of internal and boundary face integrals. A number of flow problems are presented to verify the implementationmore » of the developed scheme. Timing measurements were obtained by running the resulting GPU code on one Nvidia Tesla K20c GPU card (Nvidia Corporation, Santa Clara, CA, USA) and compared with those obtained by running the equivalent Message Passing Interface (MPI) parallel CPU code on a compute node (consisting of two AMD Opteron 6128 eight-core CPUs (Advanced Micro Devices, Inc., Sunnyvale, CA, USA)). Speedup factors of up to 24× and 1.6× for the GPU code were achieved with respect to one and 16 CPU cores, respectively. The numerical results indicate that this OpenACC-based parallel scheme is an effective and extensible approach to port unstructured high-order CFD solvers to GPU computing.« less
Thread scheduling for GPU-based OPC simulation on multi-thread
NASA Astrophysics Data System (ADS)
Lee, Heejun; Kim, Sangwook; Hong, Jisuk; Lee, Sooryong; Han, Hwansoo
2018-03-01
As semiconductor product development based on shrinkage continues, the accuracy and difficulty required for the model based optical proximity correction (MBOPC) is increasing. OPC simulation time, which is the most timeconsuming part of MBOPC, is rapidly increasing due to high pattern density in a layout and complex OPC model. To reduce OPC simulation time, we attempt to apply graphic processing unit (GPU) to MBOPC because OPC process is good to be programmed in parallel. We address some issues that may typically happen during GPU-based OPC simulation in multi thread system, such as "out of memory" and "GPU idle time". To overcome these problems, we propose a thread scheduling method, which manages OPC jobs in multiple threads in such a way that simulations jobs from multiple threads are alternatively executed on GPU while correction jobs are executed at the same time in each CPU cores. It was observed that the amount of GPU peak memory usage decreases by up to 35%, and MBOPC runtime also decreases by 4%. In cases where out of memory issues occur in a multi-threaded environment, the thread scheduler was used to improve MBOPC runtime up to 23%.
Liu, Yu; Hong, Yang; Lin, Chun-Yuan; Hung, Che-Lun
2015-01-01
The Smith-Waterman (SW) algorithm has been widely utilized for searching biological sequence databases in bioinformatics. Recently, several works have adopted the graphic card with Graphic Processing Units (GPUs) and their associated CUDA model to enhance the performance of SW computations. However, these works mainly focused on the protein database search by using the intertask parallelization technique, and only using the GPU capability to do the SW computations one by one. Hence, in this paper, we will propose an efficient SW alignment method, called CUDA-SWfr, for the protein database search by using the intratask parallelization technique based on a CPU-GPU collaborative system. Before doing the SW computations on GPU, a procedure is applied on CPU by using the frequency distance filtration scheme (FDFS) to eliminate the unnecessary alignments. The experimental results indicate that CUDA-SWfr runs 9.6 times and 96 times faster than the CPU-based SW method without and with FDFS, respectively.
Strong scaling of general-purpose molecular dynamics simulations on GPUs
NASA Astrophysics Data System (ADS)
Glaser, Jens; Nguyen, Trung Dac; Anderson, Joshua A.; Lui, Pak; Spiga, Filippo; Millan, Jaime A.; Morse, David C.; Glotzer, Sharon C.
2015-07-01
We describe a highly optimized implementation of MPI domain decomposition in a GPU-enabled, general-purpose molecular dynamics code, HOOMD-blue (Anderson and Glotzer, 2013). Our approach is inspired by a traditional CPU-based code, LAMMPS (Plimpton, 1995), but is implemented within a code that was designed for execution on GPUs from the start (Anderson et al., 2008). The software supports short-ranged pair force and bond force fields and achieves optimal GPU performance using an autotuning algorithm. We are able to demonstrate equivalent or superior scaling on up to 3375 GPUs in Lennard-Jones and dissipative particle dynamics (DPD) simulations of up to 108 million particles. GPUDirect RDMA capabilities in recent GPU generations provide better performance in full double precision calculations. For a representative polymer physics application, HOOMD-blue 1.0 provides an effective GPU vs. CPU node speed-up of 12.5 ×.
NASA Astrophysics Data System (ADS)
Nakatsuji, Noriaki; Matsushima, Kyoji
2017-03-01
Full-parallax high-definition CGHs composed of more than billion pixels were so far created only by the polygon-based method because of its high performance. However, GPUs recently allow us to generate CGHs much faster by the point cloud. In this paper, we measure computation time of object fields for full-parallax high-definition CGHs, which are composed of 4 billion pixels and reconstruct the same scene, by using the point cloud with GPU and the polygon-based method with CPU. In addition, we compare the optical and simulated reconstructions between CGHs created by these techniques to verify the image quality.
A GPU accelerated and error-controlled solver for the unbounded Poisson equation in three dimensions
NASA Astrophysics Data System (ADS)
Exl, Lukas
2017-12-01
An efficient solver for the three dimensional free-space Poisson equation is presented. The underlying numerical method is based on finite Fourier series approximation. While the error of all involved approximations can be fully controlled, the overall computation error is driven by the convergence of the finite Fourier series of the density. For smooth and fast-decaying densities the proposed method will be spectrally accurate. The method scales with O(N log N) operations, where N is the total number of discretization points in the Cartesian grid. The majority of the computational costs come from fast Fourier transforms (FFT), which makes it ideal for GPU computation. Several numerical computations on CPU and GPU validate the method and show efficiency and convergence behavior. Tests are performed using the Vienna Scientific Cluster 3 (VSC3). A free MATLAB implementation for CPU and GPU is provided to the interested community.
GPU accelerated generation of digitally reconstructed radiographs for 2-D/3-D image registration.
Dorgham, Osama M; Laycock, Stephen D; Fisher, Mark H
2012-09-01
Recent advances in programming languages for graphics processing units (GPUs) provide developers with a convenient way of implementing applications which can be executed on the CPU and GPU interchangeably. GPUs are becoming relatively cheap, powerful, and widely available hardware components, which can be used to perform intensive calculations. The last decade of hardware performance developments shows that GPU-based computation is progressing significantly faster than CPU-based computation, particularly if one considers the execution of highly parallelisable algorithms. Future predictions illustrate that this trend is likely to continue. In this paper, we introduce a way of accelerating 2-D/3-D image registration by developing a hybrid system which executes on the CPU and utilizes the GPU for parallelizing the generation of digitally reconstructed radiographs (DRRs). Based on the advancements of the GPU over the CPU, it is timely to exploit the benefits of many-core GPU technology by developing algorithms for DRR generation. Although some previous work has investigated the rendering of DRRs using the GPU, this paper investigates approximations which reduce the computational overhead while still maintaining a quality consistent with that needed for 2-D/3-D registration with sufficient accuracy to be clinically acceptable in certain applications of radiation oncology. Furthermore, by comparing implementations of 2-D/3-D registration on the CPU and GPU, we investigate current performance and propose an optimal framework for PC implementations addressing the rigid registration problem. Using this framework, we are able to render DRR images from a 256×256×133 CT volume in ~24 ms using an NVidia GeForce 8800 GTX and in ~2 ms using NVidia GeForce GTX 580. In addition to applications requiring fast automatic patient setup, these levels of performance suggest image-guided radiation therapy at video frame rates is technically feasible using relatively low cost PC architecture.
Cho, Nathan; Tsiamas, Panagiotis; Velarde, Esteban; Tryggestad, Erik; Jacques, Robert; Berbeco, Ross; McNutt, Todd; Kazanzides, Peter; Wong, John
2018-05-01
The Small Animal Radiation Research Platform (SARRP) has been developed for conformal microirradiation with on-board cone beam CT (CBCT) guidance. The graphics processing unit (GPU)-accelerated Superposition-Convolution (SC) method for dose computation has been integrated into the treatment planning system (TPS) for SARRP. This paper describes the validation of the SC method for the kilovoltage energy by comparing with EBT2 film measurements and Monte Carlo (MC) simulations. MC data were simulated by EGSnrc code with 3 × 10 8 -1.5 × 10 9 histories, while 21 photon energy bins were used to model the 220 kVp x-rays in the SC method. Various types of phantoms including plastic water, cork, graphite, and aluminum were used to encompass the range of densities of mouse organs. For the comparison, percentage depth dose (PDD) of SC, MC, and film measurements were analyzed. Cross beam (x,y) dosimetric profiles of SC and film measurements are also presented. Correction factors (CFz) to convert SC to MC dose-to-medium are derived from the SC and MC simulations in homogeneous phantoms of aluminum and graphite to improve the estimation. The SC method produces dose values that are within 5% of film measurements and MC simulations in the flat regions of the profile. The dose is less accurate at the edges, due to factors such as geometric uncertainties of film placement and difference in dose calculation grids. The GPU-accelerated Superposition-Convolution dose computation method was successfully validated with EBT2 film measurements and MC calculations. The SC method offers much faster computation speed than MC and provides calculations of both dose-to-water in medium and dose-to-medium in medium. © 2018 American Association of Physicists in Medicine.
Kim, Byungyeon; Park, Byungjun; Lee, Seungrag; Won, Youngjae
2016-01-01
We demonstrated GPU accelerated real-time confocal fluorescence lifetime imaging microscopy (FLIM) based on the analog mean-delay (AMD) method. Our algorithm was verified for various fluorescence lifetimes and photon numbers. The GPU processing time was faster than the physical scanning time for images up to 800 × 800, and more than 149 times faster than a single core CPU. The frame rate of our system was demonstrated to be 13 fps for a 200 × 200 pixel image when observing maize vascular tissue. This system can be utilized for observing dynamic biological reactions, medical diagnosis, and real-time industrial inspection. PMID:28018724
Multi-GPU implementation of a VMAT treatment plan optimization algorithm.
Tian, Zhen; Peng, Fei; Folkerts, Michael; Tan, Jun; Jia, Xun; Jiang, Steve B
2015-06-01
Volumetric modulated arc therapy (VMAT) optimization is a computationally challenging problem due to its large data size, high degrees of freedom, and many hardware constraints. High-performance graphics processing units (GPUs) have been used to speed up the computations. However, GPU's relatively small memory size cannot handle cases with a large dose-deposition coefficient (DDC) matrix in cases of, e.g., those with a large target size, multiple targets, multiple arcs, and/or small beamlet size. The main purpose of this paper is to report an implementation of a column-generation-based VMAT algorithm, previously developed in the authors' group, on a multi-GPU platform to solve the memory limitation problem. While the column-generation-based VMAT algorithm has been previously developed, the GPU implementation details have not been reported. Hence, another purpose is to present detailed techniques employed for GPU implementation. The authors also would like to utilize this particular problem as an example problem to study the feasibility of using a multi-GPU platform to solve large-scale problems in medical physics. The column-generation approach generates VMAT apertures sequentially by solving a pricing problem (PP) and a master problem (MP) iteratively. In the authors' method, the sparse DDC matrix is first stored on a CPU in coordinate list format (COO). On the GPU side, this matrix is split into four submatrices according to beam angles, which are stored on four GPUs in compressed sparse row format. Computation of beamlet price, the first step in PP, is accomplished using multi-GPUs. A fast inter-GPU data transfer scheme is accomplished using peer-to-peer access. The remaining steps of PP and MP problems are implemented on CPU or a single GPU due to their modest problem scale and computational loads. Barzilai and Borwein algorithm with a subspace step scheme is adopted here to solve the MP problem. A head and neck (H&N) cancer case is then used to validate the authors' method. The authors also compare their multi-GPU implementation with three different single GPU implementation strategies, i.e., truncating DDC matrix (S1), repeatedly transferring DDC matrix between CPU and GPU (S2), and porting computations involving DDC matrix to CPU (S3), in terms of both plan quality and computational efficiency. Two more H&N patient cases and three prostate cases are used to demonstrate the advantages of the authors' method. The authors' multi-GPU implementation can finish the optimization process within ∼ 1 min for the H&N patient case. S1 leads to an inferior plan quality although its total time was 10 s shorter than the multi-GPU implementation due to the reduced matrix size. S2 and S3 yield the same plan quality as the multi-GPU implementation but take ∼4 and ∼6 min, respectively. High computational efficiency was consistently achieved for the other five patient cases tested, with VMAT plans of clinically acceptable quality obtained within 23-46 s. Conversely, to obtain clinically comparable or acceptable plans for all six of these VMAT cases that the authors have tested in this paper, the optimization time needed in a commercial TPS system on CPU was found to be in an order of several minutes. The results demonstrate that the multi-GPU implementation of the authors' column-generation-based VMAT optimization can handle the large-scale VMAT optimization problem efficiently without sacrificing plan quality. The authors' study may serve as an example to shed some light on other large-scale medical physics problems that require multi-GPU techniques.
Mládek, Arnošt; Sponer, Judit E; Kulhánek, Petr; Lu, Xiang-Jun; Olson, Wilma K; Sponer, Jiřĺ
2012-01-10
Folded RNA molecules are shaped by an astonishing variety of highly conserved noncanonical molecular interactions and backbone topologies. The dinucleotide platform is a widespread recurrent RNA modular building submotif formed by the side-by-side pairing of bases from two consecutive nucleotides within a single strand, with highly specific sequence preferences. This unique arrangement of bases is cemented by an intricate network of noncanonical hydrogen bonds and facilitated by a distinctive backbone topology. The present study investigates the gas-phase intrinsic stabilities of the three most common RNA dinucleotide platforms - 5'-GpU-3', ApA, and UpC - via state-of-the-art quantum-chemical (QM) techniques. The mean stability of base-base interactions decreases with sequence in the order GpU > ApA > UpC. Bader's atoms-in-molecules analysis reveals that the N2(G)…O4(U) hydrogen bond of the GpU platform is stronger than the corresponding hydrogen bonds in the other two platforms. The mixed-pucker sugar-phosphate backbone conformation found in most GpU platforms, in which the 5'-ribose sugar (G) is in the C2'-endo form and the 3'-sugar (U) in the C3'-endo form, is intrinsically more stable than the standard A-RNA backbone arrangement, partially as a result of a favorable O2'…O2P intra-platform interaction. Our results thus validate the hypothesis of Lu et al. (Lu Xiang-Jun, et al. Nucleic Acids Res. 2010, 38, 4868-4876), that the superior stability of GpU platforms is partially mediated by the strong O2'…O2P hydrogen bond. In contrast, ApA and especially UpC platform-compatible backbone conformations are rather diverse and do not display any characteristic structural features. The average stabilities of ApA and UpC derived backbone conformers are also lower than those of GpU platforms. Thus, the observed structural and evolutionary patterns of the dinucleotide platforms can be accounted for, to a large extent, by their intrinsic properties as described by modern QM calculations. In contrast, we show that the dinucleotide platform is not properly described in the course of atomistic explicit-solvent simulations. Our work also gives methodological insights into QM calculations of experimental RNA backbone geometries. Such calculations are inherently complicated by rather large data and refinement uncertainties in the available RNA experimental structures, which often preclude reliable energy computations.
High-Throughput Characterization of Porous Materials Using Graphics Processing Units
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kim, Jihan; Martin, Richard L.; Rübel, Oliver
We have developed a high-throughput graphics processing units (GPU) code that can characterize a large database of crystalline porous materials. In our algorithm, the GPU is utilized to accelerate energy grid calculations where the grid values represent interactions (i.e., Lennard-Jones + Coulomb potentials) between gas molecules (i.e., CHmore » $$_{4}$$ and CO$$_{2}$$) and material's framework atoms. Using a parallel flood fill CPU algorithm, inaccessible regions inside the framework structures are identified and blocked based on their energy profiles. Finally, we compute the Henry coefficients and heats of adsorption through statistical Widom insertion Monte Carlo moves in the domain restricted to the accessible space. The code offers significant speedup over a single core CPU code and allows us to characterize a set of porous materials at least an order of magnitude larger than ones considered in earlier studies. For structures selected from such a prescreening algorithm, full adsorption isotherms can be calculated by conducting multiple grand canonical Monte Carlo simulations concurrently within the GPU.« less
A review of GPU-based medical image reconstruction.
Després, Philippe; Jia, Xun
2017-10-01
Tomographic image reconstruction is a computationally demanding task, even more so when advanced models are used to describe a more complete and accurate picture of the image formation process. Such advanced modeling and reconstruction algorithms can lead to better images, often with less dose, but at the price of long calculation times that are hardly compatible with clinical workflows. Fortunately, reconstruction tasks can often be executed advantageously on Graphics Processing Units (GPUs), which are exploited as massively parallel computational engines. This review paper focuses on recent developments made in GPU-based medical image reconstruction, from a CT, PET, SPECT, MRI and US perspective. Strategies and approaches to get the most out of GPUs in image reconstruction are presented as well as innovative applications arising from an increased computing capacity. The future of GPU-based image reconstruction is also envisioned, based on current trends in high-performance computing. Copyright © 2017 Associazione Italiana di Fisica Medica. Published by Elsevier Ltd. All rights reserved.
Local Alignment Tool Based on Hadoop Framework and GPU Architecture
Hung, Che-Lun; Hua, Guan-Jie
2014-01-01
With the rapid growth of next generation sequencing technologies, such as Slex, more and more data have been discovered and published. To analyze such huge data the computational performance is an important issue. Recently, many tools, such as SOAP, have been implemented on Hadoop and GPU parallel computing architectures. BLASTP is an important tool, implemented on GPU architectures, for biologists to compare protein sequences. To deal with the big biology data, it is hard to rely on single GPU. Therefore, we implement a distributed BLASTP by combining Hadoop and multi-GPUs. The experimental results present that the proposed method can improve the performance of BLASTP on single GPU, and also it can achieve high availability and fault tolerance. PMID:24955362
Local alignment tool based on Hadoop framework and GPU architecture.
Hung, Che-Lun; Hua, Guan-Jie
2014-01-01
With the rapid growth of next generation sequencing technologies, such as Slex, more and more data have been discovered and published. To analyze such huge data the computational performance is an important issue. Recently, many tools, such as SOAP, have been implemented on Hadoop and GPU parallel computing architectures. BLASTP is an important tool, implemented on GPU architectures, for biologists to compare protein sequences. To deal with the big biology data, it is hard to rely on single GPU. Therefore, we implement a distributed BLASTP by combining Hadoop and multi-GPUs. The experimental results present that the proposed method can improve the performance of BLASTP on single GPU, and also it can achieve high availability and fault tolerance.
CUDAICA: GPU Optimization of Infomax-ICA EEG Analysis
Raimondo, Federico; Kamienkowski, Juan E.; Sigman, Mariano; Fernandez Slezak, Diego
2012-01-01
In recent years, Independent Component Analysis (ICA) has become a standard to identify relevant dimensions of the data in neuroscience. ICA is a very reliable method to analyze data but it is, computationally, very costly. The use of ICA for online analysis of the data, used in brain computing interfaces, results are almost completely prohibitive. We show an increase with almost no cost (a rapid video card) of speed of ICA by about 25 fold. The EEG data, which is a repetition of many independent signals in multiple channels, is very suitable for processing using the vector processors included in the graphical units. We profiled the implementation of this algorithm and detected two main types of operations responsible of the processing bottleneck and taking almost 80% of computing time: vector-matrix and matrix-matrix multiplications. By replacing function calls to basic linear algebra functions to the standard CUBLAS routines provided by GPU manufacturers, it does not increase performance due to CUDA kernel launch overhead. Instead, we developed a GPU-based solution that, comparing with the original BLAS and CUBLAS versions, obtains a 25x increase of performance for the ICA calculation. PMID:22811699
GPU computing in medical physics: a review.
Pratx, Guillem; Xing, Lei
2011-05-01
The graphics processing unit (GPU) has emerged as a competitive platform for computing massively parallel problems. Many computing applications in medical physics can be formulated as data-parallel tasks that exploit the capabilities of the GPU for reducing processing times. The authors review the basic principles of GPU computing as well as the main performance optimization techniques, and survey existing applications in three areas of medical physics, namely image reconstruction, dose calculation and treatment plan optimization, and image processing.
Clinical applications of advanced rotational radiation therapy
NASA Astrophysics Data System (ADS)
Nalichowski, Adrian
Purpose: With a fast adoption of emerging technologies, it is critical to fully test and understand its limits and capabilities. In this work we investigate new graphic processing unit (GPU) based treatment planning algorithm and its applications in helical tomotherapy dose delivery. We explore the limits of the system by applying it to challenging clinical cases of total marrow irradiation (TMI) and stereotactic radiosurgery (SRS). We also analyze the feasibility of alternative fractionation schemes for total body irradiation (TBI) and TMI based on reported historical data on lung dose and interstitial pneumonitis (IP) incidence rates. Methods and Materials: An anthropomorphic phantom was used to create TMI plans using the new GPU based treatment planning system and the existing CPU cluster based system. Optimization parameters were selected based on clinically used values for field width, modulation factor and pitch. Treatment plans were also created on Eclipse treatment planning system (Varian Medical Systems Inc, Palo Alto, CA) using volumetric modulated arc therapy (VMAT) for dose delivery on IX treatment unit. A retrospective review was performed of 42 publications that reported IP rates along with lung dose, fractionation regimen, dose rate and chemotherapy. The analysis consisted of nearly thirty two hundred patients and 34 unique radiation regimens. Multivariate logistic regression was performed to determine parameters associated with IP and establish does response function. Results: The results showed very good dosimetric agreement between the GPU and CPU calculated plans. The results from SBRT study show that GPU planning system can maintain 90% target coverage while meeting all the constraints of RTOG 0631 protocol. Beam on time for Tomotherapy and flattening filter free RapidArc was much faster than for Vero or Cyberknife. Retrospective data analysis showed that lung dose and Cyclophosphomide (Cy) are both predictors of IP in TBI/TMI treatments. The dose rate was not found to be an independent risk factor for IP. The model failed to establish accurate dose response function, but the discrete data indicated a radiation dose threshold of 7.6Gy (EQD2_repair) and 120 mg/kg of Cy below which no IP cases were reported. Conclusion: The TomoTherapy GPU based dose engine is capable of calculating TMI treatment plans with plan quality nearly identical to plans calculated using the traditional CPU/cluster based system, while significantly reducing the time required for optimization and dose calculation. The new system was able to achieve more uniform dose distribution throughout the target volume and steeper dose fall off, resulting in superior OAR sparing when compared to Eclipse treatment planning system for VMAT delivery. The machine optimization parameters tested for TMI cases provide a comprehensive overview of the capabilities of the treatment planning station and associated helical delivery system. The new system also proved to be dosimetrically compatible with other leading modalities for treatments of small and complicated target volumes and was even superior when treatment delivery times were compared. These finding demonstrate that the advanced treatment planning and delivery system from TomoTherapy is well suitable for treatments of complicated cases such as TMI and SRS and it's often dosimetrically and/or logistically superior to other modalities. The new planning system can easily meet the constraint of threshold lung dose established in this study. The results presented here on the capabilities of Tomotherapy and on the identified lung dose threshold provide an opportunity to explore alternative fractionation schemes without sacrificing target coverage or lung toxicity. (Abstract shortened by ProQuest.).
GPU-Acceleration of Sequence Homology Searches with Database Subsequence Clustering.
Suzuki, Shuji; Kakuta, Masanori; Ishida, Takashi; Akiyama, Yutaka
2016-01-01
Sequence homology searches are used in various fields and require large amounts of computation time, especially for metagenomic analysis, owing to the large number of queries and the database size. To accelerate computing analyses, graphics processing units (GPUs) are widely used as a low-cost, high-performance computing platform. Therefore, we mapped the time-consuming steps involved in GHOSTZ, which is a state-of-the-art homology search algorithm for protein sequences, onto a GPU and implemented it as GHOSTZ-GPU. In addition, we optimized memory access for GPU calculations and for communication between the CPU and GPU. As per results of the evaluation test involving metagenomic data, GHOSTZ-GPU with 12 CPU threads and 1 GPU was approximately 3.0- to 4.1-fold faster than GHOSTZ with 12 CPU threads. Moreover, GHOSTZ-GPU with 12 CPU threads and 3 GPUs was approximately 5.8- to 7.7-fold faster than GHOSTZ with 12 CPU threads.
Fales, B Scott; Levine, Benjamin G
2015-10-13
Methods based on a full configuration interaction (FCI) expansion in an active space of orbitals are widely used for modeling chemical phenomena such as bond breaking, multiply excited states, and conical intersections in small-to-medium-sized molecules, but these phenomena occur in systems of all sizes. To scale such calculations up to the nanoscale, we have developed an implementation of FCI in which electron repulsion integral transformation and several of the more expensive steps in σ vector formation are performed on graphical processing unit (GPU) hardware. When applied to a 1.7 × 1.4 × 1.4 nm silicon nanoparticle (Si72H64) described with the polarized, all-electron 6-31G** basis set, our implementation can solve for the ground state of the 16-active-electron/16-active-orbital CASCI Hamiltonian (more than 100,000,000 configurations) in 39 min on a single NVidia K40 GPU.
Eslami, Taban; Saeed, Fahad
2018-04-20
Functional magnetic resonance imaging (fMRI) is a non-invasive brain imaging technique, which has been regularly used for studying brain’s functional activities in the past few years. A very well-used measure for capturing functional associations in brain is Pearson’s correlation coefficient. Pearson’s correlation is widely used for constructing functional network and studying dynamic functional connectivity of the brain. These are useful measures for understanding the effects of brain disorders on connectivities among brain regions. The fMRI scanners produce huge number of voxels and using traditional central processing unit (CPU)-based techniques for computing pairwise correlations is very time consuming especially when large number of subjects are being studied. In this paper, we propose a graphics processing unit (GPU)-based algorithm called Fast-GPU-PCC for computing pairwise Pearson’s correlation coefficient. Based on the symmetric property of Pearson’s correlation, this approach returns N ( N − 1 ) / 2 correlation coefficients located at strictly upper triangle part of the correlation matrix. Storing correlations in a one-dimensional array with the order as proposed in this paper is useful for further usage. Our experiments on real and synthetic fMRI data for different number of voxels and varying length of time series show that the proposed approach outperformed state of the art GPU-based techniques as well as the sequential CPU-based versions. We show that Fast-GPU-PCC runs 62 times faster than CPU-based version and about 2 to 3 times faster than two other state of the art GPU-based methods.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Tian, Z; Folkerts, M; Jiang, S
Purpose: We have previously developed a GPU-OpenCL-based MC dose engine named goMC with built-in analytical linac beam model. To move goMC towards routine clinical use, we have developed an automatic beam-commissioning method, and an efficient source sampling strategy to facilitate dose calculations for real treatment plans. Methods: Our commissioning method is to automatically adjust the relative weights among the sub-sources, through an optimization process minimizing the discrepancies between calculated dose and measurements. Six models built for Varian Truebeam linac photon beams (6MV, 10MV, 15MV, 18MV, 6MVFFF, 10MVFFF) were commissioned using measurement data acquired at our institution. To facilitate dose calculationsmore » for real treatment plans, we employed inverse sampling method to efficiently incorporate MLC leaf-sequencing into source sampling. Specifically, instead of sampling source particles control-point by control-point and rejecting the particles blocked by MLC, we assigned a control-point index to each sampled source particle, according to MLC leaf-open duration of each control-point at the pixel where the particle intersects the iso-center plane. Results: Our auto-commissioning method decreased distance-to-agreement (DTA) of depth dose at build-up regions by 36.2% averagely, making it within 1mm. Lateral profiles were better matched for all beams, with biggest improvement found at 15MV for which root-mean-square difference was reduced from 1.44% to 0.50%. Maximum differences of output factors were reduced to less than 0.7% for all beams, with largest decrease being from1.70% to 0.37% found at 10FFF. Our new sampling strategy was tested on a Head&Neck VMAT patient case. Achieving clinically acceptable accuracy, the new strategy could reduce the required history number by a factor of ∼2.8 given a statistical uncertainty level and hence achieve a similar speed-up factor. Conclusion: Our studies have demonstrated the feasibility and effectiveness of our auto-commissioning approach and new efficient source sampling strategy, implying the potential of our GPU-based MC dose engine goMC for routine clinical use.« less
ODYSSEY: A PUBLIC GPU-BASED CODE FOR GENERAL RELATIVISTIC RADIATIVE TRANSFER IN KERR SPACETIME
DOE Office of Scientific and Technical Information (OSTI.GOV)
Pu, Hung-Yi; Yun, Kiyun; Yoon, Suk-Jin
General relativistic radiative transfer calculations coupled with the calculation of geodesics in the Kerr spacetime are an essential tool for determining the images, spectra, and light curves from matter in the vicinity of black holes. Such studies are especially important for ongoing and upcoming millimeter/submillimeter very long baseline interferometry observations of the supermassive black holes at the centers of Sgr A* and M87. To this end we introduce Odyssey, a graphics processing unit (GPU) based code for ray tracing and radiative transfer in the Kerr spacetime. On a single GPU, the performance of Odyssey can exceed 1 ns per photon, per Runge–Kutta integrationmore » step. Odyssey is publicly available, fast, accurate, and flexible enough to be modified to suit the specific needs of new users. Along with a Graphical User Interface powered by a video-accelerated display architecture, we also present an educational software tool, Odyssey-Edu, for showing in real time how null geodesics around a Kerr black hole vary as a function of black hole spin and angle of incidence onto the black hole.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Su, Lin; Du, Xining; Liu, Tianyu
Purpose: Using the graphical processing units (GPU) hardware technology, an extremely fast Monte Carlo (MC) code ARCHER{sub RT} is developed for radiation dose calculations in radiation therapy. This paper describes the detailed software development and testing for three clinical TomoTherapy® cases: the prostate, lung, and head and neck. Methods: To obtain clinically relevant dose distributions, phase space files (PSFs) created from optimized radiation therapy treatment plan fluence maps were used as the input to ARCHER{sub RT}. Patient-specific phantoms were constructed from patient CT images. Batch simulations were employed to facilitate the time-consuming task of loading large PSFs, and to improvemore » the estimation of statistical uncertainty. Furthermore, two different Woodcock tracking algorithms were implemented and their relative performance was compared. The dose curves of an Elekta accelerator PSF incident on a homogeneous water phantom were benchmarked against DOSXYZnrc. For each of the treatment cases, dose volume histograms and isodose maps were produced from ARCHER{sub RT} and the general-purpose code, GEANT4. The gamma index analysis was performed to evaluate the similarity of voxel doses obtained from these two codes. The hardware accelerators used in this study are one NVIDIA K20 GPU, one NVIDIA K40 GPU, and six NVIDIA M2090 GPUs. In addition, to make a fairer comparison of the CPU and GPU performance, a multithreaded CPU code was developed using OpenMP and tested on an Intel E5-2620 CPU. Results: For the water phantom, the depth dose curve and dose profiles from ARCHER{sub RT} agree well with DOSXYZnrc. For clinical cases, results from ARCHER{sub RT} are compared with those from GEANT4 and good agreement is observed. Gamma index test is performed for voxels whose dose is greater than 10% of maximum dose. For 2%/2mm criteria, the passing rates for the prostate, lung case, and head and neck cases are 99.7%, 98.5%, and 97.2%, respectively. Due to specific architecture of GPU, modified Woodcock tracking algorithm performed inferior to the original one. ARCHER{sub RT} achieves a fast speed for PSF-based dose calculations. With a single M2090 card, the simulations cost about 60, 50, 80 s for three cases, respectively, with the 1% statistical error in the PTV. Using the latest K40 card, the simulations are 1.7–1.8 times faster. More impressively, six M2090 cards could finish the simulations in 8.9–13.4 s. For comparison, the same simulations on Intel E5-2620 (12 hyperthreading) cost about 500–800 s. Conclusions: ARCHER{sub RT} was developed successfully to perform fast and accurate MC dose calculation for radiotherapy using PSFs and patient CT phantoms.« less
High performance GPU processing for inversion using uniform grid searches
NASA Astrophysics Data System (ADS)
Venetis, Ioannis E.; Saltogianni, Vasso; Stiros, Stathis; Gallopoulos, Efstratios
2017-04-01
Many geophysical problems are described by systems of redundant, highly non-linear systems of ordinary equations with constant terms deriving from measurements and hence representing stochastic variables. Solution (inversion) of such problems is based on numerical, optimization methods, based on Monte Carlo sampling or on exhaustive searches in cases of two or even three "free" unknown variables. Recently the TOPological INVersion (TOPINV) algorithm, a grid search-based technique in the Rn space, has been proposed. TOPINV is not based on the minimization of a certain cost function and involves only forward computations, hence avoiding computational errors. The basic concept is to transform observation equations into inequalities on the basis of an optimization parameter k and of their standard errors, and through repeated "scans" of n-dimensional search grids for decreasing values of k to identify the optimal clusters of gridpoints which satisfy observation inequalities and by definition contain the "true" solution. Stochastic optimal solutions and their variance-covariance matrices are then computed as first and second statistical moments. Such exhaustive uniform searches produce an excessive computational load and are extremely time consuming for common computers based on a CPU. An alternative is to use a computing platform based on a GPU, which nowadays is affordable to the research community, which provides a much higher computing performance. Using the CUDA programming language to implement TOPINV allows the investigation of the attained speedup in execution time on such a high performance platform. Based on synthetic data we compared the execution time required for two typical geophysical problems, modeling magma sources and seismic faults, described with up to 18 unknown variables, on both CPU/FORTRAN and GPU/CUDA platforms. The same problems for several different sizes of search grids (up to 1012 gridpoints) and numbers of unknown variables were solved on both platforms, and execution time as a function of the grid dimension for each problem was recorded. Results indicate an average speedup in calculations by a factor of 100 on the GPU platform; for example problems with 1012 grid-points require less than two hours instead of several days on conventional desktop computers. Such a speedup encourages the application of TOPINV on high performance platforms, as a GPU, in cases where nearly real time decisions are necessary, for example finite fault modeling to identify possible tsunami sources.
Protein alignment algorithms with an efficient backtracking routine on multiple GPUs.
Blazewicz, Jacek; Frohmberg, Wojciech; Kierzynka, Michal; Pesch, Erwin; Wojciechowski, Pawel
2011-05-20
Pairwise sequence alignment methods are widely used in biological research. The increasing number of sequences is perceived as one of the upcoming challenges for sequence alignment methods in the nearest future. To overcome this challenge several GPU (Graphics Processing Unit) computing approaches have been proposed lately. These solutions show a great potential of a GPU platform but in most cases address the problem of sequence database scanning and computing only the alignment score whereas the alignment itself is omitted. Thus, the need arose to implement the global and semiglobal Needleman-Wunsch, and Smith-Waterman algorithms with a backtracking procedure which is needed to construct the alignment. In this paper we present the solution that performs the alignment of every given sequence pair, which is a required step for progressive multiple sequence alignment methods, as well as for DNA recognition at the DNA assembly stage. Performed tests show that the implementation, with performance up to 6.3 GCUPS on a single GPU for affine gap penalties, is very efficient in comparison to other CPU and GPU-based solutions. Moreover, multiple GPUs support with load balancing makes the application very scalable. The article shows that the backtracking procedure of the sequence alignment algorithms may be designed to fit in with the GPU architecture. Therefore, our algorithm, apart from scores, is able to compute pairwise alignments. This opens a wide range of new possibilities, allowing other methods from the area of molecular biology to take advantage of the new computational architecture. Performed tests show that the efficiency of the implementation is excellent. Moreover, the speed of our GPU-based algorithms can be almost linearly increased when using more than one graphics card.
Multi-GPU implementation of a VMAT treatment plan optimization algorithm
DOE Office of Scientific and Technical Information (OSTI.GOV)
Tian, Zhen, E-mail: Zhen.Tian@UTSouthwestern.edu, E-mail: Xun.Jia@UTSouthwestern.edu, E-mail: Steve.Jiang@UTSouthwestern.edu; Folkerts, Michael; Tan, Jun
Purpose: Volumetric modulated arc therapy (VMAT) optimization is a computationally challenging problem due to its large data size, high degrees of freedom, and many hardware constraints. High-performance graphics processing units (GPUs) have been used to speed up the computations. However, GPU’s relatively small memory size cannot handle cases with a large dose-deposition coefficient (DDC) matrix in cases of, e.g., those with a large target size, multiple targets, multiple arcs, and/or small beamlet size. The main purpose of this paper is to report an implementation of a column-generation-based VMAT algorithm, previously developed in the authors’ group, on a multi-GPU platform tomore » solve the memory limitation problem. While the column-generation-based VMAT algorithm has been previously developed, the GPU implementation details have not been reported. Hence, another purpose is to present detailed techniques employed for GPU implementation. The authors also would like to utilize this particular problem as an example problem to study the feasibility of using a multi-GPU platform to solve large-scale problems in medical physics. Methods: The column-generation approach generates VMAT apertures sequentially by solving a pricing problem (PP) and a master problem (MP) iteratively. In the authors’ method, the sparse DDC matrix is first stored on a CPU in coordinate list format (COO). On the GPU side, this matrix is split into four submatrices according to beam angles, which are stored on four GPUs in compressed sparse row format. Computation of beamlet price, the first step in PP, is accomplished using multi-GPUs. A fast inter-GPU data transfer scheme is accomplished using peer-to-peer access. The remaining steps of PP and MP problems are implemented on CPU or a single GPU due to their modest problem scale and computational loads. Barzilai and Borwein algorithm with a subspace step scheme is adopted here to solve the MP problem. A head and neck (H and N) cancer case is then used to validate the authors’ method. The authors also compare their multi-GPU implementation with three different single GPU implementation strategies, i.e., truncating DDC matrix (S1), repeatedly transferring DDC matrix between CPU and GPU (S2), and porting computations involving DDC matrix to CPU (S3), in terms of both plan quality and computational efficiency. Two more H and N patient cases and three prostate cases are used to demonstrate the advantages of the authors’ method. Results: The authors’ multi-GPU implementation can finish the optimization process within ∼1 min for the H and N patient case. S1 leads to an inferior plan quality although its total time was 10 s shorter than the multi-GPU implementation due to the reduced matrix size. S2 and S3 yield the same plan quality as the multi-GPU implementation but take ∼4 and ∼6 min, respectively. High computational efficiency was consistently achieved for the other five patient cases tested, with VMAT plans of clinically acceptable quality obtained within 23–46 s. Conversely, to obtain clinically comparable or acceptable plans for all six of these VMAT cases that the authors have tested in this paper, the optimization time needed in a commercial TPS system on CPU was found to be in an order of several minutes. Conclusions: The results demonstrate that the multi-GPU implementation of the authors’ column-generation-based VMAT optimization can handle the large-scale VMAT optimization problem efficiently without sacrificing plan quality. The authors’ study may serve as an example to shed some light on other large-scale medical physics problems that require multi-GPU techniques.« less
NASA Astrophysics Data System (ADS)
Ammazzalorso, F.; Bednarz, T.; Jelen, U.
2014-03-01
We demonstrate acceleration on graphic processing units (GPU) of automatic identification of robust particle therapy beam setups, minimizing negative dosimetric effects of Bragg peak displacement caused by treatment-time patient positioning errors. Our particle therapy research toolkit, RobuR, was extended with OpenCL support and used to implement calculation on GPU of the Port Homogeneity Index, a metric scoring irradiation port robustness through analysis of tissue density patterns prior to dose optimization and computation. Results were benchmarked against an independent native CPU implementation. Numerical results were in agreement between the GPU implementation and native CPU implementation. For 10 skull base cases, the GPU-accelerated implementation was employed to select beam setups for proton and carbon ion treatment plans, which proved to be dosimetrically robust, when recomputed in presence of various simulated positioning errors. From the point of view of performance, average running time on the GPU decreased by at least one order of magnitude compared to the CPU, rendering the GPU-accelerated analysis a feasible step in a clinical treatment planning interactive session. In conclusion, selection of robust particle therapy beam setups can be effectively accelerated on a GPU and become an unintrusive part of the particle therapy treatment planning workflow. Additionally, the speed gain opens new usage scenarios, like interactive analysis manipulation (e.g. constraining of some setup) and re-execution. Finally, through OpenCL portable parallelism, the new implementation is suitable also for CPU-only use, taking advantage of multiple cores, and can potentially exploit types of accelerators other than GPUs.
Kasahara, Kota; Ma, Benson; Goto, Kota; Dasgupta, Bhaskar; Higo, Junichi; Fukuda, Ikuo; Mashimo, Tadaaki; Akiyama, Yutaka; Nakamura, Haruki
2016-01-01
Molecular dynamics (MD) is a promising computational approach to investigate dynamical behavior of molecular systems at the atomic level. Here, we present a new MD simulation engine named "myPresto/omegagene" that is tailored for enhanced conformational sampling methods with a non-Ewald electrostatic potential scheme. Our enhanced conformational sampling methods, e.g. , the virtual-system-coupled multi-canonical MD (V-McMD) method, replace a multi-process parallelized run with multiple independent runs to avoid inter-node communication overhead. In addition, adopting the non-Ewald-based zero-multipole summation method (ZMM) makes it possible to eliminate the Fourier space calculations altogether. The combination of these state-of-the-art techniques realizes efficient and accurate calculations of the conformational ensemble at an equilibrium state. By taking these advantages, myPresto/omegagene is specialized for the single process execution with Graphics Processing Unit (GPU). We performed benchmark simulations for the 20-mer peptide, Trp-cage, with explicit solvent. One of the most thermodynamically stable conformations generated by the V-McMD simulation is very similar to an experimentally solved native conformation. Furthermore, the computation speed is four-times faster than that of our previous simulation engine, myPresto/psygene-G. The new simulator, myPresto/omegagene, is freely available at the following URLs: http://www.protein.osaka-u.ac.jp/rcsfp/pi/omegagene/ and http://presto.protein.osaka-u.ac.jp/myPresto4/.
Accelerating NBODY6 with graphics processing units
NASA Astrophysics Data System (ADS)
Nitadori, Keigo; Aarseth, Sverre J.
2012-07-01
We describe the use of graphics processing units (GPUs) for speeding up the code NBODY6 which is widely used for direct N-body simulations. Over the years, the N2 nature of the direct force calculation has proved a barrier for extending the particle number. Following an early introduction of force polynomials and individual time steps, the calculation cost was first reduced by the introduction of a neighbour scheme. After a decade of GRAPE computers which speeded up the force calculation further, we are now in the era of GPUs where relatively small hardware systems are highly cost effective. A significant gain in efficiency is achieved by employing the GPU to obtain the so-called regular force which typically involves some 99 per cent of the particles, while the remaining local forces are evaluated on the host. However, the latter operation is performed up to 20 times more frequently and may still account for a significant cost. This effort is reduced by parallel SSE/AVX procedures where each interaction term is calculated using mainly single precision. We also discuss further strategies connected with coordinate and velocity prediction required by the integration scheme. This leaves hard binaries and multiple close encounters which are treated by several regularization methods. The present NBODY6-GPU code is well balanced for simulations in the particle range 104-2 × 105 for a dual-GPU system attached to a standard PC.
Lossless data compression for improving the performance of a GPU-based beamformer.
Lok, U-Wai; Fan, Gang-Wei; Li, Pai-Chi
2015-04-01
The powerful parallel computation ability of a graphics processing unit (GPU) makes it feasible to perform dynamic receive beamforming However, a real time GPU-based beamformer requires high data rate to transfer radio-frequency (RF) data from hardware to software memory, as well as from central processing unit (CPU) to GPU memory. There are data compression methods (e.g. Joint Photographic Experts Group (JPEG)) available for the hardware front end to reduce data size, alleviating the data transfer requirement of the hardware interface. Nevertheless, the required decoding time may even be larger than the transmission time of its original data, in turn degrading the overall performance of the GPU-based beamformer. This article proposes and implements a lossless compression-decompression algorithm, which enables in parallel compression and decompression of data. By this means, the data transfer requirement of hardware interface and the transmission time of CPU to GPU data transfers are reduced, without sacrificing image quality. In simulation results, the compression ratio reached around 1.7. The encoder design of our lossless compression approach requires low hardware resources and reasonable latency in a field programmable gate array. In addition, the transmission time of transferring data from CPU to GPU with the parallel decoding process improved by threefold, as compared with transferring original uncompressed data. These results show that our proposed lossless compression plus parallel decoder approach not only mitigate the transmission bandwidth requirement to transfer data from hardware front end to software system but also reduce the transmission time for CPU to GPU data transfer. © The Author(s) 2014.
SU-C-BRC-07: Parametrized GPU Accelerated Electron Monte Carlo Second Check
DOE Office of Scientific and Technical Information (OSTI.GOV)
Haywood, J
Purpose: I am presenting a parameterized 3D GPU accelerated electron Monte Carlo second check program. Method: I wrote the 3D grid dose calculation algorithm in CUDA and utilized an NVIDIA GeForce GTX 780 Ti to run all of the calculations. The electron path beyond the distal end of the cone is governed by four parameters: the amplitude of scattering (AMP), the mean and width of a Gaussian energy distribution (E and α), and the percentage of photons. In my code, I adjusted all parameters until the calculated PDD and profile fit the measured 10×10 open beam data within 1%/1mm. Imore » then wrote a user interface for reading the DICOM treatment plan and images in Python. In order to verify the algorithm, I calculated 3D dose distributions on a variety of phantoms and geometries, and compared them with the Eclipse eMC calculations. I also calculated several patient specific dose distributions, including a nose and an ear. Finally, I compared my algorithm’s computation times to Eclipse’s. Results: The calculated MU for all of the investigated geometries agree with the TPS within the TG-114 action level of 5%. The MU for the nose was < 0.5 % different while the MU for the ear at 105 SSD was ∼2 %. Calculation times for a 12MeV 10×10 open beam ranged from 1 second for a 2.5 mm grid resolution with ∼15 million particles to 33 seconds on a 1 mm grid with ∼460 million particles. Eclipse calculation runtimes distributed over 10 FAS workers were 9 seconds to 15 minutes respectively. Conclusion: The GPU accelerated second check allows quick MU verification while accounting for patient specific geometry and heterogeneity.« less
Chen, Qingkui; Zhao, Deyu; Wang, Jingjuan
2017-01-01
This paper aims to develop a low-cost, high-performance and high-reliability computing system to process large-scale data using common data mining algorithms in the Internet of Things (IoT) computing environment. Considering the characteristics of IoT data processing, similar to mainstream high performance computing, we use a GPU (Graphics Processing Unit) cluster to achieve better IoT services. Firstly, we present an energy consumption calculation method (ECCM) based on WSNs. Then, using the CUDA (Compute Unified Device Architecture) Programming model, we propose a Two-level Parallel Optimization Model (TLPOM) which exploits reasonable resource planning and common compiler optimization techniques to obtain the best blocks and threads configuration considering the resource constraints of each node. The key to this part is dynamic coupling Thread-Level Parallelism (TLP) and Instruction-Level Parallelism (ILP) to improve the performance of the algorithms without additional energy consumption. Finally, combining the ECCM and the TLPOM, we use the Reliable GPU Cluster Architecture (RGCA) to obtain a high-reliability computing system considering the nodes’ diversity, algorithm characteristics, etc. The results show that the performance of the algorithms significantly increased by 34.1%, 33.96% and 24.07% for Fermi, Kepler and Maxwell on average with TLPOM and the RGCA ensures that our IoT computing system provides low-cost and high-reliability services. PMID:28777325
Fang, Yuling; Chen, Qingkui; Xiong, Neal N; Zhao, Deyu; Wang, Jingjuan
2017-08-04
This paper aims to develop a low-cost, high-performance and high-reliability computing system to process large-scale data using common data mining algorithms in the Internet of Things (IoT) computing environment. Considering the characteristics of IoT data processing, similar to mainstream high performance computing, we use a GPU (Graphics Processing Unit) cluster to achieve better IoT services. Firstly, we present an energy consumption calculation method (ECCM) based on WSNs. Then, using the CUDA (Compute Unified Device Architecture) Programming model, we propose a Two-level Parallel Optimization Model (TLPOM) which exploits reasonable resource planning and common compiler optimization techniques to obtain the best blocks and threads configuration considering the resource constraints of each node. The key to this part is dynamic coupling Thread-Level Parallelism (TLP) and Instruction-Level Parallelism (ILP) to improve the performance of the algorithms without additional energy consumption. Finally, combining the ECCM and the TLPOM, we use the Reliable GPU Cluster Architecture (RGCA) to obtain a high-reliability computing system considering the nodes' diversity, algorithm characteristics, etc. The results show that the performance of the algorithms significantly increased by 34.1%, 33.96% and 24.07% for Fermi, Kepler and Maxwell on average with TLPOM and the RGCA ensures that our IoT computing system provides low-cost and high-reliability services.
Multi-scale calculation based on dual domain material point method combined with molecular dynamics
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dhakal, Tilak Raj
This dissertation combines the dual domain material point method (DDMP) with molecular dynamics (MD) in an attempt to create a multi-scale numerical method to simulate materials undergoing large deformations with high strain rates. In these types of problems, the material is often in a thermodynamically non-equilibrium state, and conventional constitutive relations are often not available. In this method, the closure quantities, such as stress, at each material point are calculated from a MD simulation of a group of atoms surrounding the material point. Rather than restricting the multi-scale simulation in a small spatial region, such as phase interfaces, or crackmore » tips, this multi-scale method can be used to consider non-equilibrium thermodynamic e ects in a macroscopic domain. This method takes advantage that the material points only communicate with mesh nodes, not among themselves; therefore MD simulations for material points can be performed independently in parallel. First, using a one-dimensional shock problem as an example, the numerical properties of the original material point method (MPM), the generalized interpolation material point (GIMP) method, the convected particle domain interpolation (CPDI) method, and the DDMP method are investigated. Among these methods, only the DDMP method converges as the number of particles increases, but the large number of particles needed for convergence makes the method very expensive especially in our multi-scale method where we calculate stress in each material point using MD simulation. To improve DDMP, the sub-point method is introduced in this dissertation, which provides high quality numerical solutions with a very small number of particles. The multi-scale method based on DDMP with sub-points is successfully implemented for a one dimensional problem of shock wave propagation in a cerium crystal. The MD simulation to calculate stress in each material point is performed in GPU using CUDA to accelerate the computation. The numerical properties of the multiscale method are investigated as well as the results from this multi-scale calculation are compared of particles needed for convergence makes the method very expensive especially in our multi-scale method where we calculate stress in each material point using MD simulation. To improve DDMP, the sub-point method is introduced in this dissertation, which provides high quality numerical solutions with a very small number of particles. The multi-scale method based on DDMP with sub-points is successfully implemented for a one dimensional problem of shock wave propagation in a cerium crystal. The MD simulation to calculate stress in each material point is performed in GPU using CUDA to accelerate the computation. The numerical properties of the multiscale method are investigated as well as the results from this multi-scale calculation are compared with direct MD simulation results to demonstrate the feasibility of the method. Also, the multi-scale method is applied for a two dimensional problem of jet formation around copper notch under a strong impact.« less
A New GPU-Enabled MODTRAN Thermal Model for the PLUME TRACKER Volcanic Emission Analysis Toolkit
NASA Astrophysics Data System (ADS)
Acharya, P. K.; Berk, A.; Guiang, C.; Kennett, R.; Perkins, T.; Realmuto, V. J.
2013-12-01
Real-time quantification of volcanic gaseous and particulate releases is important for (1) recognizing rapid increases in SO2 gaseous emissions which may signal an impending eruption; (2) characterizing ash clouds to enable safe and efficient commercial aviation; and (3) quantifying the impact of volcanic aerosols on climate forcing. The Jet Propulsion Laboratory (JPL) has developed state-of-the-art algorithms, embedded in their analyst-driven Plume Tracker toolkit, for performing SO2, NH3, and CH4 retrievals from remotely sensed multi-spectral Thermal InfraRed spectral imagery. While Plume Tracker provides accurate results, it typically requires extensive analyst time. A major bottleneck in this processing is the relatively slow but accurate FORTRAN-based MODTRAN atmospheric and plume radiance model, developed by Spectral Sciences, Inc. (SSI). To overcome this bottleneck, SSI in collaboration with JPL, is porting these slow thermal radiance algorithms onto massively parallel, relatively inexpensive and commercially-available GPUs. This paper discusses SSI's efforts to accelerate the MODTRAN thermal emission algorithms used by Plume Tracker. Specifically, we are developing a GPU implementation of the Curtis-Godson averaging and the Voigt in-band transmittances from near line center molecular absorption, which comprise the major computational bottleneck. The transmittance calculations were decomposed into separate functions, individually implemented as GPU kernels, and tested for accuracy and performance relative to the original CPU code. Speedup factors of 14 to 30× were realized for individual processing components on an NVIDIA GeForce GTX 295 graphics card with no loss of accuracy. Due to the separate host (CPU) and device (GPU) memory spaces, a redesign of the MODTRAN architecture was required to ensure efficient data transfer between host and device, and to facilitate high parallel throughput. Currently, we are incorporating the separate GPU kernels into a single function for calculating the Voigt in-band transmittance, and subsequently for integration into the re-architectured MODTRAN6 code. Our overall objective is that by combining the GPU processing with more efficient Plume Tracker retrieval algorithms, a 100-fold increase in the computational speed will be realized. Since the Plume Tracker runs on Windows-based platforms, the GPU-enhanced MODTRAN6 will be packaged as a DLL. We do however anticipate that the accelerated option will be made available to the general MODTRAN community through an application programming interface (API).
GPU-Acceleration of Sequence Homology Searches with Database Subsequence Clustering
Suzuki, Shuji; Kakuta, Masanori; Ishida, Takashi; Akiyama, Yutaka
2016-01-01
Sequence homology searches are used in various fields and require large amounts of computation time, especially for metagenomic analysis, owing to the large number of queries and the database size. To accelerate computing analyses, graphics processing units (GPUs) are widely used as a low-cost, high-performance computing platform. Therefore, we mapped the time-consuming steps involved in GHOSTZ, which is a state-of-the-art homology search algorithm for protein sequences, onto a GPU and implemented it as GHOSTZ-GPU. In addition, we optimized memory access for GPU calculations and for communication between the CPU and GPU. As per results of the evaluation test involving metagenomic data, GHOSTZ-GPU with 12 CPU threads and 1 GPU was approximately 3.0- to 4.1-fold faster than GHOSTZ with 12 CPU threads. Moreover, GHOSTZ-GPU with 12 CPU threads and 3 GPUs was approximately 5.8- to 7.7-fold faster than GHOSTZ with 12 CPU threads. PMID:27482905
GPU-Based Real-Time Volumetric Ultrasound Image Reconstruction for a Ring Array
Choe, Jung Woo; Nikoozadeh, Amin; Oralkan, Ömer; Khuri-Yakub, Butrus T.
2014-01-01
Synthetic phased array (SPA) beamforming with Hadamard coding and aperture weighting is an optimal option for real-time volumetric imaging with a ring array, a particularly attractive geometry in intracardiac and intravascular applications. However, the imaging frame rate of this method is limited by the immense computational load required in synthetic beamforming. For fast imaging with a ring array, we developed graphics processing unit (GPU)-based, real-time image reconstruction software that exploits massive data-level parallelism in beamforming operations. The GPU-based software reconstructs and displays three cross-sectional images at 45 frames per second (fps). This frame rate is 4.5 times higher than that for our previously-developed multi-core CPU-based software. In an alternative imaging mode, it shows one B-mode image rotating about the axis and its maximum intensity projection (MIP), processed at a rate of 104 fps. This paper describes the image reconstruction procedure on the GPU platform and presents the experimental images obtained using this software. PMID:23529080
A GPU-based large-scale Monte Carlo simulation method for systems with long-range interactions
NASA Astrophysics Data System (ADS)
Liang, Yihao; Xing, Xiangjun; Li, Yaohang
2017-06-01
In this work we present an efficient implementation of Canonical Monte Carlo simulation for Coulomb many body systems on graphics processing units (GPU). Our method takes advantage of the GPU Single Instruction, Multiple Data (SIMD) architectures, and adopts the sequential updating scheme of Metropolis algorithm. It makes no approximation in the computation of energy, and reaches a remarkable 440-fold speedup, compared with the serial implementation on CPU. We further use this method to simulate primitive model electrolytes, and measure very precisely all ion-ion pair correlation functions at high concentrations. From these data, we extract the renormalized Debye length, renormalized valences of constituent ions, and renormalized dielectric constants. These results demonstrate unequivocally physics beyond the classical Poisson-Boltzmann theory.
Massanes, Francesc; Cadennes, Marie; Brankov, Jovan G.
2012-01-01
In this paper we describe and evaluate a fast implementation of a classical block matching motion estimation algorithm for multiple Graphical Processing Units (GPUs) using the Compute Unified Device Architecture (CUDA) computing engine. The implemented block matching algorithm (BMA) uses summed absolute difference (SAD) error criterion and full grid search (FS) for finding optimal block displacement. In this evaluation we compared the execution time of a GPU and CPU implementation for images of various sizes, using integer and non-integer search grids. The results show that use of a GPU card can shorten computation time by a factor of 200 times for integer and 1000 times for a non-integer search grid. The additional speedup for non-integer search grid comes from the fact that GPU has built-in hardware for image interpolation. Further, when using multiple GPU cards, the presented evaluation shows the importance of the data splitting method across multiple cards, but an almost linear speedup with a number of cards is achievable. In addition we compared execution time of the proposed FS GPU implementation with two existing, highly optimized non-full grid search CPU based motion estimations methods, namely implementation of the Pyramidal Lucas Kanade Optical flow algorithm in OpenCV and Simplified Unsymmetrical multi-Hexagon search in H.264/AVC standard. In these comparisons, FS GPU implementation still showed modest improvement even though the computational complexity of FS GPU implementation is substantially higher than non-FS CPU implementation. We also demonstrated that for an image sequence of 720×480 pixels in resolution, commonly used in video surveillance, the proposed GPU implementation is sufficiently fast for real-time motion estimation at 30 frames-per-second using two NVIDIA C1060 Tesla GPU cards. PMID:22347787
Massanes, Francesc; Cadennes, Marie; Brankov, Jovan G
2011-07-01
In this paper we describe and evaluate a fast implementation of a classical block matching motion estimation algorithm for multiple Graphical Processing Units (GPUs) using the Compute Unified Device Architecture (CUDA) computing engine. The implemented block matching algorithm (BMA) uses summed absolute difference (SAD) error criterion and full grid search (FS) for finding optimal block displacement. In this evaluation we compared the execution time of a GPU and CPU implementation for images of various sizes, using integer and non-integer search grids.The results show that use of a GPU card can shorten computation time by a factor of 200 times for integer and 1000 times for a non-integer search grid. The additional speedup for non-integer search grid comes from the fact that GPU has built-in hardware for image interpolation. Further, when using multiple GPU cards, the presented evaluation shows the importance of the data splitting method across multiple cards, but an almost linear speedup with a number of cards is achievable.In addition we compared execution time of the proposed FS GPU implementation with two existing, highly optimized non-full grid search CPU based motion estimations methods, namely implementation of the Pyramidal Lucas Kanade Optical flow algorithm in OpenCV and Simplified Unsymmetrical multi-Hexagon search in H.264/AVC standard. In these comparisons, FS GPU implementation still showed modest improvement even though the computational complexity of FS GPU implementation is substantially higher than non-FS CPU implementation.We also demonstrated that for an image sequence of 720×480 pixels in resolution, commonly used in video surveillance, the proposed GPU implementation is sufficiently fast for real-time motion estimation at 30 frames-per-second using two NVIDIA C1060 Tesla GPU cards.
NASA Astrophysics Data System (ADS)
Yang, Sheng-Chun; Lu, Zhong-Yuan; Qian, Hu-Jun; Wang, Yong-Lei; Han, Jie-Ping
2017-11-01
In this work, we upgraded the electrostatic interaction method of CU-ENUF (Yang, et al., 2016) which first applied CUNFFT (nonequispaced Fourier transforms based on CUDA) to the reciprocal-space electrostatic computation and made the computation of electrostatic interaction done thoroughly in GPU. The upgraded edition of CU-ENUF runs concurrently in a hybrid parallel way that enables the computation parallelizing on multiple computer nodes firstly, then further on the installed GPU in each computer. By this parallel strategy, the size of simulation system will be never restricted to the throughput of a single CPU or GPU. The most critical technical problem is how to parallelize a CUNFFT in the parallel strategy, which is conquered effectively by deep-seated research of basic principles and some algorithm skills. Furthermore, the upgraded method is capable of computing electrostatic interactions for both the atomistic molecular dynamics (MD) and the dissipative particle dynamics (DPD). Finally, the benchmarks conducted for validation and performance indicate that the upgraded method is able to not only present a good precision when setting suitable parameters, but also give an efficient way to compute electrostatic interactions for huge simulation systems. Program Files doi:http://dx.doi.org/10.17632/zncf24fhpv.1 Licensing provisions: GNU General Public License 3 (GPL) Programming language: C, C++, and CUDA C Supplementary material: The program is designed for effective electrostatic interactions of large-scale simulation systems, which runs on particular computers equipped with NVIDIA GPUs. It has been tested on (a) single computer node with Intel(R) Core(TM) i7-3770@ 3.40 GHz (CPU) and GTX 980 Ti (GPU), and (b) MPI parallel computer nodes with the same configurations. Nature of problem: For molecular dynamics simulation, the electrostatic interaction is the most time-consuming computation because of its long-range feature and slow convergence in simulation space, which approximately take up most of the total simulation time. Although the parallel method CU-ENUF (Yang et al., 2016) based on GPU has achieved a qualitative leap compared with previous methods in electrostatic interactions computation, the computation capability is limited to the throughput capacity of a single GPU for super-scale simulation system. Therefore, we should look for an effective method to handle the calculation of electrostatic interactions efficiently for a simulation system with super-scale size. Solution method: We constructed a hybrid parallel architecture, in which CPU and GPU are combined to accelerate the electrostatic computation effectively. Firstly, the simulation system is divided into many subtasks via domain-decomposition method. Then MPI (Message Passing Interface) is used to implement the CPU-parallel computation with each computer node corresponding to a particular subtask, and furthermore each subtask in one computer node will be executed in GPU in parallel efficiently. In this hybrid parallel method, the most critical technical problem is how to parallelize a CUNFFT (nonequispaced fast Fourier transform based on CUDA) in the parallel strategy, which is conquered effectively by deep-seated research of basic principles and some algorithm skills. Restrictions: The HP-ENUF is mainly oriented to super-scale system simulations, in which the performance superiority is shown adequately. However, for a small simulation system containing less than 106 particles, the mode of multiple computer nodes has no apparent efficiency advantage or even lower efficiency due to the serious network delay among computer nodes, than the mode of single computer node. References: (1) S.-C. Yang, H.-J. Qian, Z.-Y. Lu, Appl. Comput. Harmon. Anal. 2016, http://dx.doi.org/10.1016/j.acha.2016.04.009. (2) S.-C. Yang, Y.-L. Wang, G.-S. Jiao, H.-J. Qian, Z.-Y. Lu, J. Comput. Chem. 37 (2016) 378. (3) S.-C. Yang, Y.-L. Zhu, H.-J. Qian, Z.-Y. Lu, Appl. Chem. Res. Chin. Univ., 2017, http://dx.doi.org/10.1007/s40242-016-6354-5. (4) Y.-L. Zhu, H. Liu, Z.-W. Li, H.-J. Qian, G. Milano, Z.-Y. Lu, J. Comput. Chem. 34 (2013) 2197.
NASA Astrophysics Data System (ADS)
Hou, Zhenlong; Huang, Danian
2017-09-01
In this paper, we make a study on the inversion of probability tomography (IPT) with gravity gradiometry data at first. The space resolution of the results is improved by multi-tensor joint inversion, depth weighting matrix and the other methods. Aiming at solving the problems brought by the big data in the exploration, we present the parallel algorithm and the performance analysis combining Compute Unified Device Architecture (CUDA) with Open Multi-Processing (OpenMP) based on Graphics Processing Unit (GPU) accelerating. In the test of the synthetic model and real data from Vinton Dome, we get the improved results. It is also proved that the improved inversion algorithm is effective and feasible. The performance of parallel algorithm we designed is better than the other ones with CUDA. The maximum speedup could be more than 200. In the performance analysis, multi-GPU speedup and multi-GPU efficiency are applied to analyze the scalability of the multi-GPU programs. The designed parallel algorithm is demonstrated to be able to process larger scale of data and the new analysis method is practical.
Development of a GPU Compatible Version of the Fast Radiation Code RRTMG
NASA Astrophysics Data System (ADS)
Iacono, M. J.; Mlawer, E. J.; Berthiaume, D.; Cady-Pereira, K. E.; Suarez, M.; Oreopoulos, L.; Lee, D.
2012-12-01
The absorption of solar radiation and emission/absorption of thermal radiation are crucial components of the physics that drive Earth's climate and weather. Therefore, accurate radiative transfer calculations are necessary for realistic climate and weather simulations. Efficient radiation codes have been developed for this purpose, but their accuracy requirements still necessitate that as much as 30% of the computational time of a GCM is spent computing radiative fluxes and heating rates. The overall computational expense constitutes a limitation on a GCM's predictive ability if it becomes an impediment to adding new physics to or increasing the spatial and/or vertical resolution of the model. The emergence of Graphics Processing Unit (GPU) technology, which will allow the parallel computation of multiple independent radiative calculations in a GCM, will lead to a fundamental change in the competition between accuracy and speed. Processing time previously consumed by radiative transfer will now be available for the modeling of other processes, such as physics parameterizations, without any sacrifice in the accuracy of the radiative transfer. Furthermore, fast radiation calculations can be performed much more frequently and will allow the modeling of radiative effects of rapid changes in the atmosphere. The fast radiation code RRTMG, developed at Atmospheric and Environmental Research (AER), is utilized operationally in many dynamical models throughout the world. We will present the results from the first stage of an effort to create a version of the RRTMG radiation code designed to run efficiently in a GPU environment. This effort will focus on the RRTMG implementation in GEOS-5. RRTMG has an internal pseudo-spectral vector of length of order 100 that, when combined with the much greater length of the global horizontal grid vector from which the radiation code is called in GEOS-5, makes RRTMG/GEOS-5 particularly suited to achieving a significant speed improvement through GPU technology. This large number of independent cases will allow us to take full advantage of the computational power of the latest GPUs, ensuring that all thread cores in the GPU remain active, a key criterion for obtaining significant speedup. The CUDA (Compute Unified Device Architecture) Fortran compiler developed by PGI and Nvidia will allow us to construct this parallel implementation on the GPU while remaining in the Fortran language. This implementation will scale very well across various CUDA-supported GPUs such as the recently released Fermi Nvidia cards. We will present the computational speed improvements of the GPU-compatible code relative to the standard CPU-based RRTMG with respect to a very large and diverse suite of atmospheric profiles. This suite will also be utilized to demonstrate the minimal impact of the code restructuring on the accuracy of radiation calculations. The GPU-compatible version of RRTMG will be directly applicable to future versions of GEOS-5, but it is also likely to provide significant associated benefits for other GCMs that employ RRTMG.
GPURFSCREEN: a GPU based virtual screening tool using random forest classifier.
Jayaraj, P B; Ajay, Mathias K; Nufail, M; Gopakumar, G; Jaleel, U C A
2016-01-01
In-silico methods are an integral part of modern drug discovery paradigm. Virtual screening, an in-silico method, is used to refine data models and reduce the chemical space on which wet lab experiments need to be performed. Virtual screening of a ligand data model requires large scale computations, making it a highly time consuming task. This process can be speeded up by implementing parallelized algorithms on a Graphical Processing Unit (GPU). Random Forest is a robust classification algorithm that can be employed in the virtual screening. A ligand based virtual screening tool (GPURFSCREEN) that uses random forests on GPU systems has been proposed and evaluated in this paper. This tool produces optimized results at a lower execution time for large bioassay data sets. The quality of results produced by our tool on GPU is same as that on a regular serial environment. Considering the magnitude of data to be screened, the parallelized virtual screening has a significantly lower running time at high throughput. The proposed parallel tool outperforms its serial counterpart by successfully screening billions of molecules in training and prediction phases.
TH-A-18C-09: Ultra-Fast Monte Carlo Simulation for Cone Beam CT Imaging of Brain Trauma
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sisniega, A; Zbijewski, W; Stayman, J
Purpose: Application of cone-beam CT (CBCT) to low-contrast soft tissue imaging, such as in detection of traumatic brain injury, is challenged by high levels of scatter. A fast, accurate scatter correction method based on Monte Carlo (MC) estimation is developed for application in high-quality CBCT imaging of acute brain injury. Methods: The correction involves MC scatter estimation executed on an NVIDIA GTX 780 GPU (MC-GPU), with baseline simulation speed of ~1e7 photons/sec. MC-GPU is accelerated by a novel, GPU-optimized implementation of variance reduction (VR) techniques (forced detection and photon splitting). The number of simulated tracks and projections is reduced formore » additional speed-up. Residual noise is removed and the missing scatter projections are estimated via kernel smoothing (KS) in projection plane and across gantry angles. The method is assessed using CBCT images of a head phantom presenting a realistic simulation of fresh intracranial hemorrhage (100 kVp, 180 mAs, 720 projections, source-detector distance 700 mm, source-axis distance 480 mm). Results: For a fixed run-time of ~1 sec/projection, GPU-optimized VR reduces the noise in MC-GPU scatter estimates by a factor of 4. For scatter correction, MC-GPU with VR is executed with 4-fold angular downsampling and 1e5 photons/projection, yielding 3.5 minute run-time per scan, and de-noised with optimized KS. Corrected CBCT images demonstrate uniformity improvement of 18 HU and contrast improvement of 26 HU compared to no correction, and a 52% increase in contrast-tonoise ratio in simulated hemorrhage compared to “oracle” constant fraction correction. Conclusion: Acceleration of MC-GPU achieved through GPU-optimized variance reduction and kernel smoothing yields an efficient (<5 min/scan) and accurate scatter correction that does not rely on additional hardware or simplifying assumptions about the scatter distribution. The method is undergoing implementation in a novel CBCT dedicated to brain trauma imaging at the point of care in sports and military applications. Research grant from Carestream Health. JY is an employee of Carestream Health.« less
A GPU-based incompressible Navier-Stokes solver on moving overset grids
NASA Astrophysics Data System (ADS)
Chandar, Dominic D. J.; Sitaraman, Jayanarayanan; Mavriplis, Dimitri J.
2013-07-01
In pursuit of obtaining high fidelity solutions to the fluid flow equations in a short span of time, graphics processing units (GPUs) which were originally intended for gaming applications are currently being used to accelerate computational fluid dynamics (CFD) codes. With a high peak throughput of about 1 TFLOPS on a PC, GPUs seem to be favourable for many high-resolution computations. One such computation that involves a lot of number crunching is computing time accurate flow solutions past moving bodies. The aim of the present paper is thus to discuss the development of a flow solver on unstructured and overset grids and its implementation on GPUs. In its present form, the flow solver solves the incompressible fluid flow equations on unstructured/hybrid/overset grids using a fully implicit projection method. The resulting discretised equations are solved using a matrix-free Krylov solver using several GPU kernels such as gradient, Laplacian and reduction. Some of the simple arithmetic vector calculations are implemented using the CU++: An Object Oriented Framework for Computational Fluid Dynamics Applications using Graphics Processing Units, Journal of Supercomputing, 2013, doi:10.1007/s11227-013-0985-9 approach where GPU kernels are automatically generated at compile time. Results are presented for two- and three-dimensional computations on static and moving grids.
Extending the BEAGLE library to a multi-FPGA platform.
Jin, Zheming; Bakos, Jason D
2013-01-19
Maximum Likelihood (ML)-based phylogenetic inference using Felsenstein's pruning algorithm is a standard method for estimating the evolutionary relationships amongst a set of species based on DNA sequence data, and is used in popular applications such as RAxML, PHYLIP, GARLI, BEAST, and MrBayes. The Phylogenetic Likelihood Function (PLF) and its associated scaling and normalization steps comprise the computational kernel for these tools. These computations are data intensive but contain fine grain parallelism that can be exploited by coprocessor architectures such as FPGAs and GPUs. A general purpose API called BEAGLE has recently been developed that includes optimized implementations of Felsenstein's pruning algorithm for various data parallel architectures. In this paper, we extend the BEAGLE API to a multiple Field Programmable Gate Array (FPGA)-based platform called the Convey HC-1. The core calculation of our implementation, which includes both the phylogenetic likelihood function (PLF) and the tree likelihood calculation, has an arithmetic intensity of 130 floating-point operations per 64 bytes of I/O, or 2.03 ops/byte. Its performance can thus be calculated as a function of the host platform's peak memory bandwidth and the implementation's memory efficiency, as 2.03 × peak bandwidth × memory efficiency. Our FPGA-based platform has a peak bandwidth of 76.8 GB/s and our implementation achieves a memory efficiency of approximately 50%, which gives an average throughput of 78 Gflops. This represents a ~40X speedup when compared with BEAGLE's CPU implementation on a dual Xeon 5520 and 3X speedup versus BEAGLE's GPU implementation on a Tesla T10 GPU for very large data sizes. The power consumption is 92 W, yielding a power efficiency of 1.7 Gflops per Watt. The use of data parallel architectures to achieve high performance for likelihood-based phylogenetic inference requires high memory bandwidth and a design methodology that emphasizes high memory efficiency. To achieve this objective, we integrated 32 pipelined processing elements (PEs) across four FPGAs. For the design of each PE, we developed a specialized synthesis tool to generate a floating-point pipeline with resource and throughput constraints to match the target platform. We have found that using low-latency floating-point operators can significantly reduce FPGA area and still meet timing requirement on the target platform. We found that this design methodology can achieve performance that exceeds that of a GPU-based coprocessor.
GPU-accelerated computing for Lagrangian coherent structures of multi-body gravitational regimes
NASA Astrophysics Data System (ADS)
Lin, Mingpei; Xu, Ming; Fu, Xiaoyu
2017-04-01
Based on a well-established theoretical foundation, Lagrangian Coherent Structures (LCSs) have elicited widespread research on the intrinsic structures of dynamical systems in many fields, including the field of astrodynamics. Although the application of LCSs in dynamical problems seems straightforward theoretically, its associated computational cost is prohibitive. We propose a block decomposition algorithm developed on Compute Unified Device Architecture (CUDA) platform for the computation of the LCSs of multi-body gravitational regimes. In order to take advantage of GPU's outstanding computing properties, such as Shared Memory, Constant Memory, and Zero-Copy, the algorithm utilizes a block decomposition strategy to facilitate computation of finite-time Lyapunov exponent (FTLE) fields of arbitrary size and timespan. Simulation results demonstrate that this GPU-based algorithm can satisfy double-precision accuracy requirements and greatly decrease the time needed to calculate final results, increasing speed by approximately 13 times. Additionally, this algorithm can be generalized to various large-scale computing problems, such as particle filters, constellation design, and Monte-Carlo simulation.
Katouda, Michio; Naruse, Akira; Hirano, Yukihiko; Nakajima, Takahito
2016-11-15
A new parallel algorithm and its implementation for the RI-MP2 energy calculation utilizing peta-flop-class many-core supercomputers are presented. Some improvements from the previous algorithm (J. Chem. Theory Comput. 2013, 9, 5373) have been performed: (1) a dual-level hierarchical parallelization scheme that enables the use of more than 10,000 Message Passing Interface (MPI) processes and (2) a new data communication scheme that reduces network communication overhead. A multi-node and multi-GPU implementation of the present algorithm is presented for calculations on a central processing unit (CPU)/graphics processing unit (GPU) hybrid supercomputer. Benchmark results of the new algorithm and its implementation using the K computer (CPU clustering system) and TSUBAME 2.5 (CPU/GPU hybrid system) demonstrate high efficiency. The peak performance of 3.1 PFLOPS is attained using 80,199 nodes of the K computer. The peak performance of the multi-node and multi-GPU implementation is 514 TFLOPS using 1349 nodes and 4047 GPUs of TSUBAME 2.5. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.
Parallel Computer System for 3D Visualization Stereo on GPU
NASA Astrophysics Data System (ADS)
Al-Oraiqat, Anas M.; Zori, Sergii A.
2018-03-01
This paper proposes the organization of a parallel computer system based on Graphic Processors Unit (GPU) for 3D stereo image synthesis. The development is based on the modified ray tracing method developed by the authors for fast search of tracing rays intersections with scene objects. The system allows significant increase in the productivity for the 3D stereo synthesis of photorealistic quality. The generalized procedure of 3D stereo image synthesis on the Graphics Processing Unit/Graphics Processing Clusters (GPU/GPC) is proposed. The efficiency of the proposed solutions by GPU implementation is compared with single-threaded and multithreaded implementations on the CPU. The achieved average acceleration in multi-thread implementation on the test GPU and CPU is about 7.5 and 1.6 times, respectively. Studying the influence of choosing the size and configuration of the computational Compute Unified Device Archi-tecture (CUDA) network on the computational speed shows the importance of their correct selection. The obtained experimental estimations can be significantly improved by new GPUs with a large number of processing cores and multiprocessors, as well as optimized configuration of the computing CUDA network.
Anandakrishnan, Ramu; Scogland, Tom R. W.; Fenley, Andrew T.; Gordon, John C.; Feng, Wu-chun; Onufriev, Alexey V.
2010-01-01
Tools that compute and visualize biomolecular electrostatic surface potential have been used extensively for studying biomolecular function. However, determining the surface potential for large biomolecules on a typical desktop computer can take days or longer using currently available tools and methods. Two commonly used techniques to speed up these types of electrostatic computations are approximations based on multi-scale coarse-graining and parallelization across multiple processors. This paper demonstrates that for the computation of electrostatic surface potential, these two techniques can be combined to deliver significantly greater speed-up than either one separately, something that is in general not always possible. Specifically, the electrostatic potential computation, using an analytical linearized Poisson Boltzmann (ALPB) method, is approximated using the hierarchical charge partitioning (HCP) multiscale method, and parallelized on an ATI Radeon 4870 graphical processing unit (GPU). The implementation delivers a combined 934-fold speed-up for a 476,040 atom viral capsid, compared to an equivalent non-parallel implementation on an Intel E6550 CPU without the approximation. This speed-up is significantly greater than the 42-fold speed-up for the HCP approximation alone or the 182-fold speed-up for the GPU alone. PMID:20452792
Andrade, Xavier; Aspuru-Guzik, Alán
2013-10-08
We discuss the application of graphical processing units (GPUs) to accelerate real-space density functional theory (DFT) calculations. To make our implementation efficient, we have developed a scheme to expose the data parallelism available in the DFT approach; this is applied to the different procedures required for a real-space DFT calculation. We present results for current-generation GPUs from AMD and Nvidia, which show that our scheme, implemented in the free code Octopus, can reach a sustained performance of up to 90 GFlops for a single GPU, representing a significant speed-up when compared to the CPU version of the code. Moreover, for some systems, our implementation can outperform a GPU Gaussian basis set code, showing that the real-space approach is a competitive alternative for DFT simulations on GPUs.
3D fast adaptive correlation imaging for large-scale gravity data based on GPU computation
NASA Astrophysics Data System (ADS)
Chen, Z.; Meng, X.; Guo, L.; Liu, G.
2011-12-01
In recent years, large scale gravity data sets have been collected and employed to enhance gravity problem-solving abilities of tectonics studies in China. Aiming at the large scale data and the requirement of rapid interpretation, previous authors have carried out a lot of work, including the fast gradient module inversion and Euler deconvolution depth inversion ,3-D physical property inversion using stochastic subspaces and equivalent storage, fast inversion using wavelet transforms and a logarithmic barrier method. So it can be say that 3-D gravity inversion has been greatly improved in the last decade. Many authors added many different kinds of priori information and constraints to deal with nonuniqueness using models composed of a large number of contiguous cells of unknown property and obtained good results. However, due to long computation time, instability and other shortcomings, 3-D physical property inversion has not been widely applied to large-scale data yet. In order to achieve 3-D interpretation with high efficiency and precision for geological and ore bodies and obtain their subsurface distribution, there is an urgent need to find a fast and efficient inversion method for large scale gravity data. As an entirely new geophysical inversion method, 3D correlation has a rapid development thanks to the advantage of requiring no a priori information and demanding small amount of computer memory. This method was proposed to image the distribution of equivalent excess masses of anomalous geological bodies with high resolution both longitudinally and transversely. In order to tranform the equivalence excess masses into real density contrasts, we adopt the adaptive correlation imaging for gravity data. After each 3D correlation imaging, we change the equivalence into density contrasts according to the linear relationship, and then carry out forward gravity calculation for each rectangle cells. Next, we compare the forward gravity data with real data, and comtinue to perform 3D correlation imaging for the redisual gravity data. After several iterations, we can obtain a satisfactoy results. Newly developed general purpose computing technology from Nvidia GPU (Graphics Processing Unit) has been put into practice and received widespread attention in many areas. Based on the GPU programming mode and two parallel levels, five CPU loops for the main computation of 3D correlation imaging are converted into three loops in GPU kernel functions, thus achieving GPU/CPU collaborative computing. The two inner loops are defined as the dimensions of blocks and the three outer loops are defined as the dimensions of threads, thus realizing the double loop block calculation. Theoretical and real gravity data tests show that results are reliable and the computing time is greatly reduced. Acknowledgments We acknowledge the financial support of Sinoprobe project (201011039 and 201011049-03), the Fundamental Research Funds for the Central Universities (2010ZY26 and 2011PY0183), the National Natural Science Foundation of China (41074095) and the Open Project of State Key Laboratory of Geological Processes and Mineral Resources (GPMR0945).
GPU-based multi-volume ray casting within VTK for medical applications.
Bozorgi, Mohammadmehdi; Lindseth, Frank
2015-03-01
Multi-volume visualization is important for displaying relevant information in multimodal or multitemporal medical imaging studies. The main objective with the current study was to develop an efficient GPU-based multi-volume ray caster (MVRC) and validate the proposed visualization system in the context of image-guided surgical navigation. Ray casting can produce high-quality 2D images from 3D volume data but the method is computationally demanding, especially when multiple volumes are involved, so a parallel GPU version has been implemented. In the proposed MVRC, imaginary rays are sent through the volumes (one ray for each pixel in the view), and at equal and short intervals along the rays, samples are collected from each volume. Samples from all the volumes are composited using front to back α-blending. Since all the rays can be processed simultaneously, the MVRC was implemented in parallel on the GPU to achieve acceptable interactive frame rates. The method is fully integrated within the visualization toolkit (VTK) pipeline with the ability to apply different operations (e.g., transformations, clipping, and cropping) on each volume separately. The implemented method is cross-platform (Windows, Linux and Mac OSX) and runs on different graphics card (NVidia and AMD). The speed of the MVRC was tested with one to five volumes of varying sizes: 128(3), 256(3), and 512(3). A Tesla C2070 GPU was used, and the output image size was 600 × 600 pixels. The original VTK single-volume ray caster and the MVRC were compared when rendering only one volume. The multi-volume rendering system achieved an interactive frame rate (> 15 fps) when rendering five small volumes (128 (3) voxels), four medium-sized volumes (256(3) voxels), and two large volumes (512(3) voxels). When rendering single volumes, the frame rate of the MVRC was comparable to the original VTK ray caster for small and medium-sized datasets but was approximately 3 frames per second slower for large datasets. The MVRC was successfully integrated in an existing surgical navigation system and was shown to be clinically useful during an ultrasound-guided neurosurgical tumor resection. A GPU-based MVRC for VTK is a useful tool in medical visualization. The proposed multi-volume GPU-based ray caster for VTK provided high-quality images at reasonable frame rates. The MVRC was effective when used in a neurosurgical navigation application.
Porting a Hall MHD Code to a Graphic Processing Unit
NASA Technical Reports Server (NTRS)
Dorelli, John C.
2011-01-01
We present our experience porting a Hall MHD code to a Graphics Processing Unit (GPU). The code is a 2nd order accurate MUSCL-Hancock scheme which makes use of an HLL Riemann solver to compute numerical fluxes and second-order finite differences to compute the Hall contribution to the electric field. The divergence of the magnetic field is controlled with Dedner?s hyperbolic divergence cleaning method. Preliminary benchmark tests indicate a speedup (relative to a single Nehalem core) of 58x for a double precision calculation. We discuss scaling issues which arise when distributing work across multiple GPUs in a CPU-GPU cluster.
NASA Astrophysics Data System (ADS)
Eckert, C. H. J.; Zenker, E.; Bussmann, M.; Albach, D.
2016-10-01
We present an adaptive Monte Carlo algorithm for computing the amplified spontaneous emission (ASE) flux in laser gain media pumped by pulsed lasers. With the design of high power lasers in mind, which require large size gain media, we have developed the open source code HASEonGPU that is capable of utilizing multiple graphic processing units (GPUs). With HASEonGPU, time to solution is reduced to minutes on a medium size GPU cluster of 64 NVIDIA Tesla K20m GPUs and excellent speedup is achieved when scaling to multiple GPUs. Comparison of simulation results to measurements of ASE in Y b 3 + : Y AG ceramics show perfect agreement.
GPU-accelerated phase-field simulation of dendritic solidification in a binary alloy
NASA Astrophysics Data System (ADS)
Yamanaka, Akinori; Aoki, Takayuki; Ogawa, Satoi; Takaki, Tomohiro
2011-03-01
The phase-field simulation for dendritic solidification of a binary alloy has been accelerated by using a graphic processing unit (GPU). To perform the phase-field simulation of the alloy solidification on GPU, a program code was developed with computer unified device architecture (CUDA). In this paper, the implementation technique of the phase-field model on GPU is presented. Also, we evaluated the acceleration performance of the three-dimensional solidification simulation by using a single NVIDIA TESLA C1060 GPU and the developed program code. The results showed that the GPU calculation for 5763 computational grids achieved the performance of 170 GFLOPS by utilizing the shared memory as a software-managed cache. Furthermore, it can be demonstrated that the computation with the GPU is 100 times faster than that with a single CPU core. From the obtained results, we confirmed the feasibility of realizing a real-time full three-dimensional phase-field simulation of microstructure evolution on a personal desktop computer.
Su, Xiaoquan; Wang, Xuetao; Jing, Gongchao; Ning, Kang
2014-04-01
The number of microbial community samples is increasing with exponential speed. Data-mining among microbial community samples could facilitate the discovery of valuable biological information that is still hidden in the massive data. However, current methods for the comparison among microbial communities are limited by their ability to process large amount of samples each with complex community structure. We have developed an optimized GPU-based software, GPU-Meta-Storms, to efficiently measure the quantitative phylogenetic similarity among massive amount of microbial community samples. Our results have shown that GPU-Meta-Storms would be able to compute the pair-wise similarity scores for 10 240 samples within 20 min, which gained a speed-up of >17 000 times compared with single-core CPU, and >2600 times compared with 16-core CPU. Therefore, the high-performance of GPU-Meta-Storms could facilitate in-depth data mining among massive microbial community samples, and make the real-time analysis and monitoring of temporal or conditional changes for microbial communities possible. GPU-Meta-Storms is implemented by CUDA (Compute Unified Device Architecture) and C++. Source code is available at http://www.computationalbioenergy.org/meta-storms.html.
NASA Astrophysics Data System (ADS)
Guo, X.; Li, Y.; Suo, T.; Liu, H.; Zhang, C.
2017-11-01
This paper proposes a method for de-blurring of images captured in the dynamic deformation of materials. De-blurring is achieved based on the dynamic-based approach, which is used to estimate the Point Spread Function (PSF) during the camera exposure window. The deconvolution process involving iterative matrix calculations of pixels, is then performed on the GPU to decrease the time cost. Compared to the Gauss method and the Lucy-Richardson method, it has the best result of the image restoration. The proposed method has been evaluated by using the Hopkinson bar loading system. In comparison to the blurry image, the proposed method has successfully restored the image. It is also demonstrated from image processing applications that the de-blurring method can improve the accuracy and the stability of the digital imaging correlation measurement.
Bayer image parallel decoding based on GPU
NASA Astrophysics Data System (ADS)
Hu, Rihui; Xu, Zhiyong; Wei, Yuxing; Sun, Shaohua
2012-11-01
In the photoelectrical tracking system, Bayer image is decompressed in traditional method, which is CPU-based. However, it is too slow when the images become large, for example, 2K×2K×16bit. In order to accelerate the Bayer image decoding, this paper introduces a parallel speedup method for NVIDA's Graphics Processor Unit (GPU) which supports CUDA architecture. The decoding procedure can be divided into three parts: the first is serial part, the second is task-parallelism part, and the last is data-parallelism part including inverse quantization, inverse discrete wavelet transform (IDWT) as well as image post-processing part. For reducing the execution time, the task-parallelism part is optimized by OpenMP techniques. The data-parallelism part could advance its efficiency through executing on the GPU as CUDA parallel program. The optimization techniques include instruction optimization, shared memory access optimization, the access memory coalesced optimization and texture memory optimization. In particular, it can significantly speed up the IDWT by rewriting the 2D (Tow-dimensional) serial IDWT into 1D parallel IDWT. Through experimenting with 1K×1K×16bit Bayer image, data-parallelism part is 10 more times faster than CPU-based implementation. Finally, a CPU+GPU heterogeneous decompression system was designed. The experimental result shows that it could achieve 3 to 5 times speed increase compared to the CPU serial method.
Accelerating Smith-Waterman Algorithm for Biological Database Search on CUDA-Compatible GPUs
NASA Astrophysics Data System (ADS)
Munekawa, Yuma; Ino, Fumihiko; Hagihara, Kenichi
This paper presents a fast method capable of accelerating the Smith-Waterman algorithm for biological database search on a cluster of graphics processing units (GPUs). Our method is implemented using compute unified device architecture (CUDA), which is available on the nVIDIA GPU. As compared with previous methods, our method has four major contributions. (1) The method efficiently uses on-chip shared memory to reduce the data amount being transferred between off-chip video memory and processing elements in the GPU. (2) It also reduces the number of data fetches by applying a data reuse technique to query and database sequences. (3) A pipelined method is also implemented to overlap GPU execution with database access. (4) Finally, a master/worker paradigm is employed to accelerate hundreds of database searches on a cluster system. In experiments, the peak performance on a GeForce GTX 280 card reaches 8.32 giga cell updates per second (GCUPS). We also find that our method reduces the amount of data fetches to 1/140, achieving approximately three times higher performance than a previous CUDA-based method. Our 32-node cluster version is approximately 28 times faster than a single GPU version. Furthermore, the effective performance reaches 75.6 giga instructions per second (GIPS) using 32 GeForce 8800 GTX cards.
AMITIS: A 3D GPU-Based Hybrid-PIC Model for Space and Plasma Physics
NASA Astrophysics Data System (ADS)
Fatemi, Shahab; Poppe, Andrew R.; Delory, Gregory T.; Farrell, William M.
2017-05-01
We have developed, for the first time, an advanced modeling infrastructure in space simulations (AMITIS) with an embedded three-dimensional self-consistent grid-based hybrid model of plasma (kinetic ions and fluid electrons) that runs entirely on graphics processing units (GPUs). The model uses NVIDIA GPUs and their associated parallel computing platform, CUDA, developed for general purpose processing on GPUs. The model uses a single CPU-GPU pair, where the CPU transfers data between the system and GPU memory, executes CUDA kernels, and writes simulation outputs on the disk. All computations, including moving particles, calculating macroscopic properties of particles on a grid, and solving hybrid model equations are processed on a single GPU. We explain various computing kernels within AMITIS and compare their performance with an already existing well-tested hybrid model of plasma that runs in parallel using multi-CPU platforms. We show that AMITIS runs ∼10 times faster than the parallel CPU-based hybrid model. We also introduce an implicit solver for computation of Faraday’s Equation, resulting in an explicit-implicit scheme for the hybrid model equation. We show that the proposed scheme is stable and accurate. We examine the AMITIS energy conservation and show that the energy is conserved with an error < 0.2% after 500,000 timesteps, even when a very low number of particles per cell is used.
NASA Astrophysics Data System (ADS)
Ha, Sanghyun; Park, Junshin; You, Donghyun
2018-01-01
Utility of the computational power of Graphics Processing Units (GPUs) is elaborated for solutions of incompressible Navier-Stokes equations which are integrated using a semi-implicit fractional-step method. The Alternating Direction Implicit (ADI) and the Fourier-transform-based direct solution methods used in the semi-implicit fractional-step method take advantage of multiple tridiagonal matrices whose inversion is known as the major bottleneck for acceleration on a typical multi-core machine. A novel implementation of the semi-implicit fractional-step method designed for GPU acceleration of the incompressible Navier-Stokes equations is presented. Aspects of the programing model of Compute Unified Device Architecture (CUDA), which are critical to the bandwidth-bound nature of the present method are discussed in detail. A data layout for efficient use of CUDA libraries is proposed for acceleration of tridiagonal matrix inversion and fast Fourier transform. OpenMP is employed for concurrent collection of turbulence statistics on a CPU while the Navier-Stokes equations are computed on a GPU. Performance of the present method using CUDA is assessed by comparing the speed of solving three tridiagonal matrices using ADI with the speed of solving one heptadiagonal matrix using a conjugate gradient method. An overall speedup of 20 times is achieved using a Tesla K40 GPU in comparison with a single-core Xeon E5-2660 v3 CPU in simulations of turbulent boundary-layer flow over a flat plate conducted on over 134 million grids. Enhanced performance of 48 times speedup is reached for the same problem using a Tesla P100 GPU.
Extending the length and time scales of Gram-Schmidt Lyapunov vector computations
NASA Astrophysics Data System (ADS)
Costa, Anthony B.; Green, Jason R.
2013-08-01
Lyapunov vectors have found growing interest recently due to their ability to characterize systems out of thermodynamic equilibrium. The computation of orthogonal Gram-Schmidt vectors requires multiplication and QR decomposition of large matrices, which grow as N2 (with the particle count). This expense has limited such calculations to relatively small systems and short time scales. Here, we detail two implementations of an algorithm for computing Gram-Schmidt vectors. The first is a distributed-memory message-passing method using Scalapack. The second uses the newly-released MAGMA library for GPUs. We compare the performance of both codes for Lennard-Jones fluids from N=100 to 1300 between Intel Nahalem/Infiniband DDR and NVIDIA C2050 architectures. To our best knowledge, these are the largest systems for which the Gram-Schmidt Lyapunov vectors have been computed, and the first time their calculation has been GPU-accelerated. We conclude that Lyapunov vector calculations can be significantly extended in length and time by leveraging the power of GPU-accelerated linear algebra.
Wang, Yuhe; Mazur, Thomas R; Green, Olga; Hu, Yanle; Li, Hua; Rodriguez, Vivian; Wooten, H Omar; Yang, Deshan; Zhao, Tianyu; Mutic, Sasa; Li, H Harold
2016-07-01
The clinical commissioning of IMRT subject to a magnetic field is challenging. The purpose of this work is to develop a GPU-accelerated Monte Carlo dose calculation platform based on penelope and then use the platform to validate a vendor-provided MRIdian head model toward quality assurance of clinical IMRT treatment plans subject to a 0.35 T magnetic field. penelope was first translated from fortran to c++ and the result was confirmed to produce equivalent results to the original code. The c++ code was then adapted to cuda in a workflow optimized for GPU architecture. The original code was expanded to include voxelized transport with Woodcock tracking, faster electron/positron propagation in a magnetic field, and several features that make gpenelope highly user-friendly. Moreover, the vendor-provided MRIdian head model was incorporated into the code in an effort to apply gpenelope as both an accurate and rapid dose validation system. A set of experimental measurements were performed on the MRIdian system to examine the accuracy of both the head model and gpenelope. Ultimately, gpenelope was applied toward independent validation of patient doses calculated by MRIdian's kmc. An acceleration factor of 152 was achieved in comparison to the original single-thread fortran implementation with the original accuracy being preserved. For 16 treatment plans including stomach (4), lung (2), liver (3), adrenal gland (2), pancreas (2), spleen(1), mediastinum (1), and breast (1), the MRIdian dose calculation engine agrees with gpenelope with a mean gamma passing rate of 99.1% ± 0.6% (2%/2 mm). A Monte Carlo simulation platform was developed based on a GPU- accelerated version of penelope. This platform was used to validate that both the vendor-provided head model and fast Monte Carlo engine used by the MRIdian system are accurate in modeling radiation transport in a patient using 2%/2 mm gamma criteria. Future applications of this platform will include dose validation and accumulation, IMRT optimization, and dosimetry system modeling for next generation MR-IGRT systems.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Su, L; Du, X; Liu, T
Purpose: As a module of ARCHER -- Accelerated Radiation-transport Computations in Heterogeneous EnviRonments, ARCHER{sub RT} is designed for RadioTherapy (RT) dose calculation. This paper describes the application of ARCHERRT on patient-dependent TomoTherapy and patient-independent IMRT. It also conducts a 'fair' comparison of different GPUs and multicore CPU. Methods: The source input used for patient-dependent TomoTherapy is phase space file (PSF) generated from optimized plan. For patient-independent IMRT, the open filed PSF is used for different cases. The intensity modulation is simulated by fluence map. The GEANT4 code is used as benchmark. DVH and gamma index test are employed to evaluatemore » the accuracy of ARCHER{sub RT} code. Some previous studies reported misleading speedups by comparing GPU code with serial CPU code. To perform a fairer comparison, we write multi-thread code with OpenMP to fully exploit computing potential of CPU. The hardware involved in this study are a 6-core Intel E5-2620 CPU and 6 NVIDIA M2090 GPUs, a K20 GPU and a K40 GPU. Results: Dosimetric results from ARCHER{sub RT} and GEANT4 show good agreement. The 2%/2mm gamma test pass rates for different clinical cases are 97.2% to 99.7%. A single M2090 GPU needs 50~79 seconds for the simulation to achieve a statistical error of 1% in the PTV. The K40 card is about 1.7∼1.8 times faster than M2090 card. Using 6 M2090 card, the simulation can be finished in about 10 seconds. For comparison, Intel E5-2620 needs 507∼879 seconds for the same simulation. Conclusion: We successfully applied ARCHER{sub RT} to Tomotherapy and patient-independent IMRT, and conducted a fair comparison between GPU and CPU performance. The ARCHER{sub RT} code is both accurate and efficient and may be used towards clinical applications.« less
A simple method of calculating Stirling engines for engine design optimization
NASA Technical Reports Server (NTRS)
Martini, W. R.
1978-01-01
A calculation method is presented for a rhombic drive Stirling engine with a tubular heater and cooler and a screen type regenerator. Generally the equations presented describe power generation and consumption and heat losses. It is the simplest type of analysis that takes into account the conflicting requirements inherent in Stirling engine design. The method itemizes the power and heat losses for intelligent engine optimization. The results of engine analysis of the GPU-3 Stirling engine are compared with more complicated engine analysis and with engine measurements.
Multilevel Summation of Electrostatic Potentials Using Graphics Processing Units*
Hardy, David J.; Stone, John E.; Schulten, Klaus
2009-01-01
Physical and engineering practicalities involved in microprocessor design have resulted in flat performance growth for traditional single-core microprocessors. The urgent need for continuing increases in the performance of scientific applications requires the use of many-core processors and accelerators such as graphics processing units (GPUs). This paper discusses GPU acceleration of the multilevel summation method for computing electrostatic potentials and forces for a system of charged atoms, which is a problem of paramount importance in biomolecular modeling applications. We present and test a new GPU algorithm for the long-range part of the potentials that computes a cutoff pair potential between lattice points, essentially convolving a fixed 3-D lattice of “weights” over all sub-cubes of a much larger lattice. The implementation exploits the different memory subsystems provided on the GPU to stream optimally sized data sets through the multiprocessors. We demonstrate for the full multilevel summation calculation speedups of up to 26 using a single GPU and 46 using multiple GPUs, enabling the computation of a high-resolution map of the electrostatic potential for a system of 1.5 million atoms in under 12 seconds. PMID:20161132
Sachetto Oliveira, Rafael; Martins Rocha, Bernardo; Burgarelli, Denise; Meira, Wagner; Constantinides, Christakis; Weber Dos Santos, Rodrigo
2018-02-01
The use of computer models as a tool for the study and understanding of the complex phenomena of cardiac electrophysiology has attained increased importance nowadays. At the same time, the increased complexity of the biophysical processes translates into complex computational and mathematical models. To speed up cardiac simulations and to allow more precise and realistic uses, 2 different techniques have been traditionally exploited: parallel computing and sophisticated numerical methods. In this work, we combine a modern parallel computing technique based on multicore and graphics processing units (GPUs) and a sophisticated numerical method based on a new space-time adaptive algorithm. We evaluate each technique alone and in different combinations: multicore and GPU, multicore and GPU and space adaptivity, multicore and GPU and space adaptivity and time adaptivity. All the techniques and combinations were evaluated under different scenarios: 3D simulations on slabs, 3D simulations on a ventricular mouse mesh, ie, complex geometry, sinus-rhythm, and arrhythmic conditions. Our results suggest that multicore and GPU accelerate the simulations by an approximate factor of 33×, whereas the speedups attained by the space-time adaptive algorithms were approximately 48. Nevertheless, by combining all the techniques, we obtained speedups that ranged between 165 and 498. The tested methods were able to reduce the execution time of a simulation by more than 498× for a complex cellular model in a slab geometry and by 165× in a realistic heart geometry simulating spiral waves. The proposed methods will allow faster and more realistic simulations in a feasible time with no significant loss of accuracy. Copyright © 2017 John Wiley & Sons, Ltd.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Neylon, J; Qi, S; Sheng, K
2014-06-15
Purpose: To develop a GPU-based framework that can generate highresolution and patient-specific biomechanical models from a given simulation CT and contoured structures, optimized to run at interactive speeds, for addressing adaptive radiotherapy objectives. Method: A Massspring-damping (MSD) model was generated from a given simulation CT. The model's mass elements were generated for every voxel of anatomy, and positioned in a deformation space in the GPU memory. MSD connections were established between neighboring mass elements in a dense distribution. Contoured internal structures allowed control over elastic material properties of different tissues. Once the model was initialized in GPU memory, skeletal anatomymore » was actuated using rigid-body transformations, while soft tissues were governed by elastic corrective forces and constraints, which included tensile forces, shear forces, and spring damping forces. The model was validated by applying a known load to a soft tissue block and comparing the observed deformation to ground truth calculations from established elastic mechanics. Results: Our analyses showed that both local and global load experiments yielded results with a correlation coefficient R{sup 2} > 0.98 compared to ground truth. Models were generated for several anatomical regions. Head and neck models accurately simulated posture changes by rotating the skeletal anatomy in three dimensions. Pelvic models were developed for realistic deformations for changes in bladder volume. Thoracic models demonstrated breast deformation due to gravity when changing treatment position from supine to prone. The GPU framework performed at greater than 30 iterations per second for over 1 million mass elements with up to 26 MSD connections each. Conclusions: Realistic simulations of site-specific, complex posture and physiological changes were simulated at interactive speeds using patient data. Incorporating such a model with live patient tracking would facilitate real time assessment of variations of the actual anatomy and delivered dose for adaptive intervention and re-planning.« less
Problems Related to Parallelization of CFD Algorithms on GPU, Multi-GPU and Hybrid Architectures
NASA Astrophysics Data System (ADS)
Biazewicz, Marek; Kurowski, Krzysztof; Ludwiczak, Bogdan; Napieraia, Krystyna
2010-09-01
Computational Fluid Dynamics (CFD) is one of the branches of fluid mechanics, which uses numerical methods and algorithms to solve and analyze fluid flows. CFD is used in various domains, such as oil and gas reservoir uncertainty analysis, aerodynamic body shapes optimization (e.g. planes, cars, ships, sport helmets, skis), natural phenomena analysis, numerical simulation for weather forecasting or realistic visualizations. CFD problem is very complex and needs a lot of computational power to obtain the results in a reasonable time. We have implemented a parallel application for two-dimensional CFD simulation with a free surface approximation (MAC method) using new hardware architectures, in particular multi-GPU and hybrid computing environments. For this purpose we decided to use NVIDIA graphic cards with CUDA environment due to its simplicity of programming and good computations performance. We used finite difference discretization of Navier-Stokes equations, where fluid is propagated over an Eulerian Grid. In this model, the behavior of the fluid inside the cell depends only on the properties of local, surrounding cells, therefore it is well suited for the GPU-based architecture. In this paper we demonstrate how to use efficiently the computing power of GPUs for CFD. Additionally, we present some best practices to help users analyze and improve the performance of CFD applications executed on GPU. Finally, we discuss various challenges around the multi-GPU implementation on the example of matrix multiplication.
Accelerating separable footprint (SF) forward and back projection on GPU
NASA Astrophysics Data System (ADS)
Xie, Xiaobin; McGaffin, Madison G.; Long, Yong; Fessler, Jeffrey A.; Wen, Minhua; Lin, James
2017-03-01
Statistical image reconstruction (SIR) methods for X-ray CT can improve image quality and reduce radiation dosages over conventional reconstruction methods, such as filtered back projection (FBP). However, SIR methods require much longer computation time. The separable footprint (SF) forward and back projection technique simplifies the calculation of intersecting volumes of image voxels and finite-size beams in a way that is both accurate and efficient for parallel implementation. We propose a new method to accelerate the SF forward and back projection on GPU with NVIDIA's CUDA environment. For the forward projection, we parallelize over all detector cells. For the back projection, we parallelize over all 3D image voxels. The simulation results show that the proposed method is faster than the acceleration method of the SF projectors proposed by Wu and Fessler.13 We further accelerate the proposed method using multiple GPUs. The results show that the computation time is reduced approximately proportional to the number of GPUs.
NASA Astrophysics Data System (ADS)
Chun, Poo-Reum; Lee, Se-Ah; Yook, Yeong-Geun; Choi, Kwang-Sung; Cho, Deog-Geun; Yu, Dong-Hun; Chang, Won-Seok; Kwon, Deuk-Chul; Im, Yeon-Ho
2013-09-01
Although plasma etch profile simulation has been attracted much interest for developing reliable plasma etching, there still exist big gaps between current research status and predictable modeling due to the inherent complexity of plasma process. As an effort to address this issue, we present 3D feature profile simulation coupled with well-defined plasma-surface kinetic model for silicon dioxide etching process under fluorocarbon plasmas. To capture the realistic plasma surface reaction behaviors, a polymer layer based surface kinetic model was proposed to consider the simultaneous polymer deposition and oxide etching. Finally, the realistic plasma surface model was used for calculation of speed function for 3D topology simulation, which consists of multiple level set based moving algorithm, and ballistic transport module. In addition, the time consumable computations in the ballistic transport calculation were improved drastically by GPU based numerical computation, leading to the real time computation. Finally, we demonstrated that the surface kinetic model could be coupled successfully for 3D etch profile simulations in high-aspect ratio contact hole plasma etching.
Replica Exchange Molecular Dynamics in the Age of Heterogeneous Architectures
NASA Astrophysics Data System (ADS)
Roitberg, Adrian
2014-03-01
The rise of GPU-based codes has allowed MD to reach timescales only dreamed of only 5 years ago. Even within this new paradigm there is still need for advanced sampling techniques. Modern supercomputers (e.g. Blue Waters, Titan, Keeneland) have made available to users a significant number of GPUS and CPUS, which in turn translate into amazing opportunities for dream calculations. Replica-exchange based methods can optimally use tis combination of codes and architectures to explore conformational variabilities in large systems. I will show our recent work in porting the program Amber to GPUS, and the support for replica exchange methods, where the replicated dimension could be Temperature, pH, Hamiltonian, Umbrella windows and combinations of those schemes.
Free energy simulations with the AMOEBA polarizable force field and metadynamics on GPU platform.
Peng, Xiangda; Zhang, Yuebin; Chu, Huiying; Li, Guohui
2016-03-05
The free energy calculation library PLUMED has been incorporated into the OpenMM simulation toolkit, with the purpose to perform enhanced sampling MD simulations using the AMOEBA polarizable force field on GPU platform. Two examples, (I) the free energy profile of water pair separation (II) alanine dipeptide dihedral angle free energy surface in explicit solvent, are provided here to demonstrate the accuracy and efficiency of our implementation. The converged free energy profiles could be obtained within an affordable MD simulation time when the AMOEBA polarizable force field is employed. Moreover, the free energy surfaces estimated using the AMOEBA polarizable force field are in agreement with those calculated from experimental data and ab initio methods. Hence, the implementation in this work is reliable and would be utilized to study more complicated biological phenomena in both an accurate and efficient way. © 2015 Wiley Periodicals, Inc. © 2015 Wiley Periodicals, Inc.
Real-time digital holographic microscopy using the graphic processing unit.
Shimobaba, Tomoyoshi; Sato, Yoshikuni; Miura, Junya; Takenouchi, Mai; Ito, Tomoyoshi
2008-08-04
Digital holographic microscopy (DHM) is a well-known powerful method allowing both the amplitude and phase of a specimen to be simultaneously observed. In order to obtain a reconstructed image from a hologram, numerous calculations for the Fresnel diffraction are required. The Fresnel diffraction can be accelerated by the FFT (Fast Fourier Transform) algorithm. However, real-time reconstruction from a hologram is difficult even if we use a recent central processing unit (CPU) to calculate the Fresnel diffraction by the FFT algorithm. In this paper, we describe a real-time DHM system using a graphic processing unit (GPU) with many stream processors, which allows use as a highly parallel processor. The computational speed of the Fresnel diffraction using the GPU is faster than that of recent CPUs. The real-time DHM system can obtain reconstructed images from holograms whose size is 512 x 512 grids in 24 frames per second.
Multigroup Monte Carlo on GPUs: Comparison of history- and event-based algorithms
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hamilton, Steven P.; Slattery, Stuart R.; Evans, Thomas M.
This article presents an investigation of the performance of different multigroup Monte Carlo transport algorithms on GPUs with a discussion of both history-based and event-based approaches. Several algorithmic improvements are introduced for both approaches. By modifying the history-based algorithm that is traditionally favored in CPU-based MC codes to occasionally filter out dead particles to reduce thread divergence, performance exceeds that of either the pure history-based or event-based approaches. The impacts of several algorithmic choices are discussed, including performance studies on Kepler and Pascal generation NVIDIA GPUs for fixed source and eigenvalue calculations. Single-device performance equivalent to 20–40 CPU cores onmore » the K40 GPU and 60–80 CPU cores on the P100 GPU is achieved. Last, in addition, nearly perfect multi-device parallel weak scaling is demonstrated on more than 16,000 nodes of the Titan supercomputer.« less
Multigroup Monte Carlo on GPUs: Comparison of history- and event-based algorithms
Hamilton, Steven P.; Slattery, Stuart R.; Evans, Thomas M.
2017-12-22
This article presents an investigation of the performance of different multigroup Monte Carlo transport algorithms on GPUs with a discussion of both history-based and event-based approaches. Several algorithmic improvements are introduced for both approaches. By modifying the history-based algorithm that is traditionally favored in CPU-based MC codes to occasionally filter out dead particles to reduce thread divergence, performance exceeds that of either the pure history-based or event-based approaches. The impacts of several algorithmic choices are discussed, including performance studies on Kepler and Pascal generation NVIDIA GPUs for fixed source and eigenvalue calculations. Single-device performance equivalent to 20–40 CPU cores onmore » the K40 GPU and 60–80 CPU cores on the P100 GPU is achieved. Last, in addition, nearly perfect multi-device parallel weak scaling is demonstrated on more than 16,000 nodes of the Titan supercomputer.« less
Multi-GPU and multi-CPU accelerated FDTD scheme for vibroacoustic applications
NASA Astrophysics Data System (ADS)
Francés, J.; Otero, B.; Bleda, S.; Gallego, S.; Neipp, C.; Márquez, A.; Beléndez, A.
2015-06-01
The Finite-Difference Time-Domain (FDTD) method is applied to the analysis of vibroacoustic problems and to study the propagation of longitudinal and transversal waves in a stratified media. The potential of the scheme and the relevance of each acceleration strategy for massively computations in FDTD are demonstrated in this work. In this paper, we propose two new specific implementations of the bi-dimensional scheme of the FDTD method using multi-CPU and multi-GPU, respectively. In the first implementation, an open source message passing interface (OMPI) has been included in order to massively exploit the resources of a biprocessor station with two Intel Xeon processors. Moreover, regarding CPU code version, the streaming SIMD extensions (SSE) and also the advanced vectorial extensions (AVX) have been included with shared memory approaches that take advantage of the multi-core platforms. On the other hand, the second implementation called the multi-GPU code version is based on Peer-to-Peer communications available in CUDA on two GPUs (NVIDIA GTX 670). Subsequently, this paper presents an accurate analysis of the influence of the different code versions including shared memory approaches, vector instructions and multi-processors (both CPU and GPU) and compares them in order to delimit the degree of improvement of using distributed solutions based on multi-CPU and multi-GPU. The performance of both approaches was analysed and it has been demonstrated that the addition of shared memory schemes to CPU computing improves substantially the performance of vector instructions enlarging the simulation sizes that use efficiently the cache memory of CPUs. In this case GPU computing is slightly twice times faster than the fine tuned CPU version in both cases one and two nodes. However, for massively computations explicit vector instructions do not worth it since the memory bandwidth is the limiting factor and the performance tends to be the same than the sequential version with auto-vectorisation and also shared memory approach. In this scenario GPU computing is the best option since it provides a homogeneous behaviour. More specifically, the speedup of GPU computing achieves an upper limit of 12 for both one and two GPUs, whereas the performance reaches peak values of 80 GFlops and 146 GFlops for the performance for one GPU and two GPUs respectively. Finally, the method is applied to an earth crust profile in order to demonstrate the potential of our approach and the necessity of applying acceleration strategies in these type of applications.
GPU-based simulation of optical propagation through turbulence for active and passive imaging
NASA Astrophysics Data System (ADS)
Monnier, Goulven; Duval, François-Régis; Amram, Solène
2014-10-01
IMOTEP is a GPU-based (Graphical Processing Units) software relying on a fast parallel implementation of Fresnel diffraction through successive phase screens. Its applications include active imaging, laser telemetry and passive imaging through turbulence with anisoplanatic spatial and temporal fluctuations. Thanks to parallel implementation on GPU, speedups ranging from 40X to 70X are achieved. The present paper gives a brief overview of IMOTEP models, algorithms, implementation and user interface. It then focuses on major improvements recently brought to the anisoplanatic imaging simulation method. Previously, we took advantage of the computational power offered by the GPU to develop a simulation method based on large series of deterministic realisations of the PSF distorted by turbulence. The phase screen propagation algorithm, by reproducing higher moments of the incident wavefront distortion, provides realistic PSFs. However, we first used a coarse gaussian model to fit the numerical PSFs and characterise there spatial statistics through only 3 parameters (two-dimensional displacements of centroid and width). Meanwhile, this approach was unable to reproduce the effects related to the details of the PSF structure, especially the "speckles" leading to prominent high-frequency content in short-exposure images. To overcome this limitation, we recently implemented a new empirical model of the PSF, based on Principal Components Analysis (PCA), ought to catch most of the PSF complexity. The GPU implementation allows estimating and handling efficiently the numerous (up to several hundreds) principal components typically required under the strong turbulence regime. A first demanding computational step involves PCA, phase screen propagation and covariance estimates. In a second step, realistic instantaneous images, fully accounting for anisoplanatic effects, are quickly generated. Preliminary results are presented.
XaNSoNS: GPU-accelerated simulator of diffraction patterns of nanoparticles
NASA Astrophysics Data System (ADS)
Neverov, V. S.
XaNSoNS is an open source software with GPU support, which simulates X-ray and neutron 1D (or 2D) diffraction patterns and pair-distribution functions (PDF) for amorphous or crystalline nanoparticles (up to ∼107 atoms) of heterogeneous structural content. Among the multiple parameters of the structure the user may specify atomic displacements, site occupancies, molecular displacements and molecular rotations. The software uses general equations nonspecific to crystalline structures to calculate the scattering intensity. It supports four major standards of parallel computing: MPI, OpenMP, Nvidia CUDA and OpenCL, enabling it to run on various architectures, from CPU-based HPCs to consumer-level GPUs.
Shear Elasticity and Shear Viscosity Imaging in Soft Tissue
NASA Astrophysics Data System (ADS)
Yang, Yiqun
In this thesis, a new approach is introduced that provides estimates of shear elasticity and shear viscosity using time-domain measurements of shear waves in viscoelastic media. Simulations of shear wave particle displacements induced by an acoustic radiation force are accelerated significantly by a GPU. The acoustic radiation force is first calculated using the fast near field method (FNM) and the angular spectrum approach (ASA). The shear waves induced by the acoustic radiation force are then simulated in elastic and viscoelastic media using Green's functions. A parallel algorithm is developed to perform these calculations on a GPU, where the shear wave particle displacements at different observation points are calculated in parallel. The resulting speed increase enables rapid evaluation of shear waves at discrete points, in 2D planes, and for push beams with different spatial samplings and for different values of the f-number (f/#). The results of these simulations show that push beams with smaller f/# require a higher spatial sampling rate. The significant amount of acceleration achieved by this approach suggests that shear wave simulations with the Green's function approach are ideally suited for high-performance GPUs. Shear wave elasticity imaging determines the mechanical parameters of soft tissue by analyzing measured shear waves induced by an acoustic radiation force. To estimate the shear elasticity value, the widely used time-of-flight method calculates the correlation between shear wave particle velocities at adjacent lateral observation points. Although this method provides accurate estimates of the shear elasticity in purely elastic media, our experience suggests that the time-of-flight (TOF) method consistently overestimates the shear elasticity values in viscoelastic media because the combined effects of diffraction, attenuation, and dispersion are not considered. To address this problem, we have developed an approach that directly accounts for all of these effects when estimating the shear elasticity. This new approach simulates shear wave particle velocities using a Green's function-based approach for the Voigt model, where the shear elasticity and viscosity values are estimated using an optimization-based approach that compares measured shear wave particle velocities with simulated shear wave particle velocities in the time-domain. The results are evaluated on a point-by-point basis to generate images. There is good agreement between the simulated and measured shear wave particle velocities, where the new approach yields much better images of the shear elasticity and shear viscosity than the TOF method. The new estimation approach is accelerated with an approximate viscoelastic Green's function model that is evaluated with shear wave data obtained from in vivo human livers. Instead of calculating shear waves with combinations of different shear elasticities and shear viscosities, shear waves are calculated with different shear elasticities on the GPU and then convolved with a viscous loss model, which accelerates the calculation dramatically. The shear elasticity and shear viscosity values are then estimated using an optimization-based approach by minimizing the difference between measured and simulated shear wave particle velocities. Shear elasticity and shear viscosity images are generated at every spatial point in a two-dimensional (2D) field-of-view (FOV). The new approach is applied to measured shear wave data obtained from in vivo human livers, and the results show that this new approach successfully generates shear elasticity and shear viscosity images from this data. The results also indicate that the shear elasticity values estimated with this approach are significantly smaller than the values estimated with the conventional TOF method and that the new approach demonstrates more consistent values for these estimates compared with the TOF method. This experience suggests that the new method is an effective approach for estimating the shear elasticity and the shear viscosity in liver and in other soft tissue.
Accelerating numerical solution of stochastic differential equations with CUDA
NASA Astrophysics Data System (ADS)
Januszewski, M.; Kostur, M.
2010-01-01
Numerical integration of stochastic differential equations is commonly used in many branches of science. In this paper we present how to accelerate this kind of numerical calculations with popular NVIDIA Graphics Processing Units using the CUDA programming environment. We address general aspects of numerical programming on stream processors and illustrate them by two examples: the noisy phase dynamics in a Josephson junction and the noisy Kuramoto model. In presented cases the measured speedup can be as high as 675× compared to a typical CPU, which corresponds to several billion integration steps per second. This means that calculations which took weeks can now be completed in less than one hour. This brings stochastic simulation to a completely new level, opening for research a whole new range of problems which can now be solved interactively. Program summaryProgram title: SDE Catalogue identifier: AEFG_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEFG_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Gnu GPL v3 No. of lines in distributed program, including test data, etc.: 978 No. of bytes in distributed program, including test data, etc.: 5905 Distribution format: tar.gz Programming language: CUDA C Computer: any system with a CUDA-compatible GPU Operating system: Linux RAM: 64 MB of GPU memory Classification: 4.3 External routines: The program requires the NVIDIA CUDA Toolkit Version 2.0 or newer and the GNU Scientific Library v1.0 or newer. Optionally gnuplot is recommended for quick visualization of the results. Nature of problem: Direct numerical integration of stochastic differential equations is a computationally intensive problem, due to the necessity of calculating multiple independent realizations of the system. We exploit the inherent parallelism of this problem and perform the calculations on GPUs using the CUDA programming environment. The GPU's ability to execute hundreds of threads simultaneously makes it possible to speed up the computation by over two orders of magnitude, compared to a typical modern CPU. Solution method: The stochastic Runge-Kutta method of the second order is applied to integrate the equation of motion. Ensemble-averaged quantities of interest are obtained through averaging over multiple independent realizations of the system. Unusual features: The numerical solution of the stochastic differential equations in question is performed on a GPU using the CUDA environment. Running time: < 1 minute
Next-generation acceleration and code optimization for light transport in turbid media using GPUs
Alerstam, Erik; Lo, William Chun Yip; Han, Tianyi David; Rose, Jonathan; Andersson-Engels, Stefan; Lilge, Lothar
2010-01-01
A highly optimized Monte Carlo (MC) code package for simulating light transport is developed on the latest graphics processing unit (GPU) built for general-purpose computing from NVIDIA - the Fermi GPU. In biomedical optics, the MC method is the gold standard approach for simulating light transport in biological tissue, both due to its accuracy and its flexibility in modelling realistic, heterogeneous tissue geometry in 3-D. However, the widespread use of MC simulations in inverse problems, such as treatment planning for PDT, is limited by their long computation time. Despite its parallel nature, optimizing MC code on the GPU has been shown to be a challenge, particularly when the sharing of simulation result matrices among many parallel threads demands the frequent use of atomic instructions to access the slow GPU global memory. This paper proposes an optimization scheme that utilizes the fast shared memory to resolve the performance bottleneck caused by atomic access, and discusses numerous other optimization techniques needed to harness the full potential of the GPU. Using these techniques, a widely accepted MC code package in biophotonics, called MCML, was successfully accelerated on a Fermi GPU by approximately 600x compared to a state-of-the-art Intel Core i7 CPU. A skin model consisting of 7 layers was used as the standard simulation geometry. To demonstrate the possibility of GPU cluster computing, the same GPU code was executed on four GPUs, showing a linear improvement in performance with an increasing number of GPUs. The GPU-based MCML code package, named GPU-MCML, is compatible with a wide range of graphics cards and is released as an open-source software in two versions: an optimized version tuned for high performance and a simplified version for beginners (http://code.google.com/p/gpumcml). PMID:21258498
Su, Lin; Yang, Youming; Bednarz, Bryan; Sterpin, Edmond; Du, Xining; Liu, Tianyu; Ji, Wei; Xu, X George
2014-07-01
Using the graphical processing units (GPU) hardware technology, an extremely fast Monte Carlo (MC) code ARCHERRT is developed for radiation dose calculations in radiation therapy. This paper describes the detailed software development and testing for three clinical TomoTherapy® cases: the prostate, lung, and head & neck. To obtain clinically relevant dose distributions, phase space files (PSFs) created from optimized radiation therapy treatment plan fluence maps were used as the input to ARCHERRT. Patient-specific phantoms were constructed from patient CT images. Batch simulations were employed to facilitate the time-consuming task of loading large PSFs, and to improve the estimation of statistical uncertainty. Furthermore, two different Woodcock tracking algorithms were implemented and their relative performance was compared. The dose curves of an Elekta accelerator PSF incident on a homogeneous water phantom were benchmarked against DOSXYZnrc. For each of the treatment cases, dose volume histograms and isodose maps were produced from ARCHERRT and the general-purpose code, GEANT4. The gamma index analysis was performed to evaluate the similarity of voxel doses obtained from these two codes. The hardware accelerators used in this study are one NVIDIA K20 GPU, one NVIDIA K40 GPU, and six NVIDIA M2090 GPUs. In addition, to make a fairer comparison of the CPU and GPU performance, a multithreaded CPU code was developed using OpenMP and tested on an Intel E5-2620 CPU. For the water phantom, the depth dose curve and dose profiles from ARCHERRT agree well with DOSXYZnrc. For clinical cases, results from ARCHERRT are compared with those from GEANT4 and good agreement is observed. Gamma index test is performed for voxels whose dose is greater than 10% of maximum dose. For 2%/2mm criteria, the passing rates for the prostate, lung case, and head & neck cases are 99.7%, 98.5%, and 97.2%, respectively. Due to specific architecture of GPU, modified Woodcock tracking algorithm performed inferior to the original one. ARCHERRT achieves a fast speed for PSF-based dose calculations. With a single M2090 card, the simulations cost about 60, 50, 80 s for three cases, respectively, with the 1% statistical error in the PTV. Using the latest K40 card, the simulations are 1.7-1.8 times faster. More impressively, six M2090 cards could finish the simulations in 8.9-13.4 s. For comparison, the same simulations on Intel E5-2620 (12 hyperthreading) cost about 500-800 s. ARCHERRT was developed successfully to perform fast and accurate MC dose calculation for radiotherapy using PSFs and patient CT phantoms.
Qin, Nan; Botas, Pablo; Giantsoudi, Drosoula; Schuemann, Jan; Tian, Zhen; Jiang, Steve B.; Paganetti, Harald; Jia, Xun
2016-01-01
Monte Carlo (MC) simulation is commonly considered as the most accurate dose calculation method for proton therapy. Aiming at achieving fast MC dose calculations for clinical applications, we have previously developed a GPU-based MC tool, gPMC. In this paper, we report our recent updates on gPMC in terms of its accuracy, portability, and functionality, as well as comprehensive tests on this tool. The new version, gPMC v2.0, was developed under the OpenCL environment to enable portability across different computational platforms. Physics models of nuclear interactions were refined to improve calculation accuracy. Scoring functions of gPMC were expanded to enable tallying particle fluence, dose deposited by different particle types, and dose-averaged linear energy transfer (LETd). A multiple counter approach was employed to improve efficiency by reducing frequency of memory writing conflict at scoring. For dose calculation, accuracy improvements over gPMC v1.0 were observed in both water phantom cases and a patient case. For a prostate cancer case planned using high-energy proton beams, dose discrepancies in beam entrance and target region seen in gPMC v1.0 with respect to the gold standard tool for proton Monte Carlo simulations (TOPAS) results were substantially reduced and gamma test passing rate (1%/1mm) was improved from 82.7% to 93.1%. Average relative difference in LETd between gPMC and TOPAS was 1.7%. Average relative differences in dose deposited by primary, secondary, and other heavier particles were within 2.3%, 0.4%, and 0.2%. Depending on source proton energy and phantom complexity, it took 8 to 17 seconds on an AMD Radeon R9 290x GPU to simulate 107 source protons, achieving less than 1% average statistical uncertainty. As beam size was reduced from 10×10 cm2 to 1×1 cm2, time on scoring was only increased by 4.8% with eight counters, in contrast to a 40% increase using only one counter. With the OpenCL environment, the portability of gPMC v2.0 was enhanced. It was successfully executed on different CPUs and GPUs and its performance on different devices varied depending on processing power and hardware structure. PMID:27694712
NASA Astrophysics Data System (ADS)
McClure, J. E.; Prins, J. F.; Miller, C. T.
2014-07-01
Multiphase flow implementations of the lattice Boltzmann method (LBM) are widely applied to the study of porous medium systems. In this work, we construct a new variant of the popular "color" LBM for two-phase flow in which a three-dimensional, 19-velocity (D3Q19) lattice is used to compute the momentum transport solution while a three-dimensional, seven velocity (D3Q7) lattice is used to compute the mass transport solution. Based on this formulation, we implement a novel heterogeneous GPU-accelerated algorithm in which the mass transport solution is computed by multiple shared memory CPU cores programmed using OpenMP while a concurrent solution of the momentum transport is performed using a GPU. The heterogeneous solution is demonstrated to provide speedup of 2.6 × as compared to multi-core CPU solution and 1.8 × compared to GPU solution due to concurrent utilization of both CPU and GPU bandwidths. Furthermore, we verify that the proposed formulation provides an accurate physical representation of multiphase flow processes and demonstrate that the approach can be applied to perform heterogeneous simulations of two-phase flow in porous media using a typical GPU-accelerated workstation.
Anandakrishnan, Ramu; Scogland, Tom R W; Fenley, Andrew T; Gordon, John C; Feng, Wu-chun; Onufriev, Alexey V
2010-06-01
Tools that compute and visualize biomolecular electrostatic surface potential have been used extensively for studying biomolecular function. However, determining the surface potential for large biomolecules on a typical desktop computer can take days or longer using currently available tools and methods. Two commonly used techniques to speed-up these types of electrostatic computations are approximations based on multi-scale coarse-graining and parallelization across multiple processors. This paper demonstrates that for the computation of electrostatic surface potential, these two techniques can be combined to deliver significantly greater speed-up than either one separately, something that is in general not always possible. Specifically, the electrostatic potential computation, using an analytical linearized Poisson-Boltzmann (ALPB) method, is approximated using the hierarchical charge partitioning (HCP) multi-scale method, and parallelized on an ATI Radeon 4870 graphical processing unit (GPU). The implementation delivers a combined 934-fold speed-up for a 476,040 atom viral capsid, compared to an equivalent non-parallel implementation on an Intel E6550 CPU without the approximation. This speed-up is significantly greater than the 42-fold speed-up for the HCP approximation alone or the 182-fold speed-up for the GPU alone. Copyright (c) 2010 Elsevier Inc. All rights reserved.
Xu, Daguang; Huang, Yong; Kang, Jin U
2014-06-16
We implemented the graphics processing unit (GPU) accelerated compressive sensing (CS) non-uniform in k-space spectral domain optical coherence tomography (SD OCT). Kaiser-Bessel (KB) function and Gaussian function are used independently as the convolution kernel in the gridding-based non-uniform fast Fourier transform (NUFFT) algorithm with different oversampling ratios and kernel widths. Our implementation is compared with the GPU-accelerated modified non-uniform discrete Fourier transform (MNUDFT) matrix-based CS SD OCT and the GPU-accelerated fast Fourier transform (FFT)-based CS SD OCT. It was found that our implementation has comparable performance to the GPU-accelerated MNUDFT-based CS SD OCT in terms of image quality while providing more than 5 times speed enhancement. When compared to the GPU-accelerated FFT based-CS SD OCT, it shows smaller background noise and less side lobes while eliminating the need for the cumbersome k-space grid filling and the k-linear calibration procedure. Finally, we demonstrated that by using a conventional desktop computer architecture having three GPUs, real-time B-mode imaging can be obtained in excess of 30 fps for the GPU-accelerated NUFFT based CS SD OCT with frame size 2048(axial) × 1,000(lateral).
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chung, H; Lee, J; Pua, R
2014-06-01
Purpose: The purpose of our study is to reduce imaging radiation dose while maintaining image quality of region of interest (ROI) in X-ray fluoroscopy. A low-dose real-time ROI fluoroscopic imaging technique which includes graphics-processing-unit- (GPU-) accelerated image processing for brightness compensation and noise filtering was developed in this study. Methods: In our ROI fluoroscopic imaging, a copper filter is placed in front of the X-ray tube. The filter contains a round aperture to reduce radiation dose to outside of the aperture. To equalize the brightness difference between inner and outer ROI regions, brightness compensation was performed by use of amore » simple weighting method that applies selectively to the inner ROI, the outer ROI, and the boundary zone. A bilateral filtering was applied to the images to reduce relatively high noise in the outer ROI images. To speed up the calculation of our technique for real-time application, the GPU-acceleration was applied to the image processing algorithm. We performed a dosimetric measurement using an ion-chamber dosimeter to evaluate the amount of radiation dose reduction. The reduction of calculation time compared to a CPU-only computation was also measured, and the assessment of image quality in terms of image noise and spatial resolution was conducted. Results: More than 80% of dose was reduced by use of the ROI filter. The reduction rate depended on the thickness of the filter and the size of ROI aperture. The image noise outside the ROI was remarkably reduced by the bilateral filtering technique. The computation time for processing each frame image was reduced from 3.43 seconds with single CPU to 9.85 milliseconds with GPU-acceleration. Conclusion: The proposed technique for X-ray fluoroscopy can substantially reduce imaging radiation dose to the patient while maintaining image quality particularly in the ROI region in real-time.« less
A Joint Method of Envelope Inversion Combined with Hybrid-domain Full Waveform Inversion
NASA Astrophysics Data System (ADS)
CUI, C.; Hou, W.
2017-12-01
Full waveform inversion (FWI) aims to construct high-precision subsurface models by fully using the information in seismic records, including amplitude, travel time, phase and so on. However, high non-linearity and the absence of low frequency information in seismic data lead to the well-known cycle skipping problem and make inversion easily fall into local minima. In addition, those 3D inversion methods that are based on acoustic approximation ignore the elastic effects in real seismic field, and make inversion harder. As a result, the accuracy of final inversion results highly relies on the quality of initial model. In order to improve stability and quality of inversion results, multi-scale inversion that reconstructs subsurface model from low to high frequency are applied. But, the absence of very low frequencies (< 3Hz) in field data is still bottleneck in the FWI. By extracting ultra low-frequency data from field data, envelope inversion is able to recover low wavenumber model with a demodulation operator (envelope operator), though the low frequency data does not really exist in field data. To improve the efficiency and viability of the inversion, in this study, we proposed a joint method of envelope inversion combined with hybrid-domain FWI. First, we developed 3D elastic envelope inversion, and the misfit function and the corresponding gradient operator were derived. Then we performed hybrid-domain FWI with envelope inversion result as initial model which provides low wavenumber component of model. Here, forward modeling is implemented in the time domain and inversion in the frequency domain. To accelerate the inversion, we adopt CPU/GPU heterogeneous computing techniques. There were two levels of parallelism. In the first level, the inversion tasks are decomposed and assigned to each computation node by shot number. In the second level, GPU multithreaded programming is used for the computation tasks in each node, including forward modeling, envelope extraction, DFT (discrete Fourier transform) calculation and gradients calculation. Numerical tests demonstrated that the combined envelope inversion + hybrid-domain FWI could obtain much faithful and accurate result than conventional hybrid-domain FWI. The CPU/GPU heterogeneous parallel computation could improve the performance speed.
Extending the BEAGLE library to a multi-FPGA platform
2013-01-01
Background Maximum Likelihood (ML)-based phylogenetic inference using Felsenstein’s pruning algorithm is a standard method for estimating the evolutionary relationships amongst a set of species based on DNA sequence data, and is used in popular applications such as RAxML, PHYLIP, GARLI, BEAST, and MrBayes. The Phylogenetic Likelihood Function (PLF) and its associated scaling and normalization steps comprise the computational kernel for these tools. These computations are data intensive but contain fine grain parallelism that can be exploited by coprocessor architectures such as FPGAs and GPUs. A general purpose API called BEAGLE has recently been developed that includes optimized implementations of Felsenstein’s pruning algorithm for various data parallel architectures. In this paper, we extend the BEAGLE API to a multiple Field Programmable Gate Array (FPGA)-based platform called the Convey HC-1. Results The core calculation of our implementation, which includes both the phylogenetic likelihood function (PLF) and the tree likelihood calculation, has an arithmetic intensity of 130 floating-point operations per 64 bytes of I/O, or 2.03 ops/byte. Its performance can thus be calculated as a function of the host platform’s peak memory bandwidth and the implementation’s memory efficiency, as 2.03 × peak bandwidth × memory efficiency. Our FPGA-based platform has a peak bandwidth of 76.8 GB/s and our implementation achieves a memory efficiency of approximately 50%, which gives an average throughput of 78 Gflops. This represents a ~40X speedup when compared with BEAGLE’s CPU implementation on a dual Xeon 5520 and 3X speedup versus BEAGLE’s GPU implementation on a Tesla T10 GPU for very large data sizes. The power consumption is 92 W, yielding a power efficiency of 1.7 Gflops per Watt. Conclusions The use of data parallel architectures to achieve high performance for likelihood-based phylogenetic inference requires high memory bandwidth and a design methodology that emphasizes high memory efficiency. To achieve this objective, we integrated 32 pipelined processing elements (PEs) across four FPGAs. For the design of each PE, we developed a specialized synthesis tool to generate a floating-point pipeline with resource and throughput constraints to match the target platform. We have found that using low-latency floating-point operators can significantly reduce FPGA area and still meet timing requirement on the target platform. We found that this design methodology can achieve performance that exceeds that of a GPU-based coprocessor. PMID:23331707
Large calculation of the flow over a hypersonic vehicle using a GPU
NASA Astrophysics Data System (ADS)
Elsen, Erich; LeGresley, Patrick; Darve, Eric
2008-12-01
Graphics processing units are capable of impressive computing performance up to 518 Gflops peak performance. Various groups have been using these processors for general purpose computing; most efforts have focussed on demonstrating relatively basic calculations, e.g. numerical linear algebra, or physical simulations for visualization purposes with limited accuracy. This paper describes the simulation of a hypersonic vehicle configuration with detailed geometry and accurate boundary conditions using the compressible Euler equations. To the authors' knowledge, this is the most sophisticated calculation of this kind in terms of complexity of the geometry, the physical model, the numerical methods employed, and the accuracy of the solution. The Navier-Stokes Stanford University Solver (NSSUS) was used for this purpose. NSSUS is a multi-block structured code with a provably stable and accurate numerical discretization which uses a vertex-based finite-difference method. A multi-grid scheme is used to accelerate the solution of the system. Based on a comparison of the Intel Core 2 Duo and NVIDIA 8800GTX, speed-ups of over 40× were demonstrated for simple test geometries and 20× for complex geometries.
Lok, U-Wai; Li, Pai-Chi
2016-03-01
Graphics processing unit (GPU)-based software beamforming has advantages over hardware-based beamforming of easier programmability and a faster design cycle, since complicated imaging algorithms can be efficiently programmed and modified. However, the need for a high data rate when transferring ultrasound radio-frequency (RF) data from the hardware front end to the software back end limits the real-time performance. Data compression methods can be applied to the hardware front end to mitigate the data transfer issue. Nevertheless, most decompression processes cannot be performed efficiently on a GPU, thus becoming another bottleneck of the real-time imaging. Moreover, lossless (or nearly lossless) compression is desirable to avoid image quality degradation. In a previous study, we proposed a real-time lossless compression-decompression algorithm and demonstrated that it can reduce the overall processing time because the reduction in data transfer time is greater than the computation time required for compression/decompression. This paper analyzes the lossless compression method in order to understand the factors limiting the compression efficiency. Based on the analytical results, a nearly lossless compression is proposed to further enhance the compression efficiency. The proposed method comprises a transformation coding method involving modified lossless compression that aims at suppressing amplitude data. The simulation results indicate that the compression ratio (CR) of the proposed approach can be enhanced from nearly 1.8 to 2.5, thus allowing a higher data acquisition rate at the front end. The spatial and contrast resolutions with and without compression were almost identical, and the process of decompressing the data of a single frame on a GPU took only several milliseconds. Moreover, the proposed method has been implemented in a 64-channel system that we built in-house to demonstrate the feasibility of the proposed algorithm in a real system. It was found that channel data from a 64-channel system can be transferred using the standard USB 3.0 interface in most practical imaging applications.
Sailfish: A flexible multi-GPU implementation of the lattice Boltzmann method
NASA Astrophysics Data System (ADS)
Januszewski, M.; Kostur, M.
2014-09-01
We present Sailfish, an open source fluid simulation package implementing the lattice Boltzmann method (LBM) on modern Graphics Processing Units (GPUs) using CUDA/OpenCL. We take a novel approach to GPU code implementation and use run-time code generation techniques and a high level programming language (Python) to achieve state of the art performance, while allowing easy experimentation with different LBM models and tuning for various types of hardware. We discuss the general design principles of the code, scaling to multiple GPUs in a distributed environment, as well as the GPU implementation and optimization of many different LBM models, both single component (BGK, MRT, ELBM) and multicomponent (Shan-Chen, free energy). The paper also presents results of performance benchmarks spanning the last three NVIDIA GPU generations (Tesla, Fermi, Kepler), which we hope will be useful for researchers working with this type of hardware and similar codes. Catalogue identifier: AETA_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AETA_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GNU Lesser General Public License, version 3 No. of lines in distributed program, including test data, etc.: 225864 No. of bytes in distributed program, including test data, etc.: 46861049 Distribution format: tar.gz Programming language: Python, CUDA C, OpenCL. Computer: Any with an OpenCL or CUDA-compliant GPU. Operating system: No limits (tested on Linux and Mac OS X). RAM: Hundreds of megabytes to tens of gigabytes for typical cases. Classification: 12, 6.5. External routines: PyCUDA/PyOpenCL, Numpy, Mako, ZeroMQ (for multi-GPU simulations), scipy, sympy Nature of problem: GPU-accelerated simulation of single- and multi-component fluid flows. Solution method: A wide range of relaxation models (LBGK, MRT, regularized LB, ELBM, Shan-Chen, free energy, free surface) and boundary conditions within the lattice Boltzmann method framework. Simulations can be run in single or double precision using one or more GPUs. Restrictions: The lattice Boltzmann method works for low Mach number flows only. Unusual features: The actual numerical calculations run exclusively on GPUs. The numerical code is built dynamically at run-time in CUDA C or OpenCL, using templates and symbolic formulas. The high-level control of the simulation is maintained by a Python process. Additional comments: !!!!! The distribution file for this program is over 45 Mbytes and therefore is not delivered directly when Download or Email is requested. Instead a html file giving details of how the program can be obtained is sent. !!!!! Running time: Problem-dependent, typically minutes (for small cases or short simulations) to hours (large cases or long simulations).
NASA Astrophysics Data System (ADS)
Yu, H.; Wang, Z.; Zhang, C.; Chen, N.; Zhao, Y.; Sawchuk, A. P.; Dalsing, M. C.; Teague, S. D.; Cheng, Y.
2014-11-01
Existing research of patient-specific computational hemodynamics (PSCH) heavily relies on software for anatomical extraction of blood arteries. Data reconstruction and mesh generation have to be done using existing commercial software due to the gap between medical image processing and CFD, which increases computation burden and introduces inaccuracy during data transformation thus limits the medical applications of PSCH. We use lattice Boltzmann method (LBM) to solve the level-set equation over an Eulerian distance field and implicitly and dynamically segment the artery surfaces from radiological CT/MRI imaging data. The segments seamlessly feed to the LBM based CFD computation of PSCH thus explicit mesh construction and extra data management are avoided. The LBM is ideally suited for GPU (graphic processing unit)-based parallel computing. The parallel acceleration over GPU achieves excellent performance in PSCH computation. An application study will be presented which segments an aortic artery from a chest CT dataset and models PSCH of the segmented artery.
CUDA-based acceleration of collateral filtering in brain MR images
NASA Astrophysics Data System (ADS)
Li, Cheng-Yuan; Chang, Herng-Hua
2017-02-01
Image denoising is one of the fundamental and essential tasks within image processing. In medical imaging, finding an effective algorithm that can remove random noise in MR images is important. This paper proposes an effective noise reduction method for brain magnetic resonance (MR) images. Our approach is based on the collateral filter which is a more powerful method than the bilateral filter in many cases. However, the computation of the collateral filter algorithm is quite time-consuming. To solve this problem, we improved the collateral filter algorithm with parallel computing using GPU. We adopted CUDA, an application programming interface for GPU by NVIDIA, to accelerate the computation. Our experimental evaluation on an Intel Xeon CPU E5-2620 v3 2.40GHz with a NVIDIA Tesla K40c GPU indicated that the proposed implementation runs dramatically faster than the traditional collateral filter. We believe that the proposed framework has established a general blueprint for achieving fast and robust filtering in a wide variety of medical image denoising applications.
Watanabe, Yuuki; Takahashi, Yuhei; Numazawa, Hiroshi
2014-02-01
We demonstrate intensity-based optical coherence tomography (OCT) angiography using the squared difference of two sequential frames with bulk-tissue-motion (BTM) correction. This motion correction was performed by minimization of the sum of the pixel values using axial- and lateral-pixel-shifted structural OCT images. We extract the BTM-corrected image from a total of 25 calculated OCT angiographic images. Image processing was accelerated by a graphics processing unit (GPU) with many stream processors to optimize the parallel processing procedure. The GPU processing rate was faster than that of a line scan camera (46.9 kHz). Our OCT system provides the means of displaying structural OCT images and BTM-corrected OCT angiographic images in real time.
GPU-Based Point Cloud Superpositioning for Structural Comparisons of Protein Binding Sites.
Leinweber, Matthias; Fober, Thomas; Freisleben, Bernd
2018-01-01
In this paper, we present a novel approach to solve the labeled point cloud superpositioning problem for performing structural comparisons of protein binding sites. The solution is based on a parallel evolution strategy that operates on large populations and runs on GPU hardware. The proposed evolution strategy reduces the likelihood of getting stuck in a local optimum of the multimodal real-valued optimization problem represented by labeled point cloud superpositioning. The performance of the GPU-based parallel evolution strategy is compared to a previously proposed CPU-based sequential approach for labeled point cloud superpositioning, indicating that the GPU-based parallel evolution strategy leads to qualitatively better results and significantly shorter runtimes, with speed improvements of up to a factor of 1,500 for large populations. Binary classification tests based on the ATP, NADH, and FAD protein subsets of CavBase, a database containing putative binding sites, show average classification rate improvements from about 92 percent (CPU) to 96 percent (GPU). Further experiments indicate that the proposed GPU-based labeled point cloud superpositioning approach can be superior to traditional protein comparison approaches based on sequence alignments.
An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU
NASA Astrophysics Data System (ADS)
Lyakh, Dmitry I.
2015-04-01
An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typically appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the naïve scattering algorithm (no memory access optimization). The tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).
Extending the length and time scales of Gram–Schmidt Lyapunov vector computations
DOE Office of Scientific and Technical Information (OSTI.GOV)
Costa, Anthony B., E-mail: acosta@northwestern.edu; Green, Jason R., E-mail: jason.green@umb.edu; Department of Chemistry, University of Massachusetts Boston, Boston, MA 02125
Lyapunov vectors have found growing interest recently due to their ability to characterize systems out of thermodynamic equilibrium. The computation of orthogonal Gram–Schmidt vectors requires multiplication and QR decomposition of large matrices, which grow as N{sup 2} (with the particle count). This expense has limited such calculations to relatively small systems and short time scales. Here, we detail two implementations of an algorithm for computing Gram–Schmidt vectors. The first is a distributed-memory message-passing method using Scalapack. The second uses the newly-released MAGMA library for GPUs. We compare the performance of both codes for Lennard–Jones fluids from N=100 to 1300 betweenmore » Intel Nahalem/Infiniband DDR and NVIDIA C2050 architectures. To our best knowledge, these are the largest systems for which the Gram–Schmidt Lyapunov vectors have been computed, and the first time their calculation has been GPU-accelerated. We conclude that Lyapunov vector calculations can be significantly extended in length and time by leveraging the power of GPU-accelerated linear algebra.« less
TU-AB-202-05: GPU-Based 4D Deformable Image Registration Using Adaptive Tetrahedral Mesh Modeling
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zhong, Z; Zhuang, L; Gu, X
Purpose: Deformable image registration (DIR) has been employed today as an automated and effective segmentation method to transfer tumor or organ contours from the planning image to daily images, instead of manual segmentation. However, the computational time and accuracy of current DIR approaches are still insufficient for online adaptive radiation therapy (ART), which requires real-time and high-quality image segmentation, especially in a large datasets of 4D-CT images. The objective of this work is to propose a new DIR algorithm, with fast computational speed and high accuracy, by using adaptive feature-based tetrahedral meshing and GPU-based parallelization. Methods: The first step ismore » to generate the adaptive tetrahedral mesh based on the image features of a reference phase of 4D-CT, so that the deformation can be well captured and accurately diffused from the mesh vertices to voxels of the image volume. Subsequently, the deformation vector fields (DVF) and other phases of 4D-CT can be obtained by matching each phase of the target 4D-CT images with the corresponding deformed reference phase. The proposed 4D DIR method is implemented on GPU, resulting in significantly increasing the computational efficiency due to its parallel computing ability. Results: A 4D NCAT digital phantom was used to test the efficiency and accuracy of our method. Both the image and DVF results show that the fine structures and shapes of lung are well preserved, and the tumor position is well captured, i.e., 3D distance error is 1.14 mm. Compared to the previous voxel-based CPU implementation of DIR, such as demons, the proposed method is about 160x faster for registering a 10-phase 4D-CT with a phase dimension of 256×256×150. Conclusion: The proposed 4D DIR method uses feature-based mesh and GPU-based parallelism, which demonstrates the capability to compute both high-quality image and motion results, with significant improvement on the computational speed.« less
GPU-based prompt gamma ray imaging from boron neutron capture therapy
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yoon, Do-Kun; Jung, Joo-Young; Suk Suh, Tae, E-mail: suhsanta@catholic.ac.kr
Purpose: The purpose of this research is to perform the fast reconstruction of a prompt gamma ray image using a graphics processing unit (GPU) computation from boron neutron capture therapy (BNCT) simulations. Methods: To evaluate the accuracy of the reconstructed image, a phantom including four boron uptake regions (BURs) was used in the simulation. After the Monte Carlo simulation of the BNCT, the modified ordered subset expectation maximization reconstruction algorithm using the GPU computation was used to reconstruct the images with fewer projections. The computation times for image reconstruction were compared between the GPU and the central processing unit (CPU).more » Also, the accuracy of the reconstructed image was evaluated by a receiver operating characteristic (ROC) curve analysis. Results: The image reconstruction time using the GPU was 196 times faster than the conventional reconstruction time using the CPU. For the four BURs, the area under curve values from the ROC curve were 0.6726 (A-region), 0.6890 (B-region), 0.7384 (C-region), and 0.8009 (D-region). Conclusions: The tomographic image using the prompt gamma ray event from the BNCT simulation was acquired using the GPU computation in order to perform a fast reconstruction during treatment. The authors verified the feasibility of the prompt gamma ray image reconstruction using the GPU computation for BNCT simulations.« less
TU-FG-BRB-07: GPU-Based Prompt Gamma Ray Imaging From Boron Neutron Capture Therapy
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kim, S; Suh, T; Yoon, D
Purpose: The purpose of this research is to perform the fast reconstruction of a prompt gamma ray image using a graphics processing unit (GPU) computation from boron neutron capture therapy (BNCT) simulations. Methods: To evaluate the accuracy of the reconstructed image, a phantom including four boron uptake regions (BURs) was used in the simulation. After the Monte Carlo simulation of the BNCT, the modified ordered subset expectation maximization reconstruction algorithm using the GPU computation was used to reconstruct the images with fewer projections. The computation times for image reconstruction were compared between the GPU and the central processing unit (CPU).more » Also, the accuracy of the reconstructed image was evaluated by a receiver operating characteristic (ROC) curve analysis. Results: The image reconstruction time using the GPU was 196 times faster than the conventional reconstruction time using the CPU. For the four BURs, the area under curve values from the ROC curve were 0.6726 (A-region), 0.6890 (B-region), 0.7384 (C-region), and 0.8009 (D-region). Conclusion: The tomographic image using the prompt gamma ray event from the BNCT simulation was acquired using the GPU computation in order to perform a fast reconstruction during treatment. The authors verified the feasibility of the prompt gamma ray reconstruction using the GPU computation for BNCT simulations.« less
High performance MRI simulations of motion on multi-GPU systems
2014-01-01
Background MRI physics simulators have been developed in the past for optimizing imaging protocols and for training purposes. However, these simulators have only addressed motion within a limited scope. The purpose of this study was the incorporation of realistic motion, such as cardiac motion, respiratory motion and flow, within MRI simulations in a high performance multi-GPU environment. Methods Three different motion models were introduced in the Magnetic Resonance Imaging SIMULator (MRISIMUL) of this study: cardiac motion, respiratory motion and flow. Simulation of a simple Gradient Echo pulse sequence and a CINE pulse sequence on the corresponding anatomical model was performed. Myocardial tagging was also investigated. In pulse sequence design, software crushers were introduced to accommodate the long execution times in order to avoid spurious echoes formation. The displacement of the anatomical model isochromats was calculated within the Graphics Processing Unit (GPU) kernel for every timestep of the pulse sequence. Experiments that would allow simulation of custom anatomical and motion models were also performed. Last, simulations of motion with MRISIMUL on single-node and multi-node multi-GPU systems were examined. Results Gradient Echo and CINE images of the three motion models were produced and motion-related artifacts were demonstrated. The temporal evolution of the contractility of the heart was presented through the application of myocardial tagging. Better simulation performance and image quality were presented through the introduction of software crushers without the need to further increase the computational load and GPU resources. Last, MRISIMUL demonstrated an almost linear scalable performance with the increasing number of available GPU cards, in both single-node and multi-node multi-GPU computer systems. Conclusions MRISIMUL is the first MR physics simulator to have implemented motion with a 3D large computational load on a single computer multi-GPU configuration. The incorporation of realistic motion models, such as cardiac motion, respiratory motion and flow may benefit the design and optimization of existing or new MR pulse sequences, protocols and algorithms, which examine motion related MR applications. PMID:24996972
DOE Office of Scientific and Technical Information (OSTI.GOV)
Taleei, R; Qin, N; Jiang, S
2016-06-15
Purpose: Biological treatment plan optimization is of great interest for proton therapy. It requires extensive Monte Carlo (MC) simulations to compute physical dose and biological quantities. Recently, a gPMC package was developed for rapid MC dose calculations on a GPU platform. This work investigated its suitability for proton therapy biological optimization in terms of accuracy and efficiency. Methods: We performed simulations of a proton pencil beam with energies of 75, 150 and 225 MeV in a homogeneous water phantom using gPMC and FLUKA. Physical dose and energy spectra for each ion type on the central beam axis were scored. Relativemore » Biological Effectiveness (RBE) was calculated using repair-misrepair-fixation model. Microdosimetry calculations were performed using Monte Carlo Damage Simulation (MCDS). Results: Ranges computed by the two codes agreed within 1 mm. Physical dose difference was less than 2.5 % at the Bragg peak. RBE-weighted dose agreed within 5 % at the Bragg peak. Differences in microdosimetric quantities such as dose average lineal energy transfer and specific energy were < 10%. The simulation time per source particle with FLUKA was 0.0018 sec, while gPMC was ∼ 600 times faster. Conclusion: Physical dose computed by FLUKA and gPMC were in a good agreement. The RBE differences along the central axis were small, and RBE-weighted dose difference was found to be acceptable. The combined accuracy and efficiency makes gPMC suitable for proton therapy biological optimization.« less
CudaChain: an alternative algorithm for finding 2D convex hulls on the GPU.
Mei, Gang
2016-01-01
This paper presents an alternative GPU-accelerated convex hull algorithm and a novel S orting-based P reprocessing A pproach (SPA) for planar point sets. The proposed convex hull algorithm termed as CudaChain consists of two stages: (1) two rounds of preprocessing performed on the GPU and (2) the finalization of calculating the expected convex hull on the CPU. Those interior points locating inside a quadrilateral formed by four extreme points are first discarded, and then the remaining points are distributed into several (typically four) sub regions. For each subset of points, they are first sorted in parallel; then the second round of discarding is performed using SPA; and finally a simple chain is formed for the current remaining points. A simple polygon can be easily generated by directly connecting all the chains in sub regions. The expected convex hull of the input points can be finally obtained by calculating the convex hull of the simple polygon. The library Thrust is utilized to realize the parallel sorting, reduction, and partitioning for better efficiency and simplicity. Experimental results show that: (1) SPA can very effectively detect and discard the interior points; and (2) CudaChain achieves 5×-6× speedups over the famous Qhull implementation for 20M points.
GASPACHO: a generic automatic solver using proximal algorithms for convex huge optimization problems
NASA Astrophysics Data System (ADS)
Goossens, Bart; Luong, Hiêp; Philips, Wilfried
2017-08-01
Many inverse problems (e.g., demosaicking, deblurring, denoising, image fusion, HDR synthesis) share various similarities: degradation operators are often modeled by a specific data fitting function while image prior knowledge (e.g., sparsity) is incorporated by additional regularization terms. In this paper, we investigate automatic algorithmic techniques for evaluating proximal operators. These algorithmic techniques also enable efficient calculation of adjoints from linear operators in a general matrix-free setting. In particular, we study the simultaneous-direction method of multipliers (SDMM) and the parallel proximal algorithm (PPXA) solvers and show that the automatically derived implementations are well suited for both single-GPU and multi-GPU processing. We demonstrate this approach for an Electron Microscopy (EM) deconvolution problem.
Quan, Guotao; Gong, Hui; Deng, Yong; Fu, Jianwei; Luo, Qingming
2011-02-01
High-speed fluorescence molecular tomography (FMT) reconstruction for 3-D heterogeneous media is still one of the most challenging problems in diffusive optical fluorescence imaging. In this paper, we propose a fast FMT reconstruction method that is based on Monte Carlo (MC) simulation and accelerated by a cluster of graphics processing units (GPUs). Based on the Message Passing Interface standard, we modified the MC code for fast FMT reconstruction, and different Green's functions representing the flux distribution in media are calculated simultaneously by different GPUs in the cluster. A load-balancing method was also developed to increase the computational efficiency. By applying the Fréchet derivative, a Jacobian matrix is formed to reconstruct the distribution of the fluorochromes using the calculated Green's functions. Phantom experiments have shown that only 10 min are required to get reconstruction results with a cluster of 6 GPUs, rather than 6 h with a cluster of multiple dual opteron CPU nodes. Because of the advantages of high accuracy and suitability for 3-D heterogeneity media with refractive-index-unmatched boundaries from the MC simulation, the GPU cluster-accelerated method provides a reliable approach to high-speed reconstruction for FMT imaging.
A GPU-based mipmapping method for water surface visualization
NASA Astrophysics Data System (ADS)
Li, Hua; Quan, Wei; Xu, Chao; Wu, Yan
2018-03-01
Visualization of water surface is a hot topic in computer graphics. In this paper, we presented a fast method to generate wide range of water surface with good image quality both near and far from the viewpoint. This method utilized uniform mesh and Fractal Perlin noise to model water surface. Mipmapping technology was enforced to the surface textures, which adjust the resolution with respect to the distance from the viewpoint and reduce the computing cost. Lighting effect was computed based on shadow mapping technology, Snell's law and Fresnel term. The render pipeline utilizes a CPU-GPU shared memory structure, which improves the rendering efficiency. Experiment results show that our approach visualizes water surface with good image quality at real-time frame rates performance.
Hadoop-MCC: Efficient Multiple Compound Comparison Algorithm Using Hadoop.
Hua, Guan-Jie; Hung, Che-Lun; Tang, Chuan Yi
2018-01-01
In the past decade, the drug design technologies have been improved enormously. The computer-aided drug design (CADD) has played an important role in analysis and prediction in drug development, which makes the procedure more economical and efficient. However, computation with big data, such as ZINC containing more than 60 million compounds data and GDB-13 with more than 930 million small molecules, is a noticeable issue of time-consuming problem. Therefore, we propose a novel heterogeneous high performance computing method, named as Hadoop-MCC, integrating Hadoop and GPU, to copy with big chemical structure data efficiently. Hadoop-MCC gains the high availability and fault tolerance from Hadoop, as Hadoop is used to scatter input data to GPU devices and gather the results from GPU devices. Hadoop framework adopts mapper/reducer computation model. In the proposed method, mappers response for fetching SMILES data segments and perform LINGO method on GPU, then reducers collect all comparison results produced by mappers. Due to the high availability of Hadoop, all of LINGO computational jobs on mappers can be completed, even if some of the mappers encounter problems. A comparison of LINGO is performed on each the GPU device in parallel. According to the experimental results, the proposed method on multiple GPU devices can achieve better computational performance than the CUDA-MCC on a single GPU device. Hadoop-MCC is able to achieve scalability, high availability, and fault tolerance granted by Hadoop, and high performance as well by integrating computational power of both of Hadoop and GPU. It has been shown that using the heterogeneous architecture as Hadoop-MCC effectively can enhance better computational performance than on a single GPU device. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.
NASA Astrophysics Data System (ADS)
Ammendola, R.; Biagioni, A.; Chiozzi, S.; Cretaro, P.; Cotta Ramusino, A.; Di Lorenzo, S.; Fantechi, R.; Fiorini, M.; Frezza, O.; Gianoli, A.; Lamanna, G.; Lo Cicero, F.; Lonardo, A.; Martinelli, M.; Neri, I.; Paolucci, P. S.; Pastorelli, E.; Piandani, R.; Piccini, M.; Pontisso, L.; Rossetti, D.; Simula, F.; Sozzi, M.; Vicini, P.
2017-03-01
This project aims to exploit the parallel computing power of a commercial Graphics Processing Unit (GPU) to implement fast pattern matching in the Ring Imaging Cherenkov (RICH) detector for the level 0 (L0) trigger of the NA62 experiment. In this approach, the ring-fitting algorithm is seedless, being fed with raw RICH data, with no previous information on the ring position from other detectors. Moreover, since the L0 trigger is provided with a more elaborated information than a simple multiplicity number, it results in a higher selection power. Two methods have been studied in order to reduce the data transfer latency from the readout boards of the detector to the GPU, i.e., the use of a dedicated NIC device driver with very low latency and a direct data transfer protocol from a custom FPGA-based NIC to the GPU. The performance of the system, developed through the FPGA approach, for multi-ring Cherenkov online reconstruction obtained during the NA62 physics runs is presented.
Bin-Hash Indexing: A Parallel Method for Fast Query Processing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bethel, Edward W; Gosink, Luke J.; Wu, Kesheng
2008-06-27
This paper presents a new parallel indexing data structure for answering queries. The index, called Bin-Hash, offers extremely high levels of concurrency, and is therefore well-suited for the emerging commodity of parallel processors, such as multi-cores, cell processors, and general purpose graphics processing units (GPU). The Bin-Hash approach first bins the base data, and then partitions and separately stores the values in each bin as a perfect spatial hash table. To answer a query, we first determine whether or not a record satisfies the query conditions based on the bin boundaries. For the bins with records that can not bemore » resolved, we examine the spatial hash tables. The procedures for examining the bin numbers and the spatial hash tables offer the maximum possible level of concurrency; all records are able to be evaluated by our procedure independently in parallel. Additionally, our Bin-Hash procedures access much smaller amounts of data than similar parallel methods, such as the projection index. This smaller data footprint is critical for certain parallel processors, like GPUs, where memory resources are limited. To demonstrate the effectiveness of Bin-Hash, we implement it on a GPU using the data-parallel programming language CUDA. The concurrency offered by the Bin-Hash index allows us to fully utilize the GPU's massive parallelism in our work; over 12,000 records can be simultaneously evaluated at any one time. We show that our new query processing method is an order of magnitude faster than current state-of-the-art CPU-based indexing technologies. Additionally, we compare our performance to existing GPU-based projection index strategies.« less
NASA Astrophysics Data System (ADS)
Hammitzsch, M.; Spazier, J.; Reißland, S.
2014-12-01
Usually, tsunami early warning and mitigation systems (TWS or TEWS) are based on several software components deployed in a client-server based infrastructure. The vast majority of systems importantly include desktop-based clients with a graphical user interface (GUI) for the operators in early warning centers. However, in times of cloud computing and ubiquitous computing the use of concepts and paradigms, introduced by continuously evolving approaches in information and communications technology (ICT), have to be considered even for early warning systems (EWS). Based on the experiences and the knowledge gained in three research projects - 'German Indonesian Tsunami Early Warning System' (GITEWS), 'Distant Early Warning System' (DEWS), and 'Collaborative, Complex, and Critical Decision-Support in Evolving Crises' (TRIDEC) - new technologies are exploited to implement a cloud-based and web-based prototype to open up new prospects for EWS. This prototype, named 'TRIDEC Cloud', merges several complementary external and in-house cloud-based services into one platform for automated background computation with graphics processing units (GPU), for web-mapping of hazard specific geospatial data, and for serving relevant functionality to handle, share, and communicate threat specific information in a collaborative and distributed environment. The prototype in its current version addresses tsunami early warning and mitigation. The integration of GPU accelerated tsunami simulation computations have been an integral part of this prototype to foster early warning with on-demand tsunami predictions based on actual source parameters. However, the platform is meant for researchers around the world to make use of the cloud-based GPU computation to analyze other types of geohazards and natural hazards and react upon the computed situation picture with a web-based GUI in a web browser at remote sites. The current website is an early alpha version for demonstration purposes to give the concept a whirl and to shape science's future. Further functionality, improvements and possible profound changes have to implemented successively based on the users' evolving needs.
Blind detection of giant pulses: GPU implementation
NASA Astrophysics Data System (ADS)
Ait-Allal, Dalal; Weber, Rodolphe; Dumez-Viou, Cédric; Cognard, Ismael; Theureau, Gilles
2012-01-01
Radio astronomical pulsar observations require specific instrumentation and dedicated signal processing to cope with the dispersion caused by the interstellar medium. Moreover, the quality of observations can be limited by radio frequency interference (RFI) generated by Telecommunications activity. This article presents the innovative pulsar instrumentation based on graphical processing units (GPU) which has been designed at the Nançay Radio Astronomical Observatory. In addition, for giant pulsar search, we propose a new approach which combines a hardware-efficient search method and some RFI mitigation capabilities. Although this approach is less sensitive than the classical approach, its advantage is that no a priori information on the pulsar parameters is required. The validation of a GPU implementation is under way.
GPU computing with Kaczmarz’s and other iterative algorithms for linear systems
Elble, Joseph M.; Sahinidis, Nikolaos V.; Vouzis, Panagiotis
2009-01-01
The graphics processing unit (GPU) is used to solve large linear systems derived from partial differential equations. The differential equations studied are strongly convection-dominated, of various sizes, and common to many fields, including computational fluid dynamics, heat transfer, and structural mechanics. The paper presents comparisons between GPU and CPU implementations of several well-known iterative methods, including Kaczmarz’s, Cimmino’s, component averaging, conjugate gradient normal residual (CGNR), symmetric successive overrelaxation-preconditioned conjugate gradient, and conjugate-gradient-accelerated component-averaged row projections (CARP-CG). Computations are preformed with dense as well as general banded systems. The results demonstrate that our GPU implementation outperforms CPU implementations of these algorithms, as well as previously studied parallel implementations on Linux clusters and shared memory systems. While the CGNR method had begun to fall out of favor for solving such problems, for the problems studied in this paper, the CGNR method implemented on the GPU performed better than the other methods, including a cluster implementation of the CARP-CG method. PMID:20526446
Eghtesad, Adnan; Germaschewski, Kai; Beyerlein, Irene J.; ...
2017-10-14
We present the first high-performance computing implementation of the meso-scale phase field dislocation dynamics (PFDD) model on a graphics processing unit (GPU)-based platform. The implementation takes advantage of the portable OpenACC standard directive pragmas along with Nvidia's compute unified device architecture (CUDA) fast Fourier transform (FFT) library called CUFFT to execute the FFT computations within the PFDD formulation on the same GPU platform. The overall implementation is termed ACCPFDD-CUFFT. The package is entirely performance portable due to the use of OPENACC-CUDA inter-operability, in which calls to CUDA functions are replaced with the OPENACC data regions for a host central processingmore » unit (CPU) and device (GPU). A comprehensive benchmark study has been conducted, which compares a number of FFT routines, the Numerical Recipes FFT (FOURN), Fastest Fourier Transform in the West (FFTW), and the CUFFT. The last one exploits the advantages of the GPU hardware for FFT calculations. The novel ACCPFDD-CUFFT implementation is verified using the analytical solutions for the stress field around an infinite edge dislocation and subsequently applied to simulate the interaction and motion of dislocations through a bi-phase copper-nickel (Cu–Ni) interface. It is demonstrated that the ACCPFDD-CUFFT implementation on a single TESLA K80 GPU offers a 27.6X speedup relative to the serial version and a 5X speedup relative to the 22-multicore Intel Xeon CPU E5-2699 v4 @ 2.20 GHz version of the code.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Eghtesad, Adnan; Germaschewski, Kai; Beyerlein, Irene J.
We present the first high-performance computing implementation of the meso-scale phase field dislocation dynamics (PFDD) model on a graphics processing unit (GPU)-based platform. The implementation takes advantage of the portable OpenACC standard directive pragmas along with Nvidia's compute unified device architecture (CUDA) fast Fourier transform (FFT) library called CUFFT to execute the FFT computations within the PFDD formulation on the same GPU platform. The overall implementation is termed ACCPFDD-CUFFT. The package is entirely performance portable due to the use of OPENACC-CUDA inter-operability, in which calls to CUDA functions are replaced with the OPENACC data regions for a host central processingmore » unit (CPU) and device (GPU). A comprehensive benchmark study has been conducted, which compares a number of FFT routines, the Numerical Recipes FFT (FOURN), Fastest Fourier Transform in the West (FFTW), and the CUFFT. The last one exploits the advantages of the GPU hardware for FFT calculations. The novel ACCPFDD-CUFFT implementation is verified using the analytical solutions for the stress field around an infinite edge dislocation and subsequently applied to simulate the interaction and motion of dislocations through a bi-phase copper-nickel (Cu–Ni) interface. It is demonstrated that the ACCPFDD-CUFFT implementation on a single TESLA K80 GPU offers a 27.6X speedup relative to the serial version and a 5X speedup relative to the 22-multicore Intel Xeon CPU E5-2699 v4 @ 2.20 GHz version of the code.« less
Employing multi-GPU power for molecular dynamics simulation: an extension of GALAMOST
NASA Astrophysics Data System (ADS)
Zhu, You-Liang; Pan, Deng; Li, Zhan-Wei; Liu, Hong; Qian, Hu-Jun; Zhao, Yang; Lu, Zhong-Yuan; Sun, Zhao-Yan
2018-04-01
We describe the algorithm of employing multi-GPU power on the basis of Message Passing Interface (MPI) domain decomposition in a molecular dynamics code, GALAMOST, which is designed for the coarse-grained simulation of soft matters. The code of multi-GPU version is developed based on our previous single-GPU version. In multi-GPU runs, one GPU takes charge of one domain and runs single-GPU code path. The communication between neighbouring domains takes a similar algorithm of CPU-based code of LAMMPS, but is optimised specifically for GPUs. We employ a memory-saving design which can enlarge maximum system size at the same device condition. An optimisation algorithm is employed to prolong the update period of neighbour list. We demonstrate good performance of multi-GPU runs on the simulation of Lennard-Jones liquid, dissipative particle dynamics liquid, polymer and nanoparticle composite, and two-patch particles on workstation. A good scaling of many nodes on cluster for two-patch particles is presented.
Efficient Implementation of MrBayes on Multi-GPU
Zhou, Jianfu; Liu, Xiaoguang; Wang, Gang
2013-01-01
MrBayes, using Metropolis-coupled Markov chain Monte Carlo (MCMCMC or (MC)3), is a popular program for Bayesian inference. As a leading method of using DNA data to infer phylogeny, the (MC)3 Bayesian algorithm and its improved and parallel versions are now not fast enough for biologists to analyze massive real-world DNA data. Recently, graphics processor unit (GPU) has shown its power as a coprocessor (or rather, an accelerator) in many fields. This article describes an efficient implementation a(MC)3 (aMCMCMC) for MrBayes (MC)3 on compute unified device architecture. By dynamically adjusting the task granularity to adapt to input data size and hardware configuration, it makes full use of GPU cores with different data sets. An adaptive method is also developed to split and combine DNA sequences to make full use of a large number of GPU cards. Furthermore, a new “node-by-node” task scheduling strategy is developed to improve concurrency, and several optimizing methods are used to reduce extra overhead. Experimental results show that a(MC)3 achieves up to 63× speedup over serial MrBayes on a single machine with one GPU card, and up to 170× speedup with four GPU cards, and up to 478× speedup with a 32-node GPU cluster. a(MC)3 is dramatically faster than all the previous (MC)3 algorithms and scales well to large GPU clusters. PMID:23493260
Efficient implementation of MrBayes on multi-GPU.
Bao, Jie; Xia, Hongju; Zhou, Jianfu; Liu, Xiaoguang; Wang, Gang
2013-06-01
MrBayes, using Metropolis-coupled Markov chain Monte Carlo (MCMCMC or (MC)(3)), is a popular program for Bayesian inference. As a leading method of using DNA data to infer phylogeny, the (MC)(3) Bayesian algorithm and its improved and parallel versions are now not fast enough for biologists to analyze massive real-world DNA data. Recently, graphics processor unit (GPU) has shown its power as a coprocessor (or rather, an accelerator) in many fields. This article describes an efficient implementation a(MC)(3) (aMCMCMC) for MrBayes (MC)(3) on compute unified device architecture. By dynamically adjusting the task granularity to adapt to input data size and hardware configuration, it makes full use of GPU cores with different data sets. An adaptive method is also developed to split and combine DNA sequences to make full use of a large number of GPU cards. Furthermore, a new "node-by-node" task scheduling strategy is developed to improve concurrency, and several optimizing methods are used to reduce extra overhead. Experimental results show that a(MC)(3) achieves up to 63× speedup over serial MrBayes on a single machine with one GPU card, and up to 170× speedup with four GPU cards, and up to 478× speedup with a 32-node GPU cluster. a(MC)(3) is dramatically faster than all the previous (MC)(3) algorithms and scales well to large GPU clusters.
Li, Yongbao; Tian, Zhen; Shi, Feng; Song, Ting; Wu, Zhaoxia; Liu, Yaqiang; Jiang, Steve; Jia, Xun
2015-04-07
Intensity-modulated radiation treatment (IMRT) plan optimization needs beamlet dose distributions. Pencil-beam or superposition/convolution type algorithms are typically used because of their high computational speed. However, inaccurate beamlet dose distributions may mislead the optimization process and hinder the resulting plan quality. To solve this problem, the Monte Carlo (MC) simulation method has been used to compute all beamlet doses prior to the optimization step. The conventional approach samples the same number of particles from each beamlet. Yet this is not the optimal use of MC in this problem. In fact, there are beamlets that have very small intensities after solving the plan optimization problem. For those beamlets, it may be possible to use fewer particles in dose calculations to increase efficiency. Based on this idea, we have developed a new MC-based IMRT plan optimization framework that iteratively performs MC dose calculation and plan optimization. At each dose calculation step, the particle numbers for beamlets were adjusted based on the beamlet intensities obtained through solving the plan optimization problem in the last iteration step. We modified a GPU-based MC dose engine to allow simultaneous computations of a large number of beamlet doses. To test the accuracy of our modified dose engine, we compared the dose from a broad beam and the summed beamlet doses in this beam in an inhomogeneous phantom. Agreement within 1% for the maximum difference and 0.55% for the average difference was observed. We then validated the proposed MC-based optimization schemes in one lung IMRT case. It was found that the conventional scheme required 10(6) particles from each beamlet to achieve an optimization result that was 3% difference in fluence map and 1% difference in dose from the ground truth. In contrast, the proposed scheme achieved the same level of accuracy with on average 1.2 × 10(5) particles per beamlet. Correspondingly, the computation time including both MC dose calculations and plan optimizations was reduced by a factor of 4.4, from 494 to 113 s, using only one GPU card.
Toward GPGPU accelerated human electromechanical cardiac simulations
Vigueras, Guillermo; Roy, Ishani; Cookson, Andrew; Lee, Jack; Smith, Nicolas; Nordsletten, David
2014-01-01
In this paper, we look at the acceleration of weakly coupled electromechanics using the graphics processing unit (GPU). Specifically, we port to the GPU a number of components of Heart—a CPU-based finite element code developed for simulating multi-physics problems. On the basis of a criterion of computational cost, we implemented on the GPU the ODE and PDE solution steps for the electrophysiology problem and the Jacobian and residual evaluation for the mechanics problem. Performance of the GPU implementation is then compared with single core CPU (SC) execution as well as multi-core CPU (MC) computations with equivalent theoretical performance. Results show that for a human scale left ventricle mesh, GPU acceleration of the electrophysiology problem provided speedups of 164 × compared with SC and 5.5 times compared with MC for the solution of the ODE model. Speedup of up to 72 × compared with SC and 2.6 × compared with MC was also observed for the PDE solve. Using the same human geometry, the GPU implementation of mechanics residual/Jacobian computation provided speedups of up to 44 × compared with SC and 2.0 × compared with MC. © 2013 The Authors. International Journal for Numerical Methods in Biomedical Engineering published by John Wiley & Sons, Ltd. PMID:24115492
Rapid simulation of X-ray transmission imaging for baggage inspection via GPU-based ray-tracing
NASA Astrophysics Data System (ADS)
Gong, Qian; Stoian, Razvan-Ionut; Coccarelli, David S.; Greenberg, Joel A.; Vera, Esteban; Gehm, Michael E.
2018-01-01
We present a pipeline that rapidly simulates X-ray transmission imaging for arbitrary system architectures using GPU-based ray-tracing techniques. The purpose of the pipeline is to enable statistical analysis of threat detection in the context of airline baggage inspection. As a faster alternative to Monte Carlo methods, we adopt a deterministic approach for simulating photoelectric absorption-based imaging. The highly-optimized NVIDIA OptiX API is used to implement ray-tracing, greatly speeding code execution. In addition, we implement the first hierarchical representation structure to determine the interaction path length of rays traversing heterogeneous media described by layered polygons. The accuracy of the pipeline has been validated by comparing simulated data with experimental data collected using a heterogenous phantom and a laboratory X-ray imaging system. On a single computer, our approach allows us to generate over 400 2D transmission projections (125 × 125 pixels per frame) per hour for a bag packed with hundreds of everyday objects. By implementing our approach on cloud-based GPU computing platforms, we find that the same 2D projections of approximately 3.9 million bags can be obtained in a single day using 400 GPU instances, at a cost of only 0.001 per bag.
Cheng, Chun-Pei; Lan, Kuo-Lun; Liu, Wen-Chun; Chang, Ting-Tsung; Tseng, Vincent S
2016-12-01
Hepatitis B viral (HBV) infection is strongly associated with an increased risk of liver diseases like cirrhosis or hepatocellular carcinoma (HCC). Many lines of evidence suggest that deletions occurring in HBV genomic DNA are highly associated with the activity of HBV via the interplay between aberrant viral proteins release and human immune system. Deletions finding on the HBV whole genome sequences is thus a very important issue though there exist underlying the challenges in mining such big and complex biological data. Although some next generation sequencing (NGS) tools are recently designed for identifying structural variations such as insertions or deletions, their validity is generally committed to human sequences study. This design may not be suitable for viruses due to different species. We propose a graphics processing unit (GPU)-based data mining method called DeF-GPU to efficiently and precisely identify HBV deletions from large NGS data, which generally contain millions of reads. To fit the single instruction multiple data instructions, sequencing reads are referred to as multiple data and the deletion finding procedure is referred to as a single instruction. We use Compute Unified Device Architecture (CUDA) to parallelize the procedures, and further validate DeF-GPU on 5 synthetic and 1 real datasets. Our results suggest that DeF-GPU outperforms the existing commonly-used method Pindel and is able to exactly identify the deletions of our ground truth in few seconds. The source code and other related materials are available at https://sourceforge.net/projects/defgpu/. Copyright © 2016 Elsevier Inc. All rights reserved.
Molecular dynamics simulations through GPU video games technologies
Loukatou, Styliani; Papageorgiou, Louis; Fakourelis, Paraskevas; Filntisi, Arianna; Polychronidou, Eleftheria; Bassis, Ioannis; Megalooikonomou, Vasileios; Makałowski, Wojciech; Vlachakis, Dimitrios; Kossida, Sophia
2016-01-01
Bioinformatics is the scientific field that focuses on the application of computer technology to the management of biological information. Over the years, bioinformatics applications have been used to store, process and integrate biological and genetic information, using a wide range of methodologies. One of the most de novo techniques used to understand the physical movements of atoms and molecules is molecular dynamics (MD). MD is an in silico method to simulate the physical motions of atoms and molecules under certain conditions. This has become a state strategic technique and now plays a key role in many areas of exact sciences, such as chemistry, biology, physics and medicine. Due to their complexity, MD calculations could require enormous amounts of computer memory and time and therefore their execution has been a big problem. Despite the huge computational cost, molecular dynamics have been implemented using traditional computers with a central memory unit (CPU). A graphics processing unit (GPU) computing technology was first designed with the goal to improve video games, by rapidly creating and displaying images in a frame buffer such as screens. The hybrid GPU-CPU implementation, combined with parallel computing is a novel technology to perform a wide range of calculations. GPUs have been proposed and used to accelerate many scientific computations including MD simulations. Herein, we describe the new methodologies developed initially as video games and how they are now applied in MD simulations. PMID:27525251
Mermelstein, Daniel J; Lin, Charles; Nelson, Gard; Kretsch, Rachael; McCammon, J Andrew; Walker, Ross C
2018-07-15
Alchemical free energy (AFE) calculations based on molecular dynamics (MD) simulations are key tools in both improving our understanding of a wide variety of biological processes and accelerating the design and optimization of therapeutics for numerous diseases. Computing power and theory have, however, long been insufficient to enable AFE calculations to be routinely applied in early stage drug discovery. One of the major difficulties in performing AFE calculations is the length of time required for calculations to converge to an ensemble average. CPU implementations of MD-based free energy algorithms can effectively only reach tens of nanoseconds per day for systems on the order of 50,000 atoms, even running on massively parallel supercomputers. Therefore, converged free energy calculations on large numbers of potential lead compounds are often untenable, preventing researchers from gaining crucial insight into molecular recognition, potential druggability and other crucial areas of interest. Graphics Processing Units (GPUs) can help address this. We present here a seamless GPU implementation, within the PMEMD module of the AMBER molecular dynamics package, of thermodynamic integration (TI) capable of reaching speeds of >140 ns/day for a 44,907-atom system, with accuracy equivalent to the existing CPU implementation in AMBER. The implementation described here is currently part of the AMBER 18 beta code and will be an integral part of the upcoming version 18 release of AMBER. © 2018 Wiley Periodicals, Inc. © 2018 Wiley Periodicals, Inc.
GRay: A Massively Parallel GPU-based Code for Ray Tracing in Relativistic Spacetimes
NASA Astrophysics Data System (ADS)
Chan, Chi-kwan; Psaltis, Dimitrios; Özel, Feryal
2013-11-01
We introduce GRay, a massively parallel integrator designed to trace the trajectories of billions of photons in a curved spacetime. This graphics-processing-unit (GPU)-based integrator employs the stream processing paradigm, is implemented in CUDA C/C++, and runs on nVidia graphics cards. The peak performance of GRay using single-precision floating-point arithmetic on a single GPU exceeds 300 GFLOP (or 1 ns per photon per time step). For a realistic problem, where the peak performance cannot be reached, GRay is two orders of magnitude faster than existing central-processing-unit-based ray-tracing codes. This performance enhancement allows more effective searches of large parameter spaces when comparing theoretical predictions of images, spectra, and light curves from the vicinities of compact objects to observations. GRay can also perform on-the-fly ray tracing within general relativistic magnetohydrodynamic algorithms that simulate accretion flows around compact objects. Making use of this algorithm, we calculate the properties of the shadows of Kerr black holes and the photon rings that surround them. We also provide accurate fitting formulae of their dependencies on black hole spin and observer inclination, which can be used to interpret upcoming observations of the black holes at the center of the Milky Way, as well as M87, with the Event Horizon Telescope.
Min, Yugang; Santhanam, Anand; Neelakkantan, Harini; Ruddy, Bari H; Meeks, Sanford L; Kupelian, Patrick A
2010-09-07
In this paper, we present a graphics processing unit (GPU)-based simulation framework to calculate the delivered dose to a 3D moving lung tumor and its surrounding normal tissues, which are undergoing subject-specific lung deformations. The GPU-based simulation framework models the motion of the 3D volumetric lung tumor and its surrounding tissues, simulates the dose delivery using the dose extracted from a treatment plan using Pinnacle Treatment Planning System, Phillips, for one of the 3DCTs of the 4DCT and predicts the amount and location of radiation doses deposited inside the lung. The 4DCT lung datasets were registered with each other using a modified optical flow algorithm. The motion of the tumor and the motion of the surrounding tissues were simulated by measuring the changes in lung volume during the radiotherapy treatment using spirometry. The real-time dose delivered to the tumor for each beam is generated by summing the dose delivered to the target volume at each increase in lung volume during the beam delivery time period. The simulation results showed the real-time capability of the framework at 20 discrete tumor motion steps per breath, which is higher than the number of 4DCT steps (approximately 12) reconstructed during multiple breathing cycles.
Ng, C M
2013-10-01
The development of a population PK/PD model, an essential component for model-based drug development, is both time- and labor-intensive. A graphical-processing unit (GPU) computing technology has been proposed and used to accelerate many scientific computations. The objective of this study was to develop a hybrid GPU-CPU implementation of parallelized Monte Carlo parametric expectation maximization (MCPEM) estimation algorithm for population PK data analysis. A hybrid GPU-CPU implementation of the MCPEM algorithm (MCPEMGPU) and identical algorithm that is designed for the single CPU (MCPEMCPU) were developed using MATLAB in a single computer equipped with dual Xeon 6-Core E5690 CPU and a NVIDIA Tesla C2070 GPU parallel computing card that contained 448 stream processors. Two different PK models with rich/sparse sampling design schemes were used to simulate population data in assessing the performance of MCPEMCPU and MCPEMGPU. Results were analyzed by comparing the parameter estimation and model computation times. Speedup factor was used to assess the relative benefit of parallelized MCPEMGPU over MCPEMCPU in shortening model computation time. The MCPEMGPU consistently achieved shorter computation time than the MCPEMCPU and can offer more than 48-fold speedup using a single GPU card. The novel hybrid GPU-CPU implementation of parallelized MCPEM algorithm developed in this study holds a great promise in serving as the core for the next-generation of modeling software for population PK/PD analysis.
Affordable and accurate large-scale hybrid-functional calculations on GPU-accelerated supercomputers
NASA Astrophysics Data System (ADS)
Ratcliff, Laura E.; Degomme, A.; Flores-Livas, José A.; Goedecker, Stefan; Genovese, Luigi
2018-03-01
Performing high accuracy hybrid functional calculations for condensed matter systems containing a large number of atoms is at present computationally very demanding or even out of reach if high quality basis sets are used. We present a highly optimized multiple graphics processing unit implementation of the exact exchange operator which allows one to perform fast hybrid functional density-functional theory (DFT) calculations with systematic basis sets without additional approximations for up to a thousand atoms. With this method hybrid DFT calculations of high quality become accessible on state-of-the-art supercomputers within a time-to-solution that is of the same order of magnitude as traditional semilocal-GGA functionals. The method is implemented in a portable open-source library.
A GPU-based symmetric non-rigid image registration method in human lung.
Haghighi, Babak; D Ellingwood, Nathan; Yin, Youbing; Hoffman, Eric A; Lin, Ching-Long
2018-03-01
Quantitative computed tomography (QCT) of the lungs plays an increasing role in identifying sub-phenotypes of pathologies previously lumped into broad categories such as chronic obstructive pulmonary disease and asthma. Methods for image matching and linking multiple lung volumes have proven useful in linking structure to function and in the identification of regional longitudinal changes. Here, we seek to improve the accuracy of image matching via the use of a symmetric multi-level non-rigid registration employing an inverse consistent (IC) transformation whereby images are registered both in the forward and reverse directions. To develop the symmetric method, two similarity measures, the sum of squared intensity difference (SSD) and the sum of squared tissue volume difference (SSTVD), were used. The method is based on a novel generic mathematical framework to include forward and backward transformations, simultaneously, eliminating the need to compute the inverse transformation. Two implementations were used to assess the proposed method: a two-dimensional (2-D) implementation using synthetic examples with SSD, and a multi-core CPU and graphics processing unit (GPU) implementation with SSTVD for three-dimensional (3-D) human lung datasets (six normal adults studied at total lung capacity (TLC) and functional residual capacity (FRC)). Success was evaluated in terms of the IC transformation consistency serving to link TLC to FRC. 2-D registration on synthetic images, using both symmetric and non-symmetric SSD methods, and comparison of displacement fields showed that the symmetric method gave a symmetrical grid shape and reduced IC errors, with the mean values of IC errors decreased by 37%. Results for both symmetric and non-symmetric transformations of human datasets showed that the symmetric method gave better results for IC errors in all cases, with mean values of IC errors for the symmetric method lower than the non-symmetric methods using both SSD and SSTVD. The GPU version demonstrated an average of 43 times speedup and ~5.2 times speedup over the single-threaded and 12-threaded CPU versions, respectively. Run times with the GPU were as fast as 2 min. The symmetric method improved the inverse consistency, aiding the use of image registration in the QCT-based evaluation of the lung.
Simulation-based MDP verification for leading-edge masks
NASA Astrophysics Data System (ADS)
Su, Bo; Syrel, Oleg; Pomerantsev, Michael; Hagiwara, Kazuyuki; Pearman, Ryan; Pang, Leo; Fujimara, Aki
2017-07-01
For IC design starts below the 20nm technology node, the assist features on photomasks shrink well below 60nm and the printed patterns of those features on masks written by VSB eBeam writers start to show a large deviation from the mask designs. Traditional geometry-based fracturing starts to show large errors for those small features. As a result, other mask data preparation (MDP) methods have become available and adopted, such as rule-based Mask Process Correction (MPC), model-based MPC and eventually model-based MDP. The new MDP methods may place shot edges slightly differently from target to compensate for mask process effects, so that the final patterns on a mask are much closer to the design (which can be viewed as the ideal mask), especially for those assist features. Such an alteration generally produces better masks that are closer to the intended mask design. Traditional XOR-based MDP verification cannot detect problems caused by eBeam effects. Much like model-based OPC verification which became a necessity for OPC a decade ago, we see the same trend in MDP today. Simulation-based MDP verification solution requires a GPU-accelerated computational geometry engine with simulation capabilities. To have a meaningful simulation-based mask check, a good mask process model is needed. The TrueModel® system is a field tested physical mask model developed by D2S. The GPU-accelerated D2S Computational Design Platform (CDP) is used to run simulation-based mask check, as well as model-based MDP. In addition to simulation-based checks such as mask EPE or dose margin, geometry-based rules are also available to detect quality issues such as slivers or CD splits. Dose margin related hotspots can also be detected by setting a correct detection threshold. In this paper, we will demonstrate GPU-acceleration for geometry processing, and give examples of mask check results and performance data. GPU-acceleration is necessary to make simulation-based mask MDP verification acceptable.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Neylon, J; Min, Y; Qi, S
2014-06-15
Purpose: Deformable image registration (DIR) plays a pivotal role in head and neck adaptive radiotherapy but a systematic validation of DIR algorithms has been limited by a lack of quantitative high-resolution groundtruth. We address this limitation by developing a GPU-based framework that provides a systematic DIR validation by generating (a) model-guided synthetic CTs representing posture and physiological changes, and (b) model-guided landmark-based validation. Method: The GPU-based framework was developed to generate massive mass-spring biomechanical models from patient simulation CTs and contoured structures. The biomechanical model represented soft tissue deformations for known rigid skeletal motion. Posture changes were simulated by articulatingmore » skeletal anatomy, which subsequently applied elastic corrective forces upon the soft tissue. Physiological changes such as tumor regression and weight loss were simulated in a biomechanically precise manner. Synthetic CT data was then generated from the deformed anatomy. The initial and final positions for one hundred randomly-chosen mass elements inside each of the internal contoured structures were recorded as ground truth data. The process was automated to create 45 synthetic CT datasets for a given patient CT. For instance, the head rotation was varied between +/− 4 degrees along each axis, and tumor volumes were systematically reduced up to 30%. Finally, the original CT and deformed synthetic CT were registered using an optical flow based DIR. Results: Each synthetic data creation took approximately 28 seconds of computation time. The number of landmarks per data set varied between two and three thousand. The validation method is able to perform sub-voxel analysis of the DIR, and report the results by structure, giving a much more in depth investigation of the error. Conclusions: We presented a GPU based high-resolution biomechanical head and neck model to validate DIR algorithms by generating CT equivalent 3D volumes with simulated posture changes and physiological regression.« less
NASA Astrophysics Data System (ADS)
Wu, Tao; Wu, Zhensen; Linghu, Longxiang
2017-10-01
Study of characteristics of sea clutter is very important for signal processing of radar, detection of targets on sea surface and remote sensing. The sea state is complex at Low grazing angle (LGA), and it is difficult with its large irradiation area and a great deal simulation facets. A practical and efficient model to obtain radar clutter of dynamic sea in different sea condition is proposed, basing on the physical mechanism of interaction between electromagnetic wave and sea wave. The classical analysis method for sea clutter is basing on amplitude and spectrum distribution, taking the clutter as random processing model, which is equivocal in its physical mechanism. To achieve electromagnetic field from sea surface, a modified phase from facets is considered, and the backscattering coefficient is calculated by Wu's improved two-scale model, which can solve the statistical sea backscattering problem less than 5 degree, considering the effects of the surface slopes joint probability density, the shadowing function, the skewness of sea waves and the curvature of the surface on the backscattering from the ocean surface. We make the assumption that the scattering contribution of each facet is independent, the total field is the superposition of each facet in the receiving direction. Such data characters are very suitable to compute on GPU threads. So we can make the best of GPU resource. We have achieved a speedup of 155-fold for S band and 162-fold for Ku/Χ band on the Tesla K80 GPU as compared with Intel® Core™ CPU. In this paper, we mainly study the high resolution data, and the time resolution is millisecond, so we may have 10,00 time points, and we analyze amplitude probability density distribution of radar clutter.
Fast 2D flood modelling using GPU technology - recent applications and new developments
NASA Astrophysics Data System (ADS)
Crossley, Amanda; Lamb, Rob; Waller, Simon; Dunning, Paul
2010-05-01
In recent years there has been considerable interest amongst scientists and engineers in exploiting the potential of commodity graphics hardware for desktop parallel computing. The Graphics Processing Units (GPUs) that are used in PC graphics cards have now evolved into powerful parallel co-processors that can be used to accelerate the numerical codes used for floodplain inundation modelling. We report in this paper on experience over the past two years in developing and applying two dimensional (2D) flood inundation models using GPUs to achieve significant practical performance benefits. Starting with a solution scheme for the 2D diffusion wave approximation to the 2D Shallow Water Equations (SWEs), we have demonstrated the capability to reduce model run times in ‘real-world' applications using GPU hardware and programming techniques. We then present results from a GPU-based 2D finite volume SWE solver. A series of numerical test cases demonstrate that the model produces outputs that are accurate and consistent with reference results published elsewhere. In comparisons conducted for a real world test case, the GPU-based SWE model was over 100 times faster than the CPU version. We conclude with some discussion of practical experience in using the GPU technology for flood mapping applications, and for research projects investigating use of Monte Carlo simulation methods for the analysis of uncertainty in 2D flood modelling.
GPU Particle Tracking and MHD Simulations with Greatly Enhanced Computational Speed
NASA Astrophysics Data System (ADS)
Ziemba, T.; O'Donnell, D.; Carscadden, J.; Cash, M.; Winglee, R.; Harnett, E.
2008-12-01
GPUs are intrinsically highly parallelized systems that provide more than an order of magnitude computing speed over a CPU based systems, for less cost than a high end-workstation. Recent advancements in GPU technologies allow for full IEEE float specifications with performance up to several hundred GFLOPs per GPU, and new software architectures have recently become available to ease the transition from graphics based to scientific applications. This allows for a cheap alternative to standard supercomputing methods and should increase the time to discovery. 3-D particle tracking and MHD codes have been developed using NVIDIA's CUDA and have demonstrated speed up of nearly a factor of 20 over equivalent CPU versions of the codes. Such a speed up enables new applications to develop, including real time running of radiation belt simulations and real time running of global magnetospheric simulations, both of which could provide important space weather prediction tools.
Gai, Jiading; Obeid, Nady; Holtrop, Joseph L.; Wu, Xiao-Long; Lam, Fan; Fu, Maojing; Haldar, Justin P.; Hwu, Wen-mei W.; Liang, Zhi-Pei; Sutton, Bradley P.
2013-01-01
Several recent methods have been proposed to obtain significant speed-ups in MRI image reconstruction by leveraging the computational power of GPUs. Previously, we implemented a GPU-based image reconstruction technique called the Illinois Massively Parallel Acquisition Toolkit for Image reconstruction with ENhanced Throughput in MRI (IMPATIENT MRI) for reconstructing data collected along arbitrary 3D trajectories. In this paper, we improve IMPATIENT by removing computational bottlenecks by using a gridding approach to accelerate the computation of various data structures needed by the previous routine. Further, we enhance the routine with capabilities for off-resonance correction and multi-sensor parallel imaging reconstruction. Through implementation of optimized gridding into our iterative reconstruction scheme, speed-ups of more than a factor of 200 are provided in the improved GPU implementation compared to the previous accelerated GPU code. PMID:23682203
GPU accelerated FDTD solver and its application in MRI.
Chi, J; Liu, F; Jin, J; Mason, D G; Crozier, S
2010-01-01
The finite difference time domain (FDTD) method is a popular technique for computational electromagnetics (CEM). The large computational power often required, however, has been a limiting factor for its applications. In this paper, we will present a graphics processing unit (GPU)-based parallel FDTD solver and its successful application to the investigation of a novel B1 shimming scheme for high-field magnetic resonance imaging (MRI). The optimized shimming scheme exhibits considerably improved transmit B(1) profiles. The GPU implementation dramatically shortened the runtime of FDTD simulation of electromagnetic field compared with its CPU counterpart. The acceleration in runtime has made such investigation possible, and will pave the way for other studies of large-scale computational electromagnetic problems in modern MRI which were previously impractical.
SU-F-T-256: 4D IMRT Planning Using An Early Prototype GPU-Enabled Eclipse Workstation
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hagan, A; Modiri, A; Sawant, A
Purpose: True 4D IMRT planning, based on simultaneous spatiotemporal optimization has been shown to significantly improve plan quality in lung radiotherapy. However, the high computational complexity associated with such planning represents a significant barrier to widespread clinical deployment. We introduce an early prototype GPU-enabled Eclipse workstation for inverse planning. To our knowledge, this is the first GPUintegrated Eclipse system demonstrating the potential for clinical translation of GPU computing on a major commercially-available TPS. Methods: The prototype system comprised of four NVIDIA Tesla K80 GPUs, with a maximum processing capability of 8.5 Tflops per K80 card. The system architecture consisted ofmore » three key modules: (i) a GPU-based inverse planning module using a highly-parallelizable, swarm intelligence-based global optimization algorithm, (ii) a GPU-based open-source b-spline deformable image registration module, Elastix, and (iii) a CUDA-based data management module. For evaluation, aperture fluence weights in an IMRT plan were optimized over 9 beams,166 apertures and 10 respiratory phases (14940 variables) for a lung cancer case (GTV = 95 cc, right lower lobe, 15 mm cranio-caudal motion). Sensitivity of the planning time and memory expense to parameter variations was quantified. Results: GPU-based inverse planning was significantly accelerated compared to its CPU counterpart (36 vs 488 min, for 10 phases, 10 search agents and 10 iterations). The optimized IMRT plan significantly improved OAR sparing compared to the original internal target volume (ITV)-based clinical plan, while maintaining prescribed tumor coverage. The dose-sparing improvements were: Esophagus Dmax 50%, Heart Dmax 42% and Spinal cord Dmax 25%. Conclusion: Our early prototype system demonstrates that through massive parallelization, computationally intense tasks such as 4D treatment planning can be accomplished in clinically feasible timeframes. With further optimization, such systems are expected to enable the eventual clinical translation of higher-dimensional and complex treatment planning strategies to significantly improve plan quality. This work was partially supported through research funding from National Institutes of Health (R01CA169102) and Varian Medical Systems, Palo Alto, CA, USA.« less
An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU
Lyakh, Dmitry I.
2015-01-05
An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typicallymore » appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the na ve scattering algorithm (no memory access optimization). Furthermore, the tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).« less
Molecular Monte Carlo Simulations Using Graphics Processing Units: To Waste Recycle or Not?
Kim, Jihan; Rodgers, Jocelyn M; Athènes, Manuel; Smit, Berend
2011-10-11
In the waste recycling Monte Carlo (WRMC) algorithm, (1) multiple trial states may be simultaneously generated and utilized during Monte Carlo moves to improve the statistical accuracy of the simulations, suggesting that such an algorithm may be well posed for implementation in parallel on graphics processing units (GPUs). In this paper, we implement two waste recycling Monte Carlo algorithms in CUDA (Compute Unified Device Architecture) using uniformly distributed random trial states and trial states based on displacement random-walk steps, and we test the methods on a methane-zeolite MFI framework system to evaluate their utility. We discuss the specific implementation details of the waste recycling GPU algorithm and compare the methods to other parallel algorithms optimized for the framework system. We analyze the relationship between the statistical accuracy of our simulations and the CUDA block size to determine the efficient allocation of the GPU hardware resources. We make comparisons between the GPU and the serial CPU Monte Carlo implementations to assess speedup over conventional microprocessors. Finally, we apply our optimized GPU algorithms to the important problem of determining free energy landscapes, in this case for molecular motion through the zeolite LTA.
A Feature-adaptive Subdivision Method for Real-time 3D Reconstruction of Repeated Topology Surfaces
NASA Astrophysics Data System (ADS)
Lin, Jinhua; Wang, Yanjie; Sun, Honghai
2017-03-01
It's well known that rendering for a large number of triangles with GPU hardware tessellation has made great progress. However, due to the fixed nature of GPU pipeline, many off-line methods that perform well can not meet the on-line requirements. In this paper, an optimized Feature-adaptive subdivision method is proposed, which is more suitable for reconstructing surfaces with repeated cusps or creases. An Octree primitive is established in irregular regions where there are the same sharp vertices or creases, this method can find the neighbor geometry information quickly. Because of having the same topology structure between Octree primitive and feature region, the Octree feature points can match the arbitrary vertices in feature region more precisely. In the meanwhile, the patches is re-encoded in the Octree primitive by using the breadth-first strategy, resulting in a meta-table which allows for real-time reconstruction by GPU hardware tessellation unit. There is only one feature region needed to be calculated under Octree primitive, other regions with the same repeated feature generate their own meta-table directly, the reconstruction time is saved greatly for this step. With regard to the meshes having a large number of repeated topology feature, our algorithm improves the subdivision time by 17.575% and increases the average frame drawing time by 0.2373 ms compared to the traditional FAS (Feature-adaptive Subdivision), at the same time the model can be reconstructed in a watertight manner.
Heterogeneous CPU-GPU moving targets detection for UAV video
NASA Astrophysics Data System (ADS)
Li, Maowen; Tang, Linbo; Han, Yuqi; Yu, Chunlei; Zhang, Chao; Fu, Huiquan
2017-07-01
Moving targets detection is gaining popularity in civilian and military applications. On some monitoring platform of motion detection, some low-resolution stationary cameras are replaced by moving HD camera based on UAVs. The pixels of moving targets in the HD Video taken by UAV are always in a minority, and the background of the frame is usually moving because of the motion of UAVs. The high computational cost of the algorithm prevents running it at higher resolutions the pixels of frame. Hence, to solve the problem of moving targets detection based UAVs video, we propose a heterogeneous CPU-GPU moving target detection algorithm for UAV video. More specifically, we use background registration to eliminate the impact of the moving background and frame difference to detect small moving targets. In order to achieve the effect of real-time processing, we design the solution of heterogeneous CPU-GPU framework for our method. The experimental results show that our method can detect the main moving targets from the HD video taken by UAV, and the average process time is 52.16ms per frame which is fast enough to solve the problem.
Crespo, Alejandro C.; Dominguez, Jose M.; Barreiro, Anxo; Gómez-Gesteira, Moncho; Rogers, Benedict D.
2011-01-01
Smoothed Particle Hydrodynamics (SPH) is a numerical method commonly used in Computational Fluid Dynamics (CFD) to simulate complex free-surface flows. Simulations with this mesh-free particle method far exceed the capacity of a single processor. In this paper, as part of a dual-functioning code for either central processing units (CPUs) or Graphics Processor Units (GPUs), a parallelisation using GPUs is presented. The GPU parallelisation technique uses the Compute Unified Device Architecture (CUDA) of nVidia devices. Simulations with more than one million particles on a single GPU card exhibit speedups of up to two orders of magnitude over using a single-core CPU. It is demonstrated that the code achieves different speedups with different CUDA-enabled GPUs. The numerical behaviour of the SPH code is validated with a standard benchmark test case of dam break flow impacting on an obstacle where good agreement with the experimental results is observed. Both the achieved speed-ups and the quantitative agreement with experiments suggest that CUDA-based GPU programming can be used in SPH methods with efficiency and reliability. PMID:21695185
DOE Office of Scientific and Technical Information (OSTI.GOV)
Pelt, Daniël M.; Gürsoy, Dogˇa; Palenstijn, Willem Jan
2016-04-28
The processing of tomographic synchrotron data requires advanced and efficient software to be able to produce accurate results in reasonable time. In this paper, the integration of two software toolboxes, TomoPy and the ASTRA toolbox, which, together, provide a powerful framework for processing tomographic data, is presented. The integration combines the advantages of both toolboxes, such as the user-friendliness and CPU-efficient methods of TomoPy and the flexibility and optimized GPU-based reconstruction methods of the ASTRA toolbox. It is shown that both toolboxes can be easily installed and used together, requiring only minor changes to existing TomoPy scripts. Furthermore, it ismore » shown that the efficient GPU-based reconstruction methods of the ASTRA toolbox can significantly decrease the time needed to reconstruct large datasets, and that advanced reconstruction methods can improve reconstruction quality compared with TomoPy's standard reconstruction method.« less
Pelt, Daniël M.; Gürsoy, Doǧa; Palenstijn, Willem Jan; Sijbers, Jan; De Carlo, Francesco; Batenburg, Kees Joost
2016-01-01
The processing of tomographic synchrotron data requires advanced and efficient software to be able to produce accurate results in reasonable time. In this paper, the integration of two software toolboxes, TomoPy and the ASTRA toolbox, which, together, provide a powerful framework for processing tomographic data, is presented. The integration combines the advantages of both toolboxes, such as the user-friendliness and CPU-efficient methods of TomoPy and the flexibility and optimized GPU-based reconstruction methods of the ASTRA toolbox. It is shown that both toolboxes can be easily installed and used together, requiring only minor changes to existing TomoPy scripts. Furthermore, it is shown that the efficient GPU-based reconstruction methods of the ASTRA toolbox can significantly decrease the time needed to reconstruct large datasets, and that advanced reconstruction methods can improve reconstruction quality compared with TomoPy’s standard reconstruction method. PMID:27140167
2010-01-01
Background Simulation of sophisticated biological models requires considerable computational power. These models typically integrate together numerous biological phenomena such as spatially-explicit heterogeneous cells, cell-cell interactions, cell-environment interactions and intracellular gene networks. The recent advent of programming for graphical processing units (GPU) opens up the possibility of developing more integrative, detailed and predictive biological models while at the same time decreasing the computational cost to simulate those models. Results We construct a 3D model of epidermal development and provide a set of GPU algorithms that executes significantly faster than sequential central processing unit (CPU) code. We provide a parallel implementation of the subcellular element method for individual cells residing in a lattice-free spatial environment. Each cell in our epidermal model includes an internal gene network, which integrates cellular interaction of Notch signaling together with environmental interaction of basement membrane adhesion, to specify cellular state and behaviors such as growth and division. We take a pedagogical approach to describing how modeling methods are efficiently implemented on the GPU including memory layout of data structures and functional decomposition. We discuss various programmatic issues and provide a set of design guidelines for GPU programming that are instructive to avoid common pitfalls as well as to extract performance from the GPU architecture. Conclusions We demonstrate that GPU algorithms represent a significant technological advance for the simulation of complex biological models. We further demonstrate with our epidermal model that the integration of multiple complex modeling methods for heterogeneous multicellular biological processes is both feasible and computationally tractable using this new technology. We hope that the provided algorithms and source code will be a starting point for modelers to develop their own GPU implementations, and encourage others to implement their modeling methods on the GPU and to make that code available to the wider community. PMID:20696053
GPU-based optimal control for RWM feedback in tokamaks
Clement, Mitchell; Hanson, Jeremy; Bialek, Jim; ...
2017-08-23
The design and implementation of a Graphics Processing Unit (GPU) based Resistive Wall Mode (RWM) controller to perform feedback control on the RWM using Linear Quadratic Gaussian (LQG) control is reported herein. Also, the control algorithm is based on a simplified DIII-D VALEN model. By using NVIDIA’s GPUDirect RDMA framework, the digitizer and output module are able to write and read directly to and from GPU memory, eliminating memory transfers between host and GPU. In conclusion, the system and algorithm was able to reduce plasma response excited by externally applied fields by 32% during development experiments.
GPU-based optimal control for RWM feedback in tokamaks
DOE Office of Scientific and Technical Information (OSTI.GOV)
Clement, Mitchell; Hanson, Jeremy; Bialek, Jim
The design and implementation of a Graphics Processing Unit (GPU) based Resistive Wall Mode (RWM) controller to perform feedback control on the RWM using Linear Quadratic Gaussian (LQG) control is reported herein. Also, the control algorithm is based on a simplified DIII-D VALEN model. By using NVIDIA’s GPUDirect RDMA framework, the digitizer and output module are able to write and read directly to and from GPU memory, eliminating memory transfers between host and GPU. In conclusion, the system and algorithm was able to reduce plasma response excited by externally applied fields by 32% during development experiments.
Ha, S; Matej, S; Ispiryan, M; Mueller, K
2013-02-01
We describe a GPU-accelerated framework that efficiently models spatially (shift) variant system response kernels and performs forward- and back-projection operations with these kernels for the DIRECT (Direct Image Reconstruction for TOF) iterative reconstruction approach. Inherent challenges arise from the poor memory cache performance at non-axis aligned TOF directions. Focusing on the GPU memory access patterns, we utilize different kinds of GPU memory according to these patterns in order to maximize the memory cache performance. We also exploit the GPU instruction-level parallelism to efficiently hide long latencies from the memory operations. Our experiments indicate that our GPU implementation of the projection operators has slightly faster or approximately comparable time performance than FFT-based approaches using state-of-the-art FFTW routines. However, most importantly, our GPU framework can also efficiently handle any generic system response kernels, such as spatially symmetric and shift-variant as well as spatially asymmetric and shift-variant, both of which an FFT-based approach cannot cope with.
NASA Astrophysics Data System (ADS)
Ha, S.; Matej, S.; Ispiryan, M.; Mueller, K.
2013-02-01
We describe a GPU-accelerated framework that efficiently models spatially (shift) variant system response kernels and performs forward- and back-projection operations with these kernels for the DIRECT (Direct Image Reconstruction for TOF) iterative reconstruction approach. Inherent challenges arise from the poor memory cache performance at non-axis aligned TOF directions. Focusing on the GPU memory access patterns, we utilize different kinds of GPU memory according to these patterns in order to maximize the memory cache performance. We also exploit the GPU instruction-level parallelism to efficiently hide long latencies from the memory operations. Our experiments indicate that our GPU implementation of the projection operators has slightly faster or approximately comparable time performance than FFT-based approaches using state-of-the-art FFTW routines. However, most importantly, our GPU framework can also efficiently handle any generic system response kernels, such as spatially symmetric and shift-variant as well as spatially asymmetric and shift-variant, both of which an FFT-based approach cannot cope with.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Badal, Andreu; Badano, Aldo
Purpose: It is a known fact that Monte Carlo simulations of radiation transport are computationally intensive and may require long computing times. The authors introduce a new paradigm for the acceleration of Monte Carlo simulations: The use of a graphics processing unit (GPU) as the main computing device instead of a central processing unit (CPU). Methods: A GPU-based Monte Carlo code that simulates photon transport in a voxelized geometry with the accurate physics models from PENELOPE has been developed using the CUDA programming model (NVIDIA Corporation, Santa Clara, CA). Results: An outline of the new code and a sample x-raymore » imaging simulation with an anthropomorphic phantom are presented. A remarkable 27-fold speed up factor was obtained using a GPU compared to a single core CPU. Conclusions: The reported results show that GPUs are currently a good alternative to CPUs for the simulation of radiation transport. Since the performance of GPUs is currently increasing at a faster pace than that of CPUs, the advantages of GPU-based software are likely to be more pronounced in the future.« less
NASA Astrophysics Data System (ADS)
Doulgerakis, Matthaios; Eggebrecht, Adam; Wojtkiewicz, Stanislaw; Culver, Joseph; Dehghani, Hamid
2017-12-01
Parameter recovery in diffuse optical tomography is a computationally expensive algorithm, especially when used for large and complex volumes, as in the case of human brain functional imaging. The modeling of light propagation, also known as the forward problem, is the computational bottleneck of the recovery algorithm, whereby the lack of a real-time solution is impeding practical and clinical applications. The objective of this work is the acceleration of the forward model, within a diffusion approximation-based finite-element modeling framework, employing parallelization to expedite the calculation of light propagation in realistic adult head models. The proposed methodology is applicable for modeling both continuous wave and frequency-domain systems with the results demonstrating a 10-fold speed increase when GPU architectures are available, while maintaining high accuracy. It is shown that, for a very high-resolution finite-element model of the adult human head with ˜600,000 nodes, consisting of heterogeneous layers, light propagation can be calculated at ˜0.25 s/excitation source.
Automatic detection and classification of obstacles with applications in autonomous mobile robots
NASA Astrophysics Data System (ADS)
Ponomaryov, Volodymyr I.; Rosas-Miranda, Dario I.
2016-04-01
Hardware implementation of an automatic detection and classification of objects that can represent an obstacle for an autonomous mobile robot using stereo vision algorithms is presented. We propose and evaluate a new method to detect and classify objects for a mobile robot in outdoor conditions. This method is divided in two parts, the first one is the object detection step based on the distance from the objects to the camera and a BLOB analysis. The second part is the classification step that is based on visuals primitives and a SVM classifier. The proposed method is performed in GPU in order to reduce the processing time values. This is performed with help of hardware based on multi-core processors and GPU platform, using a NVIDIA R GeForce R GT640 graphic card and Matlab over a PC with Windows 10.
fast_protein_cluster: parallel and optimized clustering of large-scale protein modeling data.
Hung, Ling-Hong; Samudrala, Ram
2014-06-15
fast_protein_cluster is a fast, parallel and memory efficient package used to cluster 60 000 sets of protein models (with up to 550 000 models per set) generated by the Nutritious Rice for the World project. fast_protein_cluster is an optimized and extensible toolkit that supports Root Mean Square Deviation after optimal superposition (RMSD) and Template Modeling score (TM-score) as metrics. RMSD calculations using a laptop CPU are 60× faster than qcprot and 3× faster than current graphics processing unit (GPU) implementations. New GPU code further increases the speed of RMSD and TM-score calculations. fast_protein_cluster provides novel k-means and hierarchical clustering methods that are up to 250× and 2000× faster, respectively, than Clusco, and identify significantly more accurate models than Spicker and Clusco. fast_protein_cluster is written in C++ using OpenMP for multi-threading support. Custom streaming Single Instruction Multiple Data (SIMD) extensions and advanced vector extension intrinsics code accelerate CPU calculations, and OpenCL kernels support AMD and Nvidia GPUs. fast_protein_cluster is available under the M.I.T. license. (http://software.compbio.washington.edu/fast_protein_cluster) © The Author 2014. Published by Oxford University Press.
GPU accelerated study of heat transfer and fluid flow by lattice Boltzmann method on CUDA
NASA Astrophysics Data System (ADS)
Ren, Qinlong
Lattice Boltzmann method (LBM) has been developed as a powerful numerical approach to simulate the complex fluid flow and heat transfer phenomena during the past two decades. As a mesoscale method based on the kinetic theory, LBM has several advantages compared with traditional numerical methods such as physical representation of microscopic interactions, dealing with complex geometries and highly parallel nature. Lattice Boltzmann method has been applied to solve various fluid behaviors and heat transfer process like conjugate heat transfer, magnetic and electric field, diffusion and mixing process, chemical reactions, multiphase flow, phase change process, non-isothermal flow in porous medium, microfluidics, fluid-structure interactions in biological system and so on. In addition, as a non-body-conformal grid method, the immersed boundary method (IBM) could be applied to handle the complex or moving geometries in the domain. The immersed boundary method could be coupled with lattice Boltzmann method to study the heat transfer and fluid flow problems. Heat transfer and fluid flow are solved on Euler nodes by LBM while the complex solid geometries are captured by Lagrangian nodes using immersed boundary method. Parallel computing has been a popular topic for many decades to accelerate the computational speed in engineering and scientific fields. Today, almost all the laptop and desktop have central processing units (CPUs) with multiple cores which could be used for parallel computing. However, the cost of CPUs with hundreds of cores is still high which limits its capability of high performance computing on personal computer. Graphic processing units (GPU) is originally used for the computer video cards have been emerged as the most powerful high-performance workstation in recent years. Unlike the CPUs, the cost of GPU with thousands of cores is cheap. For example, the GPU (GeForce GTX TITAN) which is used in the current work has 2688 cores and the price is only 1,000 US dollars. The release of NVIDIA's CUDA architecture which includes both hardware and programming environment in 2007 makes GPU computing attractive. Due to its highly parallel nature, lattice Boltzmann method is successfully ported into GPU with a performance benefit during the recent years. In the current work, LBM CUDA code is developed for different fluid flow and heat transfer problems. In this dissertation, lattice Boltzmann method and immersed boundary method are used to study natural convection in an enclosure with an array of conduting obstacles, double-diffusive convection in a vertical cavity with Soret and Dufour effects, PCM melting process in a latent heat thermal energy storage system with internal fins, mixed convection in a lid-driven cavity with a sinusoidal cylinder, and AC electrothermal pumping in microfluidic systems on a CUDA computational platform. It is demonstrated that LBM is an efficient method to simulate complex heat transfer problems using GPU on CUDA.
Speeding up tsunami wave propagation modeling
NASA Astrophysics Data System (ADS)
Lavrentyev, Mikhail; Romanenko, Alexey
2014-05-01
Trans-oceanic wave propagation is one of the most time/CPU consuming parts of the tsunami modeling process. The so-called Method Of Splitting Tsunami (MOST) software package, developed at PMEL NOAA USA (Pacific Marine Environmental Laboratory of the National Oceanic and Atmospheric Administration, USA), is widely used to evaluate the tsunami parameters. However, it takes time to simulate trans-ocean wave propagation, that is up to 5 hours CPU time to "drive" the wave from Chili (epicenter) to the coast of Japan (even using a rather coarse computational mesh). Accurate wave height prediction requires fine meshes which leads to dramatic increase in time for simulation. Computation time is among the critical parameter as it takes only about 20 minutes for tsunami wave to approach the coast of Japan after earthquake at Japan trench or Sagami trench (as it was after the Great East Japan Earthquake on March 11, 2011). MOST solves numerically the hyperbolic system for three unknown functions, namely velocity vector and wave height (shallow water approximation). The system could be split into two independent systems by orthogonal directions (splitting method). Each system can be treated independently. This calculation scheme is well suited for SIMD architecture and GPUs as well. We performed adaptation of MOST package to GPU. Several numerical tests showed 40x performance gain for NVIDIA Tesla C2050 GPU vs. single core of Intel i7 processor. Results of numerical experiments were compared with other available simulation data. Calculation results, obtained at GPU, differ from the reference ones by 10^-3 cm of the wave height simulating 24 hours wave propagation. This allows us to speak about possibility to develop real-time system for evaluating tsunami danger.
Distributed GPU Computing in GIScience
NASA Astrophysics Data System (ADS)
Jiang, Y.; Yang, C.; Huang, Q.; Li, J.; Sun, M.
2013-12-01
Geoscientists strived to discover potential principles and patterns hidden inside ever-growing Big Data for scientific discoveries. To better achieve this objective, more capable computing resources are required to process, analyze and visualize Big Data (Ferreira et al., 2003; Li et al., 2013). Current CPU-based computing techniques cannot promptly meet the computing challenges caused by increasing amount of datasets from different domains, such as social media, earth observation, environmental sensing (Li et al., 2013). Meanwhile CPU-based computing resources structured as cluster or supercomputer is costly. In the past several years with GPU-based technology matured in both the capability and performance, GPU-based computing has emerged as a new computing paradigm. Compare to traditional computing microprocessor, the modern GPU, as a compelling alternative microprocessor, has outstanding high parallel processing capability with cost-effectiveness and efficiency(Owens et al., 2008), although it is initially designed for graphical rendering in visualization pipe. This presentation reports a distributed GPU computing framework for integrating GPU-based computing within distributed environment. Within this framework, 1) for each single computer, computing resources of both GPU-based and CPU-based can be fully utilized to improve the performance of visualizing and processing Big Data; 2) within a network environment, a variety of computers can be used to build up a virtual super computer to support CPU-based and GPU-based computing in distributed computing environment; 3) GPUs, as a specific graphic targeted device, are used to greatly improve the rendering efficiency in distributed geo-visualization, especially for 3D/4D visualization. Key words: Geovisualization, GIScience, Spatiotemporal Studies Reference : 1. Ferreira de Oliveira, M. C., & Levkowitz, H. (2003). From visual data exploration to visual data mining: A survey. Visualization and Computer Graphics, IEEE Transactions on, 9(3), 378-394. 2. Li, J., Jiang, Y., Yang, C., Huang, Q., & Rice, M. (2013). Visualizing 3D/4D Environmental Data Using Many-core Graphics Processing Units (GPUs) and Multi-core Central Processing Units (CPUs). Computers & Geosciences, 59(9), 78-89. 3. Owens, J. D., Houston, M., Luebke, D., Green, S., Stone, J. E., & Phillips, J. C. (2008). GPU computing. Proceedings of the IEEE, 96(5), 879-899.
FPGA Implementation of the Coupled Filtering Method and the Affine Warping Method.
Zhang, Chen; Liang, Tianzhu; Mok, Philip K T; Yu, Weichuan
2017-07-01
In ultrasound image analysis, the speckle tracking methods are widely applied to study the elasticity of body tissue. However, "feature-motion decorrelation" still remains as a challenge for the speckle tracking methods. Recently, a coupled filtering method and an affine warping method were proposed to accurately estimate strain values, when the tissue deformation is large. The major drawback of these methods is the high computational complexity. Even the graphics processing unit (GPU)-based program requires a long time to finish the analysis. In this paper, we propose field-programmable gate array (FPGA)-based implementations of both methods for further acceleration. The capability of FPGAs on handling different image processing components in these methods is discussed. A fast and memory-saving image warping approach is proposed. The algorithms are reformulated to build a highly efficient pipeline on FPGA. The final implementations on a Xilinx Virtex-7 FPGA are at least 13 times faster than the GPU implementation on the NVIDIA graphic card (GeForce GTX 580).
Basu, Protonu; Williams, Samuel; Van Straalen, Brian; ...
2017-04-05
GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model like CUDA, OpenCL, or OpenACC. Thus, in order to deliver portability across CPU-based and GPU-accelerated supercomputers, programmers are forced to write and maintain two versions of their applications or frameworks. In this paper, we explore the use of a compiler-based autotuning framework based on CUDA-CHiLL to deliver not only portability, but also performance portability across CPU- and GPU-accelerated platforms for the geometric multigrid linear solvers found inmore » many scientific applications. We also show that with autotuning we can attain near Roofline (a performance bound for a computation and target architecture) performance across the key operations in the miniGMG benchmark for both CPU- and GPU-based architectures as well as for a multiple stencil discretizations and smoothers. We show that our technology is readily interoperable with MPI resulting in performance at scale equal to that obtained via hand-optimized MPI+CUDA implementation.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Basu, Protonu; Williams, Samuel; Van Straalen, Brian
GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model like CUDA, OpenCL, or OpenACC. Thus, in order to deliver portability across CPU-based and GPU-accelerated supercomputers, programmers are forced to write and maintain two versions of their applications or frameworks. In this paper, we explore the use of a compiler-based autotuning framework based on CUDA-CHiLL to deliver not only portability, but also performance portability across CPU- and GPU-accelerated platforms for the geometric multigrid linear solvers found inmore » many scientific applications. We also show that with autotuning we can attain near Roofline (a performance bound for a computation and target architecture) performance across the key operations in the miniGMG benchmark for both CPU- and GPU-based architectures as well as for a multiple stencil discretizations and smoothers. We show that our technology is readily interoperable with MPI resulting in performance at scale equal to that obtained via hand-optimized MPI+CUDA implementation.« less
NASA Astrophysics Data System (ADS)
Townson, Reid W.; Zavgorodni, Sergei
2014-12-01
In GPU-based Monte Carlo simulations for radiotherapy dose calculation, source modelling from a phase-space source can be an efficiency bottleneck. Previously, this has been addressed using phase-space-let (PSL) sources, which provided significant efficiency enhancement. We propose that additional speed-up can be achieved through the use of a hybrid primary photon point source model combined with a secondary PSL source. A novel phase-space derived and histogram-based implementation of this model has been integrated into gDPM v3.0. Additionally, a simple method for approximately deriving target photon source characteristics from a phase-space that does not contain inheritable particle history variables (LATCH) has been demonstrated to succeed in selecting over 99% of the true target photons with only ~0.3% contamination (for a Varian 21EX 18 MV machine). The hybrid source model was tested using an array of open fields for various Varian 21EX and TrueBeam energies, and all cases achieved greater than 97% chi-test agreement (the mean was 99%) above the 2% isodose with 1% / 1 mm criteria. The root mean square deviations (RMSDs) were less than 1%, with a mean of 0.5%, and the source generation time was 4-5 times faster. A seven-field intensity modulated radiation therapy patient treatment achieved 95% chi-test agreement above the 10% isodose with 1% / 1 mm criteria, 99.8% for 2% / 2 mm, a RMSD of 0.8%, and source generation speed-up factor of 2.5. Presented as part of the International Workshop on Monte Carlo Techniques in Medical Physics
Fast Simulation of Dynamic Ultrasound Images Using the GPU.
Storve, Sigurd; Torp, Hans
2017-10-01
Simulated ultrasound data is a valuable tool for development and validation of quantitative image analysis methods in echocardiography. Unfortunately, simulation time can become prohibitive for phantoms consisting of a large number of point scatterers. The COLE algorithm by Gao et al. is a fast convolution-based simulator that trades simulation accuracy for improved speed. We present highly efficient parallelized CPU and GPU implementations of the COLE algorithm with an emphasis on dynamic simulations involving moving point scatterers. We argue that it is crucial to minimize the amount of data transfers from the CPU to achieve good performance on the GPU. We achieve this by storing the complete trajectories of the dynamic point scatterers as spline curves in the GPU memory. This leads to good efficiency when simulating sequences consisting of a large number of frames, such as B-mode and tissue Doppler data for a full cardiac cycle. In addition, we propose a phase-based subsample delay technique that efficiently eliminates flickering artifacts seen in B-mode sequences when COLE is used without enough temporal oversampling. To assess the performance, we used a laptop computer and a desktop computer, each equipped with a multicore Intel CPU and an NVIDIA GPU. Running the simulator on a high-end TITAN X GPU, we observed two orders of magnitude speedup compared to the parallel CPU version, three orders of magnitude speedup compared to simulation times reported by Gao et al. in their paper on COLE, and a speedup of 27000 times compared to the multithreaded version of Field II, using numbers reported in a paper by Jensen. We hope that by releasing the simulator as an open-source project we will encourage its use and further development.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Botas, P; Heidelberg University, Heidelberg; Grassberger, C
Purpose: To demonstrate the feasibility of fast Monte Carlo (MC) treatment planning and verification using four-dimensional CT (4DCT) for adaptive IMPT for lung cancer patients. Methods: A validated GPU MC code, gPMC, has been linked to the patient database at our institution and employed to compute the dose-influence matrices (Dij) on the planning CT (pCT). The pCT is an average of the respiratory motion of the patient. The Dijs and patient structures were fed to the optimizer to calculate a treatment plan. To validate the plan against motion, a 4D dose distribution averaged over the possible starting phases is calculatedmore » using the 4DCT and a model of the time structure of the delivered spot map. The dose is accumulated using vector maps created by a GPU-accelerated deformable image registration program (DIR) from each phase of the 4DCT to the reference phase using the B-spline method. Calculation of the Dij matrices and the DIR are performed on a cluster, with each field and vector map calculated in parallel. Results: The Dij production takes ∼3.5s per beamlet for 10e6 protons, depending on the energy and the CT size. Generating a plan with 4D simulation of 1000 spots in 4 fields takes approximately 1h. To test the framework, IMPT plans for 10 lung cancer patients were generated for validation. Differences between the planned and the delivered dose of 19% in dose to some organs at risk and 1.4/21.1% in target mean dose/homogeneity with respect to the plan were observed, suggesting potential for improvement if adaptation is considered. Conclusion: A fast MC treatment planning framework has been developed that allows reliable plan design and verification for mobile targets and adaptation of treatment plans. This will significantly impact treatments for lung tumors, as 4D-MC dose calculations can now become part of planning strategies.« less
GPU-based High-Performance Computing for Radiation Therapy
Jia, Xun; Ziegenhein, Peter; Jiang, Steve B.
2014-01-01
Recent developments in radiotherapy therapy demand high computation powers to solve challenging problems in a timely fashion in a clinical environment. Graphics processing unit (GPU), as an emerging high-performance computing platform, has been introduced to radiotherapy. It is particularly attractive due to its high computational power, small size, and low cost for facility deployment and maintenance. Over the past a few years, GPU-based high-performance computing in radiotherapy has experienced rapid developments. A tremendous amount of studies have been conducted, in which large acceleration factors compared with the conventional CPU platform have been observed. In this article, we will first give a brief introduction to the GPU hardware structure and programming model. We will then review the current applications of GPU in major imaging-related and therapy-related problems encountered in radiotherapy. A comparison of GPU with other platforms will also be presented. PMID:24486639
Real-time optical flow estimation on a GPU for a skied-steered mobile robot
NASA Astrophysics Data System (ADS)
Kniaz, V. V.
2016-04-01
Accurate egomotion estimation is required for mobile robot navigation. Often the egomotion is estimated using optical flow algorithms. For an accurate estimation of optical flow most of modern algorithms require high memory resources and processor speed. However simple single-board computers that control the motion of the robot usually do not provide such resources. On the other hand, most of modern single-board computers are equipped with an embedded GPU that could be used in parallel with a CPU to improve the performance of the optical flow estimation algorithm. This paper presents a new Z-flow algorithm for efficient computation of an optical flow using an embedded GPU. The algorithm is based on the phase correlation optical flow estimation and provide a real-time performance on a low cost embedded GPU. The layered optical flow model is used. Layer segmentation is performed using graph-cut algorithm with a time derivative based energy function. Such approach makes the algorithm both fast and robust in low light and low texture conditions. The algorithm implementation for a Raspberry Pi Model B computer is discussed. For evaluation of the algorithm the computer was mounted on a Hercules mobile skied-steered robot equipped with a monocular camera. The evaluation was performed using a hardware-in-the-loop simulation and experiments with Hercules mobile robot. Also the algorithm was evaluated using KITTY Optical Flow 2015 dataset. The resulting endpoint error of the optical flow calculated with the developed algorithm was low enough for navigation of the robot along the desired trajectory.
EvoGraph: On-The-Fly Efficient Mining of Evolving Graphs on GPU
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sengupta, Dipanjan; Song, Shuaiwen
With the prevalence of the World Wide Web and social networks, there has been a growing interest in high performance analytics for constantly-evolving dynamic graphs. Modern GPUs provide massive AQ1 amount of parallelism for efficient graph processing, but the challenges remain due to their lack of support for the near real-time streaming nature of dynamic graphs. Specifically, due to the current high volume and velocity of graph data combined with the complexity of user queries, traditional processing methods by first storing the updates and then repeatedly running static graph analytics on a sequence of versions or snapshots are deemed undesirablemore » and computational infeasible on GPU. We present EvoGraph, a highly efficient and scalable GPU- based dynamic graph analytics framework.« less
GPU implementation of the simplex identification via split augmented Lagrangian
NASA Astrophysics Data System (ADS)
Sevilla, Jorge; Nascimento, José M. P.
2015-10-01
Hyperspectral imaging can be used for object detection and for discriminating between different objects based on their spectral characteristics. One of the main problems of hyperspectral data analysis is the presence of mixed pixels, due to the low spatial resolution of such images. This means that several spectrally pure signatures (endmembers) are combined into the same mixed pixel. Linear spectral unmixing follows an unsupervised approach which aims at inferring pure spectral signatures and their material fractions at each pixel of the scene. The huge data volumes acquired by such sensors put stringent requirements on processing and unmixing methods. This paper proposes an efficient implementation of a unsupervised linear unmixing method on GPUs using CUDA. The method finds the smallest simplex by solving a sequence of nonsmooth convex subproblems using variable splitting to obtain a constraint formulation, and then applying an augmented Lagrangian technique. The parallel implementation of SISAL presented in this work exploits the GPU architecture at low level, using shared memory and coalesced accesses to memory. The results herein presented indicate that the GPU implementation can significantly accelerate the method's execution over big datasets while maintaining the methods accuracy.
CPU-GPU hybrid accelerating the Zuker algorithm for RNA secondary structure prediction applications.
Lei, Guoqing; Dou, Yong; Wan, Wen; Xia, Fei; Li, Rongchun; Ma, Meng; Zou, Dan
2012-01-01
Prediction of ribonucleic acid (RNA) secondary structure remains one of the most important research areas in bioinformatics. The Zuker algorithm is one of the most popular methods of free energy minimization for RNA secondary structure prediction. Thus far, few studies have been reported on the acceleration of the Zuker algorithm on general-purpose processors or on extra accelerators such as Field Programmable Gate-Array (FPGA) and Graphics Processing Units (GPU). To the best of our knowledge, no implementation combines both CPU and extra accelerators, such as GPUs, to accelerate the Zuker algorithm applications. In this paper, a CPU-GPU hybrid computing system that accelerates Zuker algorithm applications for RNA secondary structure prediction is proposed. The computing tasks are allocated between CPU and GPU for parallel cooperate execution. Performance differences between the CPU and the GPU in the task-allocation scheme are considered to obtain workload balance. To improve the hybrid system performance, the Zuker algorithm is optimally implemented with special methods for CPU and GPU architecture. Speedup of 15.93× over optimized multi-core SIMD CPU implementation and performance advantage of 16% over optimized GPU implementation are shown in the experimental results. More than 14% of the sequences are executed on CPU in the hybrid system. The system combining CPU and GPU to accelerate the Zuker algorithm is proven to be promising and can be applied to other bioinformatics applications.
A 3D front tracking method on a CPU/GPU system
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bo, Wurigen; Grove, John
2011-01-21
We describe the method to port a sequential 3D interface tracking code to a GPU with CUDA. The interface is represented as a triangular mesh. Interface geometry properties and point propagation are performed on a GPU. Interface mesh adaptation is performed on a CPU. The convergence of the method is assessed from the test problems with given velocity fields. Performance results show overall speedups from 11 to 14 for the test problems under mesh refinement. We also briefly describe our ongoing work to couple the interface tracking method with a hydro solver.
Optimized Laplacian image sharpening algorithm based on graphic processing unit
NASA Astrophysics Data System (ADS)
Ma, Tinghuai; Li, Lu; Ji, Sai; Wang, Xin; Tian, Yuan; Al-Dhelaan, Abdullah; Al-Rodhaan, Mznah
2014-12-01
In classical Laplacian image sharpening, all pixels are processed one by one, which leads to large amount of computation. Traditional Laplacian sharpening processed on CPU is considerably time-consuming especially for those large pictures. In this paper, we propose a parallel implementation of Laplacian sharpening based on Compute Unified Device Architecture (CUDA), which is a computing platform of Graphic Processing Units (GPU), and analyze the impact of picture size on performance and the relationship between the processing time of between data transfer time and parallel computing time. Further, according to different features of different memory, an improved scheme of our method is developed, which exploits shared memory in GPU instead of global memory and further increases the efficiency. Experimental results prove that two novel algorithms outperform traditional consequentially method based on OpenCV in the aspect of computing speed.
Implementation of EAM and FS potentials in HOOMD-blue
NASA Astrophysics Data System (ADS)
Yang, Lin; Zhang, Feng; Travesset, Alex; Wang, Caizhuang; Ho, Kaiming
HOOMD-blue is a general-purpose software to perform classical molecular dynamics simulations entirely on GPUs. We provide full support for EAM and FS type potentials in HOOMD-blue, and report accuracy and efficiency benchmarks, including comparisons with the LAMMPS GPU package. Two problems were selected to test the accuracy: the determination of the glass transition temperature of Cu64.5Zr35.5 alloy using an FS potential and the calculation of pair distribution functions of Ni3Al using an EAM potential. In both cases, the results using HOOMD-blue are indistinguishable from those obtained by the GPU package in LAMMPS within statistical uncertainties. As tests for time efficiency, we benchmark time-steps per second using LAMMPS GPU and HOOMD-blue on one NVIDIA Tesla GPU. Compared to our typical LAMMPS simulations on one CPU cluster node which has 16 CPUs, LAMMPS GPU can be 3-3.5 times faster, and HOOMD-blue can be 4-5.5 times faster. We acknowledge the support from Laboratory Directed Research and Development (LDRD) of Ames Laboratory.
A New Parallel Approach for Accelerating the GPU-Based Execution of Edge Detection Algorithms
Emrani, Zahra; Bateni, Soroosh; Rabbani, Hossein
2017-01-01
Real-time image processing is used in a wide variety of applications like those in medical care and industrial processes. This technique in medical care has the ability to display important patient information graphi graphically, which can supplement and help the treatment process. Medical decisions made based on real-time images are more accurate and reliable. According to the recent researches, graphic processing unit (GPU) programming is a useful method for improving the speed and quality of medical image processing and is one of the ways of real-time image processing. Edge detection is an early stage in most of the image processing methods for the extraction of features and object segments from a raw image. The Canny method, Sobel and Prewitt filters, and the Roberts’ Cross technique are some examples of edge detection algorithms that are widely used in image processing and machine vision. In this work, these algorithms are implemented using the Compute Unified Device Architecture (CUDA), Open Source Computer Vision (OpenCV), and Matrix Laboratory (MATLAB) platforms. An existing parallel method for Canny approach has been modified further to run in a fully parallel manner. This has been achieved by replacing the breadth- first search procedure with a parallel method. These algorithms have been compared by testing them on a database of optical coherence tomography images. The comparison of results shows that the proposed implementation of the Canny method on GPU using the CUDA platform improves the speed of execution by 2–100× compared to the central processing unit-based implementation using the OpenCV and MATLAB platforms. PMID:28487831
A New Parallel Approach for Accelerating the GPU-Based Execution of Edge Detection Algorithms.
Emrani, Zahra; Bateni, Soroosh; Rabbani, Hossein
2017-01-01
Real-time image processing is used in a wide variety of applications like those in medical care and industrial processes. This technique in medical care has the ability to display important patient information graphi graphically, which can supplement and help the treatment process. Medical decisions made based on real-time images are more accurate and reliable. According to the recent researches, graphic processing unit (GPU) programming is a useful method for improving the speed and quality of medical image processing and is one of the ways of real-time image processing. Edge detection is an early stage in most of the image processing methods for the extraction of features and object segments from a raw image. The Canny method, Sobel and Prewitt filters, and the Roberts' Cross technique are some examples of edge detection algorithms that are widely used in image processing and machine vision. In this work, these algorithms are implemented using the Compute Unified Device Architecture (CUDA), Open Source Computer Vision (OpenCV), and Matrix Laboratory (MATLAB) platforms. An existing parallel method for Canny approach has been modified further to run in a fully parallel manner. This has been achieved by replacing the breadth- first search procedure with a parallel method. These algorithms have been compared by testing them on a database of optical coherence tomography images. The comparison of results shows that the proposed implementation of the Canny method on GPU using the CUDA platform improves the speed of execution by 2-100× compared to the central processing unit-based implementation using the OpenCV and MATLAB platforms.
Musrfit-Real Time Parameter Fitting Using GPUs
NASA Astrophysics Data System (ADS)
Locans, Uldis; Suter, Andreas
High transverse field μSR (HTF-μSR) experiments typically lead to a rather large data sets, since it is necessary to follow the high frequencies present in the positron decay histograms. The analysis of these data sets can be very time consuming, usually due to the limited computational power of the hardware. To overcome the limited computing resources rotating reference frame transformation (RRF) is often used to reduce the data sets that need to be handled. This comes at a price typically the μSR community is not aware of: (i) due to the RRF transformation the fitting parameter estimate is of poorer precision, i.e., more extended expensive beamtime is needed. (ii) RRF introduces systematic errors which hampers the statistical interpretation of χ2 or the maximum log-likelihood. We will briefly discuss these issues in a non-exhaustive practical way. The only and single purpose of the RRF transformation is the sluggish computer power. Therefore during this work GPU (Graphical Processing Units) based fitting was developed which allows to perform real-time full data analysis without RRF. GPUs have become increasingly popular in scientific computing in recent years. Due to their highly parallel architecture they provide the opportunity to accelerate many applications with considerably less costs than upgrading the CPU computational power. With the emergence of frameworks such as CUDA and OpenCL these devices have become more easily programmable. During this work GPU support was added to Musrfit- a data analysis framework for μSR experiments. The new fitting algorithm uses CUDA or OpenCL to offload the most time consuming parts of the calculations to Nvidia or AMD GPUs. Using the current CPU implementation in Musrfit parameter fitting can take hours for certain data sets while the GPU version can allow to perform real-time data analysis on the same data sets. This work describes the challenges that arise in adding the GPU support to t as well as results obtained using the GPU version. The speedups using the GPU were measured comparing to the CPU implementation. Two different GPUs were used for the comparison — high end Nvidia Tesla K40c GPU designed for HPC applications and AMD Radeon R9 390× GPU designed for gaming industry.
GRay: A MASSIVELY PARALLEL GPU-BASED CODE FOR RAY TRACING IN RELATIVISTIC SPACETIMES
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chan, Chi-kwan; Psaltis, Dimitrios; Özel, Feryal
We introduce GRay, a massively parallel integrator designed to trace the trajectories of billions of photons in a curved spacetime. This graphics-processing-unit (GPU)-based integrator employs the stream processing paradigm, is implemented in CUDA C/C++, and runs on nVidia graphics cards. The peak performance of GRay using single-precision floating-point arithmetic on a single GPU exceeds 300 GFLOP (or 1 ns per photon per time step). For a realistic problem, where the peak performance cannot be reached, GRay is two orders of magnitude faster than existing central-processing-unit-based ray-tracing codes. This performance enhancement allows more effective searches of large parameter spaces when comparingmore » theoretical predictions of images, spectra, and light curves from the vicinities of compact objects to observations. GRay can also perform on-the-fly ray tracing within general relativistic magnetohydrodynamic algorithms that simulate accretion flows around compact objects. Making use of this algorithm, we calculate the properties of the shadows of Kerr black holes and the photon rings that surround them. We also provide accurate fitting formulae of their dependencies on black hole spin and observer inclination, which can be used to interpret upcoming observations of the black holes at the center of the Milky Way, as well as M87, with the Event Horizon Telescope.« less
NASA Astrophysics Data System (ADS)
Wang, Tai-Han; Huang, Da-Nian; Ma, Guo-Qing; Meng, Zhao-Hai; Li, Ye
2017-06-01
With the continuous development of full tensor gradiometer (FTG) measurement techniques, three-dimensional (3D) inversion of FTG data is becoming increasingly used in oil and gas exploration. In the fast processing and interpretation of large-scale high-precision data, the use of the graphics processing unit process unit (GPU) and preconditioning methods are very important in the data inversion. In this paper, an improved preconditioned conjugate gradient algorithm is proposed by combining the symmetric successive over-relaxation (SSOR) technique and the incomplete Choleksy decomposition conjugate gradient algorithm (ICCG). Since preparing the preconditioner requires extra time, a parallel implement based on GPU is proposed. The improved method is then applied in the inversion of noisecontaminated synthetic data to prove its adaptability in the inversion of 3D FTG data. Results show that the parallel SSOR-ICCG algorithm based on NVIDIA Tesla C2050 GPU achieves a speedup of approximately 25 times that of a serial program using a 2.0 GHz Central Processing Unit (CPU). Real airborne gravity-gradiometry data from Vinton salt dome (southwest Louisiana, USA) are also considered. Good results are obtained, which verifies the efficiency and feasibility of the proposed parallel method in fast inversion of 3D FTG data.
Kantardjiev, Alexander A
2015-04-05
A cluster of strongly interacting ionization groups in protein molecules with irregular ionization behavior is suggestive for specific structure-function relationship. However, their computational treatment is unconventional (e.g., lack of convergence in naive self-consistent iterative algorithm). The stringent evaluation requires evaluation of Boltzmann averaged statistical mechanics sums and electrostatic energy estimation for each microstate. irGPU: Irregular strong interactions in proteins--a GPU solver is novel solution to a versatile problem in protein biophysics--atypical protonation behavior of coupled groups. The computational severity of the problem is alleviated by parallelization (via GPU kernels) which is applied for the electrostatic interaction evaluation (including explicit electrostatics via the fast multipole method) as well as statistical mechanics sums (partition function) estimation. Special attention is given to the ease of the service and encapsulation of theoretical details without sacrificing rigor of computational procedures. irGPU is not just a solution-in-principle but a promising practical application with potential to entice community into deeper understanding of principles governing biomolecule mechanisms. © 2015 Wiley Periodicals, Inc.
Optimizing Tensor Contraction Expressions for Hybrid CPU-GPU Execution
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ma, Wenjing; Krishnamoorthy, Sriram; Villa, Oreste
2013-03-01
Tensor contractions are generalized multidimensional matrix multiplication operations that widely occur in quantum chemistry. Efficient execution of tensor contractions on Graphics Processing Units (GPUs) requires several challenges to be addressed, including index permutation and small dimension-sizes reducing thread block utilization. Moreover, to apply the same optimizations to various expressions, we need a code generation tool. In this paper, we present our approach to automatically generate CUDA code to execute tensor contractions on GPUs, including management of data movement between CPU and GPU. To evaluate our tool, GPU-enabled code is generated for the most expensive contractions in CCSD(T), a key coupledmore » cluster method, and incorporated into NWChem, a popular computational chemistry suite. For this method, we demonstrate speedup over a factor of 8.4 using one GPU (instead of one core per node) and over 2.6 when utilizing the entire system using hybrid CPU+GPU solution with 2 GPUs and 5 cores (instead of 7 cores per node). Finally, we analyze the implementation behavior on future GPU systems.« less
GPU-Powered Coherent Beamforming
NASA Astrophysics Data System (ADS)
Magro, A.; Adami, K. Zarb; Hickish, J.
2015-03-01
Graphics processing units (GPU)-based beamforming is a relatively unexplored area in radio astronomy, possibly due to the assumption that any such system will be severely limited by the PCIe bandwidth required to transfer data to the GPU. We have developed a CUDA-based GPU implementation of a coherent beamformer, specifically designed and optimized for deployment at the BEST-2 array which can generate an arbitrary number of synthesized beams for a wide range of parameters. It achieves ˜1.3 TFLOPs on an NVIDIA Tesla K20, approximately 10x faster than an optimized, multithreaded CPU implementation. This kernel has been integrated into two real-time, GPU-based time-domain software pipelines deployed at the BEST-2 array in Medicina: a standalone beamforming pipeline and a transient detection pipeline. We present performance benchmarks for the beamforming kernel as well as the transient detection pipeline with beamforming capabilities as well as results of test observation.
A GPU-Accelerated Parameter Interpolation Thermodynamic Integration Free Energy Method.
Giese, Timothy J; York, Darrin M
2018-03-13
There has been a resurgence of interest in free energy methods motivated by the performance enhancements offered by molecular dynamics (MD) software written for specialized hardware, such as graphics processing units (GPUs). In this work, we exploit the properties of a parameter-interpolated thermodynamic integration (PI-TI) method to connect states by their molecular mechanical (MM) parameter values. This pathway is shown to be better behaved for Mg 2+ → Ca 2+ transformations than traditional linear alchemical pathways (with and without soft-core potentials). The PI-TI method has the practical advantage that no modification of the MD code is required to propagate the dynamics, and unlike with linear alchemical mixing, only one electrostatic evaluation is needed (e.g., single call to particle-mesh Ewald) leading to better performance. In the case of AMBER, this enables all the performance benefits of GPU-acceleration to be realized, in addition to unlocking the full spectrum of features available within the MD software, such as Hamiltonian replica exchange (HREM). The TI derivative evaluation can be accomplished efficiently in a post-processing step by reanalyzing the statistically independent trajectory frames in parallel for high throughput. We also show how one can evaluate the particle mesh Ewald contribution to the TI derivative evaluation without needing to perform two reciprocal space calculations. We apply the PI-TI method with HREM on GPUs in AMBER to predict p K a values in double stranded RNA molecules and make comparison with experiments. Convergence to under 0.25 units for these systems required 100 ns or more of sampling per window and coupling of windows with HREM. We find that MM charges derived from ab initio QM/MM fragment calculations improve the agreement between calculation and experimental results.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wu Jian; Kim, Minho; Peters, Jorg
2009-12-15
Purpose: Rigid 2D-3D registration is an alternative to 3D-3D registration for cases where largely bony anatomy can be used for patient positioning in external beam radiation therapy. In this article, the authors evaluated seven similarity measures for use in the intensity-based rigid 2D-3D registration using a variation in Skerl's similarity measure evaluation protocol. Methods: The seven similarity measures are partitioned intensity uniformity, normalized mutual information (NMI), normalized cross correlation (NCC), entropy of the difference image, pattern intensity (PI), gradient correlation (GC), and gradient difference (GD). In contrast to traditional evaluation methods that rely on visual inspection or registration outcomes, themore » similarity measure evaluation protocol probes the transform parameter space and computes a number of similarity measure properties, which is objective and optimization method independent. The variation in protocol offers an improved property in the quantification of the capture range. The authors used this protocol to investigate the effects of the downsampling ratio, the region of interest, and the method of the digitally reconstructed radiograph (DRR) calculation [i.e., the incremental ray-tracing method implemented on a central processing unit (CPU) or the 3D texture rendering method implemented on a graphics processing unit (GPU)] on the performance of the similarity measures. The studies were carried out using both the kilovoltage (kV) and the megavoltage (MV) images of an anthropomorphic cranial phantom and the MV images of a head-and-neck cancer patient. Results: Both the phantom and the patient studies showed the 2D-3D registration using the GPU-based DRR calculation yielded better robustness, while providing similar accuracy compared to the CPU-based calculation. The phantom study using kV imaging suggested that NCC has the best accuracy and robustness, but its slow function value change near the global maximum requires a stricter termination condition for an optimization method. The phantom study using MV imaging indicated that PI, GD, and GC have the best accuracy, while NCC and NMI have the best robustness. The clinical study using MV imaging showed that NCC and NMI have the best robustness. Conclusions: The authors evaluated the performance of seven similarity measures for use in 2D-3D image registration using the variation in Skerl's similarity measure evaluation protocol. The generalized methodology can be used to select the best similarity measures, determine the optimal or near optimal choice of parameter, and choose the appropriate registration strategy for the end user in his specific registration applications in medical imaging.« less
A CFD Heterogeneous Parallel Solver Based on Collaborating CPU and GPU
NASA Astrophysics Data System (ADS)
Lai, Jianqi; Tian, Zhengyu; Li, Hua; Pan, Sha
2018-03-01
Since Graphic Processing Unit (GPU) has a strong ability of floating-point computation and memory bandwidth for data parallelism, it has been widely used in the areas of common computing such as molecular dynamics (MD), computational fluid dynamics (CFD) and so on. The emergence of compute unified device architecture (CUDA), which reduces the complexity of compiling program, brings the great opportunities to CFD. There are three different modes for parallel solution of NS equations: parallel solver based on CPU, parallel solver based on GPU and heterogeneous parallel solver based on collaborating CPU and GPU. As we can see, GPUs are relatively rich in compute capacity but poor in memory capacity and the CPUs do the opposite. We need to make full use of the GPUs and CPUs, so a CFD heterogeneous parallel solver based on collaborating CPU and GPU has been established. Three cases are presented to analyse the solver’s computational accuracy and heterogeneous parallel efficiency. The numerical results agree well with experiment results, which demonstrate that the heterogeneous parallel solver has high computational precision. The speedup on a single GPU is more than 40 for laminar flow, it decreases for turbulent flow, but it still can reach more than 20. What’s more, the speedup increases as the grid size becomes larger.
Szécsi, László; Kacsó, Ágota; Zeck, Günther; Hantz, Péter
2017-01-01
Light stimulation with precise and complex spatial and temporal modulation is demanded by a series of research fields like visual neuroscience, optogenetics, ophthalmology, and visual psychophysics. We developed a user-friendly and flexible stimulus generating framework (GEARS GPU-based Eye And Retina Stimulation Software), which offers access to GPU computing power, and allows interactive modification of stimulus parameters during experiments. Furthermore, it has built-in support for driving external equipment, as well as for synchronization tasks, via USB ports. The use of GEARS does not require elaborate programming skills. The necessary scripting is visually aided by an intuitive interface, while the details of the underlying software and hardware components remain hidden. Internally, the software is a C++/Python hybrid using OpenGL graphics. Computations are performed on the GPU, and are defined in the GLSL shading language. However, all GPU settings, including the GPU shader programs, are automatically generated by GEARS. This is configured through a method encountered in game programming, which allows high flexibility: stimuli are straightforwardly composed using a broad library of basic components. Stimulus rendering is implemented solely in C++, therefore intermediary libraries for interfacing could be omitted. This enables the program to perform computationally demanding tasks like en-masse random number generation or real-time image processing by local and global operations.
Zhang, Yongshun; Zheng, Guimei; Feng, Cunqian; Tang, Jun
2017-01-01
In this paper, we focus on the problem of two-dimensional direction of arrival (2D-DOA) estimation for monostatic MIMO Radar with electromagnetic vector received sensors (MIMO-EMVSs) under the condition of gain and phase uncertainties (GPU) and mutual coupling (MC). GPU would spoil the invariance property of the EMVSs in MIMO-EMVSs, thus the effective ESPRIT algorithm unable to be used directly. Then we put forward a C-SPD ESPRIT-like algorithm. It estimates the 2D-DOA and polarization station angle (PSA) based on the instrumental sensors method (ISM). The C-SPD ESPRIT-like algorithm can obtain good angle estimation accuracy without knowing the GPU. Furthermore, it can be applied to arbitrary array configuration and has low complexity for avoiding the angle searching procedure. When MC and GPU exist together between the elements of EMVSs, in order to make our algorithm feasible, we derive a class of separated electromagnetic vector receiver and give the S-SPD ESPRIT-like algorithm. It can solve the problem of GPU and MC efficiently. And the array configuration can be arbitrary. The effectiveness of our proposed algorithms is verified by the simulation result. PMID:29072588
Zhang, Dong; Zhang, Yongshun; Zheng, Guimei; Feng, Cunqian; Tang, Jun
2017-10-26
In this paper, we focus on the problem of two-dimensional direction of arrival (2D-DOA) estimation for monostatic MIMO Radar with electromagnetic vector received sensors (MIMO-EMVSs) under the condition of gain and phase uncertainties (GPU) and mutual coupling (MC). GPU would spoil the invariance property of the EMVSs in MIMO-EMVSs, thus the effective ESPRIT algorithm unable to be used directly. Then we put forward a C-SPD ESPRIT-like algorithm. It estimates the 2D-DOA and polarization station angle (PSA) based on the instrumental sensors method (ISM). The C-SPD ESPRIT-like algorithm can obtain good angle estimation accuracy without knowing the GPU. Furthermore, it can be applied to arbitrary array configuration and has low complexity for avoiding the angle searching procedure. When MC and GPU exist together between the elements of EMVSs, in order to make our algorithm feasible, we derive a class of separated electromagnetic vector receiver and give the S-SPD ESPRIT-like algorithm. It can solve the problem of GPU and MC efficiently. And the array configuration can be arbitrary. The effectiveness of our proposed algorithms is verified by the simulation result.
The Research and Test of Fast Radio Burst Real-time Search Algorithm Based on GPU Acceleration
NASA Astrophysics Data System (ADS)
Wang, J.; Chen, M. Z.; Pei, X.; Wang, Z. Q.
2017-03-01
In order to satisfy the research needs of Nanshan 25 m radio telescope of Xinjiang Astronomical Observatory (XAO) and study the key technology of the planned QiTai radio Telescope (QTT), the receiver group of XAO studied the GPU (Graphics Processing Unit) based real-time FRB searching algorithm which developed from the original FRB searching algorithm based on CPU (Central Processing Unit), and built the FRB real-time searching system. The comparison of the GPU system and the CPU system shows that: on the basis of ensuring the accuracy of the search, the speed of the GPU accelerated algorithm is improved by 35-45 times compared with the CPU algorithm.
gWEGA: GPU-accelerated WEGA for molecular superposition and shape comparison.
Yan, Xin; Li, Jiabo; Gu, Qiong; Xu, Jun
2014-06-05
Virtual screening of a large chemical library for drug lead identification requires searching/superimposing a large number of three-dimensional (3D) chemical structures. This article reports a graphic processing unit (GPU)-accelerated weighted Gaussian algorithm (gWEGA) that expedites shape or shape-feature similarity score-based virtual screening. With 86 GPU nodes (each node has one GPU card), gWEGA can screen 110 million conformations derived from an entire ZINC drug-like database with diverse antidiabetic agents as query structures within 2 s (i.e., screening more than 55 million conformations per second). The rapid screening speed was accomplished through the massive parallelization on multiple GPU nodes and rapid prescreening of 3D structures (based on their shape descriptors and pharmacophore feature compositions). Copyright © 2014 Wiley Periodicals, Inc.
Real-time global illumination on mobile device
NASA Astrophysics Data System (ADS)
Ahn, Minsu; Ha, Inwoo; Lee, Hyong-Euk; Kim, James D. K.
2014-02-01
We propose a novel method for real-time global illumination on mobile devices. Our approach is based on instant radiosity, which uses a sequence of virtual point lights in order to represent the e ect of indirect illumination. Our rendering process consists of three stages. With the primary light, the rst stage generates a local illumination with the shadow map on GPU The second stage of the global illumination uses the re ective shadow map on GPU and generates the sequence of virtual point lights on CPU. Finally, we use the splatting method of Dachsbacher et al 1 and add the indirect illumination to the local illumination on GPU. With the limited computing resources in mobile devices, a small number of virtual point lights are allowed for real-time rendering. Our approach uses the multi-resolution sampling method with 3D geometry and attributes simultaneously and reduce the total number of virtual point lights. We also use the hybrid strategy, which collaboratively combines the CPUs and GPUs available in a mobile SoC due to the limited computing resources in mobile devices. Experimental results demonstrate the global illumination performance of the proposed method.
Accelerating moderately stiff chemical kinetics in reactive-flow simulations using GPUs
NASA Astrophysics Data System (ADS)
Niemeyer, Kyle E.; Sung, Chih-Jen
2014-01-01
The chemical kinetics ODEs arising from operator-split reactive-flow simulations were solved on GPUs using explicit integration algorithms. Nonstiff chemical kinetics of a hydrogen oxidation mechanism (9 species and 38 irreversible reactions) were computed using the explicit fifth-order Runge-Kutta-Cash-Karp method, and the GPU-accelerated version performed faster than single- and six-core CPU versions by factors of 126 and 25, respectively, for 524,288 ODEs. Moderately stiff kinetics, represented with mechanisms for hydrogen/carbon-monoxide (13 species and 54 irreversible reactions) and methane (53 species and 634 irreversible reactions) oxidation, were computed using the stabilized explicit second-order Runge-Kutta-Chebyshev (RKC) algorithm. The GPU-based RKC implementation demonstrated an increase in performance of nearly 59 and 10 times, for problem sizes consisting of 262,144 ODEs and larger, than the single- and six-core CPU-based RKC algorithms using the hydrogen/carbon-monoxide mechanism. With the methane mechanism, RKC-GPU performed more than 65 and 11 times faster, for problem sizes consisting of 131,072 ODEs and larger, than the single- and six-core RKC-CPU versions, and up to 57 times faster than the six-core CPU-based implicit VODE algorithm on 65,536 ODEs. In the presence of more severe stiffness, such as ethylene oxidation (111 species and 1566 irreversible reactions), RKC-GPU performed more than 17 times faster than RKC-CPU on six cores for 32,768 ODEs and larger, and at best 4.5 times faster than VODE on six CPU cores for 65,536 ODEs. With a larger time step size, RKC-GPU performed at best 2.5 times slower than six-core VODE for 8192 ODEs and larger. Therefore, the need for developing new strategies for integrating stiff chemistry on GPUs was discussed.
Rohl, Sebastian; Bodenstedt, Sebastian; Suwelack, Stefan; Dillmann, Rudiger; Speidel, Stefanie; Kenngott, Hannes; Muller-Stich, Beat P
2012-03-01
In laparoscopic surgery, soft tissue deformations substantially change the surgical site, thus impeding the use of preoperative planning during intraoperative navigation. Extracting depth information from endoscopic images and building a surface model of the surgical field-of-view is one way to represent this constantly deforming environment. The information can then be used for intraoperative registration. Stereo reconstruction is a typical problem within computer vision. However, most of the available methods do not fulfill the specific requirements in a minimally invasive setting such as the need of real-time performance, the problem of view-dependent specular reflections and large curved areas with partly homogeneous or periodic textures and occlusions. In this paper, the authors present an approach toward intraoperative surface reconstruction based on stereo endoscopic images. The authors describe our answer to this problem through correspondence analysis, disparity correction and refinement, 3D reconstruction, point cloud smoothing and meshing. Real-time performance is achieved by implementing the algorithms on the gpu. The authors also present a new hybrid cpu-gpu algorithm that unifies the advantages of the cpu and the gpu version. In a comprehensive evaluation using in vivo data, in silico data from the literature and virtual data from a newly developed simulation environment, the cpu, the gpu, and the hybrid cpu-gpu versions of the surface reconstruction are compared to a cpu and a gpu algorithm from the literature. The recommended approach toward intraoperative surface reconstruction can be conducted in real-time depending on the image resolution (20 fps for the gpu and 14fps for the hybrid cpu-gpu version on resolution of 640 × 480). It is robust to homogeneous regions without texture, large image changes, noise or errors from camera calibration, and it reconstructs the surface down to sub millimeter accuracy. In all the experiments within the simulation environment, the mean distance to ground truth data is between 0.05 and 0.6 mm for the hybrid cpu-gpu version. The hybrid cpu-gpu algorithm shows a much more superior performance than its cpu and gpu counterpart (mean distance reduction 26% and 45%, respectively, for the experiments in the simulation environment). The recommended approach for surface reconstruction is fast, robust, and accurate. It can represent changes in the intraoperative environment and can be used to adapt a preoperative model within the surgical site by registration of these two models.
NASA Astrophysics Data System (ADS)
Fang, Juan; Hao, Xiaoting; Fan, Qingwen; Chang, Zeqing; Song, Shuying
2017-05-01
In the Heterogeneous multi-core architecture, CPU and GPU processor are integrated on the same chip, which poses a new challenge to the last-level cache management. In this architecture, the CPU application and the GPU application execute concurrently, accessing the last-level cache. CPU and GPU have different memory access characteristics, so that they have differences in the sensitivity of last-level cache (LLC) capacity. For many CPU applications, a reduced share of the LLC could lead to significant performance degradation. On the contrary, GPU applications can tolerate increase in memory access latency when there is sufficient thread-level parallelism. Taking into account the GPU program memory latency tolerance characteristics, this paper presents a method that let GPU applications can access to memory directly, leaving lots of LLC space for CPU applications, in improving the performance of CPU applications and does not affect the performance of GPU applications. When the CPU application is cache sensitive, and the GPU application is insensitive to the cache, the overall performance of the system is improved significantly.
CPU-GPU hybrid accelerating the Zuker algorithm for RNA secondary structure prediction applications
2012-01-01
Background Prediction of ribonucleic acid (RNA) secondary structure remains one of the most important research areas in bioinformatics. The Zuker algorithm is one of the most popular methods of free energy minimization for RNA secondary structure prediction. Thus far, few studies have been reported on the acceleration of the Zuker algorithm on general-purpose processors or on extra accelerators such as Field Programmable Gate-Array (FPGA) and Graphics Processing Units (GPU). To the best of our knowledge, no implementation combines both CPU and extra accelerators, such as GPUs, to accelerate the Zuker algorithm applications. Results In this paper, a CPU-GPU hybrid computing system that accelerates Zuker algorithm applications for RNA secondary structure prediction is proposed. The computing tasks are allocated between CPU and GPU for parallel cooperate execution. Performance differences between the CPU and the GPU in the task-allocation scheme are considered to obtain workload balance. To improve the hybrid system performance, the Zuker algorithm is optimally implemented with special methods for CPU and GPU architecture. Conclusions Speedup of 15.93× over optimized multi-core SIMD CPU implementation and performance advantage of 16% over optimized GPU implementation are shown in the experimental results. More than 14% of the sequences are executed on CPU in the hybrid system. The system combining CPU and GPU to accelerate the Zuker algorithm is proven to be promising and can be applied to other bioinformatics applications. PMID:22369626
Efficient Parallel Video Processing Techniques on GPU: From Framework to Implementation
Su, Huayou; Wen, Mei; Wu, Nan; Ren, Ju; Zhang, Chunyuan
2014-01-01
Through reorganizing the execution order and optimizing the data structure, we proposed an efficient parallel framework for H.264/AVC encoder based on massively parallel architecture. We implemented the proposed framework by CUDA on NVIDIA's GPU. Not only the compute intensive components of the H.264 encoder are parallelized but also the control intensive components are realized effectively, such as CAVLC and deblocking filter. In addition, we proposed serial optimization methods, including the multiresolution multiwindow for motion estimation, multilevel parallel strategy to enhance the parallelism of intracoding as much as possible, component-based parallel CAVLC, and direction-priority deblocking filter. More than 96% of workload of H.264 encoder is offloaded to GPU. Experimental results show that the parallel implementation outperforms the serial program by 20 times of speedup ratio and satisfies the requirement of the real-time HD encoding of 30 fps. The loss of PSNR is from 0.14 dB to 0.77 dB, when keeping the same bitrate. Through the analysis to the kernels, we found that speedup ratios of the compute intensive algorithms are proportional with the computation power of the GPU. However, the performance of the control intensive parts (CAVLC) is much related to the memory bandwidth, which gives an insight for new architecture design. PMID:24757432
NASA Astrophysics Data System (ADS)
Hernandez, Monica
2017-12-01
This paper proposes a method for primal-dual convex optimization in variational large deformation diffeomorphic metric mapping problems formulated with robust regularizers and robust image similarity metrics. The method is based on Chambolle and Pock primal-dual algorithm for solving general convex optimization problems. Diagonal preconditioning is used to ensure the convergence of the algorithm to the global minimum. We consider three robust regularizers liable to provide acceptable results in diffeomorphic registration: Huber, V-Huber and total generalized variation. The Huber norm is used in the image similarity term. The primal-dual equations are derived for the stationary and the non-stationary parameterizations of diffeomorphisms. The resulting algorithms have been implemented for running in the GPU using Cuda. For the most memory consuming methods, we have developed a multi-GPU implementation. The GPU implementations allowed us to perform an exhaustive evaluation study in NIREP and LPBA40 databases. The experiments showed that, for all the considered regularizers, the proposed method converges to diffeomorphic solutions while better preserving discontinuities at the boundaries of the objects compared to baseline diffeomorphic registration methods. In most cases, the evaluation showed a competitive performance for the robust regularizers, close to the performance of the baseline diffeomorphic registration methods.
GPU-accelerated Red Blood Cells Simulations with Transport Dissipative Particle Dynamics.
Blumers, Ansel L; Tang, Yu-Hang; Li, Zhen; Li, Xuejin; Karniadakis, George E
2017-08-01
Mesoscopic numerical simulations provide a unique approach for the quantification of the chemical influences on red blood cell functionalities. The transport Dissipative Particles Dynamics (tDPD) method can lead to such effective multiscale simulations due to its ability to simultaneously capture mesoscopic advection, diffusion, and reaction. In this paper, we present a GPU-accelerated red blood cell simulation package based on a tDPD adaptation of our red blood cell model, which can correctly recover the cell membrane viscosity, elasticity, bending stiffness, and cross-membrane chemical transport. The package essentially processes all computational workloads in parallel by GPU, and it incorporates multi-stream scheduling and non-blocking MPI communications to improve inter-node scalability. Our code is validated for accuracy and compared against the CPU counterpart for speed. Strong scaling and weak scaling are also presented to characterizes scalability. We observe a speedup of 10.1 on one GPU over all 16 cores within a single node, and a weak scaling efficiency of 91% across 256 nodes. The program enables quick-turnaround and high-throughput numerical simulations for investigating chemical-driven red blood cell phenomena and disorders.
An Investigation of Unified Memory Access Performance in CUDA
Landaverde, Raphael; Zhang, Tiansheng; Coskun, Ayse K.; Herbordt, Martin
2015-01-01
Managing memory between the CPU and GPU is a major challenge in GPU computing. A programming model, Unified Memory Access (UMA), has been recently introduced by Nvidia to simplify the complexities of memory management while claiming good overall performance. In this paper, we investigate this programming model and evaluate its performance and programming model simplifications based on our experimental results. We find that beyond on-demand data transfers to the CPU, the GPU is also able to request subsets of data it requires on demand. This feature allows UMA to outperform full data transfer methods for certain parallel applications and small data sizes. We also find, however, that for the majority of applications and memory access patterns, the performance overheads associated with UMA are significant, while the simplifications to the programming model restrict flexibility for adding future optimizations. PMID:26594668
Masunaga, Shin-ichiro; Kimura, Sadaaki; Harada, Tomohiro; Okuda, Kensuke; Sakurai, Yoshinori; Tanaka, Hiroki; Suzuki, Minoru; Kondo, Natsuko; Maruhashi, Akira; Nagasawa, Hideko; Ono, Koji
2012-01-01
Background To evaluate the usefulness of a novel 10B-carrier conjugated with an integrin-binding cyclic RGD peptide (GPU-201) in boron neutron capture therapy (BNCT). Methods GPU-201 was synthesized from integrin-binding Arg-Gly-Asp (RGD) consensus sequence of matrix proteins and a 10B cluster 1, 2-dicarba-closo-dodecaborane-10B. Mercaptododecaborate-10B (BSH) dissolved in physiological saline and BSH and GPU-201 dissolved with cyclodextrin (CD) as a solubilizing and dispersing agent were intraperitoneally administered to SCC VII tumor-bearing mice. Then, the 10B concentrations in the tumors and normal tissues were measured by γ-ray spectrometry. Meanwhile, tumor-bearing mice were continuously given 5-bromo-2’-deoxyuridine (BrdU) to label all proliferating (P) cells in the tumors, then treated with GPU-201, BSH-CD, or BSH. Immediately after reactor neutron beam or γ-ray irradiation, during which intratumor 10B concentrations were kept at levels similar to each other, cells from some tumors were isolated and incubated with a cytokinesis blocker. The responses of the Q and total (= P + Q) cell populations were assessed based on the frequency of micronuclei using immunofluorescence staining for BrdU. Results The 10B from BSH was washed away rapidly in all these tissues and the retention of 10B from BSH-CD and GPU-201 was similar except in blood where the 10B concentration from GPU-201 was higher for longer. GPU-201 showed a significantly stronger radio-sensitizing effect under neutron beam irradiation on both total and Q cell populations than any other 10B-carrier. Conclusion A novel 10B-carrier conjugated with an integrin-binding RGD peptide (GPU-201) that sensitized tumor cells more markedly than conventional 10B-carriers may be a promising candidate for use in BNCT. However, its toxicity needs to be tested further. PMID:29147290
Liu, Yongchao; Wirawan, Adrianto; Schmidt, Bertil
2013-04-04
The maximal sensitivity for local alignments makes the Smith-Waterman algorithm a popular choice for protein sequence database search based on pairwise alignment. However, the algorithm is compute-intensive due to a quadratic time complexity. Corresponding runtimes are further compounded by the rapid growth of sequence databases. We present CUDASW++ 3.0, a fast Smith-Waterman protein database search algorithm, which couples CPU and GPU SIMD instructions and carries out concurrent CPU and GPU computations. For the CPU computation, this algorithm employs SSE-based vector execution units as accelerators. For the GPU computation, we have investigated for the first time a GPU SIMD parallelization, which employs CUDA PTX SIMD video instructions to gain more data parallelism beyond the SIMT execution model. Moreover, sequence alignment workloads are automatically distributed over CPUs and GPUs based on their respective compute capabilities. Evaluation on the Swiss-Prot database shows that CUDASW++ 3.0 gains a performance improvement over CUDASW++ 2.0 up to 2.9 and 3.2, with a maximum performance of 119.0 and 185.6 GCUPS, on a single-GPU GeForce GTX 680 and a dual-GPU GeForce GTX 690 graphics card, respectively. In addition, our algorithm has demonstrated significant speedups over other top-performing tools: SWIPE and BLAST+. CUDASW++ 3.0 is written in CUDA C++ and PTX assembly languages, targeting GPUs based on the Kepler architecture. This algorithm obtains significant speedups over its predecessor: CUDASW++ 2.0, by benefiting from the use of CPU and GPU SIMD instructions as well as the concurrent execution on CPUs and GPUs. The source code and the simulated data are available at http://cudasw.sourceforge.net.
Ling, Cheng; Hamada, Tsuyoshi; Gao, Jingyang; Zhao, Guoguang; Sun, Donghong; Shi, Weifeng
2016-01-01
MrBayes is a widespread phylogenetic inference tool harnessing empirical evolutionary models and Bayesian statistics. However, the computational cost on the likelihood estimation is very expensive, resulting in undesirably long execution time. Although a number of multi-threaded optimizations have been proposed to speed up MrBayes, there are bottlenecks that severely limit the GPU thread-level parallelism of likelihood estimations. This study proposes a high performance and resource-efficient method for GPU-oriented parallelization of likelihood estimations. Instead of having to rely on empirical programming, the proposed novel decomposition storage model implements high performance data transfers implicitly. In terms of performance improvement, a speedup factor of up to 178 can be achieved on the analysis of simulated datasets by four Tesla K40 cards. In comparison to the other publicly available GPU-oriented MrBayes, the tgMC 3 ++ method (proposed herein) outperforms the tgMC 3 (v1.0), nMC 3 (v2.1.1) and oMC 3 (v1.00) methods by speedup factors of up to 1.6, 1.9 and 2.9, respectively. Moreover, tgMC 3 ++ supports more evolutionary models and gamma categories, which previous GPU-oriented methods fail to take into analysis.
Multi-phase SPH modelling of violent hydrodynamics on GPUs
NASA Astrophysics Data System (ADS)
Mokos, Athanasios; Rogers, Benedict D.; Stansby, Peter K.; Domínguez, José M.
2015-11-01
This paper presents the acceleration of multi-phase smoothed particle hydrodynamics (SPH) using a graphics processing unit (GPU) enabling large numbers of particles (10-20 million) to be simulated on just a single GPU card. With novel hardware architectures such as a GPU, the optimum approach to implement a multi-phase scheme presents some new challenges. Many more particles must be included in the calculation and there are very different speeds of sound in each phase with the largest speed of sound determining the time step. This requires efficient computation. To take full advantage of the hardware acceleration provided by a single GPU for a multi-phase simulation, four different algorithms are investigated: conditional statements, binary operators, separate particle lists and an intermediate global function. Runtime results show that the optimum approach needs to employ separate cell and neighbour lists for each phase. The profiler shows that this approach leads to a reduction in both memory transactions and arithmetic operations giving significant runtime gains. The four different algorithms are compared to the efficiency of the optimised single-phase GPU code, DualSPHysics, for 2-D and 3-D simulations which indicate that the multi-phase functionality has a significant computational overhead. A comparison with an optimised CPU code shows a speed up of an order of magnitude over an OpenMP simulation with 8 threads and two orders of magnitude over a single thread simulation. A demonstration of the multi-phase SPH GPU code is provided by a 3-D dam break case impacting an obstacle. This shows better agreement with experimental results than an equivalent single-phase code. The multi-phase GPU code enables a convergence study to be undertaken on a single GPU with a large number of particles that otherwise would have required large high performance computing resources.
Towards real-time photon Monte Carlo dose calculation in the cloud
NASA Astrophysics Data System (ADS)
Ziegenhein, Peter; Kozin, Igor N.; Kamerling, Cornelis Ph; Oelfke, Uwe
2017-06-01
Near real-time application of Monte Carlo (MC) dose calculation in clinic and research is hindered by the long computational runtimes of established software. Currently, fast MC software solutions are available utilising accelerators such as graphical processing units (GPUs) or clusters based on central processing units (CPUs). Both platforms are expensive in terms of purchase costs and maintenance and, in case of the GPU, provide only limited scalability. In this work we propose a cloud-based MC solution, which offers high scalability of accurate photon dose calculations. The MC simulations run on a private virtual supercomputer that is formed in the cloud. Computational resources can be provisioned dynamically at low cost without upfront investment in expensive hardware. A client-server software solution has been developed which controls the simulations and transports data to and from the cloud efficiently and securely. The client application integrates seamlessly into a treatment planning system. It runs the MC simulation workflow automatically and securely exchanges simulation data with the server side application that controls the virtual supercomputer. Advanced encryption standards were used to add an additional security layer, which encrypts and decrypts patient data on-the-fly at the processor register level. We could show that our cloud-based MC framework enables near real-time dose computation. It delivers excellent linear scaling for high-resolution datasets with absolute runtimes of 1.1 seconds to 10.9 seconds for simulating a clinical prostate and liver case up to 1% statistical uncertainty. The computation runtimes include the transportation of data to and from the cloud as well as process scheduling and synchronisation overhead. Cloud-based MC simulations offer a fast, affordable and easily accessible alternative for near real-time accurate dose calculations to currently used GPU or cluster solutions.
Towards real-time photon Monte Carlo dose calculation in the cloud.
Ziegenhein, Peter; Kozin, Igor N; Kamerling, Cornelis Ph; Oelfke, Uwe
2017-06-07
Near real-time application of Monte Carlo (MC) dose calculation in clinic and research is hindered by the long computational runtimes of established software. Currently, fast MC software solutions are available utilising accelerators such as graphical processing units (GPUs) or clusters based on central processing units (CPUs). Both platforms are expensive in terms of purchase costs and maintenance and, in case of the GPU, provide only limited scalability. In this work we propose a cloud-based MC solution, which offers high scalability of accurate photon dose calculations. The MC simulations run on a private virtual supercomputer that is formed in the cloud. Computational resources can be provisioned dynamically at low cost without upfront investment in expensive hardware. A client-server software solution has been developed which controls the simulations and transports data to and from the cloud efficiently and securely. The client application integrates seamlessly into a treatment planning system. It runs the MC simulation workflow automatically and securely exchanges simulation data with the server side application that controls the virtual supercomputer. Advanced encryption standards were used to add an additional security layer, which encrypts and decrypts patient data on-the-fly at the processor register level. We could show that our cloud-based MC framework enables near real-time dose computation. It delivers excellent linear scaling for high-resolution datasets with absolute runtimes of 1.1 seconds to 10.9 seconds for simulating a clinical prostate and liver case up to 1% statistical uncertainty. The computation runtimes include the transportation of data to and from the cloud as well as process scheduling and synchronisation overhead. Cloud-based MC simulations offer a fast, affordable and easily accessible alternative for near real-time accurate dose calculations to currently used GPU or cluster solutions.
Parallelization and checkpointing of GPU applications through program transformation
DOE Office of Scientific and Technical Information (OSTI.GOV)
Solano-Quinde, Lizandro Damian
2012-01-01
GPUs have emerged as a powerful tool for accelerating general-purpose applications. The availability of programming languages that makes writing general-purpose applications for running on GPUs tractable have consolidated GPUs as an alternative for accelerating general purpose applications. Among the areas that have benefited from GPU acceleration are: signal and image processing, computational fluid dynamics, quantum chemistry, and, in general, the High Performance Computing (HPC) Industry. In order to continue to exploit higher levels of parallelism with GPUs, multi-GPU systems are gaining popularity. In this context, single-GPU applications are parallelized for running in multi-GPU systems. Furthermore, multi-GPU systems help to solvemore » the GPU memory limitation for applications with large application memory footprint. Parallelizing single-GPU applications has been approached by libraries that distribute the workload at runtime, however, they impose execution overhead and are not portable. On the other hand, on traditional CPU systems, parallelization has been approached through application transformation at pre-compile time, which enhances the application to distribute the workload at application level and does not have the issues of library-based approaches. Hence, a parallelization scheme for GPU systems based on application transformation is needed. Like any computing engine of today, reliability is also a concern in GPUs. GPUs are vulnerable to transient and permanent failures. Current checkpoint/restart techniques are not suitable for systems with GPUs. Checkpointing for GPU systems present new and interesting challenges, primarily due to the natural differences imposed by the hardware design, the memory subsystem architecture, the massive number of threads, and the limited amount of synchronization among threads. Therefore, a checkpoint/restart technique suitable for GPU systems is needed. The goal of this work is to exploit higher levels of parallelism and to develop support for application-level fault tolerance in applications using multiple GPUs. Our techniques reduce the burden of enhancing single-GPU applications to support these features. To achieve our goal, this work designs and implements a framework for enhancing a single-GPU OpenCL application through application transformation.« less
Numerical simulation of disperse particle flows on a graphics processing unit
NASA Astrophysics Data System (ADS)
Sierakowski, Adam J.
In both nature and technology, we commonly encounter solid particles being carried within fluid flows, from dust storms to sediment erosion and from food processing to energy generation. The motion of uncountably many particles in highly dynamic flow environments characterizes the tremendous complexity of such phenomena. While methods exist for the full-scale numerical simulation of such systems, current computational capabilities require the simplification of the numerical task with significant approximation using closure models widely recognized as insufficient. There is therefore a fundamental need for the investigation of the underlying physical processes governing these disperse particle flows. In the present work, we develop a new tool based on the Physalis method for the first-principles numerical simulation of thousands of particles (a small fraction of an entire disperse particle flow system) in order to assist in the search for new reduced-order closure models. We discuss numerous enhancements to the efficiency and stability of the Physalis method, which introduces the influence of spherical particles to a fixed-grid incompressible Navier-Stokes flow solver using a local analytic solution to the flow equations. Our first-principles investigation demands the modeling of unresolved length and time scales associated with particle collisions. We introduce a collision model alongside Physalis, incorporating lubrication effects and proposing a new nonlinearly damped Hertzian contact model. By reproducing experimental studies from the literature, we document extensive validation of the methods. We discuss the implementation of our methods for massively parallel computation using a graphics processing unit (GPU). We combine Eulerian grid-based algorithms with Lagrangian particle-based algorithms to achieve computational throughput up to 90 times faster than the legacy implementation of Physalis for a single central processing unit. By avoiding all data communication between the GPU and the host system during the simulation, we utilize with great efficacy the GPU hardware with which many high performance computing systems are currently equipped. We conclude by looking forward to the future of Physalis with multi-GPU parallelization in order to perform resolved disperse flow simulations of more than 100,000 particles and further advance the development of reduced-order closure models.
Parallel Optimization of 3D Cardiac Electrophysiological Model Using GPU
Xia, Yong; Zhang, Henggui
2015-01-01
Large-scale 3D virtual heart model simulations are highly demanding in computational resources. This imposes a big challenge to the traditional computation resources based on CPU environment, which already cannot meet the requirement of the whole computation demands or are not easily available due to expensive costs. GPU as a parallel computing environment therefore provides an alternative to solve the large-scale computational problems of whole heart modeling. In this study, using a 3D sheep atrial model as a test bed, we developed a GPU-based simulation algorithm to simulate the conduction of electrical excitation waves in the 3D atria. In the GPU algorithm, a multicellular tissue model was split into two components: one is the single cell model (ordinary differential equation) and the other is the diffusion term of the monodomain model (partial differential equation). Such a decoupling enabled realization of the GPU parallel algorithm. Furthermore, several optimization strategies were proposed based on the features of the virtual heart model, which enabled a 200-fold speedup as compared to a CPU implementation. In conclusion, an optimized GPU algorithm has been developed that provides an economic and powerful platform for 3D whole heart simulations. PMID:26581957
Parallel Optimization of 3D Cardiac Electrophysiological Model Using GPU.
Xia, Yong; Wang, Kuanquan; Zhang, Henggui
2015-01-01
Large-scale 3D virtual heart model simulations are highly demanding in computational resources. This imposes a big challenge to the traditional computation resources based on CPU environment, which already cannot meet the requirement of the whole computation demands or are not easily available due to expensive costs. GPU as a parallel computing environment therefore provides an alternative to solve the large-scale computational problems of whole heart modeling. In this study, using a 3D sheep atrial model as a test bed, we developed a GPU-based simulation algorithm to simulate the conduction of electrical excitation waves in the 3D atria. In the GPU algorithm, a multicellular tissue model was split into two components: one is the single cell model (ordinary differential equation) and the other is the diffusion term of the monodomain model (partial differential equation). Such a decoupling enabled realization of the GPU parallel algorithm. Furthermore, several optimization strategies were proposed based on the features of the virtual heart model, which enabled a 200-fold speedup as compared to a CPU implementation. In conclusion, an optimized GPU algorithm has been developed that provides an economic and powerful platform for 3D whole heart simulations.
GPU-based Branchless Distance-Driven Projection and Backprojection
Liu, Rui; Fu, Lin; De Man, Bruno; Yu, Hengyong
2017-01-01
Projection and backprojection operations are essential in a variety of image reconstruction and physical correction algorithms in CT. The distance-driven (DD) projection and backprojection are widely used for their highly sequential memory access pattern and low arithmetic cost. However, a typical DD implementation has an inner loop that adjusts the calculation depending on the relative position between voxel and detector cell boundaries. The irregularity of the branch behavior makes it inefficient to be implemented on massively parallel computing devices such as graphics processing units (GPUs). Such irregular branch behaviors can be eliminated by factorizing the DD operation as three branchless steps: integration, linear interpolation, and differentiation, all of which are highly amenable to massive vectorization. In this paper, we implement and evaluate a highly parallel branchless DD algorithm for 3D cone beam CT. The algorithm utilizes the texture memory and hardware interpolation on GPUs to achieve fast computational speed. The developed branchless DD algorithm achieved 137-fold speedup for forward projection and 188-fold speedup for backprojection relative to a single-thread CPU implementation. Compared with a state-of-the-art 32-thread CPU implementation, the proposed branchless DD achieved 8-fold acceleration for forward projection and 10-fold acceleration for backprojection. GPU based branchless DD method was evaluated by iterative reconstruction algorithms with both simulation and real datasets. It obtained visually identical images as the CPU reference algorithm. PMID:29333480
GPU-based Branchless Distance-Driven Projection and Backprojection.
Liu, Rui; Fu, Lin; De Man, Bruno; Yu, Hengyong
2017-12-01
Projection and backprojection operations are essential in a variety of image reconstruction and physical correction algorithms in CT. The distance-driven (DD) projection and backprojection are widely used for their highly sequential memory access pattern and low arithmetic cost. However, a typical DD implementation has an inner loop that adjusts the calculation depending on the relative position between voxel and detector cell boundaries. The irregularity of the branch behavior makes it inefficient to be implemented on massively parallel computing devices such as graphics processing units (GPUs). Such irregular branch behaviors can be eliminated by factorizing the DD operation as three branchless steps: integration, linear interpolation, and differentiation, all of which are highly amenable to massive vectorization. In this paper, we implement and evaluate a highly parallel branchless DD algorithm for 3D cone beam CT. The algorithm utilizes the texture memory and hardware interpolation on GPUs to achieve fast computational speed. The developed branchless DD algorithm achieved 137-fold speedup for forward projection and 188-fold speedup for backprojection relative to a single-thread CPU implementation. Compared with a state-of-the-art 32-thread CPU implementation, the proposed branchless DD achieved 8-fold acceleration for forward projection and 10-fold acceleration for backprojection. GPU based branchless DD method was evaluated by iterative reconstruction algorithms with both simulation and real datasets. It obtained visually identical images as the CPU reference algorithm.
GPU accelerated simulations of 3D deterministic particle transport using discrete ordinates method
NASA Astrophysics Data System (ADS)
Gong, Chunye; Liu, Jie; Chi, Lihua; Huang, Haowei; Fang, Jingyue; Gong, Zhenghu
2011-07-01
Graphics Processing Unit (GPU), originally developed for real-time, high-definition 3D graphics in computer games, now provides great faculty in solving scientific applications. The basis of particle transport simulation is the time-dependent, multi-group, inhomogeneous Boltzmann transport equation. The numerical solution to the Boltzmann equation involves the discrete ordinates ( Sn) method and the procedure of source iteration. In this paper, we present a GPU accelerated simulation of one energy group time-independent deterministic discrete ordinates particle transport in 3D Cartesian geometry (Sweep3D). The performance of the GPU simulations are reported with the simulations of vacuum boundary condition. The discussion of the relative advantages and disadvantages of the GPU implementation, the simulation on multi GPUs, the programming effort and code portability are also reported. The results show that the overall performance speedup of one NVIDIA Tesla M2050 GPU ranges from 2.56 compared with one Intel Xeon X5670 chip to 8.14 compared with one Intel Core Q6600 chip for no flux fixup. The simulation with flux fixup on one M2050 is 1.23 times faster than on one X5670.
CPU architecture for a fast and energy-saving calculation of convolution neural networks
NASA Astrophysics Data System (ADS)
Knoll, Florian J.; Grelcke, Michael; Czymmek, Vitali; Holtorf, Tim; Hussmann, Stephan
2017-06-01
One of the most difficult problem in the use of artificial neural networks is the computational capacity. Although large search engine companies own specially developed hardware to provide the necessary computing power, for the conventional user only remains the state of the art method, which is the use of a graphic processing unit (GPU) as a computational basis. Although these processors are well suited for large matrix computations, they need massive energy. Therefore a new processor on the basis of a field programmable gate array (FPGA) has been developed and is optimized for the application of deep learning. This processor is presented in this paper. The processor can be adapted for a particular application (in this paper to an organic farming application). The power consumption is only a fraction of a GPU application and should therefore be well suited for energy-saving applications.
NASA Astrophysics Data System (ADS)
Schultz, A.
2010-12-01
3D forward solvers lie at the core of inverse formulations used to image the variation of electrical conductivity within the Earth's interior. This property is associated with variations in temperature, composition, phase, presence of volatiles, and in specific settings, the presence of groundwater, geothermal resources, oil/gas or minerals. The high cost of 3D solutions has been a stumbling block to wider adoption of 3D methods. Parallel algorithms for modeling frequency domain 3D EM problems have not achieved wide scale adoption, with emphasis on fairly coarse grained parallelism using MPI and similar approaches. The communications bandwidth as well as the latency required to send and receive network communication packets is a limiting factor in implementing fine grained parallel strategies, inhibiting wide adoption of these algorithms. Leading Graphics Processor Unit (GPU) companies now produce GPUs with hundreds of GPU processor cores per die. The footprint, in silicon, of the GPU's restricted instruction set is much smaller than the general purpose instruction set required of a CPU. Consequently, the density of processor cores on a GPU can be much greater than on a CPU. GPUs also have local memory, registers and high speed communication with host CPUs, usually through PCIe type interconnects. The extremely low cost and high computational power of GPUs provides the EM geophysics community with an opportunity to achieve fine grained (i.e. massive) parallelization of codes on low cost hardware. The current generation of GPUs (e.g. NVidia Fermi) provides 3 billion transistors per chip die, with nearly 500 processor cores and up to 6 GB of fast (DDR5) GPU memory. This latest generation of GPU supports fast hardware double precision (64 bit) floating point operations of the type required for frequency domain EM forward solutions. Each Fermi GPU board can sustain nearly 1 TFLOP in double precision, and multiple boards can be installed in the host computer system. We describe our ongoing efforts to achieve massive parallelization on a novel hybrid GPU testbed machine currently configured with 12 Intel Westmere Xeon CPU cores (or 24 parallel computational threads) with 96 GB DDR3 system memory, 4 GPU subsystems which in aggregate contain 960 NVidia Tesla GPU cores with 16 GB dedicated DDR3 GPU memory, and a second interleved bank of 4 GPU subsystems containing in aggregate 1792 NVidia Fermi GPU cores with 12 GB dedicated DDR5 GPU memory. We are applying domain decomposition methods to a modified version of Weiss' (2001) 3D frequency domain full physics EM finite difference code, an open source GPL licensed f90 code available for download from www.OpenEM.org. This will be the core of a new hybrid 3D inversion that parallelizes frequencies across CPUs and individual forward solutions across GPUs. We describe progress made in modifying the code to use direct solvers in GPU cores dedicated to each small subdomain, iteratively improving the solution by matching adjacent subdomain boundary solutions, rather than iterative Krylov space sparse solvers as currently applied to the whole domain.
A survey of GPU-based medical image computing techniques
Shi, Lin; Liu, Wen; Zhang, Heye; Xie, Yongming
2012-01-01
Medical imaging currently plays a crucial role throughout the entire clinical applications from medical scientific research to diagnostics and treatment planning. However, medical imaging procedures are often computationally demanding due to the large three-dimensional (3D) medical datasets to process in practical clinical applications. With the rapidly enhancing performances of graphics processors, improved programming support, and excellent price-to-performance ratio, the graphics processing unit (GPU) has emerged as a competitive parallel computing platform for computationally expensive and demanding tasks in a wide range of medical image applications. The major purpose of this survey is to provide a comprehensive reference source for the starters or researchers involved in GPU-based medical image processing. Within this survey, the continuous advancement of GPU computing is reviewed and the existing traditional applications in three areas of medical image processing, namely, segmentation, registration and visualization, are surveyed. The potential advantages and associated challenges of current GPU-based medical imaging are also discussed to inspire future applications in medicine. PMID:23256080
NASA Astrophysics Data System (ADS)
Derkachov, G.; Jakubczyk, T.; Jakubczyk, D.; Archer, J.; Woźniak, M.
2017-07-01
Utilising Compute Unified Device Architecture (CUDA) platform for Graphics Processing Units (GPUs) enables significant reduction of computation time at a moderate cost, by means of parallel computing. In the paper [Jakubczyk et al., Opto-Electron. Rev., 2016] we reported using GPU for Mie scattering inverse problem solving (up to 800-fold speed-up). Here we report the development of two subroutines utilising GPU at data preprocessing stages for the inversion procedure: (i) A subroutine, based on ray tracing, for finding spherical aberration correction function. (ii) A subroutine performing the conversion of an image to a 1D distribution of light intensity versus azimuth angle (i.e. scattering diagram), fed from a movie-reading CPU subroutine running in parallel. All subroutines are incorporated in PikeReader application, which we make available on GitHub repository. PikeReader returns a sequence of intensity distributions versus a common azimuth angle vector, corresponding to the recorded movie. We obtained an overall ∼ 400 -fold speed-up of calculations at data preprocessing stages using CUDA codes running on GPU in comparison to single thread MATLAB-only code running on CPU.
CUDA-Accelerated Geodesic Ray-Tracing for Fiber Tracking
van Aart, Evert; Sepasian, Neda; Jalba, Andrei; Vilanova, Anna
2011-01-01
Diffusion Tensor Imaging (DTI) allows to noninvasively measure the diffusion of water in fibrous tissue. By reconstructing the fibers from DTI data using a fiber-tracking algorithm, we can deduce the structure of the tissue. In this paper, we outline an approach to accelerating such a fiber-tracking algorithm using a Graphics Processing Unit (GPU). This algorithm, which is based on the calculation of geodesics, has shown promising results for both synthetic and real data, but is limited in its applicability by its high computational requirements. We present a solution which uses the parallelism offered by modern GPUs, in combination with the CUDA platform by NVIDIA, to significantly reduce the execution time of the fiber-tracking algorithm. Compared to a multithreaded CPU implementation of the same algorithm, our GPU mapping achieves a speedup factor of up to 40 times. PMID:21941525
Doulgerakis, Matthaios; Eggebrecht, Adam; Wojtkiewicz, Stanislaw; Culver, Joseph; Dehghani, Hamid
2017-12-01
Parameter recovery in diffuse optical tomography is a computationally expensive algorithm, especially when used for large and complex volumes, as in the case of human brain functional imaging. The modeling of light propagation, also known as the forward problem, is the computational bottleneck of the recovery algorithm, whereby the lack of a real-time solution is impeding practical and clinical applications. The objective of this work is the acceleration of the forward model, within a diffusion approximation-based finite-element modeling framework, employing parallelization to expedite the calculation of light propagation in realistic adult head models. The proposed methodology is applicable for modeling both continuous wave and frequency-domain systems with the results demonstrating a 10-fold speed increase when GPU architectures are available, while maintaining high accuracy. It is shown that, for a very high-resolution finite-element model of the adult human head with ∼600,000 nodes, consisting of heterogeneous layers, light propagation can be calculated at ∼0.25 s/excitation source. (2017) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE).
Exploiting graphics processing units for computational biology and bioinformatics.
Payne, Joshua L; Sinnott-Armstrong, Nicholas A; Moore, Jason H
2010-09-01
Advances in the video gaming industry have led to the production of low-cost, high-performance graphics processing units (GPUs) that possess more memory bandwidth and computational capability than central processing units (CPUs), the standard workhorses of scientific computing. With the recent release of generalpurpose GPUs and NVIDIA's GPU programming language, CUDA, graphics engines are being adopted widely in scientific computing applications, particularly in the fields of computational biology and bioinformatics. The goal of this article is to concisely present an introduction to GPU hardware and programming, aimed at the computational biologist or bioinformaticist. To this end, we discuss the primary differences between GPU and CPU architecture, introduce the basics of the CUDA programming language, and discuss important CUDA programming practices, such as the proper use of coalesced reads, data types, and memory hierarchies. We highlight each of these topics in the context of computing the all-pairs distance between instances in a dataset, a common procedure in numerous disciplines of scientific computing. We conclude with a runtime analysis of the GPU and CPU implementations of the all-pairs distance calculation. We show our final GPU implementation to outperform the CPU implementation by a factor of 1700.
NASA Astrophysics Data System (ADS)
Meng, Hui; Hui, Hui; Hu, Chaoen; Yang, Xin; Tian, Jie
2017-03-01
The ability of fast and single-neuron resolution imaging of neural activities enables light-sheet fluorescence microscopy (LSFM) as a powerful imaging technique in functional neural connection applications. The state-of-art LSFM imaging system can record the neuronal activities of entire brain for small animal, such as zebrafish or C. elegans at single-neuron resolution. However, the stimulated and spontaneous movements in animal brain result in inconsistent neuron positions during recording process. It is time consuming to register the acquired large-scale images with conventional method. In this work, we address the problem of fast registration of neural positions in stacks of LSFM images. This is necessary to register brain structures and activities. To achieve fast registration of neural activities, we present a rigid registration architecture by implementation of Graphics Processing Unit (GPU). In this approach, the image stacks were preprocessed on GPU by mean stretching to reduce the computation effort. The present image was registered to the previous image stack that considered as reference. A fast Fourier transform (FFT) algorithm was used for calculating the shift of the image stack. The calculations for image registration were performed in different threads while the preparation functionality was refactored and called only once by the master thread. We implemented our registration algorithm on NVIDIA Quadro K4200 GPU under Compute Unified Device Architecture (CUDA) programming environment. The experimental results showed that the registration computation can speed-up to 550ms for a full high-resolution brain image. Our approach also has potential to be used for other dynamic image registrations in biomedical applications.
SU-D-BRD-03: A Gateway for GPU Computing in Cancer Radiotherapy Research
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jia, X; Folkerts, M; Shi, F
Purpose: Graphics Processing Unit (GPU) has become increasingly important in radiotherapy. However, it is still difficult for general clinical researchers to access GPU codes developed by other researchers, and for developers to objectively benchmark their codes. Moreover, it is quite often to see repeated efforts spent on developing low-quality GPU codes. The goal of this project is to establish an infrastructure for testing GPU codes, cross comparing them, and facilitating code distributions in radiotherapy community. Methods: We developed a system called Gateway for GPU Computing in Cancer Radiotherapy Research (GCR2). A number of GPU codes developed by our group andmore » other developers can be accessed via a web interface. To use the services, researchers first upload their test data or use the standard data provided by our system. Then they can select the GPU device on which the code will be executed. Our system offers all mainstream GPU hardware for code benchmarking purpose. After the code running is complete, the system automatically summarizes and displays the computing results. We also released a SDK to allow the developers to build their own algorithm implementation and submit their binary codes to the system. The submitted code is then systematically benchmarked using a variety of GPU hardware and representative data provided by our system. The developers can also compare their codes with others and generate benchmarking reports. Results: It is found that the developed system is fully functioning. Through a user-friendly web interface, researchers are able to test various GPU codes. Developers also benefit from this platform by comprehensively benchmarking their codes on various GPU platforms and representative clinical data sets. Conclusion: We have developed an open platform allowing the clinical researchers and developers to access the GPUs and GPU codes. This development will facilitate the utilization of GPU in radiation therapy field.« less
LASSIE: simulating large-scale models of biochemical systems on GPUs.
Tangherloni, Andrea; Nobile, Marco S; Besozzi, Daniela; Mauri, Giancarlo; Cazzaniga, Paolo
2017-05-10
Mathematical modeling and in silico analysis are widely acknowledged as complementary tools to biological laboratory methods, to achieve a thorough understanding of emergent behaviors of cellular processes in both physiological and perturbed conditions. Though, the simulation of large-scale models-consisting in hundreds or thousands of reactions and molecular species-can rapidly overtake the capabilities of Central Processing Units (CPUs). The purpose of this work is to exploit alternative high-performance computing solutions, such as Graphics Processing Units (GPUs), to allow the investigation of these models at reduced computational costs. LASSIE is a "black-box" GPU-accelerated deterministic simulator, specifically designed for large-scale models and not requiring any expertise in mathematical modeling, simulation algorithms or GPU programming. Given a reaction-based model of a cellular process, LASSIE automatically generates the corresponding system of Ordinary Differential Equations (ODEs), assuming mass-action kinetics. The numerical solution of the ODEs is obtained by automatically switching between the Runge-Kutta-Fehlberg method in the absence of stiffness, and the Backward Differentiation Formulae of first order in presence of stiffness. The computational performance of LASSIE are assessed using a set of randomly generated synthetic reaction-based models of increasing size, ranging from 64 to 8192 reactions and species, and compared to a CPU-implementation of the LSODA numerical integration algorithm. LASSIE adopts a novel fine-grained parallelization strategy to distribute on the GPU cores all the calculations required to solve the system of ODEs. By virtue of this implementation, LASSIE achieves up to 92× speed-up with respect to LSODA, therefore reducing the running time from approximately 1 month down to 8 h to simulate models consisting in, for instance, four thousands of reactions and species. Notably, thanks to its smaller memory footprint, LASSIE is able to perform fast simulations of even larger models, whereby the tested CPU-implementation of LSODA failed to reach termination. LASSIE is therefore expected to make an important breakthrough in Systems Biology applications, for the execution of faster and in-depth computational analyses of large-scale models of complex biological systems.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kim, Kyungsang; Ye, Jong Chul, E-mail: jong.ye@kaist.ac.kr; Lee, Taewon
2015-09-15
Purpose: In digital breast tomosynthesis (DBT), scatter correction is highly desirable, as it improves image quality at low doses. Because the DBT detector panel is typically stationary during the source rotation, antiscatter grids are not generally compatible with DBT; thus, a software-based scatter correction is required. This work proposes a fully iterative scatter correction method that uses a novel fast Monte Carlo simulation (MCS) with a tissue-composition ratio estimation technique for DBT imaging. Methods: To apply MCS to scatter estimation, the material composition in each voxel should be known. To overcome the lack of prior accurate knowledge of tissue compositionmore » for DBT, a tissue-composition ratio is estimated based on the observation that the breast tissues are principally composed of adipose and glandular tissues. Using this approximation, the composition ratio can be estimated from the reconstructed attenuation coefficients, and the scatter distribution can then be estimated by MCS using the composition ratio. The scatter estimation and image reconstruction procedures can be performed iteratively until an acceptable accuracy is achieved. For practical use, (i) the authors have implemented a fast MCS using a graphics processing unit (GPU), (ii) the MCS is simplified to transport only x-rays in the energy range of 10–50 keV, modeling Rayleigh and Compton scattering and the photoelectric effect using the tissue-composition ratio of adipose and glandular tissues, and (iii) downsampling is used because the scatter distribution varies rather smoothly. Results: The authors have demonstrated that the proposed method can accurately estimate the scatter distribution, and that the contrast-to-noise ratio of the final reconstructed image is significantly improved. The authors validated the performance of the MCS by changing the tissue thickness, composition ratio, and x-ray energy. The authors confirmed that the tissue-composition ratio estimation was quite accurate under a variety of conditions. Our GPU-based fast MCS implementation took approximately 3 s to generate each angular projection for a 6 cm thick breast, which is believed to make this process acceptable for clinical applications. In addition, the clinical preferences of three radiologists were evaluated; the preference for the proposed method compared to the preference for the convolution-based method was statistically meaningful (p < 0.05, McNemar test). Conclusions: The proposed fully iterative scatter correction method and the GPU-based fast MCS using tissue-composition ratio estimation successfully improved the image quality within a reasonable computational time, which may potentially increase the clinical utility of DBT.« less
Determinant Computation on the GPU using the Condensation Method
NASA Astrophysics Data System (ADS)
Anisul Haque, Sardar; Moreno Maza, Marc
2012-02-01
We report on a GPU implementation of the condensation method designed by Abdelmalek Salem and Kouachi Said for computing the determinant of a matrix. We consider two types of coefficients: modular integers and floating point numbers. We evaluate the performance of our code by measuring its effective bandwidth and argue that it is numerical stable in the floating point number case. In addition, we compare our code with serial implementation of determinant computation from well-known mathematical packages. Our results suggest that a GPU implementation of the condensation method has a large potential for improving those packages in terms of running time and numerical stability.
Blocked inverted indices for exact clustering of large chemical spaces.
Thiel, Philipp; Sach-Peltason, Lisa; Ottmann, Christian; Kohlbacher, Oliver
2014-09-22
The calculation of pairwise compound similarities based on fingerprints is one of the fundamental tasks in chemoinformatics. Methods for efficient calculation of compound similarities are of the utmost importance for various applications like similarity searching or library clustering. With the increasing size of public compound databases, exact clustering of these databases is desirable, but often computationally prohibitively expensive. We present an optimized inverted index algorithm for the calculation of all pairwise similarities on 2D fingerprints of a given data set. In contrast to other algorithms, it neither requires GPU computing nor yields a stochastic approximation of the clustering. The algorithm has been designed to work well with multicore architectures and shows excellent parallel speedup. As an application example of this algorithm, we implemented a deterministic clustering application, which has been designed to decompose virtual libraries comprising tens of millions of compounds in a short time on current hardware. Our results show that our implementation achieves more than 400 million Tanimoto similarity calculations per second on a common desktop CPU. Deterministic clustering of the available chemical space thus can be done on modern multicore machines within a few days.
GPU applications for data processing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Vladymyrov, Mykhailo, E-mail: mykhailo.vladymyrov@cern.ch; Aleksandrov, Andrey; INFN sezione di Napoli, I-80125 Napoli
2015-12-31
Modern experiments that use nuclear photoemulsion imply fast and efficient data acquisition from the emulsion can be performed. The new approaches in developing scanning systems require real-time processing of large amount of data. Methods that use Graphical Processing Unit (GPU) computing power for emulsion data processing are presented here. It is shown how the GPU-accelerated emulsion processing helped us to rise the scanning speed by factor of nine.
Szécsi, László; Kacsó, Ágota; Zeck, Günther; Hantz, Péter
2017-01-01
Light stimulation with precise and complex spatial and temporal modulation is demanded by a series of research fields like visual neuroscience, optogenetics, ophthalmology, and visual psychophysics. We developed a user-friendly and flexible stimulus generating framework (GEARS GPU-based Eye And Retina Stimulation Software), which offers access to GPU computing power, and allows interactive modification of stimulus parameters during experiments. Furthermore, it has built-in support for driving external equipment, as well as for synchronization tasks, via USB ports. The use of GEARS does not require elaborate programming skills. The necessary scripting is visually aided by an intuitive interface, while the details of the underlying software and hardware components remain hidden. Internally, the software is a C++/Python hybrid using OpenGL graphics. Computations are performed on the GPU, and are defined in the GLSL shading language. However, all GPU settings, including the GPU shader programs, are automatically generated by GEARS. This is configured through a method encountered in game programming, which allows high flexibility: stimuli are straightforwardly composed using a broad library of basic components. Stimulus rendering is implemented solely in C++, therefore intermediary libraries for interfacing could be omitted. This enables the program to perform computationally demanding tasks like en-masse random number generation or real-time image processing by local and global operations. PMID:29326579
Implementation of GPU accelerated SPECT reconstruction with Monte Carlo-based scatter correction.
Bexelius, Tobias; Sohlberg, Antti
2018-06-01
Statistical SPECT reconstruction can be very time-consuming especially when compensations for collimator and detector response, attenuation, and scatter are included in the reconstruction. This work proposes an accelerated SPECT reconstruction algorithm based on graphics processing unit (GPU) processing. Ordered subset expectation maximization (OSEM) algorithm with CT-based attenuation modelling, depth-dependent Gaussian convolution-based collimator-detector response modelling, and Monte Carlo-based scatter compensation was implemented using OpenCL. The OpenCL implementation was compared against the existing multi-threaded OSEM implementation running on a central processing unit (CPU) in terms of scatter-to-primary ratios, standardized uptake values (SUVs), and processing speed using mathematical phantoms and clinical multi-bed bone SPECT/CT studies. The difference in scatter-to-primary ratios, visual appearance, and SUVs between GPU and CPU implementations was minor. On the other hand, at its best, the GPU implementation was noticed to be 24 times faster than the multi-threaded CPU version on a normal 128 × 128 matrix size 3 bed bone SPECT/CT data set when compensations for collimator and detector response, attenuation, and scatter were included. GPU SPECT reconstructions show great promise as an every day clinical reconstruction tool.
GraphReduce: Processing Large-Scale Graphs on Accelerator-Based Systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sengupta, Dipanjan; Song, Shuaiwen; Agarwal, Kapil
2015-11-15
Recent work on real-world graph analytics has sought to leverage the massive amount of parallelism offered by GPU devices, but challenges remain due to the inherent irregularity of graph algorithms and limitations in GPU-resident memory for storing large graphs. We present GraphReduce, a highly efficient and scalable GPU-based framework that operates on graphs that exceed the device’s internal memory capacity. GraphReduce adopts a combination of edge- and vertex-centric implementations of the Gather-Apply-Scatter programming model and operates on multiple asynchronous GPU streams to fully exploit the high degrees of parallelism in GPUs with efficient graph data movement between the host andmore » device.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lyakh, Dmitry I.
An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typicallymore » appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the na ve scattering algorithm (no memory access optimization). Furthermore, the tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).« less
NASA Astrophysics Data System (ADS)
Caplan, R. M.
2013-04-01
We present a simple to use, yet powerful code package called NLSEmagic to numerically integrate the nonlinear Schrödinger equation in one, two, and three dimensions. NLSEmagic is a high-order finite-difference code package which utilizes graphic processing unit (GPU) parallel architectures. The codes running on the GPU are many times faster than their serial counterparts, and are much cheaper to run than on standard parallel clusters. The codes are developed with usability and portability in mind, and therefore are written to interface with MATLAB utilizing custom GPU-enabled C codes with the MEX-compiler interface. The packages are freely distributed, including user manuals and set-up files. Catalogue identifier: AEOJ_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEOJ_v1_0.html Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 124453 No. of bytes in distributed program, including test data, etc.: 4728604 Distribution format: tar.gz Programming language: C, CUDA, MATLAB. Computer: PC, MAC. Operating system: Windows, MacOS, Linux. Has the code been vectorized or parallelized?: Yes. Number of processors used: Single CPU, number of GPU processors dependent on chosen GPU card (max is currently 3072 cores on GeForce GTX 690). Supplementary material: Setup guide, Installation guide. RAM: Highly dependent on dimensionality and grid size. For typical medium-large problem size in three dimensions, 4GB is sufficient. Keywords: Nonlinear Schröodinger Equation, GPU, high-order finite difference, Bose-Einstien condensates. Classification: 4.3, 7.7. Nature of problem: Integrate solutions of the time-dependent one-, two-, and three-dimensional cubic nonlinear Schrödinger equation. Solution method: The integrators utilize a fully-explicit fourth-order Runge-Kutta scheme in time and both second- and fourth-order differencing in space. The integrators are written to run on NVIDIA GPUs and are interfaced with MATLAB including built-in visualization and analysis tools. Restrictions: The main restriction for the GPU integrators is the amount of RAM on the GPU as the code is currently only designed for running on a single GPU. Unusual features: Ability to visualize real-time simulations through the interaction of MATLAB and the compiled GPU integrators. Additional comments: Setup guide and Installation guide provided. Program has a dedicated web site at www.nlsemagic.com. Running time: A three-dimensional run with a grid dimension of 87×87×203 for 3360 time steps (100 non-dimensional time units) takes about one and a half minutes on a GeForce GTX 580 GPU card.
Real-time motion artifacts compensation of ToF sensors data on GPU
NASA Astrophysics Data System (ADS)
Lefloch, Damien; Hoegg, Thomas; Kolb, Andreas
2013-05-01
Over the last decade, ToF sensors attracted many computer vision and graphics researchers. Nevertheless, ToF devices suffer from severe motion artifacts for dynamic scenes as well as low-resolution depth data which strongly justifies the importance of a valid correction. To counterbalance this effect, a pre-processing approach is introduced to greatly improve range image data on dynamic scenes. We first demonstrate the robustness of our approach using simulated data to finally validate our method using sensor range data. Our GPU-based processing pipeline enhances range data reliability in real-time.
Massively parallel simulator of optical coherence tomography of inhomogeneous turbid media.
Malektaji, Siavash; Lima, Ivan T; Escobar I, Mauricio R; Sherif, Sherif S
2017-10-01
An accurate and practical simulator for Optical Coherence Tomography (OCT) could be an important tool to study the underlying physical phenomena in OCT such as multiple light scattering. Recently, many researchers have investigated simulation of OCT of turbid media, e.g., tissue, using Monte Carlo methods. The main drawback of these earlier simulators is the long computational time required to produce accurate results. We developed a massively parallel simulator of OCT of inhomogeneous turbid media that obtains both Class I diffusive reflectivity, due to ballistic and quasi-ballistic scattered photons, and Class II diffusive reflectivity due to multiply scattered photons. This Monte Carlo-based simulator is implemented on graphic processing units (GPUs), using the Compute Unified Device Architecture (CUDA) platform and programming model, to exploit the parallel nature of propagation of photons in tissue. It models an arbitrary shaped sample medium as a tetrahedron-based mesh and uses an advanced importance sampling scheme. This new simulator speeds up simulations of OCT of inhomogeneous turbid media by about two orders of magnitude. To demonstrate this result, we have compared the computation times of our new parallel simulator and its serial counterpart using two samples of inhomogeneous turbid media. We have shown that our parallel implementation reduced simulation time of OCT of the first sample medium from 407 min to 92 min by using a single GPU card, to 12 min by using 8 GPU cards and to 7 min by using 16 GPU cards. For the second sample medium, the OCT simulation time was reduced from 209 h to 35.6 h by using a single GPU card, and to 4.65 h by using 8 GPU cards, and to only 2 h by using 16 GPU cards. Therefore our new parallel simulator is considerably more practical to use than its central processing unit (CPU)-based counterpart. Our new parallel OCT simulator could be a practical tool to study the different physical phenomena underlying OCT, or to design OCT systems with improved performance. Copyright © 2017 Elsevier B.V. All rights reserved.
GPU-Accelerated Voxelwise Hepatic Perfusion Quantification
Wang, H; Cao, Y
2012-01-01
Voxelwise quantification of hepatic perfusion parameters from dynamic contrast enhanced (DCE) imaging greatly contributes to assessment of liver function in response to radiation therapy. However, the efficiency of the estimation of hepatic perfusion parameters voxel-by-voxel in the whole liver using a dual-input single-compartment model requires substantial improvement for routine clinical applications. In this paper, we utilize the parallel computation power of a graphics processing unit (GPU) to accelerate the computation, while maintaining the same accuracy as the conventional method. Using CUDA-GPU, the hepatic perfusion computations over multiple voxels are run across the GPU blocks concurrently but independently. At each voxel, non-linear least squares fitting the time series of the liver DCE data to the compartmental model is distributed to multiple threads in a block, and the computations of different time points are performed simultaneously and synchronically. An efficient fast Fourier transform in a block is also developed for the convolution computation in the model. The GPU computations of the voxel-by-voxel hepatic perfusion images are compared with ones by the CPU using the simulated DCE data and the experimental DCE MR images from patients. The computation speed is improved by 30 times using a NVIDIA Tesla C2050 GPU compared to a 2.67 GHz Intel Xeon CPU processor. To obtain liver perfusion maps with 626400 voxels in a patient’s liver, it takes 0.9 min with the GPU-accelerated voxelwise computation, compared to 110 min with the CPU, while both methods result in perfusion parameters differences less than 10−6. The method will be useful for generating liver perfusion images in clinical settings. PMID:22892645
Employing OpenCL to Accelerate Ab Initio Calculations on Graphics Processing Units.
Kussmann, Jörg; Ochsenfeld, Christian
2017-06-13
We present an extension of our graphics processing units (GPU)-accelerated quantum chemistry package to employ OpenCL compute kernels, which can be executed on a wide range of computing devices like CPUs, Intel Xeon Phi, and AMD GPUs. Here, we focus on the use of AMD GPUs and discuss differences as compared to CUDA-based calculations on NVIDIA GPUs. First illustrative timings are presented for hybrid density functional theory calculations using serial as well as parallel compute environments. The results show that AMD GPUs are as fast or faster than comparable NVIDIA GPUs and provide a viable alternative for quantum chemical applications.
A Fast Full Tensor Gravity computation algorithm for High Resolution 3D Geologic Interpretations
NASA Astrophysics Data System (ADS)
Jayaram, V.; Crain, K.; Keller, G. R.
2011-12-01
We present an algorithm to rapidly calculate the vertical gravity and full tensor gravity (FTG) values due to a 3-D geologic model. This algorithm can be implemented on single, multi-core CPU and graphical processing units (GPU) architectures. Our technique is based on the line element approximation with a constant density within each grid cell. This type of parameterization is well suited for high-resolution elevation datasets with grid size typically in the range of 1m to 30m. The large high-resolution data grids in our studies employ a pre-filtered mipmap pyramid type representation for the grid data known as the Geometry clipmap. The clipmap was first introduced by Microsoft Research in 2004 to do fly-through terrain visualization. This method caches nested rectangular extents of down-sampled data layers in the pyramid to create view-dependent calculation scheme. Together with the simple grid structure, this allows the gravity to be computed conveniently on-the-fly, or stored in a highly compressed format. Neither of these capabilities has previously been available. Our approach can perform rapid calculations on large topographies including crustal-scale models derived from complex geologic interpretations. For example, we used a 1KM Sphere model consisting of 105000 cells at 10m resolution with 100000 gravity stations. The line element approach took less than 90 seconds to compute the FTG and vertical gravity on an Intel Core i7 CPU at 3.07 GHz utilizing just its single core. Also, unlike traditional gravity computational algorithms, the line-element approach can calculate gravity effects at locations interior or exterior to the model. The only condition that must be met is the observation point cannot be located directly above the line element. Therefore, we perform a location test and then apply appropriate formulation to those data points. We will present and compare the computational performance of the traditional prism method versus the line element approach on different CPU-GPU system configurations. The algorithm calculates the expected gravity at station locations where the observed gravity and FTG data were acquired. This algorithm can be used for all fast forward model calculations of 3D geologic interpretations for data from airborne, space and submarine gravity, and FTG instrumentation.
Moving magnets in a micromagnetic finite-difference framework
NASA Astrophysics Data System (ADS)
Rissanen, Ilari; Laurson, Lasse
2018-05-01
We present a method and an implementation for smooth linear motion in a finite-difference-based micromagnetic simulation code, to be used in simulating magnetic friction and other phenomena involving moving microscale magnets. Our aim is to accurately simulate the magnetization dynamics and relative motion of magnets while retaining high computational speed. To this end, we combine techniques for fast scalar potential calculation and cubic b-spline interpolation, parallelizing them on a graphics processing unit (GPU). The implementation also includes the possibility of explicitly simulating eddy currents in the case of conducting magnets. We test our implementation by providing numerical examples of stick-slip motion of thin films pulled by a spring and the effect of eddy currents on the switching time of magnetic nanocubes.
BlochSolver: A GPU-optimized fast 3D MRI simulator for experimentally compatible pulse sequences
NASA Astrophysics Data System (ADS)
Kose, Ryoichi; Kose, Katsumi
2017-08-01
A magnetic resonance imaging (MRI) simulator, which reproduces MRI experiments using computers, has been developed using two graphic-processor-unit (GPU) boards (GTX 1080). The MRI simulator was developed to run according to pulse sequences used in experiments. Experiments and simulations were performed to demonstrate the usefulness of the MRI simulator for three types of pulse sequences, namely, three-dimensional (3D) gradient-echo, 3D radio-frequency spoiled gradient-echo, and gradient-echo multislice with practical matrix sizes. The results demonstrated that the calculation speed using two GPU boards was typically about 7 TFLOPS and about 14 times faster than the calculation speed using CPUs (two 18-core Xeons). We also found that MR images acquired by experiment could be reproduced using an appropriate number of subvoxels, and that 3D isotropic and two-dimensional multislice imaging experiments for practical matrix sizes could be simulated using the MRI simulator. Therefore, we concluded that such powerful MRI simulators are expected to become an indispensable tool for MRI research and development.
NASA Astrophysics Data System (ADS)
Ha, Sanghyun; Park, Junshin; You, Donghyun
2017-11-01
Utility of the computational power of modern Graphics Processing Units (GPUs) is elaborated for solutions of incompressible Navier-Stokes equations which are integrated using a semi-implicit fractional-step method. Due to its serial and bandwidth-bound nature, the present choice of numerical methods is considered to be a good candidate for evaluating the potential of GPUs for solving Navier-Stokes equations using non-explicit time integration. An efficient algorithm is presented for GPU acceleration of the Alternating Direction Implicit (ADI) and the Fourier-transform-based direct solution method used in the semi-implicit fractional-step method. OpenMP is employed for concurrent collection of turbulence statistics on a CPU while Navier-Stokes equations are computed on a GPU. Extension to multiple NVIDIA GPUs is implemented using NVLink supported by the Pascal architecture. Performance of the present method is experimented on multiple Tesla P100 GPUs compared with a single-core Xeon E5-2650 v4 CPU in simulations of boundary-layer flow over a flat plate. Supported by the National Research Foundation of Korea (NRF) Grant funded by the Korea government (Ministry of Science, ICT and Future Planning NRF-2016R1E1A2A01939553, NRF-2014R1A2A1A11049599, and Ministry of Trade, Industry and Energy 201611101000230).
DOE Office of Scientific and Technical Information (OSTI.GOV)
Brown, W Michael; Kohlmeyer, Axel; Plimpton, Steven J
The use of accelerators such as graphics processing units (GPUs) has become popular in scientific computing applications due to their low cost, impressive floating-point capabilities, high memory bandwidth, and low electrical power requirements. Hybrid high-performance computers, machines with nodes containing more than one type of floating-point processor (e.g. CPU and GPU), are now becoming more prevalent due to these advantages. In this paper, we present a continuation of previous work implementing algorithms for using accelerators into the LAMMPS molecular dynamics software for distributed memory parallel hybrid machines. In our previous work, we focused on acceleration for short-range models with anmore » approach intended to harness the processing power of both the accelerator and (multi-core) CPUs. To augment the existing implementations, we present an efficient implementation of long-range electrostatic force calculation for molecular dynamics. Specifically, we present an implementation of the particle-particle particle-mesh method based on the work by Harvey and De Fabritiis. We present benchmark results on the Keeneland InfiniBand GPU cluster. We provide a performance comparison of the same kernels compiled with both CUDA and OpenCL. We discuss limitations to parallel efficiency and future directions for improving performance on hybrid or heterogeneous computers.« less
Kwon, M-W; Kim, S-C; Yoon, S-E; Ho, Y-S; Kim, E-S
2015-02-09
A new object tracking mask-based novel-look-up-table (OTM-NLUT) method is proposed and implemented on graphics-processing-units (GPUs) for real-time generation of holographic videos of three-dimensional (3-D) scenes. Since the proposed method is designed to be matched with software and memory structures of the GPU, the number of compute-unified-device-architecture (CUDA) kernel function calls and the computer-generated hologram (CGH) buffer size of the proposed method have been significantly reduced. It therefore results in a great increase of the computational speed of the proposed method and enables real-time generation of CGH patterns of 3-D scenes. Experimental results show that the proposed method can generate 31.1 frames of Fresnel CGH patterns with 1,920 × 1,080 pixels per second, on average, for three test 3-D video scenarios with 12,666 object points on three GPU boards of NVIDIA GTX TITAN, and confirm the feasibility of the proposed method in the practical application of electro-holographic 3-D displays.
Community Detection on the GPU
DOE Office of Scientific and Technical Information (OSTI.GOV)
Naim, Md; Manne, Fredrik; Halappanavar, Mahantesh
We present and evaluate a new GPU algorithm based on the Louvain method for community detection. Our algorithm is the first for this problem that parallelizes the access to individual edges. In this way we can fine tune the load balance when processing networks with nodes of highly varying degrees. This is achieved by scaling the number of threads assigned to each node according to its degree. Extensive experiments show that we obtain speedups up to a factor of 270 compared to the sequential algorithm. The algorithm consistently outperforms other recent shared memory implementations and is only one order ofmore » magnitude slower than the current fastest parallel Louvain method running on a Blue Gene/Q supercomputer using more than 500K threads.« less
Tunable, mixed-resolution modeling using library-based Monte Carlo and graphics processing units
Mamonov, Artem B.; Lettieri, Steven; Ding, Ying; Sarver, Jessica L.; Palli, Rohith; Cunningham, Timothy F.; Saxena, Sunil; Zuckerman, Daniel M.
2012-01-01
Building on our recently introduced library-based Monte Carlo (LBMC) approach, we describe a flexible protocol for mixed coarse-grained (CG)/all-atom (AA) simulation of proteins and ligands. In the present implementation of LBMC, protein side chain configurations are pre-calculated and stored in libraries, while bonded interactions along the backbone are treated explicitly. Because the AA side chain coordinates are maintained at minimal run-time cost, arbitrary sites and interaction terms can be turned on to create mixed-resolution models. For example, an AA region of interest such as a binding site can be coupled to a CG model for the rest of the protein. We have additionally developed a hybrid implementation of the generalized Born/surface area (GBSA) implicit solvent model suitable for mixed-resolution models, which in turn was ported to a graphics processing unit (GPU) for faster calculation. The new software was applied to study two systems: (i) the behavior of spin labels on the B1 domain of protein G (GB1) and (ii) docking of randomly initialized estradiol configurations to the ligand binding domain of the estrogen receptor (ERα). The performance of the GPU version of the code was also benchmarked in a number of additional systems. PMID:23162384
Singular value decomposition utilizing parallel algorithms on graphical processors
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kotas, Charlotte W; Barhen, Jacob
2011-01-01
One of the current challenges in underwater acoustic array signal processing is the detection of quiet targets in the presence of noise. In order to enable robust detection, one of the key processing steps requires data and replica whitening. This, in turn, involves the eigen-decomposition of the sample spectral matrix, Cx = 1/K xKX(k)XH(k) where X(k) denotes a single frequency snapshot with an element for each element of the array. By employing the singular value decomposition (SVD) method, the eigenvectors and eigenvalues can be determined directly from the data without computing the sample covariance matrix, reducing the computational requirements formore » a given level of accuracy (van Trees, Optimum Array Processing). (Recall that the SVD of a complex matrix A involves determining V, , and U such that A = U VH where U and V are orthonormal and is a positive, real, diagonal matrix containing the singular values of A. U and V are the eigenvectors of AAH and AHA, respectively, while the singular values are the square roots of the eigenvalues of AAH.) Because it is desirable to be able to compute these quantities in real time, an efficient technique for computing the SVD is vital. In addition, emerging multicore processors like graphical processing units (GPUs) are bringing parallel processing capabilities to an ever increasing number of users. Since the computational tasks involved in array signal processing are well suited for parallelization, it is expected that these computations will be implemented using GPUs as soon as users have the necessary computational tools available to them. Thus, it is important to have an SVD algorithm that is suitable for these processors. This work explores the effectiveness of two different parallel SVD implementations on an NVIDIA Tesla C2050 GPU (14 multiprocessors, 32 cores per multiprocessor, 1.15 GHz clock - peed). The first algorithm is based on a two-step algorithm which bidiagonalizes the matrix using Householder transformations, and then diagonalizes the intermediate bidiagonal matrix through implicit QR shifts. This is similar to that implemented for real matrices by Lahabar and Narayanan ("Singular Value Decomposition on GPU using CUDA", IEEE International Parallel Distributed Processing Symposium 2009). The implementation is done in a hybrid manner, with the bidiagonalization stage done using the GPU while the diagonalization stage is done using the CPU, with the GPU used to update the U and V matrices. The second algorithm is based on a one-sided Jacobi scheme utilizing a sequence of pair-wise column orthogonalizations such that A is replaced by AV until the resulting matrix is sufficiently orthogonal (that is, equal to U ). V is obtained from the sequence of orthogonalizations, while can be found from the square root of the diagonal elements of AH A and, once is known, U can be found from column scaling the resulting matrix. These implementations utilize CUDA Fortran and NVIDIA's CUB LAS library. The primary goal of this study is to quantify the comparative performance of these two techniques against themselves and other standard implementations (for example, MATLAB). Considering that there is significant overhead associated with transferring data to the GPU and with synchronization between the GPU and the host CPU, it is also important to understand when it is worthwhile to use the GPU in terms of the matrix size and number of concurrent SVDs to be calculated.« less
Efficient parallel implementation of active appearance model fitting algorithm on GPU.
Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou
2014-01-01
The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures.
Efficient Parallel Implementation of Active Appearance Model Fitting Algorithm on GPU
Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou
2014-01-01
The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures. PMID:24723812
Accelerating Computation of DCM for ERP in MATLAB by External Function Calls to the GPU.
Wang, Wei-Jen; Hsieh, I-Fan; Chen, Chun-Chuan
2013-01-01
This study aims to improve the performance of Dynamic Causal Modelling for Event Related Potentials (DCM for ERP) in MATLAB by using external function calls to a graphics processing unit (GPU). DCM for ERP is an advanced method for studying neuronal effective connectivity. DCM utilizes an iterative procedure, the expectation maximization (EM) algorithm, to find the optimal parameters given a set of observations and the underlying probability model. As the EM algorithm is computationally demanding and the analysis faces possible combinatorial explosion of models to be tested, we propose a parallel computing scheme using the GPU to achieve a fast estimation of DCM for ERP. The computation of DCM for ERP is dynamically partitioned and distributed to threads for parallel processing, according to the DCM model complexity and the hardware constraints. The performance efficiency of this hardware-dependent thread arrangement strategy was evaluated using the synthetic data. The experimental data were used to validate the accuracy of the proposed computing scheme and quantify the time saving in practice. The simulation results show that the proposed scheme can accelerate the computation by a factor of 155 for the parallel part. For experimental data, the speedup factor is about 7 per model on average, depending on the model complexity and the data. This GPU-based implementation of DCM for ERP gives qualitatively the same results as the original MATLAB implementation does at the group level analysis. In conclusion, we believe that the proposed GPU-based implementation is very useful for users as a fast screen tool to select the most likely model and may provide implementation guidance for possible future clinical applications such as online diagnosis.
Accelerating Computation of DCM for ERP in MATLAB by External Function Calls to the GPU
Wang, Wei-Jen; Hsieh, I-Fan; Chen, Chun-Chuan
2013-01-01
This study aims to improve the performance of Dynamic Causal Modelling for Event Related Potentials (DCM for ERP) in MATLAB by using external function calls to a graphics processing unit (GPU). DCM for ERP is an advanced method for studying neuronal effective connectivity. DCM utilizes an iterative procedure, the expectation maximization (EM) algorithm, to find the optimal parameters given a set of observations and the underlying probability model. As the EM algorithm is computationally demanding and the analysis faces possible combinatorial explosion of models to be tested, we propose a parallel computing scheme using the GPU to achieve a fast estimation of DCM for ERP. The computation of DCM for ERP is dynamically partitioned and distributed to threads for parallel processing, according to the DCM model complexity and the hardware constraints. The performance efficiency of this hardware-dependent thread arrangement strategy was evaluated using the synthetic data. The experimental data were used to validate the accuracy of the proposed computing scheme and quantify the time saving in practice. The simulation results show that the proposed scheme can accelerate the computation by a factor of 155 for the parallel part. For experimental data, the speedup factor is about 7 per model on average, depending on the model complexity and the data. This GPU-based implementation of DCM for ERP gives qualitatively the same results as the original MATLAB implementation does at the group level analysis. In conclusion, we believe that the proposed GPU-based implementation is very useful for users as a fast screen tool to select the most likely model and may provide implementation guidance for possible future clinical applications such as online diagnosis. PMID:23840507
NASA Astrophysics Data System (ADS)
Montanari, Davide; Scolari, Enrica; Silvestri, Chiara; Jiang Graves, Yan; Yan, Hao; Cervino, Laura; Rice, Roger; Jiang, Steve B.; Jia, Xun
2014-03-01
Cone beam CT (CBCT) has been widely used for patient setup in image-guided radiation therapy (IGRT). Radiation dose from CBCT scans has become a clinical concern. The purposes of this study are (1) to commission a graphics processing unit (GPU)-based Monte Carlo (MC) dose calculation package gCTD for Varian On-Board Imaging (OBI) system and test the calculation accuracy, and (2) to quantitatively evaluate CBCT dose from the OBI system in typical IGRT scan protocols. We first conducted dose measurements in a water phantom. X-ray source model parameters used in gCTD are obtained through a commissioning process. gCTD accuracy is demonstrated by comparing calculations with measurements in water and in CTDI phantoms. Twenty-five brain cancer patients are used to study dose in a standard-dose head protocol, and 25 prostate cancer patients are used to study dose in pelvis protocol and pelvis spotlight protocol. Mean dose to each organ is calculated. Mean dose to 2% voxels that have the highest dose is also computed to quantify the maximum dose. It is found that the mean dose value to an organ varies largely among patients. Moreover, dose distribution is highly non-homogeneous inside an organ. The maximum dose is found to be 1-3 times higher than the mean dose depending on the organ, and is up to eight times higher for the entire body due to the very high dose region in bony structures. High computational efficiency has also been observed in our studies, such that MC dose calculation time is less than 5 min for a typical case.
SU-E-T-422: Fast Analytical Beamlet Optimization for Volumetric Intensity-Modulated Arc Therapy
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chan, Kenny S K; Lee, Louis K Y; Xing, L
2015-06-15
Purpose: To implement a fast optimization algorithm on CPU/GPU heterogeneous computing platform and to obtain an optimal fluence for a given target dose distribution from the pre-calculated beamlets in an analytical approach. Methods: The 2D target dose distribution was modeled as an n-dimensional vector and estimated by a linear combination of independent basis vectors. The basis set was composed of the pre-calculated beamlet dose distributions at every 6 degrees of gantry angle and the cost function was set as the magnitude square of the vector difference between the target and the estimated dose distribution. The optimal weighting of the basis,more » which corresponds to the optimal fluence, was obtained analytically by the least square method. Those basis vectors with a positive weighting were selected for entering into the next level of optimization. Totally, 7 levels of optimization were implemented in the study.Ten head-and-neck and ten prostate carcinoma cases were selected for the study and mapped to a round water phantom with a diameter of 20cm. The Matlab computation was performed in a heterogeneous programming environment with Intel i7 CPU and NVIDIA Geforce 840M GPU. Results: In all selected cases, the estimated dose distribution was in a good agreement with the given target dose distribution and their correlation coefficients were found to be in the range of 0.9992 to 0.9997. Their root-mean-square error was monotonically decreasing and converging after 7 cycles of optimization. The computation took only about 10 seconds and the optimal fluence maps at each gantry angle throughout an arc were quickly obtained. Conclusion: An analytical approach is derived for finding the optimal fluence for a given target dose distribution and a fast optimization algorithm implemented on the CPU/GPU heterogeneous computing environment greatly reduces the optimization time.« less
GPU and APU computations of Finite Time Lyapunov Exponent fields
NASA Astrophysics Data System (ADS)
Conti, Christian; Rossinelli, Diego; Koumoutsakos, Petros
2012-03-01
We present GPU and APU accelerated computations of Finite-Time Lyapunov Exponent (FTLE) fields. The calculation of FTLEs is a computationally intensive process, as in order to obtain the sharp ridges associated with the Lagrangian Coherent Structures an extensive resampling of the flow field is required. The computational performance of this resampling is limited by the memory bandwidth of the underlying computer architecture. The present technique harnesses data-parallel execution of many-core architectures and relies on fast and accurate evaluations of moment conserving functions for the mesh to particle interpolations. We demonstrate how the computation of FTLEs can be efficiently performed on a GPU and on an APU through OpenCL and we report over one order of magnitude improvements over multi-threaded executions in FTLE computations of bluff body flows.
GPU real-time processing in NA62 trigger system
NASA Astrophysics Data System (ADS)
Ammendola, R.; Biagioni, A.; Chiozzi, S.; Cretaro, P.; Di Lorenzo, S.; Fantechi, R.; Fiorini, M.; Frezza, O.; Lamanna, G.; Lo Cicero, F.; Lonardo, A.; Martinelli, M.; Neri, I.; Paolucci, P. S.; Pastorelli, E.; Piandani, R.; Piccini, M.; Pontisso, L.; Rossetti, D.; Simula, F.; Sozzi, M.; Vicini, P.
2017-01-01
A commercial Graphics Processing Unit (GPU) is used to build a fast Level 0 (L0) trigger system tested parasitically with the TDAQ (Trigger and Data Acquisition systems) of the NA62 experiment at CERN. In particular, the parallel computing power of the GPU is exploited to perform real-time fitting in the Ring Imaging CHerenkov (RICH) detector. Direct GPU communication using a FPGA-based board has been used to reduce the data transmission latency. The performance of the system for multi-ring reconstrunction obtained during the NA62 physics run will be presented.
GPU computing of compressible flow problems by a meshless method with space-filling curves
NASA Astrophysics Data System (ADS)
Ma, Z. H.; Wang, H.; Pu, S. H.
2014-04-01
A graphic processing unit (GPU) implementation of a meshless method for solving compressible flow problems is presented in this paper. Least-square fit is used to discretize the spatial derivatives of Euler equations and an upwind scheme is applied to estimate the flux terms. The compute unified device architecture (CUDA) C programming model is employed to efficiently and flexibly port the meshless solver from CPU to GPU. Considering the data locality of randomly distributed points, space-filling curves are adopted to re-number the points in order to improve the memory performance. Detailed evaluations are firstly carried out to assess the accuracy and conservation property of the underlying numerical method. Then the GPU accelerated flow solver is used to solve external steady flows over aerodynamic configurations. Representative results are validated through extensive comparisons with the experimental, finite volume or other available reference solutions. Performance analysis reveals that the running time cost of simulations is significantly reduced while impressive (more than an order of magnitude) speedups are achieved.
GPU-enabled molecular dynamics simulations of ankyrin kinase complex
NASA Astrophysics Data System (ADS)
Gautam, Vertika; Chong, Wei Lim; Wisitponchai, Tanchanok; Nimmanpipug, Piyarat; Zain, Sharifuddin M.; Rahman, Noorsaadah Abd.; Tayapiwatana, Chatchai; Lee, Vannajan Sanghiran
2014-10-01
The ankyrin repeat (AR) protein can be used as a versatile scaffold for protein-protein interactions. It has been found that the heterotrimeric complex between integrin-linked kinase (ILK), PINCH, and parvin is an essential signaling platform, serving as a convergence point for integrin and growth-factor signaling and regulating cell adhesion, spreading, and migration. Using ILK-AR with high affinity for the PINCH1 as our model system, we explored a structure-based computational protocol to probe and characterize binding affinity hot spots at protein-protein interfaces. In this study, the long time scale dynamics simulations with GPU accelerated molecular dynamics (MD) simulations in AMBER12 have been performed to locate the hot spots of protein-protein interaction by the analysis of the Molecular Mechanics-Poisson-Boltzmann Surface Area/Generalized Born Solvent Area (MM-PBSA/GBSA) of the MD trajectories. Our calculations suggest good binding affinity of the complex and also the residues critical in the binding.
Lee, Tai-Sung; Hu, Yuan; Sherborne, Brad; Guo, Zhuyan; York, Darrin M
2017-07-11
We report the implementation of the thermodynamic integration method on the pmemd module of the AMBER 16 package on GPUs (pmemdGTI). The pmemdGTI code typically delivers over 2 orders of magnitude of speed-up relative to a single CPU core for the calculation of ligand-protein binding affinities with no statistically significant numerical differences and thus provides a powerful new tool for drug discovery applications.
GraphReduce: Large-Scale Graph Analytics on Accelerator-Based HPC Systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sengupta, Dipanjan; Agarwal, Kapil; Song, Shuaiwen
2015-09-30
Recent work on real-world graph analytics has sought to leverage the massive amount of parallelism offered by GPU devices, but challenges remain due to the inherent irregularity of graph algorithms and limitations in GPU-resident memory for storing large graphs. We present GraphReduce, a highly efficient and scalable GPU-based framework that operates on graphs that exceed the device’s internal memory capacity. GraphReduce adopts a combination of both edge- and vertex-centric implementations of the Gather-Apply-Scatter programming model and operates on multiple asynchronous GPU streams to fully exploit the high degrees of parallelism in GPUs with efficient graph data movement between the hostmore » and the device.« less
High performance MRI simulations of motion on multi-GPU systems.
Xanthis, Christos G; Venetis, Ioannis E; Aletras, Anthony H
2014-07-04
MRI physics simulators have been developed in the past for optimizing imaging protocols and for training purposes. However, these simulators have only addressed motion within a limited scope. The purpose of this study was the incorporation of realistic motion, such as cardiac motion, respiratory motion and flow, within MRI simulations in a high performance multi-GPU environment. Three different motion models were introduced in the Magnetic Resonance Imaging SIMULator (MRISIMUL) of this study: cardiac motion, respiratory motion and flow. Simulation of a simple Gradient Echo pulse sequence and a CINE pulse sequence on the corresponding anatomical model was performed. Myocardial tagging was also investigated. In pulse sequence design, software crushers were introduced to accommodate the long execution times in order to avoid spurious echoes formation. The displacement of the anatomical model isochromats was calculated within the Graphics Processing Unit (GPU) kernel for every timestep of the pulse sequence. Experiments that would allow simulation of custom anatomical and motion models were also performed. Last, simulations of motion with MRISIMUL on single-node and multi-node multi-GPU systems were examined. Gradient Echo and CINE images of the three motion models were produced and motion-related artifacts were demonstrated. The temporal evolution of the contractility of the heart was presented through the application of myocardial tagging. Better simulation performance and image quality were presented through the introduction of software crushers without the need to further increase the computational load and GPU resources. Last, MRISIMUL demonstrated an almost linear scalable performance with the increasing number of available GPU cards, in both single-node and multi-node multi-GPU computer systems. MRISIMUL is the first MR physics simulator to have implemented motion with a 3D large computational load on a single computer multi-GPU configuration. The incorporation of realistic motion models, such as cardiac motion, respiratory motion and flow may benefit the design and optimization of existing or new MR pulse sequences, protocols and algorithms, which examine motion related MR applications.
NASA Astrophysics Data System (ADS)
Timchenko, Leonid; Yarovyi, Andrii; Kokriatskaya, Nataliya; Nakonechna, Svitlana; Abramenko, Ludmila; Ławicki, Tomasz; Popiel, Piotr; Yesmakhanova, Laura
2016-09-01
The paper presents a method of parallel-hierarchical transformations for rapid recognition of dynamic images using GPU technology. Direct parallel-hierarchical transformations based on cluster CPU-and GPU-oriented hardware platform. Mathematic models of training of the parallel hierarchical (PH) network for the transformation are developed, as well as a training method of the PH network for recognition of dynamic images. This research is most topical for problems on organizing high-performance computations of super large arrays of information designed to implement multi-stage sensing and processing as well as compaction and recognition of data in the informational structures and computer devices. This method has such advantages as high performance through the use of recent advances in parallelization, possibility to work with images of ultra dimension, ease of scaling in case of changing the number of nodes in the cluster, auto scan of local network to detect compute nodes.
GPUbased, Microsecond Latency, HectoChannel MIMO Feedback Control of Magnetically Confined Plasmas
NASA Astrophysics Data System (ADS)
Rath, Nikolaus
Feedback control has become a crucial tool in the research on magnetic confinement of plasmas for achieving controlled nuclear fusion. This thesis presents a novel plasma feedback control system that, for the first time, employs a Graphics Processing Unit (GPU) for microsecond-latency, real-time control computations. This novel application area for GPU computing is opened up by a new system architecture that is optimized for low-latency computations on less than kilobyte sized data samples as they occur in typical plasma control algorithms. In contrast to traditional GPU computing approaches that target complex, high-throughput computations with massive amounts of data, the architecture presented in this thesis uses the GPU as the primary processing unit rather than as an auxiliary of the CPU, and data is transferred from A-D/D-A converters directly into GPU memory using peer-to-peer PCI Express transfers. The described design has been implemented in a new, GPU-based control system for the High-Beta Tokamak - Extended Pulse (HBT-EP) device. The system is built from commodity hardware and uses an NVIDIA GeForce GPU and D-TACQ A-D/D-A converters providing a total of 96 input and 64 output channels. The system is able to run with sampling periods down to 4 μs and latencies down to 8 μs. The GPU provides a total processing power of 1.5 x 1012 floating point operations per second. To illustrate the performance and versatility of both the general architecture and concrete implementation, a new control algorithm has been developed. The algorithm is designed for the control of multiple rotating magnetic perturbations in situations where the plasma equilibrium is not known exactly and features an adaptive system model: instead of requiring the rotation frequencies and growth rates embedded in the system model to be set a priori, the adaptive algorithm derives these parameters from the evolution of the perturbation amplitudes themselves. This results in non-linear control computations with high computational demands, but is handled easily by the GPU based system. Both digital processing latency and an arbitrary multi-pole response of amplifiers and control coils is fully taken into account for the generation of control signals. To separate sensor signals into perturbed and equilibrium components without knowledge of the equilibrium fields, a new separation method based on biorthogonal decomposition is introduced and used to derive a filter that performs the separation in real-time. The control algorithm has been implemented and tested on the new, GPU-based feedback control system of the HBT-EP tokamak. In this instance, the algorithm was set up to control four rotating n = 1 perturbations at different poloidal angles. The perturbations were treated as coupled in frequency but independent in amplitude and phase, so that the system effectively controls a helical n = 1 perturbation with unknown poloidal spectrum. Depending on the plasma's edge safety factor and rotation frequency, the control system is shown to be able to suppress the amplitude of the dominant 8 kHz mode by up to 60% or amplify the saturated amplitude by a factor of up to two. Intermediate feedback phases combine suppression and amplification with a speed up or slow down of the mode rotation frequency. Increasing feedback gain results in the excitation of an additional, slowly rotating 1.4 kHz mode without further effects on the 8 kHz mode. The feedback performance is found to exceed previous results obtained with an FPGA- and Kalman-filter based control system without requiring any tuning of system model parameters. Experimental results are compared with simulations based on a combination of the Boozer surface current model and the Fitzpatrick-Aydemir model. Within the subset of phenomena that can be represented by the model as well as determined experimentally, qualitative agreement is found.
Ren, Shanshan; Bertels, Koen; Al-Ars, Zaid
2018-01-01
GATK HaplotypeCaller (HC) is a popular variant caller, which is widely used to identify variants in complex genomes. However, due to its high variants detection accuracy, it suffers from long execution time. In GATK HC, the pair-HMMs forward algorithm accounts for a large percentage of the total execution time. This article proposes to accelerate the pair-HMMs forward algorithm on graphics processing units (GPUs) to improve the performance of GATK HC. This article presents several GPU-based implementations of the pair-HMMs forward algorithm. It also analyzes the performance bottlenecks of the implementations on an NVIDIA Tesla K40 card with various data sets. Based on these results and the characteristics of GATK HC, we are able to identify the GPU-based implementations with the highest performance for the various analyzed data sets. Experimental results show that the GPU-based implementations of the pair-HMMs forward algorithm achieve a speedup of up to 5.47× over existing GPU-based implementations.
GAPD: a GPU-accelerated atom-based polychromatic diffraction simulation code.
E, J C; Wang, L; Chen, S; Zhang, Y Y; Luo, S N
2018-03-01
GAPD, a graphics-processing-unit (GPU)-accelerated atom-based polychromatic diffraction simulation code for direct, kinematics-based, simulations of X-ray/electron diffraction of large-scale atomic systems with mono-/polychromatic beams and arbitrary plane detector geometries, is presented. This code implements GPU parallel computation via both real- and reciprocal-space decompositions. With GAPD, direct simulations are performed of the reciprocal lattice node of ultralarge systems (∼5 billion atoms) and diffraction patterns of single-crystal and polycrystalline configurations with mono- and polychromatic X-ray beams (including synchrotron undulator sources), and validation, benchmark and application cases are presented.
Multidisciplinary Simulation Acceleration using Multiple Shared-Memory Graphical Processing Units
NASA Astrophysics Data System (ADS)
Kemal, Jonathan Yashar
For purposes of optimizing and analyzing turbomachinery and other designs, the unsteady Favre-averaged flow-field differential equations for an ideal compressible gas can be solved in conjunction with the heat conduction equation. We solve all equations using the finite-volume multiple-grid numerical technique, with the dual time-step scheme used for unsteady simulations. Our numerical solver code targets CUDA-capable Graphical Processing Units (GPUs) produced by NVIDIA. Making use of MPI, our solver can run across networked compute notes, where each MPI process can use either a GPU or a Central Processing Unit (CPU) core for primary solver calculations. We use NVIDIA Tesla C2050/C2070 GPUs based on the Fermi architecture, and compare our resulting performance against Intel Zeon X5690 CPUs. Solver routines converted to CUDA typically run about 10 times faster on a GPU for sufficiently dense computational grids. We used a conjugate cylinder computational grid and ran a turbulent steady flow simulation using 4 increasingly dense computational grids. Our densest computational grid is divided into 13 blocks each containing 1033x1033 grid points, for a total of 13.87 million grid points or 1.07 million grid points per domain block. To obtain overall speedups, we compare the execution time of the solver's iteration loop, including all resource intensive GPU-related memory copies. Comparing the performance of 8 GPUs to that of 8 CPUs, we obtain an overall speedup of about 6.0 when using our densest computational grid. This amounts to an 8-GPU simulation running about 39.5 times faster than running than a single-CPU simulation.
Gravitational tree-code on graphics processing units: implementation in CUDA
NASA Astrophysics Data System (ADS)
Gaburov, Evghenii; Bédorf, Jeroen; Portegies Zwart, Simon
2010-05-01
We present a new very fast tree-code which runs on massively parallel Graphical Processing Units (GPU) with NVIDIA CUDA architecture. The tree-construction and calculation of multipole moments is carried out on the host CPU, while the force calculation which consists of tree walks and evaluation of interaction list is carried out on the GPU. In this way we achieve a sustained performance of about 100GFLOP/s and data transfer rates of about 50GB/s. It takes about a second to compute forces on a million particles with an opening angle of θ ≈ 0.5. The code has a convenient user interface and is freely available for use. http://castle.strw.leidenuniv.nl/software/octgrav.html
Accelerating image reconstruction in dual-head PET system by GPU and symmetry properties.
Chou, Cheng-Ying; Dong, Yun; Hung, Yukai; Kao, Yu-Jiun; Wang, Weichung; Kao, Chien-Min; Chen, Chin-Tu
2012-01-01
Positron emission tomography (PET) is an important imaging modality in both clinical usage and research studies. We have developed a compact high-sensitivity PET system that consisted of two large-area panel PET detector heads, which produce more than 224 million lines of response and thus request dramatic computational demands. In this work, we employed a state-of-the-art graphics processing unit (GPU), NVIDIA Tesla C2070, to yield an efficient reconstruction process. Our approaches ingeniously integrate the distinguished features of the symmetry properties of the imaging system and GPU architectures, including block/warp/thread assignments and effective memory usage, to accelerate the computations for ordered subset expectation maximization (OSEM) image reconstruction. The OSEM reconstruction algorithms were implemented employing both CPU-based and GPU-based codes, and their computational performance was quantitatively analyzed and compared. The results showed that the GPU-accelerated scheme can drastically reduce the reconstruction time and thus can largely expand the applicability of the dual-head PET system.
The GPU implementation of micro - Doppler period estimation
NASA Astrophysics Data System (ADS)
Yang, Liyuan; Wang, Junling; Bi, Ran
2018-03-01
Aiming at the problem that the computational complexity and the deficiency of real-time of the wideband radar echo signal, a program is designed to improve the performance of real-time extraction of micro-motion feature in this paper based on the CPU-GPU heterogeneous parallel structure. Firstly, we discuss the principle of the micro-Doppler effect generated by the rolling of the scattering points on the orbiting satellite, analyses how to use Kalman filter to compensate the translational motion of tumbling satellite and how to use the joint time-frequency analysis and inverse Radon transform to extract the micro-motion features from the echo after compensation. Secondly, the advantages of GPU in terms of real-time processing and the working principle of CPU-GPU heterogeneous parallelism are analysed, and a program flow based on GPU to extract the micro-motion feature from the radar echo signal of rolling satellite is designed. At the end of the article the results of extraction are given to verify the correctness of the program and algorithm.
Fast precalculated triangular mesh algorithm for 3D binary computer-generated holograms.
Yang, Fan; Kaczorowski, Andrzej; Wilkinson, Tim D
2014-12-10
A new method for constructing computer-generated holograms using a precalculated triangular mesh is presented. The speed of calculation can be increased dramatically by exploiting both the precalculated base triangle and GPU parallel computing. Unlike algorithms using point-based sources, this method can reconstruct a more vivid 3D object instead of a "hollow image." In addition, there is no need to do a fast Fourier transform for each 3D element every time. A ferroelectric liquid crystal spatial light modulator is used to display the binary hologram within our experiment and the hologram of a base right triangle is produced by utilizing just a one-step Fourier transform in the 2D case, which can be expanded to the 3D case by multiplying by a suitable Fresnel phase plane. All 3D holograms generated in this paper are based on Fresnel propagation; thus, the Fresnel plane is treated as a vital element in producing the hologram. A GeForce GTX 770 graphics card with 2 GB memory is used to achieve parallel computing.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lindstrom, P; Cohen, J D
We present a streaming geometry compression codec for multiresolution, uniformly-gridded, triangular terrain patches that supports very fast decompression. Our method is based on linear prediction and residual coding for lossless compression of the full-resolution data. As simplified patches on coarser levels in the hierarchy already incur some data loss, we optionally allow further quantization for more lossy compression. The quantization levels are adaptive on a per-patch basis, while still permitting seamless, adaptive tessellations of the terrain. Our geometry compression on such a hierarchy achieves compression ratios of 3:1 to 12:1. Our scheme is not only suitable for fast decompression onmore » the CPU, but also for parallel decoding on the GPU with peak throughput over 2 billion triangles per second. Each terrain patch is independently decompressed on the fly from a variable-rate bitstream by a GPU geometry program with no branches or conditionals. Thus we can store the geometry compressed on the GPU, reducing storage and bandwidth requirements throughout the system. In our rendering approach, only compressed bitstreams and the decoded height values in the view-dependent 'cut' are explicitly stored on the GPU. Normal vectors are computed in a streaming fashion, and remaining geometry and texture coordinates, as well as mesh connectivity, are shared and re-used for all patches. We demonstrate and evaluate our algorithms on a small prototype system in which all compressed geometry fits in the GPU memory and decompression occurs on the fly every rendering frame without any cache maintenance.« less
GPU Accelerated Clustering for Arbitrary Shapes in Geoscience Data
NASA Astrophysics Data System (ADS)
Pankratius, V.; Gowanlock, M.; Rude, C. M.; Li, J. D.
2016-12-01
Clustering algorithms have become a vital component in intelligent systems for geoscience that helps scientists discover and track phenomena of various kinds. Here, we outline advances in Density-Based Spatial Clustering of Applications with Noise (DBSCAN) which detects clusters of arbitrary shape that are common in geospatial data. In particular, we propose a hybrid CPU-GPU implementation of DBSCAN and highlight new optimization approaches on the GPU that allows clustering detection in parallel while optimizing data transport during CPU-GPU interactions. We employ an efficient batching scheme between the host and GPU such that limited GPU memory is not prohibitive when processing large and/or dense datasets. To minimize data transfer overhead, we estimate the total workload size and employ an execution that generates optimized batches that will not overflow the GPU buffer. This work is demonstrated on space weather Total Electron Content (TEC) datasets containing over 5 million measurements from instruments worldwide, and allows scientists to spot spatially coherent phenomena with ease. Our approach is up to 30 times faster than a sequential implementation and therefore accelerates discoveries in large datasets. We acknowledge support from NSF ACI-1442997.
Aspects of GPU perfomance in algorithms with random memory access
NASA Astrophysics Data System (ADS)
Kashkovsky, Alexander V.; Shershnev, Anton A.; Vashchenkov, Pavel V.
2017-10-01
The numerical code for solving the Boltzmann equation on the hybrid computational cluster using the Direct Simulation Monte Carlo (DSMC) method showed that on Tesla K40 accelerators computational performance drops dramatically with increase of percentage of occupied GPU memory. Testing revealed that memory access time increases tens of times after certain critical percentage of memory is occupied. Moreover, it seems to be the common problem of all NVidia's GPUs arising from its architecture. Few modifications of the numerical algorithm were suggested to overcome this problem. One of them, based on the splitting the memory into "virtual" blocks, resulted in 2.5 times speed up.
GPU-accelerated computational tool for studying the effectiveness of asteroid disruption techniques
NASA Astrophysics Data System (ADS)
Zimmerman, Ben J.; Wie, Bong
2016-10-01
This paper presents the development of a new Graphics Processing Unit (GPU) accelerated computational tool for asteroid disruption techniques. Numerical simulations are completed using the high-order spectral difference (SD) method. Due to the compact nature of the SD method, it is well suited for implementation with the GPU architecture, hence solutions are generated at orders of magnitude faster than the Central Processing Unit (CPU) counterpart. A multiphase model integrated with the SD method is introduced, and several asteroid disruption simulations are conducted, including kinetic-energy impactors, multi-kinetic energy impactor systems, and nuclear options. Results illustrate the benefits of using multi-kinetic energy impactor systems when compared to a single impactor system. In addition, the effectiveness of nuclear options is observed.
The Metropolis Monte Carlo method with CUDA enabled Graphic Processing Units
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hall, Clifford; School of Physics, Astronomy, and Computational Sciences, George Mason University, 4400 University Dr., Fairfax, VA 22030; Ji, Weixiao
2014-02-01
We present a CPU–GPU system for runtime acceleration of large molecular simulations using GPU computation and memory swaps. The memory architecture of the GPU can be used both as container for simulation data stored on the graphics card and as floating-point code target, providing an effective means for the manipulation of atomistic or molecular data on the GPU. To fully take advantage of this mechanism, efficient GPU realizations of algorithms used to perform atomistic and molecular simulations are essential. Our system implements a versatile molecular engine, including inter-molecule interactions and orientational variables for performing the Metropolis Monte Carlo (MMC) algorithm,more » which is one type of Markov chain Monte Carlo. By combining memory objects with floating-point code fragments we have implemented an MMC parallel engine that entirely avoids the communication time of molecular data at runtime. Our runtime acceleration system is a forerunner of a new class of CPU–GPU algorithms exploiting memory concepts combined with threading for avoiding bus bandwidth and communication. The testbed molecular system used here is a condensed phase system of oligopyrrole chains. A benchmark shows a size scaling speedup of 60 for systems with 210,000 pyrrole monomers. Our implementation can easily be combined with MPI to connect in parallel several CPU–GPU duets. -- Highlights: •We parallelize the Metropolis Monte Carlo (MMC) algorithm on one CPU—GPU duet. •The Adaptive Tempering Monte Carlo employs MMC and profits from this CPU—GPU implementation. •Our benchmark shows a size scaling-up speedup of 62 for systems with 225,000 particles. •The testbed involves a polymeric system of oligopyrroles in the condensed phase. •The CPU—GPU parallelization includes dipole—dipole and Mie—Jones classic potentials.« less
NASA Astrophysics Data System (ADS)
Bergmann, Ryan
Graphics processing units, or GPUs, have gradually increased in computational power from the small, job-specific boards of the early 1990s to the programmable powerhouses of today. Compared to more common central processing units, or CPUs, GPUs have a higher aggregate memory bandwidth, much higher floating-point operations per second (FLOPS), and lower energy consumption per FLOP. Because one of the main obstacles in exascale computing is power consumption, many new supercomputing platforms are gaining much of their computational capacity by incorporating GPUs into their compute nodes. Since CPU-optimized parallel algorithms are not directly portable to GPU architectures (or at least not without losing substantial performance), transport codes need to be rewritten to execute efficiently on GPUs. Unless this is done, reactor simulations cannot take full advantage of these new supercomputers. WARP, which can stand for ``Weaving All the Random Particles,'' is a three-dimensional (3D) continuous energy Monte Carlo neutron transport code developed in this work as to efficiently implement a continuous energy Monte Carlo neutron transport algorithm on a GPU. WARP accelerates Monte Carlo simulations while preserving the benefits of using the Monte Carlo Method, namely, very few physical and geometrical simplifications. WARP is able to calculate multiplication factors, flux tallies, and fission source distributions for time-independent problems, and can run in both criticality or fixed source modes. WARP can transport neutrons in unrestricted arrangements of parallelepipeds, hexagonal prisms, cylinders, and spheres. WARP uses an event-based algorithm, but with some important differences. Moving data is expensive, so WARP uses a remapping vector of pointer/index pairs to direct GPU threads to the data they need to access. The remapping vector is sorted by reaction type after every transport iteration using a high-efficiency parallel radix sort, which serves to keep the reaction types as contiguous as possible and removes completed histories from the transport cycle. The sort reduces the amount of divergence in GPU ``thread blocks,'' keeps the SIMD units as full as possible, and eliminates using memory bandwidth to check if a neutron in the batch has been terminated or not. Using a remapping vector means the data access pattern is irregular, but this is mitigated by using large batch sizes where the GPU can effectively eliminate the high cost of irregular global memory access. WARP modifies the standard unionized energy grid implementation to reduce memory traffic. Instead of storing a matrix of pointers indexed by reaction type and energy, WARP stores three matrices. The first contains cross section values, the second contains pointers to angular distributions, and a third contains pointers to energy distributions. This linked list type of layout increases memory usage, but lowers the number of data loads that are needed to determine a reaction by eliminating a pointer load to find a cross section value. Optimized, high-performance GPU code libraries are also used by WARP wherever possible. The CUDA performance primitives (CUDPP) library is used to perform the parallel reductions, sorts and sums, the CURAND library is used to seed the linear congruential random number generators, and the OptiX ray tracing framework is used for geometry representation. OptiX is a highly-optimized library developed by NVIDIA that automatically builds hierarchical acceleration structures around user-input geometry so only surfaces along a ray line need to be queried in ray tracing. WARP also performs material and cell number queries with OptiX by using a point-in-polygon like algorithm. WARP has shown that GPUs are an effective platform for performing Monte Carlo neutron transport with continuous energy cross sections. Currently, WARP is the most detailed and feature-rich program in existence for performing continuous energy Monte Carlo neutron transport in general 3D geometries on GPUs, but compared to production codes like Serpent and MCNP, WARP has limited capabilities. Despite WARP's lack of features, its novel algorithm implementations show that high performance can be achieved on a GPU despite the inherently divergent program flow and sparse data access patterns. WARP is not ready for everyday nuclear reactor calculations, but is a good platform for further development of GPU-accelerated Monte Carlo neutron transport. In it's current state, it may be a useful tool for multiplication factor searches, i.e. determining reactivity coefficients by perturbing material densities or temperatures, since these types of calculations typically do not require many flux tallies. (Abstract shortened by UMI.)
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dong, Tingzing Tim; Tomov, Stanimire Z; Luszczek, Piotr R
As modern hardware keeps evolving, an increasingly effective approach to developing energy efficient and high-performance solvers is to design them to work on many small size and independent problems. Many applications already need this functionality, especially for GPUs, which are currently known to be about four to five times more energy efficient than multicore CPUs. We describe the development of one-sided factorizations that work for a set of small dense matrices in parallel, and we illustrate our techniques on the QR factorization based on Householder transformations. We refer to this mode of operation as a batched factorization. Our approach ismore » based on representing the algorithms as a sequence of batched BLAS routines for GPU-only execution. This is in contrast to the hybrid CPU-GPU algorithms that rely heavily on using the multicore CPU for specific parts of the workload. But for a system to benefit fully from the GPU's significantly higher energy efficiency, avoiding the use of the multicore CPU must be a primary design goal, so the system can rely more heavily on the more efficient GPU. Additionally, this will result in the removal of the costly CPU-to-GPU communication. Furthermore, we do not use a single symmetric multiprocessor(on the GPU) to factorize a single problem at a time. We illustrate how our performance analysis, and the use of profiling and tracing tools, guided the development and optimization of our batched factorization to achieve up to a 2-fold speedup and a 3-fold energy efficiency improvement compared to our highly optimized batched CPU implementations based on the MKL library(when using two sockets of Intel Sandy Bridge CPUs). Compared to a batched QR factorization featured in the CUBLAS library for GPUs, we achieved up to 5x speedup on the K40 GPU.« less
Sharp, G C; Kandasamy, N; Singh, H; Folkert, M
2007-10-07
This paper shows how to significantly accelerate cone-beam CT reconstruction and 3D deformable image registration using the stream-processing model. We describe data-parallel designs for the Feldkamp, Davis and Kress (FDK) reconstruction algorithm, and the demons deformable registration algorithm, suitable for use on a commodity graphics processing unit. The streaming versions of these algorithms are implemented using the Brook programming environment and executed on an NVidia 8800 GPU. Performance results using CT data of a preserved swine lung indicate that the GPU-based implementations of the FDK and demons algorithms achieve a substantial speedup--up to 80 times for FDK and 70 times for demons when compared to an optimized reference implementation on a 2.8 GHz Intel processor. In addition, the accuracy of the GPU-based implementations was found to be excellent. Compared with CPU-based implementations, the RMS differences were less than 0.1 Hounsfield unit for reconstruction and less than 0.1 mm for deformable registration.
GPU-accelerated algorithms for many-particle continuous-time quantum walks
NASA Astrophysics Data System (ADS)
Piccinini, Enrico; Benedetti, Claudia; Siloi, Ilaria; Paris, Matteo G. A.; Bordone, Paolo
2017-06-01
Many-particle continuous-time quantum walks (CTQWs) represent a resource for several tasks in quantum technology, including quantum search algorithms and universal quantum computation. In order to design and implement CTQWs in a realistic scenario, one needs effective simulation tools for Hamiltonians that take into account static noise and fluctuations in the lattice, i.e. Hamiltonians containing stochastic terms. To this aim, we suggest a parallel algorithm based on the Taylor series expansion of the evolution operator, and compare its performances with those of algorithms based on the exact diagonalization of the Hamiltonian or a 4th order Runge-Kutta integration. We prove that both Taylor-series expansion and Runge-Kutta algorithms are reliable and have a low computational cost, the Taylor-series expansion showing the additional advantage of a memory allocation not depending on the precision of calculation. Both algorithms are also highly parallelizable within the SIMT paradigm, and are thus suitable for GPGPU computing. In turn, we have benchmarked 4 NVIDIA GPUs and 3 quad-core Intel CPUs for a 2-particle system over lattices of increasing dimension, showing that the speedup provided by GPU computing, with respect to the OPENMP parallelization, lies in the range between 8x and (more than) 20x, depending on the frequency of post-processing. GPU-accelerated codes thus allow one to overcome concerns about the execution time, and make it possible simulations with many interacting particles on large lattices, with the only limit of the memory available on the device.
Storage strategies of eddy-current FE-BI model for GPU implementation
NASA Astrophysics Data System (ADS)
Bardel, Charles; Lei, Naiguang; Udpa, Lalita
2013-01-01
In the past few years graphical processing units (GPUs) have shown tremendous improvements in computational throughput over standard CPU architecture. However, this comes at the cost of restructuring the algorithms to meet the strengths and drawbacks of this GPU architecture. A major drawback is the state of limited memory, and hence storage of FE stiffness matrices on the GPU is important. In contrast to storage on CPU the GPU storage format has significant influence on the overall performance. This paper presents an investigation of a storage strategy in the implementation of a two-dimensional finite element-boundary integral (FE-BI) model for Eddy current NDE applications, on GPU architecture. Specifically, the high dimensional matrices are manipulated by examining the matrix structure and optimally splitting into structurally independent component matrices for efficient storage and retrieval of each component. Results obtained using the proposed approach are compared to those of conventional CPU implementation for validating the method.
NASA Astrophysics Data System (ADS)
Maffucci, Irene; Hu, Xiao; Fumagalli, Valentina; Contini, Alessandro
2018-03-01
Nwat-MMGBSA is a variant of MM-PB/GBSA based on the inclusion of a number of explicit water molecules that are the closest to the ligand in each frame of a molecular dynamics trajectory. This method demonstrated improved correlations between calculated and experimental binding energies in both protein-protein interactions and ligand-receptor complexes, in comparison to the standard MM-GBSA. A protocol optimization, aimed to maximize efficacy and efficiency, is discussed here considering penicillopepsin, HIV1-protease, and BCL-XL as test cases. Calculations were performed in triplicates on both classic HPC environments and on standard workstations equipped by a GPU card, evidencing no statistical differences in the results. No relevant differences in correlation to experiments were also observed when performing Nwat-MMGBSA calculations on 4 ns or 1 ns long trajectories. A fully automatic workflow for structure-based virtual screening, performing from library set-up to docking and Nwat-MMGBSA rescoring, has then been developed. The protocol has been tested against no rescoring or standard MM-GBSA rescoring within a retrospective virtual screening of inhibitors of AmpC β-lactamase and of the Rac1-Tiam1 protein-protein interaction. In both cases, Nwat-MMGBSA rescoring provided a statistically significant increase in the ROC AUCs of between 20% and 30%, compared to docking scoring or to standard MM-GBSA rescoring.
Analysis of impact of general-purpose graphics processor units in supersonic flow modeling
NASA Astrophysics Data System (ADS)
Emelyanov, V. N.; Karpenko, A. G.; Kozelkov, A. S.; Teterina, I. V.; Volkov, K. N.; Yalozo, A. V.
2017-06-01
Computational methods are widely used in prediction of complex flowfields associated with off-normal situations in aerospace engineering. Modern graphics processing units (GPU) provide architectures and new programming models that enable to harness their large processing power and to design computational fluid dynamics (CFD) simulations at both high performance and low cost. Possibilities of the use of GPUs for the simulation of external and internal flows on unstructured meshes are discussed. The finite volume method is applied to solve three-dimensional unsteady compressible Euler and Navier-Stokes equations on unstructured meshes with high resolution numerical schemes. CUDA technology is used for programming implementation of parallel computational algorithms. Solutions of some benchmark test cases on GPUs are reported, and the results computed are compared with experimental and computational data. Approaches to optimization of the CFD code related to the use of different types of memory are considered. Speedup of solution on GPUs with respect to the solution on central processor unit (CPU) is compared. Performance measurements show that numerical schemes developed achieve 20-50 speedup on GPU hardware compared to CPU reference implementation. The results obtained provide promising perspective for designing a GPU-based software framework for applications in CFD.
Kinematic modelling of disc galaxies using graphics processing units
NASA Astrophysics Data System (ADS)
Bekiaris, G.; Glazebrook, K.; Fluke, C. J.; Abraham, R.
2016-01-01
With large-scale integral field spectroscopy (IFS) surveys of thousands of galaxies currently under-way or planned, the astronomical community is in need of methods, techniques and tools that will allow the analysis of huge amounts of data. We focus on the kinematic modelling of disc galaxies and investigate the potential use of massively parallel architectures, such as the graphics processing unit (GPU), as an accelerator for the computationally expensive model-fitting procedure. We review the algorithms involved in model-fitting and evaluate their suitability for GPU implementation. We employ different optimization techniques, including the Levenberg-Marquardt and nested sampling algorithms, but also a naive brute-force approach based on nested grids. We find that the GPU can accelerate the model-fitting procedure up to a factor of ˜100 when compared to a single-threaded CPU, and up to a factor of ˜10 when compared to a multithreaded dual CPU configuration. Our method's accuracy, precision and robustness are assessed by successfully recovering the kinematic properties of simulated data, and also by verifying the kinematic modelling results of galaxies from the GHASP and DYNAMO surveys as found in the literature. The resulting GBKFIT code is available for download from: http://supercomputing.swin.edu.au/gbkfit.
Multi-GPU maximum entropy image synthesis for radio astronomy
NASA Astrophysics Data System (ADS)
Cárcamo, M.; Román, P. E.; Casassus, S.; Moral, V.; Rannou, F. R.
2018-01-01
The maximum entropy method (MEM) is a well known deconvolution technique in radio-interferometry. This method solves a non-linear optimization problem with an entropy regularization term. Other heuristics such as CLEAN are faster but highly user dependent. Nevertheless, MEM has the following advantages: it is unsupervised, it has a statistical basis, it has a better resolution and better image quality under certain conditions. This work presents a high performance GPU version of non-gridding MEM, which is tested using real and simulated data. We propose a single-GPU and a multi-GPU implementation for single and multi-spectral data, respectively. We also make use of the Peer-to-Peer and Unified Virtual Addressing features of newer GPUs which allows to exploit transparently and efficiently multiple GPUs. Several ALMA data sets are used to demonstrate the effectiveness in imaging and to evaluate GPU performance. The results show that a speedup from 1000 to 5000 times faster than a sequential version can be achieved, depending on data and image size. This allows to reconstruct the HD142527 CO(6-5) short baseline data set in 2.1 min, instead of 2.5 days that takes a sequential version on CPU.
3D Parallel Multigrid Methods for Real-Time Fluid Simulation
NASA Astrophysics Data System (ADS)
Wan, Feifei; Yin, Yong; Zhang, Suiyu
2018-03-01
The multigrid method is widely used in fluid simulation because of its strong convergence. In addition to operating accuracy, operational efficiency is also an important factor to consider in order to enable real-time fluid simulation in computer graphics. For this problem, we compared the performance of the Algebraic Multigrid and the Geometric Multigrid in the V-Cycle and Full-Cycle schemes respectively, and analyze the convergence and speed of different methods. All the calculations are done on the parallel computing of GPU in this paper. Finally, we experiment with the 3D-grid for each scale, and give the exact experimental results.
NASA Astrophysics Data System (ADS)
Nikolskiy, V. P.; Stegailov, V. V.
2018-01-01
Metal nanoparticles (NPs) serve as important tools for many modern technologies. However, the proper microscopic models of the interaction between ultrashort laser pulses and metal NPs are currently not very well developed in many cases. One part of the problem is the description of the warm dense matter that is formed in NPs after intense irradiation. Another part of the problem is the description of the electromagnetic waves around NPs. Description of wave propagation requires the solution of Maxwell’s equations and the finite-difference time-domain (FDTD) method is the classic approach for solving them. There are many commercial and free implementations of FDTD, including the open source software that supports graphics processing unit (GPU) acceleration. In this report we present the results on the FDTD calculations for different cases of the interaction between ultrashort laser pulses and metal nanoparticles. Following our previous results, we analyze the efficiency of the GPU acceleration of the FDTD algorithm.
RMG An Open Source Electronic Structure Code for Multi-Petaflops Calculations
NASA Astrophysics Data System (ADS)
Briggs, Emil; Lu, Wenchang; Hodak, Miroslav; Bernholc, Jerzy
RMG (Real-space Multigrid) is an open source, density functional theory code for quantum simulations of materials. It solves the Kohn-Sham equations on real-space grids, which allows for natural parallelization via domain decomposition. Either subspace or Davidson diagonalization, coupled with multigrid methods, are used to accelerate convergence. RMG is a cross platform open source package which has been used in the study of a wide range of systems, including semiconductors, biomolecules, and nanoscale electronic devices. It can optionally use GPU accelerators to improve performance on systems where they are available. The recently released versions (>2.0) support multiple GPU's per compute node, have improved performance and scalability, enhanced accuracy and support for additional hardware platforms. New versions of the code are regularly released at http://www.rmgdft.org. The releases include binaries for Linux, Windows and MacIntosh systems, automated builds for clusters using cmake, as well as versions adapted to the major supercomputing installations and platforms. Several recent, large-scale applications of RMG will be discussed.
Adaptive multi-GPU Exchange Monte Carlo for the 3D Random Field Ising Model
NASA Astrophysics Data System (ADS)
Navarro, Cristóbal A.; Huang, Wei; Deng, Youjin
2016-08-01
This work presents an adaptive multi-GPU Exchange Monte Carlo approach for the simulation of the 3D Random Field Ising Model (RFIM). The design is based on a two-level parallelization. The first level, spin-level parallelism, maps the parallel computation as optimal 3D thread-blocks that simulate blocks of spins in shared memory with minimal halo surface, assuming a constant block volume. The second level, replica-level parallelism, uses multi-GPU computation to handle the simulation of an ensemble of replicas. CUDA's concurrent kernel execution feature is used in order to fill the occupancy of each GPU with many replicas, providing a performance boost that is more notorious at the smallest values of L. In addition to the two-level parallel design, the work proposes an adaptive multi-GPU approach that dynamically builds a proper temperature set free of exchange bottlenecks. The strategy is based on mid-point insertions at the temperature gaps where the exchange rate is most compromised. The extra work generated by the insertions is balanced across the GPUs independently of where the mid-point insertions were performed. Performance results show that spin-level performance is approximately two orders of magnitude faster than a single-core CPU version and one order of magnitude faster than a parallel multi-core CPU version running on 16-cores. Multi-GPU performance is highly convenient under a weak scaling setting, reaching up to 99 % efficiency as long as the number of GPUs and L increase together. The combination of the adaptive approach with the parallel multi-GPU design has extended our possibilities of simulation to sizes of L = 32 , 64 for a workstation with two GPUs. Sizes beyond L = 64 can eventually be studied using larger multi-GPU systems.
NMF-mGPU: non-negative matrix factorization on multi-GPU systems.
Mejía-Roa, Edgardo; Tabas-Madrid, Daniel; Setoain, Javier; García, Carlos; Tirado, Francisco; Pascual-Montano, Alberto
2015-02-13
In the last few years, the Non-negative Matrix Factorization ( NMF ) technique has gained a great interest among the Bioinformatics community, since it is able to extract interpretable parts from high-dimensional datasets. However, the computing time required to process large data matrices may become impractical, even for a parallel application running on a multiprocessors cluster. In this paper, we present NMF-mGPU, an efficient and easy-to-use implementation of the NMF algorithm that takes advantage of the high computing performance delivered by Graphics-Processing Units ( GPUs ). Driven by the ever-growing demands from the video-games industry, graphics cards usually provided in PCs and laptops have evolved from simple graphics-drawing platforms into high-performance programmable systems that can be used as coprocessors for linear-algebra operations. However, these devices may have a limited amount of on-board memory, which is not considered by other NMF implementations on GPU. NMF-mGPU is based on CUDA ( Compute Unified Device Architecture ), the NVIDIA's framework for GPU computing. On devices with low memory available, large input matrices are blockwise transferred from the system's main memory to the GPU's memory, and processed accordingly. In addition, NMF-mGPU has been explicitly optimized for the different CUDA architectures. Finally, platforms with multiple GPUs can be synchronized through MPI ( Message Passing Interface ). In a four-GPU system, this implementation is about 120 times faster than a single conventional processor, and more than four times faster than a single GPU device (i.e., a super-linear speedup). Applications of GPUs in Bioinformatics are getting more and more attention due to their outstanding performance when compared to traditional processors. In addition, their relatively low price represents a highly cost-effective alternative to conventional clusters. In life sciences, this results in an excellent opportunity to facilitate the daily work of bioinformaticians that are trying to extract biological meaning out of hundreds of gigabytes of experimental information. NMF-mGPU can be used "out of the box" by researchers with little or no expertise in GPU programming in a variety of platforms, such as PCs, laptops, or high-end GPU clusters. NMF-mGPU is freely available at https://github.com/bioinfo-cnb/bionmf-gpu .
NASA Astrophysics Data System (ADS)
Kluber, Alexander; Hayre, Robert; Cox, Daniel
2012-02-01
Motivated by the need to find beta-structure aggregation nuclei for the polyQ diseases such as Huntington's, we have undertaken a search for length dependent structure in model polyglutamine proteins. We use the Onufriev-Bashford-Case (OBC) generalized Born implicit solvent GPU based AMBER11 molecular dynamics with the parm96 force field coupled with a replica exchange method to characterize monomeric strands of polyglutamine as a function of chain length and temperature. This force field and solvation method has been shown among other methods to accurately reproduce folded metastability in certain small peptides, and to yield accurately de novo folded structures in a millisecond time-scale protein. Using GPU molecular dynamics we can sample out into the microsecond range. Additionally, explicit solvent runs will be used to verify results from the implicit solvent runs. We will assess order using measures of secondary structure and hydrogen bond content.
Acceleration of Linear Finite-Difference Poisson-Boltzmann Methods on Graphics Processing Units.
Qi, Ruxi; Botello-Smith, Wesley M; Luo, Ray
2017-07-11
Electrostatic interactions play crucial roles in biophysical processes such as protein folding and molecular recognition. Poisson-Boltzmann equation (PBE)-based models have emerged as widely used in modeling these important processes. Though great efforts have been put into developing efficient PBE numerical models, challenges still remain due to the high dimensionality of typical biomolecular systems. In this study, we implemented and analyzed commonly used linear PBE solvers for the ever-improving graphics processing units (GPU) for biomolecular simulations, including both standard and preconditioned conjugate gradient (CG) solvers with several alternative preconditioners. Our implementation utilizes the standard Nvidia CUDA libraries cuSPARSE, cuBLAS, and CUSP. Extensive tests show that good numerical accuracy can be achieved given that the single precision is often used for numerical applications on GPU platforms. The optimal GPU performance was observed with the Jacobi-preconditioned CG solver, with a significant speedup over standard CG solver on CPU in our diversified test cases. Our analysis further shows that different matrix storage formats also considerably affect the efficiency of different linear PBE solvers on GPU, with the diagonal format best suited for our standard finite-difference linear systems. Further efficiency may be possible with matrix-free operations and integrated grid stencil setup specifically tailored for the banded matrices in PBE-specific linear systems.
A GPU-Parallelized Eigen-Based Clutter Filter Framework for Ultrasound Color Flow Imaging.
Chee, Adrian J Y; Yiu, Billy Y S; Yu, Alfred C H
2017-01-01
Eigen-filters with attenuation response adapted to clutter statistics in color flow imaging (CFI) have shown improved flow detection sensitivity in the presence of tissue motion. Nevertheless, its practical adoption in clinical use is not straightforward due to the high computational cost for solving eigendecompositions. Here, we provide a pedagogical description of how a real-time computing framework for eigen-based clutter filtering can be developed through a single-instruction, multiple data (SIMD) computing approach that can be implemented on a graphical processing unit (GPU). Emphasis is placed on the single-ensemble-based eigen-filtering approach (Hankel singular value decomposition), since it is algorithmically compatible with GPU-based SIMD computing. The key algebraic principles and the corresponding SIMD algorithm are explained, and annotations on how such algorithm can be rationally implemented on the GPU are presented. Real-time efficacy of our framework was experimentally investigated on a single GPU device (GTX Titan X), and the computing throughput for varying scan depths and slow-time ensemble lengths was studied. Using our eigen-processing framework, real-time video-range throughput (24 frames/s) can be attained for CFI frames with full view in azimuth direction (128 scanlines), up to a scan depth of 5 cm ( λ pixel axial spacing) for slow-time ensemble length of 16 samples. The corresponding CFI image frames, with respect to the ones derived from non-adaptive polynomial regression clutter filtering, yielded enhanced flow detection sensitivity in vivo, as demonstrated in a carotid imaging case example. These findings indicate that the GPU-enabled eigen-based clutter filtering can improve CFI flow detection performance in real time.
Detection and Tracking of Moving Targets Behind Cluttered Environments Using Compressive Sensing
NASA Astrophysics Data System (ADS)
Dang, Vinh Quang
Detection and tracking of moving targets (target's motion, vibration, etc.) in cluttered environments have been receiving much attention in numerous applications, such as disaster search-and-rescue, law enforcement, urban warfare, etc. One of the popular techniques is the use of stepped frequency continuous wave radar due to its low cost and complexity. However, the stepped frequency radar suffers from long data acquisition time. This dissertation focuses on detection and tracking of moving targets and vibration rates of stationary targets behind cluttered medium such as wall using stepped frequency radar enhanced by compressive sensing. The application of compressive sensing enables the reconstruction of the target space using fewer random frequencies, which decreases the acquisition time. Hardware-accelerated parallelization on GPU is investigated for the Orthogonal Matching Pursuit reconstruction algorithm. For simulation purpose, two hybrid methods have been developed to calculate the scattered fields from the targets through the wall approaching the antenna system, and to convert the incoming fields into voltage signals at terminals of the receive antenna. The first method is developed based on the plane wave spectrum approach for calculating the scattered fields of targets behind the wall. The method uses Fast Multiple Method (FMM) to calculate scattered fields on a particular source plane, decomposes them into plane wave components, and propagates the plane wave spectrum through the wall by integrating wall transmission coefficients before constructing the fields on a desired observation plane. The second method allows one to calculate the complex output voltage at terminals of a receiving antenna which fully takes into account the antenna effects. This method adopts the concept of complex antenna factor in Electromagnetic Compatibility (EMC) community for its calculation.
Fukuda, Shinichi; Okuda, Kensuke; Kishino, Genichiro; Hoshi, Sujin; Kawano, Itsuki; Fukuda, Masahiro; Yamashita, Toshiharu; Beheregaray, Simone; Nagano, Masumi; Ohneda, Osamu; Nagasawa, Hideko; Oshika, Tetsuro
2016-12-01
Retinal hypoxia plays a crucial role in ocular neovascular diseases, such as diabetic retinopathy, retinopathy of prematurity, and retinal vascular occlusion. Fluorescein angiography is useful for identifying the hypoxia extent by detecting non-perfusion areas or neovascularization, but its ability to detect early stages of hypoxia is limited. Recently, in vivo fluorescent probes for detecting hypoxia have been developed; however, these have not been extensively applied in ophthalmology. We evaluated whether a novel donor-excited photo-induced electron transfer (d-PeT) system based on an activatable hypoxia-selective near-infrared fluorescent (NIRF) probe (GPU-327) responds to both mild and severe hypoxia in various ocular ischemic diseases animal models. The ocular fundus examination offers unique opportunities for direct observation of the retina through the transparent cornea and lens. After injection of GPU-327 in various ocular hypoxic diseases of mouse and rabbit models, NIRF imaging in the ocular fundus can be performed noninvasively and easily by using commercially available fundus cameras. To investigate the safety of GPU-327, electroretinograms were also recorded after GPU-327 and PBS injection. Fluorescence of GPU-327 increased under mild hypoxic conditions in vitro. GPU-327 also yielded excellent signal-to-noise ratio without washing out in vivo experiments. By using near-infrared region, GPU-327 enables imaging of deeper ischemia, such as choroidal circulation. Additionally, from an electroretinogram, GPU-327 did not cause neurotoxicity. GPU-327 identified hypoxic area both in vivo and in vitro.
Shi, Yulin; Veidenbaum, Alexander V; Nicolau, Alex; Xu, Xiangmin
2015-01-15
Modern neuroscience research demands computing power. Neural circuit mapping studies such as those using laser scanning photostimulation (LSPS) produce large amounts of data and require intensive computation for post hoc processing and analysis. Here we report on the design and implementation of a cost-effective desktop computer system for accelerated experimental data processing with recent GPU computing technology. A new version of Matlab software with GPU enabled functions is used to develop programs that run on Nvidia GPUs to harness their parallel computing power. We evaluated both the central processing unit (CPU) and GPU-enabled computational performance of our system in benchmark testing and practical applications. The experimental results show that the GPU-CPU co-processing of simulated data and actual LSPS experimental data clearly outperformed the multi-core CPU with up to a 22× speedup, depending on computational tasks. Further, we present a comparison of numerical accuracy between GPU and CPU computation to verify the precision of GPU computation. In addition, we show how GPUs can be effectively adapted to improve the performance of commercial image processing software such as Adobe Photoshop. To our best knowledge, this is the first demonstration of GPU application in neural circuit mapping and electrophysiology-based data processing. Together, GPU enabled computation enhances our ability to process large-scale data sets derived from neural circuit mapping studies, allowing for increased processing speeds while retaining data precision. Copyright © 2014 Elsevier B.V. All rights reserved.
de Paula, Lauro C. M.; Soares, Anderson S.; de Lima, Telma W.; Delbem, Alexandre C. B.; Coelho, Clarimar J.; Filho, Arlindo R. G.
2014-01-01
Several variable selection algorithms in multivariate calibration can be accelerated using Graphics Processing Units (GPU). Among these algorithms, the Firefly Algorithm (FA) is a recent proposed metaheuristic that may be used for variable selection. This paper presents a GPU-based FA (FA-MLR) with multiobjective formulation for variable selection in multivariate calibration problems and compares it with some traditional sequential algorithms in the literature. The advantage of the proposed implementation is demonstrated in an example involving a relatively large number of variables. The results showed that the FA-MLR, in comparison with the traditional algorithms is a more suitable choice and a relevant contribution for the variable selection problem. Additionally, the results also demonstrated that the FA-MLR performed in a GPU can be five times faster than its sequential implementation. PMID:25493625
de Paula, Lauro C M; Soares, Anderson S; de Lima, Telma W; Delbem, Alexandre C B; Coelho, Clarimar J; Filho, Arlindo R G
2014-01-01
Several variable selection algorithms in multivariate calibration can be accelerated using Graphics Processing Units (GPU). Among these algorithms, the Firefly Algorithm (FA) is a recent proposed metaheuristic that may be used for variable selection. This paper presents a GPU-based FA (FA-MLR) with multiobjective formulation for variable selection in multivariate calibration problems and compares it with some traditional sequential algorithms in the literature. The advantage of the proposed implementation is demonstrated in an example involving a relatively large number of variables. The results showed that the FA-MLR, in comparison with the traditional algorithms is a more suitable choice and a relevant contribution for the variable selection problem. Additionally, the results also demonstrated that the FA-MLR performed in a GPU can be five times faster than its sequential implementation.
NASA Astrophysics Data System (ADS)
Francés, J.; Bleda, S.; Neipp, C.; Márquez, A.; Pascual, I.; Beléndez, A.
2013-03-01
The finite-difference time-domain method (FDTD) allows electromagnetic field distribution analysis as a function of time and space. The method is applied to analyze holographic volume gratings (HVGs) for the near-field distribution at optical wavelengths. Usually, this application requires the simulation of wide areas, which implies more memory and time processing. In this work, we propose a specific implementation of the FDTD method including several add-ons for a precise simulation of optical diffractive elements. Values in the near-field region are computed considering the illumination of the grating by means of a plane wave for different angles of incidence and including absorbing boundaries as well. We compare the results obtained by FDTD with those obtained using a matrix method (MM) applied to diffraction gratings. In addition, we have developed two optimized versions of the algorithm, for both CPU and GPU, in order to analyze the improvement of using the new NVIDIA Fermi GPU architecture versus highly tuned multi-core CPU as a function of the size simulation. In particular, the optimized CPU implementation takes advantage of the arithmetic and data transfer streaming SIMD (single instruction multiple data) extensions (SSE) included explicitly in the code and also of multi-threading by means of OpenMP directives. A good agreement between the results obtained using both FDTD and MM methods is obtained, thus validating our methodology. Moreover, the performance of the GPU is compared to the SSE+OpenMP CPU implementation, and it is quantitatively determined that a highly optimized CPU program can be competitive for a wider range of simulation sizes, whereas GPU computing becomes more powerful for large-scale simulations.
An SDR-Based Real-Time Testbed for GNSS Adaptive Array Anti-Jamming Algorithms Accelerated by GPU
Xu, Hailong; Cui, Xiaowei; Lu, Mingquan
2016-01-01
Nowadays, software-defined radio (SDR) has become a common approach to evaluate new algorithms. However, in the field of Global Navigation Satellite System (GNSS) adaptive array anti-jamming, previous work has been limited due to the high computational power demanded by adaptive algorithms, and often lack flexibility and configurability. In this paper, the design and implementation of an SDR-based real-time testbed for GNSS adaptive array anti-jamming accelerated by a Graphics Processing Unit (GPU) are documented. This testbed highlights itself as a feature-rich and extendible platform with great flexibility and configurability, as well as high computational performance. Both Space-Time Adaptive Processing (STAP) and Space-Frequency Adaptive Processing (SFAP) are implemented with a wide range of parameters. Raw data from as many as eight antenna elements can be processed in real-time in either an adaptive nulling or beamforming mode. To fully take advantage of the parallelism resource provided by the GPU, a batched method in programming is proposed. Tests and experiments are conducted to evaluate both the computational and anti-jamming performance. This platform can be used for research and prototyping, as well as a real product in certain applications. PMID:26978363
An SDR-Based Real-Time Testbed for GNSS Adaptive Array Anti-Jamming Algorithms Accelerated by GPU.
Xu, Hailong; Cui, Xiaowei; Lu, Mingquan
2016-03-11
Nowadays, software-defined radio (SDR) has become a common approach to evaluate new algorithms. However, in the field of Global Navigation Satellite System (GNSS) adaptive array anti-jamming, previous work has been limited due to the high computational power demanded by adaptive algorithms, and often lack flexibility and configurability. In this paper, the design and implementation of an SDR-based real-time testbed for GNSS adaptive array anti-jamming accelerated by a Graphics Processing Unit (GPU) are documented. This testbed highlights itself as a feature-rich and extendible platform with great flexibility and configurability, as well as high computational performance. Both Space-Time Adaptive Processing (STAP) and Space-Frequency Adaptive Processing (SFAP) are implemented with a wide range of parameters. Raw data from as many as eight antenna elements can be processed in real-time in either an adaptive nulling or beamforming mode. To fully take advantage of the parallelism resource provided by the GPU, a batched method in programming is proposed. Tests and experiments are conducted to evaluate both the computational and anti-jamming performance. This platform can be used for research and prototyping, as well as a real product in certain applications.
Performance Testing of GPU-Based Approximate Matching Algorithm on Network Traffic
2015-03-01
Defense Department’s use. vi THIS PAGE INTENTIONALLY LEFT BLANK vii TABLE OF CONTENTS I. INTRODUCTION...22 D. GENERATING DIGESTS ............................................................................23 1. Reference...the-shelf GPU Graphical Processing Unit GPGPU General -Purpose Graphic Processing Unit HBSS Host-Based Security System HIPS Host Intrusion
Migrating EO/IR sensors to cloud-based infrastructure as service architectures
NASA Astrophysics Data System (ADS)
Berglie, Stephen T.; Webster, Steven; May, Christopher M.
2014-06-01
The Night Vision Image Generator (NVIG), a product of US Army RDECOM CERDEC NVESD, is a visualization tool used widely throughout Army simulation environments to provide fully attributed synthesized, full motion video using physics-based sensor and environmental effects. The NVIG relies heavily on contemporary hardware-based acceleration and GPU processing techniques, which push the envelope of both enterprise and commodity-level hypervisor support for providing virtual machines with direct access to hardware resources. The NVIG has successfully been integrated into fully virtual environments where system architectures leverage cloudbased technologies to various extents in order to streamline infrastructure and service management. This paper details the challenges presented to engineers seeking to migrate GPU-bound processes, such as the NVIG, to virtual machines and, ultimately, Cloud-Based IAS architectures. In addition, it presents the path that led to success for the NVIG. A brief overview of Cloud-Based infrastructure management tool sets is provided, and several virtual desktop solutions are outlined. A discrimination is made between general purpose virtual desktop technologies compared to technologies that expose GPU-specific capabilities, including direct rendering and hard ware-based video encoding. Candidate hypervisor/virtual machine configurations that nominally satisfy the virtualized hardware-level GPU requirements of the NVIG are presented , and each is subsequently reviewed in light of its implications on higher-level Cloud management techniques. Implementation details are included from the hardware level, through the operating system, to the 3D graphics APls required by the NVIG and similar GPU-bound tools.
NASA Astrophysics Data System (ADS)
Xu, Jincheng; Liu, Wei; Wang, Jin; Liu, Linong; Zhang, Jianfeng
2018-02-01
De-absorption pre-stack time migration (QPSTM) compensates for the absorption and dispersion of seismic waves by introducing an effective Q parameter, thereby making it an effective tool for 3D, high-resolution imaging of seismic data. Although the optimal aperture obtained via stationary-phase migration reduces the computational cost of 3D QPSTM and yields 3D stationary-phase QPSTM, the associated computational efficiency is still the main problem in the processing of 3D, high-resolution images for real large-scale seismic data. In the current paper, we proposed a division method for large-scale, 3D seismic data to optimize the performance of stationary-phase QPSTM on clusters of graphics processing units (GPU). Then, we designed an imaging point parallel strategy to achieve an optimal parallel computing performance. Afterward, we adopted an asynchronous double buffering scheme for multi-stream to perform the GPU/CPU parallel computing. Moreover, several key optimization strategies of computation and storage based on the compute unified device architecture (CUDA) were adopted to accelerate the 3D stationary-phase QPSTM algorithm. Compared with the initial GPU code, the implementation of the key optimization steps, including thread optimization, shared memory optimization, register optimization and special function units (SFU), greatly improved the efficiency. A numerical example employing real large-scale, 3D seismic data showed that our scheme is nearly 80 times faster than the CPU-QPSTM algorithm. Our GPU/CPU heterogeneous parallel computing framework significant reduces the computational cost and facilitates 3D high-resolution imaging for large-scale seismic data.
NASA Astrophysics Data System (ADS)
Perkins, S. J.; Marais, P. C.; Zwart, J. T. L.; Natarajan, I.; Tasse, C.; Smirnov, O.
2015-09-01
We present Montblanc, a GPU implementation of the Radio interferometer measurement equation (RIME) in support of the Bayesian inference for radio observations (BIRO) technique. BIRO uses Bayesian inference to select sky models that best match the visibilities observed by a radio interferometer. To accomplish this, BIRO evaluates the RIME multiple times, varying sky model parameters to produce multiple model visibilities. χ2 values computed from the model and observed visibilities are used as likelihood values to drive the Bayesian sampling process and select the best sky model. As most of the elements of the RIME and χ2 calculation are independent of one another, they are highly amenable to parallel computation. Additionally, Montblanc caters for iterative RIME evaluation to produce multiple χ2 values. Modified model parameters are transferred to the GPU between each iteration. We implemented Montblanc as a Python package based upon NVIDIA's CUDA architecture. As such, it is easy to extend and implement different pipelines. At present, Montblanc supports point and Gaussian morphologies, but is designed for easy addition of new source profiles. Montblanc's RIME implementation is performant: On an NVIDIA K40, it is approximately 250 times faster than MEQTREES on a dual hexacore Intel E5-2620v2 CPU. Compared to the OSKAR simulator's GPU-implemented RIME components it is 7.7 and 12 times faster on the same K40 for single and double-precision floating point respectively. However, OSKAR's RIME implementation is more general than Montblanc's BIRO-tailored RIME. Theoretical analysis of Montblanc's dominant CUDA kernel suggests that it is memory bound. In practice, profiling shows that is balanced between compute and memory, as much of the data required by the problem is retained in L1 and L2 caches.
Igarashi, Jun; Shouno, Osamu; Fukai, Tomoki; Tsujino, Hiroshi
2011-11-01
Real-time simulation of a biologically realistic spiking neural network is necessary for evaluation of its capacity to interact with real environments. However, the real-time simulation of such a neural network is difficult due to its high computational costs that arise from two factors: (1) vast network size and (2) the complicated dynamics of biologically realistic neurons. In order to address these problems, mainly the latter, we chose to use general purpose computing on graphics processing units (GPGPUs) for simulation of such a neural network, taking advantage of the powerful computational capability of a graphics processing unit (GPU). As a target for real-time simulation, we used a model of the basal ganglia that has been developed according to electrophysiological and anatomical knowledge. The model consists of heterogeneous populations of 370 spiking model neurons, including computationally heavy conductance-based models, connected by 11,002 synapses. Simulation of the model has not yet been performed in real-time using a general computing server. By parallelization of the model on the NVIDIA Geforce GTX 280 GPU in data-parallel and task-parallel fashion, faster-than-real-time simulation was robustly realized with only one-third of the GPU's total computational resources. Furthermore, we used the GPU's full computational resources to perform faster-than-real-time simulation of three instances of the basal ganglia model; these instances consisted of 1100 neurons and 33,006 synapses and were synchronized at each calculation step. Finally, we developed software for simultaneous visualization of faster-than-real-time simulation output. These results suggest the potential power of GPGPU techniques in real-time simulation of realistic neural networks. Copyright © 2011 Elsevier Ltd. All rights reserved.
Memory-Scalable GPU Spatial Hierarchy Construction.
Qiming Hou; Xin Sun; Kun Zhou; Lauterbach, C; Manocha, D
2011-04-01
Recent GPU algorithms for constructing spatial hierarchies have achieved promising performance for moderately complex models by using the breadth-first search (BFS) construction order. While being able to exploit the massive parallelism on the GPU, the BFS order also consumes excessive GPU memory, which becomes a serious issue for interactive applications involving very complex models with more than a few million triangles. In this paper, we propose to use the partial breadth-first search (PBFS) construction order to control memory consumption while maximizing performance. We apply the PBFS order to two hierarchy construction algorithms. The first algorithm is for kd-trees that automatically balances between the level of parallelism and intermediate memory usage. With PBFS, peak memory consumption during construction can be efficiently controlled without costly CPU-GPU data transfer. We also develop memory allocation strategies to effectively limit memory fragmentation. The resulting algorithm scales well with GPU memory and constructs kd-trees of models with millions of triangles at interactive rates on GPUs with 1 GB memory. Compared with existing algorithms, our algorithm is an order of magnitude more scalable for a given GPU memory bound. The second algorithm is for out-of-core bounding volume hierarchy (BVH) construction for very large scenes based on the PBFS construction order. At each iteration, all constructed nodes are dumped to the CPU memory, and the GPU memory is freed for the next iteration's use. In this way, the algorithm is able to build trees that are too large to be stored in the GPU memory. Experiments show that our algorithm can construct BVHs for scenes with up to 20 M triangles, several times larger than previous GPU algorithms.
Real-Time Compressive Sensing MRI Reconstruction Using GPU Computing and Split Bregman Methods
Smith, David S.; Gore, John C.; Yankeelov, Thomas E.; Welch, E. Brian
2012-01-01
Compressive sensing (CS) has been shown to enable dramatic acceleration of MRI acquisition in some applications. Being an iterative reconstruction technique, CS MRI reconstructions can be more time-consuming than traditional inverse Fourier reconstruction. We have accelerated our CS MRI reconstruction by factors of up to 27 by using a split Bregman solver combined with a graphics processing unit (GPU) computing platform. The increases in speed we find are similar to those we measure for matrix multiplication on this platform, suggesting that the split Bregman methods parallelize efficiently. We demonstrate that the combination of the rapid convergence of the split Bregman algorithm and the massively parallel strategy of GPU computing can enable real-time CS reconstruction of even acquisition data matrices of dimension 40962 or more, depending on available GPU VRAM. Reconstruction of two-dimensional data matrices of dimension 10242 and smaller took ~0.3 s or less, showing that this platform also provides very fast iterative reconstruction for small-to-moderate size images. PMID:22481908
Stochastic DT-MRI connectivity mapping on the GPU.
McGraw, Tim; Nadar, Mariappan
2007-01-01
We present a method for stochastic fiber tract mapping from diffusion tensor MRI (DT-MRI) implemented on graphics hardware. From the simulated fibers we compute a connectivity map that gives an indication of the probability that two points in the dataset are connected by a neuronal fiber path. A Bayesian formulation of the fiber model is given and it is shown that the inversion method can be used to construct plausible connectivity. An implementation of this fiber model on the graphics processing unit (GPU) is presented. Since the fiber paths can be stochastically generated independently of one another, the algorithm is highly parallelizable. This allows us to exploit the data-parallel nature of the GPU fragment processors. We also present a framework for the connectivity computation on the GPU. Our implementation allows the user to interactively select regions of interest and observe the evolving connectivity results during computation. Results are presented from the stochastic generation of over 250,000 fiber steps per iteration at interactive frame rates on consumer-grade graphics hardware.
Real-Time Compressive Sensing MRI Reconstruction Using GPU Computing and Split Bregman Methods.
Smith, David S; Gore, John C; Yankeelov, Thomas E; Welch, E Brian
2012-01-01
Compressive sensing (CS) has been shown to enable dramatic acceleration of MRI acquisition in some applications. Being an iterative reconstruction technique, CS MRI reconstructions can be more time-consuming than traditional inverse Fourier reconstruction. We have accelerated our CS MRI reconstruction by factors of up to 27 by using a split Bregman solver combined with a graphics processing unit (GPU) computing platform. The increases in speed we find are similar to those we measure for matrix multiplication on this platform, suggesting that the split Bregman methods parallelize efficiently. We demonstrate that the combination of the rapid convergence of the split Bregman algorithm and the massively parallel strategy of GPU computing can enable real-time CS reconstruction of even acquisition data matrices of dimension 4096(2) or more, depending on available GPU VRAM. Reconstruction of two-dimensional data matrices of dimension 1024(2) and smaller took ~0.3 s or less, showing that this platform also provides very fast iterative reconstruction for small-to-moderate size images.
Methods for compressible fluid simulation on GPUs using high-order finite differences
NASA Astrophysics Data System (ADS)
Pekkilä, Johannes; Väisälä, Miikka S.; Käpylä, Maarit J.; Käpylä, Petri J.; Anjum, Omer
2017-08-01
We focus on implementing and optimizing a sixth-order finite-difference solver for simulating compressible fluids on a GPU using third-order Runge-Kutta integration. Since graphics processing units perform well in data-parallel tasks, this makes them an attractive platform for fluid simulation. However, high-order stencil computation is memory-intensive with respect to both main memory and the caches of the GPU. We present two approaches for simulating compressible fluids using 55-point and 19-point stencils. We seek to reduce the requirements for memory bandwidth and cache size in our methods by using cache blocking and decomposing a latency-bound kernel into several bandwidth-bound kernels. Our fastest implementation is bandwidth-bound and integrates 343 million grid points per second on a Tesla K40t GPU, achieving a 3 . 6 × speedup over a comparable hydrodynamics solver benchmarked on two Intel Xeon E5-2690v3 processors. Our alternative GPU implementation is latency-bound and achieves the rate of 168 million updates per second.
GPU-accelerated simulations of isolated black holes
NASA Astrophysics Data System (ADS)
Lewis, Adam G. M.; Pfeiffer, Harald P.
2018-05-01
We present a port of the numerical relativity code SpEC which is capable of running on NVIDIA GPUs. Since this code must be maintained in parallel with SpEC itself, a primary design consideration is to perform as few explicit code changes as possible. We therefore rely on a hierarchy of automated porting strategies. At the highest level we use TLoops, a C++ library of our design, to automatically emit CUDA code equivalent to tensorial expressions written into C++ source using a syntax similar to analytic calculation. Next, we trace out and cache explicit matrix representations of the numerous linear transformations in the SpEC code, which allows these to be performed on the GPU using pre-existing matrix-multiplication libraries. We port the few remaining important modules by hand. In this paper we detail the specifics of our port, and present benchmarks of it simulating isolated black hole spacetimes on several generations of NVIDIA GPU.
Real time ray tracing based on shader
NASA Astrophysics Data System (ADS)
Gui, JiangHeng; Li, Min
2017-07-01
Ray tracing is a rendering algorithm for generating an image through tracing lights into an image plane, it can simulate complicate optical phenomenon like refraction, depth of field and motion blur. Compared with rasterization, ray tracing can achieve more realistic rendering result, however with greater computational cost, simple scene rendering can consume tons of time. With the GPU's performance improvement and the advent of programmable rendering pipeline, complicated algorithm can also be implemented directly on shader. So, this paper proposes a new method that implement ray tracing directly on fragment shader, mainly include: surface intersection, importance sampling and progressive rendering. With the help of GPU's powerful throughput capability, it can implement real time rendering of simple scene.
NASA Astrophysics Data System (ADS)
Ammendola, R.; Biagioni, A.; Fiorini, M.; Frezza, O.; Lonardo, A.; Lamanna, G.; Lo Cicero, F.; Martinelli, M.; Neri, I.; Paolucci, P. S.; Pastorelli, E.; Piandani, R.; Pontisso, L.; Rossetti, D.; Simula, F.; Sozzi, M.; Tosoratto, L.; Vicini, P.
2016-03-01
A GPU-based low level (L0) trigger is currently integrated in the experimental setup of the RICH detector of the NA62 experiment to assess the feasibility of building more refined physics-related trigger primitives and thus improve the trigger discriminating power. To ensure the real-time operation of the system, a dedicated data transport mechanism has been implemented: an FPGA-based Network Interface Card (NaNet-10) receives data from detectors and forwards them with low, predictable latency to the memory of the GPU performing the trigger algorithms. Results of the ring-shaped hit patterns reconstruction will be reported and discussed.
Kim, Jeongnim; Baczewski, Andrew T.; Beaudet, Todd D.; ...
2018-04-19
QMCPACK is an open source quantum Monte Carlo package for ab-initio electronic structure calculations. It supports calculations of metallic and insulating solids, molecules, atoms, and some model Hamiltonians. Implemented real space quantum Monte Carlo algorithms include variational, diffusion, and reptation Monte Carlo. QMCPACK uses Slater-Jastrow type trial wave functions in conjunction with a sophisticated optimizer capable of optimizing tens of thousands of parameters. The orbital space auxiliary field quantum Monte Carlo method is also implemented, enabling cross validation between different highly accurate methods. The code is specifically optimized for calculations with large numbers of electrons on the latest high performancemore » computing architectures, including multicore central processing unit (CPU) and graphical processing unit (GPU) systems. We detail the program’s capabilities, outline its structure, and give examples of its use in current research calculations. The package is available at http://www.qmcpack.org.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kim, Jeongnim; Baczewski, Andrew T.; Beaudet, Todd D.
QMCPACK is an open source quantum Monte Carlo package for ab-initio electronic structure calculations. It supports calculations of metallic and insulating solids, molecules, atoms, and some model Hamiltonians. Implemented real space quantum Monte Carlo algorithms include variational, diffusion, and reptation Monte Carlo. QMCPACK uses Slater-Jastrow type trial wave functions in conjunction with a sophisticated optimizer capable of optimizing tens of thousands of parameters. The orbital space auxiliary field quantum Monte Carlo method is also implemented, enabling cross validation between different highly accurate methods. The code is specifically optimized for calculations with large numbers of electrons on the latest high performancemore » computing architectures, including multicore central processing unit (CPU) and graphical processing unit (GPU) systems. We detail the program’s capabilities, outline its structure, and give examples of its use in current research calculations. The package is available at http://www.qmcpack.org.« less
Samant, Sanjiv S; Xia, Junyi; Muyan-Ozcelik, Pinar; Owens, John D
2008-08-01
The advent of readily available temporal imaging or time series volumetric (4D) imaging has become an indispensable component of treatment planning and adaptive radiotherapy (ART) at many radiotherapy centers. Deformable image registration (DIR) is also used in other areas of medical imaging, including motion corrected image reconstruction. Due to long computation time, clinical applications of DIR in radiation therapy and elsewhere have been limited and consequently relegated to offline analysis. With the recent advances in hardware and software, graphics processing unit (GPU) based computing is an emerging technology for general purpose computation, including DIR, and is suitable for highly parallelized computing. However, traditional general purpose computation on the GPU is limited because the constraints of the available programming platforms. As well, compared to CPU programming, the GPU currently has reduced dedicated processor memory, which can limit the useful working data set for parallelized processing. We present an implementation of the demons algorithm using the NVIDIA 8800 GTX GPU and the new CUDA programming language. The GPU performance will be compared with single threading and multithreading CPU implementations on an Intel dual core 2.4 GHz CPU using the C programming language. CUDA provides a C-like language programming interface, and allows for direct access to the highly parallel compute units in the GPU. Comparisons for volumetric clinical lung images acquired using 4DCT were carried out. Computation time for 100 iterations in the range of 1.8-13.5 s was observed for the GPU with image size ranging from 2.0 x 10(6) to 14.2 x 10(6) pixels. The GPU registration was 55-61 times faster than the CPU for the single threading implementation, and 34-39 times faster for the multithreading implementation. For CPU based computing, the computational time generally has a linear dependence on image size for medical imaging data. Computational efficiency is characterized in terms of time per megapixels per iteration (TPMI) with units of seconds per megapixels per iteration (or spmi). For the demons algorithm, our CPU implementation yielded largely invariant values of TPMI. The mean TPMIs were 0.527 spmi and 0.335 spmi for the single threading and multithreading cases, respectively, with <2% variation over the considered image data range. For GPU computing, we achieved TPMI =0.00916 spmi with 3.7% variation, indicating optimized memory handling under CUDA. The paradigm of GPU based real-time DIR opens up a host of clinical applications for medical imaging.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Xu, Y; Southern Medical University, Guangzhou; Tian, Z
Purpose: Monte Carlo (MC) simulation is an important tool to solve radiotherapy and medical imaging problems. Low computational efficiency hinders its wide applications. Conventionally, MC is performed in a particle-by -particle fashion. The lack of control on particle trajectory is a main cause of low efficiency in some applications. Take cone beam CT (CBCT) projection simulation as an example, significant amount of computations were wasted on transporting photons that do not reach the detector. To solve this problem, we propose an innovative MC simulation scheme with a path-by-path sampling method. Methods: Consider a photon path starting at the x-ray source.more » After going through a set of interactions, it ends at the detector. In the proposed scheme, we sampled an entire photon path each time. Metropolis-Hasting algorithm was employed to accept/reject a sampled path based on a calculated acceptance probability, in order to maintain correct relative probabilities among different paths, which are governed by photon transport physics. We developed a package gMMC on GPU with this new scheme implemented. The performance of gMMC was tested in a sample problem of CBCT projection simulation for a homogeneous object. The results were compared to those obtained using gMCDRR, a GPU-based MC tool with the conventional particle-by-particle simulation scheme. Results: Calculated scattered photon signals in gMMC agreed with those from gMCDRR with a relative difference of 3%. It took 3.1 hr. for gMCDRR to simulate 7.8e11 photons and 246.5 sec for gMMC to simulate 1.4e10 paths. Under this setting, both results attained the same ∼2% statistical uncertainty. Hence, a speed-up factor of ∼45.3 was achieved by this new path-by-path simulation scheme, where all the computations were spent on those photons contributing to the detector signal. Conclusion: We innovatively proposed a novel path-by-path simulation scheme that enabled a significant efficiency enhancement for MC particle transport simulations.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Aliaga, José I., E-mail: aliaga@uji.es; Alonso, Pedro; Badía, José M.
We introduce a new iterative Krylov subspace-based eigensolver for the simulation of macromolecular motions on desktop multithreaded platforms equipped with multicore processors and, possibly, a graphics accelerator (GPU). The method consists of two stages, with the original problem first reduced into a simpler band-structured form by means of a high-performance compute-intensive procedure. This is followed by a memory-intensive but low-cost Krylov iteration, which is off-loaded to be computed on the GPU by means of an efficient data-parallel kernel. The experimental results reveal the performance of the new eigensolver. Concretely, when applied to the simulation of macromolecules with a few thousandsmore » degrees of freedom and the number of eigenpairs to be computed is small to moderate, the new solver outperforms other methods implemented as part of high-performance numerical linear algebra packages for multithreaded architectures.« less
Blakely, Timothy; Ojemann, Jeffrey G.; Rao, Rajesh P.N.
2014-01-01
Background Electrocorticography (ECoG) signals can provide high spatio-temporal resolution and high signal to noise ratio recordings of local neural activity from the surface of the brain. Previous studies have shown that broad-band, spatially focal, high-frequency increases in ECoG signals are highly correlated with movement and other cognitive tasks and can be volitionally modulated. However, significant additional information may be present in inter-electrode interactions, but adding additional higher order inter-electrode interactions can be impractical from a computational aspect, if not impossible. New method In this paper we present a new method of calculating high frequency interactions between electrodes called Short-Time Windowed Covariance (STWC) that builds on mathematical techniques currently used in neural signal analysis, along with an implementation that accelerates the algorithm by orders of magnitude by leveraging commodity, off-the-shelf graphics processing unit (GPU) hardware. Results Using the hardware-accelerated implementation of STWC, we identify many types of event-related inter-electrode interactions from human ECoG recordings on global and local scales that have not been identified by previous methods. Unique temporal patterns are observed for digit flexion in both low- (10 mm spacing) and high-resolution (3 mm spacing) electrode arrays. Comparison with existing methods Covariance is a commonly used metric for identifying correlated signals, but the standard covariance calculations do not allow for temporally varying covariance. In contrast STWC allows and identifies event-driven changes in covariance without identifying spurious noise correlations. Conclusions: STWC can be used to identify event-related neural interactions whose high computational load is well suited to GPU capabilities. PMID:24211499
Paudel, Moti R; Kim, Anthony; Sarfehnia, Arman; Ahmad, Sayed B; Beachey, David J; Sahgal, Arjun; Keller, Brian M
2016-11-08
A new GPU-based Monte Carlo dose calculation algorithm (GPUMCD), devel-oped by the vendor Elekta for the Monaco treatment planning system (TPS), is capable of modeling dose for both a standard linear accelerator and an Elekta MRI linear accelerator. We have experimentally evaluated this algorithm for a standard Elekta Agility linear accelerator. A beam model was developed in the Monaco TPS (research version 5.09.06) using the commissioned beam data for a 6 MV Agility linac. A heterogeneous phantom representing several scenarios - tumor-in-lung, lung, and bone-in-tissue - was designed and built. Dose calculations in Monaco were done using both the current clinical Monte Carlo algorithm, XVMC, and the new GPUMCD algorithm. Dose calculations in a Pinnacle TPS were also produced using the collapsed cone convolution (CCC) algorithm with heterogeneity correc-tion. Calculations were compared with the measured doses using an ionization chamber (A1SL) and Gafchromic EBT3 films for 2 × 2 cm2, 5 × 5 cm2, and 10 × 10 cm2 field sizes. The percentage depth doses (PDDs) calculated by XVMC and GPUMCD in a homogeneous solid water phantom were within 2%/2 mm of film measurements and within 1% of ion chamber measurements. For the tumor-in-lung phantom, the calculated doses were within 2.5%/2.5 mm of film measurements for GPUMCD. For the lung phantom, doses calculated by all of the algorithms were within 3%/3 mm of film measurements, except for the 2 × 2 cm2 field size where the CCC algorithm underestimated the depth dose by ~ 5% in a larger extent of the lung region. For the bone phantom, all of the algorithms were equivalent and calculated dose to within 2%/2 mm of film measurements, except at the interfaces. Both GPUMCD and XVMC showed interface effects, which were more pronounced for GPUMCD and were comparable to film measurements, whereas the CCC algorithm showed these effects poorly. © 2016 The Authors.
Fully accelerating quantum Monte Carlo simulations of real materials on GPU clusters
NASA Astrophysics Data System (ADS)
Esler, Kenneth
2011-03-01
Quantum Monte Carlo (QMC) has proved to be an invaluable tool for predicting the properties of matter from fundamental principles, combining very high accuracy with extreme parallel scalability. By solving the many-body Schrödinger equation through a stochastic projection, it achieves greater accuracy than mean-field methods and better scaling with system size than quantum chemical methods, enabling scientific discovery across a broad spectrum of disciplines. In recent years, graphics processing units (GPUs) have provided a high-performance and low-cost new approach to scientific computing, and GPU-based supercomputers are now among the fastest in the world. The multiple forms of parallelism afforded by QMC algorithms make the method an ideal candidate for acceleration in the many-core paradigm. We present the results of porting the QMCPACK code to run on GPU clusters using the NVIDIA CUDA platform. Using mixed precision on GPUs and MPI for intercommunication, we observe typical full-application speedups of approximately 10x to 15x relative to quad-core CPUs alone, while reproducing the double-precision CPU results within statistical error. We discuss the algorithm modifications necessary to achieve good performance on this heterogeneous architecture and present the results of applying our code to molecules and bulk materials. Supported by the U.S. DOE under Contract No. DOE-DE-FG05-08OR23336 and by the NSF under No. 0904572.
Fast skin dose estimation system for interventional radiology
Takata, Takeshi; Kotoku, Jun’ichi; Maejima, Hideyuki; Kumagai, Shinobu; Arai, Norikazu; Kobayashi, Takenori; Shiraishi, Kenshiro; Yamamoto, Masayoshi; Kondo, Hiroshi; Furui, Shigeru
2018-01-01
Abstract To minimise the radiation dermatitis related to interventional radiology (IR), rapid and accurate dose estimation has been sought for all procedures. We propose a technique for estimating the patient skin dose rapidly and accurately using Monte Carlo (MC) simulation with a graphical processing unit (GPU, GTX 1080; Nvidia Corp.). The skin dose distribution is simulated based on an individual patient’s computed tomography (CT) dataset for fluoroscopic conditions after the CT dataset has been segmented into air, water and bone based on pixel values. The skin is assumed to be one layer at the outer surface of the body. Fluoroscopic conditions are obtained from a log file of a fluoroscopic examination. Estimating the absorbed skin dose distribution requires calibration of the dose simulated by our system. For this purpose, a linear function was used to approximate the relation between the simulated dose and the measured dose using radiophotoluminescence (RPL) glass dosimeters in a water-equivalent phantom. Differences of maximum skin dose between our system and the Particle and Heavy Ion Transport code System (PHITS) were as high as 6.1%. The relative statistical error (2 σ) for the simulated dose obtained using our system was ≤3.5%. Using a GPU, the simulation on the chest CT dataset aiming at the heart was within 3.49 s on average: the GPU is 122 times faster than a CPU (Core i7–7700K; Intel Corp.). Our system (using the GPU, the log file, and the CT dataset) estimated the skin dose more rapidly and more accurately than conventional methods. PMID:29136194
Fast skin dose estimation system for interventional radiology.
Takata, Takeshi; Kotoku, Jun'ichi; Maejima, Hideyuki; Kumagai, Shinobu; Arai, Norikazu; Kobayashi, Takenori; Shiraishi, Kenshiro; Yamamoto, Masayoshi; Kondo, Hiroshi; Furui, Shigeru
2018-03-01
To minimise the radiation dermatitis related to interventional radiology (IR), rapid and accurate dose estimation has been sought for all procedures. We propose a technique for estimating the patient skin dose rapidly and accurately using Monte Carlo (MC) simulation with a graphical processing unit (GPU, GTX 1080; Nvidia Corp.). The skin dose distribution is simulated based on an individual patient's computed tomography (CT) dataset for fluoroscopic conditions after the CT dataset has been segmented into air, water and bone based on pixel values. The skin is assumed to be one layer at the outer surface of the body. Fluoroscopic conditions are obtained from a log file of a fluoroscopic examination. Estimating the absorbed skin dose distribution requires calibration of the dose simulated by our system. For this purpose, a linear function was used to approximate the relation between the simulated dose and the measured dose using radiophotoluminescence (RPL) glass dosimeters in a water-equivalent phantom. Differences of maximum skin dose between our system and the Particle and Heavy Ion Transport code System (PHITS) were as high as 6.1%. The relative statistical error (2 σ) for the simulated dose obtained using our system was ≤3.5%. Using a GPU, the simulation on the chest CT dataset aiming at the heart was within 3.49 s on average: the GPU is 122 times faster than a CPU (Core i7-7700K; Intel Corp.). Our system (using the GPU, the log file, and the CT dataset) estimated the skin dose more rapidly and more accurately than conventional methods.
An improved non-uniformity correction algorithm and its GPU parallel implementation
NASA Astrophysics Data System (ADS)
Cheng, Kuanhong; Zhou, Huixin; Qin, Hanlin; Zhao, Dong; Qian, Kun; Rong, Shenghui
2018-05-01
The performance of SLP-THP based non-uniformity correction algorithm is seriously affected by the result of SLP filter, which always leads to image blurring and ghosting artifacts. To address this problem, an improved SLP-THP based non-uniformity correction method with curvature constraint was proposed. Here we put forward a new way to estimate spatial low frequency component. First, the details and contours of input image were obtained respectively by minimizing local Gaussian curvature and mean curvature of image surface. Then, the guided filter was utilized to combine these two parts together to get the estimate of spatial low frequency component. Finally, we brought this SLP component into SLP-THP method to achieve non-uniformity correction. The performance of proposed algorithm was verified by several real and simulated infrared image sequences. The experimental results indicated that the proposed algorithm can reduce the non-uniformity without detail losing. After that, a GPU based parallel implementation that runs 150 times faster than CPU was presented, which showed the proposed algorithm has great potential for real time application.
[Accurate 3D free-form registration between fan-beam CT and cone-beam CT].
Liang, Yueqiang; Xu, Hongbing; Li, Baosheng; Li, Hongsheng; Yang, Fujun
2012-06-01
Because the X-ray scatters, the CT numbers in cone-beam CT cannot exactly correspond to the electron densities. This, therefore, results in registration error when the intensity-based registration algorithm is used to register planning fan-beam CT and cone-beam CT. In order to reduce the registration error, we have developed an accurate gradient-based registration algorithm. The gradient-based deformable registration problem is described as a minimization of energy functional. Through the calculus of variations and Gauss-Seidel finite difference method, we derived the iterative formula of the deformable registration. The algorithm was implemented by GPU through OpenCL framework, with which the registration time was greatly reduced. Our experimental results showed that the proposed gradient-based registration algorithm could register more accurately the clinical cone-beam CT and fan-beam CT images compared with the intensity-based algorithm. The GPU-accelerated algorithm meets the real-time requirement in the online adaptive radiotherapy.
A Hybrid CPU/GPU Pattern-Matching Algorithm for Deep Packet Inspection
Chen, Yaw-Chung
2015-01-01
The large quantities of data now being transferred via high-speed networks have made deep packet inspection indispensable for security purposes. Scalable and low-cost signature-based network intrusion detection systems have been developed for deep packet inspection for various software platforms. Traditional approaches that only involve central processing units (CPUs) are now considered inadequate in terms of inspection speed. Graphic processing units (GPUs) have superior parallel processing power, but transmission bottlenecks can reduce optimal GPU efficiency. In this paper we describe our proposal for a hybrid CPU/GPU pattern-matching algorithm (HPMA) that divides and distributes the packet-inspecting workload between a CPU and GPU. All packets are initially inspected by the CPU and filtered using a simple pre-filtering algorithm, and packets that might contain malicious content are sent to the GPU for further inspection. Test results indicate that in terms of random payload traffic, the matching speed of our proposed algorithm was 3.4 times and 2.7 times faster than those of the AC-CPU and AC-GPU algorithms, respectively. Further, HPMA achieved higher energy efficiency than the other tested algorithms. PMID:26437335
A Hybrid CPU/GPU Pattern-Matching Algorithm for Deep Packet Inspection.
Lee, Chun-Liang; Lin, Yi-Shan; Chen, Yaw-Chung
2015-01-01
The large quantities of data now being transferred via high-speed networks have made deep packet inspection indispensable for security purposes. Scalable and low-cost signature-based network intrusion detection systems have been developed for deep packet inspection for various software platforms. Traditional approaches that only involve central processing units (CPUs) are now considered inadequate in terms of inspection speed. Graphic processing units (GPUs) have superior parallel processing power, but transmission bottlenecks can reduce optimal GPU efficiency. In this paper we describe our proposal for a hybrid CPU/GPU pattern-matching algorithm (HPMA) that divides and distributes the packet-inspecting workload between a CPU and GPU. All packets are initially inspected by the CPU and filtered using a simple pre-filtering algorithm, and packets that might contain malicious content are sent to the GPU for further inspection. Test results indicate that in terms of random payload traffic, the matching speed of our proposed algorithm was 3.4 times and 2.7 times faster than those of the AC-CPU and AC-GPU algorithms, respectively. Further, HPMA achieved higher energy efficiency than the other tested algorithms.
Data assimilation using a GPU accelerated path integral Monte Carlo approach
NASA Astrophysics Data System (ADS)
Quinn, John C.; Abarbanel, Henry D. I.
2011-09-01
The answers to data assimilation questions can be expressed as path integrals over all possible state and parameter histories. We show how these path integrals can be evaluated numerically using a Markov Chain Monte Carlo method designed to run in parallel on a graphics processing unit (GPU). We demonstrate the application of the method to an example with a transmembrane voltage time series of a simulated neuron as an input, and using a Hodgkin-Huxley neuron model. By taking advantage of GPU computing, we gain a parallel speedup factor of up to about 300, compared to an equivalent serial computation on a CPU, with performance increasing as the length of the observation time used for data assimilation increases.
Viscoelastic Finite Difference Modeling Using Graphics Processing Units
NASA Astrophysics Data System (ADS)
Fabien-Ouellet, G.; Gloaguen, E.; Giroux, B.
2014-12-01
Full waveform seismic modeling requires a huge amount of computing power that still challenges today's technology. This limits the applicability of powerful processing approaches in seismic exploration like full-waveform inversion. This paper explores the use of Graphics Processing Units (GPU) to compute a time based finite-difference solution to the viscoelastic wave equation. The aim is to investigate whether the adoption of the GPU technology is susceptible to reduce significantly the computing time of simulations. The code presented herein is based on the freely accessible software of Bohlen (2002) in 2D provided under a General Public License (GNU) licence. This implementation is based on a second order centred differences scheme to approximate time differences and staggered grid schemes with centred difference of order 2, 4, 6, 8, and 12 for spatial derivatives. The code is fully parallel and is written using the Message Passing Interface (MPI), and it thus supports simulations of vast seismic models on a cluster of CPUs. To port the code from Bohlen (2002) on GPUs, the OpenCl framework was chosen for its ability to work on both CPUs and GPUs and its adoption by most of GPU manufacturers. In our implementation, OpenCL works in conjunction with MPI, which allows computations on a cluster of GPU for large-scale model simulations. We tested our code for model sizes between 1002 and 60002 elements. Comparison shows a decrease in computation time of more than two orders of magnitude between the GPU implementation run on a AMD Radeon HD 7950 and the CPU implementation run on a 2.26 GHz Intel Xeon Quad-Core. The speed-up varies depending on the order of the finite difference approximation and generally increases for higher orders. Increasing speed-ups are also obtained for increasing model size, which can be explained by kernel overheads and delays introduced by memory transfers to and from the GPU through the PCI-E bus. Those tests indicate that the GPU memory size and the slow memory transfers are the limiting factors of our GPU implementation. Those results show the benefits of using GPUs instead of CPUs for time based finite-difference seismic simulations. The reductions in computation time and in hardware costs are significant and open the door for new approaches in seismic inversion.
Yan, Hao; Wang, Xiaoyu; Shi, Feng; Bai, Ti; Folkerts, Michael; Cervino, Laura; Jiang, Steve B.; Jia, Xun
2014-01-01
Purpose: Compressed sensing (CS)-based iterative reconstruction (IR) techniques are able to reconstruct cone-beam CT (CBCT) images from undersampled noisy data, allowing for imaging dose reduction. However, there are a few practical concerns preventing the clinical implementation of these techniques. On the image quality side, data truncation along the superior–inferior direction under the cone-beam geometry produces severe cone artifacts in the reconstructed images. Ring artifacts are also seen in the half-fan scan mode. On the reconstruction efficiency side, the long computation time hinders clinical use in image-guided radiation therapy (IGRT). Methods: Image quality improvement methods are proposed to mitigate the cone and ring image artifacts in IR. The basic idea is to use weighting factors in the IR data fidelity term to improve projection data consistency with the reconstructed volume. In order to improve the computational efficiency, a multiple graphics processing units (GPUs)-based CS-IR system was developed. The parallelization scheme, detailed analyses of computation time at each step, their relationship with image resolution, and the acceleration factors were studied. The whole system was evaluated in various phantom and patient cases. Results: Ring artifacts can be mitigated by properly designing a weighting factor as a function of the spatial location on the detector. As for the cone artifact, without applying a correction method, it contaminated 13 out of 80 slices in a head-neck case (full-fan). Contamination was even more severe in a pelvis case under half-fan mode, where 36 out of 80 slices were affected, leading to poorer soft tissue delineation and reduced superior–inferior coverage. The proposed method effectively corrects those contaminated slices with mean intensity differences compared to FDK results decreasing from ∼497 and ∼293 HU to ∼39 and ∼27 HU for the full-fan and half-fan cases, respectively. In terms of efficiency boost, an overall 3.1 × speedup factor has been achieved with four GPU cards compared to a single GPU-based reconstruction. The total computation time is ∼30 s for typical clinical cases. Conclusions: The authors have developed a low-dose CBCT IR system for IGRT. By incorporating data consistency-based weighting factors in the IR model, cone/ring artifacts can be mitigated. A boost in computational efficiency is achieved by multi-GPU implementation. PMID:25370645
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yan, Hao, E-mail: steve.jiang@utsouthwestern.edu, E-mail: xun.jia@utsouthwestern.edu; Shi, Feng; Jiang, Steve B.
Purpose: Compressed sensing (CS)-based iterative reconstruction (IR) techniques are able to reconstruct cone-beam CT (CBCT) images from undersampled noisy data, allowing for imaging dose reduction. However, there are a few practical concerns preventing the clinical implementation of these techniques. On the image quality side, data truncation along the superior–inferior direction under the cone-beam geometry produces severe cone artifacts in the reconstructed images. Ring artifacts are also seen in the half-fan scan mode. On the reconstruction efficiency side, the long computation time hinders clinical use in image-guided radiation therapy (IGRT). Methods: Image quality improvement methods are proposed to mitigate the conemore » and ring image artifacts in IR. The basic idea is to use weighting factors in the IR data fidelity term to improve projection data consistency with the reconstructed volume. In order to improve the computational efficiency, a multiple graphics processing units (GPUs)-based CS-IR system was developed. The parallelization scheme, detailed analyses of computation time at each step, their relationship with image resolution, and the acceleration factors were studied. The whole system was evaluated in various phantom and patient cases. Results: Ring artifacts can be mitigated by properly designing a weighting factor as a function of the spatial location on the detector. As for the cone artifact, without applying a correction method, it contaminated 13 out of 80 slices in a head-neck case (full-fan). Contamination was even more severe in a pelvis case under half-fan mode, where 36 out of 80 slices were affected, leading to poorer soft tissue delineation and reduced superior–inferior coverage. The proposed method effectively corrects those contaminated slices with mean intensity differences compared to FDK results decreasing from ∼497 and ∼293 HU to ∼39 and ∼27 HU for the full-fan and half-fan cases, respectively. In terms of efficiency boost, an overall 3.1 × speedup factor has been achieved with four GPU cards compared to a single GPU-based reconstruction. The total computation time is ∼30 s for typical clinical cases. Conclusions: The authors have developed a low-dose CBCT IR system for IGRT. By incorporating data consistency-based weighting factors in the IR model, cone/ring artifacts can be mitigated. A boost in computational efficiency is achieved by multi-GPU implementation.« less
GPU based cloud system for high-performance arrhythmia detection with parallel k-NN algorithm.
Tae Joon Jun; Hyun Ji Park; Hyuk Yoo; Young-Hak Kim; Daeyoung Kim
2016-08-01
In this paper, we propose an GPU based Cloud system for high-performance arrhythmia detection. Pan-Tompkins algorithm is used for QRS detection and we optimized beat classification algorithm with K-Nearest Neighbor (K-NN). To support high performance beat classification on the system, we parallelized beat classification algorithm with CUDA to execute the algorithm on virtualized GPU devices on the Cloud system. MIT-BIH Arrhythmia database is used for validation of the algorithm. The system achieved about 93.5% of detection rate which is comparable to previous researches while our algorithm shows 2.5 times faster execution time compared to CPU only detection algorithm.
Incompressible SPH (ISPH) with fast Poisson solver on a GPU
NASA Astrophysics Data System (ADS)
Chow, Alex D.; Rogers, Benedict D.; Lind, Steven J.; Stansby, Peter K.
2018-05-01
This paper presents a fast incompressible SPH (ISPH) solver implemented to run entirely on a graphics processing unit (GPU) capable of simulating several millions of particles in three dimensions on a single GPU. The ISPH algorithm is implemented by converting the highly optimised open-source weakly-compressible SPH (WCSPH) code DualSPHysics to run ISPH on the GPU, combining it with the open-source linear algebra library ViennaCL for fast solutions of the pressure Poisson equation (PPE). Several challenges are addressed with this research: constructing a PPE matrix every timestep on the GPU for moving particles, optimising the limited GPU memory, and exploiting fast matrix solvers. The ISPH pressure projection algorithm is implemented as 4 separate stages, each with a particle sweep, including an algorithm for the population of the PPE matrix suitable for the GPU, and mixed precision storage methods. An accurate and robust ISPH boundary condition ideal for parallel processing is also established by adapting an existing WCSPH boundary condition for ISPH. A variety of validation cases are presented: an impulsively started plate, incompressible flow around a moving square in a box, and dambreaks (2-D and 3-D) which demonstrate the accuracy, flexibility, and speed of the methodology. Fragmentation of the free surface is shown to influence the performance of matrix preconditioners and therefore the PPE matrix solution time. The Jacobi preconditioner demonstrates robustness and reliability in the presence of fragmented flows. For a dambreak simulation, GPU speed ups demonstrate up to 10-18 times and 1.1-4.5 times compared to single-threaded and 16-threaded CPU run times respectively.
GPU-based prompt gamma ray imaging from boron neutron capture therapy.
Yoon, Do-Kun; Jung, Joo-Young; Jo Hong, Key; Sil Lee, Keum; Suk Suh, Tae
2015-01-01
The purpose of this research is to perform the fast reconstruction of a prompt gamma ray image using a graphics processing unit (GPU) computation from boron neutron capture therapy (BNCT) simulations. To evaluate the accuracy of the reconstructed image, a phantom including four boron uptake regions (BURs) was used in the simulation. After the Monte Carlo simulation of the BNCT, the modified ordered subset expectation maximization reconstruction algorithm using the GPU computation was used to reconstruct the images with fewer projections. The computation times for image reconstruction were compared between the GPU and the central processing unit (CPU). Also, the accuracy of the reconstructed image was evaluated by a receiver operating characteristic (ROC) curve analysis. The image reconstruction time using the GPU was 196 times faster than the conventional reconstruction time using the CPU. For the four BURs, the area under curve values from the ROC curve were 0.6726 (A-region), 0.6890 (B-region), 0.7384 (C-region), and 0.8009 (D-region). The tomographic image using the prompt gamma ray event from the BNCT simulation was acquired using the GPU computation in order to perform a fast reconstruction during treatment. The authors verified the feasibility of the prompt gamma ray image reconstruction using the GPU computation for BNCT simulations.
Shen, Wenfeng; Wei, Daming; Xu, Weimin; Zhu, Xin; Yuan, Shizhong
2010-10-01
Biological computations like electrocardiological modelling and simulation usually require high-performance computing environments. This paper introduces an implementation of parallel computation for computer simulation of electrocardiograms (ECGs) in a personal computer environment with an Intel CPU of Core (TM) 2 Quad Q6600 and a GPU of Geforce 8800GT, with software support by OpenMP and CUDA. It was tested in three parallelization device setups: (a) a four-core CPU without a general-purpose GPU, (b) a general-purpose GPU plus 1 core of CPU, and (c) a four-core CPU plus a general-purpose GPU. To effectively take advantage of a multi-core CPU and a general-purpose GPU, an algorithm based on load-prediction dynamic scheduling was developed and applied to setting (c). In the simulation with 1600 time steps, the speedup of the parallel computation as compared to the serial computation was 3.9 in setting (a), 16.8 in setting (b), and 20.0 in setting (c). This study demonstrates that a current PC with a multi-core CPU and a general-purpose GPU provides a good environment for parallel computations in biological modelling and simulation studies. Copyright 2010 Elsevier Ireland Ltd. All rights reserved.
Parallel efficient rate control methods for JPEG 2000
NASA Astrophysics Data System (ADS)
Martínez-del-Amor, Miguel Á.; Bruns, Volker; Sparenberg, Heiko
2017-09-01
Since the introduction of JPEG 2000, several rate control methods have been proposed. Among them, post-compression rate-distortion optimization (PCRD-Opt) is the most widely used, and the one recommended by the standard. The approach followed by this method is to first compress the entire image split in code blocks, and subsequently, optimally truncate the set of generated bit streams according to the maximum target bit rate constraint. The literature proposes various strategies on how to estimate ahead of time where a block will get truncated in order to stop the execution prematurely and save time. However, none of them have been defined bearing in mind a parallel implementation. Today, multi-core and many-core architectures are becoming popular for JPEG 2000 codecs implementations. Therefore, in this paper, we analyze how some techniques for efficient rate control can be deployed in GPUs. In order to do that, the design of our GPU-based codec is extended, allowing stopping the process at a given point. This extension also harnesses a higher level of parallelism on the GPU, leading to up to 40% of speedup with 4K test material on a Titan X. In a second step, three selected rate control methods are adapted and implemented in our parallel encoder. A comparison is then carried out, and used to select the best candidate to be deployed in a GPU encoder, which gave an extra 40% of speedup in those situations where it was really employed.
Parallel Computational Protein Design.
Zhou, Yichao; Donald, Bruce R; Zeng, Jianyang
2017-01-01
Computational structure-based protein design (CSPD) is an important problem in computational biology, which aims to design or improve a prescribed protein function based on a protein structure template. It provides a practical tool for real-world protein engineering applications. A popular CSPD method that guarantees to find the global minimum energy solution (GMEC) is to combine both dead-end elimination (DEE) and A* tree search algorithms. However, in this framework, the A* search algorithm can run in exponential time in the worst case, which may become the computation bottleneck of large-scale computational protein design process. To address this issue, we extend and add a new module to the OSPREY program that was previously developed in the Donald lab (Gainza et al., Methods Enzymol 523:87, 2013) to implement a GPU-based massively parallel A* algorithm for improving protein design pipeline. By exploiting the modern GPU computational framework and optimizing the computation of the heuristic function for A* search, our new program, called gOSPREY, can provide up to four orders of magnitude speedups in large protein design cases with a small memory overhead comparing to the traditional A* search algorithm implementation, while still guaranteeing the optimality. In addition, gOSPREY can be configured to run in a bounded-memory mode to tackle the problems in which the conformation space is too large and the global optimal solution cannot be computed previously. Furthermore, the GPU-based A* algorithm implemented in the gOSPREY program can be combined with the state-of-the-art rotamer pruning algorithms such as iMinDEE (Gainza et al., PLoS Comput Biol 8:e1002335, 2012) and DEEPer (Hallen et al., Proteins 81:18-39, 2013) to also consider continuous backbone and side-chain flexibility.