On implementation of EM-type algorithms in the stochastic models for a matrix computing on GPU
Gorshenin, Andrey K.
2015-03-10
The paper discusses the main ideas of an implementation of EM-type algorithms for computing on the graphics processors and the application for the probabilistic models based on the Cox processes. An example of the GPU’s adapted MATLAB source code for the finite normal mixtures with the expectation-maximization matrix formulas is given. The testing of computational efficiency for GPU vs CPU is illustrated for the different sample sizes.
gpuPOM: a GPU-based Princeton Ocean Model
NASA Astrophysics Data System (ADS)
Xu, S.; Huang, X.; Zhang, Y.; Fu, H.; Oey, L.-Y.; Xu, F.; Yang, G.
2014-11-01
Rapid advances in the performance of the graphics processing unit (GPU) have made the GPU a compelling solution for a series of scientific applications. However, most existing GPU acceleration works for climate models are doing partial code porting for certain hot spots, and can only achieve limited speedup for the entire model. In this work, we take the mpiPOM (a parallel version of the Princeton Ocean Model) as our starting point, design and implement a GPU-based Princeton Ocean Model. By carefully considering the architectural features of the state-of-the-art GPU devices, we rewrite the full mpiPOM model from the original Fortran version into a new Compute Unified Device Architecture C (CUDA-C) version. We take several accelerating methods to further improve the performance of gpuPOM, including optimizing memory access in a single GPU, overlapping communication and boundary operations among multiple GPUs, and overlapping input/output (I/O) between the hybrid Central Processing Unit (CPU) and the GPU. Our experimental results indicate that the performance of the gpuPOM on a workstation containing 4 GPUs is comparable to a powerful cluster with 408 CPU cores and it reduces the energy consumption by 6.8 times.
GPU COMPUTING FOR PARTICLE TRACKING
Nishimura, Hiroshi; Song, Kai; Muriki, Krishna; Sun, Changchun; James, Susan; Qin, Yong
2011-03-25
This is a feasibility study of using a modern Graphics Processing Unit (GPU) to parallelize the accelerator particle tracking code. To demonstrate the massive parallelization features provided by GPU computing, a simplified TracyGPU program is developed for dynamic aperture calculation. Performances, issues, and challenges from introducing GPU are also discussed. General purpose Computation on Graphics Processing Units (GPGPU) bring massive parallel computing capabilities to numerical calculation. However, the unique architecture of GPU requires a comprehensive understanding of the hardware and programming model to be able to well optimize existing applications. In the field of accelerator physics, the dynamic aperture calculation of a storage ring, which is often the most time consuming part of the accelerator modeling and simulation, can benefit from GPU due to its embarrassingly parallel feature, which fits well with the GPU programming model. In this paper, we use the Tesla C2050 GPU which consists of 14 multi-processois (MP) with 32 cores on each MP, therefore a total of 448 cores, to host thousands ot threads dynamically. Thread is a logical execution unit of the program on GPU. In the GPU programming model, threads are grouped into a collection of blocks Within each block, multiple threads share the same code, and up to 48 KB of shared memory. Multiple thread blocks form a grid, which is executed as a GPU kernel. A simplified code that is a subset of Tracy++ [2] is developed to demonstrate the possibility of using GPU to speed up the dynamic aperture calculation by having each thread track a particle.
GPU programming for biomedical imaging
NASA Astrophysics Data System (ADS)
Caucci, Luca; Furenlid, Lars R.
2015-08-01
Scientific computing is rapidly advancing due to the introduction of powerful new computing hardware, such as graphics processing units (GPUs). Affordable thanks to mass production, GPU processors enable the transition to efficient parallel computing by bringing the performance of a supercomputer to a workstation. We elaborate on some of the capabilities and benefits that GPU technology offers to the field of biomedical imaging. As practical examples, we consider a GPU algorithm for the estimation of position of interaction from photomultiplier (PMT) tube data, as well as a GPU implementation of the MLEM algorithm for iterative image reconstruction.
GPU accelerated dislocation dynamics
NASA Astrophysics Data System (ADS)
Ferroni, Francesco; Tarleton, Edmund; Fitzgerald, Steven
2014-09-01
In this paper we analyze the computational bottlenecks in discrete dislocation dynamics modeling (associated with segment-segment interactions as well as the treatment of free surfaces), discuss the parallelization and optimization strategies, and demonstrate the effectiveness of Graphical Processing Unit (GPU) computation in accelerating dislocation dynamics simulations and expanding their scope. Individual algorithmic benchmark tests as well as an example large simulation of a thin film are presented.
NASA Astrophysics Data System (ADS)
Chase, Patrick; Vondran, Gary
2011-01-01
Tetrahedral interpolation is commonly used to implement continuous color space conversions from sparse 3D and 4D lookup tables. We investigate the implementation and optimization of tetrahedral interpolation algorithms for GPUs, and compare to the best known CPU implementations as well as to a well known GPU-based trilinear implementation. We show that a 500 NVIDIA GTX-580 GPU is 3x faster than a 1000 Intel Core i7 980X CPU for 3D interpolation, and 9x faster for 4D interpolation. Performance-relevant GPU attributes are explored including thread scheduling, local memory characteristics, global memory hierarchy, and cache behaviors. We consider existing tetrahedral interpolation algorithms and tune based on the structure and branching capabilities of current GPUs. Global memory performance is improved by reordering and expanding the lookup table to ensure optimal access behaviors. Per multiprocessor local memory is exploited to implement optimally coalesced global memory accesses, and local memory addressing is optimized to minimize bank conflicts. We explore the impacts of lookup table density upon computation and memory access costs. Also presented are CPU-based 3D and 4D interpolators, using SSE vector operations that are faster than any previously published solution.
Astronomia para/com crianças carentes em Limeira
NASA Astrophysics Data System (ADS)
Bretones, P. S.; Oliveira, V. C.
2003-08-01
Em 2001, o Instituto Superior de Ciências Aplicadas (ISCA Faculdades de Limeira) iniciou um projeto pelo qual o Observatório do Morro Azul empreendeu uma parceria com o Centro de Promoção Social Municipal (CEPROSOM), instituição mantida pela Prefeitura Municipal de Limeira para atender crianças e adolescentes carentes. O CEPROSOM contava com dois projetos: Projeto Centro de Convivência Infantil (CCI) e Programa Criança e Adolescente (PCA), que atendiam crianças e adolescentes em Centros Comunitários de diversas áreas da cidade. Esses projetos têm como prioridades estabelecer atividades prazerosas para as crianças no sentido de retirá-las das ruas. Assim sendo, as crianças passaram a ter mais um tipo de atividade - as visitas ao observatório. Este painel descreve as várias fases do projeto, que envolveu: reuniões de planejamento, curso de Astronomia para as orientadoras dos CCIs e PCAs, atividades relacionadas a visitas das crianças ao Observatório, proposta de construção de gnômons e relógios de Sol nos diversos Centros Comunitários de Limeira e divulgação do projeto na imprensa. O painel inclui discussões sobre a aprendizagem de crianças carentes, relatos que mostram a postura das orientadoras sobre a pertinência do ensino de Astronomia, relatos do monitor que fez o atendimento no Observatório e o que o número de crianças atendidas representou para as atividades da instituição desde o início de suas atividades e, em particular, em 2001. Os resultados são baseados na análise de relatos das orientadoras e do monitor do Observatório, registros de visitas e matérias da imprensa local. Conclui com uma avaliação do que tal projeto representou para as Instituições participantes. Para o Observatório, em particular, foi feita uma análise com relação às outras modalidades de atendimentos que envolvem alunos de escolas e público em geral. Também é abordada a questão do compromisso social do Observatório na educação do
NASA Astrophysics Data System (ADS)
Masset, Frédéric
2015-09-01
GFARGO is a GPU version of FARGO. It is written in C and C for CUDA and runs only on NVIDIA’s graphics cards. Though it corresponds to the standard, isothermal version of FARGO, not all functionnalities of the CPU version have been translated to CUDA. The code is available in single and double precision versions, the latter compatible with FERMI architectures. GFARGO can run on a graphics card connected to the display, allowing the user to see in real time how the fields evolve.
GPU-accelerated compressive holography.
Endo, Yutaka; Shimobaba, Tomoyoshi; Kakue, Takashi; Ito, Tomoyoshi
2016-04-18
In this paper, we show fast signal reconstruction for compressive holography using a graphics processing unit (GPU). We implemented a fast iterative shrinkage-thresholding algorithm on a GPU to solve the ℓ_{1} and total variation (TV) regularized problems that are typically used in compressive holography. Since the algorithm is highly parallel, GPUs can compute it efficiently by data-parallel computing. For better performance, our implementation exploits the structure of the measurement matrix to compute the matrix multiplications. The results show that GPU-based implementation is about 20 times faster than CPU-based implementation. PMID:27137282
GPU-Powered Coherent Beamforming
NASA Astrophysics Data System (ADS)
Magro, A.; Adami, K. Zarb; Hickish, J.
2015-03-01
Graphics processing units (GPU)-based beamforming is a relatively unexplored area in radio astronomy, possibly due to the assumption that any such system will be severely limited by the PCIe bandwidth required to transfer data to the GPU. We have developed a CUDA-based GPU implementation of a coherent beamformer, specifically designed and optimized for deployment at the BEST-2 array which can generate an arbitrary number of synthesized beams for a wide range of parameters. It achieves ˜1.3 TFLOPs on an NVIDIA Tesla K20, approximately 10x faster than an optimized, multithreaded CPU implementation. This kernel has been integrated into two real-time, GPU-based time-domain software pipelines deployed at the BEST-2 array in Medicina: a standalone beamforming pipeline and a transient detection pipeline. We present performance benchmarks for the beamforming kernel as well as the transient detection pipeline with beamforming capabilities as well as results of test observation.
OV-Wav: um novo pacote para análise multiescalar em astronomia
NASA Astrophysics Data System (ADS)
Pereira, D. N. E.; Rabaça, C. R.
2003-08-01
Wavelets e outras formas de análise multiescalar têm sido amplamente empregadas em diversas áreas do conhecimento, sendo reconhecidamente superiores a técnicas mais tradicionais, como as análises de Fourier e de Gabor, em certas aplicações. Embora a teoria dos wavelets tenha começado a ser elaborada há quase trinta anos, seu impacto no estudo de imagens astronômicas tem sido pequeno até bem recentemente. Apresentamos um conjunto de programas desenvolvidos ao longo dos últimos três anos no Observatório do Valongo/UFRJ que possibilitam aplicar essa poderosa ferramenta a problemas comuns em astronomia, como a remoção de ruído, a detecção hierárquica de fontes e a modelagem de objetos com perfis de brilho arbitrários em condições não ideais. Este pacote, desenvolvido para execução em plataforma IDL, teve sua primeira versão concluída recentemente e está sendo disponibilizado à comunidade científica de forma aberta. Mostramos também resultados de testes controlados ao quais submetemos os programas, com a sua aplicação a imagens artificiais, com resultados satisfatórios. Algumas aplicações astrofísicas foram estudadas com o uso do pacote, em caráter experimental, incluindo a análise da componente de luz difusa em grupos compactos de galáxias de Hickson e o estudo de subestruturas de nebulosas planetárias no espaço multiescalar.
GPU applications for data processing
Vladymyrov, Mykhailo; Aleksandrov, Andrey; Tioukov, Valeri
2015-12-31
Modern experiments that use nuclear photoemulsion imply fast and efficient data acquisition from the emulsion can be performed. The new approaches in developing scanning systems require real-time processing of large amount of data. Methods that use Graphical Processing Unit (GPU) computing power for emulsion data processing are presented here. It is shown how the GPU-accelerated emulsion processing helped us to rise the scanning speed by factor of nine.
Uma grade de perfis teóricos para estrelas massivas em transição
NASA Astrophysics Data System (ADS)
Nascimento, C. M. P.; Machado, M. A.
2003-08-01
Na XXVIII Reunião Anual da Sociedade Astronômica Brasileira (2002) apresentamos uma grade de perfis calculados de acordo com os pontos da trajetória evolutiva de metalicidade solar, Z = 0.02 e taxa de perda de massa () padrão, para estrelas com massa inicial de 25, 40, 60, 85 e 120 massas solares. Estes perfis foram calculados com o auxílio de um código numérico adequado para descrever os ventos de objetos massivos, supondo simetria esférica, estacionaridade e homogeneidade. No presente trabalho, apresentamos a complementação da grade com os perfis teóricos relativos às trajetórias de Z = 0.02 com taxa de perda de massa dobrada em relação a padrão (2´), e de metalicidade Z = 0.008. Para cada ponto das três trajetórias obtemos os perfis teóricos de Ha, Hb, Hg e Hd, e como esperado eles se apresentam em pura emissão, pura absorção ou em P-Cygni. Para valores de taxa de perda de massa muito baixos (~10-7) não há formação de linhas, o que é visto nos primeiros pontos em todas as trajetórias. Em geral, para um mesmo ponto a componente de emissão diminui e a absorção aumenta de Ha para Hd. É verificado que as trajetórias com Z = 0.02 e padrão possuem menos circuitos (loops) do que as com metalicidade Z = 0.02 e 2´ padrão, e seus perfis são, em geral, menos intensos. Em relação a trajetória de Z = 0.008, verifica-se menos circuitos e maior variação em luminosidade, e seus perfis mostram-se em, algumas trajetórias, mais intensos. Verificamos também que, pontos distintos em uma mesma trajetória, apresentam perfis diferentes para valores similares de luminosidade e temperatura efetiva. Sendo assim, uma grade de perfis teóricos parece ser útil para fornecer uma informação preliminar sobre o estágio evolutivo de uma estrela massiva.
Vínculos observacionais para o processo-S em estrelas gigantes de Bário
NASA Astrophysics Data System (ADS)
Smiljanic, R. H. S.; Porto de Mello, G. F.; da Silva, L.
2003-08-01
Estrelas de bário são gigantes vermelhas de tipo GK que apresentam excessos atmosféricos dos elementos do processo-s. Tais excessos são esperados em estrelas na fase de pulsos térmicos do AGB (TP-AGB). As estrelas de bário são, no entanto, menos massivas e menos luminosas que as estrelas do AGB, assim, não poderiam ter se auto-enriquecido. Seu enriquecimento teria origem em uma estrela companheira, inicialmente mais massiva, que evolui pelo TP-AGB, se auto-enriquece com os elementos do processo-s e transfere material contaminado para a atmosfera da atual estrela de bário. A companheira evolui então para anã branca deixando de ser observada diretamente. As estrelas de bário são, portanto, úteis como testes observacionais para teorias de nucleossíntese pelo processo-s, convecção e perda de massa. Análises detalhadas de abundância com dados de alta qualidade para estes objetos são ainda escassas na literatura. Neste trabalho construímos modelos de atmosferas e, procedendo a uma análise diferencial, determinamos parâmetros atmosféricos e evolutivos de uma amostra de dez gigantes de bário e quatro normais. Determinamos seus padrões de abundância para Na, Mg, Al, Si, Ca, Sc, Ti, V, Cr, Mn, Fe, Co, Ni, Cu, Zn, Sr, Y, Zr, Ba, La, Ce, Nd, Sm, Eu e Gd, concluindo que algumas estrelas classificadas na literatura como gigantes de bário são na verdade gigantes normais. Comparamos dois padrões médios de abundância, para estrelas com grandes excessos e estrelas com excessos moderados, com modelos teóricos de enriquecimento pelo processo-s. Os dois grupos de estrelas são ajustados pelos mesmos parâmetros de exposição de nêutrons. Tal resultado sugere que a ocorrência do fenômeno de bário com diferentes intensidades não se deve a diferentes exposições de nêutrons. Discutimos ainda efeitos nucleossintéticos, ligados ao processo-s, sugeridos na literatura para os elementos Cu, Mn, V e Sc.
Distributed GPU Computing in GIScience
NASA Astrophysics Data System (ADS)
Jiang, Y.; Yang, C.; Huang, Q.; Li, J.; Sun, M.
2013-12-01
Geoscientists strived to discover potential principles and patterns hidden inside ever-growing Big Data for scientific discoveries. To better achieve this objective, more capable computing resources are required to process, analyze and visualize Big Data (Ferreira et al., 2003; Li et al., 2013). Current CPU-based computing techniques cannot promptly meet the computing challenges caused by increasing amount of datasets from different domains, such as social media, earth observation, environmental sensing (Li et al., 2013). Meanwhile CPU-based computing resources structured as cluster or supercomputer is costly. In the past several years with GPU-based technology matured in both the capability and performance, GPU-based computing has emerged as a new computing paradigm. Compare to traditional computing microprocessor, the modern GPU, as a compelling alternative microprocessor, has outstanding high parallel processing capability with cost-effectiveness and efficiency(Owens et al., 2008), although it is initially designed for graphical rendering in visualization pipe. This presentation reports a distributed GPU computing framework for integrating GPU-based computing within distributed environment. Within this framework, 1) for each single computer, computing resources of both GPU-based and CPU-based can be fully utilized to improve the performance of visualizing and processing Big Data; 2) within a network environment, a variety of computers can be used to build up a virtual super computer to support CPU-based and GPU-based computing in distributed computing environment; 3) GPUs, as a specific graphic targeted device, are used to greatly improve the rendering efficiency in distributed geo-visualization, especially for 3D/4D visualization. Key words: Geovisualization, GIScience, Spatiotemporal Studies Reference : 1. Ferreira de Oliveira, M. C., & Levkowitz, H. (2003). From visual data exploration to visual data mining: A survey. Visualization and Computer Graphics, IEEE
NASA Astrophysics Data System (ADS)
Che, Ming-Chao; Liang, Jie
2010-01-01
JPEG XR (formerly Microsoft Windows Media Photo and HD Photo) is the latest image coding standard. By integrating various advanced technologies such as integer hierarchical lapped transform, context adaptive Huffman coding, and high dynamic range coding, it achieves competitive performance to JPEG-2000, but with lower computational complexity and memory requirement. In this paper, the GPU implementation of the JPEG XR codec using NVIDIA CUDA (Compute Unified Device Architecture) technology is investigated. Design considerations to speed up the algorithm are discussed, by taking full advantage of the properties of the CUDA framework and JPEG XR. Experimental results are presented to demonstrate the performance of the GPU implementation.
Randomized selection on the GPU
Monroe, Laura Marie; Wendelberger, Joanne R; Michalak, Sarah E
2011-01-13
We implement here a fast and memory-sparing probabilistic top N selection algorithm on the GPU. To our knowledge, this is the first direct selection in the literature for the GPU. The algorithm proceeds via a probabilistic-guess-and-chcck process searching for the Nth element. It always gives a correct result and always terminates. The use of randomization reduces the amount of data that needs heavy processing, and so reduces the average time required for the algorithm. Probabilistic Las Vegas algorithms of this kind are a form of stochastic optimization and can be well suited to more general parallel processors with limited amounts of fast memory.
BSSDATA - um programa otimizado para filtragem de dados em radioastronomia solar
NASA Astrophysics Data System (ADS)
Martinon, A. R. F.; Sawant, H. S.; Fernandes, F. C. R.; Stephany, S.; Preto, A. J.; Dobrowolski, K. M.
2003-08-01
A partir de 1998, entrou em operação regular no INPE, em São José dos Campos, o Brazilian Solar Spectroscope (BSS). O BSS é dedicado às observações de explosões solares decimétricas com alta resolução temporal e espectral, com a principal finalidade de investigar fenômenos associados com a liberação de energia dos "flares" solares. Entre os anos de 1999 e 2002, foram catalogadas, aproximadamente 340 explosões solares classificadas em 8 tipos distintos, de acordo com suas características morfológicas. Na análise detalhada de cada tipo, ou grupo, de explosões solares deve-se considerar a variação do fluxo do sol calmo ("background"), em função da freqüência e a variação temporal, além da complexidade das explosões e estruturas finas registradas superpostas ao fundo variável. Com o intuito de realizar tal análise foi desenvolvido o programa BSSData. Este programa, desenvolvido em linguagem C++, é constituído de várias ferramentas que auxiliam no tratamento e análise dos dados registrados pelo BSS. Neste trabalho iremos abordar as ferramentas referentes à filtragem do ruído de fundo. As rotinas do BSSData para filtragem de ruído foram testadas nos diversos grupos de explosões solares ("dots", "fibra", "lace", "patch", "spikes", "tipo III" e "zebra") alcançando um bom resultado na diminuição do ruído de fundo e obtendo, em conseqüência, dados onde o sinal torna-se mais homogêneo ressaltando as áreas onde existem explosões solares e tornando mais precisas as determinações dos parâmetros observacionais de cada explosão. Estes resultados serão apresentados e discutidos.
Parallelization of MODFLOW using a GPU library.
Ji, Xiaohui; Li, Dandan; Cheng, Tangpei; Wang, Xu-Sheng; Wang, Qun
2014-01-01
A new method based on a graphics processing unit (GPU) library is proposed in the paper to parallelize MODFLOW. Two programs, GetAb_CG and CG_GPU, have been developed to reorganize the equations in MODFLOW and solve them with the GPU library. Experimental tests using the NVIDIA Tesla C1060 show that a 1.6- to 10.6-fold speedup can be achieved for models with more than 10(5) cells. The efficiency can be further improved by using up-to-date GPU devices. PMID:23937315
NASA Astrophysics Data System (ADS)
Weigel, Martin
2011-09-01
Over the last couple of years it has been realized that the vast computational power of graphics processing units (GPUs) could be harvested for purposes other than the video game industry. This power, which at least nominally exceeds that of current CPUs by large factors, results from the relative simplicity of the GPU architectures as compared to CPUs, combined with a large number of parallel processing units on a single chip. To benefit from this setup for general computing purposes, the problems at hand need to be prepared in a way to profit from the inherent parallelism and hierarchical structure of memory accesses. In this contribution I discuss the performance potential for simulating spin models, such as the Ising model, on GPU as compared to conventional simulations on CPU.
GPU Accelerated Vector Median Filter
NASA Technical Reports Server (NTRS)
Aras, Rifat; Shen, Yuzhong
2011-01-01
Noise reduction is an important step for most image processing tasks. For three channel color images, a widely used technique is vector median filter in which color values of pixels are treated as 3-component vectors. Vector median filters are computationally expensive; for a window size of n x n, each of the n(sup 2) vectors has to be compared with other n(sup 2) - 1 vectors in distances. General purpose computation on graphics processing units (GPUs) is the paradigm of utilizing high-performance many-core GPU architectures for computation tasks that are normally handled by CPUs. In this work. NVIDIA's Compute Unified Device Architecture (CUDA) paradigm is used to accelerate vector median filtering. which has to the best of our knowledge never been done before. The performance of GPU accelerated vector median filter is compared to that of the CPU and MPI-based versions for different image and window sizes, Initial findings of the study showed 100x improvement of performance of vector median filter implementation on GPUs over CPU implementations and further speed-up is expected after more extensive optimizations of the GPU algorithm .
SIFT implementation based on GPU
NASA Astrophysics Data System (ADS)
Jiang, Chao; Geng, Ze-xun; Wei, Xiao-feng; Shen, Chen
2013-08-01
Abstract—Image matching is the core research topics of digital photogrammetry and computer vision. SIFT(Scale-Invariant Feature Transform) algorithm is a feature matching algorithm based on local invariant features which is proposed by Lowe at 1999, SIFT features are invariant to image rotation and scaling, even partially invariant to change in 3D camera viewpoint and illumination. They are well localized in both the spatial and frequency domains, reducing the probability of disruption by occlusion, clutter, or noise. So the algorithm has a widely used in image matching and 3D reconstruction based on stereo image. Traditional SIFT algorithm's implementation and optimization are generally for CPU. Due to the large numbers of extracted features(even if only several objects can also extract large numbers of SIFT feature), high-dimensional of the feature vector(usually a 128-dimensional SIFT feature vector), and the complexity for the SIFT algorithm, therefore the SIFT algorithm on the CPU processing speed is slow, hard to fulfil the real-time requirements. Programmable Graphic Process United(PGPU) is commonly used by the current computer graphics as a dedicated device for image processing. The development experience of recent years shows that a high-performance GPU, which can be achieved 10 times single-precision floating-point processing performanceone compared with the same time of a high-performance desktop CPU, simultaneity the GPU's memory bandwidth is up to five times compared with the same period desktop platform. Provide the same computing power, the GPU's cost and power consumption should be less than the CPU-based system. At the same time, due to the parallel nature of graphics rendering and image processing, so GPU-accelerated image processing become to an efficient solution for some algorithm which have requirements for real-time. In this paper, we realized the algorithm by OpenGL shader language and compare to the results which realized by CPU
The experience of GPU calculations at Lunarc
NASA Astrophysics Data System (ADS)
Sjöström, Anders; Lindemann, Jonas; Church, Ross
2011-09-01
To meet the ever increasing demand for computational speed and use of ever larger datasets, multi GPU instal- lations look very tempting. Lunarc and the Theoretical Astrophysics group at Lund Observatory collaborate on a pilot project to evaluate and utilize multi-GPU architectures for scientific calculations. Starting with a small workshop in 2009, continued investigations eventually lead to the procurement of the GPU-resource Timaeus, which is a four-node eight-GPU cluster with two Nvidia m2050 GPU-cards per node. The resource is housed within the larger cluster Platon and share disk-, network- and system resources with that cluster. The inaugu- ration of Timaeus coincided with the meeting "Computational Physics with GPUs" in November 2010, hosted by the Theoretical Astrophysics group at Lund Observatory. The meeting comprised of a two-day workshop on GPU-computing and a two-day science meeting on using GPUs as a tool for computational physics research, with a particular focus on astrophysics and computational biology. Today Timaeus is used by research groups from Lund, Stockholm and Lule in fields ranging from Astrophysics to Molecular Chemistry. We are investigating the use of GPUs with commercial software packages and user supplied MPI-enabled codes. Looking ahead, Lunarc will be installing a new cluster during the summer of 2011 which will have a small number of GPU-enabled nodes that will enable us to continue working with the combination of parallel codes and GPU-computing. It is clear that the combination of GPUs/CPUs is becoming an important part of high performance computing and here we will describe what has been done at Lunarc regarding GPU-computations and how we will continue to investigate the new and coming multi-GPU servers and how they can be utilized in our environment.
Enhancing professionalism at GPU Nuclear
Coe, R.P. )
1992-01-01
Late in 1988, GPU Nuclear embarked on a major program aimed at enhancing professionalism at its Oyster Creek and Three Mile Island nuclear generating stations. The program was also to include its corporate headquarters in Parsippany, New Jersey. The overall program was to take several directions, including on-site degree programs, a sabbatical leave-type program for personnel to finish college degrees, advanced technical training for licensed staff, career progression for senior reactor operators, and expanded teamwork and leadership training for control room crew. The largest portion of this initiative was the development and delivery of professionalism training to the nearly 2,000 people at both nuclear generating sites.
GPU Developments for General Circulation Models
NASA Astrophysics Data System (ADS)
Appleyard, Jeremy; Posey, Stan; Ponder, Carl; Eaton, Joe
2014-05-01
Current trends in high performance computing (HPC) are moving towards the use of graphics processing units (GPUs) to achieve speedups through the extraction of fine-grain parallelism of application software. GPUs have been developed exclusively for computational tasks as massively-parallel co-processors to the CPU, and during 2013 an extensive set of new HPC architectural features were developed in a 4th generation of NVIDIA GPUs that provide further opportunities for GPU acceleration of general circulation models used in climate science and numerical weather prediction. Today computational efficiency and simulation turnaround time continue to be important factors behind scientific decisions to develop models at higher resolutions and deploy increased use of ensembles. This presentation will examine the current state of GPU parallel developments for stencil based numerical operations typical of dynamical cores, and introduce new GPU-based implicit iterative schemes with GPU parallel preconditioning and linear solvers based on ILU, Krylov methods, and multigrid. Several GCMs show substantial gain in parallel efficiency from second-level fine-grain parallelism under first-level distributed memory parallel through a hybrid parallel implementation. Examples are provided relevant to science-scale HPC practice of CPU-GPU system configurations based on model resolution requirements of a particular simulation. Performance results compare use of the latest conventional CPUs with and without GPU acceleration. Finally a forward looking discussion is provided on the roadmap of GPU hardware, software, tools, and programmability for GCM development.
GPU-based Multilevel Clustering.
Chiosa, Iurie; Kolb, Andreas
2010-04-01
The processing power of parallel co-processors like the Graphics Processing Unit (GPU) are dramatically increasing. However, up until now only a few approaches have been presented to utilize this kind of hardware for mesh clustering purposes. In this paper we introduce a Multilevel clustering technique designed as a parallel algorithm and solely implemented on the GPU. Our formulation uses the spatial coherence present in the cluster optimization and hierarchical cluster merging to significantly reduce the number of comparisons in both parts . Our approach provides a fast, high quality and complete clustering analysis. Furthermore, based on the original concept we present a generalization of the method to data clustering. All advantages of the meshbased techniques smoothly carry over to the generalized clustering approach. Additionally, this approach solves the problem of the missing topological information inherent to general data clustering and leads to a Local Neighbors k-means algorithm. We evaluate both techniques by applying them to Centroidal Voronoi Diagram (CVD) based clustering. Compared to classical approaches, our techniques generate results with at least the same clustering quality. Our technique proves to scale very well, currently being limited only by the available amount of graphics memory. PMID:20421676
GPU architecture usage for efficient image scaling
NASA Astrophysics Data System (ADS)
Skakov, P.
2013-05-01
Specifics of graphics processing units (GPU) architecture is considered. Opportunities of relevant optimization for image processing algorithms are presented such as usage of texture filtering block. Accuracy of image scaling and drivers influenced usage specifics are noted.
CULA: hybrid GPU accelerated linear algebra routines
NASA Astrophysics Data System (ADS)
Humphrey, John R.; Price, Daniel K.; Spagnoli, Kyle E.; Paolini, Aaron L.; Kelmelis, Eric J.
2010-04-01
The modern graphics processing unit (GPU) found in many standard personal computers is a highly parallel math processor capable of nearly 1 TFLOPS peak throughput at a cost similar to a high-end CPU and an excellent FLOPS/watt ratio. High-level linear algebra operations are computationally intense, often requiring O(N3) operations and would seem a natural fit for the processing power of the GPU. Our work is on CULA, a GPU accelerated implementation of linear algebra routines. We present results from factorizations such as LU decomposition, singular value decomposition and QR decomposition along with applications like system solution and least squares. The GPU execution model featured by NVIDIA GPUs based on CUDA demands very strong parallelism, requiring between hundreds and thousands of simultaneous operations to achieve high performance. Some constructs from linear algebra map extremely well to the GPU and others map poorly. CPUs, on the other hand, do well at smaller order parallelism and perform acceptably during low-parallelism code segments. Our work addresses this via hybrid a processing model, in which the CPU and GPU work simultaneously to produce results. In many cases, this is accomplished by allowing each platform to do the work it performs most naturally.
Cui, Xiaohui; Mueller, Frank; Zhang, Yongpeng; Potok, Thomas E
2009-01-01
Accelerating hardware devices represent a novel promise for improving the performance for many problem domains but it is not clear for which domains what accelerators are suitable. While there is no room in general-purpose processor design to significantly increase the processor frequency, developers are instead resorting to multi-core chips duplicating conventional computing capabilities on a single die. Yet, accelerators offer more radical designs with a much higher level of parallelism and novel programming environments. This present work assesses the viability of text mining on CUDA. Text mining is one of the key concepts that has become prominent as an effective means to index the Internet, but its applications range beyond this scope and extend to providing document similarity metrics, the subject of this work. We have developed and optimized text search algorithms for GPUs to exploit their potential for massive data processing. We discuss the algorithmic challenges of parallelization for text search problems on GPUs and demonstrate the potential of these devices in experiments by reporting significant speedups. Our study may be one of the first to assess more complex text search problems for suitability for GPU devices, and it may also be one of the first to exploit and report on atomic instruction usage that have recently become available in NVIDIA devices.
Cholla: 3D GPU-based hydrodynamics code for astrophysical simulation
NASA Astrophysics Data System (ADS)
Schneider, Evan E.; Robertson, Brant E.
2016-07-01
Cholla (Computational Hydrodynamics On ParaLLel Architectures) models the Euler equations on a static mesh and evolves the fluid properties of thousands of cells simultaneously using GPUs. It can update over ten million cells per GPU-second while using an exact Riemann solver and PPM reconstruction, allowing computation of astrophysical simulations with physically interesting grid resolutions (>256^3) on a single device; calculations can be extended onto multiple devices with nearly ideal scaling beyond 64 GPUs.
GPU-accelerated computation of electron transfer.
Höfinger, Siegfried; Acocella, Angela; Pop, Sergiu C; Narumi, Tetsu; Yasuoka, Kenji; Beu, Titus; Zerbetto, Francesco
2012-11-01
Electron transfer is a fundamental process that can be studied with the help of computer simulation. The underlying quantum mechanical description renders the problem a computationally intensive application. In this study, we probe the graphics processing unit (GPU) for suitability to this type of problem. Time-critical components are identified via profiling of an existing implementation and several different variants are tested involving the GPU at increasing levels of abstraction. A publicly available library supporting basic linear algebra operations on the GPU turns out to accelerate the computation approximately 50-fold with minor dependence on actual problem size. The performance gain does not compromise numerical accuracy and is of significant value for practical purposes. PMID:22847673
GPU-accelerated voxelwise hepatic perfusion quantification.
Wang, H; Cao, Y
2012-09-01
Voxelwise quantification of hepatic perfusion parameters from dynamic contrast enhanced (DCE) imaging greatly contributes to assessment of liver function in response to radiation therapy. However, the efficiency of the estimation of hepatic perfusion parameters voxel-by-voxel in the whole liver using a dual-input single-compartment model requires substantial improvement for routine clinical applications. In this paper, we utilize the parallel computation power of a graphics processing unit (GPU) to accelerate the computation, while maintaining the same accuracy as the conventional method. Using compute unified device architecture-GPU, the hepatic perfusion computations over multiple voxels are run across the GPU blocks concurrently but independently. At each voxel, nonlinear least-squares fitting the time series of the liver DCE data to the compartmental model is distributed to multiple threads in a block, and the computations of different time points are performed simultaneously and synchronically. An efficient fast Fourier transform in a block is also developed for the convolution computation in the model. The GPU computations of the voxel-by-voxel hepatic perfusion images are compared with ones by the CPU using the simulated DCE data and the experimental DCE MR images from patients. The computation speed is improved by 30 times using a NVIDIA Tesla C2050 GPU compared to a 2.67 GHz Intel Xeon CPU processor. To obtain liver perfusion maps with 626 400 voxels in a patient's liver, it takes 0.9 min with the GPU-accelerated voxelwise computation, compared to 110 min with the CPU, while both methods result in perfusion parameters differences less than 10(-6). The method will be useful for generating liver perfusion images in clinical settings. PMID:22892645
Fast quantum Monte Carlo on a GPU
NASA Astrophysics Data System (ADS)
Lutsyshyn, Y.
2015-02-01
We present a scheme for the parallelization of quantum Monte Carlo method on graphical processing units, focusing on variational Monte Carlo simulation of bosonic systems. We use asynchronous execution schemes with shared memory persistence, and obtain an excellent utilization of the accelerator. The CUDA code is provided along with a package that simulates liquid helium-4. The program was benchmarked on several models of Nvidia GPU, including Fermi GTX560 and M2090, and the Kepler architecture K20 GPU. Special optimization was developed for the Kepler cards, including placement of data structures in the register space of the Kepler GPUs. Kepler-specific optimization is discussed.
Parallel LU Factorization on GPU cluster
D'Azevedo, Ed F; Hill, Judith C
2012-01-01
This paper describes our progress in developing software for performing parallel LU factorization of a large dense matrix on a GPU cluster. Three approaches, with increasing software complexity, are considered: (i) a naive 'thunking' approach that links the existing parallel ScaLAPACK software library with cuBLAS through a software emulation layer; (ii) a more intrusive magmaBLAS implementation integrated into the LU solver in the High-Performance Linpack software; and (iii) a left-looking out-of-core algorithm for solving problems that are larger than the available memory on GPU devices. Comparison of the performance gains versus the current ScaLAPACK PZGETRF are provided.
Colloquium: Large scale simulations on GPU clusters
NASA Astrophysics Data System (ADS)
Bernaschi, Massimo; Bisson, Mauro; Fatica, Massimiliano
2015-06-01
Graphics processing units (GPU) are currently used as a cost-effective platform for computer simulations and big-data processing. Large scale applications require that multiple GPUs work together but the efficiency obtained with cluster of GPUs is, at times, sub-optimal because the GPU features are not exploited at their best. We describe how it is possible to achieve an excellent efficiency for applications in statistical mechanics, particle dynamics and networks analysis by using suitable memory access patterns and mechanisms like CUDA streams, profiling tools, etc. Similar concepts and techniques may be applied also to other problems like the solution of Partial Differential Equations.
NASA Astrophysics Data System (ADS)
Souza, T. R.; Baptista, R.
2003-08-01
As estrelas secundárias em variáveis cataclí smicas (VCs) e binárias-x de baixa massa (BXBMs) são cruciais para o entendimento da origem, evolução e comportamento destas binárias interagentes. Elas são estrelas magneticamente ativas submetidas a condições ambientais extremas [e.g., estão muito próximas de uma fonte quente e irradiante; têm rotação extremamente rápida e forma distorcida; estão perdendo massa a taxas de 10-8-10-10 M¤/ano] que contribuem para que suas propriedades sejam distintas das de estrelas de mesma massa na seqüência principal. Por outro lado, o padrão de irradiação na face da secundária fornece informação sobre a geometria das estruturas de acréscimo em torno da estrela primária. Assim, a obtenção de imagens da superfície destas estrelas é de grande interesse astrofísico. A Tomografia Roche usa as variações no perfil das linhas de emissão/absorção da estrela secundária em função da fase orbital para mapear a distribuição de brilho em sua superfície. Neste trabalho apresentamos os resultados iniciais do desenvolvimento de um programa para o mapeamento da distribuição de brilho na superfí cie das estrelas secundárias em VCs e BXBMs com técnicas de astro-tomografia. Presentemente temos em operação um código que simula as variações no perfil das linhas em conseqüência de efeito Doppler resultante da combinação de rotação e translação de uma estrela em forma de lobo de Roche em torno do centro de massa da binária, em função da distribuição de brilho na superfície desta estrela. O código igualmente produz a curva de luz resultante das variações de aspecto da estrela em função da fase orbital (variações elipsoidais).
Locality-Driven Dynamic GPU Cache Bypassing
Li, Chao; Song, Shuaiwen; Dai, Hongwen; Sidelnik, A.; Hari, Siva; Zhou, Huiyang
2015-06-07
This paper presents novel cache optimizations for massively parallel, throughput-oriented architectures like GPUs. Based on the reuse characteristics of GPU workloads, we propose a design that integrates such efficient locality filtering capability into the decoupled tag store of the existing L1 D-cache through simple and cost-effective hardware extensions.
GPU Computing in Space Weather Modeling
NASA Astrophysics Data System (ADS)
Feng, X.; Zhong, D.; Xiang, C.; Zhang, Y.
2013-04-01
Space weather refers to conditions on the Sun and in the solar wind, magnetosphere, ionosphere, and thermosphere that can influence the performance and reliability of space-borne and ground-based technological systems and that affect human life or health. In order to make the real- or faster than real-time numerical prediction of adverse space weather events and their influence on the geospace environment, high-performance computational models are required. The main objective in this article is to explore the application of programmable graphic processing units (GPUs) to the numerical space weather modeling for the study of solar wind background that is a crucial part in the numerical space weather modeling. GPU programming is realized for our Solar-Interplanetary-CESE MHD model (SIP-CESE MHD model) by numerically studying the solar corona/interplanetary solar wind. The global solar wind structures is obtained by the established GPU model with the magnetic field synoptic data as input. The simulated global structures for Carrington rotation 2060 agrees well with solar observations and solar wind measurements from spacecraft near the Earth. The model's implementation of the adaptive-mesh-refinement (AMR) and message passing interface (MPI) enables the full exploitation of the computing power in a heterogeneous CPU/GPU cluster and significantly improves the overall performance. Our initial tests with available hardware show speedups of roughly 5x compared to traditional software implementation. This work presents a novel application of GPU to the space weather study.
Hyperspectral image feature extraction accelerated by GPU
NASA Astrophysics Data System (ADS)
Qu, HaiCheng; Zhang, Ye; Lin, Zhouhan; Chen, Hao
2012-10-01
PCA (principal components analysis) algorithm is the most basic method of dimension reduction for high-dimensional data1, which plays a significant role in hyperspectral data compression, decorrelation, denoising and feature extraction. With the development of imaging technology, the number of spectral bands in a hyperspectral image is getting larger and larger, and the data cube becomes bigger in these years. As a consequence, operation of dimension reduction is more and more time-consuming nowadays. Fortunately, GPU-based high-performance computing has opened up a novel approach for hyperspectral data processing6. This paper is concerning on the two main processes in hyperspectral image feature extraction: (1) calculation of transformation matrix; (2) transformation in spectrum dimension. These two processes belong to computationally intensive and data-intensive data processing respectively. Through the introduction of GPU parallel computing technology, an algorithm containing PCA transformation based on eigenvalue decomposition 8(EVD) and feature matching identification is implemented, which is aimed to explore the characteristics of the GPU parallel computing and the prospects of GPU application in hyperspectral image processing by analysing thread invoking and speedup of the algorithm. At last, the result of the experiment shows that the algorithm has reached a 12x speedup in total, in which some certain step reaches higher speedups up to 270 times.
GPU-based fast gamma index calculation
NASA Astrophysics Data System (ADS)
Gu, Xuejun; Jia, Xun; Jiang, Steve B.
2011-03-01
The γ-index dose comparison tool has been widely used to compare dose distributions in cancer radiotherapy. The accurate calculation of γ-index requires an exhaustive search of the closest Euclidean distance in the high-resolution dose-distance space. This is a computational intensive task when dealing with 3D dose distributions. In this work, we combine a geometric method (Ju et al 2008 Med. Phys. 35 879-87) with a radial pre-sorting technique (Wendling et al 2007 Med. Phys. 34 1647-54) and implement them on computer graphics processing units (GPUs). The developed GPU-based γ-index computational tool is evaluated on eight pairs of IMRT dose distributions. The γ-index calculations can be finished within a few seconds for all 3D testing cases on one single NVIDIA Tesla C1060 card, achieving 45-75× speedup compared to CPU computations conducted on an Intel Xeon 2.27 GHz processor. We further investigated the effect of various factors on both CPU and GPU computation time. The strategy of pre-sorting voxels based on their dose difference values speeds up the GPU calculation by about 2.7-5.5 times. For n-dimensional dose distributions, γ-index calculation time on CPU is proportional to the summation of γn over all voxels, while that on GPU is affected by γn distributions and is approximately proportional to the γn summation over all voxels. We found that increasing the resolution of dose distributions leads to a quadratic increase of computation time on CPU, while less-than-quadratic increase on GPU. The values of dose difference and distance-to-agreement criteria also have an impact on γ-index calculation time.
Accelerated GPU based SPECT Monte Carlo simulations.
Garcia, Marie-Paule; Bert, Julien; Benoit, Didier; Bardiès, Manuel; Visvikis, Dimitris
2016-06-01
Monte Carlo (MC) modelling is widely used in the field of single photon emission computed tomography (SPECT) as it is a reliable technique to simulate very high quality scans. This technique provides very accurate modelling of the radiation transport and particle interactions in a heterogeneous medium. Various MC codes exist for nuclear medicine imaging simulations. Recently, new strategies exploiting the computing capabilities of graphical processing units (GPU) have been proposed. This work aims at evaluating the accuracy of such GPU implementation strategies in comparison to standard MC codes in the context of SPECT imaging. GATE was considered the reference MC toolkit and used to evaluate the performance of newly developed GPU Geant4-based Monte Carlo simulation (GGEMS) modules for SPECT imaging. Radioisotopes with different photon energies were used with these various CPU and GPU Geant4-based MC codes in order to assess the best strategy for each configuration. Three different isotopes were considered: (99m) Tc, (111)In and (131)I, using a low energy high resolution (LEHR) collimator, a medium energy general purpose (MEGP) collimator and a high energy general purpose (HEGP) collimator respectively. Point source, uniform source, cylindrical phantom and anthropomorphic phantom acquisitions were simulated using a model of the GE infinia II 3/8" gamma camera. Both simulation platforms yielded a similar system sensitivity and image statistical quality for the various combinations. The overall acceleration factor between GATE and GGEMS platform derived from the same cylindrical phantom acquisition was between 18 and 27 for the different radioisotopes. Besides, a full MC simulation using an anthropomorphic phantom showed the full potential of the GGEMS platform, with a resulting acceleration factor up to 71. The good agreement with reference codes and the acceleration factors obtained support the use of GPU implementation strategies for improving computational
GPU-based fast gamma index calculation.
Gu, Xuejun; Jia, Xun; Jiang, Steve B
2011-03-01
The γ-index dose comparison tool has been widely used to compare dose distributions in cancer radiotherapy. The accurate calculation of γ-index requires an exhaustive search of the closest Euclidean distance in the high-resolution dose-distance space. This is a computational intensive task when dealing with 3D dose distributions. In this work, we combine a geometric method (Ju et al 2008 Med. Phys. 35 879-87) with a radial pre-sorting technique (Wendling et al 2007 Med. Phys. 34 1647-54) and implement them on computer graphics processing units (GPUs). The developed GPU-based γ-index computational tool is evaluated on eight pairs of IMRT dose distributions. The γ-index calculations can be finished within a few seconds for all 3D testing cases on one single NVIDIA Tesla C1060 card, achieving 45-75× speedup compared to CPU computations conducted on an Intel Xeon 2.27 GHz processor. We further investigated the effect of various factors on both CPU and GPU computation time. The strategy of pre-sorting voxels based on their dose difference values speeds up the GPU calculation by about 2.7-5.5 times. For n-dimensional dose distributions, γ-index calculation time on CPU is proportional to the summation of γ(n) over all voxels, while that on GPU is affected by γ(n) distributions and is approximately proportional to the γ(n) summation over all voxels. We found that increasing the resolution of dose distributions leads to a quadratic increase of computation time on CPU, while less-than-quadratic increase on GPU. The values of dose difference and distance-to-agreement criteria also have an impact on γ-index calculation time. PMID:21317484
Accelerated GPU based SPECT Monte Carlo simulations
NASA Astrophysics Data System (ADS)
Garcia, Marie-Paule; Bert, Julien; Benoit, Didier; Bardiès, Manuel; Visvikis, Dimitris
2016-06-01
Monte Carlo (MC) modelling is widely used in the field of single photon emission computed tomography (SPECT) as it is a reliable technique to simulate very high quality scans. This technique provides very accurate modelling of the radiation transport and particle interactions in a heterogeneous medium. Various MC codes exist for nuclear medicine imaging simulations. Recently, new strategies exploiting the computing capabilities of graphical processing units (GPU) have been proposed. This work aims at evaluating the accuracy of such GPU implementation strategies in comparison to standard MC codes in the context of SPECT imaging. GATE was considered the reference MC toolkit and used to evaluate the performance of newly developed GPU Geant4-based Monte Carlo simulation (GGEMS) modules for SPECT imaging. Radioisotopes with different photon energies were used with these various CPU and GPU Geant4-based MC codes in order to assess the best strategy for each configuration. Three different isotopes were considered: 99m Tc, 111In and 131I, using a low energy high resolution (LEHR) collimator, a medium energy general purpose (MEGP) collimator and a high energy general purpose (HEGP) collimator respectively. Point source, uniform source, cylindrical phantom and anthropomorphic phantom acquisitions were simulated using a model of the GE infinia II 3/8" gamma camera. Both simulation platforms yielded a similar system sensitivity and image statistical quality for the various combinations. The overall acceleration factor between GATE and GGEMS platform derived from the same cylindrical phantom acquisition was between 18 and 27 for the different radioisotopes. Besides, a full MC simulation using an anthropomorphic phantom showed the full potential of the GGEMS platform, with a resulting acceleration factor up to 71. The good agreement with reference codes and the acceleration factors obtained support the use of GPU implementation strategies for improving computational efficiency
Gpu Implementation of Preconditioning Method for Low-Speed Flows
NASA Astrophysics Data System (ADS)
Zhang, Jiale; Chen, Hongquan
2016-06-01
An improved preconditioning method for low-Mach-number flows is implemented on a GPU platform. The improved preconditioning method employs the fluctuation of the fluid variables to weaken the influence of accuracy caused by the truncation error. The GPU parallel computing platform is implemented to accelerate the calculations. Both details concerning the improved preconditioning method and the GPU implementation technology are described in this paper. Then a set of typical low-speed flow cases are simulated for both validation and performance analysis of the resulting GPU solver. Numerical results show that dozens of times speedup relative to a serial CPU implementation can be achieved using a single GPU desktop platform, which demonstrates that the GPU desktop can serve as a cost-effective parallel computing platform to accelerate CFD simulations for low-Speed flows substantially.
GPU Accelerated Chemical Similarity Calculation for Compound Library Comparison
Ma, Chao; Wang, Lirong; Xie, Xiang-Qun
2012-01-01
Chemical similarity calculation plays an important role in compound library design, virtual screening, and “lead” optimization. In this manuscript, we present a novel GPU-accelerated algorithm for all-vs-all Tanimoto matrix calculation and nearest neighbor search. By taking advantage of multi-core GPU architecture and CUDA parallel programming technology, the algorithm is up to 39 times superior to the existing commercial software that runs on CPUs. Because of the utilization of intrinsic GPU instructions, this approach is nearly 10 times faster than existing GPU-accelerated sparse vector algorithm, when Unity fingerprints are used for Tanimoto calculation. The GPU program that implements this new method takes about 20 minutes to complete the calculation of Tanimoto coefficients between 32M PubChem compounds and 10K Active Probes compounds, i.e., 324G Tanimoto coefficients, on a 128-CUDA-core GPU. PMID:21692447
Blind detection of giant pulses: GPU implementation
NASA Astrophysics Data System (ADS)
Ait-Allal, Dalal; Weber, Rodolphe; Dumez-Viou, Cédric; Cognard, Ismael; Theureau, Gilles
2012-01-01
Radio astronomical pulsar observations require specific instrumentation and dedicated signal processing to cope with the dispersion caused by the interstellar medium. Moreover, the quality of observations can be limited by radio frequency interference (RFI) generated by Telecommunications activity. This article presents the innovative pulsar instrumentation based on graphical processing units (GPU) which has been designed at the Nançay Radio Astronomical Observatory. In addition, for giant pulsar search, we propose a new approach which combines a hardware-efficient search method and some RFI mitigation capabilities. Although this approach is less sensitive than the classical approach, its advantage is that no a priori information on the pulsar parameters is required. The validation of a GPU implementation is under way.
Solving global optimization problems on GPU cluster
NASA Astrophysics Data System (ADS)
Barkalov, Konstantin; Gergel, Victor; Lebedev, Ilya
2016-06-01
The paper contains the results of investigation of a parallel global optimization algorithm combined with a dimension reduction scheme. This allows solving multidimensional problems by means of reducing to data-independent subproblems with smaller dimension solved in parallel. The new element implemented in the research consists in using several graphic accelerators at different computing nodes. The paper also includes results of solving problems of well-known multiextremal test class GKLS on Lobachevsky supercomputer using tens of thousands of GPU cores.
GPU-based video motion magnification
NASA Astrophysics Data System (ADS)
DomŻał, Mariusz; Jedrasiak, Karol; Sobel, Dawid; Ryt, Artur; Nawrat, Aleksander
2016-06-01
Video motion magnification (VMM) allows people see otherwise not visible subtle changes in surrounding world. VMM is also capable of hiding them with a modified version of the algorithm. It is possible to magnify motion related to breathing of patients in hospital to observe it or extinguish it and extract other information from stabilized image sequence for example blood flow. In both cases we would like to perform calculations in real time. Unfortunately, the VMM algorithm requires a great amount of computing power. In the article we suggest that VMM algorithm can be parallelized (each thread processes one pixel) and in order to prove that we implemented the algorithm on GPU using CUDA technology. CPU is used only to grab, write, display frame and schedule work for GPU. Each GPU kernel performs spatial decomposition, reconstruction and motion amplification. In this work we presented approach that achieves a significant speedup over existing methods and allow to VMM process video in real-time. This solution can be used as preprocessing for other algorithms in more complex systems or can find application wherever real time motion magnification would be useful. It is worth to mention that the implementation runs on most modern desktops and laptops compatible with CUDA technology.
Bayer image parallel decoding based on GPU
NASA Astrophysics Data System (ADS)
Hu, Rihui; Xu, Zhiyong; Wei, Yuxing; Sun, Shaohua
2012-11-01
In the photoelectrical tracking system, Bayer image is decompressed in traditional method, which is CPU-based. However, it is too slow when the images become large, for example, 2K×2K×16bit. In order to accelerate the Bayer image decoding, this paper introduces a parallel speedup method for NVIDA's Graphics Processor Unit (GPU) which supports CUDA architecture. The decoding procedure can be divided into three parts: the first is serial part, the second is task-parallelism part, and the last is data-parallelism part including inverse quantization, inverse discrete wavelet transform (IDWT) as well as image post-processing part. For reducing the execution time, the task-parallelism part is optimized by OpenMP techniques. The data-parallelism part could advance its efficiency through executing on the GPU as CUDA parallel program. The optimization techniques include instruction optimization, shared memory access optimization, the access memory coalesced optimization and texture memory optimization. In particular, it can significantly speed up the IDWT by rewriting the 2D (Tow-dimensional) serial IDWT into 1D parallel IDWT. Through experimenting with 1K×1K×16bit Bayer image, data-parallelism part is 10 more times faster than CPU-based implementation. Finally, a CPU+GPU heterogeneous decompression system was designed. The experimental result shows that it could achieve 3 to 5 times speed increase compared to the CPU serial method.
Non-rigid multi-modal registration on the GPU
NASA Astrophysics Data System (ADS)
Vetter, Christoph; Guetter, Christoph; Xu, Chenyang; Westermann, Rüdiger
2007-03-01
Non-rigid multi-modal registration of images/volumes is becoming increasingly necessary in many medical settings. While efficient registration algorithms have been published, the speed of the solutions is a problem in clinical applications. Harnessing the computational power of graphics processing unit (GPU) for general purpose computations has become increasingly popular in order to speed up algorithms further, but the algorithms have to be adapted to the data-parallel, streaming model of the GPU. This paper describes the implementation of a non-rigid, multi-modal registration using mutual information and the Kullback-Leibler divergence between observed and learned joint intensity distributions. The entire registration process is implemented on the GPU, including a GPU-friendly computation of two-dimensional histograms using vertex texture fetches as well as an implementation of recursive Gaussian filtering on the GPU. Since the computation is performed on the GPU, interactive visualization of the registration process can be done without bus transfer between main memory and video memory. This allows the user to observe the registration process and to evaluate the result more easily. Two hybrid approaches distributing the computation between the GPU and CPU are discussed. The first approach uses the CPU for lower resolutions and the GPU for higher resolutions, the second approach uses the GPU to compute a first approximation to the registration that is used as starting point for registration on the CPU using double-precision. The results of the CPU implementation are compared to the different approaches using the GPU regarding speed as well as image quality. The GPU performs up to 5 times faster per iteration than the CPU implementation.
Problems Related to Parallelization of CFD Algorithms on GPU, Multi-GPU and Hybrid Architectures
NASA Astrophysics Data System (ADS)
Biazewicz, Marek; Kurowski, Krzysztof; Ludwiczak, Bogdan; Napieraia, Krystyna
2010-09-01
Computational Fluid Dynamics (CFD) is one of the branches of fluid mechanics, which uses numerical methods and algorithms to solve and analyze fluid flows. CFD is used in various domains, such as oil and gas reservoir uncertainty analysis, aerodynamic body shapes optimization (e.g. planes, cars, ships, sport helmets, skis), natural phenomena analysis, numerical simulation for weather forecasting or realistic visualizations. CFD problem is very complex and needs a lot of computational power to obtain the results in a reasonable time. We have implemented a parallel application for two-dimensional CFD simulation with a free surface approximation (MAC method) using new hardware architectures, in particular multi-GPU and hybrid computing environments. For this purpose we decided to use NVIDIA graphic cards with CUDA environment due to its simplicity of programming and good computations performance. We used finite difference discretization of Navier-Stokes equations, where fluid is propagated over an Eulerian Grid. In this model, the behavior of the fluid inside the cell depends only on the properties of local, surrounding cells, therefore it is well suited for the GPU-based architecture. In this paper we demonstrate how to use efficiently the computing power of GPUs for CFD. Additionally, we present some best practices to help users analyze and improve the performance of CFD applications executed on GPU. Finally, we discuss various challenges around the multi-GPU implementation on the example of matrix multiplication.
GPU-based High-Performance Computing for Radiation Therapy
Jia, Xun; Ziegenhein, Peter; Jiang, Steve B.
2014-01-01
Recent developments in radiotherapy therapy demand high computation powers to solve challenging problems in a timely fashion in a clinical environment. Graphics processing unit (GPU), as an emerging high-performance computing platform, has been introduced to radiotherapy. It is particularly attractive due to its high computational power, small size, and low cost for facility deployment and maintenance. Over the past a few years, GPU-based high-performance computing in radiotherapy has experienced rapid developments. A tremendous amount of studies have been conducted, in which large acceleration factors compared with the conventional CPU platform have been observed. In this article, we will first give a brief introduction to the GPU hardware structure and programming model. We will then review the current applications of GPU in major imaging-related and therapy-related problems encountered in radiotherapy. A comparison of GPU with other platforms will also be presented. PMID:24486639
Evaluating the power of GPU acceleration for IDW interpolation algorithm.
Mei, Gang
2014-01-01
We first present two GPU implementations of the standard Inverse Distance Weighting (IDW) interpolation algorithm, the tiled version that takes advantage of shared memory and the CDP version that is implemented using CUDA Dynamic Parallelism (CDP). Then we evaluate the power of GPU acceleration for IDW interpolation algorithm by comparing the performance of CPU implementation with three GPU implementations, that is, the naive version, the tiled version, and the CDP version. Experimental results show that the tilted version has the speedups of 120x and 670x over the CPU version when the power parameter p is set to 2 and 3.0, respectively. In addition, compared to the naive GPU implementation, the tiled version is about two times faster. However, the CDP version is 4.8x ∼ 6.0x slower than the naive GPU version, and therefore does not have any potential advantages in practical applications. PMID:24707195
Evaluating the Power of GPU Acceleration for IDW Interpolation Algorithm
2014-01-01
We first present two GPU implementations of the standard Inverse Distance Weighting (IDW) interpolation algorithm, the tiled version that takes advantage of shared memory and the CDP version that is implemented using CUDA Dynamic Parallelism (CDP). Then we evaluate the power of GPU acceleration for IDW interpolation algorithm by comparing the performance of CPU implementation with three GPU implementations, that is, the naive version, the tiled version, and the CDP version. Experimental results show that the tilted version has the speedups of 120x and 670x over the CPU version when the power parameter p is set to 2 and 3.0, respectively. In addition, compared to the naive GPU implementation, the tiled version is about two times faster. However, the CDP version is 4.8x∼6.0x slower than the naive GPU version, and therefore does not have any potential advantages in practical applications. PMID:24707195
GPU-based ultrafast IMRT plan optimization
NASA Astrophysics Data System (ADS)
Men, Chunhua; Gu, Xuejun; Choi, Dongju; Majumdar, Amitava; Zheng, Ziyi; Mueller, Klaus; Jiang, Steve B.
2009-11-01
The widespread adoption of on-board volumetric imaging in cancer radiotherapy has stimulated research efforts to develop online adaptive radiotherapy techniques to handle the inter-fraction variation of the patient's geometry. Such efforts face major technical challenges to perform treatment planning in real time. To overcome this challenge, we are developing a supercomputing online re-planning environment (SCORE) at the University of California, San Diego (UCSD). As part of the SCORE project, this paper presents our work on the implementation of an intensity-modulated radiation therapy (IMRT) optimization algorithm on graphics processing units (GPUs). We adopt a penalty-based quadratic optimization model, which is solved by using a gradient projection method with Armijo's line search rule. Our optimization algorithm has been implemented in CUDA for parallel GPU computing as well as in C for serial CPU computing for comparison purpose. A prostate IMRT case with various beamlet and voxel sizes was used to evaluate our implementation. On an NVIDIA Tesla C1060 GPU card, we have achieved speedup factors of 20-40 without losing accuracy, compared to the results from an Intel Xeon 2.27 GHz CPU. For a specific nine-field prostate IMRT case with 5 × 5 mm2 beamlet size and 2.5 × 2.5 × 2.5 mm3 voxel size, our GPU implementation takes only 2.8 s to generate an optimal IMRT plan. Our work has therefore solved a major problem in developing online re-planning technologies for adaptive radiotherapy.
GPU-based ultrafast IMRT plan optimization.
Men, Chunhua; Gu, Xuejun; Choi, Dongju; Majumdar, Amitava; Zheng, Ziyi; Mueller, Klaus; Jiang, Steve B
2009-11-01
The widespread adoption of on-board volumetric imaging in cancer radiotherapy has stimulated research efforts to develop online adaptive radiotherapy techniques to handle the inter-fraction variation of the patient's geometry. Such efforts face major technical challenges to perform treatment planning in real time. To overcome this challenge, we are developing a supercomputing online re-planning environment (SCORE) at the University of California, San Diego (UCSD). As part of the SCORE project, this paper presents our work on the implementation of an intensity-modulated radiation therapy (IMRT) optimization algorithm on graphics processing units (GPUs). We adopt a penalty-based quadratic optimization model, which is solved by using a gradient projection method with Armijo's line search rule. Our optimization algorithm has been implemented in CUDA for parallel GPU computing as well as in C for serial CPU computing for comparison purpose. A prostate IMRT case with various beamlet and voxel sizes was used to evaluate our implementation. On an NVIDIA Tesla C1060 GPU card, we have achieved speedup factors of 20-40 without losing accuracy, compared to the results from an Intel Xeon 2.27 GHz CPU. For a specific nine-field prostate IMRT case with 5 x 5 mm(2) beamlet size and 2.5 x 2.5 x 2.5 mm(3) voxel size, our GPU implementation takes only 2.8 s to generate an optimal IMRT plan. Our work has therefore solved a major problem in developing online re-planning technologies for adaptive radiotherapy. PMID:19826201
GPU-accelerated adjoint algorithmic differentiation
NASA Astrophysics Data System (ADS)
Gremse, Felix; Höfter, Andreas; Razik, Lukas; Kiessling, Fabian; Naumann, Uwe
2016-03-01
Many scientific problems such as classifier training or medical image reconstruction can be expressed as minimization of differentiable real-valued cost functions and solved with iterative gradient-based methods. Adjoint algorithmic differentiation (AAD) enables automated computation of gradients of such cost functions implemented as computer programs. To backpropagate adjoint derivatives, excessive memory is potentially required to store the intermediate partial derivatives on a dedicated data structure, referred to as the "tape". Parallelization is difficult because threads need to synchronize their accesses during taping and backpropagation. This situation is aggravated for many-core architectures, such as Graphics Processing Units (GPUs), because of the large number of light-weight threads and the limited memory size in general as well as per thread. We show how these limitations can be mediated if the cost function is expressed using GPU-accelerated vector and matrix operations which are recognized as intrinsic functions by our AAD software. We compare this approach with naive and vectorized implementations for CPUs. We use four increasingly complex cost functions to evaluate the performance with respect to memory consumption and gradient computation times. Using vectorization, CPU and GPU memory consumption could be substantially reduced compared to the naive reference implementation, in some cases even by an order of complexity. The vectorization allowed usage of optimized parallel libraries during forward and reverse passes which resulted in high speedups for the vectorized CPU version compared to the naive reference implementation. The GPU version achieved an additional speedup of 7.5 ± 4.4, showing that the processing power of GPUs can be utilized for AAD using this concept. Furthermore, we show how this software can be systematically extended for more complex problems such as nonlinear absorption reconstruction for fluorescence-mediated tomography.
GPU-Accelerated Adjoint Algorithmic Differentiation
Gremse, Felix; Höfter, Andreas; Razik, Lukas; Kiessling, Fabian; Naumann, Uwe
2015-01-01
Many scientific problems such as classifier training or medical image reconstruction can be expressed as minimization of differentiable real-valued cost functions and solved with iterative gradient-based methods. Adjoint algorithmic differentiation (AAD) enables automated computation of gradients of such cost functions implemented as computer programs. To backpropagate adjoint derivatives, excessive memory is potentially required to store the intermediate partial derivatives on a dedicated data structure, referred to as the “tape”. Parallelization is difficult because threads need to synchronize their accesses during taping and backpropagation. This situation is aggravated for many-core architectures, such as Graphics Processing Units (GPUs), because of the large number of light-weight threads and the limited memory size in general as well as per thread. We show how these limitations can be mediated if the cost function is expressed using GPU-accelerated vector and matrix operations which are recognized as intrinsic functions by our AAD software. We compare this approach with naive and vectorized implementations for CPUs. We use four increasingly complex cost functions to evaluate the performance with respect to memory consumption and gradient computation times. Using vectorization, CPU and GPU memory consumption could be substantially reduced compared to the naive reference implementation, in some cases even by an order of complexity. The vectorization allowed usage of optimized parallel libraries during forward and reverse passes which resulted in high speedups for the vectorized CPU version compared to the naive reference implementation. The GPU version achieved an additional speedup of 7.5 ± 4.4, showing that the processing power of GPUs can be utilized for AAD using this concept. Furthermore, we show how this software can be systematically extended for more complex problems such as nonlinear absorption reconstruction for fluorescence-mediated tomography
CFD Computations on Multi-GPU Configurations.
NASA Astrophysics Data System (ADS)
Menon, Sandeep; Perot, Blair
2007-11-01
Programmable graphics processors have shown favorable potential for use in practical CFD simulations -- often delivering a speed-up factor between 3 to 5 times over conventional CPUs. In recent times, most PCs are supplied with the option of installing multiple GPUs on a single motherboard, thereby providing the option of a parallel GPU configuration in a shared-memory paradigm. We demonstrate our implementation of an unstructured CFD solver using a set up which is configured to run two GPUs in parallel, and discuss its performance details.
GPU-completeness: theory and implications
NASA Astrophysics Data System (ADS)
Lin, I.-Jong
2011-01-01
This paper formalizes a major insight into a class of algorithms that relate parallelism and performance. The purpose of this paper is to define a class of algorithms that trades off parallelism for quality of result (e.g. visual quality, compression rate), and we propose a similar method for algorithmic classification based on NP-Completeness techniques, applied toward parallel acceleration. We will define this class of algorithm as "GPU-Complete" and will postulate the necessary properties of the algorithms for admission into this class. We will also formally relate his algorithmic space and imaging algorithms space. This concept is based upon our experience in the print production area where GPUs (Graphic Processing Units) have shown a substantial cost/performance advantage within the context of HPdelivered enterprise services and commercial printing infrastructure. While CPUs and GPUs are converging in their underlying hardware and functional blocks, their system behaviors are clearly distinct in many ways: memory system design, programming paradigms, and massively parallel SIMD architecture. There are applications that are clearly suited to each architecture: for CPU: language compilation, word processing, operating systems, and other applications that are highly sequential in nature; for GPU: video rendering, particle simulation, pixel color conversion, and other problems clearly amenable to massive parallelization. While GPUs establishing themselves as a second, distinct computing architecture from CPUs, their end-to-end system cost/performance advantage in certain parts of computation inform the structure of algorithms and their efficient parallel implementations. While GPUs are merely one type of architecture for parallelization, we show that their introduction into the design space of printing systems demonstrate the trade-offs against competing multi-core, FPGA, and ASIC architectures. While each architecture has its own optimal application, we believe
Architecting the Finite Element Method Pipeline for the GPU.
Fu, Zhisong; Lewis, T James; Kirby, Robert M; Whitaker, Ross T
2014-02-01
The finite element method (FEM) is a widely employed numerical technique for approximating the solution of partial differential equations (PDEs) in various science and engineering applications. Many of these applications benefit from fast execution of the FEM pipeline. One way to accelerate the FEM pipeline is by exploiting advances in modern computational hardware, such as the many-core streaming processors like the graphical processing unit (GPU). In this paper, we present the algorithms and data-structures necessary to move the entire FEM pipeline to the GPU. First we propose an efficient GPU-based algorithm to generate local element information and to assemble the global linear system associated with the FEM discretization of an elliptic PDE. To solve the corresponding linear system efficiently on the GPU, we implement a conjugate gradient method preconditioned with a geometry-informed algebraic multi-grid (AMG) method preconditioner. We propose a new fine-grained parallelism strategy, a corresponding multigrid cycling stage and efficient data mapping to the many-core architecture of GPU. Comparison of our on-GPU assembly versus a traditional serial implementation on the CPU achieves up to an 87 × speedup. Focusing on the linear system solver alone, we achieve a speedup of up to 51 × versus use of a comparable state-of-the-art serial CPU linear system solver. Furthermore, the method compares favorably with other GPU-based, sparse, linear solvers. PMID:25202164
Efficient implementation of MrBayes on multi-GPU.
Bao, Jie; Xia, Hongju; Zhou, Jianfu; Liu, Xiaoguang; Wang, Gang
2013-06-01
MrBayes, using Metropolis-coupled Markov chain Monte Carlo (MCMCMC or (MC)(3)), is a popular program for Bayesian inference. As a leading method of using DNA data to infer phylogeny, the (MC)(3) Bayesian algorithm and its improved and parallel versions are now not fast enough for biologists to analyze massive real-world DNA data. Recently, graphics processor unit (GPU) has shown its power as a coprocessor (or rather, an accelerator) in many fields. This article describes an efficient implementation a(MC)(3) (aMCMCMC) for MrBayes (MC)(3) on compute unified device architecture. By dynamically adjusting the task granularity to adapt to input data size and hardware configuration, it makes full use of GPU cores with different data sets. An adaptive method is also developed to split and combine DNA sequences to make full use of a large number of GPU cards. Furthermore, a new "node-by-node" task scheduling strategy is developed to improve concurrency, and several optimizing methods are used to reduce extra overhead. Experimental results show that a(MC)(3) achieves up to 63× speedup over serial MrBayes on a single machine with one GPU card, and up to 170× speedup with four GPU cards, and up to 478× speedup with a 32-node GPU cluster. a(MC)(3) is dramatically faster than all the previous (MC)(3) algorithms and scales well to large GPU clusters. PMID:23493260
Efficient Implementation of MrBayes on Multi-GPU
Zhou, Jianfu; Liu, Xiaoguang; Wang, Gang
2013-01-01
MrBayes, using Metropolis-coupled Markov chain Monte Carlo (MCMCMC or (MC)3), is a popular program for Bayesian inference. As a leading method of using DNA data to infer phylogeny, the (MC)3 Bayesian algorithm and its improved and parallel versions are now not fast enough for biologists to analyze massive real-world DNA data. Recently, graphics processor unit (GPU) has shown its power as a coprocessor (or rather, an accelerator) in many fields. This article describes an efficient implementation a(MC)3 (aMCMCMC) for MrBayes (MC)3 on compute unified device architecture. By dynamically adjusting the task granularity to adapt to input data size and hardware configuration, it makes full use of GPU cores with different data sets. An adaptive method is also developed to split and combine DNA sequences to make full use of a large number of GPU cards. Furthermore, a new “node-by-node” task scheduling strategy is developed to improve concurrency, and several optimizing methods are used to reduce extra overhead. Experimental results show that a(MC)3 achieves up to 63× speedup over serial MrBayes on a single machine with one GPU card, and up to 170× speedup with four GPU cards, and up to 478× speedup with a 32-node GPU cluster. a(MC)3 is dramatically faster than all the previous (MC)3 algorithms and scales well to large GPU clusters. PMID:23493260
Architecting the Finite Element Method Pipeline for the GPU
Fu, Zhisong; Lewis, T. James; Kirby, Robert M.
2014-01-01
The finite element method (FEM) is a widely employed numerical technique for approximating the solution of partial differential equations (PDEs) in various science and engineering applications. Many of these applications benefit from fast execution of the FEM pipeline. One way to accelerate the FEM pipeline is by exploiting advances in modern computational hardware, such as the many-core streaming processors like the graphical processing unit (GPU). In this paper, we present the algorithms and data-structures necessary to move the entire FEM pipeline to the GPU. First we propose an efficient GPU-based algorithm to generate local element information and to assemble the global linear system associated with the FEM discretization of an elliptic PDE. To solve the corresponding linear system efficiently on the GPU, we implement a conjugate gradient method preconditioned with a geometry-informed algebraic multi-grid (AMG) method preconditioner. We propose a new fine-grained parallelism strategy, a corresponding multigrid cycling stage and efficient data mapping to the many-core architecture of GPU. Comparison of our on-GPU assembly versus a traditional serial implementation on the CPU achieves up to an 87 × speedup. Focusing on the linear system solver alone, we achieve a speedup of up to 51 × versus use of a comparable state-of-the-art serial CPU linear system solver. Furthermore, the method compares favorably with other GPU-based, sparse, linear solvers. PMID:25202164
Parallel hyperspectral compressive sensing method on GPU
NASA Astrophysics Data System (ADS)
Bernabé, Sergio; Martín, Gabriel; Nascimento, José M. P.
2015-10-01
Remote hyperspectral sensors collect large amounts of data per flight usually with low spatial resolution. It is known that the bandwidth connection between the satellite/airborne platform and the ground station is reduced, thus a compression onboard method is desirable to reduce the amount of data to be transmitted. This paper presents a parallel implementation of an compressive sensing method, called parallel hyperspectral coded aperture (P-HYCA), for graphics processing units (GPU) using the compute unified device architecture (CUDA). This method takes into account two main properties of hyperspectral dataset, namely the high correlation existing among the spectral bands and the generally low number of endmembers needed to explain the data, which largely reduces the number of measurements necessary to correctly reconstruct the original data. Experimental results conducted using synthetic and real hyperspectral datasets on two different GPU architectures by NVIDIA: GeForce GTX 590 and GeForce GTX TITAN, reveal that the use of GPUs can provide real-time compressive sensing performance. The achieved speedup is up to 20 times when compared with the processing time of HYCA running on one core of the Intel i7-2600 CPU (3.4GHz), with 16 Gbyte memory.
Synthetic aperture elastography: a GPU based approach
NASA Astrophysics Data System (ADS)
Verma, Prashant; Doyley, Marvin M.
2014-03-01
Synthetic aperture (SA) ultrasound imaging system produces highly accurate axial and lateral displacement estimates; however, low frame rates and large data volumes can hamper its clinical use. This paper describes a real-time SA imaging based ultrasound elastography system that we have recently developed to overcome this limitation. In this system, we implemented both beamforming and 2D cross-correlation echo tracking on Nvidia GTX 480 graphics processing unit (GPU). We used one thread per pixel for beamforming; whereas, one block per pixel was used for echo tracking. We compared the quality of elastograms computed with our real-time system relative to those computed using our standard single threaded elastographic imaging methodology. In all studies, we used conventional measures of image quality such as elastographic signal to noise ratio (SNRe). Specifically, SNRe of axial and lateral strain elastograms computed with real-time system were 36 dB and 23 dB, respectively, which was numerically equal to those computed with our standard approach. We achieved a frame rate of 6 frames per second using our GPU based approach for 16 transmits and kernel size of 60 × 60 pixels, which is 400 times faster than that achieved using our standard protocol.
A GPU accelerated PDF transparency engine
NASA Astrophysics Data System (ADS)
Recker, John; Lin, I.-Jong; Tastl, Ingeborg
2011-01-01
As commercial printing presses become faster, cheaper and more efficient, so too must the Raster Image Processors (RIP) that prepare data for them to print. Digital press RIPs, however, have been challenged to on the one hand meet the ever increasing print performance of the latest digital presses, and on the other hand process increasingly complex documents with transparent layers and embedded ICC profiles. This paper explores the challenges encountered when implementing a GPU accelerated driver for the open source Ghostscript Adobe PostScript and PDF language interpreter targeted at accelerating PDF transparency for high speed commercial presses. It further describes our solution, including an image memory manager for tiling input and output images and documents, a PDF compatible multiple image layer blending engine, and a GPU accelerated ICC v4 compatible color transformation engine. The result, we believe, is the foundation for a scalable, efficient, distributed RIP system that can meet current and future RIP requirements for a wide range of commercial digital presses.
GPU-Accelerated Denoising in 3D (GD3D)
2013-10-01
The raw computational power GPU Accelerators enables fast denoising of 3D MR images using bilateral filtering, anisotropic diffusion, and non-local means. This software addresses two facets of this promising application: what tuning is necessary to achieve optimal performance on a modern GPU? And what parameters yield the best denoising results in practice? To answer the first question, the software performs an autotuning step to empirically determine optimal memory blocking on the GPU. To answer themore » second, it performs a sweep of algorithm parameters to determine the combination that best reduces the mean squared error relative to a noiseless reference image.« less
Parallelization and checkpointing of GPU applications through program transformation
Solano-Quinde, Lizandro Damian
2012-01-01
GPUs have emerged as a powerful tool for accelerating general-purpose applications. The availability of programming languages that makes writing general-purpose applications for running on GPUs tractable have consolidated GPUs as an alternative for accelerating general purpose applications. Among the areas that have benefited from GPU acceleration are: signal and image processing, computational fluid dynamics, quantum chemistry, and, in general, the High Performance Computing (HPC) Industry. In order to continue to exploit higher levels of parallelism with GPUs, multi-GPU systems are gaining popularity. In this context, single-GPU applications are parallelized for running in multi-GPU systems. Furthermore, multi-GPU systems help to solve the GPU memory limitation for applications with large application memory footprint. Parallelizing single-GPU applications has been approached by libraries that distribute the workload at runtime, however, they impose execution overhead and are not portable. On the other hand, on traditional CPU systems, parallelization has been approached through application transformation at pre-compile time, which enhances the application to distribute the workload at application level and does not have the issues of library-based approaches. Hence, a parallelization scheme for GPU systems based on application transformation is needed. Like any computing engine of today, reliability is also a concern in GPUs. GPUs are vulnerable to transient and permanent failures. Current checkpoint/restart techniques are not suitable for systems with GPUs. Checkpointing for GPU systems present new and interesting challenges, primarily due to the natural differences imposed by the hardware design, the memory subsystem architecture, the massive number of threads, and the limited amount of synchronization among threads. Therefore, a checkpoint/restart technique suitable for GPU systems is needed. The goal of this work is to exploit higher levels of parallelism and
NASA Astrophysics Data System (ADS)
Wong, Un-Hong; Aoki, Takayuki; Wong, Hon-Cheng
2014-07-01
Modern graphics processing units (GPUs) have been widely utilized in magnetohydrodynamic (MHD) simulations in recent years. Due to the limited memory of a single GPU, distributed multi-GPU systems are needed to be explored for large-scale MHD simulations. However, the data transfer between GPUs bottlenecks the efficiency of the simulations on such systems. In this paper we propose a novel GPU Direct-MPI hybrid approach to address this problem for overall performance enhancement. Our approach consists of two strategies: (1) We exploit GPU Direct 2.0 to speedup the data transfers between multiple GPUs in a single node and reduce the total number of message passing interface (MPI) communications; (2) We design Compute Unified Device Architecture (CUDA) kernels instead of using memory copy to speedup the fragmented data exchange in the three-dimensional (3D) decomposition. 3D decomposition is usually not preferable for distributed multi-GPU systems due to its low efficiency of the fragmented data exchange. Our approach has made a breakthrough to make 3D decomposition available on distributed multi-GPU systems. As a result, it can reduce the memory usage and computation time of each partition of the computational domain. Experiment results show twice the FLOPS comparing to common 2D decomposition MPI-only implementation method. The proposed approach has been developed in an efficient implementation for MHD simulations on distributed multi-GPU systems, called MGPU-MHD code. The code realizes the GPU parallelization of a total variation diminishing (TVD) algorithm for solving the multidimensional ideal MHD equations, extending our work from single GPU computation (Wong et al., 2011) to multiple GPUs. Numerical tests and performance measurements are conducted on the TSUBAME 2.0 supercomputer at the Tokyo Institute of Technology. Our code achieves 2 TFLOPS in double precision for the problem with 12003 grid points using 216 GPUs.
GPU-assisted computation of centroidal Voronoi tessellation.
Rong, Guodong; Liu, Yang; Wang, Wenping; Yin, Xiaotian; Gu, Xianfeng David; Guo, Xiaohu
2011-03-01
Centroidal Voronoi tessellations (CVT) are widely used in computational science and engineering. The most commonly used method is Lloyd's method, and recently the L-BFGS method is shown to be faster than Lloyd's method for computing the CVT. However, these methods run on the CPU and are still too slow for many practical applications. We present techniques to implement these methods on the GPU for computing the CVT on 2D planes and on surfaces, and demonstrate significant speedup of these GPU-based methods over their CPU counterparts. For CVT computation on a surface, we use a geometry image stored in the GPU to represent the surface for computing the Voronoi diagram on it. In our implementation a new technique is proposed for parallel regional reduction on the GPU for evaluating integrals over Voronoi cells. PMID:21233516
Computing prestack Kirchhoff time migration on general purpose GPU
NASA Astrophysics Data System (ADS)
Shi, Xiaohua; Li, Chuang; Wang, Shihu; Wang, Xu
2011-10-01
This paper introduces how to optimize a practical prestack Kirchhoff time migration program by the Compute Unified Device Architecture (CUDA) on a general purpose GPU (GPGPU). A few useful optimization methods on GPGPU are demonstrated, such as how to increase the kernel thread numbers on GPU cores, and how to utilize the memory streams to overlap GPU kernel execution time, etc. The floating-point errors on CUDA and NVidia's GPUs are discussed in detail. Some effective methods that can be used to reduce the floating-point errors are introduced. The images generated by the practical prestack Kirchhoff time migration programs for the same real-world seismic data inputs on CPU and GPU are demonstrated. The final GPGPU approach on NVidia GTX 260 is more than 17 times faster than its original CPU version on Intel's P4 3.0G.
Local Alignment Tool Based on Hadoop Framework and GPU Architecture
Hung, Che-Lun; Hua, Guan-Jie
2014-01-01
With the rapid growth of next generation sequencing technologies, such as Slex, more and more data have been discovered and published. To analyze such huge data the computational performance is an important issue. Recently, many tools, such as SOAP, have been implemented on Hadoop and GPU parallel computing architectures. BLASTP is an important tool, implemented on GPU architectures, for biologists to compare protein sequences. To deal with the big biology data, it is hard to rely on single GPU. Therefore, we implement a distributed BLASTP by combining Hadoop and multi-GPUs. The experimental results present that the proposed method can improve the performance of BLASTP on single GPU, and also it can achieve high availability and fault tolerance. PMID:24955362
Local alignment tool based on Hadoop framework and GPU architecture.
Hung, Che-Lun; Hua, Guan-Jie
2014-01-01
With the rapid growth of next generation sequencing technologies, such as Slex, more and more data have been discovered and published. To analyze such huge data the computational performance is an important issue. Recently, many tools, such as SOAP, have been implemented on Hadoop and GPU parallel computing architectures. BLASTP is an important tool, implemented on GPU architectures, for biologists to compare protein sequences. To deal with the big biology data, it is hard to rely on single GPU. Therefore, we implement a distributed BLASTP by combining Hadoop and multi-GPUs. The experimental results present that the proposed method can improve the performance of BLASTP on single GPU, and also it can achieve high availability and fault tolerance. PMID:24955362
GPU Computing in Bayesian Inference of Realized Stochastic Volatility Model
NASA Astrophysics Data System (ADS)
Takaishi, Tetsuya
2015-01-01
The realized stochastic volatility (RSV) model that utilizes the realized volatility as additional information has been proposed to infer volatility of financial time series. We consider the Bayesian inference of the RSV model by the Hybrid Monte Carlo (HMC) algorithm. The HMC algorithm can be parallelized and thus performed on the GPU for speedup. The GPU code is developed with CUDA Fortran. We compare the computational time in performing the HMC algorithm on GPU (GTX 760) and CPU (Intel i7-4770 3.4GHz) and find that the GPU can be up to 17 times faster than the CPU. We also code the program with OpenACC and find that appropriate coding can achieve the similar speedup with CUDA Fortran.
Fast CGH computation using S-LUT on GPU.
Pan, Yuechao; Xu, Xuewu; Solanki, Sanjeev; Liang, Xinan; Tanjung, Ridwan Bin Adrian; Tan, Chiwei; Chong, Tow-Chong
2009-10-12
In computation of full-parallax computer-generated hologram (CGH), balance between speed and memory usage is always the core of algorithm development. To solve the speed problem of coherent ray trace (CRT) algorithm and memory problem of look-up table (LUT) algorithm without sacrificing reconstructed object quality, we develop a novel algorithm with split look-up tables (S-LUT) and implement it on graphics processing unit (GPU). Our results show that S-LUT on GPU has the fastest speed among all the algorithms investigated in this paper, while it still maintaining low memory usage. We also demonstrate high quality objects reconstructed from CGHs computed with S-LUT on GPU. The GPU implementation of our new algorithm may enable real-time and interactive holographic 3D display in the future. PMID:20372585
GPU-based calculations in digital holography
NASA Astrophysics Data System (ADS)
Madrigal, R.; Acebal, P.; Blaya, S.; Carretero, L.; Fimia, A.; Serrano, F.
2013-05-01
In this work we are going to apply GPU (Graphical Processing Units) with CUDA environment for scientific calculations, concretely high cost computations on the field of digital holography. For this, we have studied three typical problems in digital holography such as Fourier transforms, Fresnel reconstruction of the hologram and the calculation of vectorial diffraction integral. In all cases the runtime at different image size and the corresponding accuracy were compared to the obtained by traditional calculation systems. The programs have been carried out on a computer with a graphic card of last generation, Nvidia GTX 680, which is optimized for integer calculations. As a result a large reduction of runtime has been obtained which allows a significant improvement. Concretely, 15 fold shorter times for Fresnel approximation calculations and 600 times for the vectorial diffraction integral. These initial results, open the possibility for applying such kind of calculations in real time digital holography.
GPU-accelerated micromagnetic simulations using cloud computing
NASA Astrophysics Data System (ADS)
Jermain, C. L.; Rowlands, G. E.; Buhrman, R. A.; Ralph, D. C.
2016-03-01
Highly parallel graphics processing units (GPUs) can improve the speed of micromagnetic simulations significantly as compared to conventional computing using central processing units (CPUs). We present a strategy for performing GPU-accelerated micromagnetic simulations by utilizing cost-effective GPU access offered by cloud computing services with an open-source Python-based program for running the MuMax3 micromagnetics code remotely. We analyze the scaling and cost benefits of using cloud computing for micromagnetics.
STEM image simulation with hybrid CPU/GPU programming.
Yao, Y; Ge, B H; Shen, X; Wang, Y G; Yu, R C
2016-07-01
STEM image simulation is achieved via hybrid CPU/GPU programming under parallel algorithm architecture to speed up calculation on a personal computer (PC). To utilize the calculation power of a PC fully, the simulation is performed using the GPU core and multi-CPU cores at the same time to significantly improve efficiency. GaSb and an artificial GaSb/InAs interface with atom diffusion have been used to verify the computation. PMID:27093687
NASA Astrophysics Data System (ADS)
Tavares, E. T., Jr.; Klafke, J. C.
2003-08-01
O presente trabalho propõe-se a resgatar uma experiência que teve lugar no Planetário de São Paulo nos anos 60. Em 1962, o Sr. Acácio, então com 37 anos, deficiente visual desde os 27, passou a assistir às aulas ministradas pelo Prof. Aristóteles Orsini aos integrantes do corpo de servidores do Planetário. O Sr. Acácio era o único deficiente da turma e, embora possuísse conhecimentos básicos e relativamente avançados de matemática, enfrentava dificuldades na compreensão e acompanhamento da exposição, como também em estudos posteriores. Com o intuito de auxiliá-lo na superação desses problemas, o Prof. Orsini solicitou a construção de modelos mecânicos que, através do sentido do tato, permitissem o acompanhamento das aulas e a transposição do modelo para o "constructo" mental. Essa prática mostrou-se tão eficaz que facilitou sobejamente o aprendizado da matéria pelo sujeito. O Sr. Acácio passou a integrar o corpo de professores do Planetário/Escola Municipal de Astrofísica, tendo ficado responsável pelo curso de "Introdução à Astronomia" por vários anos. Além disso, a experiência foi tão bem sucedida que alguns dos modelos tiveram seus elementos constitutivos pintados diferencialmente para serem utilizados em cursos regulares do Planetário, tornando-se parte integrante do conjunto de recursos didáticos da instituição. É pensando nessa eficácia, tanto em seu objetivo original permitir o aprendizado de um deficiente visual quanto no subsidiário recurso didático sistemático da instituição que decidimos resgatar essa experiência. Estribados nela, acreditamos ser extremamente produtivo, em termos educacionais, o aperfeiçoamento dos modelos originais, agora resgatados e restaurados, e a criação de outros que pudessem ser utilizados no ensino dessa ciência a deficientes visuais.
gEMfitter: a highly parallel FFT-based 3D density fitting tool with GPU texture memory acceleration.
Hoang, Thai V; Cavin, Xavier; Ritchie, David W
2013-11-01
Fitting high resolution protein structures into low resolution cryo-electron microscopy (cryo-EM) density maps is an important technique for modeling the atomic structures of very large macromolecular assemblies. This article presents "gEMfitter", a highly parallel fast Fourier transform (FFT) EM density fitting program which can exploit the special hardware properties of modern graphics processor units (GPUs) to accelerate both the translational and rotational parts of the correlation search. In particular, by using the GPU's special texture memory hardware to rotate 3D voxel grids, the cost of rotating large 3D density maps is almost completely eliminated. Compared to performing 3D correlations on one core of a contemporary central processor unit (CPU), running gEMfitter on a modern GPU gives up to 26-fold speed-up. Furthermore, using our parallel processing framework, this speed-up increases linearly with the number of CPUs or GPUs used. Thus, it is now possible to use routinely more robust but more expensive 3D correlation techniques. When tested on low resolution experimental cryo-EM data for the GroEL-GroES complex, we demonstrate the satisfactory fitting results that may be achieved by using a locally normalised cross-correlation with a Laplacian pre-filter, while still being up to three orders of magnitude faster than the well-known COLORES program. PMID:24060989
Advantages of GPU technology in DFT calculations of intercalated graphene
NASA Astrophysics Data System (ADS)
Pešić, J.; Gajić, R.
2014-09-01
Over the past few years, the expansion of general-purpose graphic-processing unit (GPGPU) technology has had a great impact on computational science. GPGPU is the utilization of a graphics-processing unit (GPU) to perform calculations in applications usually handled by the central processing unit (CPU). Use of GPGPUs as a way to increase computational power in the material sciences has significantly decreased computational costs in already highly demanding calculations. A level of the acceleration and parallelization depends on the problem itself. Some problems can benefit from GPU acceleration and parallelization, such as the finite-difference time-domain algorithm (FTDT) and density-functional theory (DFT), while others cannot take advantage of these modern technologies. A number of GPU-supported applications had emerged in the past several years (www.nvidia.com/object/gpu-applications.html). Quantum Espresso (QE) is reported as an integrated suite of open source computer codes for electronic-structure calculations and materials modeling at the nano-scale. It is based on DFT, the use of a plane-waves basis and a pseudopotential approach. Since the QE 5.0 version, it has been implemented as a plug-in component for standard QE packages that allows exploiting the capabilities of Nvidia GPU graphic cards (www.qe-forge.org/gf/proj). In this study, we have examined the impact of the usage of GPU acceleration and parallelization on the numerical performance of DFT calculations. Graphene has been attracting attention worldwide and has already shown some remarkable properties. We have studied an intercalated graphene, using the QE package PHonon, which employs GPU. The term ‘intercalation’ refers to a process whereby foreign adatoms are inserted onto a graphene lattice. In addition, by intercalating different atoms between graphene layers, it is possible to tune their physical properties. Our experiments have shown there are benefits from using GPUs, and we reached an
GPU-based Parallel Application Design for Emerging Mobile Devices
NASA Astrophysics Data System (ADS)
Gupta, Kshitij
A revolution is underway in the computing world that is causing a fundamental paradigm shift in device capabilities and form-factor, with a move from well-established legacy desktop/laptop computers to mobile devices in varying sizes and shapes. Amongst all the tasks these devices must support, graphics has emerged as the 'killer app' for providing a fluid user interface and high-fidelity game rendering, effectively making the graphics processor (GPU) one of the key components in (present and future) mobile systems. By utilizing the GPU as a general-purpose parallel processor, this dissertation explores the GPU computing design space from an applications standpoint, in the mobile context, by focusing on key challenges presented by these devices---limited compute, memory bandwidth, and stringent power consumption requirements---while improving the overall application efficiency of the increasingly important speech recognition workload for mobile user interaction. We broadly partition trends in GPU computing into four major categories. We analyze hardware and programming model limitations in current-generation GPUs and detail an alternate programming style called Persistent Threads, identify four use case patterns, and propose minimal modifications that would be required for extending native support. We show how by manually extracting data locality and altering the speech recognition pipeline, we are able to achieve significant savings in memory bandwidth while simultaneously reducing the compute burden on GPU-like parallel processors. As we foresee GPU computing to evolve from its current 'co-processor' model into an independent 'applications processor' that is capable of executing complex work independently, we create an alternate application framework that enables the GPU to handle all control-flow dependencies autonomously at run-time while minimizing host involvement to just issuing commands, that facilitates an efficient application implementation. Finally, as
Parallel Optimization of 3D Cardiac Electrophysiological Model Using GPU
Xia, Yong; Wang, Kuanquan; Zhang, Henggui
2015-01-01
Large-scale 3D virtual heart model simulations are highly demanding in computational resources. This imposes a big challenge to the traditional computation resources based on CPU environment, which already cannot meet the requirement of the whole computation demands or are not easily available due to expensive costs. GPU as a parallel computing environment therefore provides an alternative to solve the large-scale computational problems of whole heart modeling. In this study, using a 3D sheep atrial model as a test bed, we developed a GPU-based simulation algorithm to simulate the conduction of electrical excitation waves in the 3D atria. In the GPU algorithm, a multicellular tissue model was split into two components: one is the single cell model (ordinary differential equation) and the other is the diffusion term of the monodomain model (partial differential equation). Such a decoupling enabled realization of the GPU parallel algorithm. Furthermore, several optimization strategies were proposed based on the features of the virtual heart model, which enabled a 200-fold speedup as compared to a CPU implementation. In conclusion, an optimized GPU algorithm has been developed that provides an economic and powerful platform for 3D whole heart simulations. PMID:26581957
Finite Difference Elastic Wave Field Simulation On GPU
NASA Astrophysics Data System (ADS)
Hu, Y.; Zhang, W.
2011-12-01
Numerical modeling of seismic wave propagation is considered as a basic and important aspect in investigation of the Earth's structure, and earthquake phenomenon. Among various numerical methods, the finite-difference method is considered one of the most efficient tools for the wave field simulation. However, with the increment of computing scale, the power of computing has becoming a bottleneck. With the development of hardware, in recent years, GPU shows powerful computational ability and bright application prospects in scientific computing. Many works using GPU demonstrate that GPU is powerful . Recently, GPU has not be used widely in the simulation of wave field. In this work, we present forward finite difference simulation of acoustic and elastic seismic wave propagation in heterogeneous media on NVIDIA graphics cards with the CUDA programming language. We also implement perfectly matched layers on the graphics cards to efficiently absorb outgoing waves on the fictitious edges of the grid Simulations compared with the results on CPU platform shows reliable accuracy and remarkable efficiency. This work proves that GPU can be an effective platform for wave field simulation, and it can also be used as a practical tool for real-time strong ground motion simulation.
A survey of CPU-GPU heterogeneous computing techniques
Mittal, Sparsh; Vetter, Jeffrey S.
2015-07-04
As both CPU and GPU become employed in a wide range of applications, it has been acknowledged that both of these processing units (PUs) have their unique features and strengths and hence, CPU-GPU collaboration is inevitable to achieve high-performance computing. This has motivated significant amount of research on heterogeneous computing techniques, along with the design of CPU-GPU fused chips and petascale heterogeneous supercomputers. In this paper, we survey heterogeneous computing techniques (HCTs) such as workload-partitioning which enable utilizing both CPU and GPU to improve performance and/or energy efficiency. We review heterogeneous computing approaches at runtime, algorithm, programming, compiler and applicationmore » level. Further, we review both discrete and fused CPU-GPU systems; and discuss benchmark suites designed for evaluating heterogeneous computing systems (HCSs). Furthermore, we believe that this paper will provide insights into working and scope of applications of HCTs to researchers and motivate them to further harness the computational powers of CPUs and GPUs to achieve the goal of exascale performance.« less
A survey of CPU-GPU heterogeneous computing techniques
Mittal, Sparsh; Vetter, Jeffrey S.
2015-07-04
As both CPU and GPU become employed in a wide range of applications, it has been acknowledged that both of these processing units (PUs) have their unique features and strengths and hence, CPU-GPU collaboration is inevitable to achieve high-performance computing. This has motivated significant amount of research on heterogeneous computing techniques, along with the design of CPU-GPU fused chips and petascale heterogeneous supercomputers. In this paper, we survey heterogeneous computing techniques (HCTs) such as workload-partitioning which enable utilizing both CPU and GPU to improve performance and/or energy efficiency. We review heterogeneous computing approaches at runtime, algorithm, programming, compiler and application level. Further, we review both discrete and fused CPU-GPU systems; and discuss benchmark suites designed for evaluating heterogeneous computing systems (HCSs). Furthermore, we believe that this paper will provide insights into working and scope of applications of HCTs to researchers and motivate them to further harness the computational powers of CPUs and GPUs to achieve the goal of exascale performance.
Performing efficient NURBS modeling operations on the GPU.
Krishnamurthy, Adarsh; Khardekar, Rahul; McMains, Sara; Haller, Kirk; Elber, Gershon
2009-01-01
We present algorithms for evaluating and performing modeling operations on NURBS surfaces using the programmable fragment processor on the Graphics Processing Unit (GPU). We extend our GPU-based NURBS evaluator that evaluates NURBS surfaces to compute exact normals for either standard or rational B-spline surfaces for use in rendering and geometric modeling. We build on these calculations in our new GPU algorithms to perform standard modeling operations such as inverse evaluations, ray intersections, and surface-surface intersections on the GPU. Our modeling algorithms run in real time, enabling the user to sketch on the actual surface to create new features. In addition, the designer can edit the surface by interactively trimming it without the need for retessellation. Our GPU-accelerated algorithm to perform surface-surface intersection operations with NURBS surfaces can output intersection curves in the model space as well as in the parametric spaces of both the intersecting surfaces at interactive rates. We also extend our surface-surface intersection algorithm to evaluate self-intersections in NURBS surfaces. PMID:19423879
Optimizing Tensor Contraction Expressions for Hybrid CPU-GPU Execution
Ma, Wenjing; Krishnamoorthy, Sriram; Villa, Oreste; Kowalski, Karol; Agrawal, Gagan
2013-03-01
Tensor contractions are generalized multidimensional matrix multiplication operations that widely occur in quantum chemistry. Efficient execution of tensor contractions on Graphics Processing Units (GPUs) requires several challenges to be addressed, including index permutation and small dimension-sizes reducing thread block utilization. Moreover, to apply the same optimizations to various expressions, we need a code generation tool. In this paper, we present our approach to automatically generate CUDA code to execute tensor contractions on GPUs, including management of data movement between CPU and GPU. To evaluate our tool, GPU-enabled code is generated for the most expensive contractions in CCSD(T), a key coupled cluster method, and incorporated into NWChem, a popular computational chemistry suite. For this method, we demonstrate speedup over a factor of 8.4 using one GPU (instead of one core per node) and over 2.6 when utilizing the entire system using hybrid CPU+GPU solution with 2 GPUs and 5 cores (instead of 7 cores per node). Finally, we analyze the implementation behavior on future GPU systems.
High Performance GPU-Based Fourier Volume Rendering.
Abdellah, Marwan; Eldeib, Ayman; Sharawi, Amr
2015-01-01
Fourier volume rendering (FVR) is a significant visualization technique that has been used widely in digital radiography. As a result of its (N (2)logN) time complexity, it provides a faster alternative to spatial domain volume rendering algorithms that are (N (3)) computationally complex. Relying on the Fourier projection-slice theorem, this technique operates on the spectral representation of a 3D volume instead of processing its spatial representation to generate attenuation-only projections that look like X-ray radiographs. Due to the rapid evolution of its underlying architecture, the graphics processing unit (GPU) became an attractive competent platform that can deliver giant computational raw power compared to the central processing unit (CPU) on a per-dollar-basis. The introduction of the compute unified device architecture (CUDA) technology enables embarrassingly-parallel algorithms to run efficiently on CUDA-capable GPU architectures. In this work, a high performance GPU-accelerated implementation of the FVR pipeline on CUDA-enabled GPUs is presented. This proposed implementation can achieve a speed-up of 117x compared to a single-threaded hybrid implementation that uses the CPU and GPU together by taking advantage of executing the rendering pipeline entirely on recent GPU architectures. PMID:25866499
GPU Lossless Hyperspectral Data Compression System
NASA Technical Reports Server (NTRS)
Aranki, Nazeeh I.; Keymeulen, Didier; Kiely, Aaron B.; Klimesh, Matthew A.
2014-01-01
Hyperspectral imaging systems onboard aircraft or spacecraft can acquire large amounts of data, putting a strain on limited downlink and storage resources. Onboard data compression can mitigate this problem but may require a system capable of a high throughput. In order to achieve a high throughput with a software compressor, a graphics processing unit (GPU) implementation of a compressor was developed targeting the current state-of-the-art GPUs from NVIDIA(R). The implementation is based on the fast lossless (FL) compression algorithm reported in "Fast Lossless Compression of Multispectral-Image Data" (NPO- 42517), NASA Tech Briefs, Vol. 30, No. 8 (August 2006), page 26, which operates on hyperspectral data and achieves excellent compression performance while having low complexity. The FL compressor uses an adaptive filtering method and achieves state-of-the-art performance in both compression effectiveness and low complexity. The new Consultative Committee for Space Data Systems (CCSDS) Standard for Lossless Multispectral & Hyperspectral image compression (CCSDS 123) is based on the FL compressor. The software makes use of the highly-parallel processing capability of GPUs to achieve a throughput at least six times higher than that of a software implementation running on a single-core CPU. This implementation provides a practical real-time solution for compression of data from airborne hyperspectral instruments.
GISAXS simulation and analysis on GPU clusters
NASA Astrophysics Data System (ADS)
Chourou, Slim; Sarje, Abhinav; Li, Xiaoye; Chan, Elaine; Hexemer, Alexander
2012-02-01
We have implemented a flexible Grazing Incidence Small-Angle Scattering (GISAXS) simulation code based on the Distorted Wave Born Approximation (DWBA) theory that effectively utilizes the parallel processing power provided by the GPUs. This constitutes a handy tool for experimentalists facing a massive flux of data, allowing them to accurately simulate the GISAXS process and analyze the produced data. The software computes the diffraction image for any given superposition of custom shapes or morphologies (e.g. obtained graphically via a discretization scheme) in a user-defined region of k-space (or region of the area detector) for all possible grazing incidence angles and in-plane sample rotations. This flexibility then allows to easily tackle a wide range of possible sample geometries such as nanostructures on top of or embedded in a substrate or a multilayered structure. In cases where the sample displays regions of significant refractive index contrast, an algorithm has been implemented to perform an optimal slicing of the sample along the vertical direction and compute the averaged refractive index profile to be used as the reference geometry of the unperturbed system. Preliminary tests on a single GPU show a speedup of over 200 times compared to the sequential code.
IMPAIR: massively parallel deconvolution on the GPU
NASA Astrophysics Data System (ADS)
Sherry, Michael; Shearer, Andy
2013-02-01
The IMPAIR software is a high throughput image deconvolution tool for processing large out-of-core datasets of images, varying from large images with spatially varying PSFs to large numbers of images with spatially invariant PSFs. IMPAIR implements a parallel version of the tried and tested Richardson-Lucy deconvolution algorithm regularised via a custom wavelet thresholding library. It exploits the inherently parallel nature of the convolution operation to achieve quality results on consumer grade hardware: through the NVIDIA Tesla GPU implementation, the multi-core OpenMP implementation, and the cluster computing MPI implementation of the software. IMPAIR aims to address the problem of parallel processing in both top-down and bottom-up approaches: by managing the input data at the image level, and by managing the execution at the instruction level. These combined techniques will lead to a scalable solution with minimal resource consumption and maximal load balancing. IMPAIR is being developed as both a stand-alone tool for image processing, and as a library which can be embedded into non-parallel code to transparently provide parallel high throughput deconvolution.
Adaptive mesh fluid simulations on GPU
NASA Astrophysics Data System (ADS)
Wang, Peng; Abel, Tom; Kaehler, Ralf
2010-10-01
We describe an implementation of compressible inviscid fluid solvers with block-structured adaptive mesh refinement on Graphics Processing Units using NVIDIA's CUDA. We show that a class of high resolution shock capturing schemes can be mapped naturally on this architecture. Using the method of lines approach with the second order total variation diminishing Runge-Kutta time integration scheme, piecewise linear reconstruction, and a Harten-Lax-van Leer Riemann solver, we achieve an overall speedup of approximately 10 times faster execution on one graphics card as compared to a single core on the host computer. We attain this speedup in uniform grid runs as well as in problems with deep AMR hierarchies. Our framework can readily be applied to more general systems of conservation laws and extended to higher order shock capturing schemes. This is shown directly by an implementation of a magneto-hydrodynamic solver and comparing its performance to the pure hydrodynamic case. Finally, we also combined our CUDA parallel scheme with MPI to make the code run on GPU clusters. Close to ideal speedup is observed on up to four GPUs.
Linear Bregman algorithm implemented in parallel GPU
NASA Astrophysics Data System (ADS)
Li, Pengyan; Ke, Jue; Sui, Dong; Wei, Ping
2015-08-01
At present, most compressed sensing (CS) algorithms have poor converging speed, thus are difficult to run on PC. To deal with this issue, we use a parallel GPU, to implement a broadly used compressed sensing algorithm, the Linear Bregman algorithm. Linear iterative Bregman algorithm is a reconstruction algorithm proposed by Osher and Cai. Compared with other CS reconstruction algorithms, the linear Bregman algorithm only involves the vector and matrix multiplication and thresholding operation, and is simpler and more efficient for programming. We use C as a development language and adopt CUDA (Compute Unified Device Architecture) as parallel computing architectures. In this paper, we compared the parallel Bregman algorithm with traditional CPU realized Bregaman algorithm. In addition, we also compared the parallel Bregman algorithm with other CS reconstruction algorithms, such as OMP and TwIST algorithms. Compared with these two algorithms, the result of this paper shows that, the parallel Bregman algorithm needs shorter time, and thus is more convenient for real-time object reconstruction, which is important to people's fast growing demand to information technology.
GPU's for event reconstruction in the FairRoot framework
NASA Astrophysics Data System (ADS)
Al-Turany, M.; Uhlig, F.; Karabowicz, R.
2010-04-01
FairRoot is the simulation and analysis framework used by CBM and PANDA experiments at FAIR/GSI. The use of graphics processor units (GPUs) for event reconstruction in FairRoot will be presented. The fact that CUDA (Nvidia's Compute Unified Device Architecture) development tools work alongside the conventional C/C++ compiler, makes it possible to mix GPU code with general-purpose code for the host CPU, based on this some of the reconstruction tasks can be send to the graphic cards. Moreover, tasks that run on the GPU's can also run in emulation mode on the host CPU, which has the advantage that the same code is used on both CPU and GPU.
Research on GPU Acceleration for Monte Carlo Criticality Calculation
NASA Astrophysics Data System (ADS)
Xu, Qi; Yu, Ganglin; Wang, Kan
2014-06-01
The Monte Carlo neutron transport method can be naturally parallelized by multi-core architectures due to the dependency between particles during the simulation. The GPU+CPU heterogeneous parallel mode has become an increasingly popular way of parallelism in the field of scientific supercomputing. Thus, this work focuses on the GPU acceleration method for the Monte Carlo criticality simulation, as well as the computational efficiency that GPUs can bring. The "neutron transport step" is introduced to increase the GPU thread occupancy. In order to test the sensitivity of the MC code's complexity, a 1D one-group code and a 3D multi-group general purpose code are respectively transplanted to GPUs, and the acceleration effects are compared. The result of numerical experiments shows considerable acceleration effect of the "neutron transport step" strategy. However, the performance comparison between the 1D code and the 3D code indicates the poor scalability of MC codes on GPUs.
gPGA: GPU Accelerated Population Genetics Analyses
Zhou, Chunbao; Lang, Xianyu; Wang, Yangang; Zhu, Chaodong
2015-01-01
Background The isolation with migration (IM) model is important for studies in population genetics and phylogeography. IM program applies the IM model to genetic data drawn from a pair of closely related populations or species based on Markov chain Monte Carlo (MCMC) simulations of gene genealogies. But computational burden of IM program has placed limits on its application. Methodology With strong computational power, Graphics Processing Unit (GPU) has been widely used in many fields. In this article, we present an effective implementation of IM program on one GPU based on Compute Unified Device Architecture (CUDA), which we call gPGA. Conclusions Compared with IM program, gPGA can achieve up to 52.30X speedup on one GPU. The evaluation results demonstrate that it allows datasets to be analyzed effectively and rapidly for research on divergence population genetics. The software is freely available with source code at https://github.com/chunbaozhou/gPGA. PMID:26248314
Gpu Implementation of a Viscous Flow Solver on Unstructured Grids
NASA Astrophysics Data System (ADS)
Xu, Tianhao; Chen, Long
2016-06-01
Graphics processing units have gained popularities in scientific computing over past several years due to their outstanding parallel computing capability. Computational fluid dynamics applications involve large amounts of calculations, therefore a latest GPU card is preferable of which the peak computing performance and memory bandwidth are much better than a contemporary high-end CPU. We herein focus on the detailed implementation of our GPU targeting Reynolds-averaged Navier-Stokes equations solver based on finite-volume method. The solver employs a vertex-centered scheme on unstructured grids for the sake of being capable of handling complex topologies. Multiple optimizations are carried out to improve the memory accessing performance and kernel utilization. Both steady and unsteady flow simulation cases are carried out using explicit Runge-Kutta scheme. The solver with GPU acceleration in this paper is demonstrated to have competitive advantages over the CPU targeting one.
Multi-GPU kinetic solvers using MPI and CUDA
NASA Astrophysics Data System (ADS)
Zabelok, Sergey; Arslanbekov, Robert; Kolobov, Vladimir
2014-12-01
This paper describes recent progress towards porting a Unified Flow Solver (UFS) to heterogeneous parallel computing. The main challenge of porting UFS to graphics processing units (GPUs) comes from the dynamically adapted mesh, which causes irregular data access. We describe the implementation of CUDA kernels for three modules in UFS: the direct Boltzmann solver using discrete velocity method (DVM), the DSMC module, and the Lattice Boltzmann Method (LBM) solver, all using octree Cartesian mesh with adaptive Mesh Refinement (AMR). Double digit speedup on single GPU and good scaling for multi-GPU has been demonstrated.
Accelerating Pseudo-Random Number Generator for MCNP on GPU
NASA Astrophysics Data System (ADS)
Gong, Chunye; Liu, Jie; Chi, Lihua; Hu, Qingfeng; Deng, Li; Gong, Zhenghu
2010-09-01
Pseudo-random number generators (PRNG) are intensively used in many stochastic algorithms in particle simulations, artificial neural networks and other scientific computation. The PRNG in Monte Carlo N-Particle Transport Code (MCNP) requires long period, high quality, flexible jump and fast enough. In this paper, we implement such a PRNG for MCNP on NVIDIA's GTX200 Graphics Processor Units (GPU) using CUDA programming model. Results shows that 3.80 to 8.10 times speedup are achieved compared with 4 to 6 cores CPUs and more than 679.18 million double precision random numbers can be generated per second on GPU.
Numerical cosmology on the GPU with Enzo and Ramses
NASA Astrophysics Data System (ADS)
Gheller, C.; Wang, P.; Vazza, F.; Teyssier, R.
2015-09-01
A number of scientific numerical codes can currently exploit GPUs with remarkable performance. In astrophysics, Enzo and Ramses are prime examples of such applications. The two codes have been ported to GPUs adopting different strategies and programming models, Enzo adopting CUDA and Ramses using OpenACC. We describe here the different solutions used for the GPU implementation of both cases. Performance benchmarks will be presented for Ramses. The results of the usage of the more mature GPU version of Enzo, adopted for a scientific project within the CHRONOS programme, will be summarised.
Accelerating DNA analysis applications on GPU clusters
Tumeo, Antonino; Villa, Oreste
2010-06-13
DNA analysis is an emerging application of high performance bioinformatic. Modern sequencing machinery are able to provide, in few hours, large input streams of data which needs to be matched against exponentially growing databases known fragments. The ability to recognize these patterns effectively and fastly may allow extending the scale and the reach of the investigations performed by biology scientists. Aho-Corasick is an exact, multiple pattern matching algorithm often at the base of this application. High performance systems are a promising platform to accelerate this algorithm, which is computationally intensive but also inherently parallel. Nowadays, high performance systems also include heterogeneous processing elements, such as Graphic Processing Units (GPUs), to further accelerate parallel algorithms. Unfortunately, the Aho-Corasick algorithm exhibits large performance variabilities, depending on the size of the input streams, on the number of patterns to search and on the number of matches, and poses significant challenges on current high performance software and hardware implementations. An adequate mapping of the algorithm on the target architecture, coping with the limit of the underlining hardware, is required to reach the desired high throughputs. Load balancing also plays a crucial role when considering the limited bandwidth among the nodes of these systems. In this paper we present an efficient implementation of the Aho-Corasick algorithm for high performance clusters accelerated with GPUs. We discuss how we partitioned and adapted the algorithm to fit the Tesla C1060 GPU and then present a MPI based implementation for a heterogeneous high performance cluster. We compare this implementation to MPI and MPI with pthreads based implementations for a homogeneous cluster of x86 processors, discussing the stability vs. the performance and the scaling of the solutions, taking into consideration aspects such as the bandwidth among the different nodes.
Full Stokes glacier model on GPU
NASA Astrophysics Data System (ADS)
Licul, Aleksandar; Herman, Frédéric; Podladchikov, Yuri; Räss, Ludovic; Omlin, Samuel
2015-04-01
Two different approaches are commonly used in glacier ice flow modeling: models based on asymptotic approximations of ice physics and full stokes models. Lower order models are computationally lighter but reach their limits in regions of complex flow, while full Stokes models are more exact but computationally expansive. To overcome this constrain, we investigate the potential of GPU acceleration in glacier modeling. The goal of this preliminary research is to develop a three-dimensional full Stokes numerical model and apply it to the glacier flow. We numerically solve the nonlinear Stokes momentum balance equations together with the incompressibility equation. Strong nonlinearities for the ice rheology are also taken into account. We have developed a fully three-dimensional numerical MATLAB application based on an iterative finite difference scheme. We have ported it to C-CUDA to run it on GPUs. Our model is benchmarked against other full Stokes solutions for all diagnostic ISMIP-HOM experiments (Pattyn et al.,2008). The preliminary results show good agreement with the other models. The major advantages of our programming approach are simplicity and order 10-100 times speed-up in comparison to serial CPU version of the code. Future work will include some real world applications and we will implement the free surface evolution capabilities. References: [1] F. Pattyn, L. Perichon, A. Aschwanden, B. Breuer, D.B. Smedt, O. Gagliardini, G.H. Gudmundsson, R.C.A. Hindmarsh, A. Hubbard, J.V. Johnson, T. Kleiner, Y. Konovalov, C. Martin, A.J. Payne, D. Pollard, S. Price, M. Ruckamp, F. Saito, S. Sugiyama, S., and T. Zwinger, Benchmark experiments for higher-order and full-Stokes ice sheet models (ISMIP-HOM), The Cryosphere, 2 (2008), 95-108.
POM.gpu-v1.0: a GPU-based Princeton Ocean Model
NASA Astrophysics Data System (ADS)
Xu, S.; Huang, X.; Oey, L.-Y.; Xu, F.; Fu, H.; Zhang, Y.; Yang, G.
2015-09-01
Graphics processing units (GPUs) are an attractive solution in many scientific applications due to their high performance. However, most existing GPU conversions of climate models use GPUs for only a few computationally intensive regions. In the present study, we redesign the mpiPOM (a parallel version of the Princeton Ocean Model) with GPUs. Specifically, we first convert the model from its original Fortran form to a new Compute Unified Device Architecture C (CUDA-C) code, then we optimize the code on each of the GPUs, the communications between the GPUs, and the I / O between the GPUs and the central processing units (CPUs). We show that the performance of the new model on a workstation containing four GPUs is comparable to that on a powerful cluster with 408 standard CPU cores, and it reduces the energy consumption by a factor of 6.8.
GPU-accelerated denoising of 3D magnetic resonance images
Howison, Mark; Wes Bethel, E.
2014-05-29
The raw computational power of GPU accelerators enables fast denoising of 3D MR images using bilateral filtering, anisotropic diffusion, and non-local means. In practice, applying these filtering operations requires setting multiple parameters. This study was designed to provide better guidance to practitioners for choosing the most appropriate parameters by answering two questions: what parameters yield the best denoising results in practice? And what tuning is necessary to achieve optimal performance on a modern GPU? To answer the first question, we use two different metrics, mean squared error (MSE) and mean structural similarity (MSSIM), to compare denoising quality against a reference image. Surprisingly, the best improvement in structural similarity with the bilateral filter is achieved with a small stencil size that lies within the range of real-time execution on an NVIDIA Tesla M2050 GPU. Moreover, inappropriate choices for parameters, especially scaling parameters, can yield very poor denoising performance. To answer the second question, we perform an autotuning study to empirically determine optimal memory tiling on the GPU. The variation in these results suggests that such tuning is an essential step in achieving real-time performance. These results have important implications for the real-time application of denoising to MR images in clinical settings that require fast turn-around times.
A survey of GPU-based medical image computing techniques.
Shi, Lin; Liu, Wen; Zhang, Heye; Xie, Yongming; Wang, Defeng
2012-09-01
Medical imaging currently plays a crucial role throughout the entire clinical applications from medical scientific research to diagnostics and treatment planning. However, medical imaging procedures are often computationally demanding due to the large three-dimensional (3D) medical datasets to process in practical clinical applications. With the rapidly enhancing performances of graphics processors, improved programming support, and excellent price-to-performance ratio, the graphics processing unit (GPU) has emerged as a competitive parallel computing platform for computationally expensive and demanding tasks in a wide range of medical image applications. The major purpose of this survey is to provide a comprehensive reference source for the starters or researchers involved in GPU-based medical image processing. Within this survey, the continuous advancement of GPU computing is reviewed and the existing traditional applications in three areas of medical image processing, namely, segmentation, registration and visualization, are surveyed. The potential advantages and associated challenges of current GPU-based medical imaging are also discussed to inspire future applications in medicine. PMID:23256080
GPU-based Volume Rendering for Medical Image Visualization.
Heng, Yang; Gu, Lixu
2005-01-01
During the quick advancements of medical image visualization and augmented virtual reality application, the low performance of the volume rendering algorithm is still a "bottle neck". To facilitate the usage of well developed hardware resource, a novel graphics processing unit (GPU)-based volume ray-casting algorithm is proposed in this paper. Running on a normal PC, it performs an interactive rate while keeping the same image quality as the traditional volume rendering algorithm does. Recently, GPU-accelerated direct volume rendering has positioned itself as an efficient tool for the display and visual analysis of volume data. However, for large sized medical image data, it often shows low efficiency for too large memories requested. Furthermore, it always holds a drawback of writing color buffers multi-times per frame. The proposed algorithm improves the situation by implementing ray casting operation completely in GPU. It needs only one slice plane from CPU and one 3Dtexture to store data when GPU calculates the two terminals of the ray and carries out the color blending operation in its pixel programs. So both the rendering speed and the memories consumed are improved, and the algorithm can deal most medical image data on normal PCs in the interactive speed. PMID:17281405
QYMSYM: A GPU-accelerated hybrid symplectic integrator
NASA Astrophysics Data System (ADS)
Moore, Alexander; Quillen, Alice C.
2012-10-01
QYMSYM is a GPU accelerated 2nd order hybrid symplectic integrator that identifies close approaches between particles and switches from symplectic to Hermite algorithms for particles that require higher resolution integrations. This is a parallel code running with CUDA on a video card that puts the many processors on board to work while taking advantage of fast shared memory.
A survey of GPU-based medical image computing techniques
Shi, Lin; Liu, Wen; Zhang, Heye; Xie, Yongming
2012-01-01
Medical imaging currently plays a crucial role throughout the entire clinical applications from medical scientific research to diagnostics and treatment planning. However, medical imaging procedures are often computationally demanding due to the large three-dimensional (3D) medical datasets to process in practical clinical applications. With the rapidly enhancing performances of graphics processors, improved programming support, and excellent price-to-performance ratio, the graphics processing unit (GPU) has emerged as a competitive parallel computing platform for computationally expensive and demanding tasks in a wide range of medical image applications. The major purpose of this survey is to provide a comprehensive reference source for the starters or researchers involved in GPU-based medical image processing. Within this survey, the continuous advancement of GPU computing is reviewed and the existing traditional applications in three areas of medical image processing, namely, segmentation, registration and visualization, are surveyed. The potential advantages and associated challenges of current GPU-based medical imaging are also discussed to inspire future applications in medicine. PMID:23256080
Computing 2D constrained delaunay triangulation using the GPU.
Qi, Meng; Cao, Thanh-Tung; Tan, Tiow-Seng
2013-05-01
We propose the first graphics processing unit (GPU) solution to compute the 2D constrained Delaunay triangulation (CDT) of a planar straight line graph (PSLG) consisting of points and edges. There are many existing CPU algorithms to solve the CDT problem in computational geometry, yet there has been no prior approach to solve this problem efficiently using the parallel computing power of the GPU. For the special case of the CDT problem where the PSLG consists of just points, which is simply the normal Delaunay triangulation (DT) problem, a hybrid approach using the GPU together with the CPU to partially speed up the computation has already been presented in the literature. Our work, on the other hand, accelerates the entire computation on the GPU. Our implementation using the CUDA programming model on NVIDIA GPUs is numerically robust, and runs up to an order of magnitude faster than the best sequential implementations on the CPU. This result is reflected in our experiment with both randomly generated PSLGs and real-world GIS data having millions of points and edges. PMID:23492377
Optimizing a mobile robot control system using GPU acceleration
NASA Astrophysics Data System (ADS)
Tuck, Nat; McGuinness, Michael; Martin, Fred
2012-01-01
This paper describes our attempt to optimize a robot control program for the Intelligent Ground Vehicle Competition (IGVC) by running computationally intensive portions of the system on a commodity graphics processing unit (GPU). The IGVC Autonomous Challenge requires a control program that performs a number of different computationally intensive tasks ranging from computer vision to path planning. For the 2011 competition our Robot Operating System (ROS) based control system would not run comfortably on the multicore CPU on our custom robot platform. The process of profiling the ROS control program and selecting appropriate modules for porting to run on a GPU is described. A GPU-targeting compiler, Bacon, is used to speed up development and help optimize the ported modules. The impact of the ported modules on overall performance is discussed. We conclude that GPU optimization can free a significant amount of CPU resources with minimal effort for expensive user-written code, but that replacing heavily-optimized library functions is more difficult, and a much less efficient use of time.
Geological Visualization System with GPU-Based Interpolation
NASA Astrophysics Data System (ADS)
Huang, L.; Chen, K.; Lai, Y.; Chang, P.; Song, S.
2011-12-01
There has been a large number of research using parallel-processing GPU to accelerate the computation. In Near Surface Geology efficient interpolations are critical for proper interpretation of measured data. Additionally, an appropriate interpolation method for generating proper results depends on the factors such as the dense of the measured locations and the estimation model. Therefore, fast interpolation process is needed to efficiently find a proper interpolation algorithm for a set of collected data. However, a general CPU framework has to process each computation in a sequential manner and is not efficient enough to handle a large number of interpolation generally needed in Near Surface Geology. When carefully observing the interpolation processing, the computation for each grid point is independent from all other computation. Therefore, the GPU parallel framework should be an efficient technology to accelerate the interpolation process which is critical in Near Surface Geology. Thus in this paper we design a geological visualization system whose core includes a set of interpolation algorithms including Nearest Neighbor, Inverse Distance and Kriging. All these interpolation algorithms are implemented using both the CPU framework and GPU framework. The comparison between CPU and GPU implementation in the aspect of precision and processing speed shows that parallel computation can accelerate the interpolation process and also demonstrates the possibility of using GPU-equipped personal computer to replace the expensive workstation. Immediate update at the measurement site is the dream of geologists. In the future the parallel and remote computation ability of cloud will be explored to make the mobile computation on the measurement site possible.
GPU implementations for fast factorizations of STAP covariance matrices
NASA Astrophysics Data System (ADS)
Roeder, Michael; Davis, Nolan; Furtek, Jeremy; Braunreiter, Dennis; Healy, Dennis
2008-08-01
One of the main goals of the STAP-BOY program has been the implementation of a space-time adaptive processing (STAP) algorithm on graphics processing units (GPUs) with the goal of reducing the processing time. Within the context of GPU implementation, we have further developed algorithms that exploit data redundancy inherent in particular STAP applications. Integration of these algorithms with GPU architecture is of primary importance for fast algorithmic processing times. STAP algorithms involve solving a linear system in which the transformation matrix is a covariance matrix. A standard method involves estimating a covariance matrix from a data matrix, computing its Cholesky factors by one of several methods, and then solving the system by substitution. Some STAP applications have redundancy in successive data matrices from which the covariance matrices are formed. For STAP applications in which a data matrix is updated with the addition of a new data row at the bottom and the elimination of the oldest data in the top of the matrix, a sequence of data matrices have multiple rows in common. Two methods have been developed for exploiting this type of data redundancy when computing Cholesky factors. These two methods are referred to as 1) Fast QR factorizations of successive data matrices 2) Fast Cholesky factorizations of successive covariance matrices. We have developed GPU implementations of these two methods. We show that these two algorithms exhibit reduced computational complexity when compared to benchmark algorithms that do not exploit data redundancy. More importantly, we show that when these algorithmic improvements are optimized for the GPU architecture, the processing times of a GPU implementation of these matrix factorization algorithms may be greatly improved.
High-Speed GPU-Based Fully Three-Dimensional Diffuse Optical Tomographic System
Saikia, Manob Jyoti; Kanhirodan, Rajan; Mohan Vasu, Ram
2014-01-01
We have developed a graphics processor unit (GPU-) based high-speed fully 3D system for diffuse optical tomography (DOT). The reduction in execution time of 3D DOT algorithm, a severely ill-posed problem, is made possible through the use of (1) an algorithmic improvement that uses Broyden approach for updating the Jacobian matrix and thereby updating the parameter matrix and (2) the multinode multithreaded GPU and CUDA (Compute Unified Device Architecture) software architecture. Two different GPU implementations of DOT programs are developed in this study: (1) conventional C language program augmented by GPU CUDA and CULA routines (C GPU), (2) MATLAB program supported by MATLAB parallel computing toolkit for GPU (MATLAB GPU). The computation time of the algorithm on host CPU and the GPU system is presented for C and Matlab implementations. The forward computation uses finite element method (FEM) and the problem domain is discretized into 14610, 30823, and 66514 tetrahedral elements. The reconstruction time, so achieved for one iteration of the DOT reconstruction for 14610 elements, is 0.52 seconds for a C based GPU program for 2-plane measurements. The corresponding MATLAB based GPU program took 0.86 seconds. The maximum number of reconstructed frames so achieved is 2 frames per second. PMID:24891848
SU-D-BRD-03: A Gateway for GPU Computing in Cancer Radiotherapy Research
Jia, X; Folkerts, M; Shi, F; Yan, H; Yan, Y; Jiang, S; Sivagnanam, S; Majumdar, A
2014-06-01
Purpose: Graphics Processing Unit (GPU) has become increasingly important in radiotherapy. However, it is still difficult for general clinical researchers to access GPU codes developed by other researchers, and for developers to objectively benchmark their codes. Moreover, it is quite often to see repeated efforts spent on developing low-quality GPU codes. The goal of this project is to establish an infrastructure for testing GPU codes, cross comparing them, and facilitating code distributions in radiotherapy community. Methods: We developed a system called Gateway for GPU Computing in Cancer Radiotherapy Research (GCR2). A number of GPU codes developed by our group and other developers can be accessed via a web interface. To use the services, researchers first upload their test data or use the standard data provided by our system. Then they can select the GPU device on which the code will be executed. Our system offers all mainstream GPU hardware for code benchmarking purpose. After the code running is complete, the system automatically summarizes and displays the computing results. We also released a SDK to allow the developers to build their own algorithm implementation and submit their binary codes to the system. The submitted code is then systematically benchmarked using a variety of GPU hardware and representative data provided by our system. The developers can also compare their codes with others and generate benchmarking reports. Results: It is found that the developed system is fully functioning. Through a user-friendly web interface, researchers are able to test various GPU codes. Developers also benefit from this platform by comprehensively benchmarking their codes on various GPU platforms and representative clinical data sets. Conclusion: We have developed an open platform allowing the clinical researchers and developers to access the GPUs and GPU codes. This development will facilitate the utilization of GPU in radiation therapy field.
Parallel hyperbolic PDE simulation on clusters: Cell versus GPU
NASA Astrophysics Data System (ADS)
Rostrup, Scott; De Sterck, Hans
2010-12-01
Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications. Program summaryProgram title: SWsolver Catalogue identifier: AEGY_v1_0 Program summary URL
[Design of a volume-rendering toolkit using GPU-based ray-casting].
Liu, Wen-Qing; Chen, Chun-Xiao; Lu, Li-Na
2009-09-01
This paper presents an approach to GPU-based ray-casting of a shader model 3.0 compatible graphics card. In addition, a software toolkit is designed using the proposed algorithm to make the full benefit of GPU by extending VTK. Experimental results suggest that remarkable speedups are observed using GPU-based algorithm, and high-quality renderings can be achieved at interactive framerates above 60 fps. The toolkit designed provides a high level of usability and extendibility. PMID:20073244
Fast computer simulation of reconstructed image from rainbow hologram based on GPU
NASA Astrophysics Data System (ADS)
Shuming, Jiao; Yoshikawa, Hiroshi
2015-10-01
A fast computer simulation solution for rainbow hologram reconstruction based on GPU is proposed. In the commonly used segment Fourier transform method for rainbow hologram reconstruction, the computation of 2D Fourier transform on each hologram segment is very time consuming. GPU-based parallel computing can be applied to improve the computing speed. Compared with CPU computing, simulation results indicate that our proposed GPU computing can effectively reduce the computation time by as much as eight times.
GPU phase-field lattice Boltzmann simulations of growth and motion of a binary alloy dendrite
NASA Astrophysics Data System (ADS)
Takaki, T.; Rojas, R.; Ohno, M.; Shimokawabe, T.; Aoki, T.
2015-06-01
A GPU code has been developed for a phase-field lattice Boltzmann (PFLB) method, which can simulate the dendritic growth with motion of solids in a dilute binary alloy melt. The GPU accelerated PFLB method has been implemented using CUDA C. The equiaxed dendritic growth in a shear flow and settling condition have been simulated by the developed GPU code. It has been confirmed that the PFLB simulations were efficiently accelerated by introducing the GPU computation. The characteristic dendrite morphologies which depend on the melt flow and the motion of the dendrite could also be confirmed by the simulations.
Rapid Parallel Calculation of shell Element Based On GPU
NASA Astrophysics Data System (ADS)
Wanga, Jian Hua; Lia, Guang Yao; Lib, Sheng; Li, Guang Yao
2010-06-01
Long computing time bottlenecked the application of finite element. In this paper, an effective method to speed up the FEM calculation by using the existing modern graphic processing unit and programmable colored rendering tool was put forward, which devised the representation of unit information in accordance with the features of GPU, converted all the unit calculation into film rendering process, solved the simulation work of all the unit calculation of the internal force, and overcame the shortcomings of lowly parallel level appeared ever before when it run in a single computer. Studies shown that this method could improve efficiency and shorten calculating hours greatly. The results of emulation calculation about the elasticity problem of large number cells in the sheet metal proved that using the GPU parallel simulation calculation was faster than using the CPU's. It is useful and efficient to solve the project problems in this way.
GPU-specific reformulations of image compression algorithms
NASA Astrophysics Data System (ADS)
Matela, Jiří; Holub, Petr; Jirman, Martin; Årom, Martin
2012-10-01
Image compression has a number of applications in various fields, where processing throughput and/or latency is a crucial attribute and the main limitation of state-of-the-art implementations of compression algorithms. At the same time contemporary GPU platforms provide tremendous processing power but they call for specific algorithm design. We discuss key components of successful design of compression algorithms for GPUs and demonstrate this on JPEG and JPEG2000 implementations, each of which contains several types of algorithms requiring different approaches to efficient parallelization for GPUs. Performance evaluation of the optimized JPEG and JPEG2000 chain is used to demonstrate the importance of various aspects of GPU programming, especially with respect to real-time applications.
Interactive brain shift compensation using GPU based programming
NASA Astrophysics Data System (ADS)
van der Steen, Sander; Noordmans, Herke Jan; Verdaasdonk, Rudolf
2009-02-01
Processing large images files or real-time video streams requires intense computational power. Driven by the gaming industry, the processing power of graphic process units (GPUs) has increased significantly. With the pixel shader model 4.0 the GPU can be used for image processing 10x faster than the CPU. Dedicated software was developed to deform 3D MR and CT image sets for real-time brain shift correction during navigated neurosurgery using landmarks or cortical surface traces defined by the navigation pointer. Feedback was given using orthogonal slices and an interactively raytraced 3D brain image. GPU based programming enables real-time processing of high definition image datasets and various applications can be developed in medicine, optics and image sciences.
GPU-based 3D lower tree wavelet video encoder
NASA Astrophysics Data System (ADS)
Galiano, Vicente; López-Granado, Otoniel; Malumbres, Manuel P.; Drummond, Leroy Anthony; Migallón, Hector
2013-12-01
The 3D-DWT is a mathematical tool of increasing importance in those applications that require an efficient processing of huge amounts of volumetric info. Other applications like professional video editing, video surveillance applications, multi-spectral satellite imaging, HQ video delivery, etc, would rather use 3D-DWT encoders to reconstruct a frame as fast as possible. In this article, we introduce a fast GPU-based encoder which uses 3D-DWT transform and lower trees. Also, we present an exhaustive analysis of the use of GPU memory. Our proposal shows good trade off between R/D, coding delay (as fast as MPEG-2 for High definition) and memory requirements (up to 6 times less memory than x264).
Implementing the projected spatial rich features on a GPU
NASA Astrophysics Data System (ADS)
Ker, Andrew D.
2014-02-01
The Projected Spatial Rich Model (PSRM) generates powerful steganalysis features, but requires the calculation of tens of thousands of convolutions with image noise residuals. This makes it very slow: the reference implementation takes an impractical 20{30 minutes per 1 megapixel (Mpix) image. We present a case study which first tweaks the definition of the PSRM features, to make them more efficient, and then optimizes an implementation on GPU hardware which exploits their parallelism (whilst avoiding the worst of their sequentiality). Some nonstandard CUDA techniques are used. Even with only a single GPU, the time for feature calculation is reduced by three orders of magnitude, and the detection power is reduced only marginally.
Implementation and Optimization of Image Processing Algorithms on Embedded GPU
NASA Astrophysics Data System (ADS)
Singhal, Nitin; Yoo, Jin Woo; Choi, Ho Yeol; Park, In Kyu
In this paper, we analyze the key factors underlying the implementation, evaluation, and optimization of image processing and computer vision algorithms on embedded GPU using OpenGL ES 2.0 shader model. First, we present the characteristics of the embedded GPU and its inherent advantage when compared to embedded CPU. Additionally, we propose techniques to achieve increased performance with optimized shader design. To show the effectiveness of the proposed techniques, we employ cartoon-style non-photorealistic rendering (NPR), speeded-up robust feature (SURF) detection, and stereo matching as our example algorithms. Performance is evaluated in terms of the execution time and speed-up achieved in comparison with the implementation on embedded CPU.
GPU Based Software Correlators - Perspectives for VLBI2010
NASA Technical Reports Server (NTRS)
Hobiger, Thomas; Kimura, Moritaka; Takefuji, Kazuhiro; Oyama, Tomoaki; Koyama, Yasuhiro; Kondo, Tetsuro; Gotoh, Tadahiro; Amagai, Jun
2010-01-01
Caused by historical separation and driven by the requirements of the PC gaming industry, Graphics Processing Units (GPUs) have evolved to massive parallel processing systems which entered the area of non-graphic related applications. Although a single processing core on the GPU is much slower and provides less functionality than its counterpart on the CPU, the huge number of these small processing entities outperforms the classical processors when the application can be parallelized. Thus, in recent years various radio astronomical projects have started to make use of this technology either to realize the correlator on this platform or to establish the post-processing pipeline with GPUs. Therefore, the feasibility of GPUs as a choice for a VLBI correlator is being investigated, including pros and cons of this technology. Additionally, a GPU based software correlator will be reviewed with respect to energy consumption/GFlop/sec and cost/GFlop/sec.
GPU acceleration of time-domain fluorescence lifetime imaging
NASA Astrophysics Data System (ADS)
Wu, Gang; Nowotny, Thomas; Chen, Yu; Li, David Day-Uei
2016-01-01
Fluorescence lifetime imaging microscopy (FLIM) plays a significant role in biological sciences, chemistry, and medical research. We propose a graphic processing unit (GPU) based FLIM analysis tool suitable for high-speed, flexible time-domain FLIM applications. With a large number of parallel processors, GPUs can significantly speed up lifetime calculations compared to CPU-OpenMP (parallel computing with multiple CPU cores) based analysis. We demonstrate how to implement and optimize FLIM algorithms on GPUs for both iterative and noniterative FLIM analysis algorithms. The implemented algorithms have been tested on both synthesized and experimental FLIM data. The results show that at the same precision, the GPU analysis can be up to 24-fold faster than its CPU-OpenMP counterpart. This means that even for high-precision but time-consuming iterative FLIM algorithms, GPUs enable fast or even real-time analysis.
GPU and APU computations of Finite Time Lyapunov Exponent fields
NASA Astrophysics Data System (ADS)
Conti, Christian; Rossinelli, Diego; Koumoutsakos, Petros
2012-03-01
We present GPU and APU accelerated computations of Finite-Time Lyapunov Exponent (FTLE) fields. The calculation of FTLEs is a computationally intensive process, as in order to obtain the sharp ridges associated with the Lagrangian Coherent Structures an extensive resampling of the flow field is required. The computational performance of this resampling is limited by the memory bandwidth of the underlying computer architecture. The present technique harnesses data-parallel execution of many-core architectures and relies on fast and accurate evaluations of moment conserving functions for the mesh to particle interpolations. We demonstrate how the computation of FTLEs can be efficiently performed on a GPU and on an APU through OpenCL and we report over one order of magnitude improvements over multi-threaded executions in FTLE computations of bluff body flows.
GPU-Accelerated Molecular Modeling Coming Of Age
Stone, John E.; Hardy, David J.; Ufimtsev, Ivan S.
2010-01-01
Graphics processing units (GPUs) have traditionally been used in molecular modeling solely for visualization of molecular structures and animation of trajectories resulting from molecular dynamics simulations. Modern GPUs have evolved into fully programmable, massively parallel co-processors that can now be exploited to accelerate many scientific computations, typically providing about one order of magnitude speedup over CPU code and in special cases providing speedups of two orders of magnitude. This paper surveys the development of molecular modeling algorithms that leverage GPU computing, the advances already made and remaining issues to be resolved, and the continuing evolution of GPU technology that promises to become even more useful to molecular modeling. Hardware acceleration with commodity GPUs is expected to benefit the overall computational biology community by bringing teraflops performance to desktop workstations and in some cases potentially changing what were formerly batch-mode computational jobs into interactive tasks. PMID:20675161
Quantifying NUMA and Contention Effects in Multi-GPU Systems
Spafford, Kyle L; Meredith, Jeremy S; Vetter, Jeffrey S
2011-01-01
As system architects strive for increased density and power efficiency, the traditional compute node is being augmented with an increasing number of graphics processing units (GPUs). The integration of multiple GPUs per node introduces complex performance phenomena including non-uniform memory access (NUMA) and contention for shared system resources. Utilizing the Keeneland system, this paper quantifies these effects and presents some guidance on programming strategies to maximize performance in multi-GPU environments.
Harnessing your GPU for interactive immersive oceanographic modeling
NASA Astrophysics Data System (ADS)
Hermann, A. J.; Moore, C. W.
2011-12-01
We report on recent success using GPU for interactive Lagrangian (fish) and Eulerian (tsunami) modeling of marine systems. Lagrangian analyses based on numerical float tracks are a fundamental tool in hydrodynamic and marine biological modeling. In particular, spatially-explicit individual-based models (IBMs) can be used to explore how changes in coastal circulation affect fish recruitment, and 3D viewing of the results leads to new insights regarding the effects of behavior on spatial path. One limit to the usefulness of this modeling approach is the latency between submission of a run and examination of the results, especially when a large (i.e. statistically meaningful) number of individuals are being tracked through finely resolved current and scalar fields. Since float tracking is an inherently parallel problem, the hundreds of cores available in modern graphics cards (GPU) can readily increase the performance of suitably adapted code by two orders of magnitude at low cost. This offers a way forward to achieve interactive submission/examination of IBMs (and float tracks in general), even on a laptop computer. Latency is an even larger issue in tsunami forecasting, where there is a need to run simple deep-ocean shallow water wave models in real time, particularly during an event when tsunamigenic earthquake events occur outside known fault zones. This problem, too, lends itself to dramatic speedup via GPU, given a suitable parallel algorithm for the shallow water solver. Here we demonstrate successful model speedup using GPU-adapted code for: 1) a spatially explicit IBM prototype, based on pre-stored circulation model output for the Bering Sea; 2) real-time runs of tsunami propagation. In both cases, results will be presented using the stereo-immersive capabilities of the graphics card, for 3D animation.
SAR wind retrieval: from Singlecore to Multicore and GPU computing
NASA Astrophysics Data System (ADS)
Myasoedov, Alexander; Monzikova, Anna
The large spatial coverage and high resolution of spaceborne synthetic aperture radars (SAR) offers a unique opportunity to derive mesoscale wind fields over the ocean surface, providing high resolution wind fields near the shore. On the other hand, due to the large size of SAR images their processing might be a headache when dealing with operational tasks or doing long-period statistical analysis. Algorithms for satellite image processing often offer many possibilities for parallelism (e.g., pixel-by-pixel processing) which makes them good candidates for execution on high-performance parallel computing hardware such as Multicore CPUs and modern graphic processing units (GPUs). In this study we implement different SAR wind speed retrieval algorithms (e.g. CMOD4, CMOD5) for Singlecore and Multicore systems, including GPUs. For this purpose both serial and parallelized versions of CMOD algorithms were written in Matlab, Python, CPython and PyOpenCL. We apply these algorithms to an Envisat ASAR image, compare the results received with different versions of the algorithms executed on both Intel CPU and a Tesla GPU. As a result of our experiments we not only show the up to 400 times speedup of GPU comparing to CPU but also try to give some advises on how much time we have spent and efforts were made for writing the same algorithm using different programming languages. We hope that our experience will help other scientist to achieve all the goodness from the GPU/Multicore computing.
GPU-based cone-beam reconstruction using wavelet denoising
NASA Astrophysics Data System (ADS)
Jin, Kyungchan; Park, Jungbyung; Park, Jongchul
2012-03-01
The scattering noise artifact resulted in low-dose projection in repetitive cone-beam CT (CBCT) scans decreases the image quality and lessens the accuracy of the diagnosis. To improve the image quality of low-dose CT imaging, the statistical filtering is more effective in noise reduction. However, image filtering and enhancement during the entire reconstruction process exactly may be challenging due to high performance computing. The general reconstruction algorithm for CBCT data is the filtered back-projection, which for a volume of 512×512×512 takes up to a few minutes on a standard system. To speed up reconstruction, massively parallel architecture of current graphical processing unit (GPU) is a platform suitable for acceleration of mathematical calculation. In this paper, we focus on accelerating wavelet denoising and Feldkamp-Davis-Kress (FDK) back-projection using parallel processing on GPU, utilize compute unified device architecture (CUDA) platform and implement CBCT reconstruction based on CUDA technique. Finally, we evaluate our implementation on clinical tooth data sets. Resulting implementation of wavelet denoising is able to process a 1024×1024 image within 2 ms, except data loading process, and our GPU-based CBCT implementation reconstructs a 512×512×512 volume from 400 projection data in less than 1 minute.
Fast, parallel implementation of particle filtering on the GPU architecture
NASA Astrophysics Data System (ADS)
Gelencsér-Horváth, Anna; Tornai, Gábor János; Horváth, András; Cserey, György
2013-12-01
In this paper, we introduce a modified cellular particle filter (CPF) which we mapped on a graphics processing unit (GPU) architecture. We developed this filter adaptation using a state-of-the art CPF technique. Mapping this filter realization on a highly parallel architecture entailed a shift in the logical representation of the particles. In this process, the original two-dimensional organization is reordered as a one-dimensional ring topology. We proposed a proof-of-concept measurement on two models with an NVIDIA Fermi architecture GPU. This design achieved a 411- μs kernel time per state and a 77-ms global running time for all states for 16,384 particles with a 256 neighbourhood size on a sequence of 24 states for a bearing-only tracking model. For a commonly used benchmark model at the same configuration, we achieved a 266- μs kernel time per state and a 124-ms global running time for all 100 states. Kernel time includes random number generation on the GPU with curand. These results attest to the effective and fast use of the particle filter in high-dimensional, real-time applications.
Ultra-Fast Image Reconstruction of Tomosynthesis Mammography Using GPU
Arefan, D.; Talebpour, A.; Ahmadinejhad, N.; Kamali Asl, A.
2015-01-01
Digital Breast Tomosynthesis (DBT) is a technology that creates three dimensional (3D) images of breast tissue. Tomosynthesis mammography detects lesions that are not detectable with other imaging systems. If image reconstruction time is in the order of seconds, we can use Tomosynthesis systems to perform Tomosynthesis-guided Interventional procedures. This research has been designed to study ultra-fast image reconstruction technique for Tomosynthesis Mammography systems using Graphics Processing Unit (GPU). At first, projections of Tomosynthesis mammography have been simulated. In order to produce Tomosynthesis projections, it has been designed a 3D breast phantom from empirical data. It is based on MRI data in its natural form. Then, projections have been created from 3D breast phantom. The image reconstruction algorithm based on FBP was programmed with C++ language in two methods using central processing unit (CPU) card and the Graphics Processing Unit (GPU). It calculated the time of image reconstruction in two kinds of programming (using CPU and GPU). PMID:26171373
GPU Lossless Hyperspectral Data Compression System for Space Applications
NASA Technical Reports Server (NTRS)
Keymeulen, Didier; Aranki, Nazeeh; Hopson, Ben; Kiely, Aaron; Klimesh, Matthew; Benkrid, Khaled
2012-01-01
On-board lossless hyperspectral data compression reduces data volume in order to meet NASA and DoD limited downlink capabilities. At JPL, a novel, adaptive and predictive technique for lossless compression of hyperspectral data, named the Fast Lossless (FL) algorithm, was recently developed. This technique uses an adaptive filtering method and achieves state-of-the-art performance in both compression effectiveness and low complexity. Because of its outstanding performance and suitability for real-time onboard hardware implementation, the FL compressor is being formalized as the emerging CCSDS Standard for Lossless Multispectral & Hyperspectral image compression. The FL compressor is well-suited for parallel hardware implementation. A GPU hardware implementation was developed for FL targeting the current state-of-the-art GPUs from NVIDIA(Trademark). The GPU implementation on a NVIDIA(Trademark) GeForce(Trademark) GTX 580 achieves a throughput performance of 583.08 Mbits/sec (44.85 MSamples/sec) and an acceleration of at least 6 times a software implementation running on a 3.47 GHz single core Intel(Trademark) Xeon(Trademark) processor. This paper describes the design and implementation of the FL algorithm on the GPU. The massively parallel implementation will provide in the future a fast and practical real-time solution for airborne and space applications.
A GPU-computing Approach to Solar Stokes Profile Inversion
NASA Astrophysics Data System (ADS)
Harker, Brian J.; Mighell, Kenneth J.
2012-09-01
We present a new computational approach to the inversion of solar photospheric Stokes polarization profiles, under the Milne-Eddington model, for vector magnetography. Our code, named GENESIS, employs multi-threaded parallel-processing techniques to harness the computing power of graphics processing units (GPUs), along with algorithms designed to exploit the inherent parallelism of the Stokes inversion problem. Using a genetic algorithm (GA) engineered specifically for use with a GPU, we produce full-disk maps of the photospheric vector magnetic field from polarized spectral line observations recorded by the Synoptic Optical Long-term Investigations of the Sun (SOLIS) Vector Spectromagnetograph (VSM) instrument. We show the advantages of pairing a population-parallel GA with data-parallel GPU-computing techniques, and present an overview of the Stokes inversion problem, including a description of our adaptation to the GPU-computing paradigm. Full-disk vector magnetograms derived by this method are shown using SOLIS/VSM data observed on 2008 March 28 at 15:45 UT.
Bin recycling strategy for improving the histogram precision on GPU
NASA Astrophysics Data System (ADS)
Cárdenas-Montes, Miguel; Rodríguez-Vázquez, Juan José; Vega-Rodríguez, Miguel A.
2016-07-01
Histogram is an easily comprehensible way to present data and analyses. In the current scientific context with access to large volumes of data, the processing time for building histogram has dramatically increased. For this reason, parallel construction is necessary to alleviate the impact of the processing time in the analysis activities. In this scenario, GPU computing is becoming widely used for reducing until affordable levels the processing time of histogram construction. Associated to the increment of the processing time, the implementations are stressed on the bin-count accuracy. Accuracy aspects due to the particularities of the implementations are not usually taken into consideration when building histogram with very large data sets. In this work, a bin recycling strategy to create an accuracy-aware implementation for building histogram on GPU is presented. In order to evaluate the approach, this strategy was applied to the computation of the three-point angular correlation function, which is a relevant function in Cosmology for the study of the Large Scale Structure of Universe. As a consequence of the study a high-accuracy implementation for histogram construction on GPU is proposed.
Implementation of GPU-Accelerated Back Projection for EPR imaging
Qiao, Zhiwei; Redler, Gage; Epel, Boris; Qian, Yuhua; Halpern, Howard
2016-01-01
Electron paramagnetic resonance (EPR) Imaging (EPRI) is a robust method for measuring in vivo oxygen concentration (pO2). For 3D pulse EPRI, a commonly used reconstruction algorithm is the filtered backprojection (FBP) algorithm, in which the backprojection process is computationally intensive and may be time consuming when implemented on a CPU. A multistage implementation of the backprojection can be used for acceleration, however it is not flexible (requires equal linear angle projection distribution) and may still be time consuming. In this work, single-stage backprojection is implemented on a GPU (Graphics Processing Units) having 1152 cores to accelerate the process. The GPU implementation results in acceleration by over a factor of 200 overall and by over a factor of 3500 if only the computing time is considered. Some important experiences regarding the implementation of GPU-accelerated backprojection for EPRI are summarized. The resulting accelerated image reconstruction is useful for real-time image reconstruction monitoring and other time sensitive applications. PMID:26410654
GPU accelerated processing of astronomical high frame-rate videosequences
NASA Astrophysics Data System (ADS)
Vítek, Stanislav; Švihlík, Jan; Krasula, Lukáš; Fliegel, Karel; Páta, Petr
2015-09-01
Astronomical instruments located around the world are producing an incredibly large amount of possibly interesting scientific data. Astronomical research is expanding into large and highly sensitive telescopes. Total volume of data rates per night of operations also increases with the quality and resolution of state-of-the-art CCD/CMOS detectors. Since many of the ground-based astronomical experiments are placed in remote locations with limited access to the Internet, it is necessary to solve the problem of the data storage. It mostly means that current data acquistion, processing and analyses algorithm require review. Decision about importance of the data has to be taken in very short time. This work deals with GPU accelerated processing of high frame-rate astronomical video-sequences, mostly originating from experiment MAIA (Meteor Automatic Imager and Analyser), an instrument primarily focused to observing of faint meteoric events with a high time resolution. The instrument with price bellow 2000 euro consists of image intensifier and gigabite ethernet camera running at 61 fps. With resolution better than VGA the system produces up to 2TB of scientifically valuable video data per night. Main goal of the paper is not to optimize any GPU algorithm, but to propose and evaluate parallel GPU algorithms able to process huge amount of video-sequences in order to delete all uninteresting data.
A GPU-COMPUTING APPROACH TO SOLAR STOKES PROFILE INVERSION
Harker, Brian J.; Mighell, Kenneth J. E-mail: mighell@noao.edu
2012-09-20
We present a new computational approach to the inversion of solar photospheric Stokes polarization profiles, under the Milne-Eddington model, for vector magnetography. Our code, named GENESIS, employs multi-threaded parallel-processing techniques to harness the computing power of graphics processing units (GPUs), along with algorithms designed to exploit the inherent parallelism of the Stokes inversion problem. Using a genetic algorithm (GA) engineered specifically for use with a GPU, we produce full-disk maps of the photospheric vector magnetic field from polarized spectral line observations recorded by the Synoptic Optical Long-term Investigations of the Sun (SOLIS) Vector Spectromagnetograph (VSM) instrument. We show the advantages of pairing a population-parallel GA with data-parallel GPU-computing techniques, and present an overview of the Stokes inversion problem, including a description of our adaptation to the GPU-computing paradigm. Full-disk vector magnetograms derived by this method are shown using SOLIS/VSM data observed on 2008 March 28 at 15:45 UT.
GMH: A Message Passing Toolkit for GPU Clusters
Jie Chen, W. Watson, Weizhen Mao
2011-01-01
Driven by the market demand for high-definition 3D graphics, commodity graphics processing units (GPUs) have evolved into highly parallel, multi-threaded, many-core processors, which are ideal for data parallel computing. Many applications have been ported to run on a single GPU with tremendous speedups using general C-style programming languages such as CUDA. However, large applications require multiple GPUs and demand explicit message passing. This paper presents a message passing toolkit, called GMH (GPU Message Handler), on NVIDIA GPUs. This toolkit utilizes a data-parallel thread group as a way to map multiple GPUs on a single host to an MPI rank, and introduces a notion of virtual GPUs as a way to bind a thread to a GPU automatically. This toolkit provides high performance MPI style point-to-point and collective communication, but more importantly, facilitates event-driven APIs to allow an application to be managed and executed by the toolkit at runtime.
Dynamic Load Balancing on Single- and Multi-GPU Systems
Chen, Long; Villa, Oreste; Krishnamoorthy, Sriram; Gao, Guang R.
2010-04-19
The computational power provided by many-core graphics processing units (GPUs) has been exploited in many applications. The programming techniques supported and employed on these GPUs are not sufficient to address problems exhibiting irregular, and unbalanced workload. The problem is exacerbated when trying to effectively exploit multiple GPUs, which are commonly available in many modern systems. In this paper, we propose a task-based dynamic load-balancing solution for single- and multi-GPU systems. The solution allows load balancing at a finer granularity than what is supported in existing APIs such as NVIDIA’s CUDA. We evaluate our approach using both micro-benchmarks and a molecular dynamics application that exhibits significant load imbalance. Experimental results with a single-GPU configuration show that our fine-grained task solution can utilize the hardware more efficiently than the CUDA scheduler for unbalanced workload. On multi-GPU systems, our solution achieves near-linear speedup, load balance, and significant performance improvement over techniques based on standard CUDA APIs.
Ultra-Fast Image Reconstruction of Tomosynthesis Mammography Using GPU.
Arefan, D; Talebpour, A; Ahmadinejhad, N; Kamali Asl, A
2015-06-01
Digital Breast Tomosynthesis (DBT) is a technology that creates three dimensional (3D) images of breast tissue. Tomosynthesis mammography detects lesions that are not detectable with other imaging systems. If image reconstruction time is in the order of seconds, we can use Tomosynthesis systems to perform Tomosynthesis-guided Interventional procedures. This research has been designed to study ultra-fast image reconstruction technique for Tomosynthesis Mammography systems using Graphics Processing Unit (GPU). At first, projections of Tomosynthesis mammography have been simulated. In order to produce Tomosynthesis projections, it has been designed a 3D breast phantom from empirical data. It is based on MRI data in its natural form. Then, projections have been created from 3D breast phantom. The image reconstruction algorithm based on FBP was programmed with C++ language in two methods using central processing unit (CPU) card and the Graphics Processing Unit (GPU). It calculated the time of image reconstruction in two kinds of programming (using CPU and GPU). PMID:26171373
Algebraic computations in seismology on GPU-clusters
NASA Astrophysics Data System (ADS)
Meskaranian, Mahjoobeh; Sadeghi, Hossein; Mohammadzaheri, Afsaneh; Toutounian, Faezeh; Navazandeh, Mahdi
2013-04-01
Recent advances in high-performance computing have allowed scientists to increase the speed of scientific computations. One of these advances is Graphics Processing Unit (GPU) which is a many-core processor and multithreaded in high-performance computing. Algorithms that can be expressed as data parallel computations such as matrix processing, in which single instruction is executed for multiple data (SIMD) are especially suitable for performing on GPU. We present algorithms for LSQR (Paige and Saunders, 1982) and LSMR (Fong and Saunders, 2011) methods, executable on GPUs. The LSQR and LSMR are iterative methods for solving least squares problems that are usually used for solving inverse problems. These methods are based on Golub and Kahan's bidiagonalization process. The LSQR and LSMR give reliable results especially when problems involve the large and spars ill- conditioned matrices, such matrices can be found in seismic tomography. The most time-consuming operation in these methods is the sparse matrix-vector multiplication (SpMV). For efficient matrix storage as well as SpMV, we use a Compressed Sparse Row (vector) Format (Bell and Garland, 2008), that dedicates one warp (32 thread) to each row. The model resolution matrix illustrates how well estimated model parameters fit the true model parameters. Although some researchers tried to approximate a generalized inverse for LSQR method, this method does not explicitly compute generalized inverse. Therefore it cannot be clearly used to calculate resolution matrix. However, following Yao et al. 2001, it is possible to determine resolution matrix by N times implementing LSQR independently. Therefore, we can utilize the Map-Reduce idea in our algorithm for computation of the model resolution matrix on GPU-clusters. Map-Reduce paradigm was popularized in 2004 by Google's researchers Dean and Ghemawat. Our algorithm is based on the Map-Reduce of Mohammadzaheri, et al. 2012, which consists of two main functions: Map and
GPU-based Integration with Application in Sensitivity Analysis
NASA Astrophysics Data System (ADS)
Atanassov, Emanouil; Ivanovska, Sofiya; Karaivanova, Aneta; Slavov, Dimitar
2010-05-01
The presented work is an important part of the grid application MCSAES (Monte Carlo Sensitivity Analysis for Environmental Studies) which aim is to develop an efficient Grid implementation of a Monte Carlo based approach for sensitivity studies in the domains of Environmental modelling and Environmental security. The goal is to study the damaging effects that can be caused by high pollution levels (especially effects on human health), when the main modeling tool is the Danish Eulerian Model (DEM). Generally speaking, sensitivity analysis (SA) is the study of how the variation in the output of a mathematical model can be apportioned to, qualitatively or quantitatively, different sources of variation in the input of a model. One of the important classes of methods for Sensitivity Analysis are Monte Carlo based, first proposed by Sobol, and then developed by Saltelli and his group. In MCSAES the general Saltelli procedure has been adapted for SA of the Danish Eulerian model. In our case we consider as factors the constants determining the speeds of the chemical reactions in the DEM and as output a certain aggregated measure of the pollution. Sensitivity simulations lead to huge computational tasks (systems with up to 4 × 109 equations at every time-step, and the number of time-steps can be more than a million) which motivates its grid implementation. MCSAES grid implementation scheme includes two main tasks: (i) Grid implementation of the DEM, (ii) Grid implementation of the Monte Carlo integration. In this work we present our new developments in the integration part of the application. We have developed an algorithm for GPU-based generation of scrambled quasirandom sequences which can be combined with the CPU-based computations related to the SA. Owen first proposed scrambling of Sobol sequence through permutation in a manner that improves the convergence rates. Scrambling is necessary not only for error analysis but for parallel implementations. Good scrambling is
GASPRNG: GPU accelerated scalable parallel random number generator library
NASA Astrophysics Data System (ADS)
Gao, Shuang; Peterson, Gregory D.
2013-04-01
Graphics processors represent a promising technology for accelerating computational science applications. Many computational science applications require fast and scalable random number generation with good statistical properties, so they use the Scalable Parallel Random Number Generators library (SPRNG). We present the GPU Accelerated SPRNG library (GASPRNG) to accelerate SPRNG in GPU-based high performance computing systems. GASPRNG includes code for a host CPU and CUDA code for execution on NVIDIA graphics processing units (GPUs) along with a programming interface to support various usage models for pseudorandom numbers and computational science applications executing on the CPU, GPU, or both. This paper describes the implementation approach used to produce high performance and also describes how to use the programming interface. The programming interface allows a user to be able to use GASPRNG the same way as SPRNG on traditional serial or parallel computers as well as to develop tightly coupled programs executing primarily on the GPU. We also describe how to install GASPRNG and use it. To help illustrate linking with GASPRNG, various demonstration codes are included for the different usage models. GASPRNG on a single GPU shows up to 280x speedup over SPRNG on a single CPU core and is able to scale for larger systems in the same manner as SPRNG. Because GASPRNG generates identical streams of pseudorandom numbers as SPRNG, users can be confident about the quality of GASPRNG for scalable computational science applications. Catalogue identifier: AEOI_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEOI_v1_0.html Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland Licensing provisions: UTK license. No. of lines in distributed program, including test data, etc.: 167900 No. of bytes in distributed program, including test data, etc.: 1422058 Distribution format: tar.gz Programming language: C and CUDA. Computer: Any PC or
Accelerated rescaling of single Monte Carlo simulation runs with the Graphics Processing Unit (GPU).
Yang, Owen; Choi, Bernard
2013-01-01
To interpret fiber-based and camera-based measurements of remitted light from biological tissues, researchers typically use analytical models, such as the diffusion approximation to light transport theory, or stochastic models, such as Monte Carlo modeling. To achieve rapid (ideally real-time) measurement of tissue optical properties, especially in clinical situations, there is a critical need to accelerate Monte Carlo simulation runs. In this manuscript, we report on our approach using the Graphics Processing Unit (GPU) to accelerate rescaling of single Monte Carlo runs to calculate rapidly diffuse reflectance values for different sets of tissue optical properties. We selected MATLAB to enable non-specialists in C and CUDA-based programming to use the generated open-source code. We developed a software package with four abstraction layers. To calculate a set of diffuse reflectance values from a simulated tissue with homogeneous optical properties, our rescaling GPU-based approach achieves a reduction in computation time of several orders of magnitude as compared to other GPU-based approaches. Specifically, our GPU-based approach generated a diffuse reflectance value in 0.08ms. The transfer time from CPU to GPU memory currently is a limiting factor with GPU-based calculations. However, for calculation of multiple diffuse reflectance values, our GPU-based approach still can lead to processing that is ~3400 times faster than other GPU-based approaches. PMID:24298424
Dynamic shader generation for GPU-based multi-volume ray casting.
Rössler, Friedemann; Botchen, Ralf P; Ertl, Thomas
2008-01-01
Real-time performance for rendering multiple intersecting volumetric objects requires the speed and flexibility of modern GPUs. This requirement has restricted programming of the necessary shaders to GPU experts only. A visualization system that dynamically generates GPU shaders for multi-volume ray casting from a user-definable abstract render graph overcomes this limitation. PMID:18753036
Shi, Yulin; Veidenbaum, Alexander V.; Nicolau, Alex; Xu, Xiangmin
2014-01-01
Background Modern neuroscience research demands computing power. Neural circuit mapping studies such as those using laser scanning photostimulation (LSPS) produce large amounts of data and require intensive computation for post-hoc processing and analysis. New Method Here we report on the design and implementation of a cost-effective desktop computer system for accelerated experimental data processing with recent GPU computing technology. A new version of Matlab software with GPU enabled functions is used to develop programs that run on Nvidia GPUs to harness their parallel computing power. Results We evaluated both the central processing unit (CPU) and GPU-enabled computational performance of our system in benchmark testing and practical applications. The experimental results show that the GPU-CPU co-processing of simulated data and actual LSPS experimental data clearly outperformed the multi-core CPU with up to a 22x speedup, depending on computational tasks. Further, we present a comparison of numerical accuracy between GPU and CPU computation to verify the precision of GPU computation. In addition, we show how GPUs can be effectively adapted to improve the performance of commercial image processing software such as Adobe Photoshop. Comparison with Existing Method(s) To our best knowledge, this is the first demonstration of GPU application in neural circuit mapping and electrophysiology-based data processing. Conclusions Together, GPU enabled computation enhances our ability to process large-scale data sets derived from neural circuit mapping studies, allowing for increased processing speeds while retaining data precision. PMID:25277633
NASA Astrophysics Data System (ADS)
Mu, Dawei; Chen, Po; Wang, Liqiang
2013-02-01
We have successfully ported an arbitrary high-order discontinuous Galerkin (ADER-DG) method for solving the three-dimensional elastic seismic wave equation on unstructured tetrahedral meshes to an Nvidia Tesla C2075 GPU using the Nvidia CUDA programming model. On average our implementation obtained a speedup factor of about 24.3 for the single-precision version of our GPU code and a speedup factor of about 12.8 for the double-precision version of our GPU code when compared with the double precision serial CPU code running on one Intel Xeon W5880 core. When compared with the parallel CPU code running on two, four and eight cores, the speedup factor of our single-precision GPU code is around 12.9, 6.8 and 3.6, respectively. In this article, we give a brief summary of the ADER-DG method, a short introduction to the CUDA programming model and a description of our CUDA implementation and optimization of the ADER-DG method on the GPU. To our knowledge, this is the first study that explores the potential of accelerating the ADER-DG method for seismic wave-propagation simulations using a GPU.
GPU Accelerated Numerical Simulation of Viscous Flow Down a Slope
NASA Astrophysics Data System (ADS)
Gygax, Remo; Räss, Ludovic; Omlin, Samuel; Podladchikov, Yuri; Jaboyedoff, Michel
2014-05-01
Numerical simulations are an effective tool in natural risk analysis. They are useful to determine the propagation and the runout distance of gravity driven movements such as debris flows or landslides. To evaluate these processes an approach on analogue laboratory experiments and a GPU accelerated numerical simulation of the flow of a viscous liquid down an inclined slope is considered. The physical processes underlying large gravity driven flows share certain aspects with the propagation of debris mass in a rockslide and the spreading of water waves. Several studies have shown that the numerical implementation of the physical processes of viscous flow produce a good fit with the observation of experiments in laboratory in both a quantitative and a qualitative way. When considering a process that is this far explored we can concentrate on its numerical transcription and the application of the code in a GPU accelerated environment to obtain a 3D simulation. The objective of providing a numerical solution in high resolution by NVIDIA-CUDA GPU parallel processing is to increase the speed of the simulation and the accuracy on the prediction. The main goal is to write an easily adaptable and as short as possible code on the widely used platform MATLAB, which will be translated to C-CUDA to achieve higher resolution and processing speed while running on a NVIDIA graphics card cluster. The numerical model, based on the finite difference scheme, is compared to analogue laboratory experiments. This way our numerical model parameters are adjusted to reproduce the effective movements observed by high-speed camera acquisitions during the laboratory experiments.
Font rendering on a GPU-based raster image processor
NASA Astrophysics Data System (ADS)
Recker, John L.; Beretta, Giordano B.; Lin, I.-Jong
2010-01-01
Historically, in the 35 years of digital printing research, raster image processing has always lagged behind marking engine technology, i.e., we have never been able to deliver rendered digital pages as fast as digital print engines can consume them. This trend has resulted in products based on throttled digital printers or expensive raster image processors (RIP) with hardware acceleration. The current trend in computer software architecture is to leverage graphic processing units (GPU) for computing tasks whenever appropriate. We discuss the issues for rendering fonts on such an architecture and present an implementation.
Engineering a fully GPU-accelerated H.264 encoder
NASA Astrophysics Data System (ADS)
Li, Bowei; Deng, Yangdong S.
2013-07-01
H.264/AVC is the most popular video coding standard and playing an essential role in today's Internet based content-delivery businesses. H.264's encoding process is highly computationally expensive due to the integration of complex video coding techniques. As a result, transcoding has become a bottleneck of content-hosting services. Recently, general purpose computing on graphics processing units (GPUs) is rapidly rising as a popular computing model to expedite time-consuming applications. In this paper, we propose a fully GPU-accelerated H.264 encoder. Experimental results show that a 100% speed-up ratio can be achieved.
GPU-accelerated interactive visualization and planning of neurosurgical interventions.
Rincón-Nigro, Mario; Navkar, Nikhil V; Tsekos, Nikolaos V; Zhigang Deng
2014-01-01
Advances in computational methods and hardware platforms provide efficient processing of medical-imaging datasets for surgical planning. For neurosurgical interventions employing a straight access path, planning entails selecting a path from the scalp to the target area that's of minimal risk to the patient. A proposed GPU-accelerated method enables interactive quantitative estimation of the risk for a particular path. It exploits acceleration spatial data structures and efficient implementation of algorithms on GPUs. In evaluations of its computational efficiency and scalability, it achieved interactive rates even for high-resolution meshes. A user study and feedback from neurosurgeons identified this methods' potential benefits for preoperative planning and intraoperative replanning. PMID:24808165
The BRUSH algorithm for two-electron integrals on GPU
NASA Astrophysics Data System (ADS)
Rák, Ádám; Cserey, György
2015-02-01
This Letter presents a new algorithmic method developed to evaluate two-electron repulsion integrals based on contracted Gaussian basis functions in a parallel way. This new algorithm scheme provides distinct SIMD (single instruction multiple data) optimized paths which symbolically transforms integral parameters into target integral algorithms. Our measurements indicate that the method gives a significant improvement over the CPU-friendly PRISM algorithm. The benchmark tests (evaluation of more than 108 integrals using the STO-3G basis set) of our GPU (NVIDIA GTX 780) implementation showed up to 750-fold speedup compared to a single core of Athlon II X4 635 CPU.
GPU-Acceleration of Sequence Homology Searches with Database Subsequence Clustering.
Suzuki, Shuji; Kakuta, Masanori; Ishida, Takashi; Akiyama, Yutaka
2016-01-01
Sequence homology searches are used in various fields and require large amounts of computation time, especially for metagenomic analysis, owing to the large number of queries and the database size. To accelerate computing analyses, graphics processing units (GPUs) are widely used as a low-cost, high-performance computing platform. Therefore, we mapped the time-consuming steps involved in GHOSTZ, which is a state-of-the-art homology search algorithm for protein sequences, onto a GPU and implemented it as GHOSTZ-GPU. In addition, we optimized memory access for GPU calculations and for communication between the CPU and GPU. As per results of the evaluation test involving metagenomic data, GHOSTZ-GPU with 12 CPU threads and 1 GPU was approximately 3.0- to 4.1-fold faster than GHOSTZ with 12 CPU threads. Moreover, GHOSTZ-GPU with 12 CPU threads and 3 GPUs was approximately 5.8- to 7.7-fold faster than GHOSTZ with 12 CPU threads. PMID:27482905
Accelerating Spaceborne SAR Imaging Using Multiple CPU/GPU Deep Collaborative Computing
Zhang, Fan; Li, Guojun; Li, Wei; Hu, Wei; Hu, Yuxin
2016-01-01
With the development of synthetic aperture radar (SAR) technologies in recent years, the huge amount of remote sensing data brings challenges for real-time imaging processing. Therefore, high performance computing (HPC) methods have been presented to accelerate SAR imaging, especially the GPU based methods. In the classical GPU based imaging algorithm, GPU is employed to accelerate image processing by massive parallel computing, and CPU is only used to perform the auxiliary work such as data input/output (IO). However, the computing capability of CPU is ignored and underestimated. In this work, a new deep collaborative SAR imaging method based on multiple CPU/GPU is proposed to achieve real-time SAR imaging. Through the proposed tasks partitioning and scheduling strategy, the whole image can be generated with deep collaborative multiple CPU/GPU computing. In the part of CPU parallel imaging, the advanced vector extension (AVX) method is firstly introduced into the multi-core CPU parallel method for higher efficiency. As for the GPU parallel imaging, not only the bottlenecks of memory limitation and frequent data transferring are broken, but also kinds of optimized strategies are applied, such as streaming, parallel pipeline and so on. Experimental results demonstrate that the deep CPU/GPU collaborative imaging method enhances the efficiency of SAR imaging on single-core CPU by 270 times and realizes the real-time imaging in that the imaging rate outperforms the raw data generation rate. PMID:27070606
GPU-Acceleration of Sequence Homology Searches with Database Subsequence Clustering
Suzuki, Shuji; Kakuta, Masanori; Ishida, Takashi; Akiyama, Yutaka
2016-01-01
Sequence homology searches are used in various fields and require large amounts of computation time, especially for metagenomic analysis, owing to the large number of queries and the database size. To accelerate computing analyses, graphics processing units (GPUs) are widely used as a low-cost, high-performance computing platform. Therefore, we mapped the time-consuming steps involved in GHOSTZ, which is a state-of-the-art homology search algorithm for protein sequences, onto a GPU and implemented it as GHOSTZ-GPU. In addition, we optimized memory access for GPU calculations and for communication between the CPU and GPU. As per results of the evaluation test involving metagenomic data, GHOSTZ-GPU with 12 CPU threads and 1 GPU was approximately 3.0- to 4.1-fold faster than GHOSTZ with 12 CPU threads. Moreover, GHOSTZ-GPU with 12 CPU threads and 3 GPUs was approximately 5.8- to 7.7-fold faster than GHOSTZ with 12 CPU threads. PMID:27482905
Accelerating Spaceborne SAR Imaging Using Multiple CPU/GPU Deep Collaborative Computing.
Zhang, Fan; Li, Guojun; Li, Wei; Hu, Wei; Hu, Yuxin
2016-01-01
With the development of synthetic aperture radar (SAR) technologies in recent years, the huge amount of remote sensing data brings challenges for real-time imaging processing. Therefore, high performance computing (HPC) methods have been presented to accelerate SAR imaging, especially the GPU based methods. In the classical GPU based imaging algorithm, GPU is employed to accelerate image processing by massive parallel computing, and CPU is only used to perform the auxiliary work such as data input/output (IO). However, the computing capability of CPU is ignored and underestimated. In this work, a new deep collaborative SAR imaging method based on multiple CPU/GPU is proposed to achieve real-time SAR imaging. Through the proposed tasks partitioning and scheduling strategy, the whole image can be generated with deep collaborative multiple CPU/GPU computing. In the part of CPU parallel imaging, the advanced vector extension (AVX) method is firstly introduced into the multi-core CPU parallel method for higher efficiency. As for the GPU parallel imaging, not only the bottlenecks of memory limitation and frequent data transferring are broken, but also kinds of optimized strategies are applied, such as streaming, parallel pipeline and so on. Experimental results demonstrate that the deep CPU/GPU collaborative imaging method enhances the efficiency of SAR imaging on single-core CPU by 270 times and realizes the real-time imaging in that the imaging rate outperforms the raw data generation rate. PMID:27070606
CPU-GPU hybrid accelerating the Zuker algorithm for RNA secondary structure prediction applications
2012-01-01
Background Prediction of ribonucleic acid (RNA) secondary structure remains one of the most important research areas in bioinformatics. The Zuker algorithm is one of the most popular methods of free energy minimization for RNA secondary structure prediction. Thus far, few studies have been reported on the acceleration of the Zuker algorithm on general-purpose processors or on extra accelerators such as Field Programmable Gate-Array (FPGA) and Graphics Processing Units (GPU). To the best of our knowledge, no implementation combines both CPU and extra accelerators, such as GPUs, to accelerate the Zuker algorithm applications. Results In this paper, a CPU-GPU hybrid computing system that accelerates Zuker algorithm applications for RNA secondary structure prediction is proposed. The computing tasks are allocated between CPU and GPU for parallel cooperate execution. Performance differences between the CPU and the GPU in the task-allocation scheme are considered to obtain workload balance. To improve the hybrid system performance, the Zuker algorithm is optimally implemented with special methods for CPU and GPU architecture. Conclusions Speedup of 15.93× over optimized multi-core SIMD CPU implementation and performance advantage of 16% over optimized GPU implementation are shown in the experimental results. More than 14% of the sequences are executed on CPU in the hybrid system. The system combining CPU and GPU to accelerate the Zuker algorithm is proven to be promising and can be applied to other bioinformatics applications. PMID:22369626
Efficient GPU Accelerationfor Integrating Large Thermonuclear Networks in Astrophysics
NASA Astrophysics Data System (ADS)
Guidry, Mike
2016-02-01
We demonstrate the systematic implementation of recently-developed fast explicit kinetic integration algorithms on modern graphics processing unit (GPU) accelerators. We take as representative test cases Type Ia supernova explosions with extremely stiff thermonuclear reaction networks having 150-365 isotopic species and 1600-4400 reactions, assumed coupled to hydrodynamics using operator splitting. In such examples we demonstrate the capability to integrate independent thermonuclear networks from ~250-500 hydro zones (assumed to be deployed on CPU cores) in parallel on a single GPU in the same wall clock time that standard implicit methods can integrate the network for a single zone. This two or more orders of magnitude increase in efficiency for solving systems of realistic thermonuclear networks coupled to fluid dynamics implies that important coupled, multiphysics problems in various scientific and technical disciplines that were intractable, or could be simulated only with highly schematic kinetic networks, are now computationally feasible. As examples of such applications I will discuss our ongoing deployment of these new methods for Type Ia supernova explosions in astrophysics and for simulation of the complex atmospheric chemistry entering into weather and climate problems.
Accelerating sub-pixel marker segmentation using GPU
NASA Astrophysics Data System (ADS)
Handel, Holger
2009-02-01
Sub-pixel accurate marker segmentation is an important task for many computer vision systems. The 3D-positions of markers are used in control loops to determine the position of machine tools or robot end-effectors. Accurate segmentation of the marker position in the image plane is crucial for accurate reconstruction. Many subpixel segmentation algorithms are computationally intensive, especially when the number of markers increases. Modern graphics hardware with its massively parallel architecture provides a powerful tool for many image segmentation tasks. Especially, the time consuming sub-pixel refinement steps in marker segmentation can benefit from the recent progress. This article presents an implementation of a sub-pixel marker segmentation framework using the GPU to accelerate the processing time. The image segmentation chain consists of two stages. The first is a pre-processing stage which segments the initial position of the marker with pixel accuracy, the second stage refines the initial marker position to sub-pixel accuracy. Both stages are implemented as shader programs on the GPU. The flexible architecture allows it to combine different pre-processing and sub-pixel refinement algorithms. Experimental results show that significant speed-up can be achieved compared to CPU implementations, especially when the number of markers increases.
Real-time ultrasound simulation using the GPU.
Gjerald, Sjur Urdson; Brekken, Reidar; Hergum, Torbjørn; D'hooge, Jan
2012-05-01
Ultrasound simulators can be used for training ultrasound image acquisition and interpretation. In such simulators, synthetic ultrasound images must be generated in real time. Anatomy can be modeled by computed tomography (CT). Shadows can be calculated by combining reflection coefficients and depth dependent, exponential attenuation. To include speckle, a pre-calculated texture map is typically added. Dynamic objects must be simulated separately. We propose to increase the speckle realism and allow for dynamic objects by using a physical model of the underlying scattering process. The model is based on convolution of the point spread function (PSF) of the ultrasound scanner with a scatterer distribution. The challenge is that the typical field-of-view contains millions of scatterers which must be selected by a virtual probe from an even larger body of scatterers. The main idea of this paper is to select and sample scatterers in parallel on the graphic processing unit (GPU). The method was used to image a cyst phantom and a movable needle. Speckle images were produced in real time (more than 10 frames per second) on a standard GPU. The ultrasound images were visually similar to images calculated by a reference method. PMID:22622973
Fast-coding robust motion estimation model in a GPU
NASA Astrophysics Data System (ADS)
García, Carlos; Botella, Guillermo; de Sande, Francisco; Prieto-Matias, Manuel
2015-02-01
Nowadays vision systems are used with countless purposes. Moreover, the motion estimation is a discipline that allow to extract relevant information as pattern segmentation, 3D structure or tracking objects. However, the real-time requirements in most applications has limited its consolidation, considering the adoption of high performance systems to meet response times. With the emergence of so-called highly parallel devices known as accelerators this gap has narrowed. Two extreme endpoints in the spectrum of most common accelerators are Field Programmable Gate Array (FPGA) and Graphics Processing Systems (GPU), which usually offer higher performance rates than general propose processors. Moreover, the use of GPUs as accelerators involves the efficient exploitation of any parallelism in the target application. This task is not easy because performance rates are affected by many aspects that programmers should overcome. In this paper, we evaluate OpenACC standard, a programming model with directives which favors porting any code to a GPU in the context of motion estimation application. The results confirm that this programming paradigm is suitable for this image processing applications achieving a very satisfactory acceleration in convolution based problems as in the well-known Lucas & Kanade method.
Molecular dynamics simulations through GPU video games technologies
Loukatou, Styliani; Papageorgiou, Louis; Fakourelis, Paraskevas; Filntisi, Arianna; Polychronidou, Eleftheria; Bassis, Ioannis; Megalooikonomou, Vasileios; Makałowski, Wojciech; Vlachakis, Dimitrios; Kossida, Sophia
2016-01-01
Bioinformatics is the scientific field that focuses on the application of computer technology to the management of biological information. Over the years, bioinformatics applications have been used to store, process and integrate biological and genetic information, using a wide range of methodologies. One of the most de novo techniques used to understand the physical movements of atoms and molecules is molecular dynamics (MD). MD is an in silico method to simulate the physical motions of atoms and molecules under certain conditions. This has become a state strategic technique and now plays a key role in many areas of exact sciences, such as chemistry, biology, physics and medicine. Due to their complexity, MD calculations could require enormous amounts of computer memory and time and therefore their execution has been a big problem. Despite the huge computational cost, molecular dynamics have been implemented using traditional computers with a central memory unit (CPU). A graphics processing unit (GPU) computing technology was first designed with the goal to improve video games, by rapidly creating and displaying images in a frame buffer such as screens. The hybrid GPU-CPU implementation, combined with parallel computing is a novel technology to perform a wide range of calculations. GPUs have been proposed and used to accelerate many scientific computations including MD simulations. Herein, we describe the new methodologies developed initially as video games and how they are now applied in MD simulations. PMID:27525251
Large Data Visualization on Distributed Memory Mulit-GPU Clusters
Childs, Henry R.
2010-03-01
Data sets of immense size are regularly generated on large scale computing resources. Even among more traditional methods for acquisition of volume data, such as MRI and CT scanners, data which is too large to be effectively visualization on standard workstations is now commonplace. One solution to this problem is to employ a 'visualization cluster,' a small to medium scale cluster dedicated to performing visualization and analysis of massive data sets generated on larger scale supercomputers. These clusters are designed to fit a different need than traditional supercomputers, and therefore their design mandates different hardware choices, such as increased memory, and more recently, graphics processing units (GPUs). While there has been much previous work on distributed memory visualization as well as GPU visualization, there is a relative dearth of algorithms which effectively use GPUs at a large scale in a distributed memory environment. In this work, we study a common visualization technique in a GPU-accelerated, distributed memory setting, and present performance characteristics when scaling to extremely large data sets.
GPU-accelerated visualization of protein dynamics in ribbon mode
NASA Astrophysics Data System (ADS)
Wahle, Manuel; Birmanns, Stefan
2011-01-01
Proteins are biomolecules present in living organisms and essential for carrying out vital functions. Inherent to their functioning is folding into different spatial conformations, and to understand these processes, it is crucial to visually explore the structural changes. In recent years, significant advancements in experimental techniques and novel algorithms for post-processing of protein data have routinely revealed static and dynamic structures of increasing sizes. In turn, interactive visualization of the systems and their transitions became more challenging. Therefore, much research for the efficient display of protein dynamics has been done, with the focus being space filling models, but for the important class of abstract ribbon or cartoon representations, there exist only few methods for an efficient rendering. Yet, these models are of high interest to scientists, as they provide a compact and concise description of the structure elements along the protein main chain. In this work, a method was developed to speed up ribbon and cartoon visualizations. Separating two phases in the calculation of geometry allows to offload computational work from the CPU to the GPU. The first phase consists of computing a smooth curve along the protein's main chain on the CPU. In the second phase, conducted independently by the GPU, vertices along that curve are moved to set up the final geometrical representation of the molecule.
Accelerating universal Kriging interpolation algorithm using CUDA-enabled GPU
NASA Astrophysics Data System (ADS)
Cheng, Tangpei
2013-04-01
Kriging algorithms are a group of important interpolation methods, which are very useful in many geological applications. However, the algorithm based on traditional general purpose processors can be computationally expensive, especially when the problem scale expands. Inspired by the current trend in graphics processing technology, we proposed an efficient parallel scheme to accelerate the universal Kriging algorithm on the NVIDIA CUDA platform. Some high-performance mathematical functions have been introduced to calculate the compute-intensive steps in the Kriging algorithm, such as matrix-vector multiplication and matrix-matrix multiplication. To further optimize performance, we reduced the memory transfer overhead by reconstructing the time-consuming loops, specifically for the execution on GPU. In the numerical experiment, we compared the performances among different multi-core CPU and GPU implementations to interpolate a geological site. The improved CUDA implementation shows a nearly 18× speedup with respect to the sequential program and is 6.32 times faster compared to the OpenMP-based version running on Intel Xeon E5320 quad-cores CPU and scales well with the size of the system.
GPU implementation of the simplex identification via split augmented Lagrangian
NASA Astrophysics Data System (ADS)
Sevilla, Jorge; Nascimento, José M. P.
2015-10-01
Hyperspectral imaging can be used for object detection and for discriminating between different objects based on their spectral characteristics. One of the main problems of hyperspectral data analysis is the presence of mixed pixels, due to the low spatial resolution of such images. This means that several spectrally pure signatures (endmembers) are combined into the same mixed pixel. Linear spectral unmixing follows an unsupervised approach which aims at inferring pure spectral signatures and their material fractions at each pixel of the scene. The huge data volumes acquired by such sensors put stringent requirements on processing and unmixing methods. This paper proposes an efficient implementation of a unsupervised linear unmixing method on GPUs using CUDA. The method finds the smallest simplex by solving a sequence of nonsmooth convex subproblems using variable splitting to obtain a constraint formulation, and then applying an augmented Lagrangian technique. The parallel implementation of SISAL presented in this work exploits the GPU architecture at low level, using shared memory and coalesced accesses to memory. The results herein presented indicate that the GPU implementation can significantly accelerate the method's execution over big datasets while maintaining the methods accuracy.
Electromagnetic metamaterial simulations using a GPU-accelerated FDTD method
NASA Astrophysics Data System (ADS)
Seok, Myung-Su; Lee, Min-Gon; Yoo, SeokJae; Park, Q.-Han
2015-12-01
Metamaterials composed of artificial subwavelength structures exhibit extraordinary properties that cannot be found in nature. Designing artificial structures having exceptional properties plays a pivotal role in current metamaterial research. We present a new numerical simulation scheme for metamaterial research. The scheme is based on a graphic processing unit (GPU)-accelerated finite-difference time-domain (FDTD) method. The FDTD computation can be significantly accelerated when GPUs are used instead of only central processing units (CPUs). We explain how the fast FDTD simulation of large-scale metamaterials can be achieved through communication optimization in a heterogeneous CPU/GPU-based computer cluster. Our method also includes various advanced FDTD techniques: the non-uniform grid technique, the total-field/scattered-field (TFSF) technique, the auxiliary field technique for dispersive materials, the running discrete Fourier transform, and the complex structure setting. We demonstrate the power of our new FDTD simulation scheme by simulating the negative refraction of light in a coaxial waveguide metamaterial.
Application of GPU processing for Brownian particle simulation
NASA Astrophysics Data System (ADS)
Cheng, Way Lee; Sheharyar, Ali; Sadr, Reza; Bouhali, Othmane
2015-01-01
Reports on the anomalous thermal-fluid properties of nanofluids (dilute suspension of nano-particles in a base fluid) have been the subject of attention for 15 years. The underlying physics that govern nanofluid behavior, however, is not fully understood and is a subject of much dispute. The interactions between the suspended particles and the base fluid have been cited as a major contributor to the improvement in heat transfer reported in the literature. Numerical simulations are instrumental in studying the behavior of nanofluids. However, such simulations can be computationally intensive due to the small dimensions and complexity of these problems. In this study, a simplified computational approach for isothermal nanofluid simulations was applied, and simulations were conducted using both conventional CPU and parallel GPU implementations. The GPU implementations significantly improved the computational performance, in terms of the simulation time, by a factor of 1000-2500. The results of this investigation show that, as the computational load increases, the simulation efficiency approaches a constant. At a very high computational load, the amount of improvement may even decrease due to limited system memory.
High-throughput GPU-based LDPC decoding
NASA Astrophysics Data System (ADS)
Chang, Yang-Lang; Chang, Cheng-Chun; Huang, Min-Yu; Huang, Bormin
2010-08-01
Low-density parity-check (LDPC) code is a linear block code known to approach the Shannon limit via the iterative sum-product algorithm. LDPC codes have been adopted in most current communication systems such as DVB-S2, WiMAX, WI-FI and 10GBASE-T. LDPC for the needs of reliable and flexible communication links for a wide variety of communication standards and configurations have inspired the demand for high-performance and flexibility computing. Accordingly, finding a fast and reconfigurable developing platform for designing the high-throughput LDPC decoder has become important especially for rapidly changing communication standards and configurations. In this paper, a new graphic-processing-unit (GPU) LDPC decoding platform with the asynchronous data transfer is proposed to realize this practical implementation. Experimental results showed that the proposed GPU-based decoder achieved 271x speedup compared to its CPU-based counterpart. It can serve as a high-throughput LDPC decoder.
Commodity CPU-GPU System for Low-Cost , High-Performance Computing
NASA Astrophysics Data System (ADS)
Wang, S.; Zhang, S.; Weiss, R. M.; Barnett, G. A.; Yuen, D. A.
2009-12-01
We have put together a desktop computer system for under 2.5 K dollars from commodity components that consist of one quad-core CPU (Intel Core 2 Quad Q6600 Kentsfield 2.4GHz) and two high end GPUs (nVidia's GeForce GTX 295 and Tesla C1060). A 1200 watt power supply is required. On this commodity system, we have constructed an easy-to-use hybrid computing environment, in which Message Passing Interface (MPI) is used for managing the working loads, for transferring the data among different GPU devices, and for minimizing the need of CPU’s memory. The test runs using the MAGMA (Matrix Algebra on GPU and Multicore Architectures) library show that the speed ups for double precision calculations can be greater than 10 (GPU vs. CPU) and they are bigger (> 20) for single precision calculations. In addition we have enabled the combination of Matlab with CUDA for interactive visualization through MPI, i.e., two GPU devices are used for simulation and one GPU device is used for visualizing the computing results as the simulation goes. Our experience with this commodity system has shown that running multiple applications on one GPU device or running one application across multiple GPU devices can be done as conveniently as on CPUs. With NVIDIA CEO Jen-Hsun Huang's claim that over the next 6 years GPU processing power will increase by 570x compared to the 3x for CPUs, future low-cost commodity computers such as ours may be a remedy for the long wait queues of the world's supercomputers, especially for small- and mid-scale computation. Our goal here is to explore the limits and capabilities of this emerging technology and to get ourselves ready to run large-scale simulations on the next generation of computing environment, which we believe will hybridize CPU and GPU architectures.
Papadopoulos, Agathoklis; Kostoglou, Kyriaki; Mitsis, Georgios D; Theocharides, Theocharis
2015-01-01
The use of a GPGPU programming paradigm (running CUDA-enabled algorithms on GPU cards) in biomedical engineering and biology-related applications have shown promising results. GPU acceleration can be used to speedup computation-intensive models, such as the mathematical modeling of biological systems, which often requires the use of nonlinear modeling approaches with a large number of free parameters. In this context, we developed a CUDA-enabled version of a model which implements a nonlinear identification approach that combines basis expansions and polynomial-type networks, termed Laguerre-Volterra networks and can be used in diverse biological applications. The proposed software implementation uses the GPGPU programming paradigm to take advantage of the inherent parallel characteristics of the aforementioned modeling approach to execute the calculations on the GPU card of the host computer system. The initial results of the GPU-based model presented in this work, show performance improvements over the original MATLAB model. PMID:26736993
Targeting Atmospheric Simulation Algorithms for Large Distributed Memory GPU Accelerated Computers
Norman, Matthew R
2013-01-01
Computing platforms are increasingly moving to accelerated architectures, and here we deal particularly with GPUs. In [15], a method was developed for atmospheric simulation to improve efficiency on large distributed memory machines by reducing communication demand and increasing the time step. Here, we improve upon this method to further target GPU accelerated platforms by reducing GPU memory accesses, removing a synchronization point, and better clustering computations. The modification ran over two times faster in some cases even though more computations were required, demonstrating the merit of improving memory handling on the GPU. Furthermore, we discover that the modification also has a near 100% hit rate in fast on-chip L1 cache and discuss the reasons for this. In concluding, we remark on further potential improvements to GPU efficiency.
Fast plane wave density functional theory molecular dynamics calculations on multi-GPU machines
Jia, Weile; University of Chinese Academy of Sciences, Beijing ; Fu, Jiyun; University of Chinese Academy of Sciences, Beijing ; Cao, Zongyan; Wang, Long; Chi, Xuebin; Gao, Weiguo; MOE Key Laboratory of Computational Physical Sciences, Fudan University, Shanghai ; Wang, Lin-Wang
2013-10-15
Plane wave pseudopotential (PWP) density functional theory (DFT) calculation is the most widely used method for material simulations, but its absolute speed stagnated due to the inability to use large scale CPU based computers. By a drastic redesign of the algorithm, and moving all the major computation parts into GPU, we have reached a speed of 12 s per molecular dynamics (MD) step for a 512 atom system using 256 GPU cards. This is about 20 times faster than the CPU version of the code regardless of the number of CPU cores used. Our tests and analysis on different GPU platforms and configurations shed lights on the optimal GPU deployments for PWP-DFT calculations. An 1800 step MD simulation is used to study the liquid phase properties of GaInP.
MATCHED FILTER COMPUTATION ON FPGA, CELL, AND GPU
BAKER, ZACHARY K.; GOKHALE, MAYA B.; TRIPP, JUSTIN L.
2007-01-08
The matched filter is an important kernel in the processing of hyperspectral data. The filter enables researchers to sift useful data from instruments that span large frequency bands. In this work, they evaluate the performance of a matched filter algorithm implementation on accelerated co-processor (XD1000), the IBM Cell microprocessor, and the NVIDIA GeForce 6900 GTX GPU graphics card. They provide extensive discussion of the challenges and opportunities afforded by each platform. In particular, they explore the problems of partitioning the filter most efficiently between the host CPU and the co-processor. Using their results, they derive several performance metrics that provide the optimal solution for a variety of application situations.
Numerically Tracking Contact Discontinuities with an Introduction for GPU Programming
Davis, Sean L
2012-08-17
We review some of the classic numerical techniques used to analyze contact discontinuities and compare their effectiveness. Several finite difference methods (the Lax-Wendroff method, a Multidimensional Positive Definite Advection Transport Algorithm (MPDATA) method and a Monotone Upstream Scheme for Conservation Laws (MUSCL) scheme with an Artificial Compression Method (ACM)) as well as the finite element Streamlined Upwind Petrov-Galerkin (SUPG) method were considered. These methods were applied to solve the 2D advection equation. Based on our results we concluded that the MUSCL scheme produces the sharpest interfaces but can inappropriately steepen the solution. The SUPG method seems to represent a good balance between stability and interface sharpness without any inappropriate steepening. However, for solutions with discontinuities, the MUSCL scheme is superior. In addition, a preliminary implementation in a GPU program is discussed.
Singular value decomposition for collaborative filtering on a GPU
NASA Astrophysics Data System (ADS)
Kato, Kimikazu; Hosino, Tikara
2010-06-01
A collaborative filtering predicts customers' unknown preferences from known preferences. In a computation of the collaborative filtering, a singular value decomposition (SVD) is needed to reduce the size of a large scale matrix so that the burden for the next phase computation will be decreased. In this application, SVD means a roughly approximated factorization of a given matrix into smaller sized matrices. Webb (a.k.a. Simon Funk) showed an effective algorithm to compute SVD toward a solution of an open competition called "Netflix Prize". The algorithm utilizes an iterative method so that the error of approximation improves in each step of the iteration. We give a GPU version of Webb's algorithm. Our algorithm is implemented in the CUDA and it is shown to be efficient by an experiment.
GPU-computing in econophysics and statistical physics
NASA Astrophysics Data System (ADS)
Preis, T.
2011-03-01
A recent trend in computer science and related fields is general purpose computing on graphics processing units (GPUs), which can yield impressive performance. With multiple cores connected by high memory bandwidth, today's GPUs offer resources for non-graphics parallel processing. This article provides a brief introduction into the field of GPU computing and includes examples. In particular computationally expensive analyses employed in financial market context are coded on a graphics card architecture which leads to a significant reduction of computing time. In order to demonstrate the wide range of possible applications, a standard model in statistical physics - the Ising model - is ported to a graphics card architecture as well, resulting in large speedup values.
Explicit integration with GPU acceleration for large kinetic networks
Brock, Benjamin; Belt, Andrew; Billings, Jay Jay; Guidry, Mike W.
2015-09-15
In this study, we demonstrate the first implementation of recently-developed fast explicit kinetic integration algorithms on modern graphics processing unit (GPU) accelerators. Taking as a generic test case a Type Ia supernova explosion with an extremely stiff thermonuclear network having 150 isotopic species and 1604 reactions coupled to hydrodynamics using operator splitting, we demonstrate the capability to solve of order 100 realistic kinetic networks in parallel in the same time that standard implicit methods can solve a single such network on a CPU. In addition, this orders-of-magnitude decrease in computation time for solving systems of realistic kinetic networks implies thatmore » important coupled, multiphysics problems in various scientific and technical fields that were intractable, or could be simulated only with highly schematic kinetic networks, are now computationally feasible.« less
Explicit integration with GPU acceleration for large kinetic networks
Brock, Benjamin; Belt, Andrew; Billings, Jay Jay; Guidry, Mike W.
2015-09-15
In this study, we demonstrate the first implementation of recently-developed fast explicit kinetic integration algorithms on modern graphics processing unit (GPU) accelerators. Taking as a generic test case a Type Ia supernova explosion with an extremely stiff thermonuclear network having 150 isotopic species and 1604 reactions coupled to hydrodynamics using operator splitting, we demonstrate the capability to solve of order 100 realistic kinetic networks in parallel in the same time that standard implicit methods can solve a single such network on a CPU. In addition, this orders-of-magnitude decrease in computation time for solving systems of realistic kinetic networks implies that important coupled, multiphysics problems in various scientific and technical fields that were intractable, or could be simulated only with highly schematic kinetic networks, are now computationally feasible.
Explicit integration with GPU acceleration for large kinetic networks
NASA Astrophysics Data System (ADS)
Brock, Benjamin; Belt, Andrew; Billings, Jay Jay; Guidry, Mike
2015-12-01
We demonstrate the first implementation of recently-developed fast explicit kinetic integration algorithms on modern graphics processing unit (GPU) accelerators. Taking as a generic test case a Type Ia supernova explosion with an extremely stiff thermonuclear network having 150 isotopic species and 1604 reactions coupled to hydrodynamics using operator splitting, we demonstrate the capability to solve of order 100 realistic kinetic networks in parallel in the same time that standard implicit methods can solve a single such network on a CPU. This orders-of-magnitude decrease in computation time for solving systems of realistic kinetic networks implies that important coupled, multiphysics problems in various scientific and technical fields that were intractable, or could be simulated only with highly schematic kinetic networks, are now computationally feasible.
GPU-accelerated Monte Carlo convolution∕superposition implementation for dose calculation
Zhou, Bo; Yu, Cedric X.; Chen, Danny Z.; Hu, X. Sharon
2010-01-01
Purpose: Dose calculation is a key component in radiation treatment planning systems. Its performance and accuracy are crucial to the quality of treatment plans as emerging advanced radiation therapy technologies are exerting ever tighter constraints on dose calculation. A common practice is to choose either a deterministic method such as the convolution∕superposition (CS) method for speed or a Monte Carlo (MC) method for accuracy. The goal of this work is to boost the performance of a hybrid Monte Carlo convolution∕superposition (MCCS) method by devising a graphics processing unit (GPU) implementation so as to make the method practical for day-to-day usage. Methods: Although the MCCS algorithm combines the merits of MC fluence generation and CS fluence transport, it is still not fast enough to be used as a day-to-day planning tool. To alleviate the speed issue of MC algorithms, the authors adopted MCCS as their target method and implemented a GPU-based version. In order to fully utilize the GPU computing power, the MCCS algorithm is modified to match the GPU hardware architecture. The performance of the authors’ GPU-based implementation on an Nvidia GTX260 card is compared to a multithreaded software implementation on a quad-core system. Results: A speedup in the range of 6.7–11.4× is observed for the clinical cases used. The less than 2% statistical fluctuation also indicates that the accuracy of the authors’ GPU-based implementation is in good agreement with the results from the quad-core CPU implementation. Conclusions: This work shows that GPU is a feasible and cost-efficient solution compared to other alternatives such as using cluster machines or field-programmable gate arrays for satisfying the increasing demands on computation speed and accuracy of dose calculation. But there are also inherent limitations of using GPU for accelerating MC-type applications, which are also analyzed in detail in this article. PMID:21158271
Development of High-speed Visualization System of Hypocenter Data Using CUDA-based GPU computing
NASA Astrophysics Data System (ADS)
Kumagai, T.; Okubo, K.; Uchida, N.; Matsuzawa, T.; Kawada, N.; Takeuchi, N.
2014-12-01
After the Great East Japan Earthquake on March 11, 2011, intelligent visualization of seismic information is becoming important to understand the earthquake phenomena. On the other hand, to date, the quantity of seismic data becomes enormous as a progress of high accuracy observation network; we need to treat many parameters (e.g., positional information, origin time, magnitude, etc.) to efficiently display the seismic information. Therefore, high-speed processing of data and image information is necessary to handle enormous amounts of seismic data. Recently, GPU (Graphic Processing Unit) is used as an acceleration tool for data processing and calculation in various study fields. This movement is called GPGPU (General Purpose computing on GPUs). In the last few years the performance of GPU keeps on improving rapidly. GPU computing gives us the high-performance computing environment at a lower cost than before. Moreover, use of GPU has an advantage of visualization of processed data, because GPU is originally architecture for graphics processing. In the GPU computing, the processed data is always stored in the video memory. Therefore, we can directly write drawing information to the VRAM on the video card by combining CUDA and the graphics API. In this study, we employ CUDA and OpenGL and/or DirectX to realize full-GPU implementation. This method makes it possible to write drawing information to the VRAM on the video card without PCIe bus data transfer: It enables the high-speed processing of seismic data. The present study examines the GPU computing-based high-speed visualization and the feasibility for high-speed visualization system of hypocenter data.
Efficient simulation of diffusion-based choice RT models on CPU and GPU.
Verdonck, Stijn; Meers, Kristof; Tuerlinckx, Francis
2016-03-01
In this paper, we present software for the efficient simulation of a broad class of linear and nonlinear diffusion models for choice RT, using either CPU or graphical processing unit (GPU) technology. The software is readily accessible from the popular scripting languages MATLAB and R (both 64-bit). The speed obtained on a single high-end GPU is comparable to that of a small CPU cluster, bringing standard statistical inference of complex diffusion models to the desktop platform. PMID:25761391
2010-01-01
Background Simulation of sophisticated biological models requires considerable computational power. These models typically integrate together numerous biological phenomena such as spatially-explicit heterogeneous cells, cell-cell interactions, cell-environment interactions and intracellular gene networks. The recent advent of programming for graphical processing units (GPU) opens up the possibility of developing more integrative, detailed and predictive biological models while at the same time decreasing the computational cost to simulate those models. Results We construct a 3D model of epidermal development and provide a set of GPU algorithms that executes significantly faster than sequential central processing unit (CPU) code. We provide a parallel implementation of the subcellular element method for individual cells residing in a lattice-free spatial environment. Each cell in our epidermal model includes an internal gene network, which integrates cellular interaction of Notch signaling together with environmental interaction of basement membrane adhesion, to specify cellular state and behaviors such as growth and division. We take a pedagogical approach to describing how modeling methods are efficiently implemented on the GPU including memory layout of data structures and functional decomposition. We discuss various programmatic issues and provide a set of design guidelines for GPU programming that are instructive to avoid common pitfalls as well as to extract performance from the GPU architecture. Conclusions We demonstrate that GPU algorithms represent a significant technological advance for the simulation of complex biological models. We further demonstrate with our epidermal model that the integration of multiple complex modeling methods for heterogeneous multicellular biological processes is both feasible and computationally tractable using this new technology. We hope that the provided algorithms and source code will be a starting point for modelers to
GPU-based parallel clustered differential pulse code modulation
NASA Astrophysics Data System (ADS)
Wu, Jiaji; Li, Wenze; Kong, Wanqiu
2015-10-01
Hyperspectral remote sensing technology is widely used in marine remote sensing, geological exploration, atmospheric and environmental remote sensing. Owing to the rapid development of hyperspectral remote sensing technology, resolution of hyperspectral image has got a huge boost. Thus data size of hyperspectral image is becoming larger. In order to reduce their saving and transmission cost, lossless compression for hyperspectral image has become an important research topic. In recent years, large numbers of algorithms have been proposed to reduce the redundancy between different spectra. Among of them, the most classical and expansible algorithm is the Clustered Differential Pulse Code Modulation (CDPCM) algorithm. This algorithm contains three parts: first clusters all spectral lines, then trains linear predictors for each band. Secondly, use these predictors to predict pixels, and get the residual image by subtraction between original image and predicted image. Finally, encode the residual image. However, the process of calculating predictors is timecosting. In order to improve the processing speed, we propose a parallel C-DPCM based on CUDA (Compute Unified Device Architecture) with GPU. Recently, general-purpose computing based on GPUs has been greatly developed. The capacity of GPU improves rapidly by increasing the number of processing units and storage control units. CUDA is a parallel computing platform and programming model created by NVIDIA. It gives developers direct access to the virtual instruction set and memory of the parallel computational elements in GPUs. Our core idea is to achieve the calculation of predictors in parallel. By respectively adopting global memory, shared memory and register memory, we finally get a decent speedup.
GPU accelerated dynamic functional connectivity analysis for functional MRI data.
Akgün, Devrim; Sakoğlu, Ünal; Esquivel, Johnny; Adinoff, Bryon; Mete, Mutlu
2015-07-01
Recent advances in multi-core processors and graphics card based computational technologies have paved the way for an improved and dynamic utilization of parallel computing techniques. Numerous applications have been implemented for the acceleration of computationally-intensive problems in various computational science fields including bioinformatics, in which big data problems are prevalent. In neuroimaging, dynamic functional connectivity (DFC) analysis is a computationally demanding method used to investigate dynamic functional interactions among different brain regions or networks identified with functional magnetic resonance imaging (fMRI) data. In this study, we implemented and analyzed a parallel DFC algorithm based on thread-based and block-based approaches. The thread-based approach was designed to parallelize DFC computations and was implemented in both Open Multi-Processing (OpenMP) and Compute Unified Device Architecture (CUDA) programming platforms. Another approach developed in this study to better utilize CUDA architecture is the block-based approach, where parallelization involves smaller parts of fMRI time-courses obtained by sliding-windows. Experimental results showed that the proposed parallel design solutions enabled by the GPUs significantly reduce the computation time for DFC analysis. Multicore implementation using OpenMP on 8-core processor provides up to 7.7× speed-up. GPU implementation using CUDA yielded substantial accelerations ranging from 18.5× to 157× speed-up once thread-based and block-based approaches were combined in the analysis. Proposed parallel programming solutions showed that multi-core processor and CUDA-supported GPU implementations accelerated the DFC analyses significantly. Developed algorithms make the DFC analyses more practical for multi-subject studies with more dynamic analyses. PMID:25805449
Three Dimensional TEM Forward Modeling Using FDTD Accelerated by GPU
NASA Astrophysics Data System (ADS)
Li, Z.; Huang, Q.
2015-12-01
Three dimensional inversion of transient electromagnetic (TEM) data is still challenging. The inversion speed mostly depends on the forward modeling. Finite-difference time-domain (FDTD) method is one of the popular forward modeling scheme. In an explicit type, which is based on the Du Fort-Frankel scheme, the time step is under the constraint of quasi-static approximation. Often an upward-continuation boundary condition (UCBC) is applied on the earth-air surface to avoid time stepping in the model air. However, UCBC is not suitable for models with topography and has a low parallel efficiency. Modeling without UCBC may cause a much smaller time step because of the resistive attribute of the air and the quasi-static constraint, which may also low the efficiency greatly. Our recent research shows that the time step in the model air is not needed to be constrained by the quasi-static approximation, which can let the time step without UCBC much closer to that with UCBC. The parallel performance of FDTD is then largely released. On a computer with a 4-core CPU, this newly developed method is obviously faster than the method using UCBC. Besides, without UCBC, this method can be easily accelerated by Graphics Processing Unit (GPU). On a computer with a CPU of 4790k@4.4GHz and a GPU of GTX 970, the speed accelerated by CUDA is almost 10 times of that using CPU only. For a model with a grid size of 140×140×130, if the conductivity of the model earth is 0.02S/m, and the minimal space interval is 15m, it takes only 80 seconds to evolve the field from excitation to 0.032s.
High quality GPU rendering with displaced pixel shading
NASA Astrophysics Data System (ADS)
Zhang, Hui; Choi, Jae
2006-03-01
Direct volume rendering via consumer PC hardware has become an efficient tool for volume visualization. In particular, the volumetric ray casting, the most frequently used volume rendering technique, can be implemented by the shading language integrated with graphical processing units (GPU). However, to produce high-quality images offered by GPU-based volume rendering, a higher sampling rate is usually required. In this paper, we present an algorithm to generate high quality images with a small number of slices by utilizing displaced pixel shading technique. Instead of sampling points along a ray with the regular interval, the actual surface location is calculated by the linear interpolation between the outer and inner points, and this location is used as the displaced pixel for the iso-surface illumination. Multi-pass and early Z-culling techniques are applied to improve the rendering speed. The first pass simply locates and stores the exact surface depth of each ray using a few pixel instructions; then, the second pass uses instructions to shade the surface at the previous position. A new 3D edge detector from our previous research is integrated to provide more realistic rendering results compared with the widely used gradient normal estimator. To implement our algorithm, we have made a program named DirectView based on DirectX 9.0c and Microsoft High Level Shading Language (HLSL) for volume rendering. We tested two data sets and discovered that our algorithm can generate smoother and more accurate shading images with a small number of intermediate slices.
High energy electromagnetic particle transportation on the GPU
Canal, P.; Elvira, D.; Jun, S. Y.; Kowalkowski, J.; Paterno, M.; Apostolakis, J.
2014-01-01
We present massively parallel high energy electromagnetic particle transportation through a finely segmented detector on a Graphics Processing Unit (GPU). Simulating events of energetic particle decay in a general-purpose high energy physics (HEP) detector requires intensive computing resources, due to the complexity of the geometry as well as physics processes applied to particles copiously produced by primary collisions and secondary interactions. The recent advent of hardware architectures of many-core or accelerated processors provides the variety of concurrent programming models applicable not only for the high performance parallel computing, but also for the conventional computing intensive application such as the HEP detector simulation. The components of our prototype are a transportation process under a non-uniform magnetic field, geometry navigation with a set of solid shapes and materials, electromagnetic physics processes for electrons and photons, and an interface to a framework that dispatches bundles of tracks in a highly vectorized manner optimizing for spatial locality and throughput. Core algorithms and methods are excerpted from the Geant4 toolkit, and are modified and optimized for the GPU application. Program kernels written in C/C++ are designed to be compatible with CUDA and OpenCL and with the aim to be generic enough for easy porting to future programming models and hardware architectures. To improve throughput by overlapping data transfers with kernel execution, multiple CUDA streams are used. Issues with floating point accuracy, random numbers generation, data structure, kernel divergences and register spills are also considered. Performance evaluation for the relative speedup compared to the corresponding sequential execution on CPU is presented as well.
Real-time GPU surface curvature estimation on deforming meshes and volumetric data sets.
Griffin, Wesley; Wang, Yu; Berrios, David; Olano, Marc
2012-10-01
Surface curvature is used in a number of areas in computer graphics, including texture synthesis and shape representation, mesh simplification, surface modeling, and nonphotorealistic line drawing. Most real-time applications must estimate curvature on a triangular mesh. This estimation has been limited to CPU algorithms, forcing object geometry to reside in main memory. However, as more computational work is done directly on the GPU, it is increasingly common for object geometry to exist only in GPU memory. Examples include vertex skinned animations and isosurfaces from GPU-based surface reconstruction algorithms. For static models, curvature can be precomputed and CPU algorithms are a reasonable choice. For deforming models where the geometry only resides on the GPU, transferring the deformed mesh back to the CPU limits performance. We introduce a GPU algorithm for estimating curvature in real time on arbitrary triangular meshes. We demonstrate our algorithm with curvature-based NPR feature lines and a curvature-based approximation for an ambient occlusion. We show curvature computation on volumetric data sets with a GPU isosurface extraction algorithm and vertex-skinned animations. We present a graphics pipeline and CUDA implementation. Our curvature estimation is up to ~18x faster than a multithreaded CPU benchmark. PMID:22508906
Suchard, Marc A.; Wang, Quanli; Chan, Cliburn; Frelinger, Jacob; Cron, Andrew; West, Mike
2010-01-01
This article describes advances in statistical computation for large-scale data analysis in structured Bayesian mixture models via graphics processing unit (GPU) programming. The developments are partly motivated by computational challenges arising in fitting models of increasing heterogeneity to increasingly large datasets. An example context concerns common biological studies using high-throughput technologies generating many, very large datasets and requiring increasingly high-dimensional mixture models with large numbers of mixture components. We outline important strategies and processes for GPU computation in Bayesian simulation and optimization approaches, give examples of the benefits of GPU implementations in terms of processing speed and scale-up in ability to analyze large datasets, and provide a detailed, tutorial-style exposition that will benefit readers interested in developing GPU-based approaches in other statistical models. Novel, GPU-oriented approaches to modifying existing algorithms software design can lead to vast speed-up and, critically, enable statistical analyses that presently will not be performed due to compute time limitations in traditional computational environments. Supplemental materials are provided with all source code, example data, and details that will enable readers to implement and explore the GPU approach in this mixture modeling context. PMID:20877443
Calculation of HELAS amplitudes for QCD processes using graphics processing unit (GPU)
NASA Astrophysics Data System (ADS)
Hagiwara, K.; Kanzaki, J.; Okamura, N.; Rainwater, D.; Stelzer, T.
2010-11-01
We use a graphics processing unit (GPU) for fast calculations of helicity amplitudes of quark and gluon scattering processes in massless QCD. New HEGET ( HELAS Evaluation with GPU Enhanced Technology) codes for gluon self-interactions are introduced, and a C++ program to convert the MadGraph generated FORTRAN codes into HEGET codes in CUDA (a C-platform for general purpose computing on GPU) is created. Because of the proliferation of the number of Feynman diagrams and the number of independent color amplitudes, the maximum number of final state jets we can evaluate on a GPU is limited to 4 for pure gluon processes ( gg→4 g), or 5 for processes with one or more quark lines such as qoverline{q}→ 5g and qq→ qq+3 g. Compared with the usual CPU-based programs, we obtain 60-100 times better performance on the GPU, except for 5-jet production processes and the gg→4 g processes for which the GPU gain over the CPU is about 20.
GPU-based parallel algorithm for blind image restoration using midfrequency-based methods
NASA Astrophysics Data System (ADS)
Xie, Lang; Luo, Yi-han; Bao, Qi-liang
2013-08-01
GPU-based general-purpose computing is a new branch of modern parallel computing, so the study of parallel algorithms specially designed for GPU hardware architecture is of great significance. In order to solve the problem of high computational complexity and poor real-time performance in blind image restoration, the midfrequency-based algorithm for blind image restoration was analyzed and improved in this paper. Furthermore, a midfrequency-based filtering method is also used to restore the image hardly with any recursion or iteration. Combining the algorithm with data intensiveness, data parallel computing and GPU execution model of single instruction and multiple threads, a new parallel midfrequency-based algorithm for blind image restoration is proposed in this paper, which is suitable for stream computing of GPU. In this algorithm, the GPU is utilized to accelerate the estimation of class-G point spread functions and midfrequency-based filtering. Aiming at better management of the GPU threads, the threads in a grid are scheduled according to the decomposition of the filtering data in frequency domain after the optimization of data access and the communication between the host and the device. The kernel parallelism structure is determined by the decomposition of the filtering data to ensure the transmission rate to get around the memory bandwidth limitation. The results show that, with the new algorithm, the operational speed is significantly increased and the real-time performance of image restoration is effectively improved, especially for high-resolution images.
GPU-based prompt gamma ray imaging from boron neutron capture therapy
Yoon, Do-Kun; Jung, Joo-Young; Suk Suh, Tae; Jo Hong, Key; Sil Lee, Keum
2015-01-15
Purpose: The purpose of this research is to perform the fast reconstruction of a prompt gamma ray image using a graphics processing unit (GPU) computation from boron neutron capture therapy (BNCT) simulations. Methods: To evaluate the accuracy of the reconstructed image, a phantom including four boron uptake regions (BURs) was used in the simulation. After the Monte Carlo simulation of the BNCT, the modified ordered subset expectation maximization reconstruction algorithm using the GPU computation was used to reconstruct the images with fewer projections. The computation times for image reconstruction were compared between the GPU and the central processing unit (CPU). Also, the accuracy of the reconstructed image was evaluated by a receiver operating characteristic (ROC) curve analysis. Results: The image reconstruction time using the GPU was 196 times faster than the conventional reconstruction time using the CPU. For the four BURs, the area under curve values from the ROC curve were 0.6726 (A-region), 0.6890 (B-region), 0.7384 (C-region), and 0.8009 (D-region). Conclusions: The tomographic image using the prompt gamma ray event from the BNCT simulation was acquired using the GPU computation in order to perform a fast reconstruction during treatment. The authors verified the feasibility of the prompt gamma ray image reconstruction using the GPU computation for BNCT simulations.
Compilação de dados atômicos e moleculares do UV ao IV próximo para uso em síntese espectral
NASA Astrophysics Data System (ADS)
Coelho, P.; Barbuy, B.; Melendez, J.; Allen, D. M.; Castilho, B.
2003-08-01
Espectros sintéticos são utéis em uma grande variedade de aplicações, desde análise de abundâncias em espectros estelares de alta resolução ao estudo de populações estelares em espectros integrados. A confiabilidade de um espectro sintético depende do modelo de atmosfera adotado, do código de formação de linhas e da qualidade dos dados atômicos e moleculares que são determinantes no cálculo das opacidades da fotosfera. O nosso grupo no departamento de Astronomia no IAG tem utilizado espectros sintéticos há mais de 15 anos, em aplicações voltadas principalmente para a análise de abundâncias de estrelas G, K e M e populações estelares velhas. Ao longo desse tempo, as listas de linhas vieram sendo construídas e atualizadas continuamente, e alguns acréscimos recentes podem ser citados: Castilho (1999, átomos e moléculas no UV), Schiavon (1998, bandas moleculares de TiO) e Melendez (2001, átomos e moléculas no IV próximo). Com o intuito de calcular uma grade de espectros do UV ao IV próximo para uso no estudo de populações estelares velhas, se fazia necessário compilar e homogeneizar as diversas listas em apenas uma lista atômica e uma molecular. Nesse processo, a nova lista compilada foi correlacionada com outras bases de dados (NIST, Kurucz Database, O' Brian et al. 1991) para atualização dos parâmetros que caracterizam a transição atômica (comprimento de onda, log gf e potencial de excitação). Adicionalmente as constantes de interação C6 foram calculadas segundo a teoria de Anstee & O'Mara (1995) e artigos posteriores. As bandas moleculares de CH e CN foram recalculadas com o programa LIFBASE (Luque & Crosley 1999). Nesse poster estão detalhados os procedimentos citados acima, as comparações entre espectros calculados com as novas listas e espectros observados em alta resolução do Sol e de Arcturus, e uma análise do impacto decorrente da utilização de diferentes modelos de atmosfera no espectro sintético. Ao
Multi-GPU implementation of a VMAT treatment plan optimization algorithm
Tian, Zhen E-mail: Xun.Jia@UTSouthwestern.edu Folkerts, Michael; Tan, Jun; Jia, Xun E-mail: Xun.Jia@UTSouthwestern.edu Jiang, Steve B. E-mail: Xun.Jia@UTSouthwestern.edu; Peng, Fei
2015-06-15
Purpose: Volumetric modulated arc therapy (VMAT) optimization is a computationally challenging problem due to its large data size, high degrees of freedom, and many hardware constraints. High-performance graphics processing units (GPUs) have been used to speed up the computations. However, GPU’s relatively small memory size cannot handle cases with a large dose-deposition coefficient (DDC) matrix in cases of, e.g., those with a large target size, multiple targets, multiple arcs, and/or small beamlet size. The main purpose of this paper is to report an implementation of a column-generation-based VMAT algorithm, previously developed in the authors’ group, on a multi-GPU platform to solve the memory limitation problem. While the column-generation-based VMAT algorithm has been previously developed, the GPU implementation details have not been reported. Hence, another purpose is to present detailed techniques employed for GPU implementation. The authors also would like to utilize this particular problem as an example problem to study the feasibility of using a multi-GPU platform to solve large-scale problems in medical physics. Methods: The column-generation approach generates VMAT apertures sequentially by solving a pricing problem (PP) and a master problem (MP) iteratively. In the authors’ method, the sparse DDC matrix is first stored on a CPU in coordinate list format (COO). On the GPU side, this matrix is split into four submatrices according to beam angles, which are stored on four GPUs in compressed sparse row format. Computation of beamlet price, the first step in PP, is accomplished using multi-GPUs. A fast inter-GPU data transfer scheme is accomplished using peer-to-peer access. The remaining steps of PP and MP problems are implemented on CPU or a single GPU due to their modest problem scale and computational loads. Barzilai and Borwein algorithm with a subspace step scheme is adopted here to solve the MP problem. A head and neck (H and N) cancer case is
NASA Astrophysics Data System (ADS)
Su, Lin; Du, Xining; Liu, Tianyu; Xu, X. George
2014-06-01
An electron-photon coupled Monte Carlo code ARCHER -
SU-E-J-60: Efficient Monte Carlo Dose Calculation On CPU-GPU Heterogeneous Systems
Xiao, K; Chen, D. Z; Hu, X. S; Zhou, B
2014-06-01
Purpose: It is well-known that the performance of GPU-based Monte Carlo dose calculation implementations is bounded by memory bandwidth. One major cause of this bottleneck is the random memory writing patterns in dose deposition, which leads to several memory efficiency issues on GPU such as un-coalesced writing and atomic operations. We propose a new method to alleviate such issues on CPU-GPU heterogeneous systems, which achieves overall performance improvement for Monte Carlo dose calculation. Methods: Dose deposition is to accumulate dose into the voxels of a dose volume along the trajectories of radiation rays. Our idea is to partition this procedure into the following three steps, which are fine-tuned for CPU or GPU: (1) each GPU thread writes dose results with location information to a buffer on GPU memory, which achieves fully-coalesced and atomic-free memory transactions; (2) the dose results in the buffer are transferred to CPU memory; (3) the dose volume is constructed from the dose buffer on CPU. We organize the processing of all radiation rays into streams. Since the steps within a stream use different hardware resources (i.e., GPU, DMA, CPU), we can overlap the execution of these steps for different streams by pipelining. Results: We evaluated our method using a Monte Carlo Convolution Superposition (MCCS) program and tested our implementation for various clinical cases on a heterogeneous system containing an Intel i7 quad-core CPU and an NVIDIA TITAN GPU. Comparing with a straightforward MCCS implementation on the same system (using both CPU and GPU for radiation ray tracing), our method gained 2-5X speedup without losing dose calculation accuracy. Conclusion: The results show that our new method improves the effective memory bandwidth and overall performance for MCCS on the CPU-GPU systems. Our proposed method can also be applied to accelerate other Monte Carlo dose calculation approaches. This research was supported in part by NSF under Grants CCF
Adaptive multi-GPU Exchange Monte Carlo for the 3D Random Field Ising Model
NASA Astrophysics Data System (ADS)
Navarro, Cristóbal A.; Huang, Wei; Deng, Youjin
2016-08-01
This work presents an adaptive multi-GPU Exchange Monte Carlo approach for the simulation of the 3D Random Field Ising Model (RFIM). The design is based on a two-level parallelization. The first level, spin-level parallelism, maps the parallel computation as optimal 3D thread-blocks that simulate blocks of spins in shared memory with minimal halo surface, assuming a constant block volume. The second level, replica-level parallelism, uses multi-GPU computation to handle the simulation of an ensemble of replicas. CUDA's concurrent kernel execution feature is used in order to fill the occupancy of each GPU with many replicas, providing a performance boost that is more notorious at the smallest values of L. In addition to the two-level parallel design, the work proposes an adaptive multi-GPU approach that dynamically builds a proper temperature set free of exchange bottlenecks. The strategy is based on mid-point insertions at the temperature gaps where the exchange rate is most compromised. The extra work generated by the insertions is balanced across the GPUs independently of where the mid-point insertions were performed. Performance results show that spin-level performance is approximately two orders of magnitude faster than a single-core CPU version and one order of magnitude faster than a parallel multi-core CPU version running on 16-cores. Multi-GPU performance is highly convenient under a weak scaling setting, reaching up to 99 % efficiency as long as the number of GPUs and L increase together. The combination of the adaptive approach with the parallel multi-GPU design has extended our possibilities of simulation to sizes of L = 32 , 64 for a workstation with two GPUs. Sizes beyond L = 64 can eventually be studied using larger multi-GPU systems.
GPU accelerated generation of digitally reconstructed radiographs for 2-D/3-D image registration.
Dorgham, Osama M; Laycock, Stephen D; Fisher, Mark H
2012-09-01
Recent advances in programming languages for graphics processing units (GPUs) provide developers with a convenient way of implementing applications which can be executed on the CPU and GPU interchangeably. GPUs are becoming relatively cheap, powerful, and widely available hardware components, which can be used to perform intensive calculations. The last decade of hardware performance developments shows that GPU-based computation is progressing significantly faster than CPU-based computation, particularly if one considers the execution of highly parallelisable algorithms. Future predictions illustrate that this trend is likely to continue. In this paper, we introduce a way of accelerating 2-D/3-D image registration by developing a hybrid system which executes on the CPU and utilizes the GPU for parallelizing the generation of digitally reconstructed radiographs (DRRs). Based on the advancements of the GPU over the CPU, it is timely to exploit the benefits of many-core GPU technology by developing algorithms for DRR generation. Although some previous work has investigated the rendering of DRRs using the GPU, this paper investigates approximations which reduce the computational overhead while still maintaining a quality consistent with that needed for 2-D/3-D registration with sufficient accuracy to be clinically acceptable in certain applications of radiation oncology. Furthermore, by comparing implementations of 2-D/3-D registration on the CPU and GPU, we investigate current performance and propose an optimal framework for PC implementations addressing the rigid registration problem. Using this framework, we are able to render DRR images from a 256×256×133 CT volume in ~24 ms using an NVidia GeForce 8800 GTX and in ~2 ms using NVidia GeForce GTX 580. In addition to applications requiring fast automatic patient setup, these levels of performance suggest image-guided radiation therapy at video frame rates is technically feasible using relatively low cost PC
Dong, Tingzing Tim; Tomov, Stanimire Z; Luszczek, Piotr R; Dongarra, Jack J
2015-01-01
As modern hardware keeps evolving, an increasingly effective approach to developing energy efficient and high-performance solvers is to design them to work on many small size and independent problems. Many applications already need this functionality, especially for GPUs, which are currently known to be about four to five times more energy efficient than multicore CPUs. We describe the development of one-sided factorizations that work for a set of small dense matrices in parallel, and we illustrate our techniques on the QR factorization based on Householder transformations. We refer to this mode of operation as a batched factorization. Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPU-only execution. This is in contrast to the hybrid CPU-GPU algorithms that rely heavily on using the multicore CPU for specific parts of the workload. But for a system to benefit fully from the GPU's significantly higher energy efficiency, avoiding the use of the multicore CPU must be a primary design goal, so the system can rely more heavily on the more efficient GPU. Additionally, this will result in the removal of the costly CPU-to-GPU communication. Furthermore, we do not use a single symmetric multiprocessor(on the GPU) to factorize a single problem at a time. We illustrate how our performance analysis, and the use of profiling and tracing tools, guided the development and optimization of our batched factorization to achieve up to a 2-fold speedup and a 3-fold energy efficiency improvement compared to our highly optimized batched CPU implementations based on the MKL library(when using two sockets of Intel Sandy Bridge CPUs). Compared to a batched QR factorization featured in the CUBLAS library for GPUs, we achieved up to 5x speedup on the K40 GPU.
3D Laplace-domain full waveform inversion using a single GPU card
NASA Astrophysics Data System (ADS)
Shin, Jungkyun; Ha, Wansoo; Jun, Hyunggu; Min, Dong-Joo; Shin, Changsoo
2014-06-01
The Laplace-domain full waveform inversion is an efficient long-wavelength velocity estimation method for seismic datasets lacking low-frequency components. However, to invert a 3D velocity model, a large cluster of CPU cores have commonly been required to overcome the extremely long computing time caused by a large impedance matrix and a number of source positions. In this study, a workstation with a single GPU card (NVIDIA GTX 580) is successfully used for the 3D Laplace-domain full waveform inversion rather than a large cluster of CPU cores. To exploit a GPU for our inversion algorithm, the routine for the iterative matrix solver is ported to the CUDA programming language for forward and backward modeling parts with minimized modification of the remaining parts, which were originally written in Fortran 90. Using a uniformly structured grid set, nonzero values in the sparse impedance matrix can be arranged according to certain rules, which efficiently parallelize the preconditioned conjugate gradient method for a number of threads contained in the GPU card. We perform a numerical experiment to verify the accuracy of a floating point operation performed by a GPU to calculate the Laplace-domain wavefield. We also measure the efficiencies of the original CPU and modified GPU programs using a cluster of CPU cores and a workstation with a GPU card, respectively. Through the analysis, the parallelized inversion code for a GPU achieves the speedup of 14.7-24.6x compared to a CPU-based serial code depending on the degrees of freedom of the impedance matrix. Finally, the practicality of the proposed algorithm is examined by inverting a 3D long-wavelength velocity model using wide azimuth real datasets in 3.7 days.
NASA Astrophysics Data System (ADS)
Ammazzalorso, F.; Bednarz, T.; Jelen, U.
2014-03-01
We demonstrate acceleration on graphic processing units (GPU) of automatic identification of robust particle therapy beam setups, minimizing negative dosimetric effects of Bragg peak displacement caused by treatment-time patient positioning errors. Our particle therapy research toolkit, RobuR, was extended with OpenCL support and used to implement calculation on GPU of the Port Homogeneity Index, a metric scoring irradiation port robustness through analysis of tissue density patterns prior to dose optimization and computation. Results were benchmarked against an independent native CPU implementation. Numerical results were in agreement between the GPU implementation and native CPU implementation. For 10 skull base cases, the GPU-accelerated implementation was employed to select beam setups for proton and carbon ion treatment plans, which proved to be dosimetrically robust, when recomputed in presence of various simulated positioning errors. From the point of view of performance, average running time on the GPU decreased by at least one order of magnitude compared to the CPU, rendering the GPU-accelerated analysis a feasible step in a clinical treatment planning interactive session. In conclusion, selection of robust particle therapy beam setups can be effectively accelerated on a GPU and become an unintrusive part of the particle therapy treatment planning workflow. Additionally, the speed gain opens new usage scenarios, like interactive analysis manipulation (e.g. constraining of some setup) and re-execution. Finally, through OpenCL portable parallelism, the new implementation is suitable also for CPU-only use, taking advantage of multiple cores, and can potentially exploit types of accelerators other than GPUs.
Revisiting Molecular Dynamics on a CPU/GPU system: Water Kernel and SHAKE Parallelization
Ruymgaart, A. Peter; Elber, Ron
2012-01-01
We report Graphics Processing Unit (GPU) and Open-MP parallel implementations of water-specific force calculations and of bond constraints for use in Molecular Dynamics simulations. We focus on a typical laboratory computing-environment in which a CPU with a few cores is attached to a GPU. We discuss in detail the design of the code and we illustrate performance comparable to highly optimized codes such as GROMACS. Beside speed our code shows excellent energy conservation. Utilization of water-specific lists allows the efficient calculations of non-bonded interactions that include water molecules and results in a speed-up factor of more than 40 on the GPU compared to code optimized on a single CPU core for systems larger than 20,000 atoms. This is up four-fold from a factor of 10 reported in our initial GPU implementation that did not include a water-specific code. Another optimization is the implementation of constrained dynamics entirely on the GPU. The routine, which enforces constraints of all bonds, runs in parallel on multiple Open-MP cores or entirely on the GPU. It is based on Conjugate Gradient solution of the Lagrange multipliers (CG SHAKE). The GPU implementation is partially in double precision and requires no communication with the CPU during the execution of the SHAKE algorithm. The (parallel) implementation of SHAKE allows an increase of the time step to 2.0fs while maintaining excellent energy conservation. Interestingly, CG SHAKE is faster than the usual bond relaxation algorithm even on a single core if high accuracy is expected. The significant speedup of the optimized components transfers the computational bottleneck of the MD calculation to the reciprocal part of Particle Mesh Ewald (PME). PMID:23264758
Revisiting Molecular Dynamics on a CPU/GPU system: Water Kernel and SHAKE Parallelization.
Ruymgaart, A Peter; Elber, Ron
2012-11-13
We report Graphics Processing Unit (GPU) and Open-MP parallel implementations of water-specific force calculations and of bond constraints for use in Molecular Dynamics simulations. We focus on a typical laboratory computing-environment in which a CPU with a few cores is attached to a GPU. We discuss in detail the design of the code and we illustrate performance comparable to highly optimized codes such as GROMACS. Beside speed our code shows excellent energy conservation. Utilization of water-specific lists allows the efficient calculations of non-bonded interactions that include water molecules and results in a speed-up factor of more than 40 on the GPU compared to code optimized on a single CPU core for systems larger than 20,000 atoms. This is up four-fold from a factor of 10 reported in our initial GPU implementation that did not include a water-specific code. Another optimization is the implementation of constrained dynamics entirely on the GPU. The routine, which enforces constraints of all bonds, runs in parallel on multiple Open-MP cores or entirely on the GPU. It is based on Conjugate Gradient solution of the Lagrange multipliers (CG SHAKE). The GPU implementation is partially in double precision and requires no communication with the CPU during the execution of the SHAKE algorithm. The (parallel) implementation of SHAKE allows an increase of the time step to 2.0fs while maintaining excellent energy conservation. Interestingly, CG SHAKE is faster than the usual bond relaxation algorithm even on a single core if high accuracy is expected. The significant speedup of the optimized components transfers the computational bottleneck of the MD calculation to the reciprocal part of Particle Mesh Ewald (PME). PMID:23264758
Single-pass GPU-raycasting for structured adaptive mesh refinement data
NASA Astrophysics Data System (ADS)
Kaehler, Ralf; Abel, Tom
2013-01-01
Structured Adaptive Mesh Refinement (SAMR) is a popular numerical technique to study processes with high spatial and temporal dynamic range. It reduces computational requirements by adapting the lattice on which the underlying differential equations are solved to most efficiently represent the solution. Particularly in astrophysics and cosmology such simulations now can capture spatial scales ten orders of magnitude apart and more. The irregular locations and extensions of the refined regions in the SAMR scheme and the fact that different resolution levels partially overlap, poses a challenge for GPU-based direct volume rendering methods. kD-trees have proven to be advantageous to subdivide the data domain into non-overlapping blocks of equally sized cells, optimal for the texture units of current graphics hardware, but previous GPU-supported raycasting approaches for SAMR data using this data structure required a separate rendering pass for each node, preventing the application of many advanced lighting schemes that require simultaneous access to more than one block of cells. In this paper we present the first single-pass GPU-raycasting algorithm for SAMR data that is based on a kD-tree. The tree is efficiently encoded by a set of 3D-textures, which allows to adaptively sample complete rays entirely on the GPU without any CPU interaction. We discuss two different data storage strategies to access the grid data on the GPU and apply them to several datasets to prove the benefits of the proposed method.
Improving GPU-accelerated adaptive IDW interpolation algorithm using fast kNN search.
Mei, Gang; Xu, Nengxiong; Xu, Liangliang
2016-01-01
This paper presents an efficient parallel Adaptive Inverse Distance Weighting (AIDW) interpolation algorithm on modern Graphics Processing Unit (GPU). The presented algorithm is an improvement of our previous GPU-accelerated AIDW algorithm by adopting fast k-nearest neighbors (kNN) search. In AIDW, it needs to find several nearest neighboring data points for each interpolated point to adaptively determine the power parameter; and then the desired prediction value of the interpolated point is obtained by weighted interpolating using the power parameter. In this work, we develop a fast kNN search approach based on the space-partitioning data structure, even grid, to improve the previous GPU-accelerated AIDW algorithm. The improved algorithm is composed of the stages of kNN search and weighted interpolating. To evaluate the performance of the improved algorithm, we perform five groups of experimental tests. The experimental results indicate: (1) the improved algorithm can achieve a speedup of up to 1017 over the corresponding serial algorithm; (2) the improved algorithm is at least two times faster than our previous GPU-accelerated AIDW algorithm; and (3) the utilization of fast kNN search can significantly improve the computational efficiency of the entire GPU-accelerated AIDW algorithm. PMID:27610308
The development of GPU-based parallel PRNG for Monte Carlo applications in CUDA Fortran
NASA Astrophysics Data System (ADS)
Kargaran, Hamed; Minuchehr, Abdolhamid; Zolfaghari, Ahmad
2016-04-01
The implementation of Monte Carlo simulation on the CUDA Fortran requires a fast random number generation with good statistical properties on GPU. In this study, a GPU-based parallel pseudo random number generator (GPPRNG) have been proposed to use in high performance computing systems. According to the type of GPU memory usage, GPU scheme is divided into two work modes including GLOBAL_MODE and SHARED_MODE. To generate parallel random numbers based on the independent sequence method, the combination of middle-square method and chaotic map along with the Xorshift PRNG have been employed. Implementation of our developed PPRNG on a single GPU showed a speedup of 150x and 470x (with respect to the speed of PRNG on a single CPU core) for GLOBAL_MODE and SHARED_MODE, respectively. To evaluate the accuracy of our developed GPPRNG, its performance was compared to that of some other commercially available PPRNGs such as MATLAB, FORTRAN and Miller-Park algorithm through employing the specific standard tests. The results of this comparison showed that the developed GPPRNG in this study can be used as a fast and accurate tool for computational science applications.
GPU accelerated simulations of 3D deterministic particle transport using discrete ordinates method
NASA Astrophysics Data System (ADS)
Gong, Chunye; Liu, Jie; Chi, Lihua; Huang, Haowei; Fang, Jingyue; Gong, Zhenghu
2011-07-01
Graphics Processing Unit (GPU), originally developed for real-time, high-definition 3D graphics in computer games, now provides great faculty in solving scientific applications. The basis of particle transport simulation is the time-dependent, multi-group, inhomogeneous Boltzmann transport equation. The numerical solution to the Boltzmann equation involves the discrete ordinates ( Sn) method and the procedure of source iteration. In this paper, we present a GPU accelerated simulation of one energy group time-independent deterministic discrete ordinates particle transport in 3D Cartesian geometry (Sweep3D). The performance of the GPU simulations are reported with the simulations of vacuum boundary condition. The discussion of the relative advantages and disadvantages of the GPU implementation, the simulation on multi GPUs, the programming effort and code portability are also reported. The results show that the overall performance speedup of one NVIDIA Tesla M2050 GPU ranges from 2.56 compared with one Intel Xeon X5670 chip to 8.14 compared with one Intel Core Q6600 chip for no flux fixup. The simulation with flux fixup on one M2050 is 1.23 times faster than on one X5670.
A Hybrid CPU/GPU Pattern-Matching Algorithm for Deep Packet Inspection
Chen, Yaw-Chung
2015-01-01
The large quantities of data now being transferred via high-speed networks have made deep packet inspection indispensable for security purposes. Scalable and low-cost signature-based network intrusion detection systems have been developed for deep packet inspection for various software platforms. Traditional approaches that only involve central processing units (CPUs) are now considered inadequate in terms of inspection speed. Graphic processing units (GPUs) have superior parallel processing power, but transmission bottlenecks can reduce optimal GPU efficiency. In this paper we describe our proposal for a hybrid CPU/GPU pattern-matching algorithm (HPMA) that divides and distributes the packet-inspecting workload between a CPU and GPU. All packets are initially inspected by the CPU and filtered using a simple pre-filtering algorithm, and packets that might contain malicious content are sent to the GPU for further inspection. Test results indicate that in terms of random payload traffic, the matching speed of our proposed algorithm was 3.4 times and 2.7 times faster than those of the AC-CPU and AC-GPU algorithms, respectively. Further, HPMA achieved higher energy efficiency than the other tested algorithms. PMID:26437335
GPU accelerated simulations of 3D deterministic particle transport using discrete ordinates method
Gong Chunye; Liu Jie; Chi Lihua; Huang Haowei; Fang Jingyue; Gong Zhenghu
2011-07-01
Graphics Processing Unit (GPU), originally developed for real-time, high-definition 3D graphics in computer games, now provides great faculty in solving scientific applications. The basis of particle transport simulation is the time-dependent, multi-group, inhomogeneous Boltzmann transport equation. The numerical solution to the Boltzmann equation involves the discrete ordinates (S{sub n}) method and the procedure of source iteration. In this paper, we present a GPU accelerated simulation of one energy group time-independent deterministic discrete ordinates particle transport in 3D Cartesian geometry (Sweep3D). The performance of the GPU simulations are reported with the simulations of vacuum boundary condition. The discussion of the relative advantages and disadvantages of the GPU implementation, the simulation on multi GPUs, the programming effort and code portability are also reported. The results show that the overall performance speedup of one NVIDIA Tesla M2050 GPU ranges from 2.56 compared with one Intel Xeon X5670 chip to 8.14 compared with one Intel Core Q6600 chip for no flux fixup. The simulation with flux fixup on one M2050 is 1.23 times faster than on one X5670.
NASA Astrophysics Data System (ADS)
Lokavarapu, H. V.; Matsui, H.
2015-12-01
Convection and magnetic field of the Earth's outer core are expected to have vast length scales. To resolve these flows, high performance computing is required for geodynamo simulations using spherical harmonics transform (SHT), a significant portion of the execution time is spent on the Legendre transform. Calypso is a geodynamo code designed to model magnetohydrodynamics of a Boussinesq fluid in a rotating spherical shell, such as the outer core of the Earth. The code has been shown to scale well on computer clusters capable of computing at the order of 10⁵ cores using Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) parallelization for CPUs. To further optimize, we investigate three different algorithms of the SHT using GPUs. One is to preemptively compute the Legendre polynomials on the CPU before executing SHT on the GPU within the time integration loop. In the second approach, both the Legendre polynomials and the SHT are computed on the GPU simultaneously. In the third approach , we initially partition the radial grid for the forward transform and the harmonic order for the backward transform between the CPU and GPU. There after, the partitioned works are simultaneously computed in the time integration loop. We examine the trade-offs between space and time, memory bandwidth and GPU computations on Maverick, a Texas Advanced Computing Center (TACC) supercomputer. We have observed improved performance using a GPU enabled Legendre transform. Furthermore, we will compare and contrast the different algorithms in the context of GPUs.
A Hybrid CPU/GPU Pattern-Matching Algorithm for Deep Packet Inspection.
Lee, Chun-Liang; Lin, Yi-Shan; Chen, Yaw-Chung
2015-01-01
The large quantities of data now being transferred via high-speed networks have made deep packet inspection indispensable for security purposes. Scalable and low-cost signature-based network intrusion detection systems have been developed for deep packet inspection for various software platforms. Traditional approaches that only involve central processing units (CPUs) are now considered inadequate in terms of inspection speed. Graphic processing units (GPUs) have superior parallel processing power, but transmission bottlenecks can reduce optimal GPU efficiency. In this paper we describe our proposal for a hybrid CPU/GPU pattern-matching algorithm (HPMA) that divides and distributes the packet-inspecting workload between a CPU and GPU. All packets are initially inspected by the CPU and filtered using a simple pre-filtering algorithm, and packets that might contain malicious content are sent to the GPU for further inspection. Test results indicate that in terms of random payload traffic, the matching speed of our proposed algorithm was 3.4 times and 2.7 times faster than those of the AC-CPU and AC-GPU algorithms, respectively. Further, HPMA achieved higher energy efficiency than the other tested algorithms. PMID:26437335
Comparison of CPU and GPU based coding on low-complexity algorithms for display signals
NASA Astrophysics Data System (ADS)
Richter, Thomas; Simon, Sven
2013-09-01
Graphics Processing Units (GPUs) are freely programmable massively parallel general purpose processing units and thus offer the opportunity to off-load heavy computations from the CPU to the GPU. One application for GPU programming is image compression, where the massively parallel nature of GPUs promises high speed benefits. This article analyzes the predicaments of data-parallel image coding on the example of two high-throughput coding algorithms. The codecs discussed here were designed to answer a call from the Video Electronics Standards Association (VESA), and require only minimal buffering at encoder and decoder side while avoiding any pixel-based feedback loops limiting the operating frequency of hardware implementations. Comparing CPU and GPU implementations of the codes show that GPU based codes are usually not considerably faster, or perform only with less than ideal rate-distortion performance. Analyzing the details of this result provides theoretical evidence that for any coding engine either parts of the entropy coding and bit-stream build-up must remain serial, or rate-distortion penalties must be paid when offloading all computations on the GPU.
Raytracing Dynamic Scenes on the GPU Using Grids.
Guntury, S; Narayanan, P J
2012-01-01
Raytracing dynamic scenes at interactive rates have received a lot of attention recently. We present a few strategies for high performance raytracing on a commodity GPU. The construction of grids needs sorting, which is fast on today's GPUs. The grid is thus the acceleration structure of choice for dynamic scenes as per-frame rebuilding is required. We advocate the use of appropriate data structures for each stage of raytracing, resulting in multiple structure building per frame. A perspective grid built for the camera achieves perfect coherence for primary rays. A perspective grid built with respect to each light source provides the best performance for shadow rays. Spherical grids handle lights positioned inside the model space and handle spotlights. Uniform grids are best for reflection and refraction rays with little coherence. We propose an Enforced Coherence method to bring coherence to them by rearranging the ray to voxel mapping using sorting. This gives the best performance on GPUs with only user-managed caches. We also propose a simple, Independent Voxel Walk method, which performs best by taking advantage of the L1 and L2 caches on recent GPUs. We achieve over 10 fps of total rendering on the Conference model with one light source and one reflection bounce, while rebuilding the data structure for each stage. Ideas presented here are likely to give high performance on the future GPUs as well as other manycore architectures. PMID:21383409
Accelerated finite element elastodynamic simulations using the GPU
Huthwaite, Peter
2014-01-15
An approach is developed to perform explicit time domain finite element simulations of elastodynamic problems on the graphical processing unit, using Nvidia's CUDA. Of critical importance for this problem is the arrangement of nodes in memory, allowing data to be loaded efficiently and minimising communication between the independently executed blocks of threads. The initial stage of memory arrangement is partitioning the mesh; both a well established ‘greedy’ partitioner and a new, more efficient ‘aligned’ partitioner are investigated. A method is then developed to efficiently arrange the memory within each partition. The software is applied to three models from the fields of non-destructive testing, vibrations and geophysics, demonstrating a memory bandwidth of very close to the card's maximum, reflecting the bandwidth-limited nature of the algorithm. Comparison with Abaqus, a widely used commercial CPU equivalent, validated the accuracy of the results and demonstrated a speed improvement of around two orders of magnitude. A software package, Pogo, incorporating these developments, is released open source, downloadable from (http://www.pogo-fea.com/) to benefit the community. -- Highlights: •A novel memory arrangement approach is discussed for finite elements on the GPU. •The mesh is partitioned then nodes are arranged efficiently within each partition. •Models from ultrasonics, vibrations and geophysics are run. •The code is significantly faster than an equivalent commercial CPU package. •Pogo, the new software package, is released open source.
Multi-GPU Accelerated Simulation of Dynamically Evolving Fluid Pathways
NASA Astrophysics Data System (ADS)
Räss, Ludovic; Omlin, Samuel; Moulas, Evangelos; Simon, Nina S. C.; Podladchikov, Yuri
2014-05-01
Fluid flow in porous rocks, both naturally occurring and caused by reservoir operations, mostly takes place along localized high permeability pathways. Pervasive flooding of the rock matrix is rarely observed, in particular for low permeability rocks. The pathways appear to form dynamically in response to the fluid flow itself; the amount of pathways, their location and their hydraulic conductivity may change in time. We propose a physically and thermodynamically consistent model that describes the formation and evolution of fluid pathways. The model consists of a system of equations describing poro-elasto-viscous deformation and flow. We have implemented the strongly coupled equations into a numerical model. Nonlinearity of the solid rheology is also taken into account. We have developed a fully three-dimensional numerical MATLAB application based on an iterative finite difference scheme. We have ported it to C-CUDA using MPI to run it on multi-GPU clusters. Numerical tuning of the application based on memory bandwidth throughput allows to approach hardware peak performance. Conducted high-resolution three-dimensional simulations predict the formation of dynamically evolving high porosity and permeability pathways as a natural outcome of porous flow coupled with rock deformation.
A GPU Reaction Diffusion Soil-Microbial Model
NASA Astrophysics Data System (ADS)
Falconer, Ruth; Houston, Alasdair; Schmidt, Sonja; Otten, Wilfred
2014-05-01
Parallelised algorithms are frequent in bioinformatics as a consequence of the close link to informatics - however in the field of soil science and ecology they are less prevalent. A current challenge in soil ecology is to link habitat structure to microbial dynamics. Soil science is therefore entering the 'big data' paradigm as a consequence of integrating data pertinent to the physical soil environment obtained via imaging and theoretical models describing growth and development of microbial dynamics permitting accurate analyses of spatio-temporal properties of different soil microenvironments. The microenvironment is often captured by 3D imaging (CT tomography) which yields large datasets and when used in computational studies the physical sizes of the samples that are amenable to computation are less than 1 cm3. Today's commodity graphics cards are programmable and possess a data parallel architecture that in many cases is capable of out-performing the CPU in terms of computational rates. The programmable aspect is achieved via a low-level parallel programming language (CUDA, OpenCL and DirectX). We ported a Soil-Microbial Model onto the GPU using the DirectX Compute API. We noted a significant computational speed up as well as an increase in the physical size that can be simulated. Some of the drawbacks of such an approach were concerned with numerical precision and the steep learning curve associated with GPGPU technologies.
FARGO3D: A New GPU-oriented MHD Code
NASA Astrophysics Data System (ADS)
Benítez-Llambay, Pablo; Masset, Frédéric S.
2016-03-01
We present the FARGO3D code, recently publicly released. It is a magnetohydrodynamics code developed with special emphasis on the physics of protoplanetary disks and planet-disk interactions, and parallelized with MPI. The hydrodynamics algorithms are based on finite-difference upwind, dimensionally split methods. The magnetohydrodynamics algorithms consist of the constrained transport method to preserve the divergence-free property of the magnetic field to machine accuracy, coupled to a method of characteristics for the evaluation of electromotive forces and Lorentz forces. Orbital advection is implemented, and an N-body solver is included to simulate planets or stars interacting with the gas. We present our implementation in detail and present a number of widely known tests for comparison purposes. One strength of FARGO3D is that it can run on either graphical processing units (GPUs) or central processing units (CPUs), achieving large speed-up with respect to CPU cores. We describe our implementation choices, which allow a user with no prior knowledge of GPU programming to develop new routines for CPUs, and have them translated automatically for GPUs.
Toward Fast Computation of Dense Image Correspondence on the GPU
Duchaineau, M; Cohen, J; Vaidya, S
2007-08-13
Large-scale video processing systems are needed to support human analysis of massive collections of image streams. Video, from both current small-format and future large-format camera systems, constitutes the single largest data source of the near future, dwarfing the output of all other data sources combined. A critical component to further advances in the processing and analysis of such video streams is the ability to register successive video frames into a common coordinate system at the pixel level. This capability enables further downstream processing, such as background/mover segmentation, 3D model extraction, and compression. We present here our recent work on computing these correspondences. We employ coarse-to-fine hierarchical approach, matching pixels from the domain of a source image to the domain of a target image at successively higher resolutions. Our diamond-style image hierarchy, with total pixel counts increasing by only a factor of two at each level, improves the prediction quality as we advance from level to level, and reduces potential grid artifacts in the results. We demonstrate the quality our approach on real aerial city imagery. We find that registration accuracy is generally on the order of one quarter of a pixel. We also benchmark the fundamental processing kernels on the GPU to show the promise of the approach for real-time video processing applications.
Modern Methods of Bundle Adjustment on the Gpu
NASA Astrophysics Data System (ADS)
Hänsch, R.; Drude, I.; Hellwich, O.
2016-06-01
The task to compute 3D reconstructions from large amounts of data has become an active field of research within the last years. Based on an initial estimate provided by structure from motion, bundle adjustment seeks to find a solution that is optimal for all cameras and 3D points. The corresponding nonlinear optimization problem is usually solved by the Levenberg-Marquardt algorithm combined with conjugate gradient descent. While many adaptations and extensions to the classical bundle adjustment approach have been proposed, only few works consider the acceleration potentials of GPU systems. This paper elaborates the possibilities of time and space savings when fitting the implementation strategy to the terms and requirements of realizing a bundler on heterogeneous CPUGPU systems. Instead of focusing on the standard approach of Levenberg-Marquardt optimization alone, nonlinear conjugate gradient descent and alternating resection-intersection are studied as two alternatives. The experiments show that in particular alternating resection-intersection reaches low error rates very fast, but converges to larger error rates than Levenberg-Marquardt. PBA, as one of the current state-of-the-art bundlers, converges slower in 50 % of the test cases and needs 1.5-2 times more memory than the Levenberg- Marquardt implementation.
GPU acceleration of particle-in-cell methods
NASA Astrophysics Data System (ADS)
Cowan, Benjamin; Cary, John; Meiser, Dominic
2015-11-01
Graphics processing units (GPUs) have become key components in many supercomputing systems, as they can provide more computations relative to their cost and power consumption than conventional processors. However, to take full advantage of this capability, they require a strict programming model which involves single-instruction multiple-data execution as well as significant constraints on memory accesses. To bring the full power of GPUs to bear on plasma physics problems, we must adapt the computational methods to this new programming model. We have developed a GPU implementation of the particle-in-cell (PIC) method, one of the mainstays of plasma physics simulation. This framework is highly general and enables advanced PIC features such as high order particles and absorbing boundary conditions. The main elements of the PIC loop, including field interpolation and particle deposition, are designed to optimize memory access. We describe the performance of these algorithms and discuss some of the methods used. Work supported by DARPA contract W31P4Q-15-C-0061 (SBIR).
GPU-enabled molecular dynamics simulations of ankyrin kinase complex
NASA Astrophysics Data System (ADS)
Gautam, Vertika; Chong, Wei Lim; Wisitponchai, Tanchanok; Nimmanpipug, Piyarat; Zain, Sharifuddin M.; Rahman, Noorsaadah Abd.; Tayapiwatana, Chatchai; Lee, Vannajan Sanghiran
2014-10-01
The ankyrin repeat (AR) protein can be used as a versatile scaffold for protein-protein interactions. It has been found that the heterotrimeric complex between integrin-linked kinase (ILK), PINCH, and parvin is an essential signaling platform, serving as a convergence point for integrin and growth-factor signaling and regulating cell adhesion, spreading, and migration. Using ILK-AR with high affinity for the PINCH1 as our model system, we explored a structure-based computational protocol to probe and characterize binding affinity hot spots at protein-protein interfaces. In this study, the long time scale dynamics simulations with GPU accelerated molecular dynamics (MD) simulations in AMBER12 have been performed to locate the hot spots of protein-protein interaction by the analysis of the Molecular Mechanics-Poisson-Boltzmann Surface Area/Generalized Born Solvent Area (MM-PBSA/GBSA) of the MD trajectories. Our calculations suggest good binding affinity of the complex and also the residues critical in the binding.
GPU surface extraction using the closest point embedding
NASA Astrophysics Data System (ADS)
Kim, Mark; Hansen, Charles
2015-01-01
Isosurface extraction is a fundamental technique used for both surface reconstruction and mesh generation. One method to extract well-formed isosurfaces is a particle system; unfortunately, particle systems can be slow. In this paper, we introduce an enhanced parallel particle system that uses the closest point embedding as the surface representation to speedup the particle system for isosurface extraction. The closest point embedding is used in the Closest Point Method (CPM), a technique that uses a standard three dimensional numerical PDE solver on two dimensional embedded surfaces. To fully take advantage of the closest point embedding, it is coupled with a Barnes-Hut tree code on the GPU. This new technique produces well-formed, conformal unstructured triangular and tetrahedral meshes from labeled multi-material volume datasets. Further, this new parallel implementation of the particle system is faster than any known methods for conformal multi-material mesh extraction. The resulting speed-ups gained in this implementation can reduce the time from labeled data to mesh from hours to minutes and benefits users, such as bioengineers, who employ triangular and tetrahedral meshes
GPU-accelerated minimum distance and clearance queries.
Krishnamurthy, Adarsh; McMains, Sara; Haller, Kirk
2011-06-01
We present practical algorithms for accelerating distance queries on models made of trimmed NURBS surfaces using programmable Graphics Processing Units (GPUs). We provide a generalized framework for using GPUs as coprocessors in accelerating CAD operations. By supplementing surface data with a surface bounding-box hierarchy on the GPU, we answer distance queries such as finding the closest point on a curved NURBS surface given any point in space and evaluating the clearance between two solid models constructed using multiple NURBS surfaces. We simultaneously output the parameter values corresponding to the solution of these queries along with the model space values. Though our algorithms make use of the programmable fragment processor, the accuracy is based on the model space precision, unlike earlier graphics algorithms that were based only on image space precision. In addition, we provide theoretical bounds for both the computed minimum distance values as well as the location of the closest point. Our algorithms are at least an order of magnitude faster and about two orders of magnitude more accurate than the commercial solid modeling kernel ACIS. PMID:21474862
Immersed boundary method implemented in lattice Boltzmann GPU code
NASA Astrophysics Data System (ADS)
Devincentis, Brian; Smith, Kevin; Thomas, John
2015-11-01
Lattice Boltzmann is well suited to efficiently utilize the rapidly increasing compute power of GPUs to simulate viscous incompressible flows. Fluid-structure interaction with solids of arbitrarily complex geometry can be modeled in this framework with the immersed boundary method (IBM). In IBM a solid is modeled by its surface which applies a force at the neighboring lattice points. The majority of published IBMs require solving a linear system in order to satisfy the no-slip condition. However, the method presented by Wang et al. (2014) is unique in that it produces equally accurate results without solving a linear system. Furthermore, the algorithm can be applied in a parallel manner over the immersed boundary and is, therefore, well suited for GPUs. Here, a 2D and 3D version of their algorithm is implemented in Sailfish CFD, a GPU-based open source lattice Boltzmann code. One issue unaddressed by most published work is how to correct force and torque calculated from IBM for translating and rotating solids. These corrections are necessary because the fluid inside the solid affects its inertia in a non-trivial manner. Therefore, this implementation uses the Lagrangian points approximation correction shown by Suzuki and Inamuro (2011) to be accurate.
Deshmukh, Nishikant P.; Kang, Hyun Jae; Billings, Seth D.; Taylor, Russell H.; Hager, Gregory D.; Boctor, Emad M.
2014-01-01
A system for real-time ultrasound (US) elastography will advance interventions for the diagnosis and treatment of cancer by advancing methods such as thermal monitoring of tissue ablation. A multi-stream graphics processing unit (GPU) based accelerated normalized cross-correlation (NCC) elastography, with a maximum frame rate of 78 frames per second, is presented in this paper. A study of NCC window size is undertaken to determine the effect on frame rate and the quality of output elastography images. This paper also presents a novel system for Online Tracked Ultrasound Elastography (O-TRuE), which extends prior work on an offline method. By tracking the US probe with an electromagnetic (EM) tracker, the system selects in-plane radio frequency (RF) data frames for generating high quality elastograms. A novel method for evaluating the quality of an elastography output stream is presented, suggesting that O-TRuE generates more stable elastograms than generated by untracked, free-hand palpation. Since EM tracking cannot be used in all systems, an integration of real-time elastography and the da Vinci Surgical System is presented and evaluated for elastography stream quality based on our metric. The da Vinci surgical robot is outfitted with a laparoscopic US probe, and palpation motions are autonomously generated by customized software. It is found that a stable output stream can be achieved, which is affected by both the frequency and amplitude of palpation. The GPU framework is validated using data from in-vivo pig liver ablation; the generated elastography images identify the ablated region, outlined more clearly than in the corresponding B-mode US images. PMID:25541954
Deshmukh, Nishikant P; Kang, Hyun Jae; Billings, Seth D; Taylor, Russell H; Hager, Gregory D; Boctor, Emad M
2014-01-01
A system for real-time ultrasound (US) elastography will advance interventions for the diagnosis and treatment of cancer by advancing methods such as thermal monitoring of tissue ablation. A multi-stream graphics processing unit (GPU) based accelerated normalized cross-correlation (NCC) elastography, with a maximum frame rate of 78 frames per second, is presented in this paper. A study of NCC window size is undertaken to determine the effect on frame rate and the quality of output elastography images. This paper also presents a novel system for Online Tracked Ultrasound Elastography (O-TRuE), which extends prior work on an offline method. By tracking the US probe with an electromagnetic (EM) tracker, the system selects in-plane radio frequency (RF) data frames for generating high quality elastograms. A novel method for evaluating the quality of an elastography output stream is presented, suggesting that O-TRuE generates more stable elastograms than generated by untracked, free-hand palpation. Since EM tracking cannot be used in all systems, an integration of real-time elastography and the da Vinci Surgical System is presented and evaluated for elastography stream quality based on our metric. The da Vinci surgical robot is outfitted with a laparoscopic US probe, and palpation motions are autonomously generated by customized software. It is found that a stable output stream can be achieved, which is affected by both the frequency and amplitude of palpation. The GPU framework is validated using data from in-vivo pig liver ablation; the generated elastography images identify the ablated region, outlined more clearly than in the corresponding B-mode US images. PMID:25541954
NASA Astrophysics Data System (ADS)
Berczik, Peter; Spurzem, Rainer; Wang, Long; Zhong, Shiyan; Huang, Siyi
2013-10-01
We present direct astrophysical N-body simulations with up to a few million bodies using our parallel MPI/CUDA code on large GPU clusters in China, Ukraine and Germany, with different kinds of GPU hardware. These clusters are directly linked under the Chinese Academy of Sciences special GPU cluster program in the cooperation of ICCS (International Center for Computational Science). We reach about the half the peak Kepler K20 GPU performance for our ?-GPU code [2], in a real application scenario with individual hierarchically block time-steps with the high (4th, 6th and 8th) order Hermite integration schemes and a real core-halo density structure of the modeled stellar systems. The code and hardware are mainly used to simulate star clusters [23, 24] and galactic nuclei with supermassive black holes [20], in which correlations between distant particles cannot be neglected.
GPU-based Scalable Volumetric Reconstruction for Multi-view Stereo
Kim, H; Duchaineau, M; Max, N
2011-09-21
We present a new scalable volumetric reconstruction algorithm for multi-view stereo using a graphics processing unit (GPU). It is an effectively parallelized GPU algorithm that simultaneously uses a large number of GPU threads, each of which performs voxel carving, in order to integrate depth maps with images from multiple views. Each depth map, triangulated from pair-wise semi-dense correspondences, represents a view-dependent surface of the scene. This algorithm also provides scalability for large-scale scene reconstruction in a high resolution voxel grid by utilizing streaming and parallel computation. The output is a photo-realistic 3D scene model in a volumetric or point-based representation. We demonstrate the effectiveness and the speed of our algorithm with a synthetic scene and real urban/outdoor scenes. Our method can also be integrated with existing multi-view stereo algorithms such as PMVS2 to fill holes or gaps in textureless regions.
Real-time generation of infrared ocean scene based on GPU
NASA Astrophysics Data System (ADS)
Jiang, Zhaoyi; Wang, Xun; Lin, Yun; Jin, Jianqiu
2007-12-01
Infrared (IR) image synthesis for ocean scene has become more and more important nowadays, especially for remote sensing and military application. Although a number of works present ready-to-use simulations, those techniques cover only a few possible ways of water interacting with the environment. And the detail calculation of ocean temperature is rarely considered by previous investigators. With the advance of programmable features of graphic card, many algorithms previously limited to offline processing have become feasible for real-time usage. In this paper, we propose an efficient algorithm for real-time rendering of infrared ocean scene using the newest features of programmable graphics processors (GPU). It differs from previous works in three aspects: adaptive GPU-based ocean surface tessellation, sophisticated balance equation of thermal balance for ocean surface, and GPU-based rendering for infrared ocean scene. Finally some results of infrared image are shown, which are in good accordance with real images.
Accelerating Large Scale Image Analyses on Parallel, CPU-GPU Equipped Systems
Teodoro, George; Kurc, Tahsin M.; Pan, Tony; Cooper, Lee A.D.; Kong, Jun; Widener, Patrick; Saltz, Joel H.
2014-01-01
The past decade has witnessed a major paradigm shift in high performance computing with the introduction of accelerators as general purpose processors. These computing devices make available very high parallel computing power at low cost and power consumption, transforming current high performance platforms into heterogeneous CPU-GPU equipped systems. Although the theoretical performance achieved by these hybrid systems is impressive, taking practical advantage of this computing power remains a very challenging problem. Most applications are still deployed to either GPU or CPU, leaving the other resource under- or un-utilized. In this paper, we propose, implement, and evaluate a performance aware scheduling technique along with optimizations to make efficient collaborative use of CPUs and GPUs on a parallel system. In the context of feature computations in large scale image analysis applications, our evaluations show that intelligently co-scheduling CPUs and GPUs can significantly improve performance over GPU-only or multi-core CPU-only approaches. PMID:25419545
Gallarno, George; Rogers, James H; Maxwell, Don E
2015-01-01
The high computational capability of graphics processing units (GPUs) is enabling and driving the scientific discovery process at large-scale. The world s second fastest supercomputer for open science, Titan, has more than 18,000 GPUs that computational scientists use to perform scientific simu- lations and data analysis. Understanding of GPU reliability characteristics, however, is still in its nascent stage since GPUs have only recently been deployed at large-scale. This paper presents a detailed study of GPU errors and their impact on system operations and applications, describing experiences with the 18,688 GPUs on the Titan supercom- puter as well as lessons learned in the process of efficient operation of GPUs at scale. These experiences are helpful to HPC sites which already have large-scale GPU clusters or plan to deploy GPUs in the future.
Key Techniques of Flat-Earth Phase Removal by Acceleration on the GPU
NASA Astrophysics Data System (ADS)
Gao, Zeng; Zeng, Qiming; Jiao, Jian; Cui, Xiai; Liang, Cunren
2013-01-01
Because InSAR processing is complex and time-consuming, parallel computing has been drawing more and more attention from researchers. GPUs (Graphics Processing Units) have become an increasingly important parallel platform for image processing in recent years. They are cheap and convenient, compared with large-scale, expensive high performance computing clusters, which have a small marketplace presence. In this paper, a valid parallelism implemented on the GPU is introduced. Taking the flat-earth phase removal for example, we introduced two different techniques that can accelerate applications dramatically on a GPU. From the experiment results, we can see that the result accomplished on the GPU is the same as on the CPU; the two techniques used really work in performance improvement.
de Paula, Lauro C. M.; Soares, Anderson S.; de Lima, Telma W.; Delbem, Alexandre C. B.; Coelho, Clarimar J.; Filho, Arlindo R. G.
2014-01-01
Several variable selection algorithms in multivariate calibration can be accelerated using Graphics Processing Units (GPU). Among these algorithms, the Firefly Algorithm (FA) is a recent proposed metaheuristic that may be used for variable selection. This paper presents a GPU-based FA (FA-MLR) with multiobjective formulation for variable selection in multivariate calibration problems and compares it with some traditional sequential algorithms in the literature. The advantage of the proposed implementation is demonstrated in an example involving a relatively large number of variables. The results showed that the FA-MLR, in comparison with the traditional algorithms is a more suitable choice and a relevant contribution for the variable selection problem. Additionally, the results also demonstrated that the FA-MLR performed in a GPU can be five times faster than its sequential implementation. PMID:25493625
Massively Parallel Computation of Soil Surface Roughness Parameters on A Fermi GPU
NASA Astrophysics Data System (ADS)
Li, Xiaojie; Song, Changhe
2016-06-01
Surface roughness is description of the surface micro topography of randomness or irregular. The standard deviation of surface height and the surface correlation length describe the statistical variation for the random component of a surface height relative to a reference surface. When the number of data points is large, calculation of surface roughness parameters is time-consuming. With the advent of Graphics Processing Unit (GPU) architectures, inherently parallel problem can be effectively solved using GPUs. In this paper we propose a GPU-based massively parallel computing method for 2D bare soil surface roughness estimation. This method was applied to the data collected by the surface roughness tester based on the laser triangulation principle during the field experiment in April 2012. The total number of data points was 52,040. It took 47 seconds on a Fermi GTX 590 GPU whereas its serial CPU version took 5422 seconds, leading to a significant 115x speedup.
High Performance Molecular Dynamic Simulation on Single and Multi-GPU Systems
Villa, Oreste; Chen, Long; Krishnamoorthy, Sriram
2010-05-30
The programming techniques supported and employed on these GPUs and Multi-GPUs systems are not sufficient to address problems exhibiting irregular, and unbalanced workload such as Molecular Dynamic (MD) simulations of systems with non-uniform densities. In this paper, we propose a task-based dynamic load-balancing solution to employ on MD simulations for single- and multi-GPU systems. The solution allows load balancing at a finer granularity than what is supported in existing APIs such as NVIDIA’s CUDA. Experimental results with a single-GPU configuration show that our fine-grained task solution can utilize the hardware more efficiently than the CUDA scheduler. On multi-GPU systems, our solution achieves near-linear speedup, load balance, and significant performance improvement over techniques based on standard CUDA APIs.
Fast GPU-based calculations in few-body quantum scattering
NASA Astrophysics Data System (ADS)
Pomerantsev, V. N.; Kukulin, V. I.; Rubtsova, O. A.; Sakhiev, S. K.
2016-07-01
A principally novel approach towards solving the few-particle (many-dimensional) quantum scattering problems is described. The approach is based on a complete discretization of few-particle continuum and usage of massively parallel computations of integral kernels for scattering equations by means of GPU. The discretization for continuous spectrum of few-particle Hamiltonian is realized with a projection of all scattering operators and wave functions onto the stationary wave-packet basis. Such projection procedure leads to a replacement of singular multidimensional integral equations with linear matrix ones having finite matrix elements. Different aspects of the employment of multithread GPU computing for fast calculation of the matrix kernel of the equation are studied in detail. As a result, the fully realistic three-body scattering problem above the break-up threshold is solved on an ordinary desktop PC with GPU for a rather small computational time.
A GPU-accelerated toolbox for the solutions of systems of linear equations
NASA Astrophysics Data System (ADS)
Humphrey, John R., Jr.; Paolini, Aaron L.; Price, Daniel K.; Kelmelis, Eric J.
2009-05-01
The modern graphics processing unit (GPU) found in many off-the shelf personal computers is a very high performance computing engine that often goes unutilized. The tremendous computing power coupled with reasonable pricing has made the GPU a topic of interest in recent research. An application for such power would be the solution to large systems of linear equations. Two popular solution domains are direct solution, via the LU decomposition, and iterative solution, via a solver such as the Generalized Method of Residuals (GMRES). Our research focuses on the acceleration of such processes, utilizing the latest in GPU technologies. We show performance that exceeds that of a standard computer by an order of magnitude, thus significantly reducing the run time of the numerous applications that depend on the solution of a set of linear equations.
Implementation and optimization of ultrasound signal processing algorithms on mobile GPU
NASA Astrophysics Data System (ADS)
Kong, Woo Kyu; Lee, Wooyoul; Kim, Kyu Cheol; Yoo, Yangmo; Song, Tai-Kyong
2014-03-01
A general-purpose graphics processing unit (GPGPU) has been used for improving computing power in medical ultrasound imaging systems. Recently, a mobile GPU becomes powerful to deal with 3D games and videos at high frame rates on Full HD or HD resolution displays. This paper proposes the method to implement ultrasound signal processing on a mobile GPU available in the high-end smartphone (Galaxy S4, Samsung Electronics, Seoul, Korea) with programmable shaders on the OpenGL ES 2.0 platform. To maximize the performance of the mobile GPU, the optimization of shader design and load sharing between vertex and fragment shader was performed. The beamformed data were captured from a tissue mimicking phantom (Model 539 Multipurpose Phantom, ATS Laboratories, Inc., Bridgeport, CT, USA) by using a commercial ultrasound imaging system equipped with a research package (Ultrasonix Touch, Ultrasonix, Richmond, BC, Canada). The real-time performance is evaluated by frame rates while varying the range of signal processing blocks. The implementation method of ultrasound signal processing on OpenGL ES 2.0 was verified by analyzing PSNR with MATLAB gold standard that has the same signal path. CNR was also analyzed to verify the method. From the evaluations, the proposed mobile GPU-based processing method has no significant difference with the processing using MATLAB (i.e., PSNR<52.51 dB). The comparable results of CNR were obtained from both processing methods (i.e., 11.31). From the mobile GPU implementation, the frame rates of 57.6 Hz were achieved. The total execution time was 17.4 ms that was faster than the acquisition time (i.e., 34.4 ms). These results indicate that the mobile GPU-based processing method can support real-time ultrasound B-mode processing on the smartphone.
GPU-accelerated 3D neutron diffusion code based on finite difference method
Xu, Q.; Yu, G.; Wang, K.
2012-07-01
Finite difference method, as a traditional numerical solution to neutron diffusion equation, although considered simpler and more precise than the coarse mesh nodal methods, has a bottle neck to be widely applied caused by the huge memory and unendurable computation time it requires. In recent years, the concept of General-Purpose computation on GPUs has provided us with a powerful computational engine for scientific research. In this study, a GPU-Accelerated multi-group 3D neutron diffusion code based on finite difference method was developed. First, a clean-sheet neutron diffusion code (3DFD-CPU) was written in C++ on the CPU architecture, and later ported to GPUs under NVIDIA's CUDA platform (3DFD-GPU). The IAEA 3D PWR benchmark problem was calculated in the numerical test, where three different codes, including the original CPU-based sequential code, the HYPRE (High Performance Pre-conditioners)-based diffusion code and CITATION, were used as counterpoints to test the efficiency and accuracy of the GPU-based program. The results demonstrate both high efficiency and adequate accuracy of the GPU implementation for neutron diffusion equation. A speedup factor of about 46 times was obtained, using NVIDIA's Geforce GTX470 GPU card against a 2.50 GHz Intel Quad Q9300 CPU processor. Compared with the HYPRE-based code performing in parallel on an 8-core tower server, the speedup of about 2 still could be observed. More encouragingly, without any mathematical acceleration technology, the GPU implementation ran about 5 times faster than CITATION which was speeded up by using the SOR method and Chebyshev extrapolation technique. (authors)
A 3D front tracking method on a CPU/GPU system
Bo, Wurigen; Grove, John
2011-01-21
We describe the method to port a sequential 3D interface tracking code to a GPU with CUDA. The interface is represented as a triangular mesh. Interface geometry properties and point propagation are performed on a GPU. Interface mesh adaptation is performed on a CPU. The convergence of the method is assessed from the test problems with given velocity fields. Performance results show overall speedups from 11 to 14 for the test problems under mesh refinement. We also briefly describe our ongoing work to couple the interface tracking method with a hydro solver.
Multi-GPU and multi-CPU accelerated FDTD scheme for vibroacoustic applications
NASA Astrophysics Data System (ADS)
Francés, J.; Otero, B.; Bleda, S.; Gallego, S.; Neipp, C.; Márquez, A.; Beléndez, A.
2015-06-01
The Finite-Difference Time-Domain (FDTD) method is applied to the analysis of vibroacoustic problems and to study the propagation of longitudinal and transversal waves in a stratified media. The potential of the scheme and the relevance of each acceleration strategy for massively computations in FDTD are demonstrated in this work. In this paper, we propose two new specific implementations of the bi-dimensional scheme of the FDTD method using multi-CPU and multi-GPU, respectively. In the first implementation, an open source message passing interface (OMPI) has been included in order to massively exploit the resources of a biprocessor station with two Intel Xeon processors. Moreover, regarding CPU code version, the streaming SIMD extensions (SSE) and also the advanced vectorial extensions (AVX) have been included with shared memory approaches that take advantage of the multi-core platforms. On the other hand, the second implementation called the multi-GPU code version is based on Peer-to-Peer communications available in CUDA on two GPUs (NVIDIA GTX 670). Subsequently, this paper presents an accurate analysis of the influence of the different code versions including shared memory approaches, vector instructions and multi-processors (both CPU and GPU) and compares them in order to delimit the degree of improvement of using distributed solutions based on multi-CPU and multi-GPU. The performance of both approaches was analysed and it has been demonstrated that the addition of shared memory schemes to CPU computing improves substantially the performance of vector instructions enlarging the simulation sizes that use efficiently the cache memory of CPUs. In this case GPU computing is slightly twice times faster than the fine tuned CPU version in both cases one and two nodes. However, for massively computations explicit vector instructions do not worth it since the memory bandwidth is the limiting factor and the performance tends to be the same than the sequential version
New Multithreaded Hybrid CPU/GPU Approach to Hartree-Fock.
Asadchev, Andrey; Gordon, Mark S
2012-11-13
In this article, a new multithreaded Hartree-Fock CPU/GPU method is presented which utilizes automatically generated code and modern C++ techniques to achieve a significant improvement in memory usage and computer time. In particular, the newly implemented Rys Quadrature and Fock Matrix algorithms, implemented as a stand-alone C++ library, with C and Fortran bindings, provides up to 40% improvement over the traditional Fortran Rys Quadrature. The C++ GPU HF code provides approximately a factor of 17.5 improvement over the corresponding C++ CPU code. PMID:26605582
A GPU-based approach to compute the brain shift using a fully nonlinear biomechanical model
NASA Astrophysics Data System (ADS)
Tian, Ye; Shen, Xukun
2015-07-01
We present a brain shift prediction algorithm which is based on multi-GPU. Our algorithm can calculate on more complex mesh model. The proposed algorithm mainly includes two aspects. On the one hand, we present a grid-based algorithm for the generation of hexahedral element meshes. It can automatically divide the complex and high quality of hexahedral mesh without manual operation. On the other hand, we implementation a non-linear brain shift prediction algorithm. It is designed to run on a GPU server. The experiment results demonstrate that the proposed algorithm can quickly calculate the results on complex mesh.
Real-time high definition H.264 video decode using the Xbox 360 GPU
NASA Astrophysics Data System (ADS)
Arevalo Baeza, Juan Carlos; Chen, William; Christoffersen, Eric; Dinu, Daniel; Friemel, Barry
2007-09-01
The Xbox 360 is powered by three dual pipeline 3.2 GHz IBM PowerPC processors and a 500 MHz ATI graphics processing unit. The Graphics Processing Unit (GPU) is a special-purpose device, intended to create advanced visual effects and to render realistic scenes for the latest Xbox 360 games. In this paper, we report work on using the GPU as a parallel processing unit to accelerate the decoding of H.264/AVC high-definition (1920x1080) video. We report our experiences in developing a real-time, software-only high-definition video decoder for the Xbox 360.
GPU acceleration experience with RRTMG long wave radiation model
NASA Astrophysics Data System (ADS)
Price, Erik; Mielikainen, Jarno; Huang, Bormin; Huang, HungLung A.; Lee, Tsengdar
2013-10-01
in many weather forecast and climate models. RRTMG_LW is in operational use in ECMWF weather forecast system, the NCEP global forecast system, the ECHAM5 climate model, Community Earth System Model (CESM) and the weather and forecasting (WRF) model. RRTMG_LW has also been evaluated for use in GFDL climate model. In this paper, we examine the feasibility of using graphics processing units (GPUs) to accelerate the RRTMG_LW as used by the WRF. GPUs can provide a substantial improvement in RRTMG speed by supporting the parallel computation of large numbers of independent radiative calculations. Furthermore, using commodity GPUs for accelerating RRTMG_LW allows getting a much higher computational performance at lower price point than traditional CPUs. Furthermore, power and cooling costs are significantly reduced by using GPUs. A GPU-compatible version of RRTMG was implemented and thorough testing was performed to ensure that the original level of accuracy is retained. Our results show that GPUs can provide significant speedup over conventional CPUs. In particular, Nvidia's GTX 680 GPU card can provide a speedup of 69x for the compared to its single-threaded Fortran counterpart running on Intel Xeon E5-2603 CPU.
Complex fluid flow modeling with SPH on GPU
NASA Astrophysics Data System (ADS)
Bilotta, Giuseppe; Hérault, Alexis; Del Negro, Ciro; Russo, Giovanni; Vicari, Annamaria
2010-05-01
SPH meshless method. In comparison to other particle methods, SPH also provides additional benefits such as the automatic preservation of mass. The direct computation of most physical quantities (e.g. pressure) without resorting to large, sparse implicit systems makes SPH particularly favorable to implementation on highly parallel computational hardware such as modern video cards. The graphical processing units (GPUs) on modern video cards often surpasses the computational power of the CPU that drives them. The CUDA architecture, introduced by NVIDIA in the spring of 2007, allows generic GPU programming with an extension of the C language, making it easy to write highly parallelized code. Our lava simulation model uses the SPH method with a pure GPU implementation in CUDA to achieve high computational performance, modeling both the dynamic and thermal aspects of a lava flow. The dynamic parts of the SPH algorithms are based on the ones of the SPHysics simulator, enhanced to include the treatment of non-Newtonian fluids, the integration of thermal effects including temperature-dependent rheological parameters, and an optimal handling of large-scale natural topographies. For the non-Newtonian rheologies priority is given to the power law recently brought into light by physical modeling of lava flows. For the thermal part of the model, the SPH model has been compared with classical finite elements to simulate a lava lake solidification, a problem for which an analytical solution is known. The comparison shows the significantly higher accuracy of the SPH method in proximity of the contact area of two or more solidification fronts.
Mobile Devices and GPU Parallelism in Ionospheric Data Processing
NASA Astrophysics Data System (ADS)
Mascharka, D.; Pankratius, V.
2015-12-01
Scientific data acquisition in the field is often constrained by data transfer backchannels to analysis environments. Geoscientists are therefore facing practical bottlenecks with increasing sensor density and variety. Mobile devices, such as smartphones and tablets, offer promising solutions to key problems in scientific data acquisition, pre-processing, and validation by providing advanced capabilities in the field. This is due to affordable network connectivity options and the increasing mobile computational power. This contribution exemplifies a scenario faced by scientists in the field and presents the "Mahali TEC Processing App" developed in the context of the NSF-funded Mahali project. Aimed at atmospheric science and the study of ionospheric Total Electron Content (TEC), this app is able to gather data from various dual-frequency GPS receivers. It demonstrates parsing of full-day RINEX files on mobile devices and on-the-fly computation of vertical TEC values based on satellite ephemeris models that are obtained from NASA. Our experiments show how parallel computing on the mobile device GPU enables fast processing and visualization of up to 2 million datapoints in real-time using OpenGL. GPS receiver bias is estimated through minimum TEC approximations that can be interactively adjusted by scientists in the graphical user interface. Scientists can also perform approximate computations for "quickviews" to reduce CPU processing time and memory consumption. In the final stage of our mobile processing pipeline, scientists can upload data to the cloud for further processing. Acknowledgements: The Mahali project (http://mahali.mit.edu) is funded by the NSF INSPIRE grant no. AGS-1343967 (PI: V. Pankratius). We would like to acknowledge our collaborators at Boston College, Virginia Tech, Johns Hopkins University, Colorado State University, as well as the support of UNAVCO for loans of dual-frequency GPS receivers for use in this project, and Intel for loans of
GPU-based relative fuzzy connectedness image segmentation
Zhuge Ying; Ciesielski, Krzysztof C.; Udupa, Jayaram K.; Miller, Robert W.
2013-01-15
Purpose:Recently, clinical radiological research and practice are becoming increasingly quantitative. Further, images continue to increase in size and volume. For quantitative radiology to become practical, it is crucial that image segmentation algorithms and their implementations are rapid and yield practical run time on very large data sets. The purpose of this paper is to present a parallel version of an algorithm that belongs to the family of fuzzy connectedness (FC) algorithms, to achieve an interactive speed for segmenting large medical image data sets. Methods: The most common FC segmentations, optimizing an Script-Small-L {sub {infinity}}-based energy, are known as relative fuzzy connectedness (RFC) and iterative relative fuzzy connectedness (IRFC). Both RFC and IRFC objects (of which IRFC contains RFC) can be found via linear time algorithms, linear with respect to the image size. The new algorithm, P-ORFC (for parallel optimal RFC), which is implemented by using NVIDIA's Compute Unified Device Architecture (CUDA) platform, considerably improves the computational speed of the above mentioned CPU based IRFC algorithm. Results: Experiments based on four data sets of small, medium, large, and super data size, achieved speedup factors of 32.8 Multiplication-Sign , 22.9 Multiplication-Sign , 20.9 Multiplication-Sign , and 17.5 Multiplication-Sign , correspondingly, on the NVIDIA Tesla C1060 platform. Although the output of P-ORFC need not precisely match that of IRFC output, it is very close to it and, as the authors prove, always lies between the RFC and IRFC objects. Conclusions: A parallel version of a top-of-the-line algorithm in the family of FC has been developed on the NVIDIA GPUs. An interactive speed of segmentation has been achieved, even for the largest medical image data set. Such GPU implementations may play a crucial role in automatic anatomy recognition in clinical radiology.
GPU-based relative fuzzy connectedness image segmentation
Zhuge, Ying; Ciesielski, Krzysztof C.; Udupa, Jayaram K.; Miller, Robert W.
2013-01-01
Purpose: Recently, clinical radiological research and practice are becoming increasingly quantitative. Further, images continue to increase in size and volume. For quantitative radiology to become practical, it is crucial that image segmentation algorithms and their implementations are rapid and yield practical run time on very large data sets. The purpose of this paper is to present a parallel version of an algorithm that belongs to the family of fuzzy connectedness (FC) algorithms, to achieve an interactive speed for segmenting large medical image data sets. Methods: The most common FC segmentations, optimizing an ℓ∞-based energy, are known as relative fuzzy connectedness (RFC) and iterative relative fuzzy connectedness (IRFC). Both RFC and IRFC objects (of which IRFC contains RFC) can be found via linear time algorithms, linear with respect to the image size. The new algorithm, P-ORFC (for parallel optimal RFC), which is implemented by using NVIDIA’s Compute Unified Device Architecture (CUDA) platform, considerably improves the computational speed of the above mentioned CPU based IRFC algorithm. Results: Experiments based on four data sets of small, medium, large, and super data size, achieved speedup factors of 32.8×, 22.9×, 20.9×, and 17.5×, correspondingly, on the NVIDIA Tesla C1060 platform. Although the output of P-ORFC need not precisely match that of IRFC output, it is very close to it and, as the authors prove, always lies between the RFC and IRFC objects. Conclusions: A parallel version of a top-of-the-line algorithm in the family of FC has been developed on the NVIDIA GPUs. An interactive speed of segmentation has been achieved, even for the largest medical image data set. Such GPU implementations may play a crucial role in automatic anatomy recognition in clinical radiology. PMID:23298094
Hallock, Michael J.; Stone, John E.; Roberts, Elijah; Fry, Corey; Luthey-Schulten, Zaida
2014-01-01
Simulation of in vivo cellular processes with the reaction-diffusion master equation (RDME) is a computationally expensive task. Our previous software enabled simulation of inhomogeneous biochemical systems for small bacteria over long time scales using the MPD-RDME method on a single GPU. Simulations of larger eukaryotic systems exceed the on-board memory capacity of individual GPUs, and long time simulations of modest-sized cells such as yeast are impractical on a single GPU. We present a new multi-GPU parallel implementation of the MPD-RDME method based on a spatial decomposition approach that supports dynamic load balancing for workstations containing GPUs of varying performance and memory capacity. We take advantage of high-performance features of CUDA for peer-to-peer GPU memory transfers and evaluate the performance of our algorithms on state-of-the-art GPU devices. We present parallel e ciency and performance results for simulations using multiple GPUs as system size, particle counts, and number of reactions grow. We also demonstrate multi-GPU performance in simulations of the Min protein system in E. coli. Moreover, our multi-GPU decomposition and load balancing approach can be generalized to other lattice-based problems. PMID:24882911
Streaming parallel GPU acceleration of large-scale filter-based spiking neural networks.
Slażyński, Leszek; Bohte, Sander
2012-01-01
The arrival of graphics processing (GPU) cards suitable for massively parallel computing promises affordable large-scale neural network simulation previously only available at supercomputing facilities. While the raw numbers suggest that GPUs may outperform CPUs by at least an order of magnitude, the challenge is to develop fine-grained parallel algorithms to fully exploit the particulars of GPUs. Computation in a neural network is inherently parallel and thus a natural match for GPU architectures: given inputs, the internal state for each neuron can be updated in parallel. We show that for filter-based spiking neurons, like the Spike Response Model, the additive nature of membrane potential dynamics enables additional update parallelism. This also reduces the accumulation of numerical errors when using single precision computation, the native precision of GPUs. We further show that optimizing simulation algorithms and data structures to the GPU's architecture has a large pay-off: for example, matching iterative neural updating to the memory architecture of the GPU speeds up this simulation step by a factor of three to five. With such optimizations, we can simulate in better-than-realtime plausible spiking neural networks of up to 50 000 neurons, processing over 35 million spiking events per second. PMID:23098420
NASA Astrophysics Data System (ADS)
Huang, M.; Mielikainen, J.; Huang, B.; Chen, H.; Huang, H.-L. A.; Goldberg, M. D.
2015-09-01
The planetary boundary layer (PBL) is the lowest part of the atmosphere and where its character is directly affected by its contact with the underlying planetary surface. The PBL is responsible for vertical sub-grid-scale fluxes due to eddy transport in the whole atmospheric column. It determines the flux profiles within the well-mixed boundary layer and the more stable layer above. It thus provides an evolutionary model of atmospheric temperature, moisture (including clouds), and horizontal momentum in the entire atmospheric column. For such purposes, several PBL models have been proposed and employed in the weather research and forecasting (WRF) model of which the Yonsei University (YSU) scheme is one. To expedite weather research and prediction, we have put tremendous effort into developing an accelerated implementation of the entire WRF model using graphics processing unit (GPU) massive parallel computing architecture whilst maintaining its accuracy as compared to its central processing unit (CPU)-based implementation. This paper presents our efficient GPU-based design on a WRF YSU PBL scheme. Using one NVIDIA Tesla K40 GPU, the GPU-based YSU PBL scheme achieves a speedup of 193× with respect to its CPU counterpart running on one CPU core, whereas the speedup for one CPU socket (4 cores) with respect to 1 CPU core is only 3.5×. We can even boost the speedup to 360× with respect to 1 CPU core as two K40 GPUs are applied.
A GPU-accelerated flow solver for incompressible two-phase fluid flows
NASA Astrophysics Data System (ADS)
Codyer, Stephen; Raessi, Mehdi; Khanna, Gaurav
2011-11-01
We present a numerical solver for incompressible, immiscible, two-phase fluid flows that is accelerated by using Graphics Processing Units (GPUs). The Navier-Stokes equations are solved by the projection method, which involves solving a pressure Poisson problem at each time step. A second-order discretization of the Poisson problem leads to a sparse matrix with five and seven diagonals for two- and three-dimensional simulations, respectively. Running a serial linear algebra solver on a single CPU can take 50-99.9% of the total simulation time to solve the above system for pressure. To remove this bottleneck, we utilized the large parallelization capabilities of GPUs; we developed a linear algebra solver based on the conjugate gradient iterative method (CGIM) by using CUDA 4.0 libraries and compared its performance with CUSP, an open-source, GPU library for linear algebra. Compared to running the CGIM solver on a single CPU core, for a 2D case, our GPU solver yields speedups of up to 88x in solver time and 81x overall time on a single GPU card. In 3D cases, the speedups are up to 81x (solver) and 15x (overall). Speedup is faster at higher grid resolutions and our GPU solver outperforms CUSP. Current work examines the acceleration versus a parallel CGIM CPU solver.
GPU-based ultra-fast direct aperture optimization for online adaptive radiation therapy
NASA Astrophysics Data System (ADS)
Men, Chunhua; Jia, Xun; Jiang, Steve B.
2010-08-01
Online adaptive radiation therapy (ART) has great promise to significantly reduce normal tissue toxicity and/or improve tumor control through real-time treatment adaptations based on the current patient anatomy. However, the major technical obstacle for clinical realization of online ART, namely the inability to achieve real-time efficiency in treatment re-planning, has yet to be solved. To overcome this challenge, this paper presents our work on the implementation of an intensity-modulated radiation therapy (IMRT) direct aperture optimization (DAO) algorithm on the graphics processing unit (GPU) based on our previous work on the CPU. We formulate the DAO problem as a large-scale convex programming problem, and use an exact method called the column generation approach to deal with its extremely large dimensionality on the GPU. Five 9-field prostate and five 5-field head-and-neck IMRT clinical cases with 5 × 5 mm2 beamlet size and 2.5 × 2.5 × 2.5 mm3 voxel size were tested to evaluate our algorithm on the GPU. It takes only 0.7-3.8 s for our implementation to generate high-quality treatment plans on an NVIDIA Tesla C1060 GPU card. Our work has therefore solved a major problem in developing ultra-fast (re-)planning technologies for online ART.
NASA Astrophysics Data System (ADS)
Trigueros-Espinosa, Blas; Vélez-Reyes, Miguel; Santiago-Santiago, Nayda G.; Rosario-Torres, Samuel
2011-06-01
Hyperspectral sensors can collect hundreds of images taken at different narrow and contiguously spaced spectral bands. This high-resolution spectral information can be used to identify materials and objects within the field of view of the sensor by their spectral signature, but this process may be computationally intensive due to the large data sizes generated by the hyperspectral sensors, typically hundreds of megabytes. This can be an important limitation for some applications where the detection process must be performed in real time (surveillance, explosive detection, etc.). In this work, we developed a parallel implementation of three state-ofthe- art target detection algorithms (RX algorithm, matched filter and adaptive matched subspace detector) using a graphics processing unit (GPU) based on the NVIDIA® CUDA™ architecture. In addition, a multi-core CPUbased implementation of each algorithm was developed to be used as a baseline for the speedups estimation. We evaluated the performance of the GPU-based implementations using an NVIDIA ® Tesla® C1060 GPU card, and the detection accuracy of the implemented algorithms was evaluated using a set of phantom images simulating traces of different materials on clothing. We achieved a maximum speedup in the GPU implementations of around 20x over a multicore CPU-based implementation, which suggests that applications for real-time detection of targets in HSI can greatly benefit from the performance of GPUs as processing hardware.
Computational performance of a two-dimensional flood model in single and multiple GPU frameworks
NASA Astrophysics Data System (ADS)
Dullo, Tigstu; Kalyanapu, Alfred; Ghafoor, Sheikh; Anantharaj, Valentine; Marshall, Ryan; Tatarczuk, Joe; Shih-Chieh, Kao
2015-04-01
The objective of this study is to investigate the computational performance and accuracy of multiple implementations of a 2D flood model called Flood2D-GPU: i) on a single GPU and ii) multiple GPUs. The model is based on shallow water equations (SWE) and uses an upwind-finite difference numerical formulation to simulate flood events. The GPU based implementation has been developed, using NVIDIA's Compute Unified Development Architecture (CUDA) programming model. The increase in the computational performance would permit simulation of larger domain sizes, more refined spatial and temporal resolutions, and more simulations (ensembles). In addition to HPC platforms, all implementations of the model are developed within a geographic information system (GIS) environment for both preprocessing and post processing of spatial datasets (e.g. topography, land use/land cover, etc.). For this study, these implementations are being applied to simulate a dam break event at the Taum Sauk pump-storage hydro-electric power plant in Missouri, which occurred on December 14, 2005. A single GPU implementation provides a significant speed up, up to two orders of magnitude compared to a CPU version of the model. We will discuss the computational approaches for multiple GPUs, and the benchmarking results from the set of dam break simulations.
Accelerating DynEarthSol3D on tightly coupled CPU-GPU heterogeneous processors
NASA Astrophysics Data System (ADS)
Ta, Tuan; Choo, Kyoshin; Tan, Eh; Jang, Byunghyun; Choi, Eunseo
2015-06-01
DynEarthSol3D (Dynamic Earth Solver in Three Dimensions) is a flexible, open-source finite element solver that models the momentum balance and the heat transfer of elasto-visco-plastic material in the Lagrangian form using unstructured meshes. It provides a platform for the study of the long-term deformation of earth's lithosphere and various problems in civil and geotechnical engineering. However, the continuous computation and update of a very large mesh poses an intolerably high computational burden to developers and users in practice. For example, simulating a small input mesh containing around 3000 elements in 20 million time steps would take more than 10 days on a high-end desktop CPU. In this paper, we explore tightly coupled CPU-GPU heterogeneous processors to address the computing concern by leveraging their new features and developing hardware-architecture-aware optimizations. Our proposed key optimization techniques are three-fold: memory access pattern improvement, data transfer elimination and kernel launch overhead minimization. Experimental results show that our proposed implementation on a tightly coupled heterogeneous processor outperforms all other alternatives including traditional discrete GPU, quad-core CPU using OpenMP, and serial implementations by 67%, 50%, and 154% respectively even though the embedded GPU in the heterogeneous processor has significantly less number of cores than high-end discrete GPU.
Graphics Processing Unit (GPU) Acceleration of the Goddard Earth Observing System Atmospheric Model
NASA Technical Reports Server (NTRS)
Putnam, Williama
2011-01-01
The Goddard Earth Observing System 5 (GEOS-5) is the atmospheric model used by the Global Modeling and Assimilation Office (GMAO) for a variety of applications, from long-term climate prediction at relatively coarse resolution, to data assimilation and numerical weather prediction, to very high-resolution cloud-resolving simulations. GEOS-5 is being ported to a graphics processing unit (GPU) cluster at the NASA Center for Climate Simulation (NCCS). By utilizing GPU co-processor technology, we expect to increase the throughput of GEOS-5 by at least an order of magnitude, and accelerate the process of scientific exploration across all scales of global modeling, including: The large-scale, high-end application of non-hydrostatic, global, cloud-resolving modeling at 10- to I-kilometer (km) global resolutions Intermediate-resolution seasonal climate and weather prediction at 50- to 25-km on small clusters of GPUs Long-range, coarse-resolution climate modeling, enabled on a small box of GPUs for the individual researcher After being ported to the GPU cluster, the primary physics components and the dynamical core of GEOS-5 have demonstrated a potential speedup of 15-40 times over conventional processor cores. Performance improvements of this magnitude reduce the required scalability of 1-km, global, cloud-resolving models from an unfathomable 6 million cores to an attainable 200,000 GPU-enabled cores.
NASA Astrophysics Data System (ADS)
Norman, Paul; Valentini, Paolo; Schwartzentruber, Thomas
2013-08-01
In this work we outline a Classical Trajectory Calculation Direct Simulation Monte Carlo (CTC-DSMC) implementation that uses the no-time-counter scheme with a cross-section determined by the interatomic potential energy surface (PES). CTC-DSMC solutions for translational and rotational relaxation in one-dimensional shock waves are compared directly to pure Molecular Dynamics simulations employing an identical PES, where exact agreement is demonstrated for all cases. For the flows considered, long-lived collisions occur within the simulations and their implications for multi-body collisions as well as algorithm implications for the CTC-DSMC method are discussed. A parallelization technique for CTC-DSMC simulations using a heterogeneous multicore CPU/GPU system is demonstrated. Our approach shows good scaling as long as a sufficiently large number of collisions are calculated simultaneously per GPU (˜100,000) at each DSMC iteration. We achieve a maximum speedup of 140× on a 4 GPU/CPU system vs. the performance on one CPU core in serial for a diatomic nitrogen shock. The parallelization approach presented here significantly reduces the cost of CTC-DSMC simulations and has the potential to scale to large CPU/GPU clusters, which could enable future application to 3D flows in strong thermochemical nonequilibrium.
Fast 2D flood modelling using GPU technology - recent applications and new developments
NASA Astrophysics Data System (ADS)
Crossley, Amanda; Lamb, Rob; Waller, Simon; Dunning, Paul
2010-05-01
In recent years there has been considerable interest amongst scientists and engineers in exploiting the potential of commodity graphics hardware for desktop parallel computing. The Graphics Processing Units (GPUs) that are used in PC graphics cards have now evolved into powerful parallel co-processors that can be used to accelerate the numerical codes used for floodplain inundation modelling. We report in this paper on experience over the past two years in developing and applying two dimensional (2D) flood inundation models using GPUs to achieve significant practical performance benefits. Starting with a solution scheme for the 2D diffusion wave approximation to the 2D Shallow Water Equations (SWEs), we have demonstrated the capability to reduce model run times in ‘real-world' applications using GPU hardware and programming techniques. We then present results from a GPU-based 2D finite volume SWE solver. A series of numerical test cases demonstrate that the model produces outputs that are accurate and consistent with reference results published elsewhere. In comparisons conducted for a real world test case, the GPU-based SWE model was over 100 times faster than the CPU version. We conclude with some discussion of practical experience in using the GPU technology for flood mapping applications, and for research projects investigating use of Monte Carlo simulation methods for the analysis of uncertainty in 2D flood modelling.
Accelerating Satellite Image Based Large-Scale Settlement Detection with GPU
Patlolla, Dilip Reddy; Cheriyadat, Anil M; Weaver, Jeanette E; Bright, Eddie A
2012-01-01
Computer vision algorithms for image analysis are often computationally demanding. Application of such algorithms on large image databases\\---- such as the high-resolution satellite imagery covering the entire land surface, can easily saturate the computational capabilities of conventional CPUs. There is a great demand for vision algorithms running on high performance computing (HPC) architecture capable of processing petascale image data. We exploit the parallel processing capability of GPUs to present a GPU-friendly algorithm for robust and efficient detection of settlements from large-scale high-resolution satellite imagery. Feature descriptor generation is an expensive, but a key step in automated scene analysis. To address this challenge, we present GPU implementations for three different feature descriptors\\-- multiscale Historgram of Oriented Gradients (HOG), Gray Level Co-Occurrence Matrix (GLCM) Contrast and local pixel intensity statistics. We perform extensive experimental evaluations of our implementation using diverse and large image datasets. Our GPU implementation of the feature descriptor algorithms results in speedups of 220 times compared to the CPU version. We present an highly efficient settlement detection system running on a multiGPU architecture capable of extracting human settlement regions from a city-scale sub-meter spatial resolution aerial imagery spanning roughly 1200 sq. kilometers in just 56 seconds with detection accuracy close to 90\\%. This remarkable speedup gained by our vision algorithm maintaining high detection accuracy clearly demonstrates that such computational advancements clearly hold the solution for petascale image analysis challenges.
GPU-Based Visualization of 3D Fluid Interfaces using Level Set Methods
NASA Astrophysics Data System (ADS)
Kadlec, B. J.
2009-12-01
We model a simple 3D fluid-interface problem using the level set method and visualize the interface as a dynamic surface. Level set methods allow implicit handling of complex topologies deformed by evolutions where sharp changes and cusps are present without destroying the representation. We present a highly optimized visualization and computation algorithm that is implemented in CUDA to run on the NVIDIA GeForce 295 GTX. CUDA is a general purpose parallel computing architecture that allows the NVIDIA GPU to be treated like a data parallel supercomputer in order to solve many computational problems in a fraction of the time required on a CPU. CUDA is compared to the new OpenCL™ (Open Computing Language), which is designed to run on heterogeneous computing environments but does not take advantage of low-level features in NVIDIA hardware that provide significant speedups. Therefore, our technique is implemented using CUDA and results are compared to a single CPU implementation to show the benefits of using the GPU and CUDA for visualizing fluid-interface problems. We solve a 1024^3 problem and experience significant speedup using the NVIDIA GeForce 295 GTX. Implementation details for mapping the problem to the GPU architecture are described as well as discussion on porting the technique to heterogeneous devices (AMD, Intel, IBM) using OpenCL. The results present a new interactive system for computing and visualizing the evolution of fluid interface problems on the GPU.
Maia, Julio Daniel Carvalho; Urquiza Carvalho, Gabriel Aires; Mangueira, Carlos Peixoto; Santana, Sidney Ramos; Cabral, Lucidio Anjos Formiga; Rocha, Gerd B
2012-09-11
In this study, we present some modifications in the semiempirical quantum chemistry MOPAC2009 code that accelerate single-point energy calculations (1SCF) of medium-size (up to 2500 atoms) molecular systems using GPU coprocessors and multithreaded shared-memory CPUs. Our modifications consisted of using a combination of highly optimized linear algebra libraries for both CPU (LAPACK and BLAS from Intel MKL) and GPU (MAGMA and CUBLAS) to hasten time-consuming parts of MOPAC such as the pseudodiagonalization, full diagonalization, and density matrix assembling. We have shown that it is possible to obtain large speedups just by using CPU serial linear algebra libraries in the MOPAC code. As a special case, we show a speedup of up to 14 times for a methanol simulation box containing 2400 atoms and 4800 basis functions, with even greater gains in performance when using multithreaded CPUs (2.1 times in relation to the single-threaded CPU code using linear algebra libraries) and GPUs (3.8 times). This degree of acceleration opens new perspectives for modeling larger structures which appear in inorganic chemistry (such as zeolites and MOFs), biochemistry (such as polysaccharides, small proteins, and DNA fragments), and materials science (such as nanotubes and fullerenes). In addition, we believe that this parallel (GPU-GPU) MOPAC code will make it feasible to use semiempirical methods in lengthy molecular simulations using both hybrid QM/MM and QM/QM potentials. PMID:26605718
GPU Based Fast Free-Wake Calculations For Multiple Horizontal Axis Wind Turbine Rotors
NASA Astrophysics Data System (ADS)
Türkal, M.; Novikov, Y.; Üşenmez, S.; Sezer-Uzol, N.; Uzol, O.
2014-06-01
Unsteady free-wake solutions of wind turbine flow fields involve computationally intensive interaction calculations, which generally limit the total amount of simulation time or the number of turbines that can be simulated by the method. This problem, however, can be addressed easily using high-level of parallelization. Especially when exploited with a GPU, a Graphics Processing Unit, this property can provide a significant computational speed-up, rendering the most intensive engineering problems realizable in hours of computation time. This paper presents the results of the simulation of the flow field for the NREL Phase VI turbine using a GPU-based in-house free-wake panel method code. Computational parallelism involved in the free-wake methodology is exploited using a GPU, allowing thousands of similar operations to be performed simultaneously. The results are compared to experimental data as well as to those obtained by running a corresponding CPU-based code. Results show that the GPU based code is capable of producing wake and load predictions similar to the CPU- based code and in a substantially reduced amount of time. This capability could allow free- wake based analysis to be used in the possible design and optimization studies of wind farms as well as prediction of multiple turbine flow fields and the investigation of the effects of using different vortex core models, core expansion and stretching models on the turbine rotor interaction problems in multiple turbine wake flow fields.
PuReMD-GPU: A reactive molecular dynamics simulation package for GPUs
Kylasa, S.B.; Aktulga, H.M.; Grama, A.Y.
2014-09-01
We present an efficient and highly accurate GP-GPU implementation of our community code, PuReMD, for reactive molecular dynamics simulations using the ReaxFF force field. PuReMD and its incorporation into LAMMPS (Reax/C) is used by a large number of research groups worldwide for simulating diverse systems ranging from biomembranes to explosives (RDX) at atomistic level of detail. The sub-femtosecond time-steps associated with ReaxFF strongly motivate significant improvements to per-timestep simulation time through effective use of GPUs. This paper presents, in detail, the design and implementation of PuReMD-GPU, which enables ReaxFF simulations on GPUs, as well as various performance optimization techniques we developed to obtain high performance on state-of-the-art hardware. Comprehensive experiments on model systems (bulk water and amorphous silica) are presented to quantify the performance improvements achieved by PuReMD-GPU and to verify its accuracy. In particular, our experiments show up to 16× improvement in runtime compared to our highly optimized CPU-only single-core ReaxFF implementation. PuReMD-GPU is a unique production code, and is currently available on request from the authors.
A GPU-accelerated direct-sum boundary integral Poisson-Boltzmann solver
NASA Astrophysics Data System (ADS)
Geng, Weihua; Jacob, Ferosh
2013-06-01
In this paper, we present a GPU-accelerated direct-sum boundary integral method to solve the linear Poisson-Boltzmann (PB) equation. In our method, a well-posed boundary integral formulation is used to ensure the fast convergence of Krylov subspace based linear algebraic solver such as the GMRES. The molecular surfaces are discretized with flat triangles and centroid collocation. To speed up our method, we take advantage of the parallel nature of the boundary integral formulation and parallelize the schemes within CUDA shared memory architecture on GPU. The schemes use only 11N+6Nc size-of-double device memory for a biomolecule with N triangular surface elements and Nc partial charges. Numerical tests of these schemes show well-maintained accuracy and fast convergence. The GPU implementation using one GPU card (Nvidia Tesla M2070) achieves 120-150X speed-up to the implementation using one CPU (Intel L5640 2.27 GHz). With our approach, solving PB equations on well-discretized molecular surfaces with up to 300,000 boundary elements will take less than about 10 min, hence our approach is particularly suitable for fast electrostatics computations on small to medium biomolecules.
Exploring Fine-Grained Task-based Execution on Multi-GPU Systems
Chen, Long; Villa, Oreste; Gao, Guang R.
2011-09-25
Many-core Graphics Processing Units (GPUs) have been utilized as the computation engine in many scientific fields due to the high peak performance, cost effectiveness, and the availability of user friendly programming environments, e.g., NVIDIA CUDA. However, the conventional data parallel GPU programming paradigm cannot satisfactorily address issues such as load balancing and GPU resource utilization due to the irregular and unbalanced workload patterns exhibited in some applications. In this paper, we explore the design space of task-based solutions for multi-GPU systems. By employing finer-grained tasks than what is supported in the current CUDA, and allowing task sharing, our solutions enable dynamic load balancing. We evaluate our solutions with a Molecular Dynamics application with different atom distributions (from uniform distribution to highly non-uniform distribution). Experimental results obtained on a 4-GPU system show that, for non-uniform distributed systems, our solutions achieve excellent speedup, and significant performance improvement over other solutions based on the standard CUDA APIs.
GPU computing with Kaczmarz’s and other iterative algorithms for linear systems
Elble, Joseph M.; Sahinidis, Nikolaos V.; Vouzis, Panagiotis
2009-01-01
The graphics processing unit (GPU) is used to solve large linear systems derived from partial differential equations. The differential equations studied are strongly convection-dominated, of various sizes, and common to many fields, including computational fluid dynamics, heat transfer, and structural mechanics. The paper presents comparisons between GPU and CPU implementations of several well-known iterative methods, including Kaczmarz’s, Cimmino’s, component averaging, conjugate gradient normal residual (CGNR), symmetric successive overrelaxation-preconditioned conjugate gradient, and conjugate-gradient-accelerated component-averaged row projections (CARP-CG). Computations are preformed with dense as well as general banded systems. The results demonstrate that our GPU implementation outperforms CPU implementations of these algorithms, as well as previously studied parallel implementations on Linux clusters and shared memory systems. While the CGNR method had begun to fall out of favor for solving such problems, for the problems studied in this paper, the CGNR method implemented on the GPU performed better than the other methods, including a cluster implementation of the CARP-CG method. PMID:20526446
Modeling of Parachute Dynamics with GPU Enhanced Continuum Fabric Model and Front Tracking Method
NASA Astrophysics Data System (ADS)
Shi, Qiangqiang
An advanced mesoscale spring-mass model is used to mimic fabric surface motion. The fabric surface is represented by a high-quality triangular surface mesh. Both the tensile stiffness and the angular stiffness of each spring are determined by the material's Young's modulus and Poisson ratio, as well as the geometrical characteristics of the surface mesh. The spring-mass system is a nonlinear Ordinary Differential Equation (ODE) system solved by fourth order Runge-Kutta method. The model is shown to be numerically convergent under the constraint that the summation of points masses is constant. Through coupling with an incompressible fluid solver and the front tracking method, the spring-mass model is applied to the simulation of the dynamic phenomenon of parachute inflation. Complex validation simulations conclude the effort via drag force comparisons with experiments. Three applications of Graphics Processing Unit (GPU)-based algorithms for high performance computation of mathematical models were reported. Using one GPU device in the solving of the spring-mass system, we have achieved 6x speedup. In the second set of simulations, the system of one-dimensional gas dynamics equations is solved by the Weighted Essentially Non-Oscillatory (WENO) scheme; the GPU code is 7-20x faster than the pure CPU code. In the last case, a GPU enhanced numerical algorithm for American option pricing under the generalized hyperbolic distribution is studied. We have achieved 2x speedup for pricing single option and 400x speedup for multiple options.
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA Tesla GPU Cluster
Allada, Veerendra, Benjegerdes, Troy; Bode, Brett
2009-08-31
Commodity clusters augmented with application accelerators are evolving as competitive high performance computing systems. The Graphical Processing Unit (GPU) with a very high arithmetic density and performance per price ratio is a good platform for the scientific application acceleration. In addition to the interconnect bottlenecks among the cluster compute nodes, the cost of memory copies between the host and the GPU device have to be carefully amortized to improve the overall efficiency of the application. Scientific applications also rely on efficient implementation of the BAsic Linear Algebra Subroutines (BLAS), among which the General Matrix Multiply (GEMM) is considered as the workhorse subroutine. In this paper, they study the performance of the memory copies and GEMM subroutines that are critical to port the computational chemistry algorithms to the GPU clusters. To that end, a benchmark based on the NetPIPE framework is developed to evaluate the latency and bandwidth of the memory copies between the host and the GPU device. The performance of the single and double precision GEMM subroutines from the NVIDIA CUBLAS 2.0 library are studied. The results have been compared with that of the BLAS routines from the Intel Math Kernel Library (MKL) to understand the computational trade-offs. The test bed is a Intel Xeon cluster equipped with NVIDIA Tesla GPUs.
GPU.proton.DOCK: Genuine Protein Ultrafast proton equilibria consistent DOCKing.
Kantardjiev, Alexander A
2011-07-01
GPU.proton.DOCK (Genuine Protein Ultrafast proton equilibria consistent DOCKing) is a state of the art service for in silico prediction of protein-protein interactions via rigorous and ultrafast docking code. It is unique in providing stringent account of electrostatic interactions self-consistency and proton equilibria mutual effects of docking partners. GPU.proton.DOCK is the first server offering such a crucial supplement to protein docking algorithms--a step toward more reliable and high accuracy docking results. The code (especially the Fast Fourier Transform bottleneck and electrostatic fields computation) is parallelized to run on a GPU supercomputer. The high performance will be of use for large-scale structural bioinformatics and systems biology projects, thus bridging physics of the interactions with analysis of molecular networks. We propose workflows for exploring in silico charge mutagenesis effects. Special emphasis is given to the interface-intuitive and user-friendly. The input is comprised of the atomic coordinate files in PDB format. The advanced user is provided with a special input section for addition of non-polypeptide charges, extra ionogenic groups with intrinsic pK(a) values or fixed ions. The output is comprised of docked complexes in PDB format as well as interactive visualization in a molecular viewer. GPU.proton.DOCK server can be accessed at http://gpudock.orgchm.bas.bg/. PMID:21666258
High performance MRI simulations of motion on multi-GPU systems
2014-01-01
Background MRI physics simulators have been developed in the past for optimizing imaging protocols and for training purposes. However, these simulators have only addressed motion within a limited scope. The purpose of this study was the incorporation of realistic motion, such as cardiac motion, respiratory motion and flow, within MRI simulations in a high performance multi-GPU environment. Methods Three different motion models were introduced in the Magnetic Resonance Imaging SIMULator (MRISIMUL) of this study: cardiac motion, respiratory motion and flow. Simulation of a simple Gradient Echo pulse sequence and a CINE pulse sequence on the corresponding anatomical model was performed. Myocardial tagging was also investigated. In pulse sequence design, software crushers were introduced to accommodate the long execution times in order to avoid spurious echoes formation. The displacement of the anatomical model isochromats was calculated within the Graphics Processing Unit (GPU) kernel for every timestep of the pulse sequence. Experiments that would allow simulation of custom anatomical and motion models were also performed. Last, simulations of motion with MRISIMUL on single-node and multi-node multi-GPU systems were examined. Results Gradient Echo and CINE images of the three motion models were produced and motion-related artifacts were demonstrated. The temporal evolution of the contractility of the heart was presented through the application of myocardial tagging. Better simulation performance and image quality were presented through the introduction of software crushers without the need to further increase the computational load and GPU resources. Last, MRISIMUL demonstrated an almost linear scalable performance with the increasing number of available GPU cards, in both single-node and multi-node multi-GPU computer systems. Conclusions MRISIMUL is the first MR physics simulator to have implemented motion with a 3D large computational load on a single computer
Fast GPU-based computation of spatial multigrid multiframe LMEM for PET.
Nassiri, Moulay Ali; Carrier, Jean-François; Després, Philippe
2015-09-01
Significant efforts were invested during the last decade to accelerate PET list-mode reconstructions, notably with GPU devices. However, the computation time per event is still relatively long, and the list-mode efficiency on the GPU is well below the histogram-mode efficiency. Since list-mode data are not arranged in any regular pattern, costly accesses to the GPU global memory can hardly be optimized and geometrical symmetries cannot be used. To overcome obstacles that limit the acceleration of reconstruction from list-mode on the GPU, a multigrid and multiframe approach of an expectation-maximization algorithm was developed. The reconstruction process is started during data acquisition, and calculations are executed concurrently on the GPU and the CPU, while the system matrix is computed on-the-fly. A new convergence criterion also was introduced, which is computationally more efficient on the GPU. The implementation was tested on a Tesla C2050 GPU device for a Gemini GXL PET system geometry. The results show that the proposed algorithm (multigrid and multiframe list-mode expectation-maximization, MGMF-LMEM) converges to the same solution as the LMEM algorithm more than three times faster. The execution time of the MGMF-LMEM algorithm was 1.1 s per million of events on the Tesla C2050 hardware used, for a reconstructed space of 188 x 188 x 57 voxels of 2 x 2 x 3.15 mm3. For 17- and 22-mm simulated hot lesions, the MGMF-LMEM algorithm led on the first iteration to contrast recovery coefficients (CRC) of more than 75 % of the maximum CRC while achieving a minimum in the relative mean square error. Therefore, the MGMF-LMEM algorithm can be used as a one-pass method to perform real-time reconstructions for low-count acquisitions, as in list-mode gated studies. The computation time for one iteration and 60 millions of events was approximately 66 s. PMID:25850980
A Fast Poisson Solver with Periodic Boundary Conditions for GPU Clusters in Various Configurations
NASA Astrophysics Data System (ADS)
Rattermann, Dale Nicholas
Fast Poisson solvers using the Fast Fourier Transform on uniform grids are especially suited for parallel implementation, making them appropriate for portability on graphical processing unit (GPU) devices. The goal of the following work was to implement, test, and evaluate a fast Poisson solver for periodic boundary conditions for use on a variety of GPU configurations. The solver used in this research was FLASH, an immersed-boundary-based method, which is well suited for complex, time-dependent geometries, has robust adaptive mesh refinement/de-refinement capabilities to capture evolving flow structures, and has been successfully implemented on conventional, parallel supercomputers. However, these solvers are still computationally costly to employ, and the total solver time is dominated by the solution of the pressure Poisson equation using state-of-the-art multigrid methods. FLASH improves the performance of its multigrid solvers by integrating a parallel FFT solver on a uniform grid during a coarse level. This hybrid solver could then be theoretically improved by replacing the highly-parallelizable FFT solver with one that utilizes GPUs, and, thus, was the motivation for my research. In the present work, the CPU-utilizing parallel FFT solver (PFFT) used in the base version of FLASH for solving the Poisson equation on uniform grids has been modified to enable parallel execution on CUDA-enabled GPU devices. New algorithms have been implemented to replace the Poisson solver that decompose the computational domain and send each new block to a GPU for parallel computation. One-dimensional (1-D) decomposition of the computational domain minimizes the amount of network traffic involved in this bandwidth-intensive computation by limiting the amount of all-to-all communication required between processes. Advanced techniques have been incorporated and implemented in a GPU-centric code design, while allowing end users the flexibility of parameter control at runtime in
Chi, Yujie; Tian, Zhen; Jia, Xun
2016-08-01
Monte Carlo (MC) particle transport simulation on a graphics-processing unit (GPU) platform has been extensively studied recently due to the efficiency advantage achieved via massive parallelization. Almost all of the existing GPU-based MC packages were developed for voxelized geometry. This limited application scope of these packages. The purpose of this paper is to develop a module to model parametric geometry and integrate it in GPU-based MC simulations. In our module, each continuous region was defined by its bounding surfaces that were parameterized by quadratic functions. Particle navigation functions in this geometry were developed. The module was incorporated to two previously developed GPU-based MC packages and was tested in two example problems: (1) low energy photon transport simulation in a brachytherapy case with a shielded cylinder applicator and (2) MeV coupled photon/electron transport simulation in a phantom containing several inserts of different shapes. In both cases, the calculated dose distributions agreed well with those calculated in the corresponding voxelized geometry. The averaged dose differences were 1.03% and 0.29%, respectively. We also used the developed package to perform simulations of a Varian VS 2000 brachytherapy source and generated a phase-space file. The computation time under the parameterized geometry depended on the memory location storing the geometry data. When the data was stored in GPU's shared memory, the highest computational speed was achieved. Incorporation of parameterized geometry yielded a computation time that was ~3 times of that in the corresponding voxelized geometry. We also developed a strategy to use an auxiliary index array to reduce frequency of geometry calculations and hence improve efficiency. With this strategy, the computational time ranged in 1.75-2.03 times of the voxelized geometry for coupled photon/electron transport depending on the voxel dimension of the auxiliary index array, and in 0
GPU-based fast Monte Carlo simulation for radiotherapy dose calculation.
Jia, Xun; Gu, Xuejun; Graves, Yan Jiang; Folkerts, Michael; Jiang, Steve B
2011-11-21
Monte Carlo (MC) simulation is commonly considered to be the most accurate dose calculation method in radiotherapy. However, its efficiency still requires improvement for many routine clinical applications. In this paper, we present our recent progress toward the development of a graphics processing unit (GPU)-based MC dose calculation package, gDPM v2.0. It utilizes the parallel computation ability of a GPU to achieve high efficiency, while maintaining the same particle transport physics as in the original dose planning method (DPM) code and hence the same level of simulation accuracy. In GPU computing, divergence of execution paths between threads can considerably reduce the efficiency. Since photons and electrons undergo different physics and hence attain different execution paths, we use a simulation scheme where photon transport and electron transport are separated to partially relieve the thread divergence issue. A high-performance random number generator and a hardware linear interpolation are also utilized. We have also developed various components to handle the fluence map and linac geometry, so that gDPM can be used to compute dose distributions for realistic IMRT or VMAT treatment plans. Our gDPM package is tested for its accuracy and efficiency in both phantoms and realistic patient cases. In all cases, the average relative uncertainties are less than 1%. A statistical t-test is performed and the dose difference between the CPU and the GPU results is not found to be statistically significant in over 96% of the high dose region and over 97% of the entire region. Speed-up factors of 69.1 ∼ 87.2 have been observed using an NVIDIA Tesla C2050 GPU card against a 2.27 GHz Intel Xeon CPU processor. For realistic IMRT and VMAT plans, MC dose calculation can be completed with less than 1% standard deviation in 36.1 ∼ 39.6 s using gDPM. PMID:22016026
Parallel implementation of 3D protein structure similarity searches using a GPU and the CUDA.
Mrozek, Dariusz; Brożek, Miłosz; Małysiak-Mrozek, Bożena
2014-02-01
Searching for similar 3D protein structures is one of the primary processes employed in the field of structural bioinformatics. However, the computational complexity of this process means that it is constantly necessary to search for new methods that can perform such a process faster and more efficiently. Finding molecular substructures that complex protein structures have in common is still a challenging task, especially when entire databases containing tens or even hundreds of thousands of protein structures must be scanned. Graphics processing units (GPUs) and general purpose graphics processing units (GPGPUs) can perform many time-consuming and computationally demanding processes much more quickly than a classical CPU can. In this paper, we describe the GPU-based implementation of the CASSERT algorithm for 3D protein structure similarity searching. This algorithm is based on the two-phase alignment of protein structures when matching fragments of the compared proteins. The GPU (GeForce GTX 560Ti: 384 cores, 2GB RAM) implementation of CASSERT ("GPU-CASSERT") parallelizes both alignment phases and yields an average 180-fold increase in speed over its CPU-based, single-core implementation on an Intel Xeon E5620 (2.40GHz, 4 cores). In this paper, we show that massive parallelization of the 3D structure similarity search process on many-core GPU devices can reduce the execution time of the process, allowing it to be performed in real time. GPU-CASSERT is available at: http://zti.polsl.pl/dmrozek/science/gpucassert/cassert.htm. PMID:24481593
Accelerating forward and adjoint simulations of seismic wave propagation on large GPU-clusters
NASA Astrophysics Data System (ADS)
Peter, D. B.; Rietmann, M.; Charles, J.; Messmer, P.; Komatitsch, D.; Schenk, O.; Tromp, J.
2012-12-01
In seismic tomography, waveform inversions require accurate simulations of seismic wave propagation in complex media.The current versions of our spectral-element method (SEM) packages, the local-scale code SPECFEM3D and the global-scale code SPECFEM3D_GLOBE, are widely used open-source community codes which simulate seismic wave propagation for local-, regional- and global-scale applications. These numerical simulations compute highly accurate seismic wavefields, accounting for fully 3D Earth models. However, code performance often governs whether seismic inversions become feasible or remain elusive. We report here on extending these high-order finite-element packages to further exploit graphic processing units (GPUs) and perform numerical simulations of seismic wave propagation on large GPU clusters. These enhanced packages can be readily run either on multi-core CPUs only or together with many-core GPU acceleration devices. One of the challenges in parallelizing finite element codes is the potential for race conditions during the assembly phase. We therefore investigated different methods such as mesh coloring or atomic updates on the GPU. In order to achieve strong scaling, we needed to ensure good overlap of data motion at all levels, including internode and host-accelerator transfers. These new MPI/CUDA solvers exhibit excellent scalability and achieve speedup on a node-to-node basis over the carefully tuned equivalent multi-core MPI solver. We present case studies run on a Cray XK6 GPU architecture up to 896 nodes to demonstrate the performance of both the forward and adjoint functionality of the code packages. Running simulations on such dedicated GPU clusters further reduces computation times and pushes seismic inversions into a new, higher frequency realm.
SU-E-T-558: Monte Carlo Photon Transport Simulations On GPU with Quadric Geometry
Chi, Y; Tian, Z; Jiang, S; Jia, X
2015-06-15
Purpose: Monte Carlo simulation on GPU has experienced rapid advancements over the past a few years and tremendous accelerations have been achieved. Yet existing packages were developed only in voxelized geometry. In some applications, e.g. radioactive seed modeling, simulations in more complicated geometry are needed. This abstract reports our initial efforts towards developing a quadric geometry module aiming at expanding the application scope of GPU-based MC simulations. Methods: We defined the simulation geometry consisting of a number of homogeneous bodies, each specified by its material composition and limiting surfaces characterized by quadric functions. A tree data structure was utilized to define geometric relationship between different bodies. We modified our GPU-based photon MC transport package to incorporate this geometry. Specifically, geometry parameters were loaded into GPU’s shared memory for fast access. Geometry functions were rewritten to enable the identification of the body that contains the current particle location via a fast searching algorithm based on the tree data structure. Results: We tested our package in an example problem of HDR-brachytherapy dose calculation for shielded cylinder. The dose under the quadric geometry and that under the voxelized geometry agreed in 94.2% of total voxels within 20% isodose line based on a statistical t-test (95% confidence level), where the reference dose was defined to be the one at 0.5cm away from the cylinder surface. It took 243sec to transport 100million source photons under this quadric geometry on an NVidia Titan GPU card. Compared with simulation time of 99.6sec in the voxelized geometry, including quadric geometry reduced efficiency due to the complicated geometry-related computations. Conclusion: Our GPU-based MC package has been extended to support photon transport simulation in quadric geometry. Satisfactory accuracy was observed with a reduced efficiency. Developments for charged
SU-E-T-395: Multi-GPU-Based VMAT Treatment Plan Optimization Using a Column-Generation Approach
Tian, Z; Shi, F; Jia, X; Jiang, S; Peng, F
2014-06-01
Purpose: GPU has been employed to speed up VMAT optimizations from hours to minutes. However, its limited memory capacity makes it difficult to handle cases with a huge dose-deposition-coefficient (DDC) matrix, e.g. those with a large target size, multiple arcs, small beam angle intervals and/or small beamlet size. We propose multi-GPU-based VMAT optimization to solve this memory issue to make GPU-based VMAT more practical for clinical use. Methods: Our column-generation-based method generates apertures sequentially by iteratively searching for an optimal feasible aperture (referred as pricing problem, PP) and optimizing aperture intensities (referred as master problem, MP). The PP requires access to the large DDC matrix, which is implemented on a multi-GPU system. Each GPU stores a DDC sub-matrix corresponding to one fraction of beam angles and is only responsible for calculation related to those angles. Broadcast and parallel reduction schemes are adopted for inter-GPU data transfer. MP is a relatively small-scale problem and is implemented on one GPU. One headand- neck cancer case was used for test. Three different strategies for VMAT optimization on single GPU were also implemented for comparison: (S1) truncating DDC matrix to ignore its small value entries for optimization; (S2) transferring DDC matrix part by part to GPU during optimizations whenever needed; (S3) moving DDC matrix related calculation onto CPU. Results: Our multi-GPU-based implementation reaches a good plan within 1 minute. Although S1 was 10 seconds faster than our method, the obtained plan quality is worse. Both S2 and S3 handle the full DDC matrix and hence yield the same plan as in our method. However, the computation time is longer, namely 4 minutes and 30 minutes, respectively. Conclusion: Our multi-GPU-based VMAT optimization can effectively solve the limited memory issue with good plan quality and high efficiency, making GPUbased ultra-fast VMAT planning practical for real clinical use.
A New GPU-Enabled MODTRAN Thermal Model for the PLUME TRACKER Volcanic Emission Analysis Toolkit
NASA Astrophysics Data System (ADS)
Acharya, P. K.; Berk, A.; Guiang, C.; Kennett, R.; Perkins, T.; Realmuto, V. J.
2013-12-01
Real-time quantification of volcanic gaseous and particulate releases is important for (1) recognizing rapid increases in SO2 gaseous emissions which may signal an impending eruption; (2) characterizing ash clouds to enable safe and efficient commercial aviation; and (3) quantifying the impact of volcanic aerosols on climate forcing. The Jet Propulsion Laboratory (JPL) has developed state-of-the-art algorithms, embedded in their analyst-driven Plume Tracker toolkit, for performing SO2, NH3, and CH4 retrievals from remotely sensed multi-spectral Thermal InfraRed spectral imagery. While Plume Tracker provides accurate results, it typically requires extensive analyst time. A major bottleneck in this processing is the relatively slow but accurate FORTRAN-based MODTRAN atmospheric and plume radiance model, developed by Spectral Sciences, Inc. (SSI). To overcome this bottleneck, SSI in collaboration with JPL, is porting these slow thermal radiance algorithms onto massively parallel, relatively inexpensive and commercially-available GPUs. This paper discusses SSI's efforts to accelerate the MODTRAN thermal emission algorithms used by Plume Tracker. Specifically, we are developing a GPU implementation of the Curtis-Godson averaging and the Voigt in-band transmittances from near line center molecular absorption, which comprise the major computational bottleneck. The transmittance calculations were decomposed into separate functions, individually implemented as GPU kernels, and tested for accuracy and performance relative to the original CPU code. Speedup factors of 14 to 30× were realized for individual processing components on an NVIDIA GeForce GTX 295 graphics card with no loss of accuracy. Due to the separate host (CPU) and device (GPU) memory spaces, a redesign of the MODTRAN architecture was required to ensure efficient data transfer between host and device, and to facilitate high parallel throughput. Currently, we are incorporating the separate GPU kernels into a
Ha, S.; Matej, S.; Ispiryan, M.; Mueller, K.
2013-01-01
We describe a GPU-accelerated framework that efficiently models spatially (shift) variant system response kernels and performs forward- and back-projection operations with these kernels for the DIRECT (Direct Image Reconstruction for TOF) iterative reconstruction approach. Inherent challenges arise from the poor memory cache performance at non-axis aligned TOF directions. Focusing on the GPU memory access patterns, we utilize different kinds of GPU memory according to these patterns in order to maximize the memory cache performance. We also exploit the GPU instruction-level parallelism to efficiently hide long latencies from the memory operations. Our experiments indicate that our GPU implementation of the projection operators has slightly faster or approximately comparable time performance than FFT-based approaches using state-of-the-art FFTW routines. However, most importantly, our GPU framework can also efficiently handle any generic system response kernels, such as spatially symmetric and shift-variant as well as spatially asymmetric and shift-variant, both of which an FFT-based approach cannot cope with. PMID:23531763
A real-time GPU implementation of the SIFT algorithm for large-scale video analysis tasks
NASA Astrophysics Data System (ADS)
Fassold, Hannes; Rosner, Jakub
2015-02-01
The SIFT algorithm is one of the most popular feature extraction methods and therefore widely used in all sort of video analysis tasks like instance search and duplicate/ near-duplicate detection. We present an efficient GPU implementation of the SIFT descriptor extraction algorithm using CUDA. The major steps of the algorithm are presented and for each step we describe how to efficiently parallelize it massively, how to take advantage of the unique capabilities of the GPU like shared memory / texture memory and how to avoid or minimize common GPU performance pitfalls. We compare the GPU implementation with the reference CPU implementation in terms of runtime and quality and achieve a speedup factor of approximately 3 - 5 for SD and 5 - 6 for Full HD video with respect to a multi-threaded CPU implementation, allowing us to run the SIFT descriptor extraction algorithm in real-time on SD video. Furthermore, quality tests show that the GPU implementation gives the same quality as the reference CPU implementation from the HessSIFT library. We further describe the benefits of GPU-accelerated SIFT descriptor calculation for video analysis applications such as near-duplicate video detection.
Xu, Daguang; Huang, Yong; Kang, Jin U
2014-06-16
We implemented the graphics processing unit (GPU) accelerated compressive sensing (CS) non-uniform in k-space spectral domain optical coherence tomography (SD OCT). Kaiser-Bessel (KB) function and Gaussian function are used independently as the convolution kernel in the gridding-based non-uniform fast Fourier transform (NUFFT) algorithm with different oversampling ratios and kernel widths. Our implementation is compared with the GPU-accelerated modified non-uniform discrete Fourier transform (MNUDFT) matrix-based CS SD OCT and the GPU-accelerated fast Fourier transform (FFT)-based CS SD OCT. It was found that our implementation has comparable performance to the GPU-accelerated MNUDFT-based CS SD OCT in terms of image quality while providing more than 5 times speed enhancement. When compared to the GPU-accelerated FFT based-CS SD OCT, it shows smaller background noise and less side lobes while eliminating the need for the cumbersome k-space grid filling and the k-linear calibration procedure. Finally, we demonstrated that by using a conventional desktop computer architecture having three GPUs, real-time B-mode imaging can be obtained in excess of 30 fps for the GPU-accelerated NUFFT based CS SD OCT with frame size 2048(axial) × 1,000(lateral). PMID:24977582
Xu, Daguang; Huang, Yong; Kang, Jin U.
2014-01-01
We implemented the graphics processing unit (GPU) accelerated compressive sensing (CS) non-uniform in k-space spectral domain optical coherence tomography (SD OCT). Kaiser-Bessel (KB) function and Gaussian function are used independently as the convolution kernel in the gridding-based non-uniform fast Fourier transform (NUFFT) algorithm with different oversampling ratios and kernel widths. Our implementation is compared with the GPU-accelerated modified non-uniform discrete Fourier transform (MNUDFT) matrix-based CS SD OCT and the GPU-accelerated fast Fourier transform (FFT)-based CS SD OCT. It was found that our implementation has comparable performance to the GPU-accelerated MNUDFT-based CS SD OCT in terms of image quality while providing more than 5 times speed enhancement. When compared to the GPU-accelerated FFT based-CS SD OCT, it shows smaller background noise and less side lobes while eliminating the need for the cumbersome k-space grid filling and the k-linear calibration procedure. Finally, we demonstrated that by using a conventional desktop computer architecture having three GPUs, real-time B-mode imaging can be obtained in excess of 30 fps for the GPU-accelerated NUFFT based CS SD OCT with frame size 2048(axial)×1000(lateral). PMID:24977582
Noniterative Multireference Coupled Cluster Methods on Heterogeneous CPU-GPU Systems
Bhaskaran-Nair, Kiran; Ma, Wenjing; Krishnamoorthy, Sriram; Villa, Oreste; van Dam, Hubertus JJ; Apra, Edoardo; Kowalski, Karol
2013-04-09
A novel parallel algorithm for non-iterative multireference coupled cluster (MRCC) theories, which merges recently introduced reference-level parallelism (RLP) [K. Bhaskaran-Nair, J.Brabec, E. Aprà, H.J.J. van Dam, J. Pittner, K. Kowalski, J. Chem. Phys. 137, 094112 (2012)] with the possibility of accelerating numerical calculations using graphics processing unit (GPU) is presented. We discuss the performance of this algorithm on the example of the MRCCSD(T) method (iterative singles and doubles and perturbative triples), where the corrections due to triples are added to the diagonal elements of the MRCCSD (iterative singles and doubles) effective Hamiltonian matrix. The performance of the combined RLP/GPU algorithm is illustrated on the example of the Brillouin-Wigner (BW) and Mukherjee (Mk) state-specific MRCCSD(T) formulations.
HOOMD-blue - scaling up from one desktop GPU to Titan
NASA Astrophysics Data System (ADS)
Glaser, Jens; Anderson, Joshua A.; Glotzer, Sharon C.
2014-03-01
Scaling molecular dynamics simulations from one to many GPUs presents unique challenges. Due to the high parallel efficiency of a single GPU, communication processes become a bottleneck when multiple GPUs are combined in parallel and limit scaling. We show how the fastest general-purpose molecular dynamics code currently available for single GPUs, HOOMD-blue, has been extended using spatial domain decomposition to run efficiently on tens or hundreds of GPUs. A key to parallel efficiency is a highly optimized communication pattern using locally load-balancing algorithms fully implemented on the GPU. We will discuss comparisons to other state-of-the-art codes (LAMMPS) and present preliminary benchmarks on the Titan super computer.
A Simple GPU-Accelerated Two-Dimensional MUSCL-Hancock Solver for Ideal Magnetohydrodynamics
NASA Technical Reports Server (NTRS)
Bard, Christopher; Dorelli, John C.
2013-01-01
We describe our experience using NVIDIA's CUDA (Compute Unified Device Architecture) C programming environment to implement a two-dimensional second-order MUSCL-Hancock ideal magnetohydrodynamics (MHD) solver on a GTX 480 Graphics Processing Unit (GPU). Taking a simple approach in which the MHD variables are stored exclusively in the global memory of the GTX 480 and accessed in a cache-friendly manner (without further optimizing memory access by, for example, staging data in the GPU's faster shared memory), we achieved a maximum speed-up of approx. = 126 for a sq 1024 grid relative to the sequential C code running on a single Intel Nehalem (2.8 GHz) core. This speedup is consistent with simple estimates based on the known floating point performance, memory throughput and parallel processing capacity of the GTX 480.
NASA Astrophysics Data System (ADS)
Yu, H.; Wang, Z.; Zhang, C.; Chen, N.; Zhao, Y.; Sawchuk, A. P.; Dalsing, M. C.; Teague, S. D.; Cheng, Y.
2014-11-01
Existing research of patient-specific computational hemodynamics (PSCH) heavily relies on software for anatomical extraction of blood arteries. Data reconstruction and mesh generation have to be done using existing commercial software due to the gap between medical image processing and CFD, which increases computation burden and introduces inaccuracy during data transformation thus limits the medical applications of PSCH. We use lattice Boltzmann method (LBM) to solve the level-set equation over an Eulerian distance field and implicitly and dynamically segment the artery surfaces from radiological CT/MRI imaging data. The segments seamlessly feed to the LBM based CFD computation of PSCH thus explicit mesh construction and extra data management are avoided. The LBM is ideally suited for GPU (graphic processing unit)-based parallel computing. The parallel acceleration over GPU achieves excellent performance in PSCH computation. An application study will be presented which segments an aortic artery from a chest CT dataset and models PSCH of the segmented artery.
An analytic linear accelerator source model for GPU-based Monte Carlo dose calculations.
Tian, Zhen; Li, Yongbao; Folkerts, Michael; Shi, Feng; Jiang, Steve B; Jia, Xun
2015-10-21
Recently, there has been a lot of research interest in developing fast Monte Carlo (MC) dose calculation methods on graphics processing unit (GPU) platforms. A good linear accelerator (linac) source model is critical for both accuracy and efficiency considerations. In principle, an analytical source model should be more preferred for GPU-based MC dose engines than a phase-space file-based model, in that data loading and CPU-GPU data transfer can be avoided. In this paper, we presented an analytical field-independent source model specifically developed for GPU-based MC dose calculations, associated with a GPU-friendly sampling scheme. A key concept called phase-space-ring (PSR) was proposed. Each PSR contained a group of particles that were of the same type, close in energy and reside in a narrow ring on the phase-space plane located just above the upper jaws. The model parameterized the probability densities of particle location, direction and energy for each primary photon PSR, scattered photon PSR and electron PSR. Models of one 2D Gaussian distribution or multiple Gaussian components were employed to represent the particle direction distributions of these PSRs. A method was developed to analyze a reference phase-space file and derive corresponding model parameters. To efficiently use our model in MC dose calculations on GPU, we proposed a GPU-friendly sampling strategy, which ensured that the particles sampled and transported simultaneously are of the same type and close in energy to alleviate GPU thread divergences. To test the accuracy of our model, dose distributions of a set of open fields in a water phantom were calculated using our source model and compared to those calculated using the reference phase-space files. For the high dose gradient regions, the average distance-to-agreement (DTA) was within 1 mm and the maximum DTA within 2 mm. For relatively low dose gradient regions, the root-mean-square (RMS) dose difference was within 1.1% and the maximum
Rapid and automatic 3D body measurement system based on a GPU-Steger line detector.
Liu, Xingjian; Zhao, Hengshuang; Zhan, Guomin; Zhong, Kai; Li, Zhongwei; Chao, Yuhjin; Shi, Yusheng
2016-07-20
This paper proposes a rapid and automatic measurement system to acquire a 3D shape of a human body. A flexible calibration method was developed to decrease the complexity in system calibration. To reduce the computation cost, a GPU-Steger line detector was proposed to more rapidly detect the center of the laser pattern and at subpixel level. The processing time of line detection is significantly shortened by the GPU-Steger line detector, which can be over 110 times faster than that by CPU. The key technologies are introduced, and the experimental results are presented in this paper to illustrate the performance of the proposed system. The system can be used to measure human body surfaces with nonuniform reflectance such as hair, skin, and clothes with rich texture. PMID:27463902
How to obtain efficient GPU kernels: An illustration using FMM & FGT algorithms
NASA Astrophysics Data System (ADS)
Cruz, Felipe A.; Layton, Simon K.; Barba, L. A.
2011-10-01
Computing on graphics processors is maybe one of the most important developments in computational science to happen in decades. Not since the arrival of the Beowulf cluster, which combined open source software with commodity hardware to truly democratize high-performance computing, has the community been so electrified. Like then, the opportunity comes with challenges. The formulation of scientific algorithms to take advantage of the performance offered by the new architecture requires rethinking core methods. Here, we have tackled fast summation algorithms (fast multipole method and fast Gauss transform), and applied algorithmic redesign for attaining performance on GPUs. The progression of performance improvements attained illustrates the exercise of formulating algorithms for the massively parallel architecture of the GPU. The end result has been GPU kernels that run at over 500 Gop/s on one NVIDIATESLA C1060 card, thereby reaching close to practical peak.
A GPU-accelerated adaptive discontinuous Galerkin method for level set equation
NASA Astrophysics Data System (ADS)
Karakus, A.; Warburton, T.; Aksel, M. H.; Sert, C.
2016-01-01
This paper presents a GPU-accelerated nodal discontinuous Galerkin method for the solution of two- and three-dimensional level set (LS) equation on unstructured adaptive meshes. Using adaptive mesh refinement, computations are localised mostly near the interface location to reduce the computational cost. Small global time step size resulting from the local adaptivity is avoided by local time-stepping based on a multi-rate Adams-Bashforth scheme. Platform independence of the solver is achieved with an extensible multi-threading programming API that allows runtime selection of different computing devices (GPU and CPU) and different threading interfaces (CUDA, OpenCL and OpenMP). Overall, a highly scalable, accurate and mass conservative numerical scheme that preserves the simplicity of LS formulation is obtained. Efficiency, performance and local high-order accuracy of the method are demonstrated through distinct numerical test cases.
GPU-Based Parallelized Solver for Large Scale Vascular Blood Flow Modeling and Simulations.
Santhanam, Anand P; Neylon, John; Eldredge, Jeff; Teran, Joseph; Dutson, Erik; Benharash, Peyman
2016-01-01
Cardio-vascular blood flow simulations are essential in understanding the blood flow behavior during normal and disease conditions. To date, such blood flow simulations have only been done at a macro scale level due to computational limitations. In this paper, we present a GPU based large scale solver that enables modeling the flow even in the smallest arteries. A mechanical equivalent of the circuit based flow modeling system is first developed to employ the GPU computing framework. Numerical studies were employed using a set of 10 million connected vascular elements. Run-time flow analysis were performed to simulate vascular blockages, as well as arterial cut-off. Our results showed that we can achieve ~100 FPS using a GTX 680m and ~40 FPS using a Tegra K1 computing platform. PMID:27046603
GPU-Accelerated Stony-Brook University 5-class Microphysics Scheme in WRF
NASA Astrophysics Data System (ADS)
Mielikainen, J.; Huang, B.; Huang, A.
2011-12-01
The Weather Research and Forecasting (WRF) model is a next-generation mesoscale numerical weather prediction system. Microphysics plays an important role in weather and climate prediction. Several bulk water microphysics schemes are available within the WRF, with different numbers of simulated hydrometeor classes and methods for estimating their size fall speeds, distributions and densities. Stony-Brook University scheme (SBU-YLIN) is a 5-class scheme with riming intensity predicted to account for mixed-phase processes. In the past few years, co-processing on Graphics Processing Units (GPUs) has been a disruptive technology in High Performance Computing (HPC). GPUs use the ever increasing transistor count for adding more processor cores. Therefore, GPUs are well suited for massively data parallel processing with high floating point arithmetic intensity. Thus, it is imperative to update legacy scientific applications to take advantage of this unprecedented increase in computing power. CUDA is an extension to the C programming language offering programming GPU's directly. It is designed so that its constructs allow for natural expression of data-level parallelism. A CUDA program is organized into two parts: a serial program running on the CPU and a CUDA kernel running on the GPU. The CUDA code consists of three computational phases: transmission of data into the global memory of the GPU, execution of the CUDA kernel, and transmission of results from the GPU into the memory of CPU. CUDA takes a bottom-up point of view of parallelism is which thread is an atomic unit of parallelism. Individual threads are part of groups called warps, within which every thread executes exactly the same sequence of instructions. To test SBU-YLIN, we used a CONtinental United States (CONUS) benchmark data set for 12 km resolution domain for October 24, 2001. A WRF domain is a geographic region of interest discretized into a 2-dimensional grid parallel to the ground. Each grid point has
APES-based procedure for super-resolution SAR imagery with GPU parallel computing
NASA Astrophysics Data System (ADS)
Jia, Weiwei; Xu, Xiaojian; Xu, Guangyao
2015-10-01
The amplitude and phase estimation (APES) algorithm is widely used in modern spectral analysis. Compared with conventional Fourier transform (FFT), APES results in lower sidelobes and narrower spectral peaks. However, in synthetic aperture radar (SAR) imaging with large scene, without parallel computation, it is difficult to apply APES directly to super-resolution radar image processing due to its great amount of calculation. In this paper, a procedure is proposed to achieve target extraction and parallel computing of APES for super-resolution SAR imaging. Numerical experimental are carried out on Tesla K40C with 745 MHz GPU clock rate and 2880 CUDA cores. Results of SAR image with GPU parallel computing show that the parallel APES is remarkably more efficient than that of CPU-based with the same super-resolution.
GPU-Accelerated Finite Element Method for Modelling Light Transport in Diffuse Optical Tomography
Schweiger, Martin
2011-01-01
We introduce a GPU-accelerated finite element forward solver for the computation of light transport in scattering media. The forward model is the computationally most expensive component of iterative methods for image reconstruction in diffuse optical tomography, and performance optimisation of the forward solver is therefore crucial for improving the efficiency of the solution of the inverse problem. The GPU forward solver uses a CUDA implementation that evaluates on the graphics hardware the sparse linear system arising in the finite element formulation of the diffusion equation. We present solutions for both time-domain and frequency-domain problems. A comparison with a CPU-based implementation shows significant performance gains of the graphics accelerated solution, with improvements of approximately a factor of 10 for double-precision computations, and factors beyond 20 for single-precision computations. The gains are also shown to be dependent on the mesh complexity, where the largest gains are achieved for high mesh resolutions. PMID:22013431
A block-wise approximate parallel implementation for ART algorithm on CUDA-enabled GPU.
Fan, Zhongyin; Xie, Yaoqin
2015-01-01
Computed tomography (CT) has been widely used to acquire volumetric anatomical information in the diagnosis and treatment of illnesses in many clinics. However, the ART algorithm for reconstruction from under-sampled and noisy projection is still time-consuming. It is the goal of our work to improve a block-wise approximate parallel implementation for the ART algorithm on CUDA-enabled GPU to make the ART algorithm applicable to the clinical environment. The resulting method has several compelling features: (1) the rays are allotted into blocks, making the rays in the same block parallel; (2) GPU implementation caters to the actual industrial and medical application demand. We test the algorithm on a digital shepp-logan phantom, and the results indicate that our method is more efficient than the existing CPU implementation. The high computation efficiency achieved in our algorithm makes it possible for clinicians to obtain real-time 3D images. PMID:26405857
GPU acceleration of melody accurate matching in query-by-humming.
Xiao, Limin; Zheng, Yao; Tang, Wenqi; Yao, Guangchao; Ruan, Li
2014-01-01
With the increasing scale of the melody database, the query-by-humming system faces the trade-offs between response speed and retrieval accuracy. Melody accurate matching is the key factor to restrict the response speed. In this paper, we present a GPU acceleration method for melody accurate matching, in order to improve the response speed without reducing retrieval accuracy. The method develops two parallel strategies (intra-task parallelism and inter-task parallelism) to obtain accelerated effects. The efficiency of our method is validated through extensive experiments. Evaluation results show that our single GPU implementation achieves 20x to 40x speedup ratio, when compared to a typical general purpose CPU's execution time. PMID:24693239
An analytic linear accelerator source model for GPU-based Monte Carlo dose calculations
NASA Astrophysics Data System (ADS)
Tian, Zhen; Li, Yongbao; Folkerts, Michael; Shi, Feng; Jiang, Steve B.; Jia, Xun
2015-10-01
Recently, there has been a lot of research interest in developing fast Monte Carlo (MC) dose calculation methods on graphics processing unit (GPU) platforms. A good linear accelerator (linac) source model is critical for both accuracy and efficiency considerations. In principle, an analytical source model should be more preferred for GPU-based MC dose engines than a phase-space file-based model, in that data loading and CPU-GPU data transfer can be avoided. In this paper, we presented an analytical field-independent source model specifically developed for GPU-based MC dose calculations, associated with a GPU-friendly sampling scheme. A key concept called phase-space-ring (PSR) was proposed. Each PSR contained a group of particles that were of the same type, close in energy and reside in a narrow ring on the phase-space plane located just above the upper jaws. The model parameterized the probability densities of particle location, direction and energy for each primary photon PSR, scattered photon PSR and electron PSR. Models of one 2D Gaussian distribution or multiple Gaussian components were employed to represent the particle direction distributions of these PSRs. A method was developed to analyze a reference phase-space file and derive corresponding model parameters. To efficiently use our model in MC dose calculations on GPU, we proposed a GPU-friendly sampling strategy, which ensured that the particles sampled and transported simultaneously are of the same type and close in energy to alleviate GPU thread divergences. To test the accuracy of our model, dose distributions of a set of open fields in a water phantom were calculated using our source model and compared to those calculated using the reference phase-space files. For the high dose gradient regions, the average distance-to-agreement (DTA) was within 1 mm and the maximum DTA within 2 mm. For relatively low dose gradient regions, the root-mean-square (RMS) dose difference was within 1.1% and the maximum
TH-E-BRE-08: GPU-Monte Carlo Based Fast IMRT Plan Optimization
Li, Y; Tian, Z; Shi, F; Jiang, S; Jia, X
2014-06-15
Purpose: Intensity-modulated radiation treatment (IMRT) plan optimization needs pre-calculated beamlet dose distribution. Pencil-beam or superposition/convolution type algorithms are typically used because of high computation speed. However, inaccurate beamlet dose distributions, particularly in cases with high levels of inhomogeneity, may mislead optimization, hindering the resulting plan quality. It is desire to use Monte Carlo (MC) methods for beamlet dose calculations. Yet, the long computational time from repeated dose calculations for a number of beamlets prevents this application. It is our objective to integrate a GPU-based MC dose engine in lung IMRT optimization using a novel two-steps workflow. Methods: A GPU-based MC code gDPM is used. Each particle is tagged with an index of a beamlet where the source particle is from. Deposit dose are stored separately for beamlets based on the index. Due to limited GPU memory size, a pyramid space is allocated for each beamlet, and dose outside the space is neglected. A two-steps optimization workflow is proposed for fast MC-based optimization. At first step, rough beamlet dose calculations is conducted with only a small number of particles per beamlet. Plan optimization is followed to get an approximated fluence map. In the second step, more accurate beamlet doses are calculated, where sampled number of particles for a beamlet is proportional to the intensity determined previously. A second-round optimization is conducted, yielding the final Result. Results: For a lung case with 5317 beamlets, 10{sup 5} particles per beamlet in the first round, and 10{sup 8} particles per beam in the second round are enough to get a good plan quality. The total simulation time is 96.4 sec. Conclusion: A fast GPU-based MC dose calculation method along with a novel two-step optimization workflow are developed. The high efficiency allows the use of MC for IMRT optimizations.
APEnet+: a 3D Torus network optimized for GPU-based HPC Systems
NASA Astrophysics Data System (ADS)
Ammendola, R.; Biagioni, A.; Frezza, O.; Lo Cicero, F.; Lonardo, A.; Paolucci, P. S.; Rossetti, D.; Simula, F.; Tosoratto, L.; Vicini, P.
2012-12-01
In the supercomputing arena, the strong rise of GPU-accelerated clusters is a matter of fact. Within INFN, we proposed an initiative — the QUonG project — whose aim is to deploy a high performance computing system dedicated to scientific computations leveraging on commodity multi-core processors coupled with latest generation GPUs. The inter-node interconnection system is based on a point-to-point, high performance, low latency 3D torus network which is built in the framework of the APEnet+ project. It takes the form of an FPGA-based PCIe network card exposing six full bidirectional links running at 34 Gbps each that implements the RDMA protocol. In order to enable significant access latency reduction for inter-node data transfer, a direct network-to-GPU interface was built. The specialized hardware blocks, integrated in the APEnet+ board, provide support for GPU-initiated communications using the so called PCIe peer-to-peer (P2P) transactions. This development is made in close collaboration with the GPU vendor NVIDIA. The final shape of a complete QUonG deployment is an assembly of standard 42U racks, each one capable of 80 TFLOPS/rack of peak performance, at a cost of 5 k€/T F LOPS and for an estimated power consumption of 25 kW/rack. In this paper we report on the status of final rack deployment and on the R&D activities for 2012 that will focus on performance enhancement of the APEnet+ hardware through the adoption of new generation 28 nm FPGAs allowing the implementation of PCIe Gen3 host interface and the addition of new fault tolerance-oriented capabilities.
GPU acceleration of Runge Kutta-Fehlberg and its comparison with Dormand-Prince method
NASA Astrophysics Data System (ADS)
Seen, Wo Mei; Gobithaasan, R. U.; Miura, Kenjiro T.
2014-07-01
There is a significant reduction of processing time and speedup of performance in computer graphics with the emergence of Graphic Processing Units (GPUs). GPUs have been developed to surpass Central Processing Unit (CPU) in terms of performance and processing speed. This evolution has opened up a new area in computing and researches where highly parallel GPU has been used for non-graphical algorithms. Physical or phenomenal simulations and modelling can be accelerated through General Purpose Graphic Processing Units (GPGPU) and Compute Unified Device Architecture (CUDA) implementations. These phenomena can be represented with mathematical models in the form of Ordinary Differential Equations (ODEs) which encompasses the gist of change rate between independent and dependent variables. ODEs are numerically integrated over time in order to simulate these behaviours. The classical Runge-Kutta (RK) scheme is the common method used to numerically solve ODEs. The Runge Kutta Fehlberg (RKF) scheme has been specially developed to provide an estimate of the principal local truncation error at each step, known as embedding estimate technique. This paper delves into the implementation of RKF scheme for GPU devices and compares its result with Dorman Prince method. A pseudo code is developed to show the implementation in detail. Hence, practitioners will be able to understand the data allocation in GPU, formation of RKF kernels and the flow of data to/from GPU-CPU upon RKF kernel evaluation. The pseudo code is then written in C Language and two ODE models are executed to show the achievable speedup as compared to CPU implementation. The accuracy and efficiency of the proposed implementation method is discussed in the final section of this paper.
Algorithms of GPU-enabled reactive force field (ReaxFF) molecular dynamics.
Zheng, Mo; Li, Xiaoxia; Guo, Li
2013-04-01
Reactive force field (ReaxFF), a recent and novel bond order potential, allows for reactive molecular dynamics (ReaxFF MD) simulations for modeling larger and more complex molecular systems involving chemical reactions when compared with computation intensive quantum mechanical methods. However, ReaxFF MD can be approximately 10-50 times slower than classical MD due to its explicit modeling of bond forming and breaking, the dynamic charge equilibration at each time-step, and its one order smaller time-step than the classical MD, all of which pose significant computational challenges in simulation capability to reach spatio-temporal scales of nanometers and nanoseconds. The very recent advances of graphics processing unit (GPU) provide not only highly favorable performance for GPU enabled MD programs compared with CPU implementations but also an opportunity to manage with the computing power and memory demanding nature imposed on computer hardware by ReaxFF MD. In this paper, we present the algorithms of GMD-Reax, the first GPU enabled ReaxFF MD program with significantly improved performance surpassing CPU implementations on desktop workstations. The performance of GMD-Reax has been benchmarked on a PC equipped with a NVIDIA C2050 GPU for coal pyrolysis simulation systems with atoms ranging from 1378 to 27,283. GMD-Reax achieved speedups as high as 12 times faster than Duin et al.'s FORTRAN codes in Lammps on 8 CPU cores and 6 times faster than the Lammps' C codes based on PuReMD in terms of the simulation time per time-step averaged over 100 steps. GMD-Reax could be used as a new and efficient computational tool for exploiting very complex molecular reactions via ReaxFF MD simulation on desktop workstations. PMID:23454611
GPU-based parallel implementation of 5-layer thermal diffusion scheme
NASA Astrophysics Data System (ADS)
Huang, Melin; Mielikainen, Jarno; Huang, Bormin; Huang, H.-L. A.; Goldberg, Mitchell D.
2012-10-01
The Weather Research and Forecasting (WRF) is a system of numerical weather prediction and atmospheric simulation with dual purposes for forecasting and research. The WRF software infrastructure consists of several components such as dynamic solvers and physical simulation modules. WRF includes several Land-Surface Models (LSMs). The LSMs use atmospheric information, the radiative and precipitation forcing from the surface layer scheme, the radiation scheme, and the microphysics/convective scheme all together with the lands state variables and land-surface properties, to provide heat and moisture fluxes over land and sea-ice points. The WRF 5-layer thermal diffusion simulation is an LSM based on the MM5 5-layer soil temperature model with an energy budget that includes radiation, sensible, and latent heat flux. The WRF LSMs are very suitable for massively parallel computation as there are no interactions among horizontal grid points. More and more scientific applications have adopted graphics processing units (GPUs) to accelerate the computing performance. This study demonstrates our GPU massively parallel computation efforts on the WRF 5-layer thermal diffusion scheme. Since this scheme is only an intermediate module of the entire WRF model, the I/O transfer does not involve in the intermediate process. Without data transfer, this module can achieve a speedup of 36x with one GPU and 108x with four GPUs as compared to a single threaded CPU processor. With CPU/GPU hybrid strategy, this module can accomplish a even higher speedup, ~114x with one GPU and ~240x with four GPUs. Meanwhile, we are seeking other approaches to improve the speeds.
A study of potential numerical pitfalls in GPU-based Monte Carlo dose calculation
NASA Astrophysics Data System (ADS)
Magnoux, Vincent; Ozell, Benoît; Bonenfant, Éric; Després, Philippe
2015-07-01
The purpose of this study was to evaluate the impact of numerical errors caused by the floating point representation of real numbers in a GPU-based Monte Carlo code used for dose calculation in radiation oncology, and to identify situations where this type of error arises. The program used as a benchmark was bGPUMCD. Three tests were performed on the code, which was divided into three functional components: energy accumulation, particle tracking and physical interactions. First, the impact of single-precision calculations was assessed for each functional component. Second, a GPU-specific compilation option that reduces execution time as well as precision was examined. Third, a specific function used for tracking and potentially more sensitive to precision errors was tested by comparing it to a very high-precision implementation. Numerical errors were found in two components of the program. Because of the energy accumulation process, a few voxels surrounding a radiation source end up with a lower computed dose than they should. The tracking system contained a series of operations that abnormally amplify rounding errors in some situations. This resulted in some rare instances (less than 0.1%) of computed distances that are exceedingly far from what they should have been. Most errors detected had no significant effects on the result of a simulation due to its random nature, either because they cancel each other out or because they only affect a small fraction of particles. The results of this work can be extended to other types of GPU-based programs and be used as guidelines to avoid numerical errors on the GPU computing platform.
Planet-Disk Interaction on the GPU: The FARGO3D code
NASA Astrophysics Data System (ADS)
Masset, F. S.; Benítez-Llambay, P.
2015-10-01
We present the new code FARGO3D. It is a finite difference code that solves the equations of hydrodynamics or magnetohydrodynamics on a Cartesian, cylindrical or spherical mesh. It features orbital advection, conserves mass and (angular) momentum to machine accuracy. Special emphasis is put on the description of planet disk tidal interactions. It is parallelized with MPI, and it can run indistinctly on CPUs or GPUs, without the need to program in a GPU oriented language.
G-NetMon: a GPU-accelerated network performance monitoring system
Wu, Wenji; DeMar, Phil; Holmgren, Don; Singh, Amitoj; /Fermilab
2011-06-01
At Fermilab, we have prototyped a GPU-accelerated network performance monitoring system, called G-NetMon, to support large-scale scientific collaborations. In this work, we explore new opportunities in network traffic monitoring and analysis with GPUs. Our system exploits the data parallelism that exists within network flow data to provide fast analysis of bulk data movement between Fermilab and collaboration sites. Experiments demonstrate that our G-NetMon can rapidly detect sub-optimal bulk data movements.
NASA Astrophysics Data System (ADS)
Chi, Yujie; Tian, Zhen; Jia, Xun
2016-08-01
Monte Carlo (MC) particle transport simulation on a graphics-processing unit (GPU) platform has been extensively studied recently due to the efficiency advantage achieved via massive parallelization. Almost all of the existing GPU-based MC packages were developed for voxelized geometry. This limited application scope of these packages. The purpose of this paper is to develop a module to model parametric geometry and integrate it in GPU-based MC simulations. In our module, each continuous region was defined by its bounding surfaces that were parameterized by quadratic functions. Particle navigation functions in this geometry were developed. The module was incorporated to two previously developed GPU-based MC packages and was tested in two example problems: (1) low energy photon transport simulation in a brachytherapy case with a shielded cylinder applicator and (2) MeV coupled photon/electron transport simulation in a phantom containing several inserts of different shapes. In both cases, the calculated dose distributions agreed well with those calculated in the corresponding voxelized geometry. The averaged dose differences were 1.03% and 0.29%, respectively. We also used the developed package to perform simulations of a Varian VS 2000 brachytherapy source and generated a phase-space file. The computation time under the parameterized geometry depended on the memory location storing the geometry data. When the data was stored in GPU’s shared memory, the highest computational speed was achieved. Incorporation of parameterized geometry yielded a computation time that was ~3 times of that in the corresponding voxelized geometry. We also developed a strategy to use an auxiliary index array to reduce frequency of geometry calculations and hence improve efficiency. With this strategy, the computational time ranged in 1.75–2.03 times of the voxelized geometry for coupled photon/electron transport depending on the voxel dimension of the auxiliary index array, and in 0
A GPU Accelerated Discontinuous Galerkin Conservative Level Set Method for Simulating Atomization
NASA Astrophysics Data System (ADS)
Jibben, Zechariah J.
This dissertation describes a process for interface capturing via an arbitrary-order, nearly quadrature free, discontinuous Galerkin (DG) scheme for the conservative level set method (Olsson et al., 2005, 2008). The DG numerical method is utilized to solve both advection and reinitialization, and executed on a refined level set grid (Herrmann, 2008) for effective use of processing power. Computation is executed in parallel utilizing both CPU and GPU architectures to make the method feasible at high order. Finally, a sparse data structure is implemented to take full advantage of parallelism on the GPU, where performance relies on well-managed memory operations. With solution variables projected into a kth order polynomial basis, a k + 1 order convergence rate is found for both advection and reinitialization tests using the method of manufactured solutions. Other standard test cases, such as Zalesak's disk and deformation of columns and spheres in periodic vortices are also performed, showing several orders of magnitude improvement over traditional WENO level set methods. These tests also show the impact of reinitialization, which often increases shape and volume errors as a result of level set scalar trapping by normal vectors calculated from the local level set field. Accelerating advection via GPU hardware is found to provide a 30x speedup factor comparing a 2.0GHz Intel Xeon E5-2620 CPU in serial vs. a Nvidia Tesla K20 GPU, with speedup factors increasing with polynomial degree until shared memory is filled. A similar algorithm is implemented for reinitialization, which relies on heavier use of shared and global memory and as a result fills them more quickly and produces smaller speedups of 18x.
Sub-second pencil beam dose calculation on GPU for adaptive proton therapy
NASA Astrophysics Data System (ADS)
da Silva, Joakim; Ansorge, Richard; Jena, Rajesh
2015-06-01
Although proton therapy delivered using scanned pencil beams has the potential to produce better dose conformity than conventional radiotherapy, the created dose distributions are more sensitive to anatomical changes and patient motion. Therefore, the introduction of adaptive treatment techniques where the dose can be monitored as it is being delivered is highly desirable. We present a GPU-based dose calculation engine relying on the widely used pencil beam algorithm, developed for on-line dose calculation. The calculation engine was implemented from scratch, with each step of the algorithm parallelized and adapted to run efficiently on the GPU architecture. To ensure fast calculation, it employs several application-specific modifications and simplifications, and a fast scatter-based implementation of the computationally expensive kernel superposition step. The calculation time for a skull base treatment plan using two beam directions was 0.22 s on an Nvidia Tesla K40 GPU, whereas a test case of a cubic target in water from the literature took 0.14 s to calculate. The accuracy of the patient dose distributions was assessed by calculating the γ-index with respect to a gold standard Monte Carlo simulation. The passing rates were 99.2% and 96.7%, respectively, for the 3%/3 mm and 2%/2 mm criteria, matching those produced by a clinical treatment planning system.
Analysis of performance improvements for host and GPU interface of the APENet+ 3D Torus network
NASA Astrophysics Data System (ADS)
Ammendola A, R.; Biagioni, A.; Frezza, O.; Lo Cicero, F.; Lonardo, A.; Paolucci, P. S.; Rossetti, D.; Simula, F.; Tosoratto, L.; Vicini, P.
2014-06-01
APEnet+ is an INFN (Italian Institute for Nuclear Physics) project aiming to develop a custom 3-Dimensional torus interconnect network optimized for hybrid clusters CPU-GPU dedicated to High Performance scientific Computing. The APEnet+ interconnect fabric is built on a FPGA-based PCI-express board with 6 bi-directional off-board links showing 34 Gbps of raw bandwidth per direction, and leverages upon peer-to-peer capabilities of Fermi and Kepler-class NVIDIA GPUs to obtain real zero-copy, GPU-to-GPU low latency transfers. The minimization of APEnet+ transfer latency is achieved through the adoption of RDMA protocol implemented in FPGA with specialized hardware blocks tightly coupled with embedded microprocessor. This architecture provides a high performance low latency offload engine for both trasmit and receive side of data transactions: preliminary results are encouraging, showing 50% of bandwidth increase for large packet size transfers. In this paper we describe the APEnet+ architecture, detailing the hardware implementation and discuss the impact of such RDMA specialized hardware on host interface latency and bandwidth.
GPU techniques applied to Euler flow simulations and comparison to CPU performance
NASA Astrophysics Data System (ADS)
Koop, Blake
With the decrease in cost of computing, and the increasingly friendly programming environments, the demand for computer generated models of real world problems has surged. Each generation of computer hardware becomes marginally faster than its predecessor, allowing for decreases in required computation time. However, the progression is slowing and will soon reach a barrier as lithography reaches its natural limits. General Purpose Graphics Processing Unit (GPGPU) programming, rather than traditional programming written for Central Processing Unit (CPU) architectures may be a viable way for computational scientists to continue to realize wall clock time reductions at a Moore's Law pace. If a code can be modified to take advantage of the Single-Input-Multiple-Data (SIMD) architecture of Graphics Processing Units (GPUs), it may be possible to gain the functionality of hundreds or thousands of cores available on a GPU card. This paper details the investigation of a specific compressible flow simulation and its functionality in both CPU and GPU programming schemes. The flow is governed by the unsteady Euler flow equations and it is checked for validity against the known solution in all three directions. It is then run over varying grid sizes using both the CPU and GPU programming schemes to evaluate wall clock time reductions.
Acceleration of 3D Finite Difference AWP-ODC for seismic simulation on GPU Fermi Architecture
NASA Astrophysics Data System (ADS)
Zhou, J.; Cui, Y.; Choi, D.
2011-12-01
AWP-ODC, a highly scalable parallel finite-difference application, enables petascale 3D earthquake calculations. This application generates realistic dynamic earthquake source description and detailed physics-based anelastic ground motions at frequencies pertinent to safe building design. In 2010, the code achieved M8, a full dynamical simulation of a magnitude-8 earthquake on the southern San Andreas fault up to 2-Hz, the largest-ever earthquake simulation. Building on the success of the previous work, we have implemented CUDA on AWP-ODC to accelerate wave propagation on GPU platform. Our CUDA development aims on aggressive parallel efficiency, optimized global and shared memory access to make the best use of GPU memory hierarchy. The benchmark on NVIDIA Tesla C2050 graphics cards demonstrated many tens of speedup in single precision compared to serial implementation at a testing problem size, while an MPI-CUDA implementation is in the progress to extend our solver to multi-GPU clusters. Our CUDA implementation has been carefully verified for accuracy.
Tempest: GPU-CPU computing for high-throughput database spectral matching
Milloy, Jeffrey A.; Faherty, Brendan K.; Gerber, Scott A.
2012-01-01
Modern mass spectrometers are now capable of producing hundreds of thousands of tandem (MS/MS) spectra per experiment, making the translation of these fragmentation spectra into peptide matches a common bottleneck in proteomics research. When coupled with experimental designs that enrich for post-translational modifications such as phosphorylation and/or include isotopically-labeled amino acids for quantification, additional burdens are placed on this computational infrastructure by shotgun sequencing. To address this issue, we have developed a new database searching program that utilizes the massively parallel compute capabilities of a graphical processing unit (GPU) to produce peptide spectral matches in a very high throughput fashion. Our program, named Tempest, combines efficient database digestion and MS/MS spectral indexing on a CPU with fast similarity scoring on a GPU. In our implementation, the entire similarity score, including the generation of full theoretical peptide candidate fragmentation spectra and its comparison to experimental spectra, is conducted on the GPU. Although Tempest uses the classical SEQUEST XCorr score as a primary metric for evaluating similarity for spectra collected at unit resolution, we have developed a new “Accelerated Score” for MS/MS spectra collected at high resolution that is based on a computationally inexpensive dot product but exhibits scoring accuracy similar to the classical XCorr. In our experience, Tempest provides compute-cluster level performance in an affordable desktop computer. PMID:22640374
Redundancy computation analysis and implementation of phase diversity based on GPU
NASA Astrophysics Data System (ADS)
Zhang, Quan; Bao, Hua; Rao, Changhui; Peng, Zhenming
2015-10-01
Phase diversity method is not only used as an image restoration technique, but also as a wavefront sensor. However, its computations have been perceived as being too burdensome to achieve its real-time applications on a desktop computer platform. In this paper, the implementation of the phase diversity algorithm based on graphic processing unit (GPU) is presented. The redundancy computations for the pupil function, point spread function, and optical transfer function are analyzed. Two kinds of implementation methods based on GPU are compared: one is the general method which is accomplished by GPU library CUFFT without precision loss (method-1) and the other one performed by our own custom FFT with little damage of precision considering the redundant calculations (method-2). The results show the cost and gradient functions can be speeded up by method-2 in contrast with the method-1 and the overhead of global memory access by kernel fusion can be reduced. For the image of 256 × 256 with the sampling factor of 3, the results reveal that method-2 achieves speedup of 1.83× compared with method-1 when the central 128 × 128 pixels of the point spread function are used.
GPU-based Monte Carlo simulation for light propagation in complex heterogeneous tissues.
Ren, Nunu; Liang, Jimin; Qu, Xiaochao; Li, Jianfeng; Lu, Bingjia; Tian, Jie
2010-03-29
As the most accurate model for simulating light propagation in heterogeneous tissues, Monte Carlo (MC) method has been widely used in the field of optical molecular imaging. However, MC method is time-consuming due to the calculations of a large number of photons propagation in tissues. The structural complexity of the heterogeneous tissues further increases the computational time. In this paper we present a parallel implementation for MC simulation of light propagation in heterogeneous tissues whose surfaces are constructed by different number of triangle meshes. On the basis of graphics processing units (GPU), the code is implemented with compute unified device architecture (CUDA) platform and optimized to reduce the access latency as much as possible by making full use of the constant memory and texture memory on GPU. We test the implementation in the homogeneous and heterogeneous mouse models with a NVIDIA GTX 260 card and a 2.40GHz Intel Xeon CPU. The experimental results demonstrate the feasibility and efficiency of the parallel MC simulation on GPU. PMID:20389700
Efficient parallel implementation of active appearance model fitting algorithm on GPU.
Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou
2014-01-01
The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures. PMID:24723812
An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU
NASA Astrophysics Data System (ADS)
Lyakh, Dmitry I.
2015-04-01
An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typically appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the naïve scattering algorithm (no memory access optimization). The tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).
Efficient Parallel Implementation of Active Appearance Model Fitting Algorithm on GPU
Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou
2014-01-01
The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures. PMID:24723812
GPU acceleration of predictiion-based lower triangular transform for lossless compression
NASA Astrophysics Data System (ADS)
Wei, Shih-Chieh; Huang, Bormin
2012-10-01
The prediction-based lower triangular transform (PLT) features the same de-correlation and coding gain properties as the Karhunen-Loeve transform (KLT), but with a lower design and implementational cost. Unlike KLT, PLT has the perfect reconstruction property which allows its direct use for lossless compression. Our previous work has shown that PLT is good for lossless compression of ultraspectral sounder data with several thousands of channels. As the computation involves many operations on large matrices, this work will exploit the parallel compute power of graphics processing unit (GPU) to speed up the PLT encoding scheme. The CUDA (Compute Unified Device Architecture) platform by NVidia will be used for comparison with a single threaded CPU core. The experimental result reveals that our GPU implementation of the PLT encoding scheme shows a speedup of 95x compared to its original Matlab implementation on CPU. Thus it is promising to apply the GPU-based PLT encoding scheme for ultraspectral sounder data compression.
Accelerating Content-Based Image Retrieval via GPU-Adaptive Index Structure
2014-01-01
A tremendous amount of work has been conducted in content-based image retrieval (CBIR) on designing effective index structure to accelerate the retrieval process. Most of them improve the retrieval efficiency via complex index structures, and few take into account the parallel implementation of them on underlying hardware, making the existing index structures suffer from low-degree of parallelism. In this paper, a novel graphics processing unit (GPU) adaptive index structure, termed as plane semantic ball (PSB), is proposed to simultaneously reduce the work of retrieval process and exploit the parallel acceleration of underlying hardware. In PSB, semantics are embedded into the generation of representative pivots and multiple balls are selected to cover more informative reference features. With PSB, the online retrieval of CBIR is factorized into independent components that are implemented on GPU efficiently. Comparative experiments with GPU-based brute force approach demonstrate that the proposed approach can achieve high speedup with little information loss. Furthermore, PSB is compared with the state-of-the-art approach, random ball cover (RBC), on two standard image datasets, Corel 10 K and GIST 1 M. Experimental results show that our approach achieves higher speedup than RBC on the same accuracy level. PMID:24782668
GPU accelerated flow solver for direct numerical simulation of turbulent flows
NASA Astrophysics Data System (ADS)
Salvadore, Francesco; Bernardini, Matteo; Botti, Michela
2013-02-01
Graphical processing units (GPUs), characterized by significant computing performance, are nowadays very appealing for the solution of computationally demanding tasks in a wide variety of scientific applications. However, to run on GPUs, existing codes need to be ported and optimized, a procedure which is not yet standardized and may require non trivial efforts, even to high-performance computing specialists. In the present paper we accurately describe the porting to CUDA (Compute Unified Device Architecture) of a finite-difference compressible Navier-Stokes solver, suitable for direct numerical simulation (DNS) of turbulent flows. Porting and validation processes are illustrated in detail, with emphasis on computational strategies and techniques that can be applied to overcome typical bottlenecks arising from the porting of common computational fluid dynamics solvers. We demonstrate that a careful optimization work is crucial to get the highest performance from GPU accelerators. The results show that the overall speedup of one NVIDIA Tesla S2070 GPU is approximately 22 compared with one AMD Opteron 2352 Barcelona chip and 11 compared with one Intel Xeon X5650 Westmere core. The potential of GPU devices in the simulation of unsteady three-dimensional turbulent flows is proved by performing a DNS of a spatially evolving compressible mixing layer.
Development of GPU-Optimized EFIT for DIII-D Equilibrium Reconstructions
NASA Astrophysics Data System (ADS)
Huang, Y.; Lao, L. L.; Xiao, B. J.; Luo, Z. P.; Yue, X. N.
2015-11-01
The development of a parallel, Graphical Processing Unit (GPU)-optimized version of EFIT for DIII-D equilibrium reconstructions is presented. This GPU-optimized version (P-EFIT) is built with the CUDA (Compute Unified Device Architecture) platform to take advantage of massively parallel GPU cores to significantly accelerate the computation under the EFIT framework. The parallel processing is implemented with the Single-Instruction Multiple-Thread (SIMT) architecture. New parallel modules to trace plasma surfaces and compute plasma parameters have been constructed. DIII-D magnetic benchmark tests show that P-EFIT could accurately reproduce the EFIT reconstruction algorithms at a fraction of the computational cost. The acceleration factor continues to increase as the (R, Z) spatial grids are increased from 65 × 65 to 257 × 257 , suggesting there may be rooms for further optimization by further reducing the communication cost. Details of the P-EFIT optimization algorithms will be discussed. Work supported by US DOE DE-FC02-04ER54698, and by China MOST under 2014GB103000, China NNSF 11205191, China CAS GJHZ201303.
Impact of data layouts on the efficiency of GPU-accelerated IDW interpolation.
Mei, Gang; Tian, Hong
2016-01-01
This paper focuses on evaluating the impact of different data layouts on the computational efficiency of GPU-accelerated Inverse Distance Weighting (IDW) interpolation algorithm. First we redesign and improve our previous GPU implementation that was performed by exploiting the feature of CUDA dynamic parallelism (CDP). Then we implement three versions of GPU implementations, i.e., the naive version, the tiled version, and the improved CDP version, based upon five data layouts, including the Structure of Arrays (SoA), the Array of Structures (AoS), the Array of aligned Structures (AoaS), the Structure of Arrays of aligned Structures (SoAoS), and the Hybrid layout. We also carry out several groups of experimental tests to evaluate the impact. Experimental results show that: the layouts AoS and AoaS achieve better performance than the layout SoA for both the naive version and tiled version, while the layout SoA is the best choice for the improved CDP version. We also observe that: for the two combined data layouts (the SoAoS and the Hybrid), there are no notable performance gains when compared to other three basic layouts. We recommend that: in practical applications, the layout AoaS is the best choice since the tiled version is the fastest one among three versions. The source code of all implementations are publicly available. PMID:26877902
GPU-based ray tracing algorithm for high-speed propagation prediction in typical indoor environments
NASA Astrophysics Data System (ADS)
Guo, Lixin; Guan, Xiaowei; Liu, Zhongyu
2015-10-01
A fast 3-D ray tracing propagation prediction model based on virtual source tree is presented in this paper, whose theoretical foundations are geometrical optics(GO) and the uniform theory of diffraction(UTD). In terms of typical single room indoor scene, taking the geometrical and electromagnetic information into account, some acceleration techniques are adopted to raise the efficiency of the ray tracing algorithm. The simulation results indicate that the runtime of the ray tracing algorithm will sharply increase when the number of the objects in the single room is large enough. Therefore, GPU acceleration technology is used to solve that problem. As is known to all, GPU is good at calculation operation rather than logical judgment, so that tens of thousands of threads in CUDA programs are able to calculate at the same time, in order to achieve massively parallel acceleration. Finally, a typical single room with several objects is simulated by using the serial ray tracing algorithm and the parallel one respectively. It can be found easily from the results that compared with the serial algorithm, the GPU-based one can achieve greater efficiency.
Luo, Ruibang; Wong, Yiu-Lun; Law, Wai-Chun; Lee, Lap-Kei; Cheung, Jeanno; Liu, Chi-Man; Lam, Tak-Wah
2014-01-01
This paper reports an integrated solution, called BALSA, for the secondary analysis of next generation sequencing data; it exploits the computational power of GPU and an intricate memory management to give a fast and accurate analysis. From raw reads to variants (including SNPs and Indels), BALSA, using just a single computing node with a commodity GPU board, takes 5.5 h to process 50-fold whole genome sequencing (∼750 million 100 bp paired-end reads), or just 25 min for 210-fold whole exome sequencing. BALSA's speed is rooted at its parallel algorithms to effectively exploit a GPU to speed up processes like alignment, realignment and statistical testing. BALSA incorporates a 16-genotype model to support the calling of SNPs and Indels and achieves competitive variant calling accuracy and sensitivity when compared to the ensemble of six popular variant callers. BALSA also supports efficient identification of somatic SNVs and CNVs; experiments showed that BALSA recovers all the previously validated somatic SNVs and CNVs, and it is more sensitive for somatic Indel detection. BALSA outputs variants in VCF format. A pileup-like SNAPSHOT format, while maintaining the same fidelity as BAM in variant calling, enables efficient storage and indexing, and facilitates the App development of downstream analyses. BALSA is available at: http://sourceforge.net/p/balsa. PMID:24949238
A mechanism to reduce energy waste in the post-execution of GPU applications
NASA Astrophysics Data System (ADS)
Carreño, Emmanuell D.; Sarates, Adiel S., Jr.; Navaux, Philippe O. A.
2015-10-01
With the increasing demand of GPU accelerators for general purpose in HPC, the impact of energy consumption of these resources cannot be overlooked. To reduce the power consumption some strategies have been applied, but their approaches have been mostly focused on power savings during the application execution. This work focuses on post-execution energy savings. When the post-execution behavior is analyzed in newer GPU cards, it is observed that the power draw does not return to the idle state in an efficient way, creating an unexpected power waste. To overcome this inefficient return to idle and power draw waste in the post-execution, we developed a strategy to reduce the energy consumption considering a minimal impact on global performance. Using this strategy, we achieved energy savings up to 73 percent in the post-execution phase of a single run of a GPU application. In the case of sequential runs, the energy saving percentage depends of the waiting time gap between executions.
Sub-second pencil beam dose calculation on GPU for adaptive proton therapy.
da Silva, Joakim; Ansorge, Richard; Jena, Rajesh
2015-06-21
Although proton therapy delivered using scanned pencil beams has the potential to produce better dose conformity than conventional radiotherapy, the created dose distributions are more sensitive to anatomical changes and patient motion. Therefore, the introduction of adaptive treatment techniques where the dose can be monitored as it is being delivered is highly desirable. We present a GPU-based dose calculation engine relying on the widely used pencil beam algorithm, developed for on-line dose calculation. The calculation engine was implemented from scratch, with each step of the algorithm parallelized and adapted to run efficiently on the GPU architecture. To ensure fast calculation, it employs several application-specific modifications and simplifications, and a fast scatter-based implementation of the computationally expensive kernel superposition step. The calculation time for a skull base treatment plan using two beam directions was 0.22 s on an Nvidia Tesla K40 GPU, whereas a test case of a cubic target in water from the literature took 0.14 s to calculate. The accuracy of the patient dose distributions was assessed by calculating the γ-index with respect to a gold standard Monte Carlo simulation. The passing rates were 99.2% and 96.7%, respectively, for the 3%/3 mm and 2%/2 mm criteria, matching those produced by a clinical treatment planning system. PMID:26040956
An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU
Lyakh, Dmitry I.
2015-01-05
An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typically appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the na ve scattering algorithm (no memory access optimization). Furthermore, the tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).
An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU
Lyakh, Dmitry I.
2015-01-05
An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typicallymore » appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the na ve scattering algorithm (no memory access optimization). Furthermore, the tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).« less
GPU accelerated flow solver for direct numerical simulation of turbulent flows
Salvadore, Francesco; Botti, Michela
2013-02-15
Graphical processing units (GPUs), characterized by significant computing performance, are nowadays very appealing for the solution of computationally demanding tasks in a wide variety of scientific applications. However, to run on GPUs, existing codes need to be ported and optimized, a procedure which is not yet standardized and may require non trivial efforts, even to high-performance computing specialists. In the present paper we accurately describe the porting to CUDA (Compute Unified Device Architecture) of a finite-difference compressible Navier–Stokes solver, suitable for direct numerical simulation (DNS) of turbulent flows. Porting and validation processes are illustrated in detail, with emphasis on computational strategies and techniques that can be applied to overcome typical bottlenecks arising from the porting of common computational fluid dynamics solvers. We demonstrate that a careful optimization work is crucial to get the highest performance from GPU accelerators. The results show that the overall speedup of one NVIDIA Tesla S2070 GPU is approximately 22 compared with one AMD Opteron 2352 Barcelona chip and 11 compared with one Intel Xeon X5650 Westmere core. The potential of GPU devices in the simulation of unsteady three-dimensional turbulent flows is proved by performing a DNS of a spatially evolving compressible mixing layer.
Yu, Fengchao; Liu, Huafeng; Hu, Zhenghui; Shi, Pengcheng
2012-04-01
As a consequence of the random nature of photon emissions and detections, the data collected by a positron emission tomography (PET) imaging system can be shown to be Poisson distributed. Meanwhile, there have been considerable efforts within the tracer kinetic modeling communities aimed at establishing the relationship between the PET data and physiological parameters that affect the uptake and metabolism of the tracer. Both statistical and physiological models are important to PET reconstruction. The majority of previous efforts are based on simplified, nonphysical mathematical expression, such as Poisson modeling of the measured data, which is, on the whole, completed without consideration of the underlying physiology. In this paper, we proposed a graphics processing unit (GPU)-accelerated reconstruction strategy that can take both statistical model and physiological model into consideration with the aid of state-space evolution equations. The proposed strategy formulates the organ activity distribution through tracer kinetics models and the photon-counting measurements through observation equations, thus making it possible to unify these two constraints into a general framework. In order to accelerate reconstruction, GPU-based parallel computing is introduced. Experiments of Zubal-thorax-phantom data, Monte Carlo simulated phantom data, and real phantom data show the power of the method. Furthermore, thanks to the computing power of the GPU, the reconstruction time is practical for clinical application. PMID:22472843
Tempest: GPU-CPU computing for high-throughput database spectral matching.
Milloy, Jeffrey A; Faherty, Brendan K; Gerber, Scott A
2012-07-01
Modern mass spectrometers are now capable of producing hundreds of thousands of tandem (MS/MS) spectra per experiment, making the translation of these fragmentation spectra into peptide matches a common bottleneck in proteomics research. When coupled with experimental designs that enrich for post-translational modifications such as phosphorylation and/or include isotopically labeled amino acids for quantification, additional burdens are placed on this computational infrastructure by shotgun sequencing. To address this issue, we have developed a new database searching program that utilizes the massively parallel compute capabilities of a graphical processing unit (GPU) to produce peptide spectral matches in a very high throughput fashion. Our program, named Tempest, combines efficient database digestion and MS/MS spectral indexing on a CPU with fast similarity scoring on a GPU. In our implementation, the entire similarity score, including the generation of full theoretical peptide candidate fragmentation spectra and its comparison to experimental spectra, is conducted on the GPU. Although Tempest uses the classical SEQUEST XCorr score as a primary metric for evaluating similarity for spectra collected at unit resolution, we have developed a new "Accelerated Score" for MS/MS spectra collected at high resolution that is based on a computationally inexpensive dot product but exhibits scoring accuracy similar to that of the classical XCorr. In our experience, Tempest provides compute-cluster level performance in an affordable desktop computer. PMID:22640374
A Performance/Cost Evaluation for a GPU-Based Drug Discovery Application on Volunteer Computing
Guerrero, Ginés D.; Imbernón, Baldomero; García, José M.
2014-01-01
Bioinformatics is an interdisciplinary research field that develops tools for the analysis of large biological databases, and, thus, the use of high performance computing (HPC) platforms is mandatory for the generation of useful biological knowledge. The latest generation of graphics processing units (GPUs) has democratized the use of HPC as they push desktop computers to cluster-level performance. Many applications within this field have been developed to leverage these powerful and low-cost architectures. However, these applications still need to scale to larger GPU-based systems to enable remarkable advances in the fields of healthcare, drug discovery, genome research, etc. The inclusion of GPUs in HPC systems exacerbates power and temperature issues, increasing the total cost of ownership (TCO). This paper explores the benefits of volunteer computing to scale bioinformatics applications as an alternative to own large GPU-based local infrastructures. We use as a benchmark a GPU-based drug discovery application called BINDSURF that their computational requirements go beyond a single desktop machine. Volunteer computing is presented as a cheap and valid HPC system for those bioinformatics applications that need to process huge amounts of data and where the response time is not a critical factor. PMID:25025055
Efficient Parallel Video Processing Techniques on GPU: From Framework to Implementation
Su, Huayou; Wen, Mei; Wu, Nan; Ren, Ju; Zhang, Chunyuan
2014-01-01
Through reorganizing the execution order and optimizing the data structure, we proposed an efficient parallel framework for H.264/AVC encoder based on massively parallel architecture. We implemented the proposed framework by CUDA on NVIDIA's GPU. Not only the compute intensive components of the H.264 encoder are parallelized but also the control intensive components are realized effectively, such as CAVLC and deblocking filter. In addition, we proposed serial optimization methods, including the multiresolution multiwindow for motion estimation, multilevel parallel strategy to enhance the parallelism of intracoding as much as possible, component-based parallel CAVLC, and direction-priority deblocking filter. More than 96% of workload of H.264 encoder is offloaded to GPU. Experimental results show that the parallel implementation outperforms the serial program by 20 times of speedup ratio and satisfies the requirement of the real-time HD encoding of 30 fps. The loss of PSNR is from 0.14 dB to 0.77 dB, when keeping the same bitrate. Through the analysis to the kernels, we found that speedup ratios of the compute intensive algorithms are proportional with the computation power of the GPU. However, the performance of the control intensive parts (CAVLC) is much related to the memory bandwidth, which gives an insight for new architecture design. PMID:24757432
GPU-based Monte Carlo radiotherapy dose calculation using phase-space sources
NASA Astrophysics Data System (ADS)
Townson, Reid W.; Jia, Xun; Tian, Zhen; Jiang Graves, Yan; Zavgorodni, Sergei; Jiang, Steve B.
2013-06-01
A novel phase-space source implementation has been designed for graphics processing unit (GPU)-based Monte Carlo dose calculation engines. Short of full simulation of the linac head, using a phase-space source is the most accurate method to model a clinical radiation beam in dose calculations. However, in GPU-based Monte Carlo dose calculations where the computation efficiency is very high, the time required to read and process a large phase-space file becomes comparable to the particle transport time. Moreover, due to the parallelized nature of GPU hardware, it is essential to simultaneously transport particles of the same type and similar energies but separated spatially to yield a high efficiency. We present three methods for phase-space implementation that have been integrated into the most recent version of the GPU-based Monte Carlo radiotherapy dose calculation package gDPM v3.0. The first method is to sequentially read particles from a patient-dependent phase-space and sort them on-the-fly based on particle type and energy. The second method supplements this with a simple secondary collimator model and fluence map implementation so that patient-independent phase-space sources can be used. Finally, as the third method (called the phase-space-let, or PSL, method) we introduce a novel source implementation utilizing pre-processed patient-independent phase-spaces that are sorted by particle type, energy and position. Position bins located outside a rectangular region of interest enclosing the treatment field are ignored, substantially decreasing simulation time with little effect on the final dose distribution. The three methods were validated in absolute dose against BEAMnrc/DOSXYZnrc and compared using gamma-index tests (2%/2 mm above the 10% isodose). It was found that the PSL method has the optimal balance between accuracy and efficiency and thus is used as the default method in gDPM v3.0. Using the PSL method, open fields of 4 × 4, 10 × 10 and 30 × 30 cm
a method of gravity and seismic sequential inversion and its GPU implementation
NASA Astrophysics Data System (ADS)
Liu, G.; Meng, X.
2011-12-01
In this abstract, we introduce a gravity and seismic sequential inversion method to invert for density and velocity together. For the gravity inversion, we use an iterative method based on correlation imaging algorithm; for the seismic inversion, we use the full waveform inversion. The link between the density and velocity is an empirical formula called Gardner equation, for large volumes of data, we use the GPU to accelerate the computation. For the gravity inversion method , we introduce a method based on correlation imaging algorithm,it is also a interative method, first we calculate the correlation imaging of the observed gravity anomaly, it is some value between -1 and +1, then we multiply this value with a little density ,this value become the initial density model. We get a forward reuslt with this initial model and also calculate the correaltion imaging of the misfit of observed data and the forward data, also multiply the correaltion imaging result a little density and add it to the initial model, then do the same procedure above , at last ,we can get a inversion density model. For the seismic inveron method ,we use a mothod base on the linearity of acoustic wave equation written in the frequency domain,with a intial velociy model, we can get a good velocity result. In the sequential inversion of gravity and seismic , we need a link formula to convert between density and velocity ,in our method , we use the Gardner equation. Driven by the insatiable market demand for real time, high-definition 3D images, the programmable NVIDIA Graphic Processing Unit (GPU) as co-processor of CPU has been developed for high performance computing. Compute Unified Device Architecture (CUDA) is a parallel programming model and software environment provided by NVIDIA designed to overcome the challenge of using traditional general purpose GPU while maintaining a low learn curve for programmers familiar with standard programming languages such as C. In our inversion processing
GPU-based Monte Carlo radiotherapy dose calculation using phase-space sources.
Townson, Reid W; Jia, Xun; Tian, Zhen; Graves, Yan Jiang; Zavgorodni, Sergei; Jiang, Steve B
2013-06-21
A novel phase-space source implementation has been designed for graphics processing unit (GPU)-based Monte Carlo dose calculation engines. Short of full simulation of the linac head, using a phase-space source is the most accurate method to model a clinical radiation beam in dose calculations. However, in GPU-based Monte Carlo dose calculations where the computation efficiency is very high, the time required to read and process a large phase-space file becomes comparable to the particle transport time. Moreover, due to the parallelized nature of GPU hardware, it is essential to simultaneously transport particles of the same type and similar energies but separated spatially to yield a high efficiency. We present three methods for phase-space implementation that have been integrated into the most recent version of the GPU-based Monte Carlo radiotherapy dose calculation package gDPM v3.0. The first method is to sequentially read particles from a patient-dependent phase-space and sort them on-the-fly based on particle type and energy. The second method supplements this with a simple secondary collimator model and fluence map implementation so that patient-independent phase-space sources can be used. Finally, as the third method (called the phase-space-let, or PSL, method) we introduce a novel source implementation utilizing pre-processed patient-independent phase-spaces that are sorted by particle type, energy and position. Position bins located outside a rectangular region of interest enclosing the treatment field are ignored, substantially decreasing simulation time with little effect on the final dose distribution. The three methods were validated in absolute dose against BEAMnrc/DOSXYZnrc and compared using gamma-index tests (2%/2 mm above the 10% isodose). It was found that the PSL method has the optimal balance between accuracy and efficiency and thus is used as the default method in gDPM v3.0. Using the PSL method, open fields of 4 × 4, 10 × 10 and 30 × 30 cm
Development of a GPU Compatible Version of the Fast Radiation Code RRTMG
NASA Astrophysics Data System (ADS)
Iacono, M. J.; Mlawer, E. J.; Berthiaume, D.; Cady-Pereira, K. E.; Suarez, M.; Oreopoulos, L.; Lee, D.
2012-12-01
The absorption of solar radiation and emission/absorption of thermal radiation are crucial components of the physics that drive Earth's climate and weather. Therefore, accurate radiative transfer calculations are necessary for realistic climate and weather simulations. Efficient radiation codes have been developed for this purpose, but their accuracy requirements still necessitate that as much as 30% of the computational time of a GCM is spent computing radiative fluxes and heating rates. The overall computational expense constitutes a limitation on a GCM's predictive ability if it becomes an impediment to adding new physics to or increasing the spatial and/or vertical resolution of the model. The emergence of Graphics Processing Unit (GPU) technology, which will allow the parallel computation of multiple independent radiative calculations in a GCM, will lead to a fundamental change in the competition between accuracy and speed. Processing time previously consumed by radiative transfer will now be available for the modeling of other processes, such as physics parameterizations, without any sacrifice in the accuracy of the radiative transfer. Furthermore, fast radiation calculations can be performed much more frequently and will allow the modeling of radiative effects of rapid changes in the atmosphere. The fast radiation code RRTMG, developed at Atmospheric and Environmental Research (AER), is utilized operationally in many dynamical models throughout the world. We will present the results from the first stage of an effort to create a version of the RRTMG radiation code designed to run efficiently in a GPU environment. This effort will focus on the RRTMG implementation in GEOS-5. RRTMG has an internal pseudo-spectral vector of length of order 100 that, when combined with the much greater length of the global horizontal grid vector from which the radiation code is called in GEOS-5, makes RRTMG/GEOS-5 particularly suited to achieving a significant speed improvement
Liu, T.; Du, X.; Ji, W.; Xu, X. G.
2013-07-01
This paper describes the development of a Graphics Processing Unit (GPU) accelerated Monte Carlo photon transport code, ARCHER{sub GPU}, to perform CT imaging dose calculations with good accuracy and performance. The code simulates interactions of photons with heterogeneous materials. It contains a detailed CT scanner model and a family of patient phantoms. Several techniques are used to optimize the code for the GPU architecture. In the accuracy and performance test, a 142 kg adult male phantom was selected, and the CT scan protocol involved a whole-body axial scan, 20-mm x-ray beam collimation, 120 kVp and a pitch of 1. A total of 9 x 108 photons were simulated and the absorbed doses to 28 radiosensitive organs/tissues were calculated. The average percentage difference of the results obtained by the general-purpose production code MCNPX and ARCHER{sub GPU} was found to be less than 0.38%, indicating an excellent agreement. The total computation time was found to be 8,689, 139 and 56 minutes for MCNPX, ARCHER{sub CPU} (6-core) and ARCHER{sub GPU}, respectively, indicating a decent speedup. Under a recent grant funding from the NIH, the project aims at developing a Monte Carlo code with the capability of sub-minute CT organ dose calculations. (authors)
NASA Astrophysics Data System (ADS)
Deng, Liang; Bai, Hanli; Wang, Fang; Xu, Qingxin
2016-06-01
CPU/GPU computing allows scientists to tremendously accelerate their numerical codes. In this paper, we port and optimize a double precision alternating direction implicit (ADI) solver for three-dimensional compressible Navier-Stokes equations from our in-house Computational Fluid Dynamics (CFD) software on heterogeneous platform. First, we implement a full GPU version of the ADI solver to remove a lot of redundant data transfers between CPU and GPU, and then design two fine-grain schemes, namely “one-thread-one-point” and “one-thread-one-line”, to maximize the performance. Second, we present a dual-level parallelization scheme using the CPU/GPU collaborative model to exploit the computational resources of both multi-core CPUs and many-core GPUs within the heterogeneous platform. Finally, considering the fact that memory on a single node becomes inadequate when the simulation size grows, we present a tri-level hybrid programming pattern MPI-OpenMP-CUDA that merges fine-grain parallelism using OpenMP and CUDA threads with coarse-grain parallelism using MPI for inter-node communication. We also propose a strategy to overlap the computation with communication using the advanced features of CUDA and MPI programming. We obtain speedups of 6.0 for the ADI solver on one Tesla M2050 GPU in contrast to two Xeon X5670 CPUs. Scalability tests show that our implementation can offer significant performance improvement on heterogeneous platform.
NASA Astrophysics Data System (ADS)
McClure, J. E.; Prins, J. F.; Miller, C. T.
2014-07-01
Multiphase flow implementations of the lattice Boltzmann method (LBM) are widely applied to the study of porous medium systems. In this work, we construct a new variant of the popular “color” LBM for two-phase flow in which a three-dimensional, 19-velocity (D3Q19) lattice is used to compute the momentum transport solution while a three-dimensional, seven velocity (D3Q7) lattice is used to compute the mass transport solution. Based on this formulation, we implement a novel heterogeneous GPU-accelerated algorithm in which the mass transport solution is computed by multiple shared memory CPU cores programmed using OpenMP while a concurrent solution of the momentum transport is performed using a GPU. The heterogeneous solution is demonstrated to provide speedup of 2.6× as compared to multi-core CPU solution and 1.8× compared to GPU solution due to concurrent utilization of both CPU and GPU bandwidths. Furthermore, we verify that the proposed formulation provides an accurate physical representation of multiphase flow processes and demonstrate that the approach can be applied to perform heterogeneous simulations of two-phase flow in porous media using a typical GPU-accelerated workstation.
NASA Astrophysics Data System (ADS)
Gu, Xuejun; Jelen, Urszula; Li, Jinsheng; Jia, Xun; Jiang, Steve B.
2011-06-01
Targeting at the development of an accurate and efficient dose calculation engine for online adaptive radiotherapy, we have implemented a finite-size pencil beam (FSPB) algorithm with a 3D-density correction method on graphics processing unit (GPU). This new GPU-based dose engine is built on our previously published ultrafast FSPB computational framework (Gu et al 2009 Phys. Med. Biol. 54 6287-97). Dosimetric evaluations against Monte Carlo dose calculations are conducted on ten IMRT treatment plans (five head-and-neck cases and five lung cases). For all cases, there is improvement with the 3D-density correction over the conventional FSPB algorithm and for most cases the improvement is significant. Regarding the efficiency, because of the appropriate arrangement of memory access and the usage of GPU intrinsic functions, the dose calculation for an IMRT plan can be accomplished well within 1 s (except for one case) with this new GPU-based FSPB algorithm. Compared to the previous GPU-based FSPB algorithm without 3D-density correction, this new algorithm, though slightly sacrificing the computational efficiency (~5-15% lower), has significantly improved the dose calculation accuracy, making it more suitable for online IMRT replanning.
Gu, Xuejun; Jelen, Urszula; Li, Jinsheng; Jia, Xun; Jiang, Steve B
2011-06-01
Targeting at the development of an accurate and efficient dose calculation engine for online adaptive radiotherapy, we have implemented a finite-size pencil beam (FSPB) algorithm with a 3D-density correction method on graphics processing unit (GPU). This new GPU-based dose engine is built on our previously published ultrafast FSPB computational framework (Gu et al 2009 Phys. Med. Biol. 54 6287-97). Dosimetric evaluations against Monte Carlo dose calculations are conducted on ten IMRT treatment plans (five head-and-neck cases and five lung cases). For all cases, there is improvement with the 3D-density correction over the conventional FSPB algorithm and for most cases the improvement is significant. Regarding the efficiency, because of the appropriate arrangement of memory access and the usage of GPU intrinsic functions, the dose calculation for an IMRT plan can be accomplished well within 1 s (except for one case) with this new GPU-based FSPB algorithm. Compared to the previous GPU-based FSPB algorithm without 3D-density correction, this new algorithm, though slightly sacrificing the computational efficiency (∼5-15% lower), has significantly improved the dose calculation accuracy, making it more suitable for online IMRT replanning. PMID:21558589
NASA Astrophysics Data System (ADS)
Archibald, R.; Evans, K. J.; Worley, P.; Norman, M. R.; Lott, A.; Salinger, A.; Woodward, C. S.
2014-12-01
The recent focus on regional refinement in the Community Atmosphere Model (CAM5) has created a strong need to develop time-stepping methods capable of accelerating throughput on high performance computing for climate dynamics across multiple spatial and temporal scales. This research is focused on developing implicit methods that can be executed at scale on GPU based machines. Efforts to port the scalable spectral element dynamical core to incorporate these developments is presented, including both 2D and 3D benchmark test case results. The current implicit solver and preconditioner implementations utilize a Fortran interface package within the Trilinos project, third party software that allows fully tested, optimized, and robust code with a suite of parameter options to be included a priori. Merging this coding strategy with GPU libraries will be discussed along with beneficial optimization gains and data structure requirements to evaluate Trilinos binded residual calculations on GPU processors.
Performance analysis of GPU-accelerated filter-based source finding for HI spectral line image data
NASA Astrophysics Data System (ADS)
Westerlund, Stefan; Harris, Christopher
2015-03-01
Searching for sources of electromagnetic emission in spectral-line radio astronomy interferometric data is a computationally intensive process. Parallel programming techniques and High Performance Computing hardware may be used to improve the computational performance of a source finding program. However, it is desirable to further reduce the processing time of source finding in order to decrease the computational resources required for the task. GPU acceleration is a method that may achieve significant increases in performance for some source finding algorithms, particularly for filtering image data. This work considers the application of GPU acceleration to the task of source finding and the techniques used to achieve the best performance, such as memory management. We also examine the changes in performance, where the algorithms that were GPU accelerated achieved a speedup of around 3.2 times the 12 core per node CPU-only performance, while the program as a whole experienced a speedup of 2.0 times.
Sailfish: A flexible multi-GPU implementation of the lattice Boltzmann method
NASA Astrophysics Data System (ADS)
Januszewski, M.; Kostur, M.
2014-09-01
We present Sailfish, an open source fluid simulation package implementing the lattice Boltzmann method (LBM) on modern Graphics Processing Units (GPUs) using CUDA/OpenCL. We take a novel approach to GPU code implementation and use run-time code generation techniques and a high level programming language (Python) to achieve state of the art performance, while allowing easy experimentation with different LBM models and tuning for various types of hardware. We discuss the general design principles of the code, scaling to multiple GPUs in a distributed environment, as well as the GPU implementation and optimization of many different LBM models, both single component (BGK, MRT, ELBM) and multicomponent (Shan-Chen, free energy). The paper also presents results of performance benchmarks spanning the last three NVIDIA GPU generations (Tesla, Fermi, Kepler), which we hope will be useful for researchers working with this type of hardware and similar codes. Catalogue identifier: AETA_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AETA_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GNU Lesser General Public License, version 3 No. of lines in distributed program, including test data, etc.: 225864 No. of bytes in distributed program, including test data, etc.: 46861049 Distribution format: tar.gz Programming language: Python, CUDA C, OpenCL. Computer: Any with an OpenCL or CUDA-compliant GPU. Operating system: No limits (tested on Linux and Mac OS X). RAM: Hundreds of megabytes to tens of gigabytes for typical cases. Classification: 12, 6.5. External routines: PyCUDA/PyOpenCL, Numpy, Mako, ZeroMQ (for multi-GPU simulations), scipy, sympy Nature of problem: GPU-accelerated simulation of single- and multi-component fluid flows. Solution method: A wide range of relaxation models (LBGK, MRT, regularized LB, ELBM, Shan-Chen, free energy, free surface) and boundary conditions within the lattice
A nonvoxel-based dose convolution/superposition algorithm optimized for scalable GPU architectures
Neylon, J. Sheng, K.; Yu, V.; Low, D. A.; Kupelian, P.; Santhanam, A.; Chen, Q.
2014-10-15
Purpose: Real-time adaptive planning and treatment has been infeasible due in part to its high computational complexity. There have been many recent efforts to utilize graphics processing units (GPUs) to accelerate the computational performance and dose accuracy in radiation therapy. Data structure and memory access patterns are the key GPU factors that determine the computational performance and accuracy. In this paper, the authors present a nonvoxel-based (NVB) approach to maximize computational and memory access efficiency and throughput on the GPU. Methods: The proposed algorithm employs a ray-tracing mechanism to restructure the 3D data sets computed from the CT anatomy into a nonvoxel-based framework. In a process that takes only a few milliseconds of computing time, the algorithm restructured the data sets by ray-tracing through precalculated CT volumes to realign the coordinate system along the convolution direction, as defined by zenithal and azimuthal angles. During the ray-tracing step, the data were resampled according to radial sampling and parallel ray-spacing parameters making the algorithm independent of the original CT resolution. The nonvoxel-based algorithm presented in this paper also demonstrated a trade-off in computational performance and dose accuracy for different coordinate system configurations. In order to find the best balance between the computed speedup and the accuracy, the authors employed an exhaustive parameter search on all sampling parameters that defined the coordinate system configuration: zenithal, azimuthal, and radial sampling of the convolution algorithm, as well as the parallel ray spacing during ray tracing. The angular sampling parameters were varied between 4 and 48 discrete angles, while both radial sampling and parallel ray spacing were varied from 0.5 to 10 mm. The gamma distribution analysis method (γ) was used to compare the dose distributions using 2% and 2 mm dose difference and distance-to-agreement criteria
Liu, T.; Ding, A.; Ji, W.; Xu, X. G.; Carothers, C. D.; Brown, F. B.
2012-07-01
Monte Carlo (MC) method is able to accurately calculate eigenvalues in reactor analysis. Its lengthy computation time can be reduced by general-purpose computing on Graphics Processing Units (GPU), one of the latest parallel computing techniques under development. The method of porting a regular transport code to GPU is usually very straightforward due to the 'embarrassingly parallel' nature of MC code. However, the situation becomes different for eigenvalue calculation in that it will be performed on a generation-by-generation basis and the thread coordination should be explicitly taken care of. This paper presents our effort to develop such a GPU-based MC code in Compute Unified Device Architecture (CUDA) environment. The code is able to perform eigenvalue calculation under simple geometries on a multi-GPU system. The specifics of algorithm design, including thread organization and memory management were described in detail. The original CPU version of the code was tested on an Intel Xeon X5660 2.8 GHz CPU, and the adapted GPU version was tested on NVIDIA Tesla M2090 GPUs. Double-precision floating point format was used throughout the calculation. The result showed that a speedup of 7.0 and 33.3 were obtained for a bare spherical core and a binary slab system respectively. The speedup factor was further increased by a factor of {approx}2 on a dual GPU system. The upper limit of device-level parallelism was analyzed, and a possible method to enhance the thread-level parallelism was proposed. (authors)
TH-A-19A-09: Towards Sub-Second Proton Dose Calculation On GPU
Silva, J da
2014-06-15
Purpose: To achieve sub-second dose calculation for clinically relevant proton therapy treatment plans. Rapid dose calculation is a key component of adaptive radiotherapy, necessary to take advantage of the better dose conformity offered by hadron therapy. Methods: To speed up proton dose calculation, the pencil beam algorithm (PBA; clinical standard) was parallelised and implemented to run on a graphics processing unit (GPU). The implementation constitutes the first PBA to run all steps on GPU, and each part of the algorithm was carefully adapted for efficiency. Monte Carlo (MC) simulations obtained using Fluka of individual beams of energies representative of the clinical range impinging on simple geometries were used to tune the PBA. For benchmarking, a typical skull base case with a spot scanning plan consisting of a total of 8872 spots divided between two beam directions of 49 energy layers each was provided by CNAO (Pavia, Italy). The calculations were carried out on an Nvidia Geforce GTX680 desktop GPU with 1536 cores running at 1006 MHz. Results: The PBA reproduced within ±3% of maximum dose results obtained from MC simulations for a range of pencil beams impinging on a water tank. Additional analysis of more complex slab geometries is currently under way to fine-tune the algorithm. Full calculation of the clinical test case took 0.9 seconds in total, with the majority of the time spent in the kernel superposition step. Conclusion: The PBA lends itself well to implementation on many-core systems such as GPUs. Using the presented implementation and current hardware, sub-second dose calculation for a clinical proton therapy plan was achieved, opening the door for adaptive treatment. The successful parallelisation of all steps of the calculation indicates that further speedups can be expected with new hardware, brightening the prospects for real-time dose calculation. This work was funded by ENTERVISION, European Commission FP7 grant 264552.
GGEMS-Brachy: GPU GEant4-based Monte Carlo simulation for brachytherapy applications
NASA Astrophysics Data System (ADS)
Lemaréchal, Yannick; Bert, Julien; Falconnet, Claire; Després, Philippe; Valeri, Antoine; Schick, Ulrike; Pradier, Olivier; Garcia, Marie-Paule; Boussion, Nicolas; Visvikis, Dimitris
2015-07-01
In brachytherapy, plans are routinely calculated using the AAPM TG43 formalism which considers the patient as a simple water object. An accurate modeling of the physical processes considering patient heterogeneity using Monte Carlo simulation (MCS) methods is currently too time-consuming and computationally demanding to be routinely used. In this work we implemented and evaluated an accurate and fast MCS on Graphics Processing Units (GPU) for brachytherapy low dose rate (LDR) applications. A previously proposed Geant4 based MCS framework implemented on GPU (GGEMS) was extended to include a hybrid GPU navigator, allowing navigation within voxelized patient specific images and analytically modeled 125I seeds used in LDR brachytherapy. In addition, dose scoring based on track length estimator including uncertainty calculations was incorporated. The implemented GGEMS-brachy platform was validated using a comparison with Geant4 simulations and reference datasets. Finally, a comparative dosimetry study based on the current clinical standard (TG43) and the proposed platform was performed on twelve prostate cancer patients undergoing LDR brachytherapy. Considering patient 3D CT volumes of 400 × 250 × 65 voxels and an average of 58 implanted seeds, the mean patient dosimetry study run time for a 2% dose uncertainty was 9.35 s (≈500 ms 10-6 simulated particles) and 2.5 s when using one and four GPUs, respectively. The performance of the proposed GGEMS-brachy platform allows envisaging the use of Monte Carlo simulation based dosimetry studies in brachytherapy compatible with clinical practice. Although the proposed platform was evaluated for prostate cancer, it is equally applicable to other LDR brachytherapy clinical applications. Future extensions will allow its application in high dose rate brachytherapy applications.
fMRI analysis on the GPU-possibilities and challenges.
Eklund, Anders; Andersson, Mats; Knutsson, Hans
2012-02-01
Functional magnetic resonance imaging (fMRI) makes it possible to non-invasively measure brain activity with high spatial resolution. There are however a number of issues that have to be addressed. One is the large amount of spatio-temporal data that needs to be processed. In addition to the statistical analysis itself, several preprocessing steps, such as slice timing correction and motion compensation, are normally applied. The high computational power of modern graphic cards has already successfully been used for MRI and fMRI. Going beyond the first published demonstration of GPU-based analysis of fMRI data, all the preprocessing steps and two statistical approaches, the general linear model (GLM) and canonical correlation analysis (CCA), have been implemented on a GPU. For an fMRI dataset of typical size (80 volumes with 64×64×22voxels), all the preprocessing takes about 0.5s on the GPU, compared to 5s with an optimized CPU implementation and 120s with the commonly used statistical parametric mapping (SPM) software. A random permutation test with 10,000 permutations, with smoothing in each permutation, takes about 50s if three GPUs are used, compared to 0.5-2.5h with an optimized CPU implementation. The presented work will save time for researchers and clinicians in their daily work and enables the use of more advanced analysis, such as non-parametric statistics, both for conventional fMRI and for real-time fMRI. PMID:21862169
GPU-based fast Monte Carlo dose calculation for proton therapy
Jia, Xun; Schümann, Jan; Paganetti, Harald; Jiang, Steve B
2015-01-01
Accurate radiation dose calculation is essential for successful proton radiotherapy. Monte Carlo (MC) simulation is considered to be the most accurate method. However, the long computation time limits it from routine clinical applications. Recently, graphics processing units (GPUs) have been widely used to accelerate computationally intensive tasks in radiotherapy. We have developed a fast MC dose calculation package, gPMC, for proton dose calculation on a GPU. In gPMC, proton transport is modeled by the class II condensed history simulation scheme with a continuous slowing down approximation. Ionization, elastic and inelastic proton nucleus interactions are considered. Energy straggling and multiple scattering are modeled. Secondary electrons are not transported and their energies are locally deposited. After an inelastic nuclear interaction event, a variety of products are generated using an empirical model. Among them, charged nuclear fragments are terminated with energy locally deposited. Secondary protons are stored in a stack and transported after finishing transport of the primary protons, while secondary neutral particles are neglected. gPMC is implemented on the GPU under the CUDA platform. We have validated gPMC using the TOPAS/Geant4 MC code as the gold standard. For various cases including homogeneous and inhomogeneous phantoms as well as a patient case, good agreements between gPMC and TOPAS/Geant4 are observed. The gamma passing rate for the 2%/2 mm criterion is over 98.7% in the region with dose greater than 10% maximum dose in all cases, excluding low-density air regions. With gPMC it takes only 6–22 s to simulate 10 million source protons to achieve ~1% relative statistical uncertainty, depending on the phantoms and energy. This is an extremely high efficiency compared to the computational time of tens of CPU hours for TOPAS/Geant4. Our fast GPU-based code can thus facilitate the routine use of MC dose calculation in proton therapy. PMID:23128424
Real-time optical flow estimation on a GPU for a skied-steered mobile robot
NASA Astrophysics Data System (ADS)
Kniaz, V. V.
2016-04-01
Accurate egomotion estimation is required for mobile robot navigation. Often the egomotion is estimated using optical flow algorithms. For an accurate estimation of optical flow most of modern algorithms require high memory resources and processor speed. However simple single-board computers that control the motion of the robot usually do not provide such resources. On the other hand, most of modern single-board computers are equipped with an embedded GPU that could be used in parallel with a CPU to improve the performance of the optical flow estimation algorithm. This paper presents a new Z-flow algorithm for efficient computation of an optical flow using an embedded GPU. The algorithm is based on the phase correlation optical flow estimation and provide a real-time performance on a low cost embedded GPU. The layered optical flow model is used. Layer segmentation is performed using graph-cut algorithm with a time derivative based energy function. Such approach makes the algorithm both fast and robust in low light and low texture conditions. The algorithm implementation for a Raspberry Pi Model B computer is discussed. For evaluation of the algorithm the computer was mounted on a Hercules mobile skied-steered robot equipped with a monocular camera. The evaluation was performed using a hardware-in-the-loop simulation and experiments with Hercules mobile robot. Also the algorithm was evaluated using KITTY Optical Flow 2015 dataset. The resulting endpoint error of the optical flow calculated with the developed algorithm was low enough for navigation of the robot along the desired trajectory.
Best bang for your buck: GPU nodes for GROMACS biomolecular simulations.
Kutzner, Carsten; Páll, Szilárd; Fechner, Martin; Esztermann, Ansgar; de Groot, Bert L; Grubmüller, Helmut
2015-10-01
The molecular dynamics simulation package GROMACS runs efficiently on a wide variety of hardware from commodity workstations to high performance computing clusters. Hardware features are well-exploited with a combination of single instruction multiple data, multithreading, and message passing interface (MPI)-based single program multiple data/multiple program multiple data parallelism while graphics processing units (GPUs) can be used as accelerators to compute interactions off-loaded from the CPU. Here, we evaluate which hardware produces trajectories with GROMACS 4.6 or 5.0 in the most economical way. We have assembled and benchmarked compute nodes with various CPU/GPU combinations to identify optimal compositions in terms of raw trajectory production rate, performance-to-price ratio, energy efficiency, and several other criteria. Although hardware prices are naturally subject to trends and fluctuations, general tendencies are clearly visible. Adding any type of GPU significantly boosts a node's simulation performance. For inexpensive consumer-class GPUs this improvement equally reflects in the performance-to-price ratio. Although memory issues in consumer-class GPUs could pass unnoticed as these cards do not support error checking and correction memory, unreliable GPUs can be sorted out with memory checking tools. Apart from the obvious determinants for cost-efficiency like hardware expenses and raw performance, the energy consumption of a node is a major cost factor. Over the typical hardware lifetime until replacement of a few years, the costs for electrical power and cooling can become larger than the costs of the hardware itself. Taking that into account, nodes with a well-balanced ratio of CPU and consumer-class GPU resources produce the maximum amount of GROMACS trajectory over their lifetime. PMID:26238484
GPU-based fast Monte Carlo dose calculation for proton therapy.
Jia, Xun; Schümann, Jan; Paganetti, Harald; Jiang, Steve B
2012-12-01
Accurate radiation dose calculation is essential for successful proton radiotherapy. Monte Carlo (MC) simulation is considered to be the most accurate method. However, the long computation time limits it from routine clinical applications. Recently, graphics processing units (GPUs) have been widely used to accelerate computationally intensive tasks in radiotherapy. We have developed a fast MC dose calculation package, gPMC, for proton dose calculation on a GPU. In gPMC, proton transport is modeled by the class II condensed history simulation scheme with a continuous slowing down approximation. Ionization, elastic and inelastic proton nucleus interactions are considered. Energy straggling and multiple scattering are modeled. Secondary electrons are not transported and their energies are locally deposited. After an inelastic nuclear interaction event, a variety of products are generated using an empirical model. Among them, charged nuclear fragments are terminated with energy locally deposited. Secondary protons are stored in a stack and transported after finishing transport of the primary protons, while secondary neutral particles are neglected. gPMC is implemented on the GPU under the CUDA platform. We have validated gPMC using the TOPAS/Geant4 MC code as the gold standard. For various cases including homogeneous and inhomogeneous phantoms as well as a patient case, good agreements between gPMC and TOPAS/Geant4 are observed. The gamma passing rate for the 2%/2 mm criterion is over 98.7% in the region with dose greater than 10% maximum dose in all cases, excluding low-density air regions. With gPMC it takes only 6-22 s to simulate 10 million source protons to achieve ∼1% relative statistical uncertainty, depending on the phantoms and energy. This is an extremely high efficiency compared to the computational time of tens of CPU hours for TOPAS/Geant4. Our fast GPU-based code can thus facilitate the routine use of MC dose calculation in proton therapy. PMID
A new morphological anomaly detection algorithm for hyperspectral images and its GPU implementation
NASA Astrophysics Data System (ADS)
Paz, Abel; Plaza, Antonio
2011-10-01
Anomaly detection is considered a very important task for hyperspectral data exploitation. It is now routinely applied in many application domains, including defence and intelligence, public safety, precision agriculture, geology, or forestry. Many of these applications require timely responses for swift decisions which depend upon high computing performance of algorithm analysis. However, with the recent explosion in the amount and dimensionality of hyperspectral imagery, this problem calls for the incorporation of parallel computing techniques. In the past, clusters of computers have offered an attractive solution for fast anomaly detection in hyperspectral data sets already transmitted to Earth. However, these systems are expensive and difficult to adapt to on-board data processing scenarios, in which low-weight and low-power integrated components are essential to reduce mission payload and obtain analysis results in (near) real-time, i.e., at the same time as the data is collected by the sensor. An exciting new development in the field of commodity computing is the emergence of commodity graphics processing units (GPUs), which can now bridge the gap towards on-board processing of remotely sensed hyperspectral data. In this paper, we develop a new morphological algorithm for anomaly detection in hyperspectral images along with an efficient GPU implementation of the algorithm. The algorithm is implemented on latest-generation GPU architectures, and evaluated with regards to other anomaly detection algorithms using hyperspectral data collected by NASA's Airborne Visible Infra-Red Imaging Spectrometer (AVIRIS) over the World Trade Center (WTC) in New York, five days after the terrorist attacks that collapsed the two main towers in the WTC complex. The proposed GPU implementation achieves real-time performance in the considered case study.
A GPU-Based Visualization Method for Computing Dark Matter Annihilation Signal
NASA Astrophysics Data System (ADS)
Yang, L.; Szalay, A.
2013-10-01
We present a novel GPU-based visualization method for computing the dark matter annihilation signal for cosmological dark matter simulations. The technique increased the speed of rendering by more than 1,000 times. In a previous study, using a code running on regular CPUs, each particle's contribution was explicitly calculated pixel by pixel over a HEALPIX map, then remapped onto a Molleweide projection. Using Via Lactea II simulation (˜ 400M particles), it takes over 7 hours for a single thread CPU (˜3 GHz) to complete an all-sky map with NSIDE=512 resolution. Our novel method is based on a separate stereographic projection for each hemisphere, and a hardware accelerated rendering pipeline on a GPU (OpenGL). We project the particles instead of the celestial sphere to the tangent plane with a skewed flux profile appropriate for the STR projection. OpenGL's Point Sprite feature and shader language allow us to render those eccentric circular flux profiles at the rate of more than 10M particles per second. The new method can process a single snapshot of the Via Lactea II data in less than 1 minute with a single NVIDIA GTX 480 GPU, including I/O, with effective rendering time less than 24 seconds. Using an approximate normalization for the flux, accurate to 2.5% in total flux, the rendering can be done in less than 13 seconds. The stereographic images corresponding to the two hemispheres are then warped to an all-sky image in the Molleweide projection, and are in good agreement with the result from the regular CPU code, at similar resolution.
Fully 3D iterative scatter-corrected OSEM for HRRT PET using a GPU.
Kim, Kyung Sang; Ye, Jong Chul
2011-08-01
Accurate scatter correction is especially important for high-resolution 3D positron emission tomographies (PETs) such as high-resolution research tomograph (HRRT) due to large scatter fraction in the data. To address this problem, a fully 3D iterative scatter-corrected ordered subset expectation maximization (OSEM) in which a 3D single scatter simulation (SSS) is alternatively performed with a 3D OSEM reconstruction was recently proposed. However, due to the computational complexity of both SSS and OSEM algorithms for a high-resolution 3D PET, it has not been widely used in practice. The main objective of this paper is, therefore, to accelerate the fully 3D iterative scatter-corrected OSEM using a graphics processing unit (GPU) and verify its performance for an HRRT. We show that to exploit the massive thread structures of the GPU, several algorithmic modifications are necessary. For SSS implementation, a sinogram-driven approach is found to be more appropriate compared to a detector-driven approach, as fast linear interpolation can be performed in the sinogram domain through the use of texture memory. Furthermore, a pixel-driven backprojector and a ray-driven projector can be significantly accelerated by assigning threads to voxels and sinograms, respectively. Using Nvidia's GPU and compute unified device architecture (CUDA), the execution time of a SSS is less than 6 s, a single iteration of OSEM with 16 subsets takes 16 s, and a single iteration of the fully 3D scatter-corrected OSEM composed of a SSS and six iterations of OSEM takes under 105 s for the HRRT geometry, which corresponds to acceleration factors of 125× and 141× for OSEM and SSS, respectively. The fully 3D iterative scatter-corrected OSEM algorithm is validated in simulations using Geant4 application for tomographic emission and in actual experiments using an HRRT. PMID:21772080
Discrete shearlet transform on GPU with applications in anomaly detection and denoising
NASA Astrophysics Data System (ADS)
Gibert, Xavier; Patel, Vishal M.; Labate, Demetrio; Chellappa, Rama
2014-12-01
Shearlets have emerged in recent years as one of the most successful methods for the multiscale analysis of multidimensional signals. Unlike wavelets, shearlets form a pyramid of well-localized functions defined not only over a range of scales and locations, but also over a range of orientations and with highly anisotropic supports. As a result, shearlets are much more effective than traditional wavelets in handling the geometry of multidimensional data, and this was exploited in a wide range of applications from image and signal processing. However, despite their desirable properties, the wider applicability of shearlets is limited by the computational complexity of current software implementations. For example, denoising a single 512 × 512 image using a current implementation of the shearlet-based shrinkage algorithm can take between 10 s and 2 min, depending on the number of CPU cores, and much longer processing times are required for video denoising. On the other hand, due to the parallel nature of the shearlet transform, it is possible to use graphics processing units (GPU) to accelerate its implementation. In this paper, we present an open source stand-alone implementation of the 2D discrete shearlet transform using CUDA C++ as well as GPU-accelerated MATLAB implementations of the 2D and 3D shearlet transforms. We have instrumented the code so that we can analyze the running time of each kernel under different GPU hardware. In addition to denoising, we describe a novel application of shearlets for detecting anomalies in textured images. In this application, computation times can be reduced by a factor of 50 or more, compared to multicore CPU implementations.
GGEMS-Brachy: GPU GEant4-based Monte Carlo simulation for brachytherapy applications.
Lemaréchal, Yannick; Bert, Julien; Falconnet, Claire; Després, Philippe; Valeri, Antoine; Schick, Ulrike; Pradier, Olivier; Garcia, Marie-Paule; Boussion, Nicolas; Visvikis, Dimitris
2015-07-01
In brachytherapy, plans are routinely calculated using the AAPM TG43 formalism which considers the patient as a simple water object. An accurate modeling of the physical processes considering patient heterogeneity using Monte Carlo simulation (MCS) methods is currently too time-consuming and computationally demanding to be routinely used. In this work we implemented and evaluated an accurate and fast MCS on Graphics Processing Units (GPU) for brachytherapy low dose rate (LDR) applications. A previously proposed Geant4 based MCS framework implemented on GPU (GGEMS) was extended to include a hybrid GPU navigator, allowing navigation within voxelized patient specific images and analytically modeled (125)I seeds used in LDR brachytherapy. In addition, dose scoring based on track length estimator including uncertainty calculations was incorporated. The implemented GGEMS-brachy platform was validated using a comparison with Geant4 simulations and reference datasets. Finally, a comparative dosimetry study based on the current clinical standard (TG43) and the proposed platform was performed on twelve prostate cancer patients undergoing LDR brachytherapy. Considering patient 3D CT volumes of 400 × 250 × 65 voxels and an average of 58 implanted seeds, the mean patient dosimetry study run time for a 2% dose uncertainty was 9.35 s (≈500 ms 10(-6) simulated particles) and 2.5 s when using one and four GPUs, respectively. The performance of the proposed GGEMS-brachy platform allows envisaging the use of Monte Carlo simulation based dosimetry studies in brachytherapy compatible with clinical practice. Although the proposed platform was evaluated for prostate cancer, it is equally applicable to other LDR brachytherapy clinical applications. Future extensions will allow its application in high dose rate brachytherapy applications. PMID:26061230
NASA Astrophysics Data System (ADS)
Perkins, S. J.; Marais, P. C.; Zwart, J. T. L.; Natarajan, I.; Tasse, C.; Smirnov, O.
2015-09-01
We present Montblanc, a GPU implementation of the Radio interferometer measurement equation (RIME) in support of the Bayesian inference for radio observations (BIRO) technique. BIRO uses Bayesian inference to select sky models that best match the visibilities observed by a radio interferometer. To accomplish this, BIRO evaluates the RIME multiple times, varying sky model parameters to produce multiple model visibilities. χ2 values computed from the model and observed visibilities are used as likelihood values to drive the Bayesian sampling process and select the best sky model. As most of the elements of the RIME and χ2 calculation are independent of one another, they are highly amenable to parallel computation. Additionally, Montblanc caters for iterative RIME evaluation to produce multiple χ2 values. Modified model parameters are transferred to the GPU between each iteration. We implemented Montblanc as a Python package based upon NVIDIA's CUDA architecture. As such, it is easy to extend and implement different pipelines. At present, Montblanc supports point and Gaussian morphologies, but is designed for easy addition of new source profiles. Montblanc's RIME implementation is performant: On an NVIDIA K40, it is approximately 250 times faster than MEQTREES on a dual hexacore Intel E5-2620v2 CPU. Compared to the OSKAR simulator's GPU-implemented RIME components it is 7.7 and 12 times faster on the same K40 for single and double-precision floating point respectively. However, OSKAR's RIME implementation is more general than Montblanc's BIRO-tailored RIME. Theoretical analysis of Montblanc's dominant CUDA kernel suggests that it is memory bound. In practice, profiling shows that is balanced between compute and memory, as much of the data required by the problem is retained in L1 and L2 caches.
MOIL-opt: Energy-Conserving Molecular Dynamics on a GPU/CPU system.
Ruymgaart, A Peter; Cardenas, Alfredo E; Elber, Ron
2011-08-26
We report an optimized version of the molecular dynamics program MOIL that runs on a shared memory system with OpenMP and exploits the power of a Graphics Processing Unit (GPU). The model is of heterogeneous computing system on a single node with several cores sharing the same memory and a GPU. This is a typical laboratory tool, which provides excellent performance at minimal cost. Besides performance, emphasis is made on accuracy and stability of the algorithm probed by energy conservation for explicit-solvent atomically-detailed-models. Especially for long simulations energy conservation is critical due to the phenomenon known as "energy drift" in which energy errors accumulate linearly as a function of simulation time. To achieve long time dynamics with acceptable accuracy the drift must be particularly small. We identify several means of controlling long-time numerical accuracy while maintaining excellent speedup. To maintain a high level of energy conservation SHAKE and the Ewald reciprocal summation are run in double precision. Double precision summation of real-space non-bonded interactions improves energy conservation. In our best option, the energy drift using 1fs for a time step while constraining the distances of all bonds, is undetectable in 10ns simulation of solvated DHFR (Dihydrofolate reductase). Faster options, shaking only bonds with hydrogen atoms, are also very well behaved and have drifts of less than 1kcal/mol per nanosecond of the same system. CPU/GPU implementations require changes in programming models. We consider the use of a list of neighbors and quadratic versus linear interpolation in lookup tables of different sizes. Quadratic interpolation with a smaller number of grid points is faster than linear lookup tables (with finer representation) without loss of accuracy. Atomic neighbor lists were found most efficient. Typical speedups are about a factor of 10 compared to a single-core single-precision code. PMID:22328867
NASA Astrophysics Data System (ADS)
Hammitzsch, M.; Spazier, J.; Reißland, S.
2014-12-01
Usually, tsunami early warning and mitigation systems (TWS or TEWS) are based on several software components deployed in a client-server based infrastructure. The vast majority of systems importantly include desktop-based clients with a graphical user interface (GUI) for the operators in early warning centers. However, in times of cloud computing and ubiquitous computing the use of concepts and paradigms, introduced by continuously evolving approaches in information and communications technology (ICT), have to be considered even for early warning systems (EWS). Based on the experiences and the knowledge gained in three research projects - 'German Indonesian Tsunami Early Warning System' (GITEWS), 'Distant Early Warning System' (DEWS), and 'Collaborative, Complex, and Critical Decision-Support in Evolving Crises' (TRIDEC) - new technologies are exploited to implement a cloud-based and web-based prototype to open up new prospects for EWS. This prototype, named 'TRIDEC Cloud', merges several complementary external and in-house cloud-based services into one platform for automated background computation with graphics processing units (GPU), for web-mapping of hazard specific geospatial data, and for serving relevant functionality to handle, share, and communicate threat specific information in a collaborative and distributed environment. The prototype in its current version addresses tsunami early warning and mitigation. The integration of GPU accelerated tsunami simulation computations have been an integral part of this prototype to foster early warning with on-demand tsunami predictions based on actual source parameters. However, the platform is meant for researchers around the world to make use of the cloud-based GPU computation to analyze other types of geohazards and natural hazards and react upon the computed situation picture with a web-based GUI in a web browser at remote sites. The current website is an early alpha version for demonstration purposes to give the
GPU-accelerated 3D Bayesian image reconstruction from Compton scattered data
NASA Astrophysics Data System (ADS)
Nguyen, Van-Giang; Lee, Soo-Jin; Lee, Mi No
2011-05-01
This paper describes the development of fast Bayesian reconstruction methods for Compton cameras using commodity graphics hardware. For fast iterative reconstruction, not only is it important to increase the convergence rate, but also it is equally important to accelerate the computation of time-consuming and repeated operations, such as projection and backprojection. Since the size of the system matrix for a typical Compton camera is intractably large, it is impractical to use a conventional caching scheme that stores the pre-calculated elements of a system matrix and uses them for the calculation of projection and backprojection. In this paper we propose GPU (graphics processing unit)-accelerated methods that can rapidly perform conical projection and backprojection on the fly. Since the conventional ray-based backprojection method is inefficient for parallel computing on GPUs, we develop voxel-based conical backprojection methods using two different approximation schemes. In the first scheme, we approximate the intersecting chord length of the ray passing through a voxel by the perpendicular distance from the center to the ray. In the second scheme, each voxel is regarded as a dimensionless point rather than a cube so that the backprojection can be performed without the need for calculating intersecting chord lengths or their approximations. Our simulation studies show that the GPU-based method dramatically improves the computational speed with only minor loss of accuracy in reconstruction. With the development of high-resolution detectors, the difference in the reconstruction accuracy between the GPU-based method and the CPU-based method will eventually be negligible.
Visualization of large medical data sets using memory-optimized CPU and GPU algorithms
NASA Astrophysics Data System (ADS)
Kiefer, Gundolf; Lehmann, Helko; Weese, Juergen
2005-04-01
With the evolution of medical scanners towards higher spatial resolutions, the sizes of image data sets are increasing rapidly. To profit from the higher resolution in medical applications such as 3D-angiography for a more efficient and precise diagnosis, high-performance visualization is essential. However, to make sure that the performance of a volume rendering algorithm scales with the performance of future computer architectures, technology trends need to be considered. The design of such scalable volume rendering algorithms remains challenging. One of the major trends in the development of computer architectures is the wider use of cache memory hierarchies to bridge the growing gap between the faster evolving processing power and the slower evolving memory access speed. In this paper we propose ways to exploit the standard PC"s cache memories supporting the main processors (CPU"s) and the graphics hardware (graphics processing unit, GPU), respectively, for computing Maximum Intensity Projections (MIPs). To this end, we describe a generic and flexible way to improve the cache efficiency of software ray casting algorithms and show by means of cache simulations, that it enables cache miss rates close to the theoretical optimum. For GPU-based rendering we propose a similar, brick-based technique to optimize the utilization of onboard caches and the transfer of data to the GPU on-board memory. All algorithms produce images of identical quality, which enables us to compare the performance of their implementations in a fair way without eventually trading quality for speed. Our comparison indicates that the proposed methods perform superior, in particular for large data sets.
NASA Astrophysics Data System (ADS)
Peng, Z.; Meng, X.; Hong, B.; Yu, X.
2012-12-01
Large shallow earthquakes are generally followed by increased seismic activities around the mainshock rupture zone, known as "aftershocks". Whether static or dynamic triggering is responsible for triggering aftershocks is still in debate. However, aftershocks listed in standard earthquake catalogs are generally incomplete immediately after the mainshock, which may result in inaccurate estimation of seismic rate changes. Recent studies have used waveforms of existing earthquakes as templates to scan through continuous waveforms to detect potential missing aftershocks, which is termed 'matched filter technique'. However, this kind of data mining is computationally intensive, which raises new challenges when applying to large data sets with tens of thousands of templates, hundreds of seismic stations and years of continuous waveforms. The waveform matched filter technique exhibits parallelism at multiple levels, which allows us to use GPU-based computation to achieve significant acceleration. By dividing the procedure into several routines and processing them in parallel, we have achieved ~40 times speedup for one Nvidia GPU card compared to sequential CPU code, and further scaled the code to multiple GPUs. We apply this paralleled code to detect potential missing aftershocks around the 2003 Mw 6.5 San Simeon and 2004 Mw6.0 Parkfield earthquakes in Central California, and around the 2010 Mw 7.2 El Mayor-Cucapah earthquake in southern California. In all these cases, we can detect several tens of times more earthquakes immediately after the mainshocks as compared with those listed in the catalogs. These newly identified earthquakes are revealing new information about the physical mechanisms responsible for triggering aftershocks in the near field. We plan to improve our code so that it can be executed in large-scale GPU clusters. Our work has the long-term goal of developing scalable methods for seismic data analysis in the context of "Big Data" challenges.
GPU-based fast Monte Carlo dose calculation for proton therapy
NASA Astrophysics Data System (ADS)
Jia, Xun; Schümann, Jan; Paganetti, Harald; Jiang, Steve B.
2012-12-01
Accurate radiation dose calculation is essential for successful proton radiotherapy. Monte Carlo (MC) simulation is considered to be the most accurate method. However, the long computation time limits it from routine clinical applications. Recently, graphics processing units (GPUs) have been widely used to accelerate computationally intensive tasks in radiotherapy. We have developed a fast MC dose calculation package, gPMC, for proton dose calculation on a GPU. In gPMC, proton transport is modeled by the class II condensed history simulation scheme with a continuous slowing down approximation. Ionization, elastic and inelastic proton nucleus interactions are considered. Energy straggling and multiple scattering are modeled. Secondary electrons are not transported and their energies are locally deposited. After an inelastic nuclear interaction event, a variety of products are generated using an empirical model. Among them, charged nuclear fragments are terminated with energy locally deposited. Secondary protons are stored in a stack and transported after finishing transport of the primary protons, while secondary neutral particles are neglected. gPMC is implemented on the GPU under the CUDA platform. We have validated gPMC using the TOPAS/Geant4 MC code as the gold standard. For various cases including homogeneous and inhomogeneous phantoms as well as a patient case, good agreements between gPMC and TOPAS/Geant4 are observed. The gamma passing rate for the 2%/2 mm criterion is over 98.7% in the region with dose greater than 10% maximum dose in all cases, excluding low-density air regions. With gPMC it takes only 6-22 s to simulate 10 million source protons to achieve ˜1% relative statistical uncertainty, depending on the phantoms and energy. This is an extremely high efficiency compared to the computational time of tens of CPU hours for TOPAS/Geant4. Our fast GPU-based code can thus facilitate the routine use of MC dose calculation in proton therapy.
Quantum.Ligand.Dock: protein-ligand docking with quantum entanglement refinement on a GPU system.
Kantardjiev, Alexander A
2012-07-01
Quantum.Ligand.Dock (protein-ligand docking with graphic processing unit (GPU) quantum entanglement refinement on a GPU system) is an original modern method for in silico prediction of protein-ligand interactions via high-performance docking code. The main flavour of our approach is a combination of fast search with a special account for overlooked physical interactions. On the one hand, we take care of self-consistency and proton equilibria mutual effects of docking partners. On the other hand, Quantum.Ligand.Dock is the the only docking server offering such a subtle supplement to protein docking algorithms as quantum entanglement contributions. The motivation for development and proposition of the method to the community hinges upon two arguments-the fundamental importance of quantum entanglement contribution in molecular interaction and the realistic possibility to implement it by the availability of supercomputing power. The implementation of sophisticated quantum methods is made possible by parallelization at several bottlenecks on a GPU supercomputer. The high-performance implementation will be of use for large-scale virtual screening projects, structural bioinformatics, systems biology and fundamental research in understanding protein-ligand recognition. The design of the interface is focused on feasibility and ease of use. Protein and ligand molecule structures are supposed to be submitted as atomic coordinate files in PDB format. A customization section is offered for addition of user-specified charges, extra ionogenic groups with intrinsic pK(a) values or fixed ions. Final predicted complexes are ranked according to obtained scores and provided in PDB format as well as interactive visualization in a molecular viewer. Quantum.Ligand.Dock server can be accessed at http://87.116.85.141/LigandDock.html. PMID:22669908
Using the GPU based Model Tsunami-HySEA for the Italian CTSP
NASA Astrophysics Data System (ADS)
Gonzalez Vida, J. M., Sr.; Castro, M. J.; Macias, J.; de la Asuncion, M.; Molinari, I.; Melini, D.; Romano, F.; Tonini, R.; Lorito, S.; Piatanesi, A.
2015-12-01
The Istituto Nazionale di Geofisica e Vulcanologia of Italy (INGV) in collaboration with the EDANYA Group (University of Málaga) are proposing a FTRT (Faster Than Real Time) tsunami simulation approach that is being implemented in the NEAMTWS Italian CTSP, namely the Centro Allerta Tsunami (CAT), which is in pre-operational stage starting from 1 October 2014, in the 24/7 seismic monitoring room at INGV. We here present the different versions and capabilities of the NTHMP benchmarked HySEA model, developed by EDANYA Group. HySEA implements a multi-GPU version that can compute in several minutes the maximum amplitude and arrival time of the main tsunami wave in about 17,000 selected locations along the Mediterranean. At the same time, HySEA is implemented in nested meshes with different resolution and multi-GPU environment, which allows much faster than real time (below few minutes) inundation simulations. The performances of the code allows as well the preparation of a huge number of different pre-calculated scenarios that are being used for PTHA and for warning applications. Acknowledgements. This research has been partially supported by the Junta de Andalucía research project TESELA (P11-RNM7069), the Spanish Government Research project DAIFLUID (MTM2012-38383-C02-01). Also these results have received funding from the Italian Flagship Project RITMARE, from the INGV-DPC agreement, All.B2, and from EU FP7 project ASTARTE, Assessment, Strategy and Risk Reduction for Tsunamis in Europe grant n° 603839 (Project ASTARTE). The multi-GPU computationswere performed at the Laboratory of Numerical Methods (University of Malaga).
Huang Bormin; Mielikainen, Jarno; Oh, Hyunjong; Allen Huang, Hung-Lung
2011-03-20
Satellite-observed radiance is a nonlinear functional of surface properties and atmospheric temperature and absorbing gas profiles as described by the radiative transfer equation (RTE). In the era of hyperspectral sounders with thousands of high-resolution channels, the computation of the radiative transfer model becomes more time-consuming. The radiative transfer model performance in operational numerical weather prediction systems still limits the number of channels we can use in hyperspectral sounders to only a few hundreds. To take the full advantage of such high-resolution infrared observations, a computationally efficient radiative transfer model is needed to facilitate satellite data assimilation. In recent years the programmable commodity graphics processing unit (GPU) has evolved into a highly parallel, multi-threaded, many-core processor with tremendous computational speed and very high memory bandwidth. The radiative transfer model is very suitable for the GPU implementation to take advantage of the hardware's efficiency and parallelism where radiances of many channels can be calculated in parallel in GPUs. In this paper, we develop a GPU-based high-performance radiative transfer model for the Infrared Atmospheric Sounding Interferometer (IASI) launched in 2006 onboard the first European meteorological polar-orbiting satellites, METOP-A. Each IASI spectrum has 8461 spectral channels. The IASI radiative transfer model consists of three modules. The first module for computing the regression predictors takes less than 0.004% of CPU time, while the second module for transmittance computation and the third module for radiance computation take approximately 92.5% and 7.5%, respectively. Our GPU-based IASI radiative transfer model is developed to run on a low-cost personal supercomputer with four GPUs with total 960 compute cores, delivering near 4 TFlops theoretical peak performance. By massively parallelizing the second and third modules, we reached 364x
NASA Astrophysics Data System (ADS)
Huang, Bormin; Mielikainen, Jarno; Oh, Hyunjong; Allen Huang, Hung-Lung
2011-03-01
Satellite-observed radiance is a nonlinear functional of surface properties and atmospheric temperature and absorbing gas profiles as described by the radiative transfer equation (RTE). In the era of hyperspectral sounders with thousands of high-resolution channels, the computation of the radiative transfer model becomes more time-consuming. The radiative transfer model performance in operational numerical weather prediction systems still limits the number of channels we can use in hyperspectral sounders to only a few hundreds. To take the full advantage of such high-resolution infrared observations, a computationally efficient radiative transfer model is needed to facilitate satellite data assimilation. In recent years the programmable commodity graphics processing unit (GPU) has evolved into a highly parallel, multi-threaded, many-core processor with tremendous computational speed and very high memory bandwidth. The radiative transfer model is very suitable for the GPU implementation to take advantage of the hardware's efficiency and parallelism where radiances of many channels can be calculated in parallel in GPUs. In this paper, we develop a GPU-based high-performance radiative transfer model for the Infrared Atmospheric Sounding Interferometer (IASI) launched in 2006 onboard the first European meteorological polar-orbiting satellites, METOP-A. Each IASI spectrum has 8461 spectral channels. The IASI radiative transfer model consists of three modules. The first module for computing the regression predictors takes less than 0.004% of CPU time, while the second module for transmittance computation and the third module for radiance computation take approximately 92.5% and 7.5%, respectively. Our GPU-based IASI radiative transfer model is developed to run on a low-cost personal supercomputer with four GPUs with total 960 compute cores, delivering near 4 TFlops theoretical peak performance. By massively parallelizing the second and third modules, we reached 364
NASA Technical Reports Server (NTRS)
Putnam, William M.
2011-01-01
Earth system models like the Goddard Earth Observing System model (GEOS-5) have been pushing the limits of large clusters of multi-core microprocessors, producing breath-taking fidelity in resolving cloud systems at a global scale. GPU computing presents an opportunity for improving the efficiency of these leading edge models. A GPU implementation of GEOS-5 will facilitate the use of cloud-system resolving resolutions in data assimilation and weather prediction, at resolutions near 3.5 km, improving our ability to extract detailed information from high-resolution satellite observations and ultimately produce better weather and climate predictions
Multi-GPU three dimensional Stokes solver for simulating glacier flow
NASA Astrophysics Data System (ADS)
Licul, Aleksandar; Herman, Frédéric; Podladchikov, Yuri; Räss, Ludovic; Omlin, Samuel
2016-04-01
Here we present how we have recently developed a three-dimensional Stokes solver on the GPUs and apply it to a glacier flow. We numerically solve the Stokes momentum balance equations together with the incompressibility equation, while also taking into account strong nonlinearities for ice rheology. We have developed a fully three-dimensional numerical MATLAB application based on an iterative finite difference scheme with preconditioning of residuals. Differential equations are discretized on a regular staggered grid. We have ported it to C-CUDA to run it on GPU's in parallel, using MPI. We demonstrate the accuracy and efficiency of our developed model by manufactured analytical solution test for three-dimensional Stokes ice sheet models (Leng et al.,2013) and by comparison with other well-established ice sheet models on diagnostic ISMIP-HOM benchmark experiments (Pattyn et al., 2008). The results show that our developed model is capable to accurately and efficiently solve Stokes system of equations in a variety of different test scenarios, while preserving good parallel efficiency on up to 80 GPU's. For example, in 3D test scenarios with 250000 grid points our solver converges in around 3 minutes for single precision computations and around 10 minutes for double precision computations. We have also optimized the developed code to efficiently run on our newly acquired state-of-the-art GPU cluster octopus. This allows us to solve our problem on more than 20 million grid points, by just increasing the number of GPU used, while keeping the computation time the same. In future work we will apply our solver to real world applications and implement the free surface evolution capabilities. REFERENCES Leng,W.,Ju,L.,Gunzburger,M. & Price,S., 2013. Manufactured solutions and the verification of three-dimensional stokes ice-sheet models. Cryosphere 7,19-29. Pattyn, F., Perichon, L., Aschwanden, A., Breuer, B., de Smedt, B., Gagliardini, O., Gudmundsson,G.H., Hindmarsh, R
A fast and memory-sparing probabilistic selection algorithm for the GPU
Monroe, Laura M; Wendelberger, Joanne; Michalak, Sarah
2010-09-29
A fast and memory-sparing probabilistic top-N selection algorithm is implemented on the GPU. This probabilistic algorithm gives a deterministic result and always terminates. The use of randomization reduces the amount of data that needs heavy processing, and so reduces both the memory requirements and the average time required for the algorithm. This algorithm is well-suited to more general parallel processors with multiple layers of memory hierarchy. Probabilistic Las Vegas algorithms of this kind are a form of stochastic optimization and can be especially useful for processors having a limited amount of fast memory available.
Software Graphics Processing Unit (sGPU) for Deep Space Applications
NASA Technical Reports Server (NTRS)
McCabe, Mary; Salazar, George; Steele, Glen
2015-01-01
A graphics processing capability will be required for deep space missions and must include a range of applications, from safety-critical vehicle health status to telemedicine for crew health. However, preliminary radiation testing of commercial graphics processing cards suggest they cannot operate in the deep space radiation environment. Investigation into an Software Graphics Processing Unit (sGPU)comprised of commercial-equivalent radiation hardened/tolerant single board computers, field programmable gate arrays, and safety-critical display software shows promising results. Preliminary performance of approximately 30 frames per second (FPS) has been achieved. Use of multi-core processors may provide a significant increase in performance.
GPU-optimized Code for Long-term Simulations of Beam-beam Effects in Colliders
Roblin, Yves; Morozov, Vasiliy; Terzic, Balsa; Aturban, Mohamed A.; Ranjan, D.; Zubair, Mohammed
2013-06-01
We report on the development of the new code for long-term simulation of beam-beam effects in particle colliders. The underlying physical model relies on a matrix-based arbitrary-order symplectic particle tracking for beam transport and the Bassetti-Erskine approximation for beam-beam interaction. The computations are accelerated through a parallel implementation on a hybrid GPU/CPU platform. With the new code, a previously computationally prohibitive long-term simulations become tractable. We use the new code to model the proposed medium-energy electron-ion collider (MEIC) at Jefferson Lab.
GPU-based four-dimensional general-relativistic ray tracing
NASA Astrophysics Data System (ADS)
Kuchelmeister, Daniel; Müller, Thomas; Ament, Marco; Wunner, Günter; Weiskopf, Daniel
2012-10-01
This paper presents a new general-relativistic ray tracer that enables image synthesis on an interactive basis by exploiting the performance of graphics processing units (GPUs). The application is capable of visualizing the distortion of the stellar background as well as trajectories of moving astronomical objects orbiting a compact mass. Its source code includes metric definitions for the Schwarzschild and Kerr spacetimes that can be easily extended to other metric definitions, relying on its object-oriented design. The basic functionality features a scene description interface based on the scripting language Lua, real-time image output, and the ability to edit almost every parameter at runtime. The ray tracing code itself is implemented for parallel execution on the GPU using NVidia's Compute Unified Device Architecture (CUDA), which leads to performance improvement of an order of magnitude compared to a single CPU and makes the application competitive with small CPU cluster architectures. Program summary Program title: GpuRay4D Catalog identifier: AEMV_v1_0 Program summary URL: http://cpc.cs.qub.ac.uk/summaries/AEMV_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 73649 No. of bytes in distributed program, including test data, etc.: 1334251 Distribution format: tar.gz Programming language: C++, CUDA. Computer: Linux platforms with a NVidia CUDA enabled GPU (Compute Capability 1.3 or higher), C++ compiler, NVCC (The CUDA Compiler Driver). Operating system: Linux. RAM: 2 GB Classification: 1.5. External routines: OpenGL Utility Toolkit development files, NVidia CUDA Toolkit 3.2, Lua5.2 Nature of problem: Ray tracing in four-dimensional Lorentzian spacetimes. Solution method: Numerical integration of light rays, GPU-based parallel programming using CUDA, 3D
GPU Accelerated Implementation of Density Functional Theory for Hybrid QM/MM Simulations.
Nitsche, Matías A; Ferreria, Manuel; Mocskos, Esteban E; González Lebrero, Mariano C
2014-03-11
The hybrid simulation tools (QM/MM) evolved into a fundamental methodology for studying chemical reactivity in complex environments. This paper presents an implementation of electronic structure calculations based on density functional theory. This development is optimized for performing hybrid molecular dynamics simulations by making use of graphic processors (GPU) for the most computationally demanding parts (exchange-correlation terms). The proposed implementation is able to take advantage of modern GPUs achieving acceleration in relevant portions between 20 to 30 times faster than the CPU version. The presented code was extensively tested, both in terms of numerical quality and performance over systems of different size and composition. PMID:26580175
NASA Astrophysics Data System (ADS)
Perkins, Simon; Marais, Patrick; Zwart, Jonathan; Natarajan, Iniyan; Tasse, Cyril; Smirnov, Oleg
2015-02-01
Montblanc, written in Python, is a GPU implementation of the Radio interferometer measurement equation (RIME) in support of the Bayesian inference for radio observations (BIRO) technique. The parameter space that BIRO explores results in tens of thousands of computationally expensive RIME evaluations before reduction to a single X2 value. The RIME is calculated over four dimensions, time, baseline, channel and source and the values in this 4D space can be independently calculated; therefore, the RIME is particularly amenable to a parallel implementation accelerated by Graphics Programming Units (GPUs). Montblanc is implemented for NVIDIA's CUDA architecture and outperforms MeqTrees (ascl:1209.010) and OSKAR.
A Survey of Methods for Analyzing and Improving GPU Energy Efficiency
Mittal, Sparsh; Vetter, Jeffrey S
2014-01-01
Recent years have witnessed a phenomenal growth in the computational capabilities and applications of GPUs. However, this trend has also led to dramatic increase in their power consumption. This paper surveys research works on analyzing and improving energy efficiency of GPUs. It also provides a classification of these techniques on the basis of their main research idea. Further, it attempts to synthesize research works which compare energy efficiency of GPUs with other computing systems, e.g. FPGAs and CPUs. The aim of this survey is to provide researchers with knowledge of state-of-the-art in GPU power management and motivate them to architect highly energy-efficient GPUs of tomorrow.
A distributed multi-GPU system for high speed electron microscopic tomographic reconstruction.
Zheng, Shawn Q; Branlund, Eric; Kesthelyi, Bettina; Braunfeld, Michael B; Cheng, Yifan; Sedat, John W; Agard, David A
2011-07-01
Full resolution electron microscopic tomographic (EMT) reconstruction of large-scale tilt series requires significant computing power. The desire to perform multiple cycles of iterative reconstruction and realignment dramatically increases the pressing need to improve reconstruction performance. This has motivated us to develop a distributed multi-GPU (graphics processing unit) system to provide the required computing power for rapid constrained, iterative reconstructions of very large three-dimensional (3D) volumes. The participating GPUs reconstruct segments of the volume in parallel, and subsequently, the segments are assembled to form the complete 3D volume. Owing to its power and versatility, the CUDA (NVIDIA, USA) platform was selected for GPU implementation of the EMT reconstruction. For a system containing 10 GPUs provided by 5 GTX295 cards, 10 cycles of SIRT reconstruction for a tomogram of 4096(2) × 512 voxels from an input tilt series containing 122 projection images of 4096(2) pixels (single precision float) takes a total of 1845 s of which 1032 s are for computation with the remainder being the system overhead. The same system takes only 39 s total to reconstruct 1024(2) × 256 voxels from 122 1024(2) pixel projections. While the system overhead is non-trivial, performance analysis indicates that adding extra GPUs to the system would lead to steadily enhanced overall performance. Therefore, this system can be easily expanded to generate superior computing power for very large tomographic reconstructions and especially to empower iterative cycles of reconstruction and realignment. PMID:21741915
Performance analysis of parallel gravitational N-body codes on large GPU clusters
NASA Astrophysics Data System (ADS)
Huang, Si-Yi; Spurzem, Rainer; Berczik, Peter
2016-01-01
We compare the performance of two very different parallel gravitational N-body codes for astrophysical simulations on large Graphics Processing Unit (GPU) clusters, both of which are pioneers in their own fields as well as on certain mutual scales - NBODY6++ and Bonsai. We carry out benchmarks of the two codes by analyzing their performance, accuracy and efficiency through the modeling of structure decomposition and timing measurements. We find that both codes are heavily optimized to leverage the computational potential of GPUs as their performance has approached half of the maximum single precision performance of the underlying GPU cards. With such performance we predict that a speed-up of 200 - 300 can be achieved when up to 1k processors and GPUs are employed simultaneously. We discuss the quantitative information about comparisons of the two codes, finding that in the same cases Bonsai adopts larger time steps as well as larger relative energy errors than NBODY6++, typically ranging from 10 - 50 times larger, depending on the chosen parameters of the codes. Although the two codes are built for different astrophysical applications, in specified conditions they may overlap in performance at certain physical scales, thus allowing the user to choose either one by fine-tuning parameters accordingly.
Development of a GPU-Accelerated 3-D Full-Wave Code for Reflectometry Simulations
NASA Astrophysics Data System (ADS)
Reuther, K. S.; Kubota, S.; Feibush, E.; Johnson, I.
2013-10-01
1-D and 2-D full-wave codes used as synthetic diagnostics in microwave reflectometry are standard tools for understanding electron density fluctuations in fusion plasmas. The accuracy of the code depends on how well the wave properties along the ignored dimensions can be pre-specified or neglected. In a toroidal magnetic geometry, such assumptions are never strictly correct and ray tracing has shown that beam propagation is inherently a 3-D problem. Previously, we reported on the application of GPGPU's (General-Purpose computing on Graphics Processing Units) to a 2-D FDTD (Finite-Difference Time-Domain) code ported to utilize the parallel processing capabilities of the NVIDIA C870 and C1060. Here, we report on the development of a FDTD code for 3-D problems. Initial tests will use NVIDIA's M2070 GPU and concentrate on the launching and propagation of Gaussian beams in free space. If available, results using a plasma target will also be presented. Performance will be compared with previous generations of GPGPU cards as well as with NVIDIA's newest K20C GPU. Finally, the possibility of utilizing multiple GPGPU cards in a cluster environment or in a single node will also be discussed. Supported by U.S. DoE Grants DE-FG02-99-ER54527 and DE-AC02-09CH11466 and the DoE National Undergraduate Fusion Fellowship.
GRay: A Massively Parallel GPU-based Code for Ray Tracing in Relativistic Spacetimes
NASA Astrophysics Data System (ADS)
Chan, Chi-kwan; Psaltis, Dimitrios; Özel, Feryal
2013-11-01
We introduce GRay, a massively parallel integrator designed to trace the trajectories of billions of photons in a curved spacetime. This graphics-processing-unit (GPU)-based integrator employs the stream processing paradigm, is implemented in CUDA C/C++, and runs on nVidia graphics cards. The peak performance of GRay using single-precision floating-point arithmetic on a single GPU exceeds 300 GFLOP (or 1 ns per photon per time step). For a realistic problem, where the peak performance cannot be reached, GRay is two orders of magnitude faster than existing central-processing-unit-based ray-tracing codes. This performance enhancement allows more effective searches of large parameter spaces when comparing theoretical predictions of images, spectra, and light curves from the vicinities of compact objects to observations. GRay can also perform on-the-fly ray tracing within general relativistic magnetohydrodynamic algorithms that simulate accretion flows around compact objects. Making use of this algorithm, we calculate the properties of the shadows of Kerr black holes and the photon rings that surround them. We also provide accurate fitting formulae of their dependencies on black hole spin and observer inclination, which can be used to interpret upcoming observations of the black holes at the center of the Milky Way, as well as M87, with the Event Horizon Telescope.
GPU-based ultra-fast dose calculation using a finite size pencil beam model
NASA Astrophysics Data System (ADS)
Gu, Xuejun; Choi, Dongju; Men, Chunhua; Pan, Hubert; Majumdar, Amitava; Jiang, Steve B.
2009-10-01
Online adaptive radiation therapy (ART) is an attractive concept that promises the ability to deliver an optimal treatment in response to the inter-fraction variability in patient anatomy. However, it has yet to be realized due to technical limitations. Fast dose deposit coefficient calculation is a critical component of the online planning process that is required for plan optimization of intensity-modulated radiation therapy (IMRT). Computer graphics processing units (GPUs) are well suited to provide the requisite fast performance for the data-parallel nature of dose calculation. In this work, we develop a dose calculation engine based on a finite-size pencil beam (FSPB) algorithm and a GPU parallel computing framework. The developed framework can accommodate any FSPB model. We test our implementation in the case of a water phantom and the case of a prostate cancer patient with varying beamlet and voxel sizes. All testing scenarios achieved speedup ranging from 200 to 400 times when using a NVIDIA Tesla C1060 card in comparison with a 2.27 GHz Intel Xeon CPU. The computational time for calculating dose deposition coefficients for a nine-field prostate IMRT plan with this new framework is less than 1 s. This indicates that the GPU-based FSPB algorithm is well suited for online re-planning for adaptive radiotherapy.
Efficient Irregular Wavefront Propagation Algorithms on Hybrid CPU-GPU Machines.
Teodoro, George; Pan, Tony; Kurc, Tahsin; Kong, Jun; Cooper, Lee; Saltz, Joel
2013-04-01
We address the problem of efficient execution of a computation pattern, referred to here as the irregular wavefront propagation pattern (IWPP), on hybrid systems with multiple CPUs and GPUs. The IWPP is common in several image processing operations. In the IWPP, data elements in the wavefront propagate waves to their neighboring elements on a grid if a propagation condition is satisfied. Elements receiving the propagated waves become part of the wavefront. This pattern results in irregular data accesses and computations. We develop and evaluate strategies for efficient computation and propagation of wavefronts using a multi-level queue structure. This queue structure improves the utilization of fast memories in a GPU and reduces synchronization overheads. We also develop a tile-based parallelization strategy to support execution on multiple CPUs and GPUs. We evaluate our approaches on a state-of-the-art GPU accelerated machine (equipped with 3 GPUs and 2 multicore CPUs) using the IWPP implementations of two widely used image processing operations: morphological reconstruction and euclidean distance transform. Our results show significant performance improvements on GPUs. The use of multiple CPUs and GPUs cooperatively attains speedups of 50× and 85× with respect to single core CPU executions for morphological reconstruction and euclidean distance transform, respectively. PMID:23908562
Efficient Irregular Wavefront Propagation Algorithms on Hybrid CPU-GPU Machines
Teodoro, George; Pan, Tony; Kurc, Tahsin; Kong, Jun; Cooper, Lee; Saltz, Joel
2013-01-01
We address the problem of efficient execution of a computation pattern, referred to here as the irregular wavefront propagation pattern (IWPP), on hybrid systems with multiple CPUs and GPUs. The IWPP is common in several image processing operations. In the IWPP, data elements in the wavefront propagate waves to their neighboring elements on a grid if a propagation condition is satisfied. Elements receiving the propagated waves become part of the wavefront. This pattern results in irregular data accesses and computations. We develop and evaluate strategies for efficient computation and propagation of wavefronts using a multi-level queue structure. This queue structure improves the utilization of fast memories in a GPU and reduces synchronization overheads. We also develop a tile-based parallelization strategy to support execution on multiple CPUs and GPUs. We evaluate our approaches on a state-of-the-art GPU accelerated machine (equipped with 3 GPUs and 2 multicore CPUs) using the IWPP implementations of two widely used image processing operations: morphological reconstruction and euclidean distance transform. Our results show significant performance improvements on GPUs. The use of multiple CPUs and GPUs cooperatively attains speedups of 50× and 85× with respect to single core CPU executions for morphological reconstruction and euclidean distance transform, respectively. PMID:23908562
Alternating dual updates algorithm for X-ray CT reconstruction on the GPU
McGaffin, Madison G.; Fessler, Jeffrey A.
2015-01-01
Model-based image reconstruction (MBIR) for X-ray computed tomography (CT) offers improved image quality and potential low-dose operation, but has yet to reach ubiquity in the clinic. MBIR methods form an image by solving a large statistically motivated optimization problem, and the long time it takes to numerically solve this problem has hampered MBIR’s adoption. We present a new optimization algorithm for X-ray CT MBIR based on duality and group coordinate ascent that may converge even with approximate updates and can handle a wide range of regularizers, including total variation (TV). The algorithm iteratively updates groups of dual variables corresponding to terms in the cost function; these updates are highly parallel and map well onto the GPU. Although the algorithm stores a large number of variables, the “working size” for each of the algorithm’s steps is small and can be efficiently streamed to the GPU while other calculations are being performed. The proposed algorithm converges rapidly on both real and simulated data and shows promising parallelization over multiple devices. PMID:26878031
McGreevy, Ryan; Isralewitz, Barry
2014-01-01
Hybrid structure fitting methods combine data from cryo-electron microscopy and X-ray crystallography with molecular dynamics simulations for the determination of all-atom structures of large biomolecular complexes. Evaluating the quality-of-fit obtained from hybrid fitting is computationally demanding, particularly in the context of a multiplicity of structural conformations that must be evaluated. Existing tools for quality-of-fit analysis and visualization have previously targeted small structures and are too slow to be used interactively for large biomolecular complexes of particular interest today such as viruses or for long molecular dynamics trajectories as they arise in protein folding. We present new data-parallel and GPU-accelerated algorithms for rapid interactive computation of quality-of-fit metrics linking all-atom structures and molecular dynamics trajectories to experimentally determined density maps obtained from cryo-electron microscopy or X-ray crystallography. We evaluate the performance and accuracy of the new quality-of-fit analysis algorithms vis-a-vis existing tools, examine algorithm performance on GPU-accelerated desktop workstations and supercomputers, and describe new visualization techniques for results of hybrid structure fitting methods. PMID:25340325
Fireflies: New Software for Interactively Exploring Dynamical Systems Using GPU Computing
NASA Astrophysics Data System (ADS)
Merrison-Hort, Robert
2015-12-01
In nonlinear systems, where explicit analytic solutions usually cannot be found, visualization is a powerful approach which can give insights into the dynamical behavior of models; it is also crucial for teaching this area of mathematics. In this paper, we present new software, Fireflies, which exploits the power of graphical processing unit (GPU) computing to produce spectacular interactive visualizations of arbitrary systems of ordinary differential equations. In contrast to typical phase portraits, Fireflies draws the current position of trajectories (projected onto 2D or 3D space) as single points of light, which move as the system is simulated. Due to the massively parallel nature of GPU hardware, Fireflies is able to simulate millions of trajectories in parallel (even on standard desktop computer hardware), producing “swarms” of particles that move around the screen in real-time according to the equations of the system. Particles that move forwards in time reveal stable attractors (e.g. fixed points and limit cycles), while the option of integrating another group of trajectories backwards in time can reveal unstable objects (repellers). Fireflies allows the user to change the parameters of the system as it is running, in order to see the effect that they have on the dynamics and to observe bifurcations. We demonstrate the capabilities of the software with three examples: a 2D “mean field” model of neuronal activity, the classical Lorenz system, and a 15D model of three interacting biologically realistic neurons.
A Survey on GPU-Based Implementation of Swarm Intelligence Algorithms.
Tan, Ying; Ding, Ke
2016-09-01
Inspired by the collective behavior of natural swarm, swarm intelligence algorithms (SIAs) have been developed and widely used for solving optimization problems. When applied to complex problems, a large number of fitness function evaluations are needed to obtain an acceptable solution. To tackle this vital issue, graphical processing units (GPUs) have been used to accelerate the optimization procedure of SIAs. Thanks to their inherent parallelism, SIAs are very suitable for parallel implementation under the GPU platform which have achieved a great success in recent years. This paper presents a comprehensive review of GPU-based parallel SIAs in accordance with a newly proposed taxonomy. Critical concerns for the efficient parallel implementation of SIAs are also described in detail. Moreover, novel criteria are also proposed to evaluate and compare the parallel implementation and algorithm performance universally. The rationality and practicability of the proposed optimization methodology and criteria are verified by careful case study. Finally, our opinions and perspectives on the trends and prospects on the relatively new research domain are also presented for future development. PMID:26571543
Interesting Spatio-Temporal Region Discovery Computations Over Gpu and Mapreduce Platforms
NASA Astrophysics Data System (ADS)
McDermott, M.; Prasad, S. K.; Shekhar, S.; Zhou, X.
2015-07-01
Discovery of interesting paths and regions in spatio-temporal data sets is important to many fields such as the earth and atmospheric sciences, GIS, public safety and public health both as a goal and as a preliminary step in a larger series of computations. This discovery is usually an exhaustive procedure that quickly becomes extremely time consuming to perform using traditional paradigms and hardware and given the rapidly growing sizes of today's data sets is quickly outpacing the speed at which computational capacity is growing. In our previous work (Prasad et al., 2013a) we achieved a 50 times speedup over sequential using a single GPU. We were able to achieve near linear speedup over this result on interesting path discovery by using Apache Hadoop to distribute the workload across multiple GPU nodes. Leveraging the parallel architecture of GPUs we were able to drastically reduce the computation time of a 3-dimensional spatio-temporal interest region search on a single tile of normalized difference vegetative index for Saudi Arabia. We were further able to see an almost linear speedup in compute performance by distributing this workload across several GPUs with a simple MapReduce model. This increases the speed of processing 10 fold over the comparable sequential while simultaneously increasing the amount of data being processed by 384 fold. This allowed us to process the entirety of the selected data set instead of a constrained window.
Examination of nanoparticle dispersion using a novel GPU based radial distribution function code
NASA Astrophysics Data System (ADS)
Rosch, Thomas; Wade, Matthew; Phelan, Frederick
We have developed a novel GPU-based code that rapidly calculates radial distribution function (RDF) for an entire system, with no cutoff, ensuring accuracy. Built on top of this code, we have developed tools to calculate the second virial coefficient (B2) and the structure factor from the RDF, two properties that are directly related to the dispersion of nanoparticles in nancomposite systems. We validate the RDF calculations by comparison with previously published results, and also show how our code, which takes into account bonding in polymeric systems, enables more accurate predictions of g(r) than current state of the art GPU-based RDF codes currently available for these systems. In addition, our code reduces the computational time by approximately an order of magnitude compared to CPU-based calculations. We demonstrate the application of our toolset by the examination of a coarse-grained nanocomposite system and show how different surface energies between particle and polymer lead to different dispersion states, and effect properties such as viscosity, yield strength, elasticity, and thermal conductivity.
Synthetic transmit aperture technique in medical ultrasound imaging implemented on a GPU
NASA Astrophysics Data System (ADS)
Li, Ying; Chen, Xiaodong; Zhang, Chuang; Wang, Yi; Jiao, Zhihai; Yu, Daoyin
2014-11-01
In the medical ultrasound imaging, the synthetic transmit aperture (STA) technique is very promising and has been a hot research topic. It is dynamically focused in both transmit and receive yielding an improvement in resolution. But this imaging technique sets high demands on processing capabilities and makes implementation of a full STA system very challenging and costly. Many attempts have been made to reduce the demands on the system making it a more realistic task to implement. In this paper we don't consider how to reduce the demands, but consider how to accelerate the processing speed of the system. The recent introduction of general-purpose graphic processing units (GPU) seems to be quite promising in this view, especially for the affordable programming complexity. In this paper we explain the main computational features of STA processing unit, trying to disclose the degree of parallelism in the operations. On the basis of the compute unified device architecture (CUDA) programming model and the extremely flexible structure of the Single Instruction Multiple Threads (SIMT) model, we show that the optimization of STA processing unit can be performed more efficiently. The input data is read from Matlab, the post-processing and display also use Matlab. Performance shows that, using a single NIVDIA GTX-650 GPU board, this amount to a speed up of more than a factor of 30 compared to a highly optimized beamformer running on our test workstation with a 3.20-GHz Intel Core-i5 processor.
A GPU accelerated Barnes-Hut tree code for FLASH4
NASA Astrophysics Data System (ADS)
Lukat, Gunther; Banerjee, Robi
2016-05-01
We present a GPU accelerated CUDA-C implementation of the Barnes Hut (BH) tree code for calculating the gravitational potential on octree adaptive meshes. The tree code algorithm is implemented within the FLASH4 adaptive mesh refinement (AMR) code framework and therefore fully MPI parallel. We describe the algorithm and present test results that demonstrate its accuracy and performance in comparison to the algorithms available in the current FLASH4 version. We use a MacLaurin spheroid to test the accuracy of our new implementation and use spherical, collapsing cloud cores with effective AMR to carry out performance tests also in comparison with previous gravity solvers. Depending on the setup and the GPU/CPU ratio, we find a speedup for the gravity unit of at least a factor of 3 and up to 60 in comparison to the gravity solvers implemented in the FLASH4 code. We find an overall speedup factor for full simulations of at least factor 1.6 up to a factor of 10.
GPU simulation of nonlinear propagation of dual band ultrasound pulse complexes
NASA Astrophysics Data System (ADS)
Kvam, Johannes; Angelsen, Bjørn A. J.; Elster, Anne C.
2015-10-01
In a new method of ultrasound imaging, called SURF imaging, dual band pulse complexes composed of overlapping low frequency (LF) and high frequency (HF) pulses are transmitted, where the frequency ratio LF:HF ˜ 1 : 20, and the relative bandwidth of both pulses are ˜ 50 - 70%. The LF pulse length is hence ˜ 20 times the HF pulse length. The LF pulse is used to nonlinearly manipulate the material elasticity observed by the co-propagating HF pulse. This produces nonlinear interaction effects that give more information on the propagation of the pulse complex. Due to the large difference in frequency and pulse length between the LF and the HF pulses, we have developed a dual level simulation where the LF pulse propagation is first simulated independent of the HF pulse, using a temporal sampling frequency matched to the LF pulse. A separate equation for the HF pulse is developed, where the the presimulated LF pulse modifies the propagation velocity. The equations are adapted to parallel processing in a GPU, where nonlinear simulations of a typical HF beam of 10 MHz down to 40 mm is done in ˜ 2 secs in a standard GPU. This simulation is hence very useful for studying the manipulation effect of the LF pulse on the HF pulse.
Matrix Algebra for GPU and Multicore Architectures (MAGMA) for Large Petascale Systems
Dongarra, Jack J.; Tomov, Stanimire
2014-03-24
The goal of the MAGMA project is to create a new generation of linear algebra libraries that achieve the fastest possible time to an accurate solution on hybrid Multicore+GPU-based systems, using all the processing power that future high-end systems can make available within given energy constraints. Our efforts at the University of Tennessee achieved the goals set in all of the five areas identified in the proposal: 1. Communication optimal algorithms; 2. Autotuning for GPU and hybrid processors; 3. Scheduling and memory management techniques for heterogeneity and scale; 4. Fault tolerance and robustness for large scale systems; 5. Building energy efficiency into software foundations. The University of Tennessee’s main contributions, as proposed, were the research and software development of new algorithms for hybrid multi/many-core CPUs and GPUs, as related to two-sided factorizations and complete eigenproblem solvers, hybrid BLAS, and energy efficiency for dense, as well as sparse, operations. Furthermore, as proposed, we investigated and experimented with various techniques targeting the five main areas outlined.
A GPU-Based Gibbs Sampler for a Unidimensional IRT Model
Welling, William S.; Zhu, Michelle M.
2014-01-01
Item response theory (IRT) is a popular approach used for addressing large-scale statistical problems in psychometrics as well as in other fields. The fully Bayesian approach for estimating IRT models is usually memory and computationally expensive due to the large number of iterations. This limits the use of the procedure in many applications. In an effort to overcome such restraint, previous studies focused on utilizing the message passing interface (MPI) in a distributed memory-based Linux cluster to achieve certain speedups. However, given the high data dependencies in a single Markov chain for IRT models, the communication overhead rapidly grows as the number of cluster nodes increases. This makes it difficult to further improve the performance under such a parallel framework. This study aims to tackle the problem using massive core-based graphic processing units (GPU), which is practical, cost-effective, and convenient in actual applications. The performance comparisons among serial CPU, MPI, and compute unified device architecture (CUDA) programs demonstrate that the CUDA GPU approach has many advantages over the CPU-based approach and therefore is preferred. PMID:27355058
GRay: A MASSIVELY PARALLEL GPU-BASED CODE FOR RAY TRACING IN RELATIVISTIC SPACETIMES
Chan, Chi-kwan; Psaltis, Dimitrios; Özel, Feryal
2013-11-01
We introduce GRay, a massively parallel integrator designed to trace the trajectories of billions of photons in a curved spacetime. This graphics-processing-unit (GPU)-based integrator employs the stream processing paradigm, is implemented in CUDA C/C++, and runs on nVidia graphics cards. The peak performance of GRay using single-precision floating-point arithmetic on a single GPU exceeds 300 GFLOP (or 1 ns per photon per time step). For a realistic problem, where the peak performance cannot be reached, GRay is two orders of magnitude faster than existing central-processing-unit-based ray-tracing codes. This performance enhancement allows more effective searches of large parameter spaces when comparing theoretical predictions of images, spectra, and light curves from the vicinities of compact objects to observations. GRay can also perform on-the-fly ray tracing within general relativistic magnetohydrodynamic algorithms that simulate accretion flows around compact objects. Making use of this algorithm, we calculate the properties of the shadows of Kerr black holes and the photon rings that surround them. We also provide accurate fitting formulae of their dependencies on black hole spin and observer inclination, which can be used to interpret upcoming observations of the black holes at the center of the Milky Way, as well as M87, with the Event Horizon Telescope.
GPU simulation of nonlinear propagation of dual band ultrasound pulse complexes
Kvam, Johannes Angelsen, Bjørn A. J.; Elster, Anne C.
2015-10-28
In a new method of ultrasound imaging, called SURF imaging, dual band pulse complexes composed of overlapping low frequency (LF) and high frequency (HF) pulses are transmitted, where the frequency ratio LF:HF ∼ 1 : 20, and the relative bandwidth of both pulses are ∼ 50 − 70%. The LF pulse length is hence ∼ 20 times the HF pulse length. The LF pulse is used to nonlinearly manipulate the material elasticity observed by the co-propagating HF pulse. This produces nonlinear interaction effects that give more information on the propagation of the pulse complex. Due to the large difference in frequency and pulse length between the LF and the HF pulses, we have developed a dual level simulation where the LF pulse propagation is first simulated independent of the HF pulse, using a temporal sampling frequency matched to the LF pulse. A separate equation for the HF pulse is developed, where the the presimulated LF pulse modifies the propagation velocity. The equations are adapted to parallel processing in a GPU, where nonlinear simulations of a typical HF beam of 10 MHz down to 40 mm is done in ∼ 2 secs in a standard GPU. This simulation is hence very useful for studying the manipulation effect of the LF pulse on the HF pulse.
NASA Astrophysics Data System (ADS)
Mena, Andres; Ferrero, Jose M.; Rodriguez Matas, Jose F.
2015-11-01
Solving the electric activity of the heart possess a big challenge, not only because of the structural complexities inherent to the heart tissue, but also because of the complex electric behaviour of the cardiac cells. The multi-scale nature of the electrophysiology problem makes difficult its numerical solution, requiring temporal and spatial resolutions of 0.1 ms and 0.2 mm respectively for accurate simulations, leading to models with millions degrees of freedom that need to be solved for thousand time steps. Solution of this problem requires the use of algorithms with higher level of parallelism in multi-core platforms. In this regard the newer programmable graphic processing units (GPU) has become a valid alternative due to their tremendous computational horsepower. This paper presents results obtained with a novel electrophysiology simulation software entirely developed in Compute Unified Device Architecture (CUDA). The software implements fully explicit and semi-implicit solvers for the monodomain model, using operator splitting. Performance is compared with classical multi-core MPI based solvers operating on dedicated high-performance computer clusters. Results obtained with the GPU based solver show enormous potential for this technology with accelerations over 50 × for three-dimensional problems.
Taylor, Z A; Comas, O; Cheng, M; Passenger, J; Hawkes, D J; Atkinson, D; Ourselin, S
2009-04-01
Efficient and accurate techniques for simulation of soft tissue deformation are an increasingly valuable tool in many areas of medical image computing, such as biomechanically-driven image registration and interactive surgical simulation. For reasons of efficiency most analyses are based on simplified linear formulations, and previously almost all have ignored well established features of tissue mechanical response such as anisotropy and time-dependence. We address these latter issues by firstly presenting a generalised anisotropic viscoelastic constitutive framework for soft tissues, particular cases of which have previously been used to model a wide range of tissues. We then develop an efficient solution procedure for the accompanying viscoelastic hereditary integrals which allows use of such models in explicit dynamic finite element algorithms. We show that the procedure allows incorporation of both anisotropy and viscoelasticity for as little as 5.1% additional cost compared with the usual isotropic elastic models. Finally we describe the implementation of a new GPU-based finite element scheme for soft tissue simulation using the CUDA API. Even with the inclusion of more elaborate constitutive models as described the new implementation affords speed improvements compared with our recent graphics API-based implementation, and compared with CPU execution a speed up of 56.3 x is achieved. The validity of the viscoelastic solution procedure and performance of the GPU implementation are demonstrated with a series of numerical examples. PMID:19019721
Sound speed estimation using wave-based ultrasound tomography: theory and GPU implementation
NASA Astrophysics Data System (ADS)
Roy, O.; Jovanović, I.; Hormati, A.; Parhizkar, R.; Vetterli, M.
2010-03-01
We present preliminary results obtained using a time domain wave-based reconstruction algorithm for an ultrasound transmission tomography scanner with a circular geometry. While a comprehensive description of this type of algorithm has already been given elsewhere, the focus of this work is on some practical issues arising with this approach. In fact, wave-based reconstruction methods suffer from two major drawbacks which limit their application in a practical setting: convergence is difficult to obtain and the computational cost is prohibitive. We address the first problem by appropriate initialization using a ray-based reconstruction. Then, the complexity of the method is reduced by means of an efficient parallel implementation on graphical processing units (GPU). We provide a mathematical derivation of the wave-based method under consideration, describe some details of our implementation and present simulation results obtained with a numerical phantom designed for a breast cancer detection application. The source code of our GPU implementation is freely available on the web at www.usense.org.
A modular cross-platform GPU-based approach for flexible 3D video playback
NASA Astrophysics Data System (ADS)
Olsson, Roger; Andersson, Håkan; Sjöström, Mårten
2011-03-01
Different compression formats for stereo and multiview based 3D video is being standardized and software players capable of decoding and presenting these formats onto different display types is a vital part in the commercialization and evolution of 3D video. However, the number of publicly available software video players capable of decoding and playing multiview 3D video is still quite limited. This paper describes the design and implementation of a GPU-based real-time 3D video playback solution, built on top of cross-platform, open source libraries for video decoding and hardware accelerated graphics. A software architecture is presented that efficiently process and presents high definition 3D video in real-time and in a flexible manner support both current 3D video formats and emerging standards. Moreover, a set of bottlenecks in the processing of 3D video content in a GPU-based real-time 3D video playback solution is identified and discussed.
Large-Scale Multi-Dimensional Document Clustering on GPU Clusters
Cui, Xiaohui; Mueller, Frank; Zhang, Yongpeng; Potok, Thomas E
2010-01-01
Document clustering plays an important role in data mining systems. Recently, a flocking-based document clustering algorithm has been proposed to solve the problem through simulation resembling the flocking behavior of birds in nature. This method is superior to other clustering algorithms, including k-means, in the sense that the outcome is not sensitive to the initial state. One limitation of this approach is that the algorithmic complexity is inherently quadratic in the number of documents. As a result, execution time becomes a bottleneck with large number of documents. In this paper, we assess the benefits of exploiting the computational power of Beowulf-like clusters equipped with contemporary Graphics Processing Units (GPUs) as a means to significantly reduce the runtime of flocking-based document clustering. Our framework scales up to over one million documents processed simultaneously in a sixteennode GPU cluster. Results are also compared to a four-node cluster with higher-end GPUs. On these clusters, we observe 30X-50X speedups, which demonstrates the potential of GPU clusters to efficiently solve massive data mining problems. Such speedups combined with the scalability potential and accelerator-based parallelization are unique in the domain of document-based data mining, to the best of our knowledge.
Edge-preserving image denoising via group coordinate descent on the GPU
McGaffin, Madison G.; Fessler, Jeffrey A.
2015-01-01
Image denoising is a fundamental operation in image processing, and its applications range from the direct (photographic enhancement) to the technical (as a subproblem in image reconstruction algorithms). In many applications, the number of pixels has continued to grow, while the serial execution speed of computational hardware has begun to stall. New image processing algorithms must exploit the power offered by massively parallel architectures like graphics processing units (GPUs). This paper describes a family of image denoising algorithms well-suited to the GPU. The algorithms iteratively perform a set of independent, parallel one-dimensional pixel-update subproblems. To match GPU memory limitations, they perform these pixel updates inplace and only store the noisy data, denoised image and problem parameters. The algorithms can handle a wide range of edge-preserving roughness penalties, including differentiable convex penalties and anisotropic total variation (TV). Both algorithms use the majorize-minimize (MM) framework to solve the one-dimensional pixel update subproblem. Results from a large 2D image denoising problem and a 3D medical imaging denoising problem demonstrate that the proposed algorithms converge rapidly in terms of both iteration and run-time. PMID:25675454
3D Alternating Direction TV-Based Cone-Beam CT Reconstruction with Efficient GPU Implementation
Cai, Ailong; Zhang, Hanming; Li, Lei; Xi, Xiaoqi; Guan, Min; Li, Jianxin
2014-01-01
Iterative image reconstruction (IIR) with sparsity-exploiting methods, such as total variation (TV) minimization, claims potentially large reductions in sampling requirements. However, the computation complexity becomes a heavy burden, especially in 3D reconstruction situations. In order to improve the performance for iterative reconstruction, an efficient IIR algorithm for cone-beam computed tomography (CBCT) with GPU implementation has been proposed in this paper. In the first place, an algorithm based on alternating direction total variation using local linearization and proximity technique is proposed for CBCT reconstruction. The applied proximal technique avoids the horrible pseudoinverse computation of big matrix which makes the proposed algorithm applicable and efficient for CBCT imaging. The iteration for this algorithm is simple but convergent. The simulation and real CT data reconstruction results indicate that the proposed algorithm is both fast and accurate. The GPU implementation shows an excellent acceleration ratio of more than 100 compared with CPU computation without losing numerical accuracy. The runtime for the new 3D algorithm is about 6.8 seconds per loop with the image size of 256 × 256 × 256 and 36 projections of the size of 512 × 512. PMID:25045400
Combined algorithmic and GPU acceleration for ultra-fast circular conebeam backprojection
NASA Astrophysics Data System (ADS)
Brokish, Jeffrey; Sack, Paul; Bresler, Yoram
2010-04-01
In this paper, we describe the first implementation and performance of a fast O(N3logN) hierarchical backprojection algorithm for cone beam CT with a circular trajectory1,developed on a modern Graphics Processing Unit (GPU). The resulting tomographic backprojection system for 3D cone beam geometry combines speedup through algorithmic improvements provided by the hierarchical backprojection algorithm with speedup from a massively parallel hardware accelerator. For data parameters typical in diagnostic CT and using a mid-range GPU card, we report reconstruction speeds of up to 360 frames per second, and relative speedup of almost 6x compared to conventional backprojection on the same hardware. The significance of these results is twofold. First, they demonstrate that the reduction in operation counts demonstrated previously for the FHBP algorithm can be translated to a comparable run-time improvement in a massively parallel hardware implementation, while preserving stringent diagnostic image quality. Second, the dramatic speedup and throughput numbers achieved indicate the feasibility of systems based on this technology, which achieve real-time 3D reconstruction for state-of-the art diagnostic CT scanners with small footprint, high-reliability, and affordable cost.
Odyssey: A Public GPU-based Code for General Relativistic Radiative Transfer in Kerr Spacetime
NASA Astrophysics Data System (ADS)
Pu, Hung-Yi; Yun, Kiyun; Younsi, Ziri; Yoon, Suk-Jin
2016-04-01
General relativistic radiative transfer calculations coupled with the calculation of geodesics in the Kerr spacetime are an essential tool for determining the images, spectra, and light curves from matter in the vicinity of black holes. Such studies are especially important for ongoing and upcoming millimeter/submillimeter very long baseline interferometry observations of the supermassive black holes at the centers of Sgr A* and M87. To this end we introduce Odyssey, a graphics processing unit (GPU) based code for ray tracing and radiative transfer in the Kerr spacetime. On a single GPU, the performance of Odyssey can exceed 1 ns per photon, per Runge-Kutta integration step. Odyssey is publicly available, fast, accurate, and flexible enough to be modified to suit the specific needs of new users. Along with a Graphical User Interface powered by a video-accelerated display architecture, we also present an educational software tool, Odyssey_Edu, for showing in real time how null geodesics around a Kerr black hole vary as a function of black hole spin and angle of incidence onto the black hole.
GraviDy: a modular, GPU-based, direct-summation N-body code
NASA Astrophysics Data System (ADS)
Maureira-Fredes, Cristián; Amaro-Seoane, Pau
2016-02-01
The direct-summation of N gravitational forces is a complex problem for which there is no analytical solution. Dense stellar systems such as galactic nuclei and stellar clusters are the loci of different interesting problems. In this work we present a new GPU, direct-summation N-body integrator written from scratch and based on the Hermite scheme. The first release of the code consists of the Hermite integrator for a system of N bodies with softening. We find an acceleration factor of about ~ 90 of the GPU version in a single node as compared to the Serial-Single-CPU one. We additionally investigate the impact of using softening in the dynamics of a dense cluster. We study how it affects the two body relaxation, as compared with another code, NBODY6, which uses KS regularization, so as to understand the role of softening in the evolution of the system. This initial release is the first step towards more and more realistic scenarios, starting for a proper treatment for binary evolution, close encounters and the role of a massive black hole.
A novel CPU/GPU simulation environment for large-scale biologically realistic neural modeling
Hoang, Roger V.; Tanna, Devyani; Jayet Bray, Laurence C.; Dascalu, Sergiu M.; Harris, Frederick C.
2013-01-01
Computational Neuroscience is an emerging field that provides unique opportunities to study complex brain structures through realistic neural simulations. However, as biological details are added to models, the execution time for the simulation becomes longer. Graphics Processing Units (GPUs) are now being utilized to accelerate simulations due to their ability to perform computations in parallel. As such, they have shown significant improvement in execution time compared to Central Processing Units (CPUs). Most neural simulators utilize either multiple CPUs or a single GPU for better performance, but still show limitations in execution time when biological details are not sacrificed. Therefore, we present a novel CPU/GPU simulation environment for large-scale biological networks, the NeoCortical Simulator version 6 (NCS6). NCS6 is a free, open-source, parallelizable, and scalable simulator, designed to run on clusters of multiple machines, potentially with high performance computing devices in each of them. It has built-in leaky-integrate-and-fire (LIF) and Izhikevich (IZH) neuron models, but users also have the capability to design their own plug-in interface for different neuron types as desired. NCS6 is currently able to simulate one million cells and 100 million synapses in quasi real time by distributing data across eight machines with each having two video cards. PMID:24106475
Feasibility of GPU-assisted iterative image reconstruction for mobile C-arm CT
NASA Astrophysics Data System (ADS)
Pan, Yongsheng; Whitaker, Ross; Cheryauka, Arvi; Ferguson, Dave
2009-02-01
Computed tomography (CT) has been extensively studied and widely used for a variety of medical applications. The reconstruction of 3D images from a projection series is an important aspect of the modality. Reconstruction by filtered backprojection (FBP) is used by most manufacturers because of speed, ease of implementation, and relatively few parameters. Iterative reconstruction methods have a significant potential to provide superior performance with incomplete or noisy data, or with less than ideal geometries, such as cone-beam systems. However, iterative methods have a high computational cost, and regularization is usually required to reduce the effects of noise. The simultaneous algebraic reconstruction technique (SART) is studied in this paper, where the Feldkamp method (FDK) for filtered back projection is used as an initialization for iterative SART. Additionally, graphics hardware is utilized to increase the speed of SART implementation. Nvidia processors and compute unified device architecture (CUDA) form the platform for GPU computation. Total variation (TV) minimization is applied for the regularization of SART results. Preliminary results of SART on 3-D Shepp-Logan phantom using using TV regularization and GPU computation are presented in this paper. Potential improvements of the proposed framework are also discussed.
Multi-GPU based on multicriteria optimization for motion estimation system
NASA Astrophysics Data System (ADS)
Garcia, Carlos; Botella, Guillermo; Ayuso, Fermin; Prieto, Manuel; Tirado, Francisco
2013-12-01
Graphics processor units (GPUs) offer high performance and power efficiency for a large number of data-parallel applications. Previous research has shown that a GPU-based version of a neuromorphic motion estimation algorithm can achieve a ×32 speedup using these devices. However, the memory consumption creates a bottleneck due to the expansive tree of signal processing operations performed. In the present contribution, an improvement in memory reduction was carried out, which limited accelerator viability usage. An evolutionary algorithm was used to find the best configuration. It supposes a trade-off solution between consumption resources, parallel efficiency, and accuracy. A multilevel parallel scheme was exploited: grain level by means of multi-GPU systems, and a finer level by data parallelism. In order to achieve a more relevant analysis, some optical flow benchmarks were used to validate this study. Satisfactory results opened the chance of building an intelligent motion estimation system that auto-adapted according to real-time, resource consumption, and accuracy requirements.
Extinction-based shading and illumination in GPU volume ray-casting.
Schlegel, Philipp; Makhinya, Maxim; Pajarola, Renato
2011-12-01
Direct volume rendering has become a popular method for visualizing volumetric datasets. Even though computers are continually getting faster, it remains a challenge to incorporate sophisticated illumination models into direct volume rendering while maintaining interactive frame rates. In this paper, we present a novel approach for advanced illumination in direct volume rendering based on GPU ray-casting. Our approach features directional soft shadows taking scattering into account, ambient occlusion and color bleeding effects while achieving very competitive frame rates. In particular, multiple dynamic lights and interactive transfer function changes are fully supported. Commonly, direct volume rendering is based on a very simplified discrete version of the original volume rendering integral, including the development of the original exponential extinction into a-blending. In contrast to a-blending forming a product when sampling along a ray, the original exponential extinction coefficient is an integral and its discretization a Riemann sum. The fact that it is a sum can cleverly be exploited to implement volume lighting effects, i.e. soft directional shadows, ambient occlusion and color bleeding. We will show how this can be achieved and how it can be implemented on the GPU. PMID:22034296
Real-time GPU implementation of transverse oscillation vector velocity flow imaging
NASA Astrophysics Data System (ADS)
Bradway, David Pierson; Pihl, Michael Johannes; Krebs, Andreas; Tomov, Borislav Gueorguiev; Kjær, Carsten Straso; Nikolov, Svetoslav Ivanov; Jensen, Jørgen Arendt
2014-03-01
Rapid estimation of blood velocity and visualization of complex flow patterns are important for clinical use of diagnostic ultrasound. This paper presents real-time processing for two-dimensional (2-D) vector flow imaging which utilizes an off-the-shelf graphics processing unit (GPU). In this work, Open Computing Language (OpenCL) is used to estimate 2-D vector velocity flow in vivo in the carotid artery. Data are streamed live from a BK Medical 2202 Pro Focus UltraView Scanner to a workstation running a research interface software platform. Processing data from a 50 millisecond frame of a duplex vector flow acquisition takes 2.3 milliseconds seconds on an Advanced Micro Devices Radeon HD 7850 GPU card. The detected velocities are accurate to within the precision limit of the output format of the display routine. Because this tool was developed as a module external to the scanner's built-in processing, it enables new opportunities for prototyping novel algorithms, optimizing processing parameters, and accelerating the path from development lab to clinic.
GPU-Based Block-Wise Nonlocal Means Denoising for 3D Ultrasound Images
Hou, Wenguang; Zhang, Xuming; Ding, Mingyue
2013-01-01
Speckle suppression plays an important role in improving ultrasound (US) image quality. While lots of algorithms have been proposed for 2D US image denoising with remarkable filtering quality, there is relatively less work done on 3D ultrasound speckle suppression, where the whole volume data rather than just one frame needs to be considered. Then, the most crucial problem with 3D US denoising is that the computational complexity increases tremendously. The nonlocal means (NLM) provides an effective method for speckle suppression in US images. In this paper, a programmable graphic-processor-unit- (GPU-) based fast NLM filter is proposed for 3D ultrasound speckle reduction. A Gamma distribution noise model, which is able to reliably capture image statistics for Log-compressed ultrasound images, was used for the 3D block-wise NLM filter on basis of Bayesian framework. The most significant aspect of our method was the adopting of powerful data-parallel computing capability of GPU to improve the overall efficiency. Experimental results demonstrate that the proposed method can enormously accelerate the algorithm. PMID:24348747
Application of a GPU-Assisted Maxwell Code to Electromagnetic Wave Propagation in ITER
NASA Astrophysics Data System (ADS)
Kubota, S.; Peebles, W. A.; Woodbury, D.; Johnson, I.; Zolfaghari, A.
2014-10-01
The Low Field Side Reflectometer (LSFR) on ITER is envisioned to provide capabilities for electron density profile and fluctuations measurements in both the plasma core and edge. The current design for the Equatorial Port Plug 11 (EPP11) employs seven monostatic antennas for use with both fixed-frequency and swept-frequency systems. The present work examines the characteristics of this layout using the 3-D version of the GPU-Assisted Maxwell Code (GAMC-3D). Previous studies in this area were performed with either 2-D full wave codes or 3-D ray- and beam-tracing. GAMC-3D is based on the FDTD method and can be run with either a fixed-frequency or modulated (e.g. FMCW) source, and with either a stationary or moving target (e.g. Doppler backscattering). The code is designed to run on a single NVIDIA Tesla GPU accelerator, and utilizes a technique based on the moving window method to overcome the size limitation of the onboard memory. Effects such as beam drift, linear mode conversion, and diffraction/scattering will be examined. Comparisons will be made with beam-tracing calculations using the complex eikonal method. Supported by U.S. DoE Grants DE-FG02-99ER54527 and DE-AC02-09CH11466, and the DoE SULI Program at PPPL.
Chen, Wenan; Ward, Kevin; Li, Qi; Kecman, Vojislav; Najarian, Kayvan; Menke, Nathan
2011-01-01
The coagulation and fibrinolytic systems are complex, inter-connected biological systems with major physiological roles. The complex, nonlinear multi-point relationships between the molecular and cellular constituents of two systems render a comprehensive and simultaneous study of the system at the microscopic and macroscopic level a significant challenge. We have created an Agent Based Modeling and Simulation (ABMS) approach for simulating these complex interactions. As the scale of agents increase, the time complexity and cost of the resulting simulations presents a significant challenge. As such, in this paper, we also present a high-speed framework for the coagulation simulation utilizing the computing power of graphics processing units (GPU). For comparison, we also implemented the simulations in NetLogo, Repast, and a direct C version. As our experiments demonstrate, the computational speed of the GPU implementation of the million-level scale of agents is over 10 times faster versus the C version, over 100 times faster versus the Repast version and over 300 times faster versus the NetLogo simulation. PMID:22254271
The GENGA code: gravitational encounters in N-body simulations with GPU acceleration
Grimm, Simon L.; Stadel, Joachim G.
2014-11-20
We describe an open source GPU implementation of a hybrid symplectic N-body integrator, GENGA (Gravitational ENcounters with Gpu Acceleration), designed to integrate planet and planetesimal dynamics in the late stage of planet formation and stability analyses of planetary systems. GENGA uses a hybrid symplectic integrator to handle close encounters with very good energy conservation, which is essential in long-term planetary system integration. We extended the second-order hybrid integration scheme to higher orders. The GENGA code supports three simulation modes: integration of up to 2048 massive bodies, integration with up to a million test particles, or parallel integration of a large number of individual planetary systems. We compare the results of GENGA to Mercury and pkdgrav2 in terms of energy conservation and performance and find that the energy conservation of GENGA is comparable to Mercury and around two orders of magnitude better than pkdgrav2. GENGA runs up to 30 times faster than Mercury and up to 8 times faster than pkdgrav2. GENGA is written in CUDA C and runs on all NVIDIA GPUs with a computing capability of at least 2.0.
GPU-based ultra-fast dose calculation using a finite size pencil beam model.
Gu, Xuejun; Choi, Dongju; Men, Chunhua; Pan, Hubert; Majumdar, Amitava; Jiang, Steve B
2009-10-21
Online adaptive radiation therapy (ART) is an attractive concept that promises the ability to deliver an optimal treatment in response to the inter-fraction variability in patient anatomy. However, it has yet to be realized due to technical limitations. Fast dose deposit coefficient calculation is a critical component of the online planning process that is required for plan optimization of intensity-modulated radiation therapy (IMRT). Computer graphics processing units (GPUs) are well suited to provide the requisite fast performance for the data-parallel nature of dose calculation. In this work, we develop a dose calculation engine based on a finite-size pencil beam (FSPB) algorithm and a GPU parallel computing framework. The developed framework can accommodate any FSPB model. We test our implementation in the case of a water phantom and the case of a prostate cancer patient with varying beamlet and voxel sizes. All testing scenarios achieved speedup ranging from 200 to 400 times when using a NVIDIA Tesla C1060 card in comparison with a 2.27 GHz Intel Xeon CPU. The computational time for calculating dose deposition coefficients for a nine-field prostate IMRT plan with this new framework is less than 1 s. This indicates that the GPU-based FSPB algorithm is well suited for online re-planning for adaptive radiotherapy. PMID:19794244
GPU-based beamformer: fast realization of plane wave compounding and synthetic aperture imaging.
Yiu, Billy Y S; Tsang, Ivan K H; Yu, Alfred C H
2011-08-01
Although they show potential to improve ultrasound image quality, plane wave (PW) compounding and synthetic aperture (SA) imaging are computationally demanding and are known to be challenging to implement in real-time. In this work, we have developed a novel beamformer architecture with the real-time parallel processing capacity needed to enable fast realization of PW compounding and SA imaging. The beamformer hardware comprises an array of graphics processing units (GPUs) that are hosted within the same computer workstation. Their parallel computational resources are controlled by a pixel-based software processor that includes the operations of analytic signal conversion, delay-and-sum beamforming, and recursive compounding as required to generate images from the channel-domain data samples acquired using PW compounding and SA imaging principles. When using two GTX-480 GPUs for beamforming and one GTX-470 GPU for recursive compounding, the beamformer can compute compounded 512 x 255 pixel PW and SA images at throughputs of over 4700 fps and 3000 fps, respectively, for imaging depths of 5 cm and 15 cm (32 receive channels, 40 MHz sampling rate). Its processing capacity can be further increased if additional GPUs or more advanced models of GPU are used. PMID:21859591
Multi-core CPU or GPU-accelerated Multiscale Modeling for Biomolecular Complexes
Liao, Tao; Zhang, Yongjie; Kekenes-Huskey, Peter M.; Cheng, Yuhui; Michailova, Anushka; McCulloch, Andrew D.; Holst, Michael; McCammon, J. Andrew
2013-01-01
Multi-scale modeling plays an important role in understanding the structure and biological functionalities of large biomolecular complexes. In this paper, we present an efficient computational framework to construct multi-scale models from atomic resolution data in the Protein Data Bank (PDB), which is accelerated by multi-core CPU and programmable Graphics Processing Units (GPU). A multi-level summation of Gaus-sian kernel functions is employed to generate implicit models for biomolecules. The coefficients in the summation are designed as functions of the structure indices, which specify the structures at a certain level and enable a local resolution control on the biomolecular surface. A method called neighboring search is adopted to locate the grid points close to the expected biomolecular surface, and reduce the number of grids to be analyzed. For a specific grid point, a KD-tree or bounding volume hierarchy is applied to search for the atoms contributing to its density computation, and faraway atoms are ignored due to the decay of Gaussian kernel functions. In addition to density map construction, three modes are also employed and compared during mesh generation and quality improvement to generate high quality tetrahedral meshes: CPU sequential, multi-core CPU parallel and GPU parallel. We have applied our algorithm to several large proteins and obtained good results. PMID:24352481
NASA Astrophysics Data System (ADS)
Ma, C. Y.; Zhao, J. M.; Liu, L. H.; Zhang, L.; Li, X. C.; Jiang, B. C.
2016-03-01
Inverse identification of radiative properties of participating media is usually time consuming. In this paper, a GPU accelerated inverse identification model is presented to obtain the radiative properties of particle suspensions. The sample medium is placed in a cuvette and a narrow light beam is irradiated normally from the side. The forward three-dimensional radiative transfer problem is solved using a massive parallel Monte Carlo method implemented on graphics processing unit (GPU), and particle swarm optimization algorithm is applied to inversely identify the radiative properties of particle suspensions based on the measured bidirectional scattering distribution function (BSDF). The GPU-accelerated Monte Carlo simulation significantly reduces the solution time of the radiative transfer simulation and hence greatly accelerates the inverse identification process. Hundreds of speedup is achieved as compared to the CPU implementation. It is demonstrated using both simulated BSDF and experimentally measured BSDF of microalgae suspensions that the radiative properties of particle suspensions can be effectively identified based on the GPU-accelerated algorithm with three-dimensional radiative transfer modelling.
NASA Astrophysics Data System (ADS)
Gutzwiller, David; Gontier, Mathieu; Demeulenaere, Alain
2014-11-01
Multi-Block structured solvers hold many advantages over their unstructured counterparts, such as a smaller memory footprint and efficient serial performance. Historically, multi-block structured solvers have not been easily adapted for use in a High Performance Computing (HPC) environment, and the recent trend towards hybrid GPU/CPU architectures has further complicated the situation. This paper will elaborate on developments and innovations applied to the NUMECA FINE/Turbo solver that have allowed near-linear scalability with real-world problems on over 250 hybrid GPU/GPU cluster nodes. Discussion will focus on the implementation of virtual partitioning and load balancing algorithms using a novel meta-block concept. This implementation is transparent to the user, allowing all pre- and post-processing steps to be performed using a simple, unpartitioned grid topology. Additional discussion will elaborate on developments that have improved parallel performance, including fully parallel I/O with the ADIOS API and the GPU porting of the computationally heavy CPUBooster convergence acceleration module. Head of HPC and Release Management, Numeca International.
Leang, Sarom S; Rendell, Alistair P; Gordon, Mark S
2014-03-11
Increasingly, modern computer systems comprise a multicore general-purpose processor augmented with a number of special purpose devices or accelerators connected via an external interface such as a PCI bus. The NVIDIA Kepler Graphical Processing Unit (GPU) and the Intel Phi are two examples of such accelerators. Accelerators offer peak performances that can be well above those of the host processor. How to exploit this heterogeneous environment for legacy application codes is not, however, straightforward. This paper considers how matrix operations in typical quantum chemical calculations can be migrated to the GPU and Phi systems. Double precision general matrix multiply operations are endemic in electronic structure calculations, especially methods that include electron correlation, such as density functional theory, second order perturbation theory, and coupled cluster theory. The use of approaches that automatically determine whether to use the host or an accelerator, based on problem size, is explored, with computations that are occurring on the accelerator and/or the host. For data-transfers over PCI-e, the GPU provides the best overall performance for data sizes up to 4096 MB with consistent upload and download rates between 5-5.6 GB/s and 5.4-6.3 GB/s, respectively. The GPU outperforms the Phi for both square and nonsquare matrix multiplications. PMID:26580169
Development of a GPU-based Monte Carlo dose calculation code for coupled electron-photon transport.
Jia, Xun; Gu, Xuejun; Sempau, Josep; Choi, Dongju; Majumdar, Amitava; Jiang, Steve B
2010-06-01
Monte Carlo simulation is the most accurate method for absorbed dose calculations in radiotherapy. Its efficiency still requires improvement for routine clinical applications, especially for online adaptive radiotherapy. In this paper, we report our recent development on a GPU-based Monte Carlo dose calculation code for coupled electron-photon transport. We have implemented the dose planning method (DPM) Monte Carlo dose calculation package (Sempau et al 2000 Phys. Med. Biol. 45 2263-91) on the GPU architecture under the CUDA platform. The implementation has been tested with respect to the original sequential DPM code on the CPU in phantoms with water-lung-water or water-bone-water slab geometry. A 20 MeV mono-energetic electron point source or a 6 MV photon point source is used in our validation. The results demonstrate adequate accuracy of our GPU implementation for both electron and photon beams in the radiotherapy energy range. Speed-up factors of about 5.0-6.6 times have been observed, using an NVIDIA Tesla C1060 GPU card against a 2.27 GHz Intel Xeon CPU processor. PMID:20463376
Araki, Hiromitsu; Takada, Naoki; Niwase, Hiroaki; Ikawa, Shohei; Fujiwara, Masato; Nakayama, Hirotaka; Kakue, Takashi; Shimobaba, Tomoyoshi; Ito, Tomoyoshi
2015-12-01
We propose real-time time-division color electroholography using a single graphics processing unit (GPU) and a simple synchronization system of reference light. To facilitate real-time time-division color electroholography, we developed a light emitting diode (LED) controller with a universal serial bus (USB) module and the drive circuit for reference light. A one-chip RGB LED connected to a personal computer via an LED controller was used as the reference light. A single GPU calculates three computer-generated holograms (CGHs) suitable for red, green, and blue colors in each frame of a three-dimensional (3D) movie. After CGH calculation using a single GPU, the CPU can synchronize the CGH display with the color switching of the one-chip RGB LED via the LED controller. Consequently, we succeeded in real-time time-division color electroholography for a 3D object consisting of around 1000 points per color when an NVIDIA GeForce GTX TITAN was used as the GPU. Furthermore, we implemented the proposed method in various GPUs. The experimental results showed that the proposed method was effective for various GPUs. PMID:26836656
A GPU OpenCL based cross-platform Monte Carlo dose calculation engine (goMC).
Tian, Zhen; Shi, Feng; Folkerts, Michael; Qin, Nan; Jiang, Steve B; Jia, Xun
2015-10-01
Monte Carlo (MC) simulation has been recognized as the most accurate dose calculation method for radiotherapy. However, the extremely long computation time impedes its clinical application. Recently, a lot of effort has been made to realize fast MC dose calculation on graphic processing units (GPUs). However, most of the GPU-based MC dose engines have been developed under NVidia's CUDA environment. This limits the code portability to other platforms, hindering the introduction of GPU-based MC simulations to clinical practice. The objective of this paper is to develop a GPU OpenCL based cross-platform MC dose engine named goMC with coupled photon-electron simulation for external photon and electron radiotherapy in the MeV energy range. Compared to our previously developed GPU-based MC code named gDPM (Jia et al 2012 Phys. Med. Biol. 57 7783-97), goMC has two major differences. First, it was developed under the OpenCL environment for high code portability and hence could be run not only on different GPU cards but also on CPU platforms. Second, we adopted the electron transport model used in EGSnrc MC package and PENELOPE's random hinge method in our new dose engine, instead of the dose planning method employed in gDPM. Dose distributions were calculated for a 15 MeV electron beam and a 6 MV photon beam in a homogenous water phantom, a water-bone-lung-water slab phantom and a half-slab phantom. Satisfactory agreement between the two MC dose engines goMC and gDPM was observed in all cases. The average dose differences in the regions that received a dose higher than 10% of the maximum dose were 0.48-0.53% for the electron beam cases and 0.15-0.17% for the photon beam cases. In terms of efficiency, goMC was ~4-16% slower than gDPM when running on the same NVidia TITAN card for all the cases we tested, due to both the different electron transport models and the different development environments. The code portability of our new dose engine goMC was validated by
A GPU OpenCL based cross-platform Monte Carlo dose calculation engine (goMC)
NASA Astrophysics Data System (ADS)
Tian, Zhen; Shi, Feng; Folkerts, Michael; Qin, Nan; Jiang, Steve B.; Jia, Xun
2015-09-01
Monte Carlo (MC) simulation has been recognized as the most accurate dose calculation method for radiotherapy. However, the extremely long computation time impedes its clinical application. Recently, a lot of effort has been made to realize fast MC dose calculation on graphic processing units (GPUs). However, most of the GPU-based MC dose engines have been developed under NVidia’s CUDA environment. This limits the code portability to other platforms, hindering the introduction of GPU-based MC simulations to clinical practice. The objective of this paper is to develop a GPU OpenCL based cross-platform MC dose engine named goMC with coupled photon-electron simulation for external photon and electron radiotherapy in the MeV energy range. Compared to our previously developed GPU-based MC code named gDPM (Jia et al 2012 Phys. Med. Biol. 57 7783-97), goMC has two major differences. First, it was developed under the OpenCL environment for high code portability and hence could be run not only on different GPU cards but also on CPU platforms. Second, we adopted the electron transport model used in EGSnrc MC package and PENELOPE’s random hinge method in our new dose engine, instead of the dose planning method employed in gDPM. Dose distributions were calculated for a 15 MeV electron beam and a 6 MV photon beam in a homogenous water phantom, a water-bone-lung-water slab phantom and a half-slab phantom. Satisfactory agreement between the two MC dose engines goMC and gDPM was observed in all cases. The average dose differences in the regions that received a dose higher than 10% of the maximum dose were 0.48-0.53% for the electron beam cases and 0.15-0.17% for the photon beam cases. In terms of efficiency, goMC was ~4-16% slower than gDPM when running on the same NVidia TITAN card for all the cases we tested, due to both the different electron transport models and the different development environments. The code portability of our new dose engine goMC was validated by
GPU-FS-kNN: A Software Tool for Fast and Scalable kNN Computation Using GPUs
Arefin, Ahmed Shamsul; Riveros, Carlos; Berretta, Regina; Moscato, Pablo
2012-01-01
Background The analysis of biological networks has become a major challenge due to the recent development of high-throughput techniques that are rapidly producing very large data sets. The exploding volumes of biological data are craving for extreme computational power and special computing facilities (i.e. super-computers). An inexpensive solution, such as General Purpose computation based on Graphics Processing Units (GPGPU), can be adapted to tackle this challenge, but the limitation of the device internal memory can pose a new problem of scalability. An efficient data and computational parallelism with partitioning is required to provide a fast and scalable solution to this problem. Results We propose an efficient parallel formulation of the k-Nearest Neighbour (kNN) search problem, which is a popular method for classifying objects in several fields of research, such as pattern recognition, machine learning and bioinformatics. Being very simple and straightforward, the performance of the kNN search degrades dramatically for large data sets, since the task is computationally intensive. The proposed approach is not only fast but also scalable to large-scale instances. Based on our approach, we implemented a software tool GPU-FS-kNN (GPU-based Fast and Scalable k-Nearest Neighbour) for CUDA enabled GPUs. The basic approach is simple and adaptable to other available GPU architectures. We observed speed-ups of 50–60 times compared with CPU implementation on a well-known breast microarray study and its associated data sets. Conclusion Our GPU-based Fast and Scalable k-Nearest Neighbour search technique (GPU-FS-kNN) provides a significant performance improvement for nearest neighbour computation in large-scale networks. Source code and the software tool is available under GNU Public License (GPL) at https://sourceforge.net/p/gpufsknn/. PMID:22937144
NASA Astrophysics Data System (ADS)
Rueda, Antonio J.; Noguera, José M.; Luque, Adrián
2016-02-01
In recent years GPU computing has gained wide acceptance as a simple low-cost solution for speeding up computationally expensive processing in many scientific and engineering applications. However, in most cases accelerating a traditional CPU implementation for a GPU is a non-trivial task that requires a thorough refactorization of the code and specific optimizations that depend on the architecture of the device. OpenACC is a promising technology that aims at reducing the effort required to accelerate C/C++/Fortran code on an attached multicore device. Virtually with this technology the CPU code only has to be augmented with a few compiler directives to identify the areas to be accelerated and the way in which data has to be moved between the CPU and GPU. Its potential benefits are multiple: better code readability, less development time, lower risk of errors and less dependency on the underlying architecture and future evolution of the GPU technology. Our aim with this work is to evaluate the pros and cons of using OpenACC against native GPU implementations in computationally expensive hydrological applications, using the classic D8 algorithm of O'Callaghan and Mark for river network extraction as case-study. We implemented the flow accumulation step of this algorithm in CPU, using OpenACC and two different CUDA versions, comparing the length and complexity of the code and its performance with different datasets. We advance that although OpenACC can not match the performance of a CUDA optimized implementation (×3.5 slower in average), it provides a significant performance improvement against a CPU implementation (×2-6) with by far a simpler code and less implementation effort.
A GPU-accelerated and Monte Carlo-based intensity modulated proton therapy optimization system
Ma, Jiasen Beltran, Chris; Seum Wan Chan Tseung, Hok; Herman, Michael G.
2014-12-15
Purpose: Conventional spot scanning intensity modulated proton therapy (IMPT) treatment planning systems (TPSs) optimize proton spot weights based on analytical dose calculations. These analytical dose calculations have been shown to have severe limitations in heterogeneous materials. Monte Carlo (MC) methods do not have these limitations; however, MC-based systems have been of limited clinical use due to the large number of beam spots in IMPT and the extremely long calculation time of traditional MC techniques. In this work, the authors present a clinically applicable IMPT TPS that utilizes a very fast MC calculation. Methods: An in-house graphics processing unit (GPU)-based MC dose calculation engine was employed to generate the dose influence map for each proton spot. With the MC generated influence map, a modified least-squares optimization method was used to achieve the desired dose volume histograms (DVHs). The intrinsic CT image resolution was adopted for voxelization in simulation and optimization to preserve spatial resolution. The optimizations were computed on a multi-GPU framework to mitigate the memory limitation issues for the large dose influence maps that resulted from maintaining the intrinsic CT resolution. The effects of tail cutoff and starting condition were studied and minimized in this work. Results: For relatively large and complex three-field head and neck cases, i.e., >100 000 spots with a target volume of ∼1000 cm{sup 3} and multiple surrounding critical structures, the optimization together with the initial MC dose influence map calculation was done in a clinically viable time frame (less than 30 min) on a GPU cluster consisting of 24 Nvidia GeForce GTX Titan cards. The in-house MC TPS plans were comparable to a commercial TPS plans based on DVH comparisons. Conclusions: A MC-based treatment planning system was developed. The treatment planning can be performed in a clinically viable time frame on a hardware system costing around 45
GPU implementation for three-dimensional mantle convection at high Rayleigh number
NASA Astrophysics Data System (ADS)
Barnett, G. A.; Wright, G. B.; Yuen, D. A.
2009-12-01
The last decade has seen the strong influence exerted by the gaming industry on high-performance scientific computing since the landmark year of 2003 when the speed of floating point operations for GPUs surpassed that of CPUs. Since then, the progress of GPUs has been astounding because of the development of faster components with many flow processing units (cores) and larger memories (240 cores and 4 Gbytes with the NVIDIA Tesla 1060). It is feasible with GPUs to have the potential computing power of around one Teraflop in your office environment for around $5000. In this study we demonstrate the enormous capabilities GPUs offers for certain geophysical fluid dynamics applications. We focus our attention on high Rayleigh number three dimensional mantle convection (ie. Rayleigh-Bénard convection in the infinite Prandtl number limit) in a rectangular box. The model equations are taken from Larsen et al (1997), where a constant viscosity has been assumed, which allows the momentum equations to be decomposed into two coupled Poisson equations involving a scalar potential for computing the velocity. These equations together with the spatial derivatives in the energy equation are approximated with second order finite differences. The whole system is advanced forward in time with an explicit, third order accurate Runge-Kutta scheme, which allows for variable time-stepping. We discuss the method with specific attention given to the solution of the two Poisson equations, which are solved directly with a Fourier transform-based algorithm, eg. Hockney (1965). We compare the implementations of this algorithm and its variations on the GPU vs the CPU and demonstrate how the choice depends on the architecture. We report the results from 3D mantle convection simulations run on a single GPU with Rayleigh number up to 10**7 and grids of size up to 512X512X256 (see figure below). A comparison of the computational time for the GPU code shows a speed up of more than 10 times over the
Source parameter inversion of compound earthquakes on GPU/CPU hybrid platform
NASA Astrophysics Data System (ADS)
Wang, Y.; Ni, S.; Chen, W.
2012-12-01
Source parameter of earthquakes is essential problem in seismology. Accurate and timely determination of the earthquake parameters (such as moment, depth, strike, dip and rake of fault planes) is significant for both the rupture dynamics and ground motion prediction or simulation. And the rupture process study, especially for the moderate and large earthquakes, is essential as the more detailed kinematic study has became the routine work of seismologists. However, among these events, some events behave very specially and intrigue seismologists. These earthquakes usually consist of two similar size sub-events which occurred with very little time interval, such as mb4.5 Dec.9, 2003 in Virginia. The studying of these special events including the source parameter determination of each sub-events will be helpful to the understanding of earthquake dynamics. However, seismic signals of two distinctive sources are mixed up bringing in the difficulty of inversion. As to common events, the method(Cut and Paste) has been proven effective for resolving source parameters, which jointly use body wave and surface wave with independent time shift and weights. CAP could resolve fault orientation and focal depth using a grid search algorithm. Based on this method, we developed an algorithm(MUL_CAP) to simultaneously acquire parameters of two distinctive events. However, the simultaneous inversion of both sub-events make the computation very time consuming, so we develop a hybrid GPU and CPU version of CAP(HYBRID_CAP) to improve the computation efficiency. Thanks to advantages on multiple dimension storage and processing in GPU, we obtain excellent performance of the revised code on GPU-CPU combined architecture and the speedup factors can be as high as 40x-90x compared to classical cap on traditional CPU architecture.As the benchmark, we take the synthetics as observation and inverse the source parameters of two given sub-events and the inversion results are very consistent with the
GPU accelerated Monte-Carlo simulation of SEM images for metrology
NASA Astrophysics Data System (ADS)
Verduin, T.; Lokhorst, S. R.; Hagen, C. W.
2016-03-01
In this work we address the computation times of numerical studies in dimensional metrology. In particular, full Monte-Carlo simulation programs for scanning electron microscopy (SEM) image acquisition are known to be notoriously slow. Our quest in reducing the computation time of SEM image simulation has led us to investigate the use of graphics processing units (GPUs) for metrology. We have succeeded in creating a full Monte-Carlo simulation program for SEM images, which runs entirely on a GPU. The physical scattering models of this GPU simulator are identical to a previous CPU-based simulator, which includes the dielectric function model for inelastic scattering and also refinements for low-voltage SEM applications. As a case study for the performance, we considered the simulated exposure of a complex feature: an isolated silicon line with rough sidewalls located on a at silicon substrate. The surface of the rough feature is decomposed into 408 012 triangles. We have used an exposure dose of 6 mC/cm2, which corresponds to 6 553 600 primary electrons on average (Poisson distributed). We repeat the simulation for various primary electron energies, 300 eV, 500 eV, 800 eV, 1 keV, 3 keV and 5 keV. At first we run the simulation on a GeForce GTX480 from NVIDIA. The very same simulation is duplicated on our CPU-based program, for which we have used an Intel Xeon X5650. Apart from statistics in the simulation, no difference is found between the CPU and GPU simulated results. The GTX480 generates the images (depending on the primary electron energy) 350 to 425 times faster than a single threaded Intel X5650 CPU. Although this is a tremendous speedup, we actually have not reached the maximum throughput because of the limited amount of available memory on the GTX480. Nevertheless, the speedup enables the fast acquisition of simulated SEM images for metrology. We now have the potential to investigate case studies in CD-SEM metrology, which otherwise would take unreasonable
Fast GPU-based absolute intensity determination for energy-dispersive X-ray Laue diffraction
NASA Astrophysics Data System (ADS)
Alghabi, F.; Send, S.; Schipper, U.; Abboud, A.; Pietsch, U.; Kolb, A.
2016-01-01
This paper presents a novel method for fast determination of absolute intensities in the sites of Laue spots generated by a tetragonal hen egg-white lysozyme crystal after exposure to white synchrotron radiation during an energy-dispersive X-ray Laue diffraction experiment. The Laue spots are taken by means of an energy-dispersive X-ray 2D pnCCD detector. Current pnCCD detectors have a spatial resolution of 384 × 384 pixels of size 75 × 75 μm2 each and operate at a maximum of 400 Hz. Future devices are going to have higher spatial resolution and frame rates. The proposed method runs on a computer equipped with multiple Graphics Processing Units (GPUs) which provide fast and parallel processing capabilities. Accordingly, our GPU-based algorithm exploits these capabilities to further analyse the Laue spots of the sample. The main contribution of the paper is therefore an alternative algorithm for determining absolute intensities of Laue spots which are themselves computed from a sequence of pnCCD frames. Moreover, a new method for integrating spectral peak intensities and improved background correction, a different way of calculating mean count rate of the background signal and also a new method for n-dimensional Poisson fitting are presented.We present a comparison of the quality of results from the GPU-based algorithm with the quality of results from a prior (base) algorithm running on CPU. This comparison shows that our algorithm is able to produce results with at least the same quality as the base algorithm. Furthermore, the GPU-based algorithm is able to speed up one of the most time-consuming parts of the base algorithm, which is n-dimensional Poisson fitting, by a factor of more than 3. Also, the entire procedure of extracting Laue spots' positions, energies and absolute intensities from a raw dataset of pnCCD frames is accelerated by a factor of more than 3.
GAMUT: GPU accelerated microRNA analysis to uncover target genes through CUDA-miRanda
2014-01-01
Background Non-coding sequences such as microRNAs have important roles in disease processes. Computational microRNA target identification (CMTI) is becoming increasingly important since traditional experimental methods for target identification pose many difficulties. These methods are time-consuming, costly, and often need guidance from computational methods to narrow down candidate genes anyway. However, most CMTI methods are computationally demanding, since they need to handle not only several million query microRNA and reference RNA pairs, but also several million nucleotide comparisons within each given pair. Thus, the need to perform microRNA identification at such large scale has increased the demand for parallel computing. Methods Although most CMTI programs (e.g., the miRanda algorithm) are based on a modified Smith-Waterman (SW) algorithm, the existing parallel SW implementations (e.g., CUDASW++ 2.0/3.0, SWIPE) are unable to meet this demand in CMTI tasks. We present CUDA-miRanda, a fast microRNA target identification algorithm that takes advantage of massively parallel computing on Graphics Processing Units (GPU) using NVIDIA's Compute Unified Device Architecture (CUDA). CUDA-miRanda specifically focuses on the local alignment of short (i.e., ≤ 32 nucleotides) sequences against longer reference sequences (e.g., 20K nucleotides). Moreover, the proposed algorithm is able to report multiple alignments (up to 191 top scores) and the corresponding traceback sequences for any given (query sequence, reference sequence) pair. Results Speeds over 5.36 Giga Cell Updates Per Second (GCUPs) are achieved on a server with 4 NVIDIA Tesla M2090 GPUs. Compared to the original miRanda algorithm, which is evaluated on an Intel Xeon E5620@2.4 GHz CPU, the experimental results show up to 166 times performance gains in terms of execution time. In addition, we have verified that the exact same targets were predicted in both CUDA-miRanda and the original mi
NASA Astrophysics Data System (ADS)
Song, Changhe; Li, Yunsong; Huang, Bormin
2011-10-01
The discrete wavelet transform (DWT)-based Set Partitioning in Hierarchical Trees (SPIHT) algorithm is widely used in many image compression systems. In order to perform real-time Reed-Solomon channel decoding and SPIHT+DWT source decoding on a massive bit stream of compressed images continuously down-linked from the satellite, we propose a novel graphic processing unit (GPU)-accelerated decoding system. In this system the GPU is used to compute the time-consuming inverse DWT, while multiple CPU threads are run in parallel for the remaining part of the system. Both CPU and GPU parts were carefully designed to have approximately the same processing speed to obtain the maximum throughput via a novel pipeline structure for processing continuous satellite images. Through the pipelined CPU and GPU heterogeneous computing, the entire decoding system approaches a speedup of 84x as compared to its single-threaded CPU counterpart.
GPU-accelerated real-time IR smoke screen simulation and assessment of its obscuration
NASA Astrophysics Data System (ADS)
Wu, Xin; Zhang, Jian-qi; Huang, Xi; Liu, De-lian
2012-01-01
With the growing demand for the Battlefield Environment Simulation (BES), IR smoke screen, which is computationally expensive and absolutely indispensable, should be modeled true to life and correct in its thermal radiation characteristics. This paper analyzes the features of an IR smoke screen, and represents an IR smoke screen model based on light extinction, particle dispersion and temperature attenuation, which is calculated by GPU and rendered to screen in real time. Thus a method considering both the real-life in profile and the real-time in efficiency is presented. Additionally, the comparison between the simulated results and the measured data is made to verify the correctness of the smoke screen's obscuration, which illustrates the effect of its interference feature in an infrared scene.
Free energy simulations with the AMOEBA polarizable force field and metadynamics on GPU platform.
Peng, Xiangda; Zhang, Yuebin; Chu, Huiying; Li, Guohui
2016-03-01
The free energy calculation library PLUMED has been incorporated into the OpenMM simulation toolkit, with the purpose to perform enhanced sampling MD simulations using the AMOEBA polarizable force field on GPU platform. Two examples, (I) the free energy profile of water pair separation (II) alanine dipeptide dihedral angle free energy surface in explicit solvent, are provided here to demonstrate the accuracy and efficiency of our implementation. The converged free energy profiles could be obtained within an affordable MD simulation time when the AMOEBA polarizable force field is employed. Moreover, the free energy surfaces estimated using the AMOEBA polarizable force field are in agreement with those calculated from experimental data and ab initio methods. Hence, the implementation in this work is reliable and would be utilized to study more complicated biological phenomena in both an accurate and efficient way. © 2015 Wiley Periodicals, Inc. PMID:26493154
Parallel, distributed and GPU computing technologies in single-particle electron microscopy
Schmeisser, Martin; Heisen, Burkhard C.; Luettich, Mario; Busche, Boris; Hauer, Florian; Koske, Tobias; Knauber, Karl-Heinz; Stark, Holger
2009-01-01
Most known methods for the determination of the structure of macromolecular complexes are limited or at least restricted at some point by their computational demands. Recent developments in information technology such as multicore, parallel and GPU processing can be used to overcome these limitations. In particular, graphics processing units (GPUs), which were originally developed for rendering real-time effects in computer games, are now ubiquitous and provide unprecedented computational power for scientific applications. Each parallel-processing paradigm alone can improve overall performance; the increased computational performance obtained by combining all paradigms, unleashing the full power of today’s technology, makes certain applications feasible that were previously virtually impossible. In this article, state-of-the-art paradigms are introduced, the tools and infrastructure needed to apply these paradigms are presented and a state-of-the-art infrastructure and solution strategy for moving scientific applications to the next generation of computer hardware is outlined. PMID:19564686
Innovative approaches to training and qualifying plant personnel at GPU Nuclear
Coe, R.P.
1994-12-31
For the past 10 yr, technical training programs at GPU Nuclear (GPUN) have been highly successful in the training and qualifying of nuclear station personnel at its Oyster Creek and Three Mile Island sites. The programs have received accreditation by the National Academy for Nuclear Training and have successfully reviewed and approved by the US Nuclear Regulatory Commission, American Nuclear Insurers, and various internal oversight groups. Over the past several years, the training and education department at GPUN has attempted to make learning more interesting and meaningful through a series of innovative approaches. Student feedback has been very positive. Plant management has seen an increase in productivity as key learning tasks are reinforced. Failure rates are dwindling, and retention rates appear to be improving. Equally important, instructors are becoming more and more comfortable with these approaches and the increased quality in classroom and laboratory training.
Astrophysical data mining with GPU. A case study: Genetic classification of globular clusters
NASA Astrophysics Data System (ADS)
Cavuoti, S.; Garofalo, M.; Brescia, M.; Paolillo, M.; Pescape', A.; Longo, G.; Ventre, G.
2014-01-01
We present a multi-purpose genetic algorithm, designed and implemented with GPGPU/CUDA parallel computing technology. The model was derived from our CPU serial implementation, named GAME (Genetic Algorithm Model Experiment). It was successfully tested and validated on the detection of candidate Globular Clusters in deep, wide-field, single band HST images. The GPU version of GAME will be made available to the community by integrating it into the web application DAMEWARE (DAta Mining Web Application REsource, http://dame.dsf.unina.it/beta_info.html), a public data mining service specialized on massive astrophysical data. Since genetic algorithms are inherently parallel, the GPGPU computing paradigm leads to a speedup of a factor of 200× in the training phase with respect to the CPU based version.
Atomic Detail Visualization of Photosynthetic Membranes with GPU-Accelerated Ray Tracing
Vandivort, Kirby L.; Barragan, Angela; Singharoy, Abhishek; Teo, Ivan; Ribeiro, João V.; Isralewitz, Barry; Liu, Bo; Goh, Boon Chong; Phillips, James C.; MacGregor-Chatwin, Craig; Johnson, Matthew P.; Kourkoutis, Lena F.; Hunter, C. Neil
2016-01-01
The cellular process responsible for providing energy for most life on Earth, namely photosynthetic light-harvesting, requires the cooperation of hundreds of proteins across an organelle, involving length and time scales spanning several orders of magnitude over quantum and classical regimes. Simulation and visualization of this fundamental energy conversion process pose many unique methodological and computational challenges. We present, in two accompanying movies, light-harvesting in the photosynthetic apparatus found in purple bacteria, the so-called chromatophore. The movies are the culmination of three decades of modeling efforts, featuring the collaboration of theoretical, experimental, and computational scientists. We describe the techniques that were used to build, simulate, analyze, and visualize the structures shown in the movies, and we highlight cases where scientific needs spurred the development of new parallel algorithms that efficiently harness GPU accelerators and petascale computers. PMID:27274603
NASA Astrophysics Data System (ADS)
Triana-Martinez, J.; Orjuela-Vargas, S. A.; Philips, W.
2013-03-01
This paper compares the speed performance of a set of classic image algorithms for evaluating texture in images by using CUDA programming. We include a summary of the general program mode of CUDA. We select a set of texture algorithms, based on statistical analysis, that allow the use of repetitive functions, such as the Coocurrence Matrix, Haralick features and local binary patterns techniques. The memory allocation time between the host and device memory is not taken into account. The results of this approach show a comparison of the texture algorithms in terms of speed when executed on CPU and GPU processors. The comparison shows that the algorithms can be accelerated more than 40 times when implemented using CUDA environment.
Aerodynamic optimization of supersonic compressor cascade using differential evolution on GPU
NASA Astrophysics Data System (ADS)
Aissa, Mohamed Hasanine; Verstraete, Tom; Vuik, Cornelis
2016-06-01
Differential Evolution (DE) is a powerful stochastic optimization method. Compared to gradient-based algorithms, DE is able to avoid local minima but requires at the same time more function evaluations. In turbomachinery applications, function evaluations are performed with time-consuming CFD simulation, which results in a long, non affordable, design cycle. Modern High Performance Computing systems, especially Graphic Processing Units (GPUs), are able to alleviate this inconvenience by accelerating the design evaluation itself. In this work we present a validated CFD Solver running on GPUs, able to accelerate the design evaluation and thus the entire design process. An achieved speedup of 20x to 30x enabled the DE algorithm to run on a high-end computer instead of a costly large cluster. The GPU-enhanced DE was used to optimize the aerodynamics of a supersonic compressor cascade, achieving an aerodynamic loss minimization of 20%.
GPU-Accelerated PIC/MCC Simulation of Laser-Plasma Interaction Using BUMBLEBEE
NASA Astrophysics Data System (ADS)
Jin, Xiaolin; Huang, Tao; Chen, Wenlong; Wu, Huidong; Tang, Maowen; Li, Bin
2015-11-01
The research of laser-plasma interaction in its wide applications relies on the use of advanced numerical simulation tools to achieve high performance operation while reducing computational time and cost. BUMBLEBEE has been developed to be a fast simulation tool used in the research of laser-plasma interactions. BUMBLEBEE uses a 1D3V electromagnetic PIC/MCC algorithm that is accelerated by using high performance Graphics Processing Unit (GPU) hardware. BUMBLEBEE includes a friendly user-interface module and four physics simulators. The user-interface provides a powerful solid-modeling front end and graphical and computational post processing functionality. The solver of BUMBLEBEE has four modules for now, which are used to simulate the field ionization, electron collisional ionization, binary coulomb collision and laser-plasma interaction processes. The ionization characteristics of laser-neutral interaction and the generation of high-energy electrons have been analyzed by using BUMBLEBEE for validation.
Abdellah, Marwan; Eldeib, Ayman; Owis, Mohamed I
2015-08-01
This paper features an advanced implementation of the X-ray rendering algorithm that harnesses the giant computing power of the current commodity graphics processors to accelerate the generation of high resolution digitally reconstructed radiographs (DRRs). The presented pipeline exploits the latest features of NVIDIA Graphics Processing Unit (GPU) architectures, mainly bindless texture objects and dynamic parallelism. The rendering throughput is substantially improved by exploiting the interoperability mechanisms between CUDA and OpenGL. The benchmarks of our optimized rendering pipeline reflect its capability of generating DRRs with resolutions of 2048(2) and 4096(2) at interactive and semi interactive frame-rates using an NVIDIA GeForce 970 GTX device. PMID:26737231
Performance improvements of differential operators code for MPS method on GPU
NASA Astrophysics Data System (ADS)
Murotani, Kohei; Masaie, Issei; Matsunaga, Takuya; Koshizuka, Seiichi; Shioya, Ryuji; Ogino, Masao; Fujisawa, Toshimitsu
2015-09-01
In the present study, performance improvements of the particle search and particle interaction calculation steps constituting the performance bottleneck in the moving particle simulation method are achieved by developing GPU-compatible algorithms for many core processor architectures. In the improvements of particle search, bucket loops of the cell-linked list are changed to a loop structure having fewer local variables and the linked list and the forward star of particle search algorithms within a bucket are compared. In the particle interaction calculation, the problem of the ratio of particles within the interaction domain to the neighboring particle candidates being quite low is improved. By these improvements, a performance efficiency of 24.7 % can be achieved for the first-order polynomial approximation scheme using NVIDIA Tesla K20, CUDA-6.5, and double-precision floating-point operations.
Peker, Musa; Şen, Baha; Gürüler, Hüseyin
2015-02-01
The effect of anesthesia on the patient is referred to as depth of anesthesia. Rapid classification of appropriate depth level of anesthesia is a matter of great importance in surgical operations. Similarly, accelerating classification algorithms is important for the rapid solution of problems in the field of biomedical signal processing. However numerous, time-consuming mathematical operations are required when training and testing stages of the classification algorithms, especially in neural networks. In this study, to accelerate the process, parallel programming and computing platform (Nvidia CUDA) facilitates dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU) was utilized. The system was employed to detect anesthetic depth level on related electroencephalogram (EEG) data set. This dataset is rather complex and large. Moreover, the achieving more anesthetic levels with rapid response is critical in anesthesia. The proposed parallelization method yielded high accurate classification results in a faster time. PMID:25650073
A GPU accelerated moving mesh correspondence algorithm with applications to RV segmentation.
Punithakumar, Kumaradevan; Noga, Michelle; Boulanger, Pierre
2015-08-01
This study proposes a parallel nonrigid registration algorithm to obtain point correspondence between a sequence of images. Several recent studies have shown that computation of point correspondence is an excellent way to delineate organs from a sequence of images, for example, delineation of cardiac right ventricle (RV) from a series of magnetic resonance (MR) images. However, nonrigid registration algorithms involve optimization of similarity functions, and are therefore, computationally expensive. We propose Graphics Processing Unit (GPU) computing to accelerate the algorithm. The proposed approach consists of two parallelization components: 1) parallel Compute Unified Device Architecture (CUDA) version of the non-rigid registration algorithm; and 2) application of an image concatenation approach to further parallelize the algorithm. The proposed approach was evaluated over a data set of 16 subjects and took an average of 4.36 seconds to segment a sequence of 19 MR images, a significant performance improvement over serial image registration approach. PMID:26737222
Modelling Nonlinear Dynamic Textures using Hybrid DWT-DCT and Kernel PCA with GPU
NASA Astrophysics Data System (ADS)
Ghadekar, Premanand Pralhad; Chopade, Nilkanth Bhikaji
2016-06-01
Most of the real-world dynamic textures are nonlinear, non-stationary, and irregular. Nonlinear motion also has some repetition of motion, but it exhibits high variation, stochasticity, and randomness. Hybrid DWT-DCT and Kernel Principal Component Analysis (KPCA) with YCbCr/YIQ colour coding using the Dynamic Texture Unit (DTU) approach is proposed to model a nonlinear dynamic texture, which provides better results than state-of-art methods in terms of PSNR, compression ratio, model coefficients, and model size. Dynamic texture is decomposed into DTUs as they help to extract temporal self-similarity. Hybrid DWT-DCT is used to extract spatial redundancy. YCbCr/YIQ colour encoding is performed to capture chromatic correlation. KPCA is applied to capture nonlinear motion. Further, the proposed algorithm is implemented on Graphics Processing Unit (GPU), which comprise of hundreds of small processors to decrease time complexity and to achieve parallelism.
Accelerating image registration of MRI by GPU-based parallel computation.
Huang, Teng-Yi; Tang, Yu-Wei; Ju, Shiun-Ying
2011-06-01
Automatic image registration for MRI applications generally requires many iteration loops and is, therefore, a time-consuming task. This drawback prolongs data analysis and delays the workflow of clinical routines. Recent advances in the massively parallel computation of graphic processing units (GPUs) may be a solution to this problem. This study proposes a method to accelerate registration calculations, especially for the popular statistical parametric mapping (SPM) system. This study reimplemented the image registration of SPM system to achieve an approximately 14-fold increase in speed in registering single-modality intrasubject data sets. The proposed program is fully compatible with SPM, allowing the user to simply replace the original image registration library of SPM to gain the benefit of the computation power provided by commodity graphic processors. In conclusion, the GPU computation method is a practical way to accelerate automatic image registration. This technology promises a broader scope of application in the field of image registration. PMID:21531103
Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics
Ronald Babich, Michael Clark, Balint Joo
2010-11-01
Graphics Processing Units (GPUs) are having a transformational effect on numerical lattice quantum chromodynamics (LQCD) calculations of importance in nuclear and particle physics. The QUDA library provides a package of mixed precision sparse matrix linear solvers for LQCD applications, supporting single GPUs based on NVIDIA's Compute Unified Device Architecture (CUDA). This library, interfaced to the QDP++/Chroma framework for LQCD calculations, is currently in production use on the "9g" cluster at the Jefferson Laboratory, enabling unprecedented price/performance for a range of problems in LQCD. Nevertheless, memory constraints on current GPU devices limit the problem sizes that can be tackled. In this contribution we describe the parallelization of the QUDA library onto multiple GPUs using MPI, including strategies for the overlapping of communication and computation. We report on both weak and strong scaling for up to 32 GPUs interconnected by InfiniBand, on which we sustain in excess of 4 Tflops.
Ultra-fast hybrid CPU-GPU multiple scatter simulation for 3-D PET.
Kim, Kyung Sang; Son, Young Don; Cho, Zang Hee; Ra, Jong Beom; Ye, Jong Chul
2014-01-01
Scatter correction is very important in 3-D PET reconstruction due to a large scatter contribution in measurements. Currently, one of the most popular methods is the so-called single scatter simulation (SSS), which considers single Compton scattering contributions from many randomly distributed scatter points. The SSS enables a fast calculation of scattering with a relatively high accuracy; however, the accuracy of SSS is dependent on the accuracy of tail fitting to find a correct scaling factor, which is often difficult in low photon count measurements. To overcome this drawback as well as to improve accuracy of scatter estimation by incorporating multiple scattering contribution, we propose a multiple scatter simulation (MSS) based on a simplified Monte Carlo (MC) simulation that considers photon migration and interactions due to photoelectric absorption and Compton scattering. Unlike the SSS, the MSS calculates a scaling factor by comparing simulated prompt data with the measured data in the whole volume, which enables a more robust estimation of a scaling factor. Even though the proposed MSS is based on MC, a significant acceleration of the computational time is possible by using a virtual detector array with a larger pitch by exploiting that the scatter distribution varies slowly in spatial domain. Furthermore, our MSS implementation is nicely fit to a parallel implementation using graphic processor unit (GPU). In particular, we exploit a hybrid CPU-GPU technique using the open multiprocessing and the compute unified device architecture, which results in 128.3 times faster than using a single CPU. Overall, the computational time of MSS is 9.4 s for a high-resolution research tomograph (HRRT) system. The performance of the proposed MSS is validated through actual experiments using an HRRT. PMID:24403412
High-throughput Analysis of Large Microscopy Image Datasets on CPU-GPU Cluster Platforms
Teodoro, George; Pan, Tony; Kurc, Tahsin M.; Kong, Jun; Cooper, Lee A. D.; Podhorszki, Norbert; Klasky, Scott; Saltz, Joel H.
2014-01-01
Analysis of large pathology image datasets offers significant opportunities for the investigation of disease morphology, but the resource requirements of analysis pipelines limit the scale of such studies. Motivated by a brain cancer study, we propose and evaluate a parallel image analysis application pipeline for high throughput computation of large datasets of high resolution pathology tissue images on distributed CPU-GPU platforms. To achieve efficient execution on these hybrid systems, we have built runtime support that allows us to express the cancer image analysis application as a hierarchical data processing pipeline. The application is implemented as a coarse-grain pipeline of stages, where each stage may be further partitioned into another pipeline of fine-grain operations. The fine-grain operations are efficiently managed and scheduled for computation on CPUs and GPUs using performance aware scheduling techniques along with several optimizations, including architecture aware process placement, data locality conscious task assignment, data prefetching, and asynchronous data copy. These optimizations are employed to maximize the utilization of the aggregate computing power of CPUs and GPUs and minimize data copy overheads. Our experimental evaluation shows that the cooperative use of CPUs and GPUs achieves significant improvements on top of GPU-only versions (up to 1.6×) and that the execution of the application as a set of fine-grain operations provides more opportunities for runtime optimizations and attains better performance than coarser-grain, monolithic implementations used in other works. An implementation of the cancer image analysis pipeline using the runtime support was able to process an image dataset consisting of 36,848 4Kx4K-pixel image tiles (about 1.8TB uncompressed) in less than 4 minutes (150 tiles/second) on 100 nodes of a state-of-the-art hybrid cluster system. PMID:25419546
GPU-based iterative cone-beam CT reconstruction using tight frame regularization
NASA Astrophysics Data System (ADS)
Jia, Xun; Dong, Bin; Lou, Yifei; Jiang, Steve B.
2011-07-01
The x-ray imaging dose from serial cone-beam computed tomography (CBCT) scans raises a clinical concern in most image-guided radiation therapy procedures. It is the goal of this paper to develop a fast graphic processing unit (GPU)-based algorithm to reconstruct high-quality CBCT images from undersampled and noisy projection data so as to lower the imaging dose. For this purpose, we have developed an iterative tight-frame (TF)-based CBCT reconstruction algorithm. A condition that a real CBCT image has a sparse representation under a TF basis is imposed in the iteration process as regularization to the solution. To speed up the computation, a multi-grid method is employed. Our GPU implementation has achieved high computational efficiency and a CBCT image of resolution 512 × 512 × 70 can be reconstructed in ~5 min. We have tested our algorithm on a digital NCAT phantom and a physical Catphan phantom. It is found that our TF-based algorithm is able to reconstruct CBCT in the context of undersampling and low mAs levels. We have also quantitatively analyzed the reconstructed CBCT image quality in terms of the modulation-transfer function and contrast-to-noise ratio under various scanning conditions. The results confirm the high CBCT image quality obtained from our TF algorithm. Moreover, our algorithm has also been validated in a real clinical context using a head-and-neck patient case. Comparisons of the developed TF algorithm and the current state-of-the-art TV algorithm have also been made in various cases studied in terms of reconstructed image quality and computation efficiency.
High-throughput Analysis of Large Microscopy Image Datasets on CPU-GPU Cluster Platforms.
Teodoro, George; Pan, Tony; Kurc, Tahsin M; Kong, Jun; Cooper, Lee A D; Podhorszki, Norbert; Klasky, Scott; Saltz, Joel H
2013-05-01
Analysis of large pathology image datasets offers significant opportunities for the investigation of disease morphology, but the resource requirements of analysis pipelines limit the scale of such studies. Motivated by a brain cancer study, we propose and evaluate a parallel image analysis application pipeline for high throughput computation of large datasets of high resolution pathology tissue images on distributed CPU-GPU platforms. To achieve efficient execution on these hybrid systems, we have built runtime support that allows us to express the cancer image analysis application as a hierarchical data processing pipeline. The application is implemented as a coarse-grain pipeline of stages, where each stage may be further partitioned into another pipeline of fine-grain operations. The fine-grain operations are efficiently managed and scheduled for computation on CPUs and GPUs using performance aware scheduling techniques along with several optimizations, including architecture aware process placement, data locality conscious task assignment, data prefetching, and asynchronous data copy. These optimizations are employed to maximize the utilization of the aggregate computing power of CPUs and GPUs and minimize data copy overheads. Our experimental evaluation shows that the cooperative use of CPUs and GPUs achieves significant improvements on top of GPU-only versions (up to 1.6×) and that the execution of the application as a set of fine-grain operations provides more opportunities for runtime optimizations and attains better performance than coarser-grain, monolithic implementations used in other works. An implementation of the cancer image analysis pipeline using the runtime support was able to process an image dataset consisting of 36,848 4Kx4K-pixel image tiles (about 1.8TB uncompressed) in less than 4 minutes (150 tiles/second) on 100 nodes of a state-of-the-art hybrid cluster system. PMID:25419546
NASA Astrophysics Data System (ADS)
Paz, Abel; Plaza, Antonio
2010-08-01
Automatic target and anomaly detection are considered very important tasks for hyperspectral data exploitation. These techniques are now routinely applied in many application domains, including defence and intelligence, public safety, precision agriculture, geology, or forestry. Many of these applications require timely responses for swift decisions which depend upon high computing performance of algorithm analysis. However, with the recent explosion in the amount and dimensionality of hyperspectral imagery, this problem calls for the incorporation of parallel computing techniques. In the past, clusters of computers have offered an attractive solution for fast anomaly and target detection in hyperspectral data sets already transmitted to Earth. However, these systems are expensive and difficult to adapt to on-board data processing scenarios, in which low-weight and low-power integrated components are essential to reduce mission payload and obtain analysis results in (near) real-time, i.e., at the same time as the data is collected by the sensor. An exciting new development in the field of commodity computing is the emergence of commodity graphics processing units (GPUs), which can now bridge the gap towards on-board processing of remotely sensed hyperspectral data. In this paper, we describe several new GPU-based implementations of target and anomaly detection algorithms for hyperspectral data exploitation. The parallel algorithms are implemented on latest-generation Tesla C1060 GPU architectures, and quantitatively evaluated using hyperspectral data collected by NASA's AVIRIS system over the World Trade Center (WTC) in New York, five days after the terrorist attacks that collapsed the two main towers in the WTC complex.
Advanced noise reduction in placental ultrasound imaging using CPU and GPU: a comparative study
NASA Astrophysics Data System (ADS)
Zombori, G.; Ryan, J.; McAuliffe, F.; Rainford, L.; Moran, M.; Brennan, P.
2010-03-01
This paper presents a comparison of different implementations of 3D anisotropic diffusion speckle noise reduction technique on ultrasound images. In this project we are developing a novel volumetric calcification assessment metric for the placenta, and providing a software tool for this purpose. The tool can also automatically segment and visualize (in 3D) ultrasound data. One of the first steps when developing such a tool is to find a fast and efficient way to eliminate speckle noise. Previous works on this topic by Duan, Q. [1] and Sun, Q. [2] have proven that the 3D noise reducing anisotropic diffusion (3D SRAD) method shows exceptional performance in enhancing ultrasound images for object segmentation. Therefore we have implemented this method in our software application and performed a comparative study on the different variants in terms of performance and computation time. To increase processing speed it was necessary to utilize the full potential of current state of the art Graphics Processing Units (GPUs). Our 3D datasets are represented in a spherical volume format. With the aim of 2D slice visualization and segmentation, a "scan conversion" or "slice-reconstruction" step is needed, which includes coordinate transformation from spherical to Cartesian, re-sampling of the volume and interpolation. Combining the noise filtering and slice reconstruction in one process on the GPU, we can achieve close to real-time operation on high quality data sets without the need for down-sampling or reducing image quality. For the GPU programming OpenCL language was used. Therefore the presented solution is fully portable.
Use of GPU Computing to Study Coupled Deformation and Fluid Flow in Porous Rocks
NASA Astrophysics Data System (ADS)
Räss, Ludovic; Omlin, Samuel; Simon, Nina; Podladchikov, Yuri
2015-04-01
Actual challenges in computational geodynamics put high requirements for the development of new coupled models. These need to solve accurate physics, on high resolution and in reasonable computation time. Multi-scale problems such as deformation of porous rocks triggered by fluid flow require both high temporal and spatial resolution. The resulting preferential flow paths involve complex physics and a strong coupling between deformation and fluid flow processes. Shortcuts such as sequential or iterative coupling of two existing solvers will not be sufficient in these difficult cases to localize the deformation and flow. We base our numerical implementation on the physically and thermodynamically consistent mathematical model for fluid flow in porous rocks, taking nonlinear stress dependent visco-elasto-plastic rheology into account. The effective permeability used for the Darcy flow is obtained through the nonlinear Karman-Cozeny relation. The model is not restricted by the lithostatic stress assumption, allowing for background stress regime as it occurs in natural conditions. We have developed a fully three-dimensional numerical application based on an iterative finite difference scheme. The application is written in C-CUDA, is enabled for GPU accelerators and is parallelized with MPI to run on multi-GPU clusters. The parallelization on a rectangular grid is straightforward (at each iteration, the boundaries of the local problem are updated by the neighboring processes) and requires no MPI global operations, only MPI point-to-point communication between neighboring processes. This parallelization method should allow by construction for linear weak scaling on any number of processors. Our linearly scaling numerical application predicts the formation of dynamically evolving fluid pathways. These supercomuting applications are vital for resolving actual challenging high-resolution three-dimensional models.
Nguyen, Trung D; Carrillo, Jan-Michael; Dobrynin, Andrey; Brown, W Michael
2013-01-01
Numerous issues have disrupted the trend for increasing computational performance with faster CPU clock frequencies. In order to exploit the potential performance of new computers, it is becoming increasingly desirable to re-evaluate computational physics methods and models with an eye towards towards approaches that allow for increased concurrency and data locality. The evaluation of long-range Coulombic interactions is a common bottleneck for molecular dynamics simulations. Enhanced truncation approaches have been proposed as an alternative method and are particularly well suited for many-core architectures and GPUs due to the inherent fine-grain parallelism that can be exploited. In this paper, we compare efficient truncation-based approximations to evaluation of electrostatic forces with the more traditional particle-particle particle-mesh (P3M) method for molecular dynamics simulation of polyelectrolyte brush layers. We show that with the use of GPU accelerators, large parallel simulations using P3M can be greater than 3 times faster due to a reduction in the mesh-size required. Alternatively, using a truncation-based scheme can improve performance even further. This approach can be up to 3.9 times faster than GPU-accelerated P3M for many polymer systems and results in accurate calculation of shear velocities and disjoining pressures for brush layers. For configurations with highly non-uniform charge distributions, however, we find that it is more efficient to use P3M; for these systems, computationally efficient parameterizations of the truncation-based approach do not produce accurate counterion density profiles or brush morphologies.
GPU-accelerated indirect boundary element method for voxel model analyses with fast multipole method
NASA Astrophysics Data System (ADS)
Hamada, Shoji
2011-05-01
An indirect boundary element method (BEM) that uses the fast multipole method (FMM) was accelerated using graphics processing units (GPUs) to reduce the time required to calculate a three-dimensional electrostatic field. The BEM is designed to handle cubic voxel models and is specialized to consider square voxel walls as boundary surface elements. The FMM handles the interactions among the surface charge elements and directly outputs surface integrals of the fields over each individual element. The CPU code was originally developed for field analysis in human voxel models derived from anatomical images. FMM processes are programmed using the NVIDIA Compute Unified Device Architecture (CUDA) with double-precision floating-point arithmetic on the basis of a shared pseudocode template. The electric field induced by DC-current application between two electrodes is calculated for two models with 499,629 (model 1) and 1,458,813 (model 2) surface elements. The calculation times were measured with a four-GPU configuration (two NVIDIA GTX295 cards) with four CPU cores (an Intel Core i7-975 processor). The times required by a linear system solver are 31 s and 186 s for models 1 and 2, respectively. The speed-up ratios of the FMM range from 5.9 to 8.2 for model 1 and from 5.0 to 5.6 for model 2. The calculation speed for element-interaction in this BEM analysis was comparable to that of particle-interaction using FMM on a GPU.
GPU-based simulation of optical propagation through turbulence for active and passive imaging
NASA Astrophysics Data System (ADS)
Monnier, Goulven; Duval, François-Régis; Amram, Solène
2012-10-01
The usual numerical approach for accurate, spatially resolved simulation of optical propagation through atmospheric turbulence involves Fresnel diffraction through a series of phase screens. When used to reproduce instantaneous laser beam intensity distribution on a target, this numerical scheme may get quite expensive in terms of CPU and memory resources, due to the many constraints to be fulfilled to ensure the validity of the resulting quantities. In particular, computational requirements grow rapidly with higher-divergence beam, longer propagation distance, stronger turbulence and larger turbulence outer scale. Our team recently developed IMOTEP, a software which demonstrates the benefits of using the computational power of the Graphics Processing Units (GPU) for both accelerating such simulations and increasing the range of accessible simulated conditions. Simulating explicitly the instantaneous effects of turbulence on the backscattered optical wave is even more challenging when the isoplanatic or totally anisoplanatic approximations are not applicable. Two methods accounting for anisoplanatic effects have been implemented in IMOTEP. The first one, dedicated to narrow beams and non-imaging applications, involves exact propagation of spherical waves for an array of isoplanatic sources in the laser spot. The second one, designed for active or passive imaging applications, involves precomputation of the DSP of parameters describing the instantaneous PSF. PSF anisoplanatic statistics are "numerically measured" from numerous simulated realizations. Once the DSP are computed and stored for given conditions (with no intrinsic limitation on turbulence strength), which typically takes 5 to 30 minutes on a recent GPU, output blurred and distorted images are easily and quickly generated. The paper gives an overview of the software with its physical and numerical backgrounds. The approach developed for generating anisoplanatic instantaneous images is emphasized.
Accelerating COBAYA3 on multi-core CPU and GPU systems using PARALUTION
NASA Astrophysics Data System (ADS)
Trost, Nico; Jiménez, Javier; Lukarski, Dimitar; Sanchez, Victor
2014-06-01
COBAYA3 is a multi-physics system of codes which includes two 3D multi-group neutron diffusion codes, ANDES and COBAYA3-PBP, coupled with COBRA-TF, COBRA-IIIc and SUBCHANFLOW sub-channel thermal-hydraulic codes, for the simulation of LWR core transients. The 3D multi-group neutron diffusion equations are expressed in terms of a sparse linear system which can be solved using different iterative Krylov subspace solvers. The mathematical SPARSKIT library has been used for these purposes as it implements among others, external GMRES, PGMRES and BiCGStab solvers. Multi-core CPUs and graphical processing units (GPUs) provide high performance capabilities which are able to accelerate many scientific computations. To take advantage of these new hardware features in daily use computer codes, the integration of the PARALUTION library to solve sparse systems of linear equations is a good choice. It features several types of iterative solvers and preconditioners which can run on both multi-core CPUs and GPU devices without any modification from the interface point of view. This feature is due to the great portability obtained by the modular and flexible design of the library. By exploring this technology, namely the implementation of the PARALUTION library in COBAYA3, we are able to decrease the solution time of the sparse linear systems by a factor 5.15x on GPU and 2.56x on multi-core CPU using standard hardware. These obtained speedup factors in addition to the implementation details are discussed in this paper.
NASA Astrophysics Data System (ADS)
Topping, T. Russell; French, James; Hancock, Monte F., Jr.
2010-04-01
Working with the Naval Research Laboratory, Celestech has implemented advanced non-linear hyperspectral image (HSI) processing algorithms optimized for Graphics Processing Units (GPU). These algorithms have demonstrated performance improvements of nearly 2 orders of magnitude over optimal CPU-based implementations. The paper briefly covers the architecture of the NIVIDIA GPU to provide a basis for discussing GPU optimization challenges and strategies. The paper then covers optimization approaches employed to extract performance from the GPU implementation of Dr. Bachmann's algorithms including memory utilization and process thread optimization considerations. The paper goes on to discuss strategies for deploying GPU-enabled servers into enterprise service oriented architectures. Also discussed are Celestech's on-going work in the area of middleware frameworks to provide an optimized multi-GPU utilization and scheduling approach that supports both multiple GPUs in a single computer as well as across multiple computers. This paper is a complementary work to the paper submitted by Dr. Charles Bachmann entitled "A Scalable Approach to Modeling Nonlinear Structure in Hyperspectral Imagery and Other High-Dimensional Data Using Manifold Coordinate Representations". Dr. Bachmann's paper covers the algorithmic and theoretical basis for the HSI processing approach.
NASA Astrophysics Data System (ADS)
Hou, Chaofeng; Ge, Wei
Graphics processing unit (GPU) is becoming a powerful computational tool in scientific and engineering fields. In this paper, for the purpose of the full employment of computing capability, a novel mode for parallel molecular dynamics (MD) simulation is presented and implemented on basis of multiple GPUs and hybrids with central processing units (CPUs). Taking into account the interactions between CPUs, GPUs, and the threads on GPU in a multi-scale and multilevel computational architecture, several cases, such as polycrystalline silicon and heat transfer on the surface of silicon crystals, are provided and taken as model systems to verify the feasibility and validity of the mode. Furthermore, the mode can be extended to MD simulation of other areas such as biology, chemistry and so forth.
Jia Xun; Lou Yifei; Li Ruijiang; Song, William Y.; Jiang, Steve B.
2010-04-15
Purpose: Cone-beam CT (CBCT) plays an important role in image guided radiation therapy (IGRT). However, the large radiation dose from serial CBCT scans in most IGRT procedures raises a clinical concern, especially for pediatric patients who are essentially excluded from receiving IGRT for this reason. The goal of this work is to develop a fast GPU-based algorithm to reconstruct CBCT from undersampled and noisy projection data so as to lower the imaging dose. Methods: The CBCT is reconstructed by minimizing an energy functional consisting of a data fidelity term and a total variation regularization term. The authors developed a GPU-friendly version of the forward-backward splitting algorithm to solve this model. A multigrid technique is also employed. Results: It is found that 20-40 x-ray projections are sufficient to reconstruct images with satisfactory quality for IGRT. The reconstruction time ranges from 77 to 130 s on an NVIDIA Tesla C1060 (NVIDIA, Santa Clara, CA) GPU card, depending on the number of projections used, which is estimated about 100 times faster than similar iterative reconstruction approaches. Moreover, phantom studies indicate that the algorithm enables the CBCT to be reconstructed under a scanning protocol with as low as 0.1 mA s/projection. Comparing with currently widely used full-fan head and neck scanning protocol of {approx}360 projections with 0.4 mA s/projection, it is estimated that an overall 36-72 times dose reduction has been achieved in our fast CBCT reconstruction algorithm. Conclusions: This work indicates that the developed GPU-based CBCT reconstruction algorithm is capable of lowering imaging dose considerably. The high computation efficiency in this algorithm makes the iterative CBCT reconstruction approach applicable in real clinical environments.
Neylon, J; Qi, S; Sheng, K; Kupelian, P; Santhanam, A
2014-06-15
Purpose: To develop a GPU-based framework that can generate highresolution and patient-specific biomechanical models from a given simulation CT and contoured structures, optimized to run at interactive speeds, for addressing adaptive radiotherapy objectives. Method: A Massspring-damping (MSD) model was generated from a given simulation CT. The model's mass elements were generated for every voxel of anatomy, and positioned in a deformation space in the GPU memory. MSD connections were established between neighboring mass elements in a dense distribution. Contoured internal structures allowed control over elastic material properties of different tissues. Once the model was initialized in GPU memory, skeletal anatomy was actuated using rigid-body transformations, while soft tissues were governed by elastic corrective forces and constraints, which included tensile forces, shear forces, and spring damping forces. The model was validated by applying a known load to a soft tissue block and comparing the observed deformation to ground truth calculations from established elastic mechanics. Results: Our analyses showed that both local and global load experiments yielded results with a correlation coefficient R{sup 2} > 0.98 compared to ground truth. Models were generated for several anatomical regions. Head and neck models accurately simulated posture changes by rotating the skeletal anatomy in three dimensions. Pelvic models were developed for realistic deformations for changes in bladder volume. Thoracic models demonstrated breast deformation due to gravity when changing treatment position from supine to prone. The GPU framework performed at greater than 30 iterations per second for over 1 million mass elements with up to 26 MSD connections each. Conclusions: Realistic simulations of site-specific, complex posture and physiological changes were simulated at interactive speeds using patient data. Incorporating such a model with live patient tracking would facilitate real
Wu, Wenji; DeMar, Phil; Holmgren, Don; Singh, Amitoj; Pordes, Ruth; /Fermilab
2011-08-01
At Fermilab, we have prototyped a GPU-accelerated network performance monitoring system, called G-NetMon, to support large-scale scientific collaborations. Our system exploits the data parallelism that exists within network flow data to provide fast analysis of bulk data movement between Fermilab and collaboration sites. Experiments demonstrate that our G-NetMon can rapidly detect sub-optimal bulk data movements.
D'Angola, A.; Tuttafesta, M.; Guadagno, M.; Santangelo, P.; Laricchiuta, A.; Colonna, G.; Capitelli, M.
2012-11-27
Calculations of thermodynamic properties of Helium plasma by using the Reaction Ensemble Monte Carlo method (REMC) are presented. Non ideal effects at high pressure are observed. Calculations, performed by using Exp-6 or multi-potential curves in the case of neutral-charge interactions, show that in the thermodynamic conditions considered no significative differences are observed. Results have been obtained by using a Graphics Processing Unit (GPU)-CUDA C version of REMC.
NASA Astrophysics Data System (ADS)
Francés, J.; Bleda, S.; Neipp, C.; Márquez, A.; Pascual, I.; Beléndez, A.
2013-03-01
The finite-difference time-domain method (FDTD) allows electromagnetic field distribution analysis as a function of time and space. The method is applied to analyze holographic volume gratings (HVGs) for the near-field distribution at optical wavelengths. Usually, this application requires the simulation of wide areas, which implies more memory and time processing. In this work, we propose a specific implementation of the FDTD method including several add-ons for a precise simulation of optical diffractive elements. Values in the near-field region are computed considering the illumination of the grating by means of a plane wave for different angles of incidence and including absorbing boundaries as well. We compare the results obtained by FDTD with those obtained using a matrix method (MM) applied to diffraction gratings. In addition, we have developed two optimized versions of the algorithm, for both CPU and GPU, in order to analyze the improvement of using the new NVIDIA Fermi GPU architecture versus highly tuned multi-core CPU as a function of the size simulation. In particular, the optimized CPU implementation takes advantage of the arithmetic and data transfer streaming SIMD (single instruction multiple data) extensions (SSE) included explicitly in the code and also of multi-threading by means of OpenMP directives. A good agreement between the results obtained using both FDTD and MM methods is obtained, thus validating our methodology. Moreover, the performance of the GPU is compared to the SSE+OpenMP CPU implementation, and it is quantitatively determined that a highly optimized CPU program can be competitive for a wider range of simulation sizes, whereas GPU computing becomes more powerful for large-scale simulations.
Comparison of a 3-D GPU-Assisted Maxwell Code and Ray Tracing for Reflectometry on ITER
NASA Astrophysics Data System (ADS)
Gady, Sarah; Kubota, Shigeyuki; Johnson, Irena
2015-11-01
Electromagnetic wave propagation and scattering in magnetized plasmas are important diagnostics for high temperature plasmas. 1-D and 2-D full-wave codes are standard tools for measurements of the electron density profile and fluctuations; however, ray tracing results have shown that beam propagation in tokamak plasmas is inherently a 3-D problem. The GPU-Assisted Maxwell Code utilizes the FDTD (Finite-Difference Time-Domain) method for solving the Maxwell equations with the cold plasma approximation in a 3-D geometry. Parallel processing with GPGPU (General-Purpose computing on Graphics Processing Units) is used to accelerate the computation. Previously, we reported on initial comparisons of the code results to 1-D numerical and analytical solutions, where the size of the computational grid was limited by the on-board memory of the GPU. In the current study, this limitation is overcome by using domain decomposition and an additional GPU. As a practical application, this code is used to study the current design of the ITER Low Field Side Reflectometer (LSFR) for the Equatorial Port Plug 11 (EPP11). A detailed examination of Gaussian beam propagation in the ITER edge plasma will be presented, as well as comparisons with ray tracing. This work was made possible by funding from the Department of Energy for the Summer Undergraduate Laboratory Internship (SULI) program. This work is supported by the US DOE Contract No.DE-AC02-09CH11466 and DE-FG02-99-ER54527.
NASA Astrophysics Data System (ADS)
Le, Anh H.; Park, Young W.; Ma, Kevin; Jacobs, Colin; Liu, Brent J.
2010-03-01
Multiple Sclerosis (MS) is a progressive neurological disease affecting myelin pathways in the brain. Multiple lesions in the white matter can cause paralysis and severe motor disabilities of the affected patient. To solve the issue of inconsistency and user-dependency in manual lesion measurement of MRI, we have proposed a 3-D automated lesion quantification algorithm to enable objective and efficient lesion volume tracking. The computer-aided detection (CAD) of MS, written in MATLAB, utilizes K-Nearest Neighbors (KNN) method to compute the probability of lesions on a per-voxel basis. Despite the highly optimized algorithm of imaging processing that is used in CAD development, MS CAD integration and evaluation in clinical workflow is technically challenging due to the requirement of high computation rates and memory bandwidth in the recursive nature of the algorithm. In this paper, we present the development and evaluation of using a computing engine in the graphical processing unit (GPU) with MATLAB for segmentation of MS lesions. The paper investigates the utilization of a high-end GPU for parallel computing of KNN in the MATLAB environment to improve algorithm performance. The integration is accomplished using NVIDIA's CUDA developmental toolkit for MATLAB. The results of this study will validate the practicality and effectiveness of the prototype MS CAD in a clinical setting. The GPU method may allow MS CAD to rapidly integrate in an electronic patient record or any disease-centric health care system.
NASA Astrophysics Data System (ADS)
Rossi, Francesco; Londrillo, Pasquale; Sgattoni, Andrea; Sinigardi, Stefano; Turchetti, Giorgio
2012-12-01
We present `jasmine', an implementation of a fully relativistic, 3D, electromagnetic Particle-In-Cell (PIC) code, capable of running simulations in various laser plasma acceleration regimes on Graphics-Processing-Units (GPUs) HPC clusters. Standard energy/charge preserving FDTD-based algorithms have been implemented using double precision and quadratic (or arbitrary sized) shape functions for the particle weighting. When porting a PIC scheme to the GPU architecture (or, in general, a shared memory environment), the particle-to-grid operations (e.g. the evaluation of the current density) require special care to avoid memory inconsistencies and conflicts. Here we present a robust implementation of this operation that is efficient for any number of particles per cell and particle shape function order. Our algorithm exploits the exposed GPU memory hierarchy and avoids the use of atomic operations, which can hurt performance especially when many particles lay on the same cell. We show the code multi-GPU scalability results and present a dynamic load-balancing algorithm. The code is written using a python-based C++ meta-programming technique which translates in a high level of modularity and allows for easy performance tuning and simple extension of the core algorithms to various simulation schemes.
NASA Astrophysics Data System (ADS)
Dardikman, Gili; Shaked, Natan T.
2016-03-01
We present highly parallel and efficient algorithms for real-time reconstruction of the quantitative three-dimensional (3-D) refractive-index maps of biological cells without labeling, as obtained from the interferometric projections acquired by tomographic phase microscopy (TPM). The new algorithms are implemented on the graphic processing unit (GPU) of the computer using CUDA programming environment. The reconstruction process includes two main parts. First, we used parallel complex wave-front reconstruction of the TPM-based interferometric projections acquired at various angles. The complex wave front reconstructions are done on the GPU in parallel, while minimizing the calculation time of the Fourier transforms and phase unwrapping needed. Next, we implemented on the GPU in parallel the 3-D refractive index map retrieval using the TPM filtered-back projection algorithm. The incorporation of algorithms that are inherently parallel with a programming environment such as Nvidia's CUDA makes it possible to obtain real-time processing rate, and enables high-throughput platform for label-free, 3-D cell visualization and diagnosis.
A real-time autostereoscopic display method based on partial sub-pixel by general GPU processing
NASA Astrophysics Data System (ADS)
Chen, Duo; Sang, Xinzhu; Cai, Yuanfa
2013-08-01
With the progress of 3D technology, the huge computing capacity for the real-time autostereoscopic display is required. Because of complicated sub-pixel allocating, masks providing arranged sub-pixels are fabricated to reduce real-time computation. However, the binary mask has inherent drawbacks. In order to solve these problems, weighted masks are used in displaying based on partial sub-pixel. Nevertheless, the corresponding computations will be tremendously growing and unbearable for CPU. To improve calculating speed, Graphics Processing Unit (GPU) processing with parallel computing ability is adopted. Here the principle of partial sub-pixel is presented, and the texture array of Direct3D 10 is used to increase the number of computable textures. When dealing with a HD display and multi-viewpoints, a low level GPU is still able to permit a fluent real time displaying, while the performance of high level CPU is really not acceptable. Meanwhile, after using texture array, the performance of D3D10 could be double, and sometimes be triple faster than D3D9. There are several distinguishing features for the proposed method, such as the good portability, less overhead and good stability. The GPU display system could also be used for the future Ultra HD autostereoscopic display.
A Multi-CPU/GPU implementation of RBF-generated finite differences for PDEs on a Sphere
NASA Astrophysics Data System (ADS)
Bollig, E. F.; Flyer, N.; Erlebacher, G.
2011-12-01
Numerical methods leveraging Radial Basis Functions (RBFs) are on the rise in computational science. With natural extensions into higher dimensions, functionality in the face of unstructured grids, stability for large time-steps, competitive accuracy and convergence when compared to other state-of-the-art methods, it is hard to ignore these simple-to-code alternatives. RBF-generated finite differences (RBF-FD) hold a promising future in that they have the advantages of global RBFs but have the ability to be highly parallelizable on multi-core machines. They differ from classical finite differences in that the test functions used to calculate the differentiation weights are n-dimensional RBFs rather than one-dimensional polynomials. This allows for generalization to n-dimensional space on completely scattered node layouts. We present an ongoing effort to develop fast and efficient implementations of RBF-FD for the geosciences. Specifically, we introduce a multi-CPU/GPU implementation for the solution of parabolic and hyperbolic PDEs. This work targets the NSF funded Keeneland GPU cluster, which---like many of the latest HPC systems around the world---offers significantly more GPU accelerators than CPU counterparts. We will discuss parallelization strategies, algorithms and data-structures used to span computation across the heterogeneous architecture.
NASA Astrophysics Data System (ADS)
Ha, Sanghyun; You, Donghyun
2015-11-01
Utility of the computational power of Graphics Processing Units (GPUs) is elaborated for solutions of both incompressible and compressible Navier-Stokes equations. A semi-implicit ADI finite-volume method for integration of the incompressible and compressible Navier-Stokes equations, which are discretized on a structured arbitrary grid, is parallelized for GPU computations using CUDA (Compute Unified Device Architecture). In the semi-implicit ADI finite-volume method, the nonlinear convection terms and the linear diffusion terms are integrated in time using a combination of an explicit scheme and an ADI scheme. Inversion of multiple tri-diagonal matrices is found to be the major challenge in GPU computations of the present method. Some of the algorithms for solving tri-diagonal matrices on GPUs are evaluated and optimized for GPU-acceleration of the present semi-implicit ADI computations of incompressible and compressible Navier-Stokes equations. Supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT and Future Planning Grant NRF-2014R1A2A1A11049599.
Hsu, Yu-Han H; Ferl, Gregory Z; Ng, Chee M
2013-05-01
Dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) is often used to examine vascular function in malignant tumors and noninvasively monitor drug efficacy of antivascular therapies in clinical studies. However, complex numerical methods used to derive tumor physiological properties from DCE-MRI images can be time-consuming and computationally challenging. Recent advancement of computing technology in graphics processing unit (GPU) makes it possible to build an energy-efficient and high-power parallel computing platform for solving complex numerical problems. This study develops the first reported fast GPU-based method for nonparametric kinetic analysis of DCE-MRI data using clinical scans of glioblastoma patients treated with bevacizumab (Avastin®). In the method, contrast agent concentration-time profiles in arterial blood and tumor tissue are smoothed using a robust kernel-based regression algorithm in order to remove artifacts due to patient motion and then deconvolved to produce the impulse response function (IRF). The area under the curve (AUC) and mean residence time (MRT) of the IRF are calculated using statistical moment analysis, and two tumor physiological properties that relate to vascular permeability, volume transfer constant between blood plasma and extravascular extracellular space (K(trans)) and fractional interstitial volume (ve) are estimated using the approximations AUC/MRT and AUC. The most significant feature in this method is the use of GPU-computing to analyze data from more than 60,000 voxels in each DCE-MRI image in parallel fashion. All analysis steps have been automated in a single program script that requires only blood and tumor data as the sole input. The GPU-accelerated method produces K(trans) and ve estimates that are comparable to results from previous studies but reduces computational time by more than 80-fold compared to a previously reported central processing unit-based nonparametric method. Furthermore, it is at
Real-time dose computation: GPU-accelerated source modeling and superposition/convolution
Jacques, Robert; Wong, John; Taylor, Russell; McNutt, Todd
2011-01-15
Purpose: To accelerate dose calculation to interactive rates using highly parallel graphics processing units (GPUs). Methods: The authors have extended their prior work in GPU-accelerated superposition/convolution with a modern dual-source model and have enhanced performance. The primary source algorithm supports both focused leaf ends and asymmetric rounded leaf ends. The extra-focal algorithm uses a discretized, isotropic area source and models multileaf collimator leaf height effects. The spectral and attenuation effects of static beam modifiers were integrated into each source's spectral function. The authors introduce the concepts of arc superposition and delta superposition. Arc superposition utilizes separate angular sampling for the total energy released per unit mass (TERMA) and superposition computations to increase accuracy and performance. Delta superposition allows single beamlet changes to be computed efficiently. The authors extended their concept of multi-resolution superposition to include kernel tilting. Multi-resolution superposition approximates solid angle ray-tracing, improving performance and scalability with a minor loss in accuracy. Superposition/convolution was implemented using the inverse cumulative-cumulative kernel and exact radiological path ray-tracing. The accuracy analyses were performed using multiple kernel ray samplings, both with and without kernel tilting and multi-resolution superposition. Results: Source model performance was <9 ms (data dependent) for a high resolution (400{sup 2}) field using an NVIDIA (Santa Clara, CA) GeForce GTX 280. Computation of the physically correct multispectral TERMA attenuation was improved by a material centric approach, which increased performance by over 80%. Superposition performance was improved by {approx}24% to 0.058 and 0.94 s for 64{sup 3} and 128{sup 3} water phantoms; a speed-up of 101-144x over the highly optimized Pinnacle{sup 3} (Philips, Madison, WI) implementation. Pinnacle{sup 3
NASA Astrophysics Data System (ADS)
Xu, Chuanfu; Deng, Xiaogang; Zhang, Lilun; Fang, Jianbin; Wang, Guangxue; Jiang, Yi; Cao, Wei; Che, Yonggang; Wang, Yongxian; Wang, Zhenghua; Liu, Wei; Cheng, Xinghua
2014-12-01
Programming and optimizing complex, real-world CFD codes on current many-core accelerated HPC systems is very challenging, especially when collaborating CPUs and accelerators to fully tap the potential of heterogeneous systems. In this paper, with a tri-level hybrid and heterogeneous programming model using MPI + OpenMP + CUDA, we port and optimize our high-order multi-block structured CFD software HOSTA on the GPU-accelerated TianHe-1A supercomputer. HOSTA adopts two self-developed high-order compact definite difference schemes WCNS and HDCS that can simulate flows with complex geometries. We present a dual-level parallelization scheme for efficient multi-block computation on GPUs and perform particular kernel optimizations for high-order CFD schemes. The GPU-only approach achieves a speedup of about 1.3 when comparing one Tesla M2050 GPU with two Xeon X5670 CPUs. To achieve a greater speedup, we collaborate CPU and GPU for HOSTA instead of using a naive GPU-only approach. We present a novel scheme to balance the loads between the store-poor GPU and the store-rich CPU. Taking CPU and GPU load balance into account, we improve the maximum simulation problem size per TianHe-1A node for HOSTA by 2.3×, meanwhile the collaborative approach can improve the performance by around 45% compared to the GPU-only approach. Further, to scale HOSTA on TianHe-1A, we propose a gather/scatter optimization to minimize PCI-e data transfer times for ghost and singularity data of 3D grid blocks, and overlap the collaborative computation and communication as far as possible using some advanced CUDA and MPI features. Scalability tests show that HOSTA can achieve a parallel efficiency of above 60% on 1024 TianHe-1A nodes. With our method, we have successfully simulated an EET high-lift airfoil configuration containing 800M cells and China's large civil airplane configuration containing 150M cells. To our best knowledge, those are the largest-scale CPU-GPU collaborative simulations that
Xu, Chuanfu; Deng, Xiaogang; Zhang, Lilun; Fang, Jianbin; Wang, Guangxue; Jiang, Yi; Cao, Wei; Che, Yonggang; Wang, Yongxian; Wang, Zhenghua; Liu, Wei; Cheng, Xinghua
2014-12-01
Programming and optimizing complex, real-world CFD codes on current many-core accelerated HPC systems is very challenging, especially when collaborating CPUs and accelerators to fully tap the potential of heterogeneous systems. In this paper, with a tri-level hybrid and heterogeneous programming model using MPI + OpenMP + CUDA, we port and optimize our high-order multi-block structured CFD software HOSTA on the GPU-accelerated TianHe-1A supercomputer. HOSTA adopts two self-developed high-order compact definite difference schemes WCNS and HDCS that can simulate flows with complex geometries. We present a dual-level parallelization scheme for efficient multi-block computation on GPUs and perform particular kernel optimizations for high-order CFD schemes. The GPU-only approach achieves a speedup of about 1.3 when comparing one Tesla M2050 GPU with two Xeon X5670 CPUs. To achieve a greater speedup, we collaborate CPU and GPU for HOSTA instead of using a naive GPU-only approach. We present a novel scheme to balance the loads between the store-poor GPU and the store-rich CPU. Taking CPU and GPU load balance into account, we improve the maximum simulation problem size per TianHe-1A node for HOSTA by 2.3×, meanwhile the collaborative approach can improve the performance by around 45% compared to the GPU-only approach. Further, to scale HOSTA on TianHe-1A, we propose a gather/scatter optimization to minimize PCI-e data transfer times for ghost and singularity data of 3D grid blocks, and overlap the collaborative computation and communication as far as possible using some advanced CUDA and MPI features. Scalability tests show that HOSTA can achieve a parallel efficiency of above 60% on 1024 TianHe-1A nodes. With our method, we have successfully simulated an EET high-lift airfoil configuration containing 800M cells and China's large civil airplane configuration containing 150M cells. To our best knowledge, those are the largest-scale CPU–GPU collaborative simulations that
Cazzaniga, Paolo; Nobile, Marco S.; Besozzi, Daniela; Bellini, Matteo; Mauri, Giancarlo
2014-01-01
The introduction of general-purpose Graphics Processing Units (GPUs) is boosting scientific applications in Bioinformatics, Systems Biology, and Computational Biology. In these fields, the use of high-performance computing solutions is motivated by the need of performing large numbers of in silico analysis to study the behavior of biological systems in different conditions, which necessitate a computing power that usually overtakes the capability of standard desktop computers. In this work we present coagSODA, a CUDA-powered computational tool that was purposely developed for the analysis of a large mechanistic model of the blood coagulation cascade (BCC), defined according to both mass-action kinetics and Hill functions. coagSODA allows the execution of parallel simulations of the dynamics of the BCC by automatically deriving the system of ordinary differential equations and then exploiting the numerical integration algorithm LSODA. We present the biological results achieved with a massive exploration of perturbed conditions of the BCC, carried out with one-dimensional and bi-dimensional parameter sweep analysis, and show that GPU-accelerated parallel simulations of this model can increase the computational performances up to a 181× speedup compared to the corresponding sequential simulations. PMID:25025072
Nguyen, Thuy-Diem; Schmidt, Bertil; Zheng, Zejun; Kwoh, Chee-Keong
2015-01-01
De novo clustering is a popular technique to perform taxonomic profiling of a microbial community by grouping 16S rRNA amplicon reads into operational taxonomic units (OTUs). In this work, we introduce a new dendrogram-based OTU clustering pipeline called CRiSPy. The key idea used in CRiSPy to improve clustering accuracy is the application of an anomaly detection technique to obtain a dynamic distance cutoff instead of using the de facto value of 97 percent sequence similarity as in most existing OTU clustering pipelines. This technique works by detecting an abrupt change in the merging heights of a dendrogram. To produce the output dendrograms, CRiSPy employs the OTU hierarchical clustering approach that is computed on a genetic distance matrix derived from an all-against-all read comparison by pairwise sequence alignment. However, most existing dendrogram-based tools have difficulty processing datasets larger than 10,000 unique reads due to high computational complexity. We address this difficulty by developing two efficient algorithms for CRiSPy: a compute-efficient GPU-accelerated parallel algorithm for pairwise distance matrix computation and a memory-efficient hierarchical clustering algorithm. Our experiments on various datasets with distinct attributes show that CRiSPy is able to produce more accurate OTU groupings than most OTU clustering applications. PMID:26451819
Solving systems of linear equations by GPU-based matrix factorization in a Science Ground Segment
NASA Astrophysics Data System (ADS)
Legendre, Maxime; Schmidt, Albrecht; Moussaoui, Saïd; Lammers, Uwe
2013-11-01
Recently, Graphics Cards have been used to offload scientific computations from traditional CPUs for greater efficiency. This paper investigates the adaptation of a real-world linear system solver, which plays a central role in the data processing of the Science Ground Segment of ESA's astrometric Gaia mission. The paper quantifies the resource trade-offs between traditional CPU implementations and modern CUDA based GPU implementations. It also analyses the impact on the pipeline architecture and system development. The investigation starts from both a selected baseline algorithm with a reference implementation and a traditional linear system solver and then explores various modifications to control flow and data layout to achieve higher resource efficiency. It turns out that with the current state of the art, the modifications impact non-technical system attributes. For example, the control flow of the original modified Cholesky transform is modified so that locality of the code and verifiability deteriorate. The maintainability of the system is affected as well. On the system level, users will have to deal with more complex configuration control and testing procedures.
Holovideo: Real-time 3D range video encoding and decoding on GPU
NASA Astrophysics Data System (ADS)
Karpinsky, Nikolaus; Zhang, Song
2012-02-01
We present a 3D video-encoding technique called Holovideo that is capable of encoding high-resolution 3D videos into standard 2D videos, and then decoding the 2D videos back into 3D rapidly without significant loss of quality. Due to the nature of the algorithm, 2D video compression such as JPEG encoding with QuickTime Run Length Encoding (QTRLE) can be applied with little quality loss, resulting in an effective way to store 3D video at very small file sizes. We found that under a compression ratio of 134:1, Holovideo to OBJ file format, the 3D geometry quality drops at a negligible level. Several sets of 3D videos were captured using a structured light scanner, compressed using the Holovideo codec, and then uncompressed and displayed to demonstrate the effectiveness of the codec. With the use of OpenGL Shaders (GLSL), the 3D video codec can encode and decode in realtime. We demonstrated that for a video size of 512×512, the decoding speed is 28 frames per second (FPS) with a laptop computer using an embedded NVIDIA GeForce 9400 m graphics processing unit (GPU). Encoding can be done with this same setup at 18 FPS, making this technology suitable for applications such as interactive 3D video games and 3D video conferencing.
GPU-based simulation of the two-dimensional unstable structure of gaseous oblique detonations
Teng, H.H.; Kiyanda, C.B.; Ng, H.D.; Morgan, G.H.; Nikiforakis, N.
2015-03-10
In this paper, the two-dimensional structure of unstable oblique detonations induced by the wedge from a supersonic combustible gas flow is simulated using the reactive Euler equations with a one-step Arrhenius chemistry model. A wide range of activation energy of the combustible mixture is considered. Computations are performed on the Graphical Processing Unit (GPU) to reduce the simulation runtimes. A large computational domain covered by a uniform mesh with high grid resolution is used to properly capture the development of instabilities and the formation of different transverse wave structures. After the initiation point, where the oblique shock transits into a detonation, an instability begins to manifest and in all cases, the left-running transverse waves first appear, followed by the subsequent emergence of right-running transverse waves forming