On implementation of EM-type algorithms in the stochastic models for a matrix computing on GPU
Gorshenin, Andrey K.
2015-03-10
The paper discusses the main ideas of an implementation of EM-type algorithms for computing on the graphics processors and the application for the probabilistic models based on the Cox processes. An example of the GPU’s adapted MATLAB source code for the finite normal mixtures with the expectation-maximization matrix formulas is given. The testing of computational efficiency for GPU vs CPU is illustrated for the different sample sizes.
gpuPOM: a GPU-based Princeton Ocean Model
NASA Astrophysics Data System (ADS)
Xu, S.; Huang, X.; Zhang, Y.; Fu, H.; Oey, L.-Y.; Xu, F.; Yang, G.
2014-11-01
Rapid advances in the performance of the graphics processing unit (GPU) have made the GPU a compelling solution for a series of scientific applications. However, most existing GPU acceleration works for climate models are doing partial code porting for certain hot spots, and can only achieve limited speedup for the entire model. In this work, we take the mpiPOM (a parallel version of the Princeton Ocean Model) as our starting point, design and implement a GPU-based Princeton Ocean Model. By carefully considering the architectural features of the state-of-the-art GPU devices, we rewrite the full mpiPOM model from the original Fortran version into a new Compute Unified Device Architecture C (CUDA-C) version. We take several accelerating methods to further improve the performance of gpuPOM, including optimizing memory access in a single GPU, overlapping communication and boundary operations among multiple GPUs, and overlapping input/output (I/O) between the hybrid Central Processing Unit (CPU) and the GPU. Our experimental results indicate that the performance of the gpuPOM on a workstation containing 4 GPUs is comparable to a powerful cluster with 408 CPU cores and it reduces the energy consumption by 6.8 times.
GPU COMPUTING FOR PARTICLE TRACKING
Nishimura, Hiroshi; Song, Kai; Muriki, Krishna; Sun, Changchun; James, Susan; Qin, Yong
2011-03-25
This is a feasibility study of using a modern Graphics Processing Unit (GPU) to parallelize the accelerator particle tracking code. To demonstrate the massive parallelization features provided by GPU computing, a simplified TracyGPU program is developed for dynamic aperture calculation. Performances, issues, and challenges from introducing GPU are also discussed. General purpose Computation on Graphics Processing Units (GPGPU) bring massive parallel computing capabilities to numerical calculation. However, the unique architecture of GPU requires a comprehensive understanding of the hardware and programming model to be able to well optimize existing applications. In the field of accelerator physics, the dynamic aperture calculation of a storage ring, which is often the most time consuming part of the accelerator modeling and simulation, can benefit from GPU due to its embarrassingly parallel feature, which fits well with the GPU programming model. In this paper, we use the Tesla C2050 GPU which consists of 14 multi-processois (MP) with 32 cores on each MP, therefore a total of 448 cores, to host thousands ot threads dynamically. Thread is a logical execution unit of the program on GPU. In the GPU programming model, threads are grouped into a collection of blocks Within each block, multiple threads share the same code, and up to 48 KB of shared memory. Multiple thread blocks form a grid, which is executed as a GPU kernel. A simplified code that is a subset of Tracy++ [2] is developed to demonstrate the possibility of using GPU to speed up the dynamic aperture calculation by having each thread track a particle.
GPU programming for biomedical imaging
NASA Astrophysics Data System (ADS)
Caucci, Luca; Furenlid, Lars R.
2015-08-01
Scientific computing is rapidly advancing due to the introduction of powerful new computing hardware, such as graphics processing units (GPUs). Affordable thanks to mass production, GPU processors enable the transition to efficient parallel computing by bringing the performance of a supercomputer to a workstation. We elaborate on some of the capabilities and benefits that GPU technology offers to the field of biomedical imaging. As practical examples, we consider a GPU algorithm for the estimation of position of interaction from photomultiplier (PMT) tube data, as well as a GPU implementation of the MLEM algorithm for iterative image reconstruction.
GPU accelerated dislocation dynamics
NASA Astrophysics Data System (ADS)
Ferroni, Francesco; Tarleton, Edmund; Fitzgerald, Steven
2014-09-01
In this paper we analyze the computational bottlenecks in discrete dislocation dynamics modeling (associated with segment-segment interactions as well as the treatment of free surfaces), discuss the parallelization and optimization strategies, and demonstrate the effectiveness of Graphical Processing Unit (GPU) computation in accelerating dislocation dynamics simulations and expanding their scope. Individual algorithmic benchmark tests as well as an example large simulation of a thin film are presented.
NASA Astrophysics Data System (ADS)
Chase, Patrick; Vondran, Gary
2011-01-01
Tetrahedral interpolation is commonly used to implement continuous color space conversions from sparse 3D and 4D lookup tables. We investigate the implementation and optimization of tetrahedral interpolation algorithms for GPUs, and compare to the best known CPU implementations as well as to a well known GPU-based trilinear implementation. We show that a 500 NVIDIA GTX-580 GPU is 3x faster than a 1000 Intel Core i7 980X CPU for 3D interpolation, and 9x faster for 4D interpolation. Performance-relevant GPU attributes are explored including thread scheduling, local memory characteristics, global memory hierarchy, and cache behaviors. We consider existing tetrahedral interpolation algorithms and tune based on the structure and branching capabilities of current GPUs. Global memory performance is improved by reordering and expanding the lookup table to ensure optimal access behaviors. Per multiprocessor local memory is exploited to implement optimally coalesced global memory accesses, and local memory addressing is optimized to minimize bank conflicts. We explore the impacts of lookup table density upon computation and memory access costs. Also presented are CPU-based 3D and 4D interpolators, using SSE vector operations that are faster than any previously published solution.
Astronomia para/com crianças carentes em Limeira
NASA Astrophysics Data System (ADS)
Bretones, P. S.; Oliveira, V. C.
2003-08-01
Em 2001, o Instituto Superior de Ciências Aplicadas (ISCA Faculdades de Limeira) iniciou um projeto pelo qual o Observatório do Morro Azul empreendeu uma parceria com o Centro de Promoção Social Municipal (CEPROSOM), instituição mantida pela Prefeitura Municipal de Limeira para atender crianças e adolescentes carentes. O CEPROSOM contava com dois projetos: Projeto Centro de Convivência Infantil (CCI) e Programa Criança e Adolescente (PCA), que atendiam crianças e adolescentes em Centros Comunitários de diversas áreas da cidade. Esses projetos têm como prioridades estabelecer atividades prazerosas para as crianças no sentido de retirá-las das ruas. Assim sendo, as crianças passaram a ter mais um tipo de atividade - as visitas ao observatório. Este painel descreve as várias fases do projeto, que envolveu: reuniões de planejamento, curso de Astronomia para as orientadoras dos CCIs e PCAs, atividades relacionadas a visitas das crianças ao Observatório, proposta de construção de gnômons e relógios de Sol nos diversos Centros Comunitários de Limeira e divulgação do projeto na imprensa. O painel inclui discussões sobre a aprendizagem de crianças carentes, relatos que mostram a postura das orientadoras sobre a pertinência do ensino de Astronomia, relatos do monitor que fez o atendimento no Observatório e o que o número de crianças atendidas representou para as atividades da instituição desde o início de suas atividades e, em particular, em 2001. Os resultados são baseados na análise de relatos das orientadoras e do monitor do Observatório, registros de visitas e matérias da imprensa local. Conclui com uma avaliação do que tal projeto representou para as Instituições participantes. Para o Observatório, em particular, foi feita uma análise com relação às outras modalidades de atendimentos que envolvem alunos de escolas e público em geral. Também é abordada a questão do compromisso social do Observatório na educação do
NASA Astrophysics Data System (ADS)
Masset, Frédéric
2015-09-01
GFARGO is a GPU version of FARGO. It is written in C and C for CUDA and runs only on NVIDIA’s graphics cards. Though it corresponds to the standard, isothermal version of FARGO, not all functionnalities of the CPU version have been translated to CUDA. The code is available in single and double precision versions, the latter compatible with FERMI architectures. GFARGO can run on a graphics card connected to the display, allowing the user to see in real time how the fields evolve.
GPU-accelerated compressive holography.
Endo, Yutaka; Shimobaba, Tomoyoshi; Kakue, Takashi; Ito, Tomoyoshi
2016-04-18
In this paper, we show fast signal reconstruction for compressive holography using a graphics processing unit (GPU). We implemented a fast iterative shrinkage-thresholding algorithm on a GPU to solve the ℓ_{1} and total variation (TV) regularized problems that are typically used in compressive holography. Since the algorithm is highly parallel, GPUs can compute it efficiently by data-parallel computing. For better performance, our implementation exploits the structure of the measurement matrix to compute the matrix multiplications. The results show that GPU-based implementation is about 20 times faster than CPU-based implementation. PMID:27137282
GPU-Powered Coherent Beamforming
NASA Astrophysics Data System (ADS)
Magro, A.; Adami, K. Zarb; Hickish, J.
2015-03-01
Graphics processing units (GPU)-based beamforming is a relatively unexplored area in radio astronomy, possibly due to the assumption that any such system will be severely limited by the PCIe bandwidth required to transfer data to the GPU. We have developed a CUDA-based GPU implementation of a coherent beamformer, specifically designed and optimized for deployment at the BEST-2 array which can generate an arbitrary number of synthesized beams for a wide range of parameters. It achieves ˜1.3 TFLOPs on an NVIDIA Tesla K20, approximately 10x faster than an optimized, multithreaded CPU implementation. This kernel has been integrated into two real-time, GPU-based time-domain software pipelines deployed at the BEST-2 array in Medicina: a standalone beamforming pipeline and a transient detection pipeline. We present performance benchmarks for the beamforming kernel as well as the transient detection pipeline with beamforming capabilities as well as results of test observation.
OV-Wav: um novo pacote para análise multiescalar em astronomia
NASA Astrophysics Data System (ADS)
Pereira, D. N. E.; Rabaça, C. R.
2003-08-01
Wavelets e outras formas de análise multiescalar têm sido amplamente empregadas em diversas áreas do conhecimento, sendo reconhecidamente superiores a técnicas mais tradicionais, como as análises de Fourier e de Gabor, em certas aplicações. Embora a teoria dos wavelets tenha começado a ser elaborada há quase trinta anos, seu impacto no estudo de imagens astronômicas tem sido pequeno até bem recentemente. Apresentamos um conjunto de programas desenvolvidos ao longo dos últimos três anos no Observatório do Valongo/UFRJ que possibilitam aplicar essa poderosa ferramenta a problemas comuns em astronomia, como a remoção de ruído, a detecção hierárquica de fontes e a modelagem de objetos com perfis de brilho arbitrários em condições não ideais. Este pacote, desenvolvido para execução em plataforma IDL, teve sua primeira versão concluída recentemente e está sendo disponibilizado à comunidade científica de forma aberta. Mostramos também resultados de testes controlados ao quais submetemos os programas, com a sua aplicação a imagens artificiais, com resultados satisfatórios. Algumas aplicações astrofísicas foram estudadas com o uso do pacote, em caráter experimental, incluindo a análise da componente de luz difusa em grupos compactos de galáxias de Hickson e o estudo de subestruturas de nebulosas planetárias no espaço multiescalar.
GPU applications for data processing
Vladymyrov, Mykhailo; Aleksandrov, Andrey; Tioukov, Valeri
2015-12-31
Modern experiments that use nuclear photoemulsion imply fast and efficient data acquisition from the emulsion can be performed. The new approaches in developing scanning systems require real-time processing of large amount of data. Methods that use Graphical Processing Unit (GPU) computing power for emulsion data processing are presented here. It is shown how the GPU-accelerated emulsion processing helped us to rise the scanning speed by factor of nine.
Uma grade de perfis teóricos para estrelas massivas em transição
NASA Astrophysics Data System (ADS)
Nascimento, C. M. P.; Machado, M. A.
2003-08-01
Na XXVIII Reunião Anual da Sociedade Astronômica Brasileira (2002) apresentamos uma grade de perfis calculados de acordo com os pontos da trajetória evolutiva de metalicidade solar, Z = 0.02 e taxa de perda de massa () padrão, para estrelas com massa inicial de 25, 40, 60, 85 e 120 massas solares. Estes perfis foram calculados com o auxílio de um código numérico adequado para descrever os ventos de objetos massivos, supondo simetria esférica, estacionaridade e homogeneidade. No presente trabalho, apresentamos a complementação da grade com os perfis teóricos relativos às trajetórias de Z = 0.02 com taxa de perda de massa dobrada em relação a padrão (2´), e de metalicidade Z = 0.008. Para cada ponto das três trajetórias obtemos os perfis teóricos de Ha, Hb, Hg e Hd, e como esperado eles se apresentam em pura emissão, pura absorção ou em P-Cygni. Para valores de taxa de perda de massa muito baixos (~10-7) não há formação de linhas, o que é visto nos primeiros pontos em todas as trajetórias. Em geral, para um mesmo ponto a componente de emissão diminui e a absorção aumenta de Ha para Hd. É verificado que as trajetórias com Z = 0.02 e padrão possuem menos circuitos (loops) do que as com metalicidade Z = 0.02 e 2´ padrão, e seus perfis são, em geral, menos intensos. Em relação a trajetória de Z = 0.008, verifica-se menos circuitos e maior variação em luminosidade, e seus perfis mostram-se em, algumas trajetórias, mais intensos. Verificamos também que, pontos distintos em uma mesma trajetória, apresentam perfis diferentes para valores similares de luminosidade e temperatura efetiva. Sendo assim, uma grade de perfis teóricos parece ser útil para fornecer uma informação preliminar sobre o estágio evolutivo de uma estrela massiva.
Vínculos observacionais para o processo-S em estrelas gigantes de Bário
NASA Astrophysics Data System (ADS)
Smiljanic, R. H. S.; Porto de Mello, G. F.; da Silva, L.
2003-08-01
Estrelas de bário são gigantes vermelhas de tipo GK que apresentam excessos atmosféricos dos elementos do processo-s. Tais excessos são esperados em estrelas na fase de pulsos térmicos do AGB (TP-AGB). As estrelas de bário são, no entanto, menos massivas e menos luminosas que as estrelas do AGB, assim, não poderiam ter se auto-enriquecido. Seu enriquecimento teria origem em uma estrela companheira, inicialmente mais massiva, que evolui pelo TP-AGB, se auto-enriquece com os elementos do processo-s e transfere material contaminado para a atmosfera da atual estrela de bário. A companheira evolui então para anã branca deixando de ser observada diretamente. As estrelas de bário são, portanto, úteis como testes observacionais para teorias de nucleossíntese pelo processo-s, convecção e perda de massa. Análises detalhadas de abundância com dados de alta qualidade para estes objetos são ainda escassas na literatura. Neste trabalho construímos modelos de atmosferas e, procedendo a uma análise diferencial, determinamos parâmetros atmosféricos e evolutivos de uma amostra de dez gigantes de bário e quatro normais. Determinamos seus padrões de abundância para Na, Mg, Al, Si, Ca, Sc, Ti, V, Cr, Mn, Fe, Co, Ni, Cu, Zn, Sr, Y, Zr, Ba, La, Ce, Nd, Sm, Eu e Gd, concluindo que algumas estrelas classificadas na literatura como gigantes de bário são na verdade gigantes normais. Comparamos dois padrões médios de abundância, para estrelas com grandes excessos e estrelas com excessos moderados, com modelos teóricos de enriquecimento pelo processo-s. Os dois grupos de estrelas são ajustados pelos mesmos parâmetros de exposição de nêutrons. Tal resultado sugere que a ocorrência do fenômeno de bário com diferentes intensidades não se deve a diferentes exposições de nêutrons. Discutimos ainda efeitos nucleossintéticos, ligados ao processo-s, sugeridos na literatura para os elementos Cu, Mn, V e Sc.
NASA Astrophysics Data System (ADS)
Che, Ming-Chao; Liang, Jie
2010-01-01
JPEG XR (formerly Microsoft Windows Media Photo and HD Photo) is the latest image coding standard. By integrating various advanced technologies such as integer hierarchical lapped transform, context adaptive Huffman coding, and high dynamic range coding, it achieves competitive performance to JPEG-2000, but with lower computational complexity and memory requirement. In this paper, the GPU implementation of the JPEG XR codec using NVIDIA CUDA (Compute Unified Device Architecture) technology is investigated. Design considerations to speed up the algorithm are discussed, by taking full advantage of the properties of the CUDA framework and JPEG XR. Experimental results are presented to demonstrate the performance of the GPU implementation.
Distributed GPU Computing in GIScience
NASA Astrophysics Data System (ADS)
Jiang, Y.; Yang, C.; Huang, Q.; Li, J.; Sun, M.
2013-12-01
Geoscientists strived to discover potential principles and patterns hidden inside ever-growing Big Data for scientific discoveries. To better achieve this objective, more capable computing resources are required to process, analyze and visualize Big Data (Ferreira et al., 2003; Li et al., 2013). Current CPU-based computing techniques cannot promptly meet the computing challenges caused by increasing amount of datasets from different domains, such as social media, earth observation, environmental sensing (Li et al., 2013). Meanwhile CPU-based computing resources structured as cluster or supercomputer is costly. In the past several years with GPU-based technology matured in both the capability and performance, GPU-based computing has emerged as a new computing paradigm. Compare to traditional computing microprocessor, the modern GPU, as a compelling alternative microprocessor, has outstanding high parallel processing capability with cost-effectiveness and efficiency(Owens et al., 2008), although it is initially designed for graphical rendering in visualization pipe. This presentation reports a distributed GPU computing framework for integrating GPU-based computing within distributed environment. Within this framework, 1) for each single computer, computing resources of both GPU-based and CPU-based can be fully utilized to improve the performance of visualizing and processing Big Data; 2) within a network environment, a variety of computers can be used to build up a virtual super computer to support CPU-based and GPU-based computing in distributed computing environment; 3) GPUs, as a specific graphic targeted device, are used to greatly improve the rendering efficiency in distributed geo-visualization, especially for 3D/4D visualization. Key words: Geovisualization, GIScience, Spatiotemporal Studies Reference : 1. Ferreira de Oliveira, M. C., & Levkowitz, H. (2003). From visual data exploration to visual data mining: A survey. Visualization and Computer Graphics, IEEE
Randomized selection on the GPU
Monroe, Laura Marie; Wendelberger, Joanne R; Michalak, Sarah E
2011-01-13
We implement here a fast and memory-sparing probabilistic top N selection algorithm on the GPU. To our knowledge, this is the first direct selection in the literature for the GPU. The algorithm proceeds via a probabilistic-guess-and-chcck process searching for the Nth element. It always gives a correct result and always terminates. The use of randomization reduces the amount of data that needs heavy processing, and so reduces the average time required for the algorithm. Probabilistic Las Vegas algorithms of this kind are a form of stochastic optimization and can be well suited to more general parallel processors with limited amounts of fast memory.
BSSDATA - um programa otimizado para filtragem de dados em radioastronomia solar
NASA Astrophysics Data System (ADS)
Martinon, A. R. F.; Sawant, H. S.; Fernandes, F. C. R.; Stephany, S.; Preto, A. J.; Dobrowolski, K. M.
2003-08-01
A partir de 1998, entrou em operação regular no INPE, em São José dos Campos, o Brazilian Solar Spectroscope (BSS). O BSS é dedicado às observações de explosões solares decimétricas com alta resolução temporal e espectral, com a principal finalidade de investigar fenômenos associados com a liberação de energia dos "flares" solares. Entre os anos de 1999 e 2002, foram catalogadas, aproximadamente 340 explosões solares classificadas em 8 tipos distintos, de acordo com suas características morfológicas. Na análise detalhada de cada tipo, ou grupo, de explosões solares deve-se considerar a variação do fluxo do sol calmo ("background"), em função da freqüência e a variação temporal, além da complexidade das explosões e estruturas finas registradas superpostas ao fundo variável. Com o intuito de realizar tal análise foi desenvolvido o programa BSSData. Este programa, desenvolvido em linguagem C++, é constituído de várias ferramentas que auxiliam no tratamento e análise dos dados registrados pelo BSS. Neste trabalho iremos abordar as ferramentas referentes à filtragem do ruído de fundo. As rotinas do BSSData para filtragem de ruído foram testadas nos diversos grupos de explosões solares ("dots", "fibra", "lace", "patch", "spikes", "tipo III" e "zebra") alcançando um bom resultado na diminuição do ruído de fundo e obtendo, em conseqüência, dados onde o sinal torna-se mais homogêneo ressaltando as áreas onde existem explosões solares e tornando mais precisas as determinações dos parâmetros observacionais de cada explosão. Estes resultados serão apresentados e discutidos.
Parallelization of MODFLOW using a GPU library.
Ji, Xiaohui; Li, Dandan; Cheng, Tangpei; Wang, Xu-Sheng; Wang, Qun
2014-01-01
A new method based on a graphics processing unit (GPU) library is proposed in the paper to parallelize MODFLOW. Two programs, GetAb_CG and CG_GPU, have been developed to reorganize the equations in MODFLOW and solve them with the GPU library. Experimental tests using the NVIDIA Tesla C1060 show that a 1.6- to 10.6-fold speedup can be achieved for models with more than 10(5) cells. The efficiency can be further improved by using up-to-date GPU devices. PMID:23937315
NASA Astrophysics Data System (ADS)
Weigel, Martin
2011-09-01
Over the last couple of years it has been realized that the vast computational power of graphics processing units (GPUs) could be harvested for purposes other than the video game industry. This power, which at least nominally exceeds that of current CPUs by large factors, results from the relative simplicity of the GPU architectures as compared to CPUs, combined with a large number of parallel processing units on a single chip. To benefit from this setup for general computing purposes, the problems at hand need to be prepared in a way to profit from the inherent parallelism and hierarchical structure of memory accesses. In this contribution I discuss the performance potential for simulating spin models, such as the Ising model, on GPU as compared to conventional simulations on CPU.
GPU Accelerated Vector Median Filter
NASA Technical Reports Server (NTRS)
Aras, Rifat; Shen, Yuzhong
2011-01-01
Noise reduction is an important step for most image processing tasks. For three channel color images, a widely used technique is vector median filter in which color values of pixels are treated as 3-component vectors. Vector median filters are computationally expensive; for a window size of n x n, each of the n(sup 2) vectors has to be compared with other n(sup 2) - 1 vectors in distances. General purpose computation on graphics processing units (GPUs) is the paradigm of utilizing high-performance many-core GPU architectures for computation tasks that are normally handled by CPUs. In this work. NVIDIA's Compute Unified Device Architecture (CUDA) paradigm is used to accelerate vector median filtering. which has to the best of our knowledge never been done before. The performance of GPU accelerated vector median filter is compared to that of the CPU and MPI-based versions for different image and window sizes, Initial findings of the study showed 100x improvement of performance of vector median filter implementation on GPUs over CPU implementations and further speed-up is expected after more extensive optimizations of the GPU algorithm .
SIFT implementation based on GPU
NASA Astrophysics Data System (ADS)
Jiang, Chao; Geng, Ze-xun; Wei, Xiao-feng; Shen, Chen
2013-08-01
Abstract—Image matching is the core research topics of digital photogrammetry and computer vision. SIFT(Scale-Invariant Feature Transform) algorithm is a feature matching algorithm based on local invariant features which is proposed by Lowe at 1999, SIFT features are invariant to image rotation and scaling, even partially invariant to change in 3D camera viewpoint and illumination. They are well localized in both the spatial and frequency domains, reducing the probability of disruption by occlusion, clutter, or noise. So the algorithm has a widely used in image matching and 3D reconstruction based on stereo image. Traditional SIFT algorithm's implementation and optimization are generally for CPU. Due to the large numbers of extracted features(even if only several objects can also extract large numbers of SIFT feature), high-dimensional of the feature vector(usually a 128-dimensional SIFT feature vector), and the complexity for the SIFT algorithm, therefore the SIFT algorithm on the CPU processing speed is slow, hard to fulfil the real-time requirements. Programmable Graphic Process United(PGPU) is commonly used by the current computer graphics as a dedicated device for image processing. The development experience of recent years shows that a high-performance GPU, which can be achieved 10 times single-precision floating-point processing performanceone compared with the same time of a high-performance desktop CPU, simultaneity the GPU's memory bandwidth is up to five times compared with the same period desktop platform. Provide the same computing power, the GPU's cost and power consumption should be less than the CPU-based system. At the same time, due to the parallel nature of graphics rendering and image processing, so GPU-accelerated image processing become to an efficient solution for some algorithm which have requirements for real-time. In this paper, we realized the algorithm by OpenGL shader language and compare to the results which realized by CPU
The experience of GPU calculations at Lunarc
NASA Astrophysics Data System (ADS)
Sjöström, Anders; Lindemann, Jonas; Church, Ross
2011-09-01
To meet the ever increasing demand for computational speed and use of ever larger datasets, multi GPU instal- lations look very tempting. Lunarc and the Theoretical Astrophysics group at Lund Observatory collaborate on a pilot project to evaluate and utilize multi-GPU architectures for scientific calculations. Starting with a small workshop in 2009, continued investigations eventually lead to the procurement of the GPU-resource Timaeus, which is a four-node eight-GPU cluster with two Nvidia m2050 GPU-cards per node. The resource is housed within the larger cluster Platon and share disk-, network- and system resources with that cluster. The inaugu- ration of Timaeus coincided with the meeting "Computational Physics with GPUs" in November 2010, hosted by the Theoretical Astrophysics group at Lund Observatory. The meeting comprised of a two-day workshop on GPU-computing and a two-day science meeting on using GPUs as a tool for computational physics research, with a particular focus on astrophysics and computational biology. Today Timaeus is used by research groups from Lund, Stockholm and Lule in fields ranging from Astrophysics to Molecular Chemistry. We are investigating the use of GPUs with commercial software packages and user supplied MPI-enabled codes. Looking ahead, Lunarc will be installing a new cluster during the summer of 2011 which will have a small number of GPU-enabled nodes that will enable us to continue working with the combination of parallel codes and GPU-computing. It is clear that the combination of GPUs/CPUs is becoming an important part of high performance computing and here we will describe what has been done at Lunarc regarding GPU-computations and how we will continue to investigate the new and coming multi-GPU servers and how they can be utilized in our environment.
Enhancing professionalism at GPU Nuclear
Coe, R.P. )
1992-01-01
Late in 1988, GPU Nuclear embarked on a major program aimed at enhancing professionalism at its Oyster Creek and Three Mile Island nuclear generating stations. The program was also to include its corporate headquarters in Parsippany, New Jersey. The overall program was to take several directions, including on-site degree programs, a sabbatical leave-type program for personnel to finish college degrees, advanced technical training for licensed staff, career progression for senior reactor operators, and expanded teamwork and leadership training for control room crew. The largest portion of this initiative was the development and delivery of professionalism training to the nearly 2,000 people at both nuclear generating sites.
GPU Developments for General Circulation Models
NASA Astrophysics Data System (ADS)
Appleyard, Jeremy; Posey, Stan; Ponder, Carl; Eaton, Joe
2014-05-01
Current trends in high performance computing (HPC) are moving towards the use of graphics processing units (GPUs) to achieve speedups through the extraction of fine-grain parallelism of application software. GPUs have been developed exclusively for computational tasks as massively-parallel co-processors to the CPU, and during 2013 an extensive set of new HPC architectural features were developed in a 4th generation of NVIDIA GPUs that provide further opportunities for GPU acceleration of general circulation models used in climate science and numerical weather prediction. Today computational efficiency and simulation turnaround time continue to be important factors behind scientific decisions to develop models at higher resolutions and deploy increased use of ensembles. This presentation will examine the current state of GPU parallel developments for stencil based numerical operations typical of dynamical cores, and introduce new GPU-based implicit iterative schemes with GPU parallel preconditioning and linear solvers based on ILU, Krylov methods, and multigrid. Several GCMs show substantial gain in parallel efficiency from second-level fine-grain parallelism under first-level distributed memory parallel through a hybrid parallel implementation. Examples are provided relevant to science-scale HPC practice of CPU-GPU system configurations based on model resolution requirements of a particular simulation. Performance results compare use of the latest conventional CPUs with and without GPU acceleration. Finally a forward looking discussion is provided on the roadmap of GPU hardware, software, tools, and programmability for GCM development.
GPU-based Multilevel Clustering.
Chiosa, Iurie; Kolb, Andreas
2010-04-01
The processing power of parallel co-processors like the Graphics Processing Unit (GPU) are dramatically increasing. However, up until now only a few approaches have been presented to utilize this kind of hardware for mesh clustering purposes. In this paper we introduce a Multilevel clustering technique designed as a parallel algorithm and solely implemented on the GPU. Our formulation uses the spatial coherence present in the cluster optimization and hierarchical cluster merging to significantly reduce the number of comparisons in both parts . Our approach provides a fast, high quality and complete clustering analysis. Furthermore, based on the original concept we present a generalization of the method to data clustering. All advantages of the meshbased techniques smoothly carry over to the generalized clustering approach. Additionally, this approach solves the problem of the missing topological information inherent to general data clustering and leads to a Local Neighbors k-means algorithm. We evaluate both techniques by applying them to Centroidal Voronoi Diagram (CVD) based clustering. Compared to classical approaches, our techniques generate results with at least the same clustering quality. Our technique proves to scale very well, currently being limited only by the available amount of graphics memory. PMID:20421676
GPU architecture usage for efficient image scaling
NASA Astrophysics Data System (ADS)
Skakov, P.
2013-05-01
Specifics of graphics processing units (GPU) architecture is considered. Opportunities of relevant optimization for image processing algorithms are presented such as usage of texture filtering block. Accuracy of image scaling and drivers influenced usage specifics are noted.
CULA: hybrid GPU accelerated linear algebra routines
NASA Astrophysics Data System (ADS)
Humphrey, John R.; Price, Daniel K.; Spagnoli, Kyle E.; Paolini, Aaron L.; Kelmelis, Eric J.
2010-04-01
The modern graphics processing unit (GPU) found in many standard personal computers is a highly parallel math processor capable of nearly 1 TFLOPS peak throughput at a cost similar to a high-end CPU and an excellent FLOPS/watt ratio. High-level linear algebra operations are computationally intense, often requiring O(N3) operations and would seem a natural fit for the processing power of the GPU. Our work is on CULA, a GPU accelerated implementation of linear algebra routines. We present results from factorizations such as LU decomposition, singular value decomposition and QR decomposition along with applications like system solution and least squares. The GPU execution model featured by NVIDIA GPUs based on CUDA demands very strong parallelism, requiring between hundreds and thousands of simultaneous operations to achieve high performance. Some constructs from linear algebra map extremely well to the GPU and others map poorly. CPUs, on the other hand, do well at smaller order parallelism and perform acceptably during low-parallelism code segments. Our work addresses this via hybrid a processing model, in which the CPU and GPU work simultaneously to produce results. In many cases, this is accomplished by allowing each platform to do the work it performs most naturally.
Cui, Xiaohui; Mueller, Frank; Zhang, Yongpeng; Potok, Thomas E
2009-01-01
Accelerating hardware devices represent a novel promise for improving the performance for many problem domains but it is not clear for which domains what accelerators are suitable. While there is no room in general-purpose processor design to significantly increase the processor frequency, developers are instead resorting to multi-core chips duplicating conventional computing capabilities on a single die. Yet, accelerators offer more radical designs with a much higher level of parallelism and novel programming environments. This present work assesses the viability of text mining on CUDA. Text mining is one of the key concepts that has become prominent as an effective means to index the Internet, but its applications range beyond this scope and extend to providing document similarity metrics, the subject of this work. We have developed and optimized text search algorithms for GPUs to exploit their potential for massive data processing. We discuss the algorithmic challenges of parallelization for text search problems on GPUs and demonstrate the potential of these devices in experiments by reporting significant speedups. Our study may be one of the first to assess more complex text search problems for suitability for GPU devices, and it may also be one of the first to exploit and report on atomic instruction usage that have recently become available in NVIDIA devices.
Cholla: 3D GPU-based hydrodynamics code for astrophysical simulation
NASA Astrophysics Data System (ADS)
Schneider, Evan E.; Robertson, Brant E.
2016-07-01
Cholla (Computational Hydrodynamics On ParaLLel Architectures) models the Euler equations on a static mesh and evolves the fluid properties of thousands of cells simultaneously using GPUs. It can update over ten million cells per GPU-second while using an exact Riemann solver and PPM reconstruction, allowing computation of astrophysical simulations with physically interesting grid resolutions (>256^3) on a single device; calculations can be extended onto multiple devices with nearly ideal scaling beyond 64 GPUs.
GPU-accelerated computation of electron transfer.
Höfinger, Siegfried; Acocella, Angela; Pop, Sergiu C; Narumi, Tetsu; Yasuoka, Kenji; Beu, Titus; Zerbetto, Francesco
2012-11-01
Electron transfer is a fundamental process that can be studied with the help of computer simulation. The underlying quantum mechanical description renders the problem a computationally intensive application. In this study, we probe the graphics processing unit (GPU) for suitability to this type of problem. Time-critical components are identified via profiling of an existing implementation and several different variants are tested involving the GPU at increasing levels of abstraction. A publicly available library supporting basic linear algebra operations on the GPU turns out to accelerate the computation approximately 50-fold with minor dependence on actual problem size. The performance gain does not compromise numerical accuracy and is of significant value for practical purposes. PMID:22847673
GPU-accelerated voxelwise hepatic perfusion quantification.
Wang, H; Cao, Y
2012-09-01
Voxelwise quantification of hepatic perfusion parameters from dynamic contrast enhanced (DCE) imaging greatly contributes to assessment of liver function in response to radiation therapy. However, the efficiency of the estimation of hepatic perfusion parameters voxel-by-voxel in the whole liver using a dual-input single-compartment model requires substantial improvement for routine clinical applications. In this paper, we utilize the parallel computation power of a graphics processing unit (GPU) to accelerate the computation, while maintaining the same accuracy as the conventional method. Using compute unified device architecture-GPU, the hepatic perfusion computations over multiple voxels are run across the GPU blocks concurrently but independently. At each voxel, nonlinear least-squares fitting the time series of the liver DCE data to the compartmental model is distributed to multiple threads in a block, and the computations of different time points are performed simultaneously and synchronically. An efficient fast Fourier transform in a block is also developed for the convolution computation in the model. The GPU computations of the voxel-by-voxel hepatic perfusion images are compared with ones by the CPU using the simulated DCE data and the experimental DCE MR images from patients. The computation speed is improved by 30 times using a NVIDIA Tesla C2050 GPU compared to a 2.67 GHz Intel Xeon CPU processor. To obtain liver perfusion maps with 626 400 voxels in a patient's liver, it takes 0.9 min with the GPU-accelerated voxelwise computation, compared to 110 min with the CPU, while both methods result in perfusion parameters differences less than 10(-6). The method will be useful for generating liver perfusion images in clinical settings. PMID:22892645
Parallel LU Factorization on GPU cluster
D'Azevedo, Ed F; Hill, Judith C
2012-01-01
This paper describes our progress in developing software for performing parallel LU factorization of a large dense matrix on a GPU cluster. Three approaches, with increasing software complexity, are considered: (i) a naive 'thunking' approach that links the existing parallel ScaLAPACK software library with cuBLAS through a software emulation layer; (ii) a more intrusive magmaBLAS implementation integrated into the LU solver in the High-Performance Linpack software; and (iii) a left-looking out-of-core algorithm for solving problems that are larger than the available memory on GPU devices. Comparison of the performance gains versus the current ScaLAPACK PZGETRF are provided.
Colloquium: Large scale simulations on GPU clusters
NASA Astrophysics Data System (ADS)
Bernaschi, Massimo; Bisson, Mauro; Fatica, Massimiliano
2015-06-01
Graphics processing units (GPU) are currently used as a cost-effective platform for computer simulations and big-data processing. Large scale applications require that multiple GPUs work together but the efficiency obtained with cluster of GPUs is, at times, sub-optimal because the GPU features are not exploited at their best. We describe how it is possible to achieve an excellent efficiency for applications in statistical mechanics, particle dynamics and networks analysis by using suitable memory access patterns and mechanisms like CUDA streams, profiling tools, etc. Similar concepts and techniques may be applied also to other problems like the solution of Partial Differential Equations.
Fast quantum Monte Carlo on a GPU
NASA Astrophysics Data System (ADS)
Lutsyshyn, Y.
2015-02-01
We present a scheme for the parallelization of quantum Monte Carlo method on graphical processing units, focusing on variational Monte Carlo simulation of bosonic systems. We use asynchronous execution schemes with shared memory persistence, and obtain an excellent utilization of the accelerator. The CUDA code is provided along with a package that simulates liquid helium-4. The program was benchmarked on several models of Nvidia GPU, including Fermi GTX560 and M2090, and the Kepler architecture K20 GPU. Special optimization was developed for the Kepler cards, including placement of data structures in the register space of the Kepler GPUs. Kepler-specific optimization is discussed.
NASA Astrophysics Data System (ADS)
Souza, T. R.; Baptista, R.
2003-08-01
As estrelas secundárias em variáveis cataclí smicas (VCs) e binárias-x de baixa massa (BXBMs) são cruciais para o entendimento da origem, evolução e comportamento destas binárias interagentes. Elas são estrelas magneticamente ativas submetidas a condições ambientais extremas [e.g., estão muito próximas de uma fonte quente e irradiante; têm rotação extremamente rápida e forma distorcida; estão perdendo massa a taxas de 10-8-10-10 M¤/ano] que contribuem para que suas propriedades sejam distintas das de estrelas de mesma massa na seqüência principal. Por outro lado, o padrão de irradiação na face da secundária fornece informação sobre a geometria das estruturas de acréscimo em torno da estrela primária. Assim, a obtenção de imagens da superfície destas estrelas é de grande interesse astrofísico. A Tomografia Roche usa as variações no perfil das linhas de emissão/absorção da estrela secundária em função da fase orbital para mapear a distribuição de brilho em sua superfície. Neste trabalho apresentamos os resultados iniciais do desenvolvimento de um programa para o mapeamento da distribuição de brilho na superfí cie das estrelas secundárias em VCs e BXBMs com técnicas de astro-tomografia. Presentemente temos em operação um código que simula as variações no perfil das linhas em conseqüência de efeito Doppler resultante da combinação de rotação e translação de uma estrela em forma de lobo de Roche em torno do centro de massa da binária, em função da distribuição de brilho na superfície desta estrela. O código igualmente produz a curva de luz resultante das variações de aspecto da estrela em função da fase orbital (variações elipsoidais).
Locality-Driven Dynamic GPU Cache Bypassing
Li, Chao; Song, Shuaiwen; Dai, Hongwen; Sidelnik, A.; Hari, Siva; Zhou, Huiyang
2015-06-07
This paper presents novel cache optimizations for massively parallel, throughput-oriented architectures like GPUs. Based on the reuse characteristics of GPU workloads, we propose a design that integrates such efficient locality filtering capability into the decoupled tag store of the existing L1 D-cache through simple and cost-effective hardware extensions.
GPU Computing in Space Weather Modeling
NASA Astrophysics Data System (ADS)
Feng, X.; Zhong, D.; Xiang, C.; Zhang, Y.
2013-04-01
Space weather refers to conditions on the Sun and in the solar wind, magnetosphere, ionosphere, and thermosphere that can influence the performance and reliability of space-borne and ground-based technological systems and that affect human life or health. In order to make the real- or faster than real-time numerical prediction of adverse space weather events and their influence on the geospace environment, high-performance computational models are required. The main objective in this article is to explore the application of programmable graphic processing units (GPUs) to the numerical space weather modeling for the study of solar wind background that is a crucial part in the numerical space weather modeling. GPU programming is realized for our Solar-Interplanetary-CESE MHD model (SIP-CESE MHD model) by numerically studying the solar corona/interplanetary solar wind. The global solar wind structures is obtained by the established GPU model with the magnetic field synoptic data as input. The simulated global structures for Carrington rotation 2060 agrees well with solar observations and solar wind measurements from spacecraft near the Earth. The model's implementation of the adaptive-mesh-refinement (AMR) and message passing interface (MPI) enables the full exploitation of the computing power in a heterogeneous CPU/GPU cluster and significantly improves the overall performance. Our initial tests with available hardware show speedups of roughly 5x compared to traditional software implementation. This work presents a novel application of GPU to the space weather study.
Hyperspectral image feature extraction accelerated by GPU
NASA Astrophysics Data System (ADS)
Qu, HaiCheng; Zhang, Ye; Lin, Zhouhan; Chen, Hao
2012-10-01
PCA (principal components analysis) algorithm is the most basic method of dimension reduction for high-dimensional data1, which plays a significant role in hyperspectral data compression, decorrelation, denoising and feature extraction. With the development of imaging technology, the number of spectral bands in a hyperspectral image is getting larger and larger, and the data cube becomes bigger in these years. As a consequence, operation of dimension reduction is more and more time-consuming nowadays. Fortunately, GPU-based high-performance computing has opened up a novel approach for hyperspectral data processing6. This paper is concerning on the two main processes in hyperspectral image feature extraction: (1) calculation of transformation matrix; (2) transformation in spectrum dimension. These two processes belong to computationally intensive and data-intensive data processing respectively. Through the introduction of GPU parallel computing technology, an algorithm containing PCA transformation based on eigenvalue decomposition 8(EVD) and feature matching identification is implemented, which is aimed to explore the characteristics of the GPU parallel computing and the prospects of GPU application in hyperspectral image processing by analysing thread invoking and speedup of the algorithm. At last, the result of the experiment shows that the algorithm has reached a 12x speedup in total, in which some certain step reaches higher speedups up to 270 times.
GPU-based fast gamma index calculation.
Gu, Xuejun; Jia, Xun; Jiang, Steve B
2011-03-01
The γ-index dose comparison tool has been widely used to compare dose distributions in cancer radiotherapy. The accurate calculation of γ-index requires an exhaustive search of the closest Euclidean distance in the high-resolution dose-distance space. This is a computational intensive task when dealing with 3D dose distributions. In this work, we combine a geometric method (Ju et al 2008 Med. Phys. 35 879-87) with a radial pre-sorting technique (Wendling et al 2007 Med. Phys. 34 1647-54) and implement them on computer graphics processing units (GPUs). The developed GPU-based γ-index computational tool is evaluated on eight pairs of IMRT dose distributions. The γ-index calculations can be finished within a few seconds for all 3D testing cases on one single NVIDIA Tesla C1060 card, achieving 45-75× speedup compared to CPU computations conducted on an Intel Xeon 2.27 GHz processor. We further investigated the effect of various factors on both CPU and GPU computation time. The strategy of pre-sorting voxels based on their dose difference values speeds up the GPU calculation by about 2.7-5.5 times. For n-dimensional dose distributions, γ-index calculation time on CPU is proportional to the summation of γ(n) over all voxels, while that on GPU is affected by γ(n) distributions and is approximately proportional to the γ(n) summation over all voxels. We found that increasing the resolution of dose distributions leads to a quadratic increase of computation time on CPU, while less-than-quadratic increase on GPU. The values of dose difference and distance-to-agreement criteria also have an impact on γ-index calculation time. PMID:21317484
Accelerated GPU based SPECT Monte Carlo simulations
NASA Astrophysics Data System (ADS)
Garcia, Marie-Paule; Bert, Julien; Benoit, Didier; Bardiès, Manuel; Visvikis, Dimitris
2016-06-01
Monte Carlo (MC) modelling is widely used in the field of single photon emission computed tomography (SPECT) as it is a reliable technique to simulate very high quality scans. This technique provides very accurate modelling of the radiation transport and particle interactions in a heterogeneous medium. Various MC codes exist for nuclear medicine imaging simulations. Recently, new strategies exploiting the computing capabilities of graphical processing units (GPU) have been proposed. This work aims at evaluating the accuracy of such GPU implementation strategies in comparison to standard MC codes in the context of SPECT imaging. GATE was considered the reference MC toolkit and used to evaluate the performance of newly developed GPU Geant4-based Monte Carlo simulation (GGEMS) modules for SPECT imaging. Radioisotopes with different photon energies were used with these various CPU and GPU Geant4-based MC codes in order to assess the best strategy for each configuration. Three different isotopes were considered: 99m Tc, 111In and 131I, using a low energy high resolution (LEHR) collimator, a medium energy general purpose (MEGP) collimator and a high energy general purpose (HEGP) collimator respectively. Point source, uniform source, cylindrical phantom and anthropomorphic phantom acquisitions were simulated using a model of the GE infinia II 3/8" gamma camera. Both simulation platforms yielded a similar system sensitivity and image statistical quality for the various combinations. The overall acceleration factor between GATE and GGEMS platform derived from the same cylindrical phantom acquisition was between 18 and 27 for the different radioisotopes. Besides, a full MC simulation using an anthropomorphic phantom showed the full potential of the GGEMS platform, with a resulting acceleration factor up to 71. The good agreement with reference codes and the acceleration factors obtained support the use of GPU implementation strategies for improving computational efficiency
GPU-based fast gamma index calculation
NASA Astrophysics Data System (ADS)
Gu, Xuejun; Jia, Xun; Jiang, Steve B.
2011-03-01
The γ-index dose comparison tool has been widely used to compare dose distributions in cancer radiotherapy. The accurate calculation of γ-index requires an exhaustive search of the closest Euclidean distance in the high-resolution dose-distance space. This is a computational intensive task when dealing with 3D dose distributions. In this work, we combine a geometric method (Ju et al 2008 Med. Phys. 35 879-87) with a radial pre-sorting technique (Wendling et al 2007 Med. Phys. 34 1647-54) and implement them on computer graphics processing units (GPUs). The developed GPU-based γ-index computational tool is evaluated on eight pairs of IMRT dose distributions. The γ-index calculations can be finished within a few seconds for all 3D testing cases on one single NVIDIA Tesla C1060 card, achieving 45-75× speedup compared to CPU computations conducted on an Intel Xeon 2.27 GHz processor. We further investigated the effect of various factors on both CPU and GPU computation time. The strategy of pre-sorting voxels based on their dose difference values speeds up the GPU calculation by about 2.7-5.5 times. For n-dimensional dose distributions, γ-index calculation time on CPU is proportional to the summation of γn over all voxels, while that on GPU is affected by γn distributions and is approximately proportional to the γn summation over all voxels. We found that increasing the resolution of dose distributions leads to a quadratic increase of computation time on CPU, while less-than-quadratic increase on GPU. The values of dose difference and distance-to-agreement criteria also have an impact on γ-index calculation time.
Accelerated GPU based SPECT Monte Carlo simulations.
Garcia, Marie-Paule; Bert, Julien; Benoit, Didier; Bardiès, Manuel; Visvikis, Dimitris
2016-06-01
Monte Carlo (MC) modelling is widely used in the field of single photon emission computed tomography (SPECT) as it is a reliable technique to simulate very high quality scans. This technique provides very accurate modelling of the radiation transport and particle interactions in a heterogeneous medium. Various MC codes exist for nuclear medicine imaging simulations. Recently, new strategies exploiting the computing capabilities of graphical processing units (GPU) have been proposed. This work aims at evaluating the accuracy of such GPU implementation strategies in comparison to standard MC codes in the context of SPECT imaging. GATE was considered the reference MC toolkit and used to evaluate the performance of newly developed GPU Geant4-based Monte Carlo simulation (GGEMS) modules for SPECT imaging. Radioisotopes with different photon energies were used with these various CPU and GPU Geant4-based MC codes in order to assess the best strategy for each configuration. Three different isotopes were considered: (99m) Tc, (111)In and (131)I, using a low energy high resolution (LEHR) collimator, a medium energy general purpose (MEGP) collimator and a high energy general purpose (HEGP) collimator respectively. Point source, uniform source, cylindrical phantom and anthropomorphic phantom acquisitions were simulated using a model of the GE infinia II 3/8" gamma camera. Both simulation platforms yielded a similar system sensitivity and image statistical quality for the various combinations. The overall acceleration factor between GATE and GGEMS platform derived from the same cylindrical phantom acquisition was between 18 and 27 for the different radioisotopes. Besides, a full MC simulation using an anthropomorphic phantom showed the full potential of the GGEMS platform, with a resulting acceleration factor up to 71. The good agreement with reference codes and the acceleration factors obtained support the use of GPU implementation strategies for improving computational
Gpu Implementation of Preconditioning Method for Low-Speed Flows
NASA Astrophysics Data System (ADS)
Zhang, Jiale; Chen, Hongquan
2016-06-01
An improved preconditioning method for low-Mach-number flows is implemented on a GPU platform. The improved preconditioning method employs the fluctuation of the fluid variables to weaken the influence of accuracy caused by the truncation error. The GPU parallel computing platform is implemented to accelerate the calculations. Both details concerning the improved preconditioning method and the GPU implementation technology are described in this paper. Then a set of typical low-speed flow cases are simulated for both validation and performance analysis of the resulting GPU solver. Numerical results show that dozens of times speedup relative to a serial CPU implementation can be achieved using a single GPU desktop platform, which demonstrates that the GPU desktop can serve as a cost-effective parallel computing platform to accelerate CFD simulations for low-Speed flows substantially.
GPU Accelerated Chemical Similarity Calculation for Compound Library Comparison
Ma, Chao; Wang, Lirong; Xie, Xiang-Qun
2012-01-01
Chemical similarity calculation plays an important role in compound library design, virtual screening, and “lead” optimization. In this manuscript, we present a novel GPU-accelerated algorithm for all-vs-all Tanimoto matrix calculation and nearest neighbor search. By taking advantage of multi-core GPU architecture and CUDA parallel programming technology, the algorithm is up to 39 times superior to the existing commercial software that runs on CPUs. Because of the utilization of intrinsic GPU instructions, this approach is nearly 10 times faster than existing GPU-accelerated sparse vector algorithm, when Unity fingerprints are used for Tanimoto calculation. The GPU program that implements this new method takes about 20 minutes to complete the calculation of Tanimoto coefficients between 32M PubChem compounds and 10K Active Probes compounds, i.e., 324G Tanimoto coefficients, on a 128-CUDA-core GPU. PMID:21692447
Blind detection of giant pulses: GPU implementation
NASA Astrophysics Data System (ADS)
Ait-Allal, Dalal; Weber, Rodolphe; Dumez-Viou, Cédric; Cognard, Ismael; Theureau, Gilles
2012-01-01
Radio astronomical pulsar observations require specific instrumentation and dedicated signal processing to cope with the dispersion caused by the interstellar medium. Moreover, the quality of observations can be limited by radio frequency interference (RFI) generated by Telecommunications activity. This article presents the innovative pulsar instrumentation based on graphical processing units (GPU) which has been designed at the Nançay Radio Astronomical Observatory. In addition, for giant pulsar search, we propose a new approach which combines a hardware-efficient search method and some RFI mitigation capabilities. Although this approach is less sensitive than the classical approach, its advantage is that no a priori information on the pulsar parameters is required. The validation of a GPU implementation is under way.
Solving global optimization problems on GPU cluster
NASA Astrophysics Data System (ADS)
Barkalov, Konstantin; Gergel, Victor; Lebedev, Ilya
2016-06-01
The paper contains the results of investigation of a parallel global optimization algorithm combined with a dimension reduction scheme. This allows solving multidimensional problems by means of reducing to data-independent subproblems with smaller dimension solved in parallel. The new element implemented in the research consists in using several graphic accelerators at different computing nodes. The paper also includes results of solving problems of well-known multiextremal test class GKLS on Lobachevsky supercomputer using tens of thousands of GPU cores.
GPU-based video motion magnification
NASA Astrophysics Data System (ADS)
DomŻał, Mariusz; Jedrasiak, Karol; Sobel, Dawid; Ryt, Artur; Nawrat, Aleksander
2016-06-01
Video motion magnification (VMM) allows people see otherwise not visible subtle changes in surrounding world. VMM is also capable of hiding them with a modified version of the algorithm. It is possible to magnify motion related to breathing of patients in hospital to observe it or extinguish it and extract other information from stabilized image sequence for example blood flow. In both cases we would like to perform calculations in real time. Unfortunately, the VMM algorithm requires a great amount of computing power. In the article we suggest that VMM algorithm can be parallelized (each thread processes one pixel) and in order to prove that we implemented the algorithm on GPU using CUDA technology. CPU is used only to grab, write, display frame and schedule work for GPU. Each GPU kernel performs spatial decomposition, reconstruction and motion amplification. In this work we presented approach that achieves a significant speedup over existing methods and allow to VMM process video in real-time. This solution can be used as preprocessing for other algorithms in more complex systems or can find application wherever real time motion magnification would be useful. It is worth to mention that the implementation runs on most modern desktops and laptops compatible with CUDA technology.
Bayer image parallel decoding based on GPU
NASA Astrophysics Data System (ADS)
Hu, Rihui; Xu, Zhiyong; Wei, Yuxing; Sun, Shaohua
2012-11-01
In the photoelectrical tracking system, Bayer image is decompressed in traditional method, which is CPU-based. However, it is too slow when the images become large, for example, 2K×2K×16bit. In order to accelerate the Bayer image decoding, this paper introduces a parallel speedup method for NVIDA's Graphics Processor Unit (GPU) which supports CUDA architecture. The decoding procedure can be divided into three parts: the first is serial part, the second is task-parallelism part, and the last is data-parallelism part including inverse quantization, inverse discrete wavelet transform (IDWT) as well as image post-processing part. For reducing the execution time, the task-parallelism part is optimized by OpenMP techniques. The data-parallelism part could advance its efficiency through executing on the GPU as CUDA parallel program. The optimization techniques include instruction optimization, shared memory access optimization, the access memory coalesced optimization and texture memory optimization. In particular, it can significantly speed up the IDWT by rewriting the 2D (Tow-dimensional) serial IDWT into 1D parallel IDWT. Through experimenting with 1K×1K×16bit Bayer image, data-parallelism part is 10 more times faster than CPU-based implementation. Finally, a CPU+GPU heterogeneous decompression system was designed. The experimental result shows that it could achieve 3 to 5 times speed increase compared to the CPU serial method.
Non-rigid multi-modal registration on the GPU
NASA Astrophysics Data System (ADS)
Vetter, Christoph; Guetter, Christoph; Xu, Chenyang; Westermann, Rüdiger
2007-03-01
Non-rigid multi-modal registration of images/volumes is becoming increasingly necessary in many medical settings. While efficient registration algorithms have been published, the speed of the solutions is a problem in clinical applications. Harnessing the computational power of graphics processing unit (GPU) for general purpose computations has become increasingly popular in order to speed up algorithms further, but the algorithms have to be adapted to the data-parallel, streaming model of the GPU. This paper describes the implementation of a non-rigid, multi-modal registration using mutual information and the Kullback-Leibler divergence between observed and learned joint intensity distributions. The entire registration process is implemented on the GPU, including a GPU-friendly computation of two-dimensional histograms using vertex texture fetches as well as an implementation of recursive Gaussian filtering on the GPU. Since the computation is performed on the GPU, interactive visualization of the registration process can be done without bus transfer between main memory and video memory. This allows the user to observe the registration process and to evaluate the result more easily. Two hybrid approaches distributing the computation between the GPU and CPU are discussed. The first approach uses the CPU for lower resolutions and the GPU for higher resolutions, the second approach uses the GPU to compute a first approximation to the registration that is used as starting point for registration on the CPU using double-precision. The results of the CPU implementation are compared to the different approaches using the GPU regarding speed as well as image quality. The GPU performs up to 5 times faster per iteration than the CPU implementation.
Problems Related to Parallelization of CFD Algorithms on GPU, Multi-GPU and Hybrid Architectures
NASA Astrophysics Data System (ADS)
Biazewicz, Marek; Kurowski, Krzysztof; Ludwiczak, Bogdan; Napieraia, Krystyna
2010-09-01
Computational Fluid Dynamics (CFD) is one of the branches of fluid mechanics, which uses numerical methods and algorithms to solve and analyze fluid flows. CFD is used in various domains, such as oil and gas reservoir uncertainty analysis, aerodynamic body shapes optimization (e.g. planes, cars, ships, sport helmets, skis), natural phenomena analysis, numerical simulation for weather forecasting or realistic visualizations. CFD problem is very complex and needs a lot of computational power to obtain the results in a reasonable time. We have implemented a parallel application for two-dimensional CFD simulation with a free surface approximation (MAC method) using new hardware architectures, in particular multi-GPU and hybrid computing environments. For this purpose we decided to use NVIDIA graphic cards with CUDA environment due to its simplicity of programming and good computations performance. We used finite difference discretization of Navier-Stokes equations, where fluid is propagated over an Eulerian Grid. In this model, the behavior of the fluid inside the cell depends only on the properties of local, surrounding cells, therefore it is well suited for the GPU-based architecture. In this paper we demonstrate how to use efficiently the computing power of GPUs for CFD. Additionally, we present some best practices to help users analyze and improve the performance of CFD applications executed on GPU. Finally, we discuss various challenges around the multi-GPU implementation on the example of matrix multiplication.
Evaluating the power of GPU acceleration for IDW interpolation algorithm.
Mei, Gang
2014-01-01
We first present two GPU implementations of the standard Inverse Distance Weighting (IDW) interpolation algorithm, the tiled version that takes advantage of shared memory and the CDP version that is implemented using CUDA Dynamic Parallelism (CDP). Then we evaluate the power of GPU acceleration for IDW interpolation algorithm by comparing the performance of CPU implementation with three GPU implementations, that is, the naive version, the tiled version, and the CDP version. Experimental results show that the tilted version has the speedups of 120x and 670x over the CPU version when the power parameter p is set to 2 and 3.0, respectively. In addition, compared to the naive GPU implementation, the tiled version is about two times faster. However, the CDP version is 4.8x ∼ 6.0x slower than the naive GPU version, and therefore does not have any potential advantages in practical applications. PMID:24707195
GPU-based High-Performance Computing for Radiation Therapy
Jia, Xun; Ziegenhein, Peter; Jiang, Steve B.
2014-01-01
Recent developments in radiotherapy therapy demand high computation powers to solve challenging problems in a timely fashion in a clinical environment. Graphics processing unit (GPU), as an emerging high-performance computing platform, has been introduced to radiotherapy. It is particularly attractive due to its high computational power, small size, and low cost for facility deployment and maintenance. Over the past a few years, GPU-based high-performance computing in radiotherapy has experienced rapid developments. A tremendous amount of studies have been conducted, in which large acceleration factors compared with the conventional CPU platform have been observed. In this article, we will first give a brief introduction to the GPU hardware structure and programming model. We will then review the current applications of GPU in major imaging-related and therapy-related problems encountered in radiotherapy. A comparison of GPU with other platforms will also be presented. PMID:24486639
Evaluating the Power of GPU Acceleration for IDW Interpolation Algorithm
2014-01-01
We first present two GPU implementations of the standard Inverse Distance Weighting (IDW) interpolation algorithm, the tiled version that takes advantage of shared memory and the CDP version that is implemented using CUDA Dynamic Parallelism (CDP). Then we evaluate the power of GPU acceleration for IDW interpolation algorithm by comparing the performance of CPU implementation with three GPU implementations, that is, the naive version, the tiled version, and the CDP version. Experimental results show that the tilted version has the speedups of 120x and 670x over the CPU version when the power parameter p is set to 2 and 3.0, respectively. In addition, compared to the naive GPU implementation, the tiled version is about two times faster. However, the CDP version is 4.8x∼6.0x slower than the naive GPU version, and therefore does not have any potential advantages in practical applications. PMID:24707195
GPU-based ultrafast IMRT plan optimization.
Men, Chunhua; Gu, Xuejun; Choi, Dongju; Majumdar, Amitava; Zheng, Ziyi; Mueller, Klaus; Jiang, Steve B
2009-11-01
The widespread adoption of on-board volumetric imaging in cancer radiotherapy has stimulated research efforts to develop online adaptive radiotherapy techniques to handle the inter-fraction variation of the patient's geometry. Such efforts face major technical challenges to perform treatment planning in real time. To overcome this challenge, we are developing a supercomputing online re-planning environment (SCORE) at the University of California, San Diego (UCSD). As part of the SCORE project, this paper presents our work on the implementation of an intensity-modulated radiation therapy (IMRT) optimization algorithm on graphics processing units (GPUs). We adopt a penalty-based quadratic optimization model, which is solved by using a gradient projection method with Armijo's line search rule. Our optimization algorithm has been implemented in CUDA for parallel GPU computing as well as in C for serial CPU computing for comparison purpose. A prostate IMRT case with various beamlet and voxel sizes was used to evaluate our implementation. On an NVIDIA Tesla C1060 GPU card, we have achieved speedup factors of 20-40 without losing accuracy, compared to the results from an Intel Xeon 2.27 GHz CPU. For a specific nine-field prostate IMRT case with 5 x 5 mm(2) beamlet size and 2.5 x 2.5 x 2.5 mm(3) voxel size, our GPU implementation takes only 2.8 s to generate an optimal IMRT plan. Our work has therefore solved a major problem in developing online re-planning technologies for adaptive radiotherapy. PMID:19826201
GPU-accelerated adjoint algorithmic differentiation
NASA Astrophysics Data System (ADS)
Gremse, Felix; Höfter, Andreas; Razik, Lukas; Kiessling, Fabian; Naumann, Uwe
2016-03-01
Many scientific problems such as classifier training or medical image reconstruction can be expressed as minimization of differentiable real-valued cost functions and solved with iterative gradient-based methods. Adjoint algorithmic differentiation (AAD) enables automated computation of gradients of such cost functions implemented as computer programs. To backpropagate adjoint derivatives, excessive memory is potentially required to store the intermediate partial derivatives on a dedicated data structure, referred to as the "tape". Parallelization is difficult because threads need to synchronize their accesses during taping and backpropagation. This situation is aggravated for many-core architectures, such as Graphics Processing Units (GPUs), because of the large number of light-weight threads and the limited memory size in general as well as per thread. We show how these limitations can be mediated if the cost function is expressed using GPU-accelerated vector and matrix operations which are recognized as intrinsic functions by our AAD software. We compare this approach with naive and vectorized implementations for CPUs. We use four increasingly complex cost functions to evaluate the performance with respect to memory consumption and gradient computation times. Using vectorization, CPU and GPU memory consumption could be substantially reduced compared to the naive reference implementation, in some cases even by an order of complexity. The vectorization allowed usage of optimized parallel libraries during forward and reverse passes which resulted in high speedups for the vectorized CPU version compared to the naive reference implementation. The GPU version achieved an additional speedup of 7.5 ± 4.4, showing that the processing power of GPUs can be utilized for AAD using this concept. Furthermore, we show how this software can be systematically extended for more complex problems such as nonlinear absorption reconstruction for fluorescence-mediated tomography.
GPU-Accelerated Adjoint Algorithmic Differentiation
Gremse, Felix; Höfter, Andreas; Razik, Lukas; Kiessling, Fabian; Naumann, Uwe
2015-01-01
Many scientific problems such as classifier training or medical image reconstruction can be expressed as minimization of differentiable real-valued cost functions and solved with iterative gradient-based methods. Adjoint algorithmic differentiation (AAD) enables automated computation of gradients of such cost functions implemented as computer programs. To backpropagate adjoint derivatives, excessive memory is potentially required to store the intermediate partial derivatives on a dedicated data structure, referred to as the “tape”. Parallelization is difficult because threads need to synchronize their accesses during taping and backpropagation. This situation is aggravated for many-core architectures, such as Graphics Processing Units (GPUs), because of the large number of light-weight threads and the limited memory size in general as well as per thread. We show how these limitations can be mediated if the cost function is expressed using GPU-accelerated vector and matrix operations which are recognized as intrinsic functions by our AAD software. We compare this approach with naive and vectorized implementations for CPUs. We use four increasingly complex cost functions to evaluate the performance with respect to memory consumption and gradient computation times. Using vectorization, CPU and GPU memory consumption could be substantially reduced compared to the naive reference implementation, in some cases even by an order of complexity. The vectorization allowed usage of optimized parallel libraries during forward and reverse passes which resulted in high speedups for the vectorized CPU version compared to the naive reference implementation. The GPU version achieved an additional speedup of 7.5 ± 4.4, showing that the processing power of GPUs can be utilized for AAD using this concept. Furthermore, we show how this software can be systematically extended for more complex problems such as nonlinear absorption reconstruction for fluorescence-mediated tomography
GPU-based ultrafast IMRT plan optimization
NASA Astrophysics Data System (ADS)
Men, Chunhua; Gu, Xuejun; Choi, Dongju; Majumdar, Amitava; Zheng, Ziyi; Mueller, Klaus; Jiang, Steve B.
2009-11-01
The widespread adoption of on-board volumetric imaging in cancer radiotherapy has stimulated research efforts to develop online adaptive radiotherapy techniques to handle the inter-fraction variation of the patient's geometry. Such efforts face major technical challenges to perform treatment planning in real time. To overcome this challenge, we are developing a supercomputing online re-planning environment (SCORE) at the University of California, San Diego (UCSD). As part of the SCORE project, this paper presents our work on the implementation of an intensity-modulated radiation therapy (IMRT) optimization algorithm on graphics processing units (GPUs). We adopt a penalty-based quadratic optimization model, which is solved by using a gradient projection method with Armijo's line search rule. Our optimization algorithm has been implemented in CUDA for parallel GPU computing as well as in C for serial CPU computing for comparison purpose. A prostate IMRT case with various beamlet and voxel sizes was used to evaluate our implementation. On an NVIDIA Tesla C1060 GPU card, we have achieved speedup factors of 20-40 without losing accuracy, compared to the results from an Intel Xeon 2.27 GHz CPU. For a specific nine-field prostate IMRT case with 5 × 5 mm2 beamlet size and 2.5 × 2.5 × 2.5 mm3 voxel size, our GPU implementation takes only 2.8 s to generate an optimal IMRT plan. Our work has therefore solved a major problem in developing online re-planning technologies for adaptive radiotherapy.
CFD Computations on Multi-GPU Configurations.
NASA Astrophysics Data System (ADS)
Menon, Sandeep; Perot, Blair
2007-11-01
Programmable graphics processors have shown favorable potential for use in practical CFD simulations -- often delivering a speed-up factor between 3 to 5 times over conventional CPUs. In recent times, most PCs are supplied with the option of installing multiple GPUs on a single motherboard, thereby providing the option of a parallel GPU configuration in a shared-memory paradigm. We demonstrate our implementation of an unstructured CFD solver using a set up which is configured to run two GPUs in parallel, and discuss its performance details.
GPU-completeness: theory and implications
NASA Astrophysics Data System (ADS)
Lin, I.-Jong
2011-01-01
This paper formalizes a major insight into a class of algorithms that relate parallelism and performance. The purpose of this paper is to define a class of algorithms that trades off parallelism for quality of result (e.g. visual quality, compression rate), and we propose a similar method for algorithmic classification based on NP-Completeness techniques, applied toward parallel acceleration. We will define this class of algorithm as "GPU-Complete" and will postulate the necessary properties of the algorithms for admission into this class. We will also formally relate his algorithmic space and imaging algorithms space. This concept is based upon our experience in the print production area where GPUs (Graphic Processing Units) have shown a substantial cost/performance advantage within the context of HPdelivered enterprise services and commercial printing infrastructure. While CPUs and GPUs are converging in their underlying hardware and functional blocks, their system behaviors are clearly distinct in many ways: memory system design, programming paradigms, and massively parallel SIMD architecture. There are applications that are clearly suited to each architecture: for CPU: language compilation, word processing, operating systems, and other applications that are highly sequential in nature; for GPU: video rendering, particle simulation, pixel color conversion, and other problems clearly amenable to massive parallelization. While GPUs establishing themselves as a second, distinct computing architecture from CPUs, their end-to-end system cost/performance advantage in certain parts of computation inform the structure of algorithms and their efficient parallel implementations. While GPUs are merely one type of architecture for parallelization, we show that their introduction into the design space of printing systems demonstrate the trade-offs against competing multi-core, FPGA, and ASIC architectures. While each architecture has its own optimal application, we believe
Architecting the Finite Element Method Pipeline for the GPU
Fu, Zhisong; Lewis, T. James; Kirby, Robert M.
2014-01-01
The finite element method (FEM) is a widely employed numerical technique for approximating the solution of partial differential equations (PDEs) in various science and engineering applications. Many of these applications benefit from fast execution of the FEM pipeline. One way to accelerate the FEM pipeline is by exploiting advances in modern computational hardware, such as the many-core streaming processors like the graphical processing unit (GPU). In this paper, we present the algorithms and data-structures necessary to move the entire FEM pipeline to the GPU. First we propose an efficient GPU-based algorithm to generate local element information and to assemble the global linear system associated with the FEM discretization of an elliptic PDE. To solve the corresponding linear system efficiently on the GPU, we implement a conjugate gradient method preconditioned with a geometry-informed algebraic multi-grid (AMG) method preconditioner. We propose a new fine-grained parallelism strategy, a corresponding multigrid cycling stage and efficient data mapping to the many-core architecture of GPU. Comparison of our on-GPU assembly versus a traditional serial implementation on the CPU achieves up to an 87 × speedup. Focusing on the linear system solver alone, we achieve a speedup of up to 51 × versus use of a comparable state-of-the-art serial CPU linear system solver. Furthermore, the method compares favorably with other GPU-based, sparse, linear solvers. PMID:25202164
Architecting the Finite Element Method Pipeline for the GPU.
Fu, Zhisong; Lewis, T James; Kirby, Robert M; Whitaker, Ross T
2014-02-01
The finite element method (FEM) is a widely employed numerical technique for approximating the solution of partial differential equations (PDEs) in various science and engineering applications. Many of these applications benefit from fast execution of the FEM pipeline. One way to accelerate the FEM pipeline is by exploiting advances in modern computational hardware, such as the many-core streaming processors like the graphical processing unit (GPU). In this paper, we present the algorithms and data-structures necessary to move the entire FEM pipeline to the GPU. First we propose an efficient GPU-based algorithm to generate local element information and to assemble the global linear system associated with the FEM discretization of an elliptic PDE. To solve the corresponding linear system efficiently on the GPU, we implement a conjugate gradient method preconditioned with a geometry-informed algebraic multi-grid (AMG) method preconditioner. We propose a new fine-grained parallelism strategy, a corresponding multigrid cycling stage and efficient data mapping to the many-core architecture of GPU. Comparison of our on-GPU assembly versus a traditional serial implementation on the CPU achieves up to an 87 × speedup. Focusing on the linear system solver alone, we achieve a speedup of up to 51 × versus use of a comparable state-of-the-art serial CPU linear system solver. Furthermore, the method compares favorably with other GPU-based, sparse, linear solvers. PMID:25202164
Efficient implementation of MrBayes on multi-GPU.
Bao, Jie; Xia, Hongju; Zhou, Jianfu; Liu, Xiaoguang; Wang, Gang
2013-06-01
MrBayes, using Metropolis-coupled Markov chain Monte Carlo (MCMCMC or (MC)(3)), is a popular program for Bayesian inference. As a leading method of using DNA data to infer phylogeny, the (MC)(3) Bayesian algorithm and its improved and parallel versions are now not fast enough for biologists to analyze massive real-world DNA data. Recently, graphics processor unit (GPU) has shown its power as a coprocessor (or rather, an accelerator) in many fields. This article describes an efficient implementation a(MC)(3) (aMCMCMC) for MrBayes (MC)(3) on compute unified device architecture. By dynamically adjusting the task granularity to adapt to input data size and hardware configuration, it makes full use of GPU cores with different data sets. An adaptive method is also developed to split and combine DNA sequences to make full use of a large number of GPU cards. Furthermore, a new "node-by-node" task scheduling strategy is developed to improve concurrency, and several optimizing methods are used to reduce extra overhead. Experimental results show that a(MC)(3) achieves up to 63× speedup over serial MrBayes on a single machine with one GPU card, and up to 170× speedup with four GPU cards, and up to 478× speedup with a 32-node GPU cluster. a(MC)(3) is dramatically faster than all the previous (MC)(3) algorithms and scales well to large GPU clusters. PMID:23493260
Efficient Implementation of MrBayes on Multi-GPU
Zhou, Jianfu; Liu, Xiaoguang; Wang, Gang
2013-01-01
MrBayes, using Metropolis-coupled Markov chain Monte Carlo (MCMCMC or (MC)3), is a popular program for Bayesian inference. As a leading method of using DNA data to infer phylogeny, the (MC)3 Bayesian algorithm and its improved and parallel versions are now not fast enough for biologists to analyze massive real-world DNA data. Recently, graphics processor unit (GPU) has shown its power as a coprocessor (or rather, an accelerator) in many fields. This article describes an efficient implementation a(MC)3 (aMCMCMC) for MrBayes (MC)3 on compute unified device architecture. By dynamically adjusting the task granularity to adapt to input data size and hardware configuration, it makes full use of GPU cores with different data sets. An adaptive method is also developed to split and combine DNA sequences to make full use of a large number of GPU cards. Furthermore, a new “node-by-node” task scheduling strategy is developed to improve concurrency, and several optimizing methods are used to reduce extra overhead. Experimental results show that a(MC)3 achieves up to 63× speedup over serial MrBayes on a single machine with one GPU card, and up to 170× speedup with four GPU cards, and up to 478× speedup with a 32-node GPU cluster. a(MC)3 is dramatically faster than all the previous (MC)3 algorithms and scales well to large GPU clusters. PMID:23493260
Parallel hyperspectral compressive sensing method on GPU
NASA Astrophysics Data System (ADS)
Bernabé, Sergio; Martín, Gabriel; Nascimento, José M. P.
2015-10-01
Remote hyperspectral sensors collect large amounts of data per flight usually with low spatial resolution. It is known that the bandwidth connection between the satellite/airborne platform and the ground station is reduced, thus a compression onboard method is desirable to reduce the amount of data to be transmitted. This paper presents a parallel implementation of an compressive sensing method, called parallel hyperspectral coded aperture (P-HYCA), for graphics processing units (GPU) using the compute unified device architecture (CUDA). This method takes into account two main properties of hyperspectral dataset, namely the high correlation existing among the spectral bands and the generally low number of endmembers needed to explain the data, which largely reduces the number of measurements necessary to correctly reconstruct the original data. Experimental results conducted using synthetic and real hyperspectral datasets on two different GPU architectures by NVIDIA: GeForce GTX 590 and GeForce GTX TITAN, reveal that the use of GPUs can provide real-time compressive sensing performance. The achieved speedup is up to 20 times when compared with the processing time of HYCA running on one core of the Intel i7-2600 CPU (3.4GHz), with 16 Gbyte memory.
A GPU accelerated PDF transparency engine
NASA Astrophysics Data System (ADS)
Recker, John; Lin, I.-Jong; Tastl, Ingeborg
2011-01-01
As commercial printing presses become faster, cheaper and more efficient, so too must the Raster Image Processors (RIP) that prepare data for them to print. Digital press RIPs, however, have been challenged to on the one hand meet the ever increasing print performance of the latest digital presses, and on the other hand process increasingly complex documents with transparent layers and embedded ICC profiles. This paper explores the challenges encountered when implementing a GPU accelerated driver for the open source Ghostscript Adobe PostScript and PDF language interpreter targeted at accelerating PDF transparency for high speed commercial presses. It further describes our solution, including an image memory manager for tiling input and output images and documents, a PDF compatible multiple image layer blending engine, and a GPU accelerated ICC v4 compatible color transformation engine. The result, we believe, is the foundation for a scalable, efficient, distributed RIP system that can meet current and future RIP requirements for a wide range of commercial digital presses.
Synthetic aperture elastography: a GPU based approach
NASA Astrophysics Data System (ADS)
Verma, Prashant; Doyley, Marvin M.
2014-03-01
Synthetic aperture (SA) ultrasound imaging system produces highly accurate axial and lateral displacement estimates; however, low frame rates and large data volumes can hamper its clinical use. This paper describes a real-time SA imaging based ultrasound elastography system that we have recently developed to overcome this limitation. In this system, we implemented both beamforming and 2D cross-correlation echo tracking on Nvidia GTX 480 graphics processing unit (GPU). We used one thread per pixel for beamforming; whereas, one block per pixel was used for echo tracking. We compared the quality of elastograms computed with our real-time system relative to those computed using our standard single threaded elastographic imaging methodology. In all studies, we used conventional measures of image quality such as elastographic signal to noise ratio (SNRe). Specifically, SNRe of axial and lateral strain elastograms computed with real-time system were 36 dB and 23 dB, respectively, which was numerically equal to those computed with our standard approach. We achieved a frame rate of 6 frames per second using our GPU based approach for 16 transmits and kernel size of 60 × 60 pixels, which is 400 times faster than that achieved using our standard protocol.
Parallelization and checkpointing of GPU applications through program transformation
Solano-Quinde, Lizandro Damian
2012-01-01
GPUs have emerged as a powerful tool for accelerating general-purpose applications. The availability of programming languages that makes writing general-purpose applications for running on GPUs tractable have consolidated GPUs as an alternative for accelerating general purpose applications. Among the areas that have benefited from GPU acceleration are: signal and image processing, computational fluid dynamics, quantum chemistry, and, in general, the High Performance Computing (HPC) Industry. In order to continue to exploit higher levels of parallelism with GPUs, multi-GPU systems are gaining popularity. In this context, single-GPU applications are parallelized for running in multi-GPU systems. Furthermore, multi-GPU systems help to solve the GPU memory limitation for applications with large application memory footprint. Parallelizing single-GPU applications has been approached by libraries that distribute the workload at runtime, however, they impose execution overhead and are not portable. On the other hand, on traditional CPU systems, parallelization has been approached through application transformation at pre-compile time, which enhances the application to distribute the workload at application level and does not have the issues of library-based approaches. Hence, a parallelization scheme for GPU systems based on application transformation is needed. Like any computing engine of today, reliability is also a concern in GPUs. GPUs are vulnerable to transient and permanent failures. Current checkpoint/restart techniques are not suitable for systems with GPUs. Checkpointing for GPU systems present new and interesting challenges, primarily due to the natural differences imposed by the hardware design, the memory subsystem architecture, the massive number of threads, and the limited amount of synchronization among threads. Therefore, a checkpoint/restart technique suitable for GPU systems is needed. The goal of this work is to exploit higher levels of parallelism and
GPU-Accelerated Denoising in 3D (GD3D)
Energy Science and Technology Software Center (ESTSC)
2013-10-01
The raw computational power GPU Accelerators enables fast denoising of 3D MR images using bilateral filtering, anisotropic diffusion, and non-local means. This software addresses two facets of this promising application: what tuning is necessary to achieve optimal performance on a modern GPU? And what parameters yield the best denoising results in practice? To answer the first question, the software performs an autotuning step to empirically determine optimal memory blocking on the GPU. To answer themore » second, it performs a sweep of algorithm parameters to determine the combination that best reduces the mean squared error relative to a noiseless reference image.« less
NASA Astrophysics Data System (ADS)
Wong, Un-Hong; Aoki, Takayuki; Wong, Hon-Cheng
2014-07-01
Modern graphics processing units (GPUs) have been widely utilized in magnetohydrodynamic (MHD) simulations in recent years. Due to the limited memory of a single GPU, distributed multi-GPU systems are needed to be explored for large-scale MHD simulations. However, the data transfer between GPUs bottlenecks the efficiency of the simulations on such systems. In this paper we propose a novel GPU Direct-MPI hybrid approach to address this problem for overall performance enhancement. Our approach consists of two strategies: (1) We exploit GPU Direct 2.0 to speedup the data transfers between multiple GPUs in a single node and reduce the total number of message passing interface (MPI) communications; (2) We design Compute Unified Device Architecture (CUDA) kernels instead of using memory copy to speedup the fragmented data exchange in the three-dimensional (3D) decomposition. 3D decomposition is usually not preferable for distributed multi-GPU systems due to its low efficiency of the fragmented data exchange. Our approach has made a breakthrough to make 3D decomposition available on distributed multi-GPU systems. As a result, it can reduce the memory usage and computation time of each partition of the computational domain. Experiment results show twice the FLOPS comparing to common 2D decomposition MPI-only implementation method. The proposed approach has been developed in an efficient implementation for MHD simulations on distributed multi-GPU systems, called MGPU-MHD code. The code realizes the GPU parallelization of a total variation diminishing (TVD) algorithm for solving the multidimensional ideal MHD equations, extending our work from single GPU computation (Wong et al., 2011) to multiple GPUs. Numerical tests and performance measurements are conducted on the TSUBAME 2.0 supercomputer at the Tokyo Institute of Technology. Our code achieves 2 TFLOPS in double precision for the problem with 12003 grid points using 216 GPUs.
Local alignment tool based on Hadoop framework and GPU architecture.
Hung, Che-Lun; Hua, Guan-Jie
2014-01-01
With the rapid growth of next generation sequencing technologies, such as Slex, more and more data have been discovered and published. To analyze such huge data the computational performance is an important issue. Recently, many tools, such as SOAP, have been implemented on Hadoop and GPU parallel computing architectures. BLASTP is an important tool, implemented on GPU architectures, for biologists to compare protein sequences. To deal with the big biology data, it is hard to rely on single GPU. Therefore, we implement a distributed BLASTP by combining Hadoop and multi-GPUs. The experimental results present that the proposed method can improve the performance of BLASTP on single GPU, and also it can achieve high availability and fault tolerance. PMID:24955362
GPU Computing in Bayesian Inference of Realized Stochastic Volatility Model
NASA Astrophysics Data System (ADS)
Takaishi, Tetsuya
2015-01-01
The realized stochastic volatility (RSV) model that utilizes the realized volatility as additional information has been proposed to infer volatility of financial time series. We consider the Bayesian inference of the RSV model by the Hybrid Monte Carlo (HMC) algorithm. The HMC algorithm can be parallelized and thus performed on the GPU for speedup. The GPU code is developed with CUDA Fortran. We compare the computational time in performing the HMC algorithm on GPU (GTX 760) and CPU (Intel i7-4770 3.4GHz) and find that the GPU can be up to 17 times faster than the CPU. We also code the program with OpenACC and find that appropriate coding can achieve the similar speedup with CUDA Fortran.
Fast CGH computation using S-LUT on GPU.
Pan, Yuechao; Xu, Xuewu; Solanki, Sanjeev; Liang, Xinan; Tanjung, Ridwan Bin Adrian; Tan, Chiwei; Chong, Tow-Chong
2009-10-12
In computation of full-parallax computer-generated hologram (CGH), balance between speed and memory usage is always the core of algorithm development. To solve the speed problem of coherent ray trace (CRT) algorithm and memory problem of look-up table (LUT) algorithm without sacrificing reconstructed object quality, we develop a novel algorithm with split look-up tables (S-LUT) and implement it on graphics processing unit (GPU). Our results show that S-LUT on GPU has the fastest speed among all the algorithms investigated in this paper, while it still maintaining low memory usage. We also demonstrate high quality objects reconstructed from CGHs computed with S-LUT on GPU. The GPU implementation of our new algorithm may enable real-time and interactive holographic 3D display in the future. PMID:20372585
GPU-assisted computation of centroidal Voronoi tessellation.
Rong, Guodong; Liu, Yang; Wang, Wenping; Yin, Xiaotian; Gu, Xianfeng David; Guo, Xiaohu
2011-03-01
Centroidal Voronoi tessellations (CVT) are widely used in computational science and engineering. The most commonly used method is Lloyd's method, and recently the L-BFGS method is shown to be faster than Lloyd's method for computing the CVT. However, these methods run on the CPU and are still too slow for many practical applications. We present techniques to implement these methods on the GPU for computing the CVT on 2D planes and on surfaces, and demonstrate significant speedup of these GPU-based methods over their CPU counterparts. For CVT computation on a surface, we use a geometry image stored in the GPU to represent the surface for computing the Voronoi diagram on it. In our implementation a new technique is proposed for parallel regional reduction on the GPU for evaluating integrals over Voronoi cells. PMID:21233516
Computing prestack Kirchhoff time migration on general purpose GPU
NASA Astrophysics Data System (ADS)
Shi, Xiaohua; Li, Chuang; Wang, Shihu; Wang, Xu
2011-10-01
This paper introduces how to optimize a practical prestack Kirchhoff time migration program by the Compute Unified Device Architecture (CUDA) on a general purpose GPU (GPGPU). A few useful optimization methods on GPGPU are demonstrated, such as how to increase the kernel thread numbers on GPU cores, and how to utilize the memory streams to overlap GPU kernel execution time, etc. The floating-point errors on CUDA and NVidia's GPUs are discussed in detail. Some effective methods that can be used to reduce the floating-point errors are introduced. The images generated by the practical prestack Kirchhoff time migration programs for the same real-world seismic data inputs on CPU and GPU are demonstrated. The final GPGPU approach on NVidia GTX 260 is more than 17 times faster than its original CPU version on Intel's P4 3.0G.
Local Alignment Tool Based on Hadoop Framework and GPU Architecture
Hung, Che-Lun; Hua, Guan-Jie
2014-01-01
With the rapid growth of next generation sequencing technologies, such as Slex, more and more data have been discovered and published. To analyze such huge data the computational performance is an important issue. Recently, many tools, such as SOAP, have been implemented on Hadoop and GPU parallel computing architectures. BLASTP is an important tool, implemented on GPU architectures, for biologists to compare protein sequences. To deal with the big biology data, it is hard to rely on single GPU. Therefore, we implement a distributed BLASTP by combining Hadoop and multi-GPUs. The experimental results present that the proposed method can improve the performance of BLASTP on single GPU, and also it can achieve high availability and fault tolerance. PMID:24955362
GPU-based calculations in digital holography
NASA Astrophysics Data System (ADS)
Madrigal, R.; Acebal, P.; Blaya, S.; Carretero, L.; Fimia, A.; Serrano, F.
2013-05-01
In this work we are going to apply GPU (Graphical Processing Units) with CUDA environment for scientific calculations, concretely high cost computations on the field of digital holography. For this, we have studied three typical problems in digital holography such as Fourier transforms, Fresnel reconstruction of the hologram and the calculation of vectorial diffraction integral. In all cases the runtime at different image size and the corresponding accuracy were compared to the obtained by traditional calculation systems. The programs have been carried out on a computer with a graphic card of last generation, Nvidia GTX 680, which is optimized for integer calculations. As a result a large reduction of runtime has been obtained which allows a significant improvement. Concretely, 15 fold shorter times for Fresnel approximation calculations and 600 times for the vectorial diffraction integral. These initial results, open the possibility for applying such kind of calculations in real time digital holography.
GPU-accelerated micromagnetic simulations using cloud computing
NASA Astrophysics Data System (ADS)
Jermain, C. L.; Rowlands, G. E.; Buhrman, R. A.; Ralph, D. C.
2016-03-01
Highly parallel graphics processing units (GPUs) can improve the speed of micromagnetic simulations significantly as compared to conventional computing using central processing units (CPUs). We present a strategy for performing GPU-accelerated micromagnetic simulations by utilizing cost-effective GPU access offered by cloud computing services with an open-source Python-based program for running the MuMax3 micromagnetics code remotely. We analyze the scaling and cost benefits of using cloud computing for micromagnetics.
STEM image simulation with hybrid CPU/GPU programming.
Yao, Y; Ge, B H; Shen, X; Wang, Y G; Yu, R C
2016-07-01
STEM image simulation is achieved via hybrid CPU/GPU programming under parallel algorithm architecture to speed up calculation on a personal computer (PC). To utilize the calculation power of a PC fully, the simulation is performed using the GPU core and multi-CPU cores at the same time to significantly improve efficiency. GaSb and an artificial GaSb/InAs interface with atom diffusion have been used to verify the computation. PMID:27093687
NASA Astrophysics Data System (ADS)
Tavares, E. T., Jr.; Klafke, J. C.
2003-08-01
O presente trabalho propõe-se a resgatar uma experiência que teve lugar no Planetário de São Paulo nos anos 60. Em 1962, o Sr. Acácio, então com 37 anos, deficiente visual desde os 27, passou a assistir às aulas ministradas pelo Prof. Aristóteles Orsini aos integrantes do corpo de servidores do Planetário. O Sr. Acácio era o único deficiente da turma e, embora possuísse conhecimentos básicos e relativamente avançados de matemática, enfrentava dificuldades na compreensão e acompanhamento da exposição, como também em estudos posteriores. Com o intuito de auxiliá-lo na superação desses problemas, o Prof. Orsini solicitou a construção de modelos mecânicos que, através do sentido do tato, permitissem o acompanhamento das aulas e a transposição do modelo para o "constructo" mental. Essa prática mostrou-se tão eficaz que facilitou sobejamente o aprendizado da matéria pelo sujeito. O Sr. Acácio passou a integrar o corpo de professores do Planetário/Escola Municipal de Astrofísica, tendo ficado responsável pelo curso de "Introdução à Astronomia" por vários anos. Além disso, a experiência foi tão bem sucedida que alguns dos modelos tiveram seus elementos constitutivos pintados diferencialmente para serem utilizados em cursos regulares do Planetário, tornando-se parte integrante do conjunto de recursos didáticos da instituição. É pensando nessa eficácia, tanto em seu objetivo original permitir o aprendizado de um deficiente visual quanto no subsidiário recurso didático sistemático da instituição que decidimos resgatar essa experiência. Estribados nela, acreditamos ser extremamente produtivo, em termos educacionais, o aperfeiçoamento dos modelos originais, agora resgatados e restaurados, e a criação de outros que pudessem ser utilizados no ensino dessa ciência a deficientes visuais.
gEMfitter: a highly parallel FFT-based 3D density fitting tool with GPU texture memory acceleration.
Hoang, Thai V; Cavin, Xavier; Ritchie, David W
2013-11-01
Fitting high resolution protein structures into low resolution cryo-electron microscopy (cryo-EM) density maps is an important technique for modeling the atomic structures of very large macromolecular assemblies. This article presents "gEMfitter", a highly parallel fast Fourier transform (FFT) EM density fitting program which can exploit the special hardware properties of modern graphics processor units (GPUs) to accelerate both the translational and rotational parts of the correlation search. In particular, by using the GPU's special texture memory hardware to rotate 3D voxel grids, the cost of rotating large 3D density maps is almost completely eliminated. Compared to performing 3D correlations on one core of a contemporary central processor unit (CPU), running gEMfitter on a modern GPU gives up to 26-fold speed-up. Furthermore, using our parallel processing framework, this speed-up increases linearly with the number of CPUs or GPUs used. Thus, it is now possible to use routinely more robust but more expensive 3D correlation techniques. When tested on low resolution experimental cryo-EM data for the GroEL-GroES complex, we demonstrate the satisfactory fitting results that may be achieved by using a locally normalised cross-correlation with a Laplacian pre-filter, while still being up to three orders of magnitude faster than the well-known COLORES program. PMID:24060989
GPU-based Parallel Application Design for Emerging Mobile Devices
NASA Astrophysics Data System (ADS)
Gupta, Kshitij
A revolution is underway in the computing world that is causing a fundamental paradigm shift in device capabilities and form-factor, with a move from well-established legacy desktop/laptop computers to mobile devices in varying sizes and shapes. Amongst all the tasks these devices must support, graphics has emerged as the 'killer app' for providing a fluid user interface and high-fidelity game rendering, effectively making the graphics processor (GPU) one of the key components in (present and future) mobile systems. By utilizing the GPU as a general-purpose parallel processor, this dissertation explores the GPU computing design space from an applications standpoint, in the mobile context, by focusing on key challenges presented by these devices---limited compute, memory bandwidth, and stringent power consumption requirements---while improving the overall application efficiency of the increasingly important speech recognition workload for mobile user interaction. We broadly partition trends in GPU computing into four major categories. We analyze hardware and programming model limitations in current-generation GPUs and detail an alternate programming style called Persistent Threads, identify four use case patterns, and propose minimal modifications that would be required for extending native support. We show how by manually extracting data locality and altering the speech recognition pipeline, we are able to achieve significant savings in memory bandwidth while simultaneously reducing the compute burden on GPU-like parallel processors. As we foresee GPU computing to evolve from its current 'co-processor' model into an independent 'applications processor' that is capable of executing complex work independently, we create an alternate application framework that enables the GPU to handle all control-flow dependencies autonomously at run-time while minimizing host involvement to just issuing commands, that facilitates an efficient application implementation. Finally, as
Advantages of GPU technology in DFT calculations of intercalated graphene
NASA Astrophysics Data System (ADS)
Pešić, J.; Gajić, R.
2014-09-01
Over the past few years, the expansion of general-purpose graphic-processing unit (GPGPU) technology has had a great impact on computational science. GPGPU is the utilization of a graphics-processing unit (GPU) to perform calculations in applications usually handled by the central processing unit (CPU). Use of GPGPUs as a way to increase computational power in the material sciences has significantly decreased computational costs in already highly demanding calculations. A level of the acceleration and parallelization depends on the problem itself. Some problems can benefit from GPU acceleration and parallelization, such as the finite-difference time-domain algorithm (FTDT) and density-functional theory (DFT), while others cannot take advantage of these modern technologies. A number of GPU-supported applications had emerged in the past several years (www.nvidia.com/object/gpu-applications.html). Quantum Espresso (QE) is reported as an integrated suite of open source computer codes for electronic-structure calculations and materials modeling at the nano-scale. It is based on DFT, the use of a plane-waves basis and a pseudopotential approach. Since the QE 5.0 version, it has been implemented as a plug-in component for standard QE packages that allows exploiting the capabilities of Nvidia GPU graphic cards (www.qe-forge.org/gf/proj). In this study, we have examined the impact of the usage of GPU acceleration and parallelization on the numerical performance of DFT calculations. Graphene has been attracting attention worldwide and has already shown some remarkable properties. We have studied an intercalated graphene, using the QE package PHonon, which employs GPU. The term ‘intercalation’ refers to a process whereby foreign adatoms are inserted onto a graphene lattice. In addition, by intercalating different atoms between graphene layers, it is possible to tune their physical properties. Our experiments have shown there are benefits from using GPUs, and we reached an
Optimizing Tensor Contraction Expressions for Hybrid CPU-GPU Execution
Ma, Wenjing; Krishnamoorthy, Sriram; Villa, Oreste; Kowalski, Karol; Agrawal, Gagan
2013-03-01
Tensor contractions are generalized multidimensional matrix multiplication operations that widely occur in quantum chemistry. Efficient execution of tensor contractions on Graphics Processing Units (GPUs) requires several challenges to be addressed, including index permutation and small dimension-sizes reducing thread block utilization. Moreover, to apply the same optimizations to various expressions, we need a code generation tool. In this paper, we present our approach to automatically generate CUDA code to execute tensor contractions on GPUs, including management of data movement between CPU and GPU. To evaluate our tool, GPU-enabled code is generated for the most expensive contractions in CCSD(T), a key coupled cluster method, and incorporated into NWChem, a popular computational chemistry suite. For this method, we demonstrate speedup over a factor of 8.4 using one GPU (instead of one core per node) and over 2.6 when utilizing the entire system using hybrid CPU+GPU solution with 2 GPUs and 5 cores (instead of 7 cores per node). Finally, we analyze the implementation behavior on future GPU systems.
High Performance GPU-Based Fourier Volume Rendering.
Abdellah, Marwan; Eldeib, Ayman; Sharawi, Amr
2015-01-01
Fourier volume rendering (FVR) is a significant visualization technique that has been used widely in digital radiography. As a result of its (N (2)logN) time complexity, it provides a faster alternative to spatial domain volume rendering algorithms that are (N (3)) computationally complex. Relying on the Fourier projection-slice theorem, this technique operates on the spectral representation of a 3D volume instead of processing its spatial representation to generate attenuation-only projections that look like X-ray radiographs. Due to the rapid evolution of its underlying architecture, the graphics processing unit (GPU) became an attractive competent platform that can deliver giant computational raw power compared to the central processing unit (CPU) on a per-dollar-basis. The introduction of the compute unified device architecture (CUDA) technology enables embarrassingly-parallel algorithms to run efficiently on CUDA-capable GPU architectures. In this work, a high performance GPU-accelerated implementation of the FVR pipeline on CUDA-enabled GPUs is presented. This proposed implementation can achieve a speed-up of 117x compared to a single-threaded hybrid implementation that uses the CPU and GPU together by taking advantage of executing the rendering pipeline entirely on recent GPU architectures. PMID:25866499
Parallel Optimization of 3D Cardiac Electrophysiological Model Using GPU
Xia, Yong; Wang, Kuanquan; Zhang, Henggui
2015-01-01
Large-scale 3D virtual heart model simulations are highly demanding in computational resources. This imposes a big challenge to the traditional computation resources based on CPU environment, which already cannot meet the requirement of the whole computation demands or are not easily available due to expensive costs. GPU as a parallel computing environment therefore provides an alternative to solve the large-scale computational problems of whole heart modeling. In this study, using a 3D sheep atrial model as a test bed, we developed a GPU-based simulation algorithm to simulate the conduction of electrical excitation waves in the 3D atria. In the GPU algorithm, a multicellular tissue model was split into two components: one is the single cell model (ordinary differential equation) and the other is the diffusion term of the monodomain model (partial differential equation). Such a decoupling enabled realization of the GPU parallel algorithm. Furthermore, several optimization strategies were proposed based on the features of the virtual heart model, which enabled a 200-fold speedup as compared to a CPU implementation. In conclusion, an optimized GPU algorithm has been developed that provides an economic and powerful platform for 3D whole heart simulations. PMID:26581957
Finite Difference Elastic Wave Field Simulation On GPU
NASA Astrophysics Data System (ADS)
Hu, Y.; Zhang, W.
2011-12-01
Numerical modeling of seismic wave propagation is considered as a basic and important aspect in investigation of the Earth's structure, and earthquake phenomenon. Among various numerical methods, the finite-difference method is considered one of the most efficient tools for the wave field simulation. However, with the increment of computing scale, the power of computing has becoming a bottleneck. With the development of hardware, in recent years, GPU shows powerful computational ability and bright application prospects in scientific computing. Many works using GPU demonstrate that GPU is powerful . Recently, GPU has not be used widely in the simulation of wave field. In this work, we present forward finite difference simulation of acoustic and elastic seismic wave propagation in heterogeneous media on NVIDIA graphics cards with the CUDA programming language. We also implement perfectly matched layers on the graphics cards to efficiently absorb outgoing waves on the fictitious edges of the grid Simulations compared with the results on CPU platform shows reliable accuracy and remarkable efficiency. This work proves that GPU can be an effective platform for wave field simulation, and it can also be used as a practical tool for real-time strong ground motion simulation.
A survey of CPU-GPU heterogeneous computing techniques
Mittal, Sparsh; Vetter, Jeffrey S.
2015-07-04
As both CPU and GPU become employed in a wide range of applications, it has been acknowledged that both of these processing units (PUs) have their unique features and strengths and hence, CPU-GPU collaboration is inevitable to achieve high-performance computing. This has motivated significant amount of research on heterogeneous computing techniques, along with the design of CPU-GPU fused chips and petascale heterogeneous supercomputers. In this paper, we survey heterogeneous computing techniques (HCTs) such as workload-partitioning which enable utilizing both CPU and GPU to improve performance and/or energy efficiency. We review heterogeneous computing approaches at runtime, algorithm, programming, compiler and applicationmore » level. Further, we review both discrete and fused CPU-GPU systems; and discuss benchmark suites designed for evaluating heterogeneous computing systems (HCSs). Furthermore, we believe that this paper will provide insights into working and scope of applications of HCTs to researchers and motivate them to further harness the computational powers of CPUs and GPUs to achieve the goal of exascale performance.« less
A survey of CPU-GPU heterogeneous computing techniques
Mittal, Sparsh; Vetter, Jeffrey S.
2015-07-04
As both CPU and GPU become employed in a wide range of applications, it has been acknowledged that both of these processing units (PUs) have their unique features and strengths and hence, CPU-GPU collaboration is inevitable to achieve high-performance computing. This has motivated significant amount of research on heterogeneous computing techniques, along with the design of CPU-GPU fused chips and petascale heterogeneous supercomputers. In this paper, we survey heterogeneous computing techniques (HCTs) such as workload-partitioning which enable utilizing both CPU and GPU to improve performance and/or energy efficiency. We review heterogeneous computing approaches at runtime, algorithm, programming, compiler and application level. Further, we review both discrete and fused CPU-GPU systems; and discuss benchmark suites designed for evaluating heterogeneous computing systems (HCSs). Furthermore, we believe that this paper will provide insights into working and scope of applications of HCTs to researchers and motivate them to further harness the computational powers of CPUs and GPUs to achieve the goal of exascale performance.
Performing efficient NURBS modeling operations on the GPU.
Krishnamurthy, Adarsh; Khardekar, Rahul; McMains, Sara; Haller, Kirk; Elber, Gershon
2009-01-01
We present algorithms for evaluating and performing modeling operations on NURBS surfaces using the programmable fragment processor on the Graphics Processing Unit (GPU). We extend our GPU-based NURBS evaluator that evaluates NURBS surfaces to compute exact normals for either standard or rational B-spline surfaces for use in rendering and geometric modeling. We build on these calculations in our new GPU algorithms to perform standard modeling operations such as inverse evaluations, ray intersections, and surface-surface intersections on the GPU. Our modeling algorithms run in real time, enabling the user to sketch on the actual surface to create new features. In addition, the designer can edit the surface by interactively trimming it without the need for retessellation. Our GPU-accelerated algorithm to perform surface-surface intersection operations with NURBS surfaces can output intersection curves in the model space as well as in the parametric spaces of both the intersecting surfaces at interactive rates. We also extend our surface-surface intersection algorithm to evaluate self-intersections in NURBS surfaces. PMID:19423879
Linear Bregman algorithm implemented in parallel GPU
NASA Astrophysics Data System (ADS)
Li, Pengyan; Ke, Jue; Sui, Dong; Wei, Ping
2015-08-01
At present, most compressed sensing (CS) algorithms have poor converging speed, thus are difficult to run on PC. To deal with this issue, we use a parallel GPU, to implement a broadly used compressed sensing algorithm, the Linear Bregman algorithm. Linear iterative Bregman algorithm is a reconstruction algorithm proposed by Osher and Cai. Compared with other CS reconstruction algorithms, the linear Bregman algorithm only involves the vector and matrix multiplication and thresholding operation, and is simpler and more efficient for programming. We use C as a development language and adopt CUDA (Compute Unified Device Architecture) as parallel computing architectures. In this paper, we compared the parallel Bregman algorithm with traditional CPU realized Bregaman algorithm. In addition, we also compared the parallel Bregman algorithm with other CS reconstruction algorithms, such as OMP and TwIST algorithms. Compared with these two algorithms, the result of this paper shows that, the parallel Bregman algorithm needs shorter time, and thus is more convenient for real-time object reconstruction, which is important to people's fast growing demand to information technology.
GISAXS simulation and analysis on GPU clusters
NASA Astrophysics Data System (ADS)
Chourou, Slim; Sarje, Abhinav; Li, Xiaoye; Chan, Elaine; Hexemer, Alexander
2012-02-01
We have implemented a flexible Grazing Incidence Small-Angle Scattering (GISAXS) simulation code based on the Distorted Wave Born Approximation (DWBA) theory that effectively utilizes the parallel processing power provided by the GPUs. This constitutes a handy tool for experimentalists facing a massive flux of data, allowing them to accurately simulate the GISAXS process and analyze the produced data. The software computes the diffraction image for any given superposition of custom shapes or morphologies (e.g. obtained graphically via a discretization scheme) in a user-defined region of k-space (or region of the area detector) for all possible grazing incidence angles and in-plane sample rotations. This flexibility then allows to easily tackle a wide range of possible sample geometries such as nanostructures on top of or embedded in a substrate or a multilayered structure. In cases where the sample displays regions of significant refractive index contrast, an algorithm has been implemented to perform an optimal slicing of the sample along the vertical direction and compute the averaged refractive index profile to be used as the reference geometry of the unperturbed system. Preliminary tests on a single GPU show a speedup of over 200 times compared to the sequential code.
IMPAIR: massively parallel deconvolution on the GPU
NASA Astrophysics Data System (ADS)
Sherry, Michael; Shearer, Andy
2013-02-01
The IMPAIR software is a high throughput image deconvolution tool for processing large out-of-core datasets of images, varying from large images with spatially varying PSFs to large numbers of images with spatially invariant PSFs. IMPAIR implements a parallel version of the tried and tested Richardson-Lucy deconvolution algorithm regularised via a custom wavelet thresholding library. It exploits the inherently parallel nature of the convolution operation to achieve quality results on consumer grade hardware: through the NVIDIA Tesla GPU implementation, the multi-core OpenMP implementation, and the cluster computing MPI implementation of the software. IMPAIR aims to address the problem of parallel processing in both top-down and bottom-up approaches: by managing the input data at the image level, and by managing the execution at the instruction level. These combined techniques will lead to a scalable solution with minimal resource consumption and maximal load balancing. IMPAIR is being developed as both a stand-alone tool for image processing, and as a library which can be embedded into non-parallel code to transparently provide parallel high throughput deconvolution.
GPU Lossless Hyperspectral Data Compression System
NASA Technical Reports Server (NTRS)
Aranki, Nazeeh I.; Keymeulen, Didier; Kiely, Aaron B.; Klimesh, Matthew A.
2014-01-01
Hyperspectral imaging systems onboard aircraft or spacecraft can acquire large amounts of data, putting a strain on limited downlink and storage resources. Onboard data compression can mitigate this problem but may require a system capable of a high throughput. In order to achieve a high throughput with a software compressor, a graphics processing unit (GPU) implementation of a compressor was developed targeting the current state-of-the-art GPUs from NVIDIA(R). The implementation is based on the fast lossless (FL) compression algorithm reported in "Fast Lossless Compression of Multispectral-Image Data" (NPO- 42517), NASA Tech Briefs, Vol. 30, No. 8 (August 2006), page 26, which operates on hyperspectral data and achieves excellent compression performance while having low complexity. The FL compressor uses an adaptive filtering method and achieves state-of-the-art performance in both compression effectiveness and low complexity. The new Consultative Committee for Space Data Systems (CCSDS) Standard for Lossless Multispectral & Hyperspectral image compression (CCSDS 123) is based on the FL compressor. The software makes use of the highly-parallel processing capability of GPUs to achieve a throughput at least six times higher than that of a software implementation running on a single-core CPU. This implementation provides a practical real-time solution for compression of data from airborne hyperspectral instruments.
Adaptive mesh fluid simulations on GPU
NASA Astrophysics Data System (ADS)
Wang, Peng; Abel, Tom; Kaehler, Ralf
2010-10-01
We describe an implementation of compressible inviscid fluid solvers with block-structured adaptive mesh refinement on Graphics Processing Units using NVIDIA's CUDA. We show that a class of high resolution shock capturing schemes can be mapped naturally on this architecture. Using the method of lines approach with the second order total variation diminishing Runge-Kutta time integration scheme, piecewise linear reconstruction, and a Harten-Lax-van Leer Riemann solver, we achieve an overall speedup of approximately 10 times faster execution on one graphics card as compared to a single core on the host computer. We attain this speedup in uniform grid runs as well as in problems with deep AMR hierarchies. Our framework can readily be applied to more general systems of conservation laws and extended to higher order shock capturing schemes. This is shown directly by an implementation of a magneto-hydrodynamic solver and comparing its performance to the pure hydrodynamic case. Finally, we also combined our CUDA parallel scheme with MPI to make the code run on GPU clusters. Close to ideal speedup is observed on up to four GPUs.
gPGA: GPU Accelerated Population Genetics Analyses
Zhou, Chunbao; Lang, Xianyu; Wang, Yangang; Zhu, Chaodong
2015-01-01
Background The isolation with migration (IM) model is important for studies in population genetics and phylogeography. IM program applies the IM model to genetic data drawn from a pair of closely related populations or species based on Markov chain Monte Carlo (MCMC) simulations of gene genealogies. But computational burden of IM program has placed limits on its application. Methodology With strong computational power, Graphics Processing Unit (GPU) has been widely used in many fields. In this article, we present an effective implementation of IM program on one GPU based on Compute Unified Device Architecture (CUDA), which we call gPGA. Conclusions Compared with IM program, gPGA can achieve up to 52.30X speedup on one GPU. The evaluation results demonstrate that it allows datasets to be analyzed effectively and rapidly for research on divergence population genetics. The software is freely available with source code at https://github.com/chunbaozhou/gPGA. PMID:26248314
GPU's for event reconstruction in the FairRoot framework
NASA Astrophysics Data System (ADS)
Al-Turany, M.; Uhlig, F.; Karabowicz, R.
2010-04-01
FairRoot is the simulation and analysis framework used by CBM and PANDA experiments at FAIR/GSI. The use of graphics processor units (GPUs) for event reconstruction in FairRoot will be presented. The fact that CUDA (Nvidia's Compute Unified Device Architecture) development tools work alongside the conventional C/C++ compiler, makes it possible to mix GPU code with general-purpose code for the host CPU, based on this some of the reconstruction tasks can be send to the graphic cards. Moreover, tasks that run on the GPU's can also run in emulation mode on the host CPU, which has the advantage that the same code is used on both CPU and GPU.
Research on GPU Acceleration for Monte Carlo Criticality Calculation
NASA Astrophysics Data System (ADS)
Xu, Qi; Yu, Ganglin; Wang, Kan
2014-06-01
The Monte Carlo neutron transport method can be naturally parallelized by multi-core architectures due to the dependency between particles during the simulation. The GPU+CPU heterogeneous parallel mode has become an increasingly popular way of parallelism in the field of scientific supercomputing. Thus, this work focuses on the GPU acceleration method for the Monte Carlo criticality simulation, as well as the computational efficiency that GPUs can bring. The "neutron transport step" is introduced to increase the GPU thread occupancy. In order to test the sensitivity of the MC code's complexity, a 1D one-group code and a 3D multi-group general purpose code are respectively transplanted to GPUs, and the acceleration effects are compared. The result of numerical experiments shows considerable acceleration effect of the "neutron transport step" strategy. However, the performance comparison between the 1D code and the 3D code indicates the poor scalability of MC codes on GPUs.
Gpu Implementation of a Viscous Flow Solver on Unstructured Grids
NASA Astrophysics Data System (ADS)
Xu, Tianhao; Chen, Long
2016-06-01
Graphics processing units have gained popularities in scientific computing over past several years due to their outstanding parallel computing capability. Computational fluid dynamics applications involve large amounts of calculations, therefore a latest GPU card is preferable of which the peak computing performance and memory bandwidth are much better than a contemporary high-end CPU. We herein focus on the detailed implementation of our GPU targeting Reynolds-averaged Navier-Stokes equations solver based on finite-volume method. The solver employs a vertex-centered scheme on unstructured grids for the sake of being capable of handling complex topologies. Multiple optimizations are carried out to improve the memory accessing performance and kernel utilization. Both steady and unsteady flow simulation cases are carried out using explicit Runge-Kutta scheme. The solver with GPU acceleration in this paper is demonstrated to have competitive advantages over the CPU targeting one.
Numerical cosmology on the GPU with Enzo and Ramses
NASA Astrophysics Data System (ADS)
Gheller, C.; Wang, P.; Vazza, F.; Teyssier, R.
2015-09-01
A number of scientific numerical codes can currently exploit GPUs with remarkable performance. In astrophysics, Enzo and Ramses are prime examples of such applications. The two codes have been ported to GPUs adopting different strategies and programming models, Enzo adopting CUDA and Ramses using OpenACC. We describe here the different solutions used for the GPU implementation of both cases. Performance benchmarks will be presented for Ramses. The results of the usage of the more mature GPU version of Enzo, adopted for a scientific project within the CHRONOS programme, will be summarised.
Accelerating Pseudo-Random Number Generator for MCNP on GPU
NASA Astrophysics Data System (ADS)
Gong, Chunye; Liu, Jie; Chi, Lihua; Hu, Qingfeng; Deng, Li; Gong, Zhenghu
2010-09-01
Pseudo-random number generators (PRNG) are intensively used in many stochastic algorithms in particle simulations, artificial neural networks and other scientific computation. The PRNG in Monte Carlo N-Particle Transport Code (MCNP) requires long period, high quality, flexible jump and fast enough. In this paper, we implement such a PRNG for MCNP on NVIDIA's GTX200 Graphics Processor Units (GPU) using CUDA programming model. Results shows that 3.80 to 8.10 times speedup are achieved compared with 4 to 6 cores CPUs and more than 679.18 million double precision random numbers can be generated per second on GPU.
Multi-GPU kinetic solvers using MPI and CUDA
NASA Astrophysics Data System (ADS)
Zabelok, Sergey; Arslanbekov, Robert; Kolobov, Vladimir
2014-12-01
This paper describes recent progress towards porting a Unified Flow Solver (UFS) to heterogeneous parallel computing. The main challenge of porting UFS to graphics processing units (GPUs) comes from the dynamically adapted mesh, which causes irregular data access. We describe the implementation of CUDA kernels for three modules in UFS: the direct Boltzmann solver using discrete velocity method (DVM), the DSMC module, and the Lattice Boltzmann Method (LBM) solver, all using octree Cartesian mesh with adaptive Mesh Refinement (AMR). Double digit speedup on single GPU and good scaling for multi-GPU has been demonstrated.
Full Stokes glacier model on GPU
NASA Astrophysics Data System (ADS)
Licul, Aleksandar; Herman, Frédéric; Podladchikov, Yuri; Räss, Ludovic; Omlin, Samuel
2015-04-01
Two different approaches are commonly used in glacier ice flow modeling: models based on asymptotic approximations of ice physics and full stokes models. Lower order models are computationally lighter but reach their limits in regions of complex flow, while full Stokes models are more exact but computationally expansive. To overcome this constrain, we investigate the potential of GPU acceleration in glacier modeling. The goal of this preliminary research is to develop a three-dimensional full Stokes numerical model and apply it to the glacier flow. We numerically solve the nonlinear Stokes momentum balance equations together with the incompressibility equation. Strong nonlinearities for the ice rheology are also taken into account. We have developed a fully three-dimensional numerical MATLAB application based on an iterative finite difference scheme. We have ported it to C-CUDA to run it on GPUs. Our model is benchmarked against other full Stokes solutions for all diagnostic ISMIP-HOM experiments (Pattyn et al.,2008). The preliminary results show good agreement with the other models. The major advantages of our programming approach are simplicity and order 10-100 times speed-up in comparison to serial CPU version of the code. Future work will include some real world applications and we will implement the free surface evolution capabilities. References: [1] F. Pattyn, L. Perichon, A. Aschwanden, B. Breuer, D.B. Smedt, O. Gagliardini, G.H. Gudmundsson, R.C.A. Hindmarsh, A. Hubbard, J.V. Johnson, T. Kleiner, Y. Konovalov, C. Martin, A.J. Payne, D. Pollard, S. Price, M. Ruckamp, F. Saito, S. Sugiyama, S., and T. Zwinger, Benchmark experiments for higher-order and full-Stokes ice sheet models (ISMIP-HOM), The Cryosphere, 2 (2008), 95-108.
Accelerating DNA analysis applications on GPU clusters
Tumeo, Antonino; Villa, Oreste
2010-06-13
DNA analysis is an emerging application of high performance bioinformatic. Modern sequencing machinery are able to provide, in few hours, large input streams of data which needs to be matched against exponentially growing databases known fragments. The ability to recognize these patterns effectively and fastly may allow extending the scale and the reach of the investigations performed by biology scientists. Aho-Corasick is an exact, multiple pattern matching algorithm often at the base of this application. High performance systems are a promising platform to accelerate this algorithm, which is computationally intensive but also inherently parallel. Nowadays, high performance systems also include heterogeneous processing elements, such as Graphic Processing Units (GPUs), to further accelerate parallel algorithms. Unfortunately, the Aho-Corasick algorithm exhibits large performance variabilities, depending on the size of the input streams, on the number of patterns to search and on the number of matches, and poses significant challenges on current high performance software and hardware implementations. An adequate mapping of the algorithm on the target architecture, coping with the limit of the underlining hardware, is required to reach the desired high throughputs. Load balancing also plays a crucial role when considering the limited bandwidth among the nodes of these systems. In this paper we present an efficient implementation of the Aho-Corasick algorithm for high performance clusters accelerated with GPUs. We discuss how we partitioned and adapted the algorithm to fit the Tesla C1060 GPU and then present a MPI based implementation for a heterogeneous high performance cluster. We compare this implementation to MPI and MPI with pthreads based implementations for a homogeneous cluster of x86 processors, discussing the stability vs. the performance and the scaling of the solutions, taking into consideration aspects such as the bandwidth among the different nodes.
POM.gpu-v1.0: a GPU-based Princeton Ocean Model
NASA Astrophysics Data System (ADS)
Xu, S.; Huang, X.; Oey, L.-Y.; Xu, F.; Fu, H.; Zhang, Y.; Yang, G.
2015-09-01
Graphics processing units (GPUs) are an attractive solution in many scientific applications due to their high performance. However, most existing GPU conversions of climate models use GPUs for only a few computationally intensive regions. In the present study, we redesign the mpiPOM (a parallel version of the Princeton Ocean Model) with GPUs. Specifically, we first convert the model from its original Fortran form to a new Compute Unified Device Architecture C (CUDA-C) code, then we optimize the code on each of the GPUs, the communications between the GPUs, and the I / O between the GPUs and the central processing units (CPUs). We show that the performance of the new model on a workstation containing four GPUs is comparable to that on a powerful cluster with 408 standard CPU cores, and it reduces the energy consumption by a factor of 6.8.
A survey of GPU-based medical image computing techniques.
Shi, Lin; Liu, Wen; Zhang, Heye; Xie, Yongming; Wang, Defeng
2012-09-01
Medical imaging currently plays a crucial role throughout the entire clinical applications from medical scientific research to diagnostics and treatment planning. However, medical imaging procedures are often computationally demanding due to the large three-dimensional (3D) medical datasets to process in practical clinical applications. With the rapidly enhancing performances of graphics processors, improved programming support, and excellent price-to-performance ratio, the graphics processing unit (GPU) has emerged as a competitive parallel computing platform for computationally expensive and demanding tasks in a wide range of medical image applications. The major purpose of this survey is to provide a comprehensive reference source for the starters or researchers involved in GPU-based medical image processing. Within this survey, the continuous advancement of GPU computing is reviewed and the existing traditional applications in three areas of medical image processing, namely, segmentation, registration and visualization, are surveyed. The potential advantages and associated challenges of current GPU-based medical imaging are also discussed to inspire future applications in medicine. PMID:23256080
GPU-based Volume Rendering for Medical Image Visualization.
Heng, Yang; Gu, Lixu
2005-01-01
During the quick advancements of medical image visualization and augmented virtual reality application, the low performance of the volume rendering algorithm is still a "bottle neck". To facilitate the usage of well developed hardware resource, a novel graphics processing unit (GPU)-based volume ray-casting algorithm is proposed in this paper. Running on a normal PC, it performs an interactive rate while keeping the same image quality as the traditional volume rendering algorithm does. Recently, GPU-accelerated direct volume rendering has positioned itself as an efficient tool for the display and visual analysis of volume data. However, for large sized medical image data, it often shows low efficiency for too large memories requested. Furthermore, it always holds a drawback of writing color buffers multi-times per frame. The proposed algorithm improves the situation by implementing ray casting operation completely in GPU. It needs only one slice plane from CPU and one 3Dtexture to store data when GPU calculates the two terminals of the ray and carries out the color blending operation in its pixel programs. So both the rendering speed and the memories consumed are improved, and the algorithm can deal most medical image data on normal PCs in the interactive speed. PMID:17281405
Computing 2D constrained delaunay triangulation using the GPU.
Qi, Meng; Cao, Thanh-Tung; Tan, Tiow-Seng
2013-05-01
We propose the first graphics processing unit (GPU) solution to compute the 2D constrained Delaunay triangulation (CDT) of a planar straight line graph (PSLG) consisting of points and edges. There are many existing CPU algorithms to solve the CDT problem in computational geometry, yet there has been no prior approach to solve this problem efficiently using the parallel computing power of the GPU. For the special case of the CDT problem where the PSLG consists of just points, which is simply the normal Delaunay triangulation (DT) problem, a hybrid approach using the GPU together with the CPU to partially speed up the computation has already been presented in the literature. Our work, on the other hand, accelerates the entire computation on the GPU. Our implementation using the CUDA programming model on NVIDIA GPUs is numerically robust, and runs up to an order of magnitude faster than the best sequential implementations on the CPU. This result is reflected in our experiment with both randomly generated PSLGs and real-world GIS data having millions of points and edges. PMID:23492377
GPU-accelerated denoising of 3D magnetic resonance images
Howison, Mark; Wes Bethel, E.
2014-05-29
The raw computational power of GPU accelerators enables fast denoising of 3D MR images using bilateral filtering, anisotropic diffusion, and non-local means. In practice, applying these filtering operations requires setting multiple parameters. This study was designed to provide better guidance to practitioners for choosing the most appropriate parameters by answering two questions: what parameters yield the best denoising results in practice? And what tuning is necessary to achieve optimal performance on a modern GPU? To answer the first question, we use two different metrics, mean squared error (MSE) and mean structural similarity (MSSIM), to compare denoising quality against a reference image. Surprisingly, the best improvement in structural similarity with the bilateral filter is achieved with a small stencil size that lies within the range of real-time execution on an NVIDIA Tesla M2050 GPU. Moreover, inappropriate choices for parameters, especially scaling parameters, can yield very poor denoising performance. To answer the second question, we perform an autotuning study to empirically determine optimal memory tiling on the GPU. The variation in these results suggests that such tuning is an essential step in achieving real-time performance. These results have important implications for the real-time application of denoising to MR images in clinical settings that require fast turn-around times.
Optimizing a mobile robot control system using GPU acceleration
NASA Astrophysics Data System (ADS)
Tuck, Nat; McGuinness, Michael; Martin, Fred
2012-01-01
This paper describes our attempt to optimize a robot control program for the Intelligent Ground Vehicle Competition (IGVC) by running computationally intensive portions of the system on a commodity graphics processing unit (GPU). The IGVC Autonomous Challenge requires a control program that performs a number of different computationally intensive tasks ranging from computer vision to path planning. For the 2011 competition our Robot Operating System (ROS) based control system would not run comfortably on the multicore CPU on our custom robot platform. The process of profiling the ROS control program and selecting appropriate modules for porting to run on a GPU is described. A GPU-targeting compiler, Bacon, is used to speed up development and help optimize the ported modules. The impact of the ported modules on overall performance is discussed. We conclude that GPU optimization can free a significant amount of CPU resources with minimal effort for expensive user-written code, but that replacing heavily-optimized library functions is more difficult, and a much less efficient use of time.
QYMSYM: A GPU-accelerated hybrid symplectic integrator
NASA Astrophysics Data System (ADS)
Moore, Alexander; Quillen, Alice C.
2012-10-01
QYMSYM is a GPU accelerated 2nd order hybrid symplectic integrator that identifies close approaches between particles and switches from symplectic to Hermite algorithms for particles that require higher resolution integrations. This is a parallel code running with CUDA on a video card that puts the many processors on board to work while taking advantage of fast shared memory.
A survey of GPU-based medical image computing techniques
Shi, Lin; Liu, Wen; Zhang, Heye; Xie, Yongming
2012-01-01
Medical imaging currently plays a crucial role throughout the entire clinical applications from medical scientific research to diagnostics and treatment planning. However, medical imaging procedures are often computationally demanding due to the large three-dimensional (3D) medical datasets to process in practical clinical applications. With the rapidly enhancing performances of graphics processors, improved programming support, and excellent price-to-performance ratio, the graphics processing unit (GPU) has emerged as a competitive parallel computing platform for computationally expensive and demanding tasks in a wide range of medical image applications. The major purpose of this survey is to provide a comprehensive reference source for the starters or researchers involved in GPU-based medical image processing. Within this survey, the continuous advancement of GPU computing is reviewed and the existing traditional applications in three areas of medical image processing, namely, segmentation, registration and visualization, are surveyed. The potential advantages and associated challenges of current GPU-based medical imaging are also discussed to inspire future applications in medicine. PMID:23256080
Geological Visualization System with GPU-Based Interpolation
NASA Astrophysics Data System (ADS)
Huang, L.; Chen, K.; Lai, Y.; Chang, P.; Song, S.
2011-12-01
There has been a large number of research using parallel-processing GPU to accelerate the computation. In Near Surface Geology efficient interpolations are critical for proper interpretation of measured data. Additionally, an appropriate interpolation method for generating proper results depends on the factors such as the dense of the measured locations and the estimation model. Therefore, fast interpolation process is needed to efficiently find a proper interpolation algorithm for a set of collected data. However, a general CPU framework has to process each computation in a sequential manner and is not efficient enough to handle a large number of interpolation generally needed in Near Surface Geology. When carefully observing the interpolation processing, the computation for each grid point is independent from all other computation. Therefore, the GPU parallel framework should be an efficient technology to accelerate the interpolation process which is critical in Near Surface Geology. Thus in this paper we design a geological visualization system whose core includes a set of interpolation algorithms including Nearest Neighbor, Inverse Distance and Kriging. All these interpolation algorithms are implemented using both the CPU framework and GPU framework. The comparison between CPU and GPU implementation in the aspect of precision and processing speed shows that parallel computation can accelerate the interpolation process and also demonstrates the possibility of using GPU-equipped personal computer to replace the expensive workstation. Immediate update at the measurement site is the dream of geologists. In the future the parallel and remote computation ability of cloud will be explored to make the mobile computation on the measurement site possible.
GPU implementations for fast factorizations of STAP covariance matrices
NASA Astrophysics Data System (ADS)
Roeder, Michael; Davis, Nolan; Furtek, Jeremy; Braunreiter, Dennis; Healy, Dennis
2008-08-01
One of the main goals of the STAP-BOY program has been the implementation of a space-time adaptive processing (STAP) algorithm on graphics processing units (GPUs) with the goal of reducing the processing time. Within the context of GPU implementation, we have further developed algorithms that exploit data redundancy inherent in particular STAP applications. Integration of these algorithms with GPU architecture is of primary importance for fast algorithmic processing times. STAP algorithms involve solving a linear system in which the transformation matrix is a covariance matrix. A standard method involves estimating a covariance matrix from a data matrix, computing its Cholesky factors by one of several methods, and then solving the system by substitution. Some STAP applications have redundancy in successive data matrices from which the covariance matrices are formed. For STAP applications in which a data matrix is updated with the addition of a new data row at the bottom and the elimination of the oldest data in the top of the matrix, a sequence of data matrices have multiple rows in common. Two methods have been developed for exploiting this type of data redundancy when computing Cholesky factors. These two methods are referred to as 1) Fast QR factorizations of successive data matrices 2) Fast Cholesky factorizations of successive covariance matrices. We have developed GPU implementations of these two methods. We show that these two algorithms exhibit reduced computational complexity when compared to benchmark algorithms that do not exploit data redundancy. More importantly, we show that when these algorithmic improvements are optimized for the GPU architecture, the processing times of a GPU implementation of these matrix factorization algorithms may be greatly improved.
High-Speed GPU-Based Fully Three-Dimensional Diffuse Optical Tomographic System
Saikia, Manob Jyoti; Kanhirodan, Rajan; Mohan Vasu, Ram
2014-01-01
We have developed a graphics processor unit (GPU-) based high-speed fully 3D system for diffuse optical tomography (DOT). The reduction in execution time of 3D DOT algorithm, a severely ill-posed problem, is made possible through the use of (1) an algorithmic improvement that uses Broyden approach for updating the Jacobian matrix and thereby updating the parameter matrix and (2) the multinode multithreaded GPU and CUDA (Compute Unified Device Architecture) software architecture. Two different GPU implementations of DOT programs are developed in this study: (1) conventional C language program augmented by GPU CUDA and CULA routines (C GPU), (2) MATLAB program supported by MATLAB parallel computing toolkit for GPU (MATLAB GPU). The computation time of the algorithm on host CPU and the GPU system is presented for C and Matlab implementations. The forward computation uses finite element method (FEM) and the problem domain is discretized into 14610, 30823, and 66514 tetrahedral elements. The reconstruction time, so achieved for one iteration of the DOT reconstruction for 14610 elements, is 0.52 seconds for a C based GPU program for 2-plane measurements. The corresponding MATLAB based GPU program took 0.86 seconds. The maximum number of reconstructed frames so achieved is 2 frames per second. PMID:24891848
SU-D-BRD-03: A Gateway for GPU Computing in Cancer Radiotherapy Research
Jia, X; Folkerts, M; Shi, F; Yan, H; Yan, Y; Jiang, S; Sivagnanam, S; Majumdar, A
2014-06-01
Purpose: Graphics Processing Unit (GPU) has become increasingly important in radiotherapy. However, it is still difficult for general clinical researchers to access GPU codes developed by other researchers, and for developers to objectively benchmark their codes. Moreover, it is quite often to see repeated efforts spent on developing low-quality GPU codes. The goal of this project is to establish an infrastructure for testing GPU codes, cross comparing them, and facilitating code distributions in radiotherapy community. Methods: We developed a system called Gateway for GPU Computing in Cancer Radiotherapy Research (GCR2). A number of GPU codes developed by our group and other developers can be accessed via a web interface. To use the services, researchers first upload their test data or use the standard data provided by our system. Then they can select the GPU device on which the code will be executed. Our system offers all mainstream GPU hardware for code benchmarking purpose. After the code running is complete, the system automatically summarizes and displays the computing results. We also released a SDK to allow the developers to build their own algorithm implementation and submit their binary codes to the system. The submitted code is then systematically benchmarked using a variety of GPU hardware and representative data provided by our system. The developers can also compare their codes with others and generate benchmarking reports. Results: It is found that the developed system is fully functioning. Through a user-friendly web interface, researchers are able to test various GPU codes. Developers also benefit from this platform by comprehensively benchmarking their codes on various GPU platforms and representative clinical data sets. Conclusion: We have developed an open platform allowing the clinical researchers and developers to access the GPUs and GPU codes. This development will facilitate the utilization of GPU in radiation therapy field.
Parallel hyperbolic PDE simulation on clusters: Cell versus GPU
NASA Astrophysics Data System (ADS)
Rostrup, Scott; De Sterck, Hans
2010-12-01
Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications. Program summaryProgram title: SWsolver Catalogue identifier: AEGY_v1_0 Program summary URL
[Design of a volume-rendering toolkit using GPU-based ray-casting].
Liu, Wen-Qing; Chen, Chun-Xiao; Lu, Li-Na
2009-09-01
This paper presents an approach to GPU-based ray-casting of a shader model 3.0 compatible graphics card. In addition, a software toolkit is designed using the proposed algorithm to make the full benefit of GPU by extending VTK. Experimental results suggest that remarkable speedups are observed using GPU-based algorithm, and high-quality renderings can be achieved at interactive framerates above 60 fps. The toolkit designed provides a high level of usability and extendibility. PMID:20073244
Fast computer simulation of reconstructed image from rainbow hologram based on GPU
NASA Astrophysics Data System (ADS)
Shuming, Jiao; Yoshikawa, Hiroshi
2015-10-01
A fast computer simulation solution for rainbow hologram reconstruction based on GPU is proposed. In the commonly used segment Fourier transform method for rainbow hologram reconstruction, the computation of 2D Fourier transform on each hologram segment is very time consuming. GPU-based parallel computing can be applied to improve the computing speed. Compared with CPU computing, simulation results indicate that our proposed GPU computing can effectively reduce the computation time by as much as eight times.
GPU phase-field lattice Boltzmann simulations of growth and motion of a binary alloy dendrite
NASA Astrophysics Data System (ADS)
Takaki, T.; Rojas, R.; Ohno, M.; Shimokawabe, T.; Aoki, T.
2015-06-01
A GPU code has been developed for a phase-field lattice Boltzmann (PFLB) method, which can simulate the dendritic growth with motion of solids in a dilute binary alloy melt. The GPU accelerated PFLB method has been implemented using CUDA C. The equiaxed dendritic growth in a shear flow and settling condition have been simulated by the developed GPU code. It has been confirmed that the PFLB simulations were efficiently accelerated by introducing the GPU computation. The characteristic dendrite morphologies which depend on the melt flow and the motion of the dendrite could also be confirmed by the simulations.
GPU-based 3D lower tree wavelet video encoder
NASA Astrophysics Data System (ADS)
Galiano, Vicente; López-Granado, Otoniel; Malumbres, Manuel P.; Drummond, Leroy Anthony; Migallón, Hector
2013-12-01
The 3D-DWT is a mathematical tool of increasing importance in those applications that require an efficient processing of huge amounts of volumetric info. Other applications like professional video editing, video surveillance applications, multi-spectral satellite imaging, HQ video delivery, etc, would rather use 3D-DWT encoders to reconstruct a frame as fast as possible. In this article, we introduce a fast GPU-based encoder which uses 3D-DWT transform and lower trees. Also, we present an exhaustive analysis of the use of GPU memory. Our proposal shows good trade off between R/D, coding delay (as fast as MPEG-2 for High definition) and memory requirements (up to 6 times less memory than x264).
Implementing the projected spatial rich features on a GPU
NASA Astrophysics Data System (ADS)
Ker, Andrew D.
2014-02-01
The Projected Spatial Rich Model (PSRM) generates powerful steganalysis features, but requires the calculation of tens of thousands of convolutions with image noise residuals. This makes it very slow: the reference implementation takes an impractical 20{30 minutes per 1 megapixel (Mpix) image. We present a case study which first tweaks the definition of the PSRM features, to make them more efficient, and then optimizes an implementation on GPU hardware which exploits their parallelism (whilst avoiding the worst of their sequentiality). Some nonstandard CUDA techniques are used. Even with only a single GPU, the time for feature calculation is reduced by three orders of magnitude, and the detection power is reduced only marginally.
Implementation and Optimization of Image Processing Algorithms on Embedded GPU
NASA Astrophysics Data System (ADS)
Singhal, Nitin; Yoo, Jin Woo; Choi, Ho Yeol; Park, In Kyu
In this paper, we analyze the key factors underlying the implementation, evaluation, and optimization of image processing and computer vision algorithms on embedded GPU using OpenGL ES 2.0 shader model. First, we present the characteristics of the embedded GPU and its inherent advantage when compared to embedded CPU. Additionally, we propose techniques to achieve increased performance with optimized shader design. To show the effectiveness of the proposed techniques, we employ cartoon-style non-photorealistic rendering (NPR), speeded-up robust feature (SURF) detection, and stereo matching as our example algorithms. Performance is evaluated in terms of the execution time and speed-up achieved in comparison with the implementation on embedded CPU.
Rapid Parallel Calculation of shell Element Based On GPU
NASA Astrophysics Data System (ADS)
Wanga, Jian Hua; Lia, Guang Yao; Lib, Sheng; Li, Guang Yao
2010-06-01
Long computing time bottlenecked the application of finite element. In this paper, an effective method to speed up the FEM calculation by using the existing modern graphic processing unit and programmable colored rendering tool was put forward, which devised the representation of unit information in accordance with the features of GPU, converted all the unit calculation into film rendering process, solved the simulation work of all the unit calculation of the internal force, and overcame the shortcomings of lowly parallel level appeared ever before when it run in a single computer. Studies shown that this method could improve efficiency and shorten calculating hours greatly. The results of emulation calculation about the elasticity problem of large number cells in the sheet metal proved that using the GPU parallel simulation calculation was faster than using the CPU's. It is useful and efficient to solve the project problems in this way.
GPU-specific reformulations of image compression algorithms
NASA Astrophysics Data System (ADS)
Matela, Jiří; Holub, Petr; Jirman, Martin; Årom, Martin
2012-10-01
Image compression has a number of applications in various fields, where processing throughput and/or latency is a crucial attribute and the main limitation of state-of-the-art implementations of compression algorithms. At the same time contemporary GPU platforms provide tremendous processing power but they call for specific algorithm design. We discuss key components of successful design of compression algorithms for GPUs and demonstrate this on JPEG and JPEG2000 implementations, each of which contains several types of algorithms requiring different approaches to efficient parallelization for GPUs. Performance evaluation of the optimized JPEG and JPEG2000 chain is used to demonstrate the importance of various aspects of GPU programming, especially with respect to real-time applications.
Interactive brain shift compensation using GPU based programming
NASA Astrophysics Data System (ADS)
van der Steen, Sander; Noordmans, Herke Jan; Verdaasdonk, Rudolf
2009-02-01
Processing large images files or real-time video streams requires intense computational power. Driven by the gaming industry, the processing power of graphic process units (GPUs) has increased significantly. With the pixel shader model 4.0 the GPU can be used for image processing 10x faster than the CPU. Dedicated software was developed to deform 3D MR and CT image sets for real-time brain shift correction during navigated neurosurgery using landmarks or cortical surface traces defined by the navigation pointer. Feedback was given using orthogonal slices and an interactively raytraced 3D brain image. GPU based programming enables real-time processing of high definition image datasets and various applications can be developed in medicine, optics and image sciences.
GPU and APU computations of Finite Time Lyapunov Exponent fields
NASA Astrophysics Data System (ADS)
Conti, Christian; Rossinelli, Diego; Koumoutsakos, Petros
2012-03-01
We present GPU and APU accelerated computations of Finite-Time Lyapunov Exponent (FTLE) fields. The calculation of FTLEs is a computationally intensive process, as in order to obtain the sharp ridges associated with the Lagrangian Coherent Structures an extensive resampling of the flow field is required. The computational performance of this resampling is limited by the memory bandwidth of the underlying computer architecture. The present technique harnesses data-parallel execution of many-core architectures and relies on fast and accurate evaluations of moment conserving functions for the mesh to particle interpolations. We demonstrate how the computation of FTLEs can be efficiently performed on a GPU and on an APU through OpenCL and we report over one order of magnitude improvements over multi-threaded executions in FTLE computations of bluff body flows.
GPU acceleration of time-domain fluorescence lifetime imaging
NASA Astrophysics Data System (ADS)
Wu, Gang; Nowotny, Thomas; Chen, Yu; Li, David Day-Uei
2016-01-01
Fluorescence lifetime imaging microscopy (FLIM) plays a significant role in biological sciences, chemistry, and medical research. We propose a graphic processing unit (GPU) based FLIM analysis tool suitable for high-speed, flexible time-domain FLIM applications. With a large number of parallel processors, GPUs can significantly speed up lifetime calculations compared to CPU-OpenMP (parallel computing with multiple CPU cores) based analysis. We demonstrate how to implement and optimize FLIM algorithms on GPUs for both iterative and noniterative FLIM analysis algorithms. The implemented algorithms have been tested on both synthesized and experimental FLIM data. The results show that at the same precision, the GPU analysis can be up to 24-fold faster than its CPU-OpenMP counterpart. This means that even for high-precision but time-consuming iterative FLIM algorithms, GPUs enable fast or even real-time analysis.
GPU Based Software Correlators - Perspectives for VLBI2010
NASA Technical Reports Server (NTRS)
Hobiger, Thomas; Kimura, Moritaka; Takefuji, Kazuhiro; Oyama, Tomoaki; Koyama, Yasuhiro; Kondo, Tetsuro; Gotoh, Tadahiro; Amagai, Jun
2010-01-01
Caused by historical separation and driven by the requirements of the PC gaming industry, Graphics Processing Units (GPUs) have evolved to massive parallel processing systems which entered the area of non-graphic related applications. Although a single processing core on the GPU is much slower and provides less functionality than its counterpart on the CPU, the huge number of these small processing entities outperforms the classical processors when the application can be parallelized. Thus, in recent years various radio astronomical projects have started to make use of this technology either to realize the correlator on this platform or to establish the post-processing pipeline with GPUs. Therefore, the feasibility of GPUs as a choice for a VLBI correlator is being investigated, including pros and cons of this technology. Additionally, a GPU based software correlator will be reviewed with respect to energy consumption/GFlop/sec and cost/GFlop/sec.
GPU-Accelerated Molecular Modeling Coming Of Age
Stone, John E.; Hardy, David J.; Ufimtsev, Ivan S.
2010-01-01
Graphics processing units (GPUs) have traditionally been used in molecular modeling solely for visualization of molecular structures and animation of trajectories resulting from molecular dynamics simulations. Modern GPUs have evolved into fully programmable, massively parallel co-processors that can now be exploited to accelerate many scientific computations, typically providing about one order of magnitude speedup over CPU code and in special cases providing speedups of two orders of magnitude. This paper surveys the development of molecular modeling algorithms that leverage GPU computing, the advances already made and remaining issues to be resolved, and the continuing evolution of GPU technology that promises to become even more useful to molecular modeling. Hardware acceleration with commodity GPUs is expected to benefit the overall computational biology community by bringing teraflops performance to desktop workstations and in some cases potentially changing what were formerly batch-mode computational jobs into interactive tasks. PMID:20675161
Harnessing your GPU for interactive immersive oceanographic modeling
NASA Astrophysics Data System (ADS)
Hermann, A. J.; Moore, C. W.
2011-12-01
We report on recent success using GPU for interactive Lagrangian (fish) and Eulerian (tsunami) modeling of marine systems. Lagrangian analyses based on numerical float tracks are a fundamental tool in hydrodynamic and marine biological modeling. In particular, spatially-explicit individual-based models (IBMs) can be used to explore how changes in coastal circulation affect fish recruitment, and 3D viewing of the results leads to new insights regarding the effects of behavior on spatial path. One limit to the usefulness of this modeling approach is the latency between submission of a run and examination of the results, especially when a large (i.e. statistically meaningful) number of individuals are being tracked through finely resolved current and scalar fields. Since float tracking is an inherently parallel problem, the hundreds of cores available in modern graphics cards (GPU) can readily increase the performance of suitably adapted code by two orders of magnitude at low cost. This offers a way forward to achieve interactive submission/examination of IBMs (and float tracks in general), even on a laptop computer. Latency is an even larger issue in tsunami forecasting, where there is a need to run simple deep-ocean shallow water wave models in real time, particularly during an event when tsunamigenic earthquake events occur outside known fault zones. This problem, too, lends itself to dramatic speedup via GPU, given a suitable parallel algorithm for the shallow water solver. Here we demonstrate successful model speedup using GPU-adapted code for: 1) a spatially explicit IBM prototype, based on pre-stored circulation model output for the Bering Sea; 2) real-time runs of tsunami propagation. In both cases, results will be presented using the stereo-immersive capabilities of the graphics card, for 3D animation.
Quantifying NUMA and Contention Effects in Multi-GPU Systems
Spafford, Kyle L; Meredith, Jeremy S; Vetter, Jeffrey S
2011-01-01
As system architects strive for increased density and power efficiency, the traditional compute node is being augmented with an increasing number of graphics processing units (GPUs). The integration of multiple GPUs per node introduces complex performance phenomena including non-uniform memory access (NUMA) and contention for shared system resources. Utilizing the Keeneland system, this paper quantifies these effects and presents some guidance on programming strategies to maximize performance in multi-GPU environments.
A GPU-COMPUTING APPROACH TO SOLAR STOKES PROFILE INVERSION
Harker, Brian J.; Mighell, Kenneth J. E-mail: mighell@noao.edu
2012-09-20
We present a new computational approach to the inversion of solar photospheric Stokes polarization profiles, under the Milne-Eddington model, for vector magnetography. Our code, named GENESIS, employs multi-threaded parallel-processing techniques to harness the computing power of graphics processing units (GPUs), along with algorithms designed to exploit the inherent parallelism of the Stokes inversion problem. Using a genetic algorithm (GA) engineered specifically for use with a GPU, we produce full-disk maps of the photospheric vector magnetic field from polarized spectral line observations recorded by the Synoptic Optical Long-term Investigations of the Sun (SOLIS) Vector Spectromagnetograph (VSM) instrument. We show the advantages of pairing a population-parallel GA with data-parallel GPU-computing techniques, and present an overview of the Stokes inversion problem, including a description of our adaptation to the GPU-computing paradigm. Full-disk vector magnetograms derived by this method are shown using SOLIS/VSM data observed on 2008 March 28 at 15:45 UT.
GMH: A Message Passing Toolkit for GPU Clusters
Jie Chen, W. Watson, Weizhen Mao
2011-01-01
Driven by the market demand for high-definition 3D graphics, commodity graphics processing units (GPUs) have evolved into highly parallel, multi-threaded, many-core processors, which are ideal for data parallel computing. Many applications have been ported to run on a single GPU with tremendous speedups using general C-style programming languages such as CUDA. However, large applications require multiple GPUs and demand explicit message passing. This paper presents a message passing toolkit, called GMH (GPU Message Handler), on NVIDIA GPUs. This toolkit utilizes a data-parallel thread group as a way to map multiple GPUs on a single host to an MPI rank, and introduces a notion of virtual GPUs as a way to bind a thread to a GPU automatically. This toolkit provides high performance MPI style point-to-point and collective communication, but more importantly, facilitates event-driven APIs to allow an application to be managed and executed by the toolkit at runtime.
Dynamic Load Balancing on Single- and Multi-GPU Systems
Chen, Long; Villa, Oreste; Krishnamoorthy, Sriram; Gao, Guang R.
2010-04-19
The computational power provided by many-core graphics processing units (GPUs) has been exploited in many applications. The programming techniques supported and employed on these GPUs are not sufficient to address problems exhibiting irregular, and unbalanced workload. The problem is exacerbated when trying to effectively exploit multiple GPUs, which are commonly available in many modern systems. In this paper, we propose a task-based dynamic load-balancing solution for single- and multi-GPU systems. The solution allows load balancing at a finer granularity than what is supported in existing APIs such as NVIDIA’s CUDA. We evaluate our approach using both micro-benchmarks and a molecular dynamics application that exhibits significant load imbalance. Experimental results with a single-GPU configuration show that our fine-grained task solution can utilize the hardware more efficiently than the CUDA scheduler for unbalanced workload. On multi-GPU systems, our solution achieves near-linear speedup, load balance, and significant performance improvement over techniques based on standard CUDA APIs.
Bin recycling strategy for improving the histogram precision on GPU
NASA Astrophysics Data System (ADS)
Cárdenas-Montes, Miguel; Rodríguez-Vázquez, Juan José; Vega-Rodríguez, Miguel A.
2016-07-01
Histogram is an easily comprehensible way to present data and analyses. In the current scientific context with access to large volumes of data, the processing time for building histogram has dramatically increased. For this reason, parallel construction is necessary to alleviate the impact of the processing time in the analysis activities. In this scenario, GPU computing is becoming widely used for reducing until affordable levels the processing time of histogram construction. Associated to the increment of the processing time, the implementations are stressed on the bin-count accuracy. Accuracy aspects due to the particularities of the implementations are not usually taken into consideration when building histogram with very large data sets. In this work, a bin recycling strategy to create an accuracy-aware implementation for building histogram on GPU is presented. In order to evaluate the approach, this strategy was applied to the computation of the three-point angular correlation function, which is a relevant function in Cosmology for the study of the Large Scale Structure of Universe. As a consequence of the study a high-accuracy implementation for histogram construction on GPU is proposed.
SAR wind retrieval: from Singlecore to Multicore and GPU computing
NASA Astrophysics Data System (ADS)
Myasoedov, Alexander; Monzikova, Anna
The large spatial coverage and high resolution of spaceborne synthetic aperture radars (SAR) offers a unique opportunity to derive mesoscale wind fields over the ocean surface, providing high resolution wind fields near the shore. On the other hand, due to the large size of SAR images their processing might be a headache when dealing with operational tasks or doing long-period statistical analysis. Algorithms for satellite image processing often offer many possibilities for parallelism (e.g., pixel-by-pixel processing) which makes them good candidates for execution on high-performance parallel computing hardware such as Multicore CPUs and modern graphic processing units (GPUs). In this study we implement different SAR wind speed retrieval algorithms (e.g. CMOD4, CMOD5) for Singlecore and Multicore systems, including GPUs. For this purpose both serial and parallelized versions of CMOD algorithms were written in Matlab, Python, CPython and PyOpenCL. We apply these algorithms to an Envisat ASAR image, compare the results received with different versions of the algorithms executed on both Intel CPU and a Tesla GPU. As a result of our experiments we not only show the up to 400 times speedup of GPU comparing to CPU but also try to give some advises on how much time we have spent and efforts were made for writing the same algorithm using different programming languages. We hope that our experience will help other scientist to achieve all the goodness from the GPU/Multicore computing.
GPU-based cone-beam reconstruction using wavelet denoising
NASA Astrophysics Data System (ADS)
Jin, Kyungchan; Park, Jungbyung; Park, Jongchul
2012-03-01
The scattering noise artifact resulted in low-dose projection in repetitive cone-beam CT (CBCT) scans decreases the image quality and lessens the accuracy of the diagnosis. To improve the image quality of low-dose CT imaging, the statistical filtering is more effective in noise reduction. However, image filtering and enhancement during the entire reconstruction process exactly may be challenging due to high performance computing. The general reconstruction algorithm for CBCT data is the filtered back-projection, which for a volume of 512×512×512 takes up to a few minutes on a standard system. To speed up reconstruction, massively parallel architecture of current graphical processing unit (GPU) is a platform suitable for acceleration of mathematical calculation. In this paper, we focus on accelerating wavelet denoising and Feldkamp-Davis-Kress (FDK) back-projection using parallel processing on GPU, utilize compute unified device architecture (CUDA) platform and implement CBCT reconstruction based on CUDA technique. Finally, we evaluate our implementation on clinical tooth data sets. Resulting implementation of wavelet denoising is able to process a 1024×1024 image within 2 ms, except data loading process, and our GPU-based CBCT implementation reconstructs a 512×512×512 volume from 400 projection data in less than 1 minute.
Fast, parallel implementation of particle filtering on the GPU architecture
NASA Astrophysics Data System (ADS)
Gelencsér-Horváth, Anna; Tornai, Gábor János; Horváth, András; Cserey, György
2013-12-01
In this paper, we introduce a modified cellular particle filter (CPF) which we mapped on a graphics processing unit (GPU) architecture. We developed this filter adaptation using a state-of-the art CPF technique. Mapping this filter realization on a highly parallel architecture entailed a shift in the logical representation of the particles. In this process, the original two-dimensional organization is reordered as a one-dimensional ring topology. We proposed a proof-of-concept measurement on two models with an NVIDIA Fermi architecture GPU. This design achieved a 411- μs kernel time per state and a 77-ms global running time for all states for 16,384 particles with a 256 neighbourhood size on a sequence of 24 states for a bearing-only tracking model. For a commonly used benchmark model at the same configuration, we achieved a 266- μs kernel time per state and a 124-ms global running time for all 100 states. Kernel time includes random number generation on the GPU with curand. These results attest to the effective and fast use of the particle filter in high-dimensional, real-time applications.
Ultra-Fast Image Reconstruction of Tomosynthesis Mammography Using GPU
Arefan, D.; Talebpour, A.; Ahmadinejhad, N.; Kamali Asl, A.
2015-01-01
Digital Breast Tomosynthesis (DBT) is a technology that creates three dimensional (3D) images of breast tissue. Tomosynthesis mammography detects lesions that are not detectable with other imaging systems. If image reconstruction time is in the order of seconds, we can use Tomosynthesis systems to perform Tomosynthesis-guided Interventional procedures. This research has been designed to study ultra-fast image reconstruction technique for Tomosynthesis Mammography systems using Graphics Processing Unit (GPU). At first, projections of Tomosynthesis mammography have been simulated. In order to produce Tomosynthesis projections, it has been designed a 3D breast phantom from empirical data. It is based on MRI data in its natural form. Then, projections have been created from 3D breast phantom. The image reconstruction algorithm based on FBP was programmed with C++ language in two methods using central processing unit (CPU) card and the Graphics Processing Unit (GPU). It calculated the time of image reconstruction in two kinds of programming (using CPU and GPU). PMID:26171373
GPU Lossless Hyperspectral Data Compression System for Space Applications
NASA Technical Reports Server (NTRS)
Keymeulen, Didier; Aranki, Nazeeh; Hopson, Ben; Kiely, Aaron; Klimesh, Matthew; Benkrid, Khaled
2012-01-01
On-board lossless hyperspectral data compression reduces data volume in order to meet NASA and DoD limited downlink capabilities. At JPL, a novel, adaptive and predictive technique for lossless compression of hyperspectral data, named the Fast Lossless (FL) algorithm, was recently developed. This technique uses an adaptive filtering method and achieves state-of-the-art performance in both compression effectiveness and low complexity. Because of its outstanding performance and suitability for real-time onboard hardware implementation, the FL compressor is being formalized as the emerging CCSDS Standard for Lossless Multispectral & Hyperspectral image compression. The FL compressor is well-suited for parallel hardware implementation. A GPU hardware implementation was developed for FL targeting the current state-of-the-art GPUs from NVIDIA(Trademark). The GPU implementation on a NVIDIA(Trademark) GeForce(Trademark) GTX 580 achieves a throughput performance of 583.08 Mbits/sec (44.85 MSamples/sec) and an acceleration of at least 6 times a software implementation running on a 3.47 GHz single core Intel(Trademark) Xeon(Trademark) processor. This paper describes the design and implementation of the FL algorithm on the GPU. The massively parallel implementation will provide in the future a fast and practical real-time solution for airborne and space applications.
A GPU-computing Approach to Solar Stokes Profile Inversion
NASA Astrophysics Data System (ADS)
Harker, Brian J.; Mighell, Kenneth J.
2012-09-01
We present a new computational approach to the inversion of solar photospheric Stokes polarization profiles, under the Milne-Eddington model, for vector magnetography. Our code, named GENESIS, employs multi-threaded parallel-processing techniques to harness the computing power of graphics processing units (GPUs), along with algorithms designed to exploit the inherent parallelism of the Stokes inversion problem. Using a genetic algorithm (GA) engineered specifically for use with a GPU, we produce full-disk maps of the photospheric vector magnetic field from polarized spectral line observations recorded by the Synoptic Optical Long-term Investigations of the Sun (SOLIS) Vector Spectromagnetograph (VSM) instrument. We show the advantages of pairing a population-parallel GA with data-parallel GPU-computing techniques, and present an overview of the Stokes inversion problem, including a description of our adaptation to the GPU-computing paradigm. Full-disk vector magnetograms derived by this method are shown using SOLIS/VSM data observed on 2008 March 28 at 15:45 UT.
Ultra-Fast Image Reconstruction of Tomosynthesis Mammography Using GPU.
Arefan, D; Talebpour, A; Ahmadinejhad, N; Kamali Asl, A
2015-06-01
Digital Breast Tomosynthesis (DBT) is a technology that creates three dimensional (3D) images of breast tissue. Tomosynthesis mammography detects lesions that are not detectable with other imaging systems. If image reconstruction time is in the order of seconds, we can use Tomosynthesis systems to perform Tomosynthesis-guided Interventional procedures. This research has been designed to study ultra-fast image reconstruction technique for Tomosynthesis Mammography systems using Graphics Processing Unit (GPU). At first, projections of Tomosynthesis mammography have been simulated. In order to produce Tomosynthesis projections, it has been designed a 3D breast phantom from empirical data. It is based on MRI data in its natural form. Then, projections have been created from 3D breast phantom. The image reconstruction algorithm based on FBP was programmed with C++ language in two methods using central processing unit (CPU) card and the Graphics Processing Unit (GPU). It calculated the time of image reconstruction in two kinds of programming (using CPU and GPU). PMID:26171373
Implementation of GPU-Accelerated Back Projection for EPR imaging
Qiao, Zhiwei; Redler, Gage; Epel, Boris; Qian, Yuhua; Halpern, Howard
2016-01-01
Electron paramagnetic resonance (EPR) Imaging (EPRI) is a robust method for measuring in vivo oxygen concentration (pO2). For 3D pulse EPRI, a commonly used reconstruction algorithm is the filtered backprojection (FBP) algorithm, in which the backprojection process is computationally intensive and may be time consuming when implemented on a CPU. A multistage implementation of the backprojection can be used for acceleration, however it is not flexible (requires equal linear angle projection distribution) and may still be time consuming. In this work, single-stage backprojection is implemented on a GPU (Graphics Processing Units) having 1152 cores to accelerate the process. The GPU implementation results in acceleration by over a factor of 200 overall and by over a factor of 3500 if only the computing time is considered. Some important experiences regarding the implementation of GPU-accelerated backprojection for EPRI are summarized. The resulting accelerated image reconstruction is useful for real-time image reconstruction monitoring and other time sensitive applications. PMID:26410654
GPU accelerated processing of astronomical high frame-rate videosequences
NASA Astrophysics Data System (ADS)
Vítek, Stanislav; Švihlík, Jan; Krasula, Lukáš; Fliegel, Karel; Páta, Petr
2015-09-01
Astronomical instruments located around the world are producing an incredibly large amount of possibly interesting scientific data. Astronomical research is expanding into large and highly sensitive telescopes. Total volume of data rates per night of operations also increases with the quality and resolution of state-of-the-art CCD/CMOS detectors. Since many of the ground-based astronomical experiments are placed in remote locations with limited access to the Internet, it is necessary to solve the problem of the data storage. It mostly means that current data acquistion, processing and analyses algorithm require review. Decision about importance of the data has to be taken in very short time. This work deals with GPU accelerated processing of high frame-rate astronomical video-sequences, mostly originating from experiment MAIA (Meteor Automatic Imager and Analyser), an instrument primarily focused to observing of faint meteoric events with a high time resolution. The instrument with price bellow 2000 euro consists of image intensifier and gigabite ethernet camera running at 61 fps. With resolution better than VGA the system produces up to 2TB of scientifically valuable video data per night. Main goal of the paper is not to optimize any GPU algorithm, but to propose and evaluate parallel GPU algorithms able to process huge amount of video-sequences in order to delete all uninteresting data.
Algebraic computations in seismology on GPU-clusters
NASA Astrophysics Data System (ADS)
Meskaranian, Mahjoobeh; Sadeghi, Hossein; Mohammadzaheri, Afsaneh; Toutounian, Faezeh; Navazandeh, Mahdi
2013-04-01
Recent advances in high-performance computing have allowed scientists to increase the speed of scientific computations. One of these advances is Graphics Processing Unit (GPU) which is a many-core processor and multithreaded in high-performance computing. Algorithms that can be expressed as data parallel computations such as matrix processing, in which single instruction is executed for multiple data (SIMD) are especially suitable for performing on GPU. We present algorithms for LSQR (Paige and Saunders, 1982) and LSMR (Fong and Saunders, 2011) methods, executable on GPUs. The LSQR and LSMR are iterative methods for solving least squares problems that are usually used for solving inverse problems. These methods are based on Golub and Kahan's bidiagonalization process. The LSQR and LSMR give reliable results especially when problems involve the large and spars ill- conditioned matrices, such matrices can be found in seismic tomography. The most time-consuming operation in these methods is the sparse matrix-vector multiplication (SpMV). For efficient matrix storage as well as SpMV, we use a Compressed Sparse Row (vector) Format (Bell and Garland, 2008), that dedicates one warp (32 thread) to each row. The model resolution matrix illustrates how well estimated model parameters fit the true model parameters. Although some researchers tried to approximate a generalized inverse for LSQR method, this method does not explicitly compute generalized inverse. Therefore it cannot be clearly used to calculate resolution matrix. However, following Yao et al. 2001, it is possible to determine resolution matrix by N times implementing LSQR independently. Therefore, we can utilize the Map-Reduce idea in our algorithm for computation of the model resolution matrix on GPU-clusters. Map-Reduce paradigm was popularized in 2004 by Google's researchers Dean and Ghemawat. Our algorithm is based on the Map-Reduce of Mohammadzaheri, et al. 2012, which consists of two main functions: Map and
GASPRNG: GPU accelerated scalable parallel random number generator library
NASA Astrophysics Data System (ADS)
Gao, Shuang; Peterson, Gregory D.
2013-04-01
Graphics processors represent a promising technology for accelerating computational science applications. Many computational science applications require fast and scalable random number generation with good statistical properties, so they use the Scalable Parallel Random Number Generators library (SPRNG). We present the GPU Accelerated SPRNG library (GASPRNG) to accelerate SPRNG in GPU-based high performance computing systems. GASPRNG includes code for a host CPU and CUDA code for execution on NVIDIA graphics processing units (GPUs) along with a programming interface to support various usage models for pseudorandom numbers and computational science applications executing on the CPU, GPU, or both. This paper describes the implementation approach used to produce high performance and also describes how to use the programming interface. The programming interface allows a user to be able to use GASPRNG the same way as SPRNG on traditional serial or parallel computers as well as to develop tightly coupled programs executing primarily on the GPU. We also describe how to install GASPRNG and use it. To help illustrate linking with GASPRNG, various demonstration codes are included for the different usage models. GASPRNG on a single GPU shows up to 280x speedup over SPRNG on a single CPU core and is able to scale for larger systems in the same manner as SPRNG. Because GASPRNG generates identical streams of pseudorandom numbers as SPRNG, users can be confident about the quality of GASPRNG for scalable computational science applications. Catalogue identifier: AEOI_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEOI_v1_0.html Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland Licensing provisions: UTK license. No. of lines in distributed program, including test data, etc.: 167900 No. of bytes in distributed program, including test data, etc.: 1422058 Distribution format: tar.gz Programming language: C and CUDA. Computer: Any PC or
GPU-based Integration with Application in Sensitivity Analysis
NASA Astrophysics Data System (ADS)
Atanassov, Emanouil; Ivanovska, Sofiya; Karaivanova, Aneta; Slavov, Dimitar
2010-05-01
The presented work is an important part of the grid application MCSAES (Monte Carlo Sensitivity Analysis for Environmental Studies) which aim is to develop an efficient Grid implementation of a Monte Carlo based approach for sensitivity studies in the domains of Environmental modelling and Environmental security. The goal is to study the damaging effects that can be caused by high pollution levels (especially effects on human health), when the main modeling tool is the Danish Eulerian Model (DEM). Generally speaking, sensitivity analysis (SA) is the study of how the variation in the output of a mathematical model can be apportioned to, qualitatively or quantitatively, different sources of variation in the input of a model. One of the important classes of methods for Sensitivity Analysis are Monte Carlo based, first proposed by Sobol, and then developed by Saltelli and his group. In MCSAES the general Saltelli procedure has been adapted for SA of the Danish Eulerian model. In our case we consider as factors the constants determining the speeds of the chemical reactions in the DEM and as output a certain aggregated measure of the pollution. Sensitivity simulations lead to huge computational tasks (systems with up to 4 × 109 equations at every time-step, and the number of time-steps can be more than a million) which motivates its grid implementation. MCSAES grid implementation scheme includes two main tasks: (i) Grid implementation of the DEM, (ii) Grid implementation of the Monte Carlo integration. In this work we present our new developments in the integration part of the application. We have developed an algorithm for GPU-based generation of scrambled quasirandom sequences which can be combined with the CPU-based computations related to the SA. Owen first proposed scrambling of Sobol sequence through permutation in a manner that improves the convergence rates. Scrambling is necessary not only for error analysis but for parallel implementations. Good scrambling is
Shi, Yulin; Veidenbaum, Alexander V.; Nicolau, Alex; Xu, Xiangmin
2014-01-01
Background Modern neuroscience research demands computing power. Neural circuit mapping studies such as those using laser scanning photostimulation (LSPS) produce large amounts of data and require intensive computation for post-hoc processing and analysis. New Method Here we report on the design and implementation of a cost-effective desktop computer system for accelerated experimental data processing with recent GPU computing technology. A new version of Matlab software with GPU enabled functions is used to develop programs that run on Nvidia GPUs to harness their parallel computing power. Results We evaluated both the central processing unit (CPU) and GPU-enabled computational performance of our system in benchmark testing and practical applications. The experimental results show that the GPU-CPU co-processing of simulated data and actual LSPS experimental data clearly outperformed the multi-core CPU with up to a 22x speedup, depending on computational tasks. Further, we present a comparison of numerical accuracy between GPU and CPU computation to verify the precision of GPU computation. In addition, we show how GPUs can be effectively adapted to improve the performance of commercial image processing software such as Adobe Photoshop. Comparison with Existing Method(s) To our best knowledge, this is the first demonstration of GPU application in neural circuit mapping and electrophysiology-based data processing. Conclusions Together, GPU enabled computation enhances our ability to process large-scale data sets derived from neural circuit mapping studies, allowing for increased processing speeds while retaining data precision. PMID:25277633
Accelerated rescaling of single Monte Carlo simulation runs with the Graphics Processing Unit (GPU).
Yang, Owen; Choi, Bernard
2013-01-01
To interpret fiber-based and camera-based measurements of remitted light from biological tissues, researchers typically use analytical models, such as the diffusion approximation to light transport theory, or stochastic models, such as Monte Carlo modeling. To achieve rapid (ideally real-time) measurement of tissue optical properties, especially in clinical situations, there is a critical need to accelerate Monte Carlo simulation runs. In this manuscript, we report on our approach using the Graphics Processing Unit (GPU) to accelerate rescaling of single Monte Carlo runs to calculate rapidly diffuse reflectance values for different sets of tissue optical properties. We selected MATLAB to enable non-specialists in C and CUDA-based programming to use the generated open-source code. We developed a software package with four abstraction layers. To calculate a set of diffuse reflectance values from a simulated tissue with homogeneous optical properties, our rescaling GPU-based approach achieves a reduction in computation time of several orders of magnitude as compared to other GPU-based approaches. Specifically, our GPU-based approach generated a diffuse reflectance value in 0.08ms. The transfer time from CPU to GPU memory currently is a limiting factor with GPU-based calculations. However, for calculation of multiple diffuse reflectance values, our GPU-based approach still can lead to processing that is ~3400 times faster than other GPU-based approaches. PMID:24298424
Dynamic shader generation for GPU-based multi-volume ray casting.
Rössler, Friedemann; Botchen, Ralf P; Ertl, Thomas
2008-01-01
Real-time performance for rendering multiple intersecting volumetric objects requires the speed and flexibility of modern GPUs. This requirement has restricted programming of the necessary shaders to GPU experts only. A visualization system that dynamically generates GPU shaders for multi-volume ray casting from a user-definable abstract render graph overcomes this limitation. PMID:18753036
NASA Astrophysics Data System (ADS)
Mu, Dawei; Chen, Po; Wang, Liqiang
2013-02-01
We have successfully ported an arbitrary high-order discontinuous Galerkin (ADER-DG) method for solving the three-dimensional elastic seismic wave equation on unstructured tetrahedral meshes to an Nvidia Tesla C2075 GPU using the Nvidia CUDA programming model. On average our implementation obtained a speedup factor of about 24.3 for the single-precision version of our GPU code and a speedup factor of about 12.8 for the double-precision version of our GPU code when compared with the double precision serial CPU code running on one Intel Xeon W5880 core. When compared with the parallel CPU code running on two, four and eight cores, the speedup factor of our single-precision GPU code is around 12.9, 6.8 and 3.6, respectively. In this article, we give a brief summary of the ADER-DG method, a short introduction to the CUDA programming model and a description of our CUDA implementation and optimization of the ADER-DG method on the GPU. To our knowledge, this is the first study that explores the potential of accelerating the ADER-DG method for seismic wave-propagation simulations using a GPU.
GPU Accelerated Numerical Simulation of Viscous Flow Down a Slope
NASA Astrophysics Data System (ADS)
Gygax, Remo; Räss, Ludovic; Omlin, Samuel; Podladchikov, Yuri; Jaboyedoff, Michel
2014-05-01
Numerical simulations are an effective tool in natural risk analysis. They are useful to determine the propagation and the runout distance of gravity driven movements such as debris flows or landslides. To evaluate these processes an approach on analogue laboratory experiments and a GPU accelerated numerical simulation of the flow of a viscous liquid down an inclined slope is considered. The physical processes underlying large gravity driven flows share certain aspects with the propagation of debris mass in a rockslide and the spreading of water waves. Several studies have shown that the numerical implementation of the physical processes of viscous flow produce a good fit with the observation of experiments in laboratory in both a quantitative and a qualitative way. When considering a process that is this far explored we can concentrate on its numerical transcription and the application of the code in a GPU accelerated environment to obtain a 3D simulation. The objective of providing a numerical solution in high resolution by NVIDIA-CUDA GPU parallel processing is to increase the speed of the simulation and the accuracy on the prediction. The main goal is to write an easily adaptable and as short as possible code on the widely used platform MATLAB, which will be translated to C-CUDA to achieve higher resolution and processing speed while running on a NVIDIA graphics card cluster. The numerical model, based on the finite difference scheme, is compared to analogue laboratory experiments. This way our numerical model parameters are adjusted to reproduce the effective movements observed by high-speed camera acquisitions during the laboratory experiments.
Font rendering on a GPU-based raster image processor
NASA Astrophysics Data System (ADS)
Recker, John L.; Beretta, Giordano B.; Lin, I.-Jong
2010-01-01
Historically, in the 35 years of digital printing research, raster image processing has always lagged behind marking engine technology, i.e., we have never been able to deliver rendered digital pages as fast as digital print engines can consume them. This trend has resulted in products based on throttled digital printers or expensive raster image processors (RIP) with hardware acceleration. The current trend in computer software architecture is to leverage graphic processing units (GPU) for computing tasks whenever appropriate. We discuss the issues for rendering fonts on such an architecture and present an implementation.
Engineering a fully GPU-accelerated H.264 encoder
NASA Astrophysics Data System (ADS)
Li, Bowei; Deng, Yangdong S.
2013-07-01
H.264/AVC is the most popular video coding standard and playing an essential role in today's Internet based content-delivery businesses. H.264's encoding process is highly computationally expensive due to the integration of complex video coding techniques. As a result, transcoding has become a bottleneck of content-hosting services. Recently, general purpose computing on graphics processing units (GPUs) is rapidly rising as a popular computing model to expedite time-consuming applications. In this paper, we propose a fully GPU-accelerated H.264 encoder. Experimental results show that a 100% speed-up ratio can be achieved.
GPU-accelerated interactive visualization and planning of neurosurgical interventions.
Rincón-Nigro, Mario; Navkar, Nikhil V; Tsekos, Nikolaos V; Zhigang Deng
2014-01-01
Advances in computational methods and hardware platforms provide efficient processing of medical-imaging datasets for surgical planning. For neurosurgical interventions employing a straight access path, planning entails selecting a path from the scalp to the target area that's of minimal risk to the patient. A proposed GPU-accelerated method enables interactive quantitative estimation of the risk for a particular path. It exploits acceleration spatial data structures and efficient implementation of algorithms on GPUs. In evaluations of its computational efficiency and scalability, it achieved interactive rates even for high-resolution meshes. A user study and feedback from neurosurgeons identified this methods' potential benefits for preoperative planning and intraoperative replanning. PMID:24808165
The BRUSH algorithm for two-electron integrals on GPU
NASA Astrophysics Data System (ADS)
Rák, Ádám; Cserey, György
2015-02-01
This Letter presents a new algorithmic method developed to evaluate two-electron repulsion integrals based on contracted Gaussian basis functions in a parallel way. This new algorithm scheme provides distinct SIMD (single instruction multiple data) optimized paths which symbolically transforms integral parameters into target integral algorithms. Our measurements indicate that the method gives a significant improvement over the CPU-friendly PRISM algorithm. The benchmark tests (evaluation of more than 108 integrals using the STO-3G basis set) of our GPU (NVIDIA GTX 780) implementation showed up to 750-fold speedup compared to a single core of Athlon II X4 635 CPU.
CPU-GPU hybrid accelerating the Zuker algorithm for RNA secondary structure prediction applications
2012-01-01
Background Prediction of ribonucleic acid (RNA) secondary structure remains one of the most important research areas in bioinformatics. The Zuker algorithm is one of the most popular methods of free energy minimization for RNA secondary structure prediction. Thus far, few studies have been reported on the acceleration of the Zuker algorithm on general-purpose processors or on extra accelerators such as Field Programmable Gate-Array (FPGA) and Graphics Processing Units (GPU). To the best of our knowledge, no implementation combines both CPU and extra accelerators, such as GPUs, to accelerate the Zuker algorithm applications. Results In this paper, a CPU-GPU hybrid computing system that accelerates Zuker algorithm applications for RNA secondary structure prediction is proposed. The computing tasks are allocated between CPU and GPU for parallel cooperate execution. Performance differences between the CPU and the GPU in the task-allocation scheme are considered to obtain workload balance. To improve the hybrid system performance, the Zuker algorithm is optimally implemented with special methods for CPU and GPU architecture. Conclusions Speedup of 15.93× over optimized multi-core SIMD CPU implementation and performance advantage of 16% over optimized GPU implementation are shown in the experimental results. More than 14% of the sequences are executed on CPU in the hybrid system. The system combining CPU and GPU to accelerate the Zuker algorithm is proven to be promising and can be applied to other bioinformatics applications. PMID:22369626
GPU-Acceleration of Sequence Homology Searches with Database Subsequence Clustering
Suzuki, Shuji; Kakuta, Masanori; Ishida, Takashi; Akiyama, Yutaka
2016-01-01
Sequence homology searches are used in various fields and require large amounts of computation time, especially for metagenomic analysis, owing to the large number of queries and the database size. To accelerate computing analyses, graphics processing units (GPUs) are widely used as a low-cost, high-performance computing platform. Therefore, we mapped the time-consuming steps involved in GHOSTZ, which is a state-of-the-art homology search algorithm for protein sequences, onto a GPU and implemented it as GHOSTZ-GPU. In addition, we optimized memory access for GPU calculations and for communication between the CPU and GPU. As per results of the evaluation test involving metagenomic data, GHOSTZ-GPU with 12 CPU threads and 1 GPU was approximately 3.0- to 4.1-fold faster than GHOSTZ with 12 CPU threads. Moreover, GHOSTZ-GPU with 12 CPU threads and 3 GPUs was approximately 5.8- to 7.7-fold faster than GHOSTZ with 12 CPU threads. PMID:27482905
Accelerating Spaceborne SAR Imaging Using Multiple CPU/GPU Deep Collaborative Computing.
Zhang, Fan; Li, Guojun; Li, Wei; Hu, Wei; Hu, Yuxin
2016-01-01
With the development of synthetic aperture radar (SAR) technologies in recent years, the huge amount of remote sensing data brings challenges for real-time imaging processing. Therefore, high performance computing (HPC) methods have been presented to accelerate SAR imaging, especially the GPU based methods. In the classical GPU based imaging algorithm, GPU is employed to accelerate image processing by massive parallel computing, and CPU is only used to perform the auxiliary work such as data input/output (IO). However, the computing capability of CPU is ignored and underestimated. In this work, a new deep collaborative SAR imaging method based on multiple CPU/GPU is proposed to achieve real-time SAR imaging. Through the proposed tasks partitioning and scheduling strategy, the whole image can be generated with deep collaborative multiple CPU/GPU computing. In the part of CPU parallel imaging, the advanced vector extension (AVX) method is firstly introduced into the multi-core CPU parallel method for higher efficiency. As for the GPU parallel imaging, not only the bottlenecks of memory limitation and frequent data transferring are broken, but also kinds of optimized strategies are applied, such as streaming, parallel pipeline and so on. Experimental results demonstrate that the deep CPU/GPU collaborative imaging method enhances the efficiency of SAR imaging on single-core CPU by 270 times and realizes the real-time imaging in that the imaging rate outperforms the raw data generation rate. PMID:27070606
GPU-Acceleration of Sequence Homology Searches with Database Subsequence Clustering.
Suzuki, Shuji; Kakuta, Masanori; Ishida, Takashi; Akiyama, Yutaka
2016-01-01
Sequence homology searches are used in various fields and require large amounts of computation time, especially for metagenomic analysis, owing to the large number of queries and the database size. To accelerate computing analyses, graphics processing units (GPUs) are widely used as a low-cost, high-performance computing platform. Therefore, we mapped the time-consuming steps involved in GHOSTZ, which is a state-of-the-art homology search algorithm for protein sequences, onto a GPU and implemented it as GHOSTZ-GPU. In addition, we optimized memory access for GPU calculations and for communication between the CPU and GPU. As per results of the evaluation test involving metagenomic data, GHOSTZ-GPU with 12 CPU threads and 1 GPU was approximately 3.0- to 4.1-fold faster than GHOSTZ with 12 CPU threads. Moreover, GHOSTZ-GPU with 12 CPU threads and 3 GPUs was approximately 5.8- to 7.7-fold faster than GHOSTZ with 12 CPU threads. PMID:27482905
Accelerating Spaceborne SAR Imaging Using Multiple CPU/GPU Deep Collaborative Computing
Zhang, Fan; Li, Guojun; Li, Wei; Hu, Wei; Hu, Yuxin
2016-01-01
With the development of synthetic aperture radar (SAR) technologies in recent years, the huge amount of remote sensing data brings challenges for real-time imaging processing. Therefore, high performance computing (HPC) methods have been presented to accelerate SAR imaging, especially the GPU based methods. In the classical GPU based imaging algorithm, GPU is employed to accelerate image processing by massive parallel computing, and CPU is only used to perform the auxiliary work such as data input/output (IO). However, the computing capability of CPU is ignored and underestimated. In this work, a new deep collaborative SAR imaging method based on multiple CPU/GPU is proposed to achieve real-time SAR imaging. Through the proposed tasks partitioning and scheduling strategy, the whole image can be generated with deep collaborative multiple CPU/GPU computing. In the part of CPU parallel imaging, the advanced vector extension (AVX) method is firstly introduced into the multi-core CPU parallel method for higher efficiency. As for the GPU parallel imaging, not only the bottlenecks of memory limitation and frequent data transferring are broken, but also kinds of optimized strategies are applied, such as streaming, parallel pipeline and so on. Experimental results demonstrate that the deep CPU/GPU collaborative imaging method enhances the efficiency of SAR imaging on single-core CPU by 270 times and realizes the real-time imaging in that the imaging rate outperforms the raw data generation rate. PMID:27070606
Accelerating universal Kriging interpolation algorithm using CUDA-enabled GPU
NASA Astrophysics Data System (ADS)
Cheng, Tangpei
2013-04-01
Kriging algorithms are a group of important interpolation methods, which are very useful in many geological applications. However, the algorithm based on traditional general purpose processors can be computationally expensive, especially when the problem scale expands. Inspired by the current trend in graphics processing technology, we proposed an efficient parallel scheme to accelerate the universal Kriging algorithm on the NVIDIA CUDA platform. Some high-performance mathematical functions have been introduced to calculate the compute-intensive steps in the Kriging algorithm, such as matrix-vector multiplication and matrix-matrix multiplication. To further optimize performance, we reduced the memory transfer overhead by reconstructing the time-consuming loops, specifically for the execution on GPU. In the numerical experiment, we compared the performances among different multi-core CPU and GPU implementations to interpolate a geological site. The improved CUDA implementation shows a nearly 18× speedup with respect to the sequential program and is 6.32 times faster compared to the OpenMP-based version running on Intel Xeon E5320 quad-cores CPU and scales well with the size of the system.
GPU implementation of the simplex identification via split augmented Lagrangian
NASA Astrophysics Data System (ADS)
Sevilla, Jorge; Nascimento, José M. P.
2015-10-01
Hyperspectral imaging can be used for object detection and for discriminating between different objects based on their spectral characteristics. One of the main problems of hyperspectral data analysis is the presence of mixed pixels, due to the low spatial resolution of such images. This means that several spectrally pure signatures (endmembers) are combined into the same mixed pixel. Linear spectral unmixing follows an unsupervised approach which aims at inferring pure spectral signatures and their material fractions at each pixel of the scene. The huge data volumes acquired by such sensors put stringent requirements on processing and unmixing methods. This paper proposes an efficient implementation of a unsupervised linear unmixing method on GPUs using CUDA. The method finds the smallest simplex by solving a sequence of nonsmooth convex subproblems using variable splitting to obtain a constraint formulation, and then applying an augmented Lagrangian technique. The parallel implementation of SISAL presented in this work exploits the GPU architecture at low level, using shared memory and coalesced accesses to memory. The results herein presented indicate that the GPU implementation can significantly accelerate the method's execution over big datasets while maintaining the methods accuracy.
Real-time ultrasound simulation using the GPU.
Gjerald, Sjur Urdson; Brekken, Reidar; Hergum, Torbjørn; D'hooge, Jan
2012-05-01
Ultrasound simulators can be used for training ultrasound image acquisition and interpretation. In such simulators, synthetic ultrasound images must be generated in real time. Anatomy can be modeled by computed tomography (CT). Shadows can be calculated by combining reflection coefficients and depth dependent, exponential attenuation. To include speckle, a pre-calculated texture map is typically added. Dynamic objects must be simulated separately. We propose to increase the speckle realism and allow for dynamic objects by using a physical model of the underlying scattering process. The model is based on convolution of the point spread function (PSF) of the ultrasound scanner with a scatterer distribution. The challenge is that the typical field-of-view contains millions of scatterers which must be selected by a virtual probe from an even larger body of scatterers. The main idea of this paper is to select and sample scatterers in parallel on the graphic processing unit (GPU). The method was used to image a cyst phantom and a movable needle. Speckle images were produced in real time (more than 10 frames per second) on a standard GPU. The ultrasound images were visually similar to images calculated by a reference method. PMID:22622973
Fast-coding robust motion estimation model in a GPU
NASA Astrophysics Data System (ADS)
García, Carlos; Botella, Guillermo; de Sande, Francisco; Prieto-Matias, Manuel
2015-02-01
Nowadays vision systems are used with countless purposes. Moreover, the motion estimation is a discipline that allow to extract relevant information as pattern segmentation, 3D structure or tracking objects. However, the real-time requirements in most applications has limited its consolidation, considering the adoption of high performance systems to meet response times. With the emergence of so-called highly parallel devices known as accelerators this gap has narrowed. Two extreme endpoints in the spectrum of most common accelerators are Field Programmable Gate Array (FPGA) and Graphics Processing Systems (GPU), which usually offer higher performance rates than general propose processors. Moreover, the use of GPUs as accelerators involves the efficient exploitation of any parallelism in the target application. This task is not easy because performance rates are affected by many aspects that programmers should overcome. In this paper, we evaluate OpenACC standard, a programming model with directives which favors porting any code to a GPU in the context of motion estimation application. The results confirm that this programming paradigm is suitable for this image processing applications achieving a very satisfactory acceleration in convolution based problems as in the well-known Lucas & Kanade method.
Molecular dynamics simulations through GPU video games technologies
Loukatou, Styliani; Papageorgiou, Louis; Fakourelis, Paraskevas; Filntisi, Arianna; Polychronidou, Eleftheria; Bassis, Ioannis; Megalooikonomou, Vasileios; Makałowski, Wojciech; Vlachakis, Dimitrios; Kossida, Sophia
2016-01-01
Bioinformatics is the scientific field that focuses on the application of computer technology to the management of biological information. Over the years, bioinformatics applications have been used to store, process and integrate biological and genetic information, using a wide range of methodologies. One of the most de novo techniques used to understand the physical movements of atoms and molecules is molecular dynamics (MD). MD is an in silico method to simulate the physical motions of atoms and molecules under certain conditions. This has become a state strategic technique and now plays a key role in many areas of exact sciences, such as chemistry, biology, physics and medicine. Due to their complexity, MD calculations could require enormous amounts of computer memory and time and therefore their execution has been a big problem. Despite the huge computational cost, molecular dynamics have been implemented using traditional computers with a central memory unit (CPU). A graphics processing unit (GPU) computing technology was first designed with the goal to improve video games, by rapidly creating and displaying images in a frame buffer such as screens. The hybrid GPU-CPU implementation, combined with parallel computing is a novel technology to perform a wide range of calculations. GPUs have been proposed and used to accelerate many scientific computations including MD simulations. Herein, we describe the new methodologies developed initially as video games and how they are now applied in MD simulations. PMID:27525251
Large Data Visualization on Distributed Memory Mulit-GPU Clusters
Childs, Henry R.
2010-03-01
Data sets of immense size are regularly generated on large scale computing resources. Even among more traditional methods for acquisition of volume data, such as MRI and CT scanners, data which is too large to be effectively visualization on standard workstations is now commonplace. One solution to this problem is to employ a 'visualization cluster,' a small to medium scale cluster dedicated to performing visualization and analysis of massive data sets generated on larger scale supercomputers. These clusters are designed to fit a different need than traditional supercomputers, and therefore their design mandates different hardware choices, such as increased memory, and more recently, graphics processing units (GPUs). While there has been much previous work on distributed memory visualization as well as GPU visualization, there is a relative dearth of algorithms which effectively use GPUs at a large scale in a distributed memory environment. In this work, we study a common visualization technique in a GPU-accelerated, distributed memory setting, and present performance characteristics when scaling to extremely large data sets.
GPU-accelerated visualization of protein dynamics in ribbon mode
NASA Astrophysics Data System (ADS)
Wahle, Manuel; Birmanns, Stefan
2011-01-01
Proteins are biomolecules present in living organisms and essential for carrying out vital functions. Inherent to their functioning is folding into different spatial conformations, and to understand these processes, it is crucial to visually explore the structural changes. In recent years, significant advancements in experimental techniques and novel algorithms for post-processing of protein data have routinely revealed static and dynamic structures of increasing sizes. In turn, interactive visualization of the systems and their transitions became more challenging. Therefore, much research for the efficient display of protein dynamics has been done, with the focus being space filling models, but for the important class of abstract ribbon or cartoon representations, there exist only few methods for an efficient rendering. Yet, these models are of high interest to scientists, as they provide a compact and concise description of the structure elements along the protein main chain. In this work, a method was developed to speed up ribbon and cartoon visualizations. Separating two phases in the calculation of geometry allows to offload computational work from the CPU to the GPU. The first phase consists of computing a smooth curve along the protein's main chain on the CPU. In the second phase, conducted independently by the GPU, vertices along that curve are moved to set up the final geometrical representation of the molecule.
Application of GPU processing for Brownian particle simulation
NASA Astrophysics Data System (ADS)
Cheng, Way Lee; Sheharyar, Ali; Sadr, Reza; Bouhali, Othmane
2015-01-01
Reports on the anomalous thermal-fluid properties of nanofluids (dilute suspension of nano-particles in a base fluid) have been the subject of attention for 15 years. The underlying physics that govern nanofluid behavior, however, is not fully understood and is a subject of much dispute. The interactions between the suspended particles and the base fluid have been cited as a major contributor to the improvement in heat transfer reported in the literature. Numerical simulations are instrumental in studying the behavior of nanofluids. However, such simulations can be computationally intensive due to the small dimensions and complexity of these problems. In this study, a simplified computational approach for isothermal nanofluid simulations was applied, and simulations were conducted using both conventional CPU and parallel GPU implementations. The GPU implementations significantly improved the computational performance, in terms of the simulation time, by a factor of 1000-2500. The results of this investigation show that, as the computational load increases, the simulation efficiency approaches a constant. At a very high computational load, the amount of improvement may even decrease due to limited system memory.
Efficient GPU Accelerationfor Integrating Large Thermonuclear Networks in Astrophysics
NASA Astrophysics Data System (ADS)
Guidry, Mike
2016-02-01
We demonstrate the systematic implementation of recently-developed fast explicit kinetic integration algorithms on modern graphics processing unit (GPU) accelerators. We take as representative test cases Type Ia supernova explosions with extremely stiff thermonuclear reaction networks having 150-365 isotopic species and 1600-4400 reactions, assumed coupled to hydrodynamics using operator splitting. In such examples we demonstrate the capability to integrate independent thermonuclear networks from ~250-500 hydro zones (assumed to be deployed on CPU cores) in parallel on a single GPU in the same wall clock time that standard implicit methods can integrate the network for a single zone. This two or more orders of magnitude increase in efficiency for solving systems of realistic thermonuclear networks coupled to fluid dynamics implies that important coupled, multiphysics problems in various scientific and technical disciplines that were intractable, or could be simulated only with highly schematic kinetic networks, are now computationally feasible. As examples of such applications I will discuss our ongoing deployment of these new methods for Type Ia supernova explosions in astrophysics and for simulation of the complex atmospheric chemistry entering into weather and climate problems.
Accelerating sub-pixel marker segmentation using GPU
NASA Astrophysics Data System (ADS)
Handel, Holger
2009-02-01
Sub-pixel accurate marker segmentation is an important task for many computer vision systems. The 3D-positions of markers are used in control loops to determine the position of machine tools or robot end-effectors. Accurate segmentation of the marker position in the image plane is crucial for accurate reconstruction. Many subpixel segmentation algorithms are computationally intensive, especially when the number of markers increases. Modern graphics hardware with its massively parallel architecture provides a powerful tool for many image segmentation tasks. Especially, the time consuming sub-pixel refinement steps in marker segmentation can benefit from the recent progress. This article presents an implementation of a sub-pixel marker segmentation framework using the GPU to accelerate the processing time. The image segmentation chain consists of two stages. The first is a pre-processing stage which segments the initial position of the marker with pixel accuracy, the second stage refines the initial marker position to sub-pixel accuracy. Both stages are implemented as shader programs on the GPU. The flexible architecture allows it to combine different pre-processing and sub-pixel refinement algorithms. Experimental results show that significant speed-up can be achieved compared to CPU implementations, especially when the number of markers increases.
Electromagnetic metamaterial simulations using a GPU-accelerated FDTD method
NASA Astrophysics Data System (ADS)
Seok, Myung-Su; Lee, Min-Gon; Yoo, SeokJae; Park, Q.-Han
2015-12-01
Metamaterials composed of artificial subwavelength structures exhibit extraordinary properties that cannot be found in nature. Designing artificial structures having exceptional properties plays a pivotal role in current metamaterial research. We present a new numerical simulation scheme for metamaterial research. The scheme is based on a graphic processing unit (GPU)-accelerated finite-difference time-domain (FDTD) method. The FDTD computation can be significantly accelerated when GPUs are used instead of only central processing units (CPUs). We explain how the fast FDTD simulation of large-scale metamaterials can be achieved through communication optimization in a heterogeneous CPU/GPU-based computer cluster. Our method also includes various advanced FDTD techniques: the non-uniform grid technique, the total-field/scattered-field (TFSF) technique, the auxiliary field technique for dispersive materials, the running discrete Fourier transform, and the complex structure setting. We demonstrate the power of our new FDTD simulation scheme by simulating the negative refraction of light in a coaxial waveguide metamaterial.
High-throughput GPU-based LDPC decoding
NASA Astrophysics Data System (ADS)
Chang, Yang-Lang; Chang, Cheng-Chun; Huang, Min-Yu; Huang, Bormin
2010-08-01
Low-density parity-check (LDPC) code is a linear block code known to approach the Shannon limit via the iterative sum-product algorithm. LDPC codes have been adopted in most current communication systems such as DVB-S2, WiMAX, WI-FI and 10GBASE-T. LDPC for the needs of reliable and flexible communication links for a wide variety of communication standards and configurations have inspired the demand for high-performance and flexibility computing. Accordingly, finding a fast and reconfigurable developing platform for designing the high-throughput LDPC decoder has become important especially for rapidly changing communication standards and configurations. In this paper, a new graphic-processing-unit (GPU) LDPC decoding platform with the asynchronous data transfer is proposed to realize this practical implementation. Experimental results showed that the proposed GPU-based decoder achieved 271x speedup compared to its CPU-based counterpart. It can serve as a high-throughput LDPC decoder.
Commodity CPU-GPU System for Low-Cost , High-Performance Computing
NASA Astrophysics Data System (ADS)
Wang, S.; Zhang, S.; Weiss, R. M.; Barnett, G. A.; Yuen, D. A.
2009-12-01
We have put together a desktop computer system for under 2.5 K dollars from commodity components that consist of one quad-core CPU (Intel Core 2 Quad Q6600 Kentsfield 2.4GHz) and two high end GPUs (nVidia's GeForce GTX 295 and Tesla C1060). A 1200 watt power supply is required. On this commodity system, we have constructed an easy-to-use hybrid computing environment, in which Message Passing Interface (MPI) is used for managing the working loads, for transferring the data among different GPU devices, and for minimizing the need of CPU’s memory. The test runs using the MAGMA (Matrix Algebra on GPU and Multicore Architectures) library show that the speed ups for double precision calculations can be greater than 10 (GPU vs. CPU) and they are bigger (> 20) for single precision calculations. In addition we have enabled the combination of Matlab with CUDA for interactive visualization through MPI, i.e., two GPU devices are used for simulation and one GPU device is used for visualizing the computing results as the simulation goes. Our experience with this commodity system has shown that running multiple applications on one GPU device or running one application across multiple GPU devices can be done as conveniently as on CPUs. With NVIDIA CEO Jen-Hsun Huang's claim that over the next 6 years GPU processing power will increase by 570x compared to the 3x for CPUs, future low-cost commodity computers such as ours may be a remedy for the long wait queues of the world's supercomputers, especially for small- and mid-scale computation. Our goal here is to explore the limits and capabilities of this emerging technology and to get ourselves ready to run large-scale simulations on the next generation of computing environment, which we believe will hybridize CPU and GPU architectures.
Targeting Atmospheric Simulation Algorithms for Large Distributed Memory GPU Accelerated Computers
Norman, Matthew R
2013-01-01
Computing platforms are increasingly moving to accelerated architectures, and here we deal particularly with GPUs. In [15], a method was developed for atmospheric simulation to improve efficiency on large distributed memory machines by reducing communication demand and increasing the time step. Here, we improve upon this method to further target GPU accelerated platforms by reducing GPU memory accesses, removing a synchronization point, and better clustering computations. The modification ran over two times faster in some cases even though more computations were required, demonstrating the merit of improving memory handling on the GPU. Furthermore, we discover that the modification also has a near 100% hit rate in fast on-chip L1 cache and discuss the reasons for this. In concluding, we remark on further potential improvements to GPU efficiency.
Papadopoulos, Agathoklis; Kostoglou, Kyriaki; Mitsis, Georgios D; Theocharides, Theocharis
2015-01-01
The use of a GPGPU programming paradigm (running CUDA-enabled algorithms on GPU cards) in biomedical engineering and biology-related applications have shown promising results. GPU acceleration can be used to speedup computation-intensive models, such as the mathematical modeling of biological systems, which often requires the use of nonlinear modeling approaches with a large number of free parameters. In this context, we developed a CUDA-enabled version of a model which implements a nonlinear identification approach that combines basis expansions and polynomial-type networks, termed Laguerre-Volterra networks and can be used in diverse biological applications. The proposed software implementation uses the GPGPU programming paradigm to take advantage of the inherent parallel characteristics of the aforementioned modeling approach to execute the calculations on the GPU card of the host computer system. The initial results of the GPU-based model presented in this work, show performance improvements over the original MATLAB model. PMID:26736993
Fast plane wave density functional theory molecular dynamics calculations on multi-GPU machines
Jia, Weile; University of Chinese Academy of Sciences, Beijing ; Fu, Jiyun; University of Chinese Academy of Sciences, Beijing ; Cao, Zongyan; Wang, Long; Chi, Xuebin; Gao, Weiguo; MOE Key Laboratory of Computational Physical Sciences, Fudan University, Shanghai ; Wang, Lin-Wang
2013-10-15
Plane wave pseudopotential (PWP) density functional theory (DFT) calculation is the most widely used method for material simulations, but its absolute speed stagnated due to the inability to use large scale CPU based computers. By a drastic redesign of the algorithm, and moving all the major computation parts into GPU, we have reached a speed of 12 s per molecular dynamics (MD) step for a 512 atom system using 256 GPU cards. This is about 20 times faster than the CPU version of the code regardless of the number of CPU cores used. Our tests and analysis on different GPU platforms and configurations shed lights on the optimal GPU deployments for PWP-DFT calculations. An 1800 step MD simulation is used to study the liquid phase properties of GaInP.
Numerically Tracking Contact Discontinuities with an Introduction for GPU Programming
Davis, Sean L
2012-08-17
We review some of the classic numerical techniques used to analyze contact discontinuities and compare their effectiveness. Several finite difference methods (the Lax-Wendroff method, a Multidimensional Positive Definite Advection Transport Algorithm (MPDATA) method and a Monotone Upstream Scheme for Conservation Laws (MUSCL) scheme with an Artificial Compression Method (ACM)) as well as the finite element Streamlined Upwind Petrov-Galerkin (SUPG) method were considered. These methods were applied to solve the 2D advection equation. Based on our results we concluded that the MUSCL scheme produces the sharpest interfaces but can inappropriately steepen the solution. The SUPG method seems to represent a good balance between stability and interface sharpness without any inappropriate steepening. However, for solutions with discontinuities, the MUSCL scheme is superior. In addition, a preliminary implementation in a GPU program is discussed.
Singular value decomposition for collaborative filtering on a GPU
NASA Astrophysics Data System (ADS)
Kato, Kimikazu; Hosino, Tikara
2010-06-01
A collaborative filtering predicts customers' unknown preferences from known preferences. In a computation of the collaborative filtering, a singular value decomposition (SVD) is needed to reduce the size of a large scale matrix so that the burden for the next phase computation will be decreased. In this application, SVD means a roughly approximated factorization of a given matrix into smaller sized matrices. Webb (a.k.a. Simon Funk) showed an effective algorithm to compute SVD toward a solution of an open competition called "Netflix Prize". The algorithm utilizes an iterative method so that the error of approximation improves in each step of the iteration. We give a GPU version of Webb's algorithm. Our algorithm is implemented in the CUDA and it is shown to be efficient by an experiment.
GPU-computing in econophysics and statistical physics
NASA Astrophysics Data System (ADS)
Preis, T.
2011-03-01
A recent trend in computer science and related fields is general purpose computing on graphics processing units (GPUs), which can yield impressive performance. With multiple cores connected by high memory bandwidth, today's GPUs offer resources for non-graphics parallel processing. This article provides a brief introduction into the field of GPU computing and includes examples. In particular computationally expensive analyses employed in financial market context are coded on a graphics card architecture which leads to a significant reduction of computing time. In order to demonstrate the wide range of possible applications, a standard model in statistical physics - the Ising model - is ported to a graphics card architecture as well, resulting in large speedup values.
Explicit integration with GPU acceleration for large kinetic networks
Brock, Benjamin; Belt, Andrew; Billings, Jay Jay; Guidry, Mike W.
2015-09-15
In this study, we demonstrate the first implementation of recently-developed fast explicit kinetic integration algorithms on modern graphics processing unit (GPU) accelerators. Taking as a generic test case a Type Ia supernova explosion with an extremely stiff thermonuclear network having 150 isotopic species and 1604 reactions coupled to hydrodynamics using operator splitting, we demonstrate the capability to solve of order 100 realistic kinetic networks in parallel in the same time that standard implicit methods can solve a single such network on a CPU. In addition, this orders-of-magnitude decrease in computation time for solving systems of realistic kinetic networks implies thatmore » important coupled, multiphysics problems in various scientific and technical fields that were intractable, or could be simulated only with highly schematic kinetic networks, are now computationally feasible.« less
Explicit integration with GPU acceleration for large kinetic networks
Brock, Benjamin; Belt, Andrew; Billings, Jay Jay; Guidry, Mike W.
2015-09-15
In this study, we demonstrate the first implementation of recently-developed fast explicit kinetic integration algorithms on modern graphics processing unit (GPU) accelerators. Taking as a generic test case a Type Ia supernova explosion with an extremely stiff thermonuclear network having 150 isotopic species and 1604 reactions coupled to hydrodynamics using operator splitting, we demonstrate the capability to solve of order 100 realistic kinetic networks in parallel in the same time that standard implicit methods can solve a single such network on a CPU. In addition, this orders-of-magnitude decrease in computation time for solving systems of realistic kinetic networks implies that important coupled, multiphysics problems in various scientific and technical fields that were intractable, or could be simulated only with highly schematic kinetic networks, are now computationally feasible.
MATCHED FILTER COMPUTATION ON FPGA, CELL, AND GPU
BAKER, ZACHARY K.; GOKHALE, MAYA B.; TRIPP, JUSTIN L.
2007-01-08
The matched filter is an important kernel in the processing of hyperspectral data. The filter enables researchers to sift useful data from instruments that span large frequency bands. In this work, they evaluate the performance of a matched filter algorithm implementation on accelerated co-processor (XD1000), the IBM Cell microprocessor, and the NVIDIA GeForce 6900 GTX GPU graphics card. They provide extensive discussion of the challenges and opportunities afforded by each platform. In particular, they explore the problems of partitioning the filter most efficiently between the host CPU and the co-processor. Using their results, they derive several performance metrics that provide the optimal solution for a variety of application situations.
Explicit integration with GPU acceleration for large kinetic networks
NASA Astrophysics Data System (ADS)
Brock, Benjamin; Belt, Andrew; Billings, Jay Jay; Guidry, Mike
2015-12-01
We demonstrate the first implementation of recently-developed fast explicit kinetic integration algorithms on modern graphics processing unit (GPU) accelerators. Taking as a generic test case a Type Ia supernova explosion with an extremely stiff thermonuclear network having 150 isotopic species and 1604 reactions coupled to hydrodynamics using operator splitting, we demonstrate the capability to solve of order 100 realistic kinetic networks in parallel in the same time that standard implicit methods can solve a single such network on a CPU. This orders-of-magnitude decrease in computation time for solving systems of realistic kinetic networks implies that important coupled, multiphysics problems in various scientific and technical fields that were intractable, or could be simulated only with highly schematic kinetic networks, are now computationally feasible.
GPU-accelerated Monte Carlo convolution∕superposition implementation for dose calculation
Zhou, Bo; Yu, Cedric X.; Chen, Danny Z.; Hu, X. Sharon
2010-01-01
Purpose: Dose calculation is a key component in radiation treatment planning systems. Its performance and accuracy are crucial to the quality of treatment plans as emerging advanced radiation therapy technologies are exerting ever tighter constraints on dose calculation. A common practice is to choose either a deterministic method such as the convolution∕superposition (CS) method for speed or a Monte Carlo (MC) method for accuracy. The goal of this work is to boost the performance of a hybrid Monte Carlo convolution∕superposition (MCCS) method by devising a graphics processing unit (GPU) implementation so as to make the method practical for day-to-day usage. Methods: Although the MCCS algorithm combines the merits of MC fluence generation and CS fluence transport, it is still not fast enough to be used as a day-to-day planning tool. To alleviate the speed issue of MC algorithms, the authors adopted MCCS as their target method and implemented a GPU-based version. In order to fully utilize the GPU computing power, the MCCS algorithm is modified to match the GPU hardware architecture. The performance of the authors’ GPU-based implementation on an Nvidia GTX260 card is compared to a multithreaded software implementation on a quad-core system. Results: A speedup in the range of 6.7–11.4× is observed for the clinical cases used. The less than 2% statistical fluctuation also indicates that the accuracy of the authors’ GPU-based implementation is in good agreement with the results from the quad-core CPU implementation. Conclusions: This work shows that GPU is a feasible and cost-efficient solution compared to other alternatives such as using cluster machines or field-programmable gate arrays for satisfying the increasing demands on computation speed and accuracy of dose calculation. But there are also inherent limitations of using GPU for accelerating MC-type applications, which are also analyzed in detail in this article. PMID:21158271
Development of High-speed Visualization System of Hypocenter Data Using CUDA-based GPU computing
NASA Astrophysics Data System (ADS)
Kumagai, T.; Okubo, K.; Uchida, N.; Matsuzawa, T.; Kawada, N.; Takeuchi, N.
2014-12-01
After the Great East Japan Earthquake on March 11, 2011, intelligent visualization of seismic information is becoming important to understand the earthquake phenomena. On the other hand, to date, the quantity of seismic data becomes enormous as a progress of high accuracy observation network; we need to treat many parameters (e.g., positional information, origin time, magnitude, etc.) to efficiently display the seismic information. Therefore, high-speed processing of data and image information is necessary to handle enormous amounts of seismic data. Recently, GPU (Graphic Processing Unit) is used as an acceleration tool for data processing and calculation in various study fields. This movement is called GPGPU (General Purpose computing on GPUs). In the last few years the performance of GPU keeps on improving rapidly. GPU computing gives us the high-performance computing environment at a lower cost than before. Moreover, use of GPU has an advantage of visualization of processed data, because GPU is originally architecture for graphics processing. In the GPU computing, the processed data is always stored in the video memory. Therefore, we can directly write drawing information to the VRAM on the video card by combining CUDA and the graphics API. In this study, we employ CUDA and OpenGL and/or DirectX to realize full-GPU implementation. This method makes it possible to write drawing information to the VRAM on the video card without PCIe bus data transfer: It enables the high-speed processing of seismic data. The present study examines the GPU computing-based high-speed visualization and the feasibility for high-speed visualization system of hypocenter data.
2010-01-01
Background Simulation of sophisticated biological models requires considerable computational power. These models typically integrate together numerous biological phenomena such as spatially-explicit heterogeneous cells, cell-cell interactions, cell-environment interactions and intracellular gene networks. The recent advent of programming for graphical processing units (GPU) opens up the possibility of developing more integrative, detailed and predictive biological models while at the same time decreasing the computational cost to simulate those models. Results We construct a 3D model of epidermal development and provide a set of GPU algorithms that executes significantly faster than sequential central processing unit (CPU) code. We provide a parallel implementation of the subcellular element method for individual cells residing in a lattice-free spatial environment. Each cell in our epidermal model includes an internal gene network, which integrates cellular interaction of Notch signaling together with environmental interaction of basement membrane adhesion, to specify cellular state and behaviors such as growth and division. We take a pedagogical approach to describing how modeling methods are efficiently implemented on the GPU including memory layout of data structures and functional decomposition. We discuss various programmatic issues and provide a set of design guidelines for GPU programming that are instructive to avoid common pitfalls as well as to extract performance from the GPU architecture. Conclusions We demonstrate that GPU algorithms represent a significant technological advance for the simulation of complex biological models. We further demonstrate with our epidermal model that the integration of multiple complex modeling methods for heterogeneous multicellular biological processes is both feasible and computationally tractable using this new technology. We hope that the provided algorithms and source code will be a starting point for modelers to
Efficient simulation of diffusion-based choice RT models on CPU and GPU.
Verdonck, Stijn; Meers, Kristof; Tuerlinckx, Francis
2016-03-01
In this paper, we present software for the efficient simulation of a broad class of linear and nonlinear diffusion models for choice RT, using either CPU or graphical processing unit (GPU) technology. The software is readily accessible from the popular scripting languages MATLAB and R (both 64-bit). The speed obtained on a single high-end GPU is comparable to that of a small CPU cluster, bringing standard statistical inference of complex diffusion models to the desktop platform. PMID:25761391
GPU-based parallel clustered differential pulse code modulation
NASA Astrophysics Data System (ADS)
Wu, Jiaji; Li, Wenze; Kong, Wanqiu
2015-10-01
Hyperspectral remote sensing technology is widely used in marine remote sensing, geological exploration, atmospheric and environmental remote sensing. Owing to the rapid development of hyperspectral remote sensing technology, resolution of hyperspectral image has got a huge boost. Thus data size of hyperspectral image is becoming larger. In order to reduce their saving and transmission cost, lossless compression for hyperspectral image has become an important research topic. In recent years, large numbers of algorithms have been proposed to reduce the redundancy between different spectra. Among of them, the most classical and expansible algorithm is the Clustered Differential Pulse Code Modulation (CDPCM) algorithm. This algorithm contains three parts: first clusters all spectral lines, then trains linear predictors for each band. Secondly, use these predictors to predict pixels, and get the residual image by subtraction between original image and predicted image. Finally, encode the residual image. However, the process of calculating predictors is timecosting. In order to improve the processing speed, we propose a parallel C-DPCM based on CUDA (Compute Unified Device Architecture) with GPU. Recently, general-purpose computing based on GPUs has been greatly developed. The capacity of GPU improves rapidly by increasing the number of processing units and storage control units. CUDA is a parallel computing platform and programming model created by NVIDIA. It gives developers direct access to the virtual instruction set and memory of the parallel computational elements in GPUs. Our core idea is to achieve the calculation of predictors in parallel. By respectively adopting global memory, shared memory and register memory, we finally get a decent speedup.
GPU accelerated dynamic functional connectivity analysis for functional MRI data.
Akgün, Devrim; Sakoğlu, Ünal; Esquivel, Johnny; Adinoff, Bryon; Mete, Mutlu
2015-07-01
Recent advances in multi-core processors and graphics card based computational technologies have paved the way for an improved and dynamic utilization of parallel computing techniques. Numerous applications have been implemented for the acceleration of computationally-intensive problems in various computational science fields including bioinformatics, in which big data problems are prevalent. In neuroimaging, dynamic functional connectivity (DFC) analysis is a computationally demanding method used to investigate dynamic functional interactions among different brain regions or networks identified with functional magnetic resonance imaging (fMRI) data. In this study, we implemented and analyzed a parallel DFC algorithm based on thread-based and block-based approaches. The thread-based approach was designed to parallelize DFC computations and was implemented in both Open Multi-Processing (OpenMP) and Compute Unified Device Architecture (CUDA) programming platforms. Another approach developed in this study to better utilize CUDA architecture is the block-based approach, where parallelization involves smaller parts of fMRI time-courses obtained by sliding-windows. Experimental results showed that the proposed parallel design solutions enabled by the GPUs significantly reduce the computation time for DFC analysis. Multicore implementation using OpenMP on 8-core processor provides up to 7.7× speed-up. GPU implementation using CUDA yielded substantial accelerations ranging from 18.5× to 157× speed-up once thread-based and block-based approaches were combined in the analysis. Proposed parallel programming solutions showed that multi-core processor and CUDA-supported GPU implementations accelerated the DFC analyses significantly. Developed algorithms make the DFC analyses more practical for multi-subject studies with more dynamic analyses. PMID:25805449
Three Dimensional TEM Forward Modeling Using FDTD Accelerated by GPU
NASA Astrophysics Data System (ADS)
Li, Z.; Huang, Q.
2015-12-01
Three dimensional inversion of transient electromagnetic (TEM) data is still challenging. The inversion speed mostly depends on the forward modeling. Finite-difference time-domain (FDTD) method is one of the popular forward modeling scheme. In an explicit type, which is based on the Du Fort-Frankel scheme, the time step is under the constraint of quasi-static approximation. Often an upward-continuation boundary condition (UCBC) is applied on the earth-air surface to avoid time stepping in the model air. However, UCBC is not suitable for models with topography and has a low parallel efficiency. Modeling without UCBC may cause a much smaller time step because of the resistive attribute of the air and the quasi-static constraint, which may also low the efficiency greatly. Our recent research shows that the time step in the model air is not needed to be constrained by the quasi-static approximation, which can let the time step without UCBC much closer to that with UCBC. The parallel performance of FDTD is then largely released. On a computer with a 4-core CPU, this newly developed method is obviously faster than the method using UCBC. Besides, without UCBC, this method can be easily accelerated by Graphics Processing Unit (GPU). On a computer with a CPU of 4790k@4.4GHz and a GPU of GTX 970, the speed accelerated by CUDA is almost 10 times of that using CPU only. For a model with a grid size of 140×140×130, if the conductivity of the model earth is 0.02S/m, and the minimal space interval is 15m, it takes only 80 seconds to evolve the field from excitation to 0.032s.
High quality GPU rendering with displaced pixel shading
NASA Astrophysics Data System (ADS)
Zhang, Hui; Choi, Jae
2006-03-01
Direct volume rendering via consumer PC hardware has become an efficient tool for volume visualization. In particular, the volumetric ray casting, the most frequently used volume rendering technique, can be implemented by the shading language integrated with graphical processing units (GPU). However, to produce high-quality images offered by GPU-based volume rendering, a higher sampling rate is usually required. In this paper, we present an algorithm to generate high quality images with a small number of slices by utilizing displaced pixel shading technique. Instead of sampling points along a ray with the regular interval, the actual surface location is calculated by the linear interpolation between the outer and inner points, and this location is used as the displaced pixel for the iso-surface illumination. Multi-pass and early Z-culling techniques are applied to improve the rendering speed. The first pass simply locates and stores the exact surface depth of each ray using a few pixel instructions; then, the second pass uses instructions to shade the surface at the previous position. A new 3D edge detector from our previous research is integrated to provide more realistic rendering results compared with the widely used gradient normal estimator. To implement our algorithm, we have made a program named DirectView based on DirectX 9.0c and Microsoft High Level Shading Language (HLSL) for volume rendering. We tested two data sets and discovered that our algorithm can generate smoother and more accurate shading images with a small number of intermediate slices.
High energy electromagnetic particle transportation on the GPU
Canal, P.; Elvira, D.; Jun, S. Y.; Kowalkowski, J.; Paterno, M.; Apostolakis, J.
2014-01-01
We present massively parallel high energy electromagnetic particle transportation through a finely segmented detector on a Graphics Processing Unit (GPU). Simulating events of energetic particle decay in a general-purpose high energy physics (HEP) detector requires intensive computing resources, due to the complexity of the geometry as well as physics processes applied to particles copiously produced by primary collisions and secondary interactions. The recent advent of hardware architectures of many-core or accelerated processors provides the variety of concurrent programming models applicable not only for the high performance parallel computing, but also for the conventional computing intensive application such as the HEP detector simulation. The components of our prototype are a transportation process under a non-uniform magnetic field, geometry navigation with a set of solid shapes and materials, electromagnetic physics processes for electrons and photons, and an interface to a framework that dispatches bundles of tracks in a highly vectorized manner optimizing for spatial locality and throughput. Core algorithms and methods are excerpted from the Geant4 toolkit, and are modified and optimized for the GPU application. Program kernels written in C/C++ are designed to be compatible with CUDA and OpenCL and with the aim to be generic enough for easy porting to future programming models and hardware architectures. To improve throughput by overlapping data transfers with kernel execution, multiple CUDA streams are used. Issues with floating point accuracy, random numbers generation, data structure, kernel divergences and register spills are also considered. Performance evaluation for the relative speedup compared to the corresponding sequential execution on CPU is presented as well.
GPU-based prompt gamma ray imaging from boron neutron capture therapy
Yoon, Do-Kun; Jung, Joo-Young; Suk Suh, Tae; Jo Hong, Key; Sil Lee, Keum
2015-01-15
Purpose: The purpose of this research is to perform the fast reconstruction of a prompt gamma ray image using a graphics processing unit (GPU) computation from boron neutron capture therapy (BNCT) simulations. Methods: To evaluate the accuracy of the reconstructed image, a phantom including four boron uptake regions (BURs) was used in the simulation. After the Monte Carlo simulation of the BNCT, the modified ordered subset expectation maximization reconstruction algorithm using the GPU computation was used to reconstruct the images with fewer projections. The computation times for image reconstruction were compared between the GPU and the central processing unit (CPU). Also, the accuracy of the reconstructed image was evaluated by a receiver operating characteristic (ROC) curve analysis. Results: The image reconstruction time using the GPU was 196 times faster than the conventional reconstruction time using the CPU. For the four BURs, the area under curve values from the ROC curve were 0.6726 (A-region), 0.6890 (B-region), 0.7384 (C-region), and 0.8009 (D-region). Conclusions: The tomographic image using the prompt gamma ray event from the BNCT simulation was acquired using the GPU computation in order to perform a fast reconstruction during treatment. The authors verified the feasibility of the prompt gamma ray image reconstruction using the GPU computation for BNCT simulations.
Suchard, Marc A.; Wang, Quanli; Chan, Cliburn; Frelinger, Jacob; Cron, Andrew; West, Mike
2010-01-01
This article describes advances in statistical computation for large-scale data analysis in structured Bayesian mixture models via graphics processing unit (GPU) programming. The developments are partly motivated by computational challenges arising in fitting models of increasing heterogeneity to increasingly large datasets. An example context concerns common biological studies using high-throughput technologies generating many, very large datasets and requiring increasingly high-dimensional mixture models with large numbers of mixture components. We outline important strategies and processes for GPU computation in Bayesian simulation and optimization approaches, give examples of the benefits of GPU implementations in terms of processing speed and scale-up in ability to analyze large datasets, and provide a detailed, tutorial-style exposition that will benefit readers interested in developing GPU-based approaches in other statistical models. Novel, GPU-oriented approaches to modifying existing algorithms software design can lead to vast speed-up and, critically, enable statistical analyses that presently will not be performed due to compute time limitations in traditional computational environments. Supplemental materials are provided with all source code, example data, and details that will enable readers to implement and explore the GPU approach in this mixture modeling context. PMID:20877443
Calculation of HELAS amplitudes for QCD processes using graphics processing unit (GPU)
NASA Astrophysics Data System (ADS)
Hagiwara, K.; Kanzaki, J.; Okamura, N.; Rainwater, D.; Stelzer, T.
2010-11-01
We use a graphics processing unit (GPU) for fast calculations of helicity amplitudes of quark and gluon scattering processes in massless QCD. New HEGET ( HELAS Evaluation with GPU Enhanced Technology) codes for gluon self-interactions are introduced, and a C++ program to convert the MadGraph generated FORTRAN codes into HEGET codes in CUDA (a C-platform for general purpose computing on GPU) is created. Because of the proliferation of the number of Feynman diagrams and the number of independent color amplitudes, the maximum number of final state jets we can evaluate on a GPU is limited to 4 for pure gluon processes ( gg→4 g), or 5 for processes with one or more quark lines such as qoverline{q}→ 5g and qq→ qq+3 g. Compared with the usual CPU-based programs, we obtain 60-100 times better performance on the GPU, except for 5-jet production processes and the gg→4 g processes for which the GPU gain over the CPU is about 20.
GPU-based parallel algorithm for blind image restoration using midfrequency-based methods
NASA Astrophysics Data System (ADS)
Xie, Lang; Luo, Yi-han; Bao, Qi-liang
2013-08-01
GPU-based general-purpose computing is a new branch of modern parallel computing, so the study of parallel algorithms specially designed for GPU hardware architecture is of great significance. In order to solve the problem of high computational complexity and poor real-time performance in blind image restoration, the midfrequency-based algorithm for blind image restoration was analyzed and improved in this paper. Furthermore, a midfrequency-based filtering method is also used to restore the image hardly with any recursion or iteration. Combining the algorithm with data intensiveness, data parallel computing and GPU execution model of single instruction and multiple threads, a new parallel midfrequency-based algorithm for blind image restoration is proposed in this paper, which is suitable for stream computing of GPU. In this algorithm, the GPU is utilized to accelerate the estimation of class-G point spread functions and midfrequency-based filtering. Aiming at better management of the GPU threads, the threads in a grid are scheduled according to the decomposition of the filtering data in frequency domain after the optimization of data access and the communication between the host and the device. The kernel parallelism structure is determined by the decomposition of the filtering data to ensure the transmission rate to get around the memory bandwidth limitation. The results show that, with the new algorithm, the operational speed is significantly increased and the real-time performance of image restoration is effectively improved, especially for high-resolution images.
Real-time GPU surface curvature estimation on deforming meshes and volumetric data sets.
Griffin, Wesley; Wang, Yu; Berrios, David; Olano, Marc
2012-10-01
Surface curvature is used in a number of areas in computer graphics, including texture synthesis and shape representation, mesh simplification, surface modeling, and nonphotorealistic line drawing. Most real-time applications must estimate curvature on a triangular mesh. This estimation has been limited to CPU algorithms, forcing object geometry to reside in main memory. However, as more computational work is done directly on the GPU, it is increasingly common for object geometry to exist only in GPU memory. Examples include vertex skinned animations and isosurfaces from GPU-based surface reconstruction algorithms. For static models, curvature can be precomputed and CPU algorithms are a reasonable choice. For deforming models where the geometry only resides on the GPU, transferring the deformed mesh back to the CPU limits performance. We introduce a GPU algorithm for estimating curvature in real time on arbitrary triangular meshes. We demonstrate our algorithm with curvature-based NPR feature lines and a curvature-based approximation for an ambient occlusion. We show curvature computation on volumetric data sets with a GPU isosurface extraction algorithm and vertex-skinned animations. We present a graphics pipeline and CUDA implementation. Our curvature estimation is up to ~18x faster than a multithreaded CPU benchmark. PMID:22508906
Compilação de dados atômicos e moleculares do UV ao IV próximo para uso em síntese espectral
NASA Astrophysics Data System (ADS)
Coelho, P.; Barbuy, B.; Melendez, J.; Allen, D. M.; Castilho, B.
2003-08-01
Espectros sintéticos são utéis em uma grande variedade de aplicações, desde análise de abundâncias em espectros estelares de alta resolução ao estudo de populações estelares em espectros integrados. A confiabilidade de um espectro sintético depende do modelo de atmosfera adotado, do código de formação de linhas e da qualidade dos dados atômicos e moleculares que são determinantes no cálculo das opacidades da fotosfera. O nosso grupo no departamento de Astronomia no IAG tem utilizado espectros sintéticos há mais de 15 anos, em aplicações voltadas principalmente para a análise de abundâncias de estrelas G, K e M e populações estelares velhas. Ao longo desse tempo, as listas de linhas vieram sendo construídas e atualizadas continuamente, e alguns acréscimos recentes podem ser citados: Castilho (1999, átomos e moléculas no UV), Schiavon (1998, bandas moleculares de TiO) e Melendez (2001, átomos e moléculas no IV próximo). Com o intuito de calcular uma grade de espectros do UV ao IV próximo para uso no estudo de populações estelares velhas, se fazia necessário compilar e homogeneizar as diversas listas em apenas uma lista atômica e uma molecular. Nesse processo, a nova lista compilada foi correlacionada com outras bases de dados (NIST, Kurucz Database, O' Brian et al. 1991) para atualização dos parâmetros que caracterizam a transição atômica (comprimento de onda, log gf e potencial de excitação). Adicionalmente as constantes de interação C6 foram calculadas segundo a teoria de Anstee & O'Mara (1995) e artigos posteriores. As bandas moleculares de CH e CN foram recalculadas com o programa LIFBASE (Luque & Crosley 1999). Nesse poster estão detalhados os procedimentos citados acima, as comparações entre espectros calculados com as novas listas e espectros observados em alta resolução do Sol e de Arcturus, e uma análise do impacto decorrente da utilização de diferentes modelos de atmosfera no espectro sintético. Ao
Multi-GPU implementation of a VMAT treatment plan optimization algorithm
Tian, Zhen E-mail: Xun.Jia@UTSouthwestern.edu Folkerts, Michael; Tan, Jun; Jia, Xun E-mail: Xun.Jia@UTSouthwestern.edu Jiang, Steve B. E-mail: Xun.Jia@UTSouthwestern.edu; Peng, Fei
2015-06-15
Purpose: Volumetric modulated arc therapy (VMAT) optimization is a computationally challenging problem due to its large data size, high degrees of freedom, and many hardware constraints. High-performance graphics processing units (GPUs) have been used to speed up the computations. However, GPU’s relatively small memory size cannot handle cases with a large dose-deposition coefficient (DDC) matrix in cases of, e.g., those with a large target size, multiple targets, multiple arcs, and/or small beamlet size. The main purpose of this paper is to report an implementation of a column-generation-based VMAT algorithm, previously developed in the authors’ group, on a multi-GPU platform to solve the memory limitation problem. While the column-generation-based VMAT algorithm has been previously developed, the GPU implementation details have not been reported. Hence, another purpose is to present detailed techniques employed for GPU implementation. The authors also would like to utilize this particular problem as an example problem to study the feasibility of using a multi-GPU platform to solve large-scale problems in medical physics. Methods: The column-generation approach generates VMAT apertures sequentially by solving a pricing problem (PP) and a master problem (MP) iteratively. In the authors’ method, the sparse DDC matrix is first stored on a CPU in coordinate list format (COO). On the GPU side, this matrix is split into four submatrices according to beam angles, which are stored on four GPUs in compressed sparse row format. Computation of beamlet price, the first step in PP, is accomplished using multi-GPUs. A fast inter-GPU data transfer scheme is accomplished using peer-to-peer access. The remaining steps of PP and MP problems are implemented on CPU or a single GPU due to their modest problem scale and computational loads. Barzilai and Borwein algorithm with a subspace step scheme is adopted here to solve the MP problem. A head and neck (H and N) cancer case is
NASA Astrophysics Data System (ADS)
Su, Lin; Du, Xining; Liu, Tianyu; Xu, X. George
2014-06-01
An electron-photon coupled Monte Carlo code ARCHER -
SU-E-J-60: Efficient Monte Carlo Dose Calculation On CPU-GPU Heterogeneous Systems
Xiao, K; Chen, D. Z; Hu, X. S; Zhou, B
2014-06-01
Purpose: It is well-known that the performance of GPU-based Monte Carlo dose calculation implementations is bounded by memory bandwidth. One major cause of this bottleneck is the random memory writing patterns in dose deposition, which leads to several memory efficiency issues on GPU such as un-coalesced writing and atomic operations. We propose a new method to alleviate such issues on CPU-GPU heterogeneous systems, which achieves overall performance improvement for Monte Carlo dose calculation. Methods: Dose deposition is to accumulate dose into the voxels of a dose volume along the trajectories of radiation rays. Our idea is to partition this procedure into the following three steps, which are fine-tuned for CPU or GPU: (1) each GPU thread writes dose results with location information to a buffer on GPU memory, which achieves fully-coalesced and atomic-free memory transactions; (2) the dose results in the buffer are transferred to CPU memory; (3) the dose volume is constructed from the dose buffer on CPU. We organize the processing of all radiation rays into streams. Since the steps within a stream use different hardware resources (i.e., GPU, DMA, CPU), we can overlap the execution of these steps for different streams by pipelining. Results: We evaluated our method using a Monte Carlo Convolution Superposition (MCCS) program and tested our implementation for various clinical cases on a heterogeneous system containing an Intel i7 quad-core CPU and an NVIDIA TITAN GPU. Comparing with a straightforward MCCS implementation on the same system (using both CPU and GPU for radiation ray tracing), our method gained 2-5X speedup without losing dose calculation accuracy. Conclusion: The results show that our new method improves the effective memory bandwidth and overall performance for MCCS on the CPU-GPU systems. Our proposed method can also be applied to accelerate other Monte Carlo dose calculation approaches. This research was supported in part by NSF under Grants CCF
Adaptive multi-GPU Exchange Monte Carlo for the 3D Random Field Ising Model
NASA Astrophysics Data System (ADS)
Navarro, Cristóbal A.; Huang, Wei; Deng, Youjin
2016-08-01
This work presents an adaptive multi-GPU Exchange Monte Carlo approach for the simulation of the 3D Random Field Ising Model (RFIM). The design is based on a two-level parallelization. The first level, spin-level parallelism, maps the parallel computation as optimal 3D thread-blocks that simulate blocks of spins in shared memory with minimal halo surface, assuming a constant block volume. The second level, replica-level parallelism, uses multi-GPU computation to handle the simulation of an ensemble of replicas. CUDA's concurrent kernel execution feature is used in order to fill the occupancy of each GPU with many replicas, providing a performance boost that is more notorious at the smallest values of L. In addition to the two-level parallel design, the work proposes an adaptive multi-GPU approach that dynamically builds a proper temperature set free of exchange bottlenecks. The strategy is based on mid-point insertions at the temperature gaps where the exchange rate is most compromised. The extra work generated by the insertions is balanced across the GPUs independently of where the mid-point insertions were performed. Performance results show that spin-level performance is approximately two orders of magnitude faster than a single-core CPU version and one order of magnitude faster than a parallel multi-core CPU version running on 16-cores. Multi-GPU performance is highly convenient under a weak scaling setting, reaching up to 99 % efficiency as long as the number of GPUs and L increase together. The combination of the adaptive approach with the parallel multi-GPU design has extended our possibilities of simulation to sizes of L = 32 , 64 for a workstation with two GPUs. Sizes beyond L = 64 can eventually be studied using larger multi-GPU systems.
GPU accelerated generation of digitally reconstructed radiographs for 2-D/3-D image registration.
Dorgham, Osama M; Laycock, Stephen D; Fisher, Mark H
2012-09-01
Recent advances in programming languages for graphics processing units (GPUs) provide developers with a convenient way of implementing applications which can be executed on the CPU and GPU interchangeably. GPUs are becoming relatively cheap, powerful, and widely available hardware components, which can be used to perform intensive calculations. The last decade of hardware performance developments shows that GPU-based computation is progressing significantly faster than CPU-based computation, particularly if one considers the execution of highly parallelisable algorithms. Future predictions illustrate that this trend is likely to continue. In this paper, we introduce a way of accelerating 2-D/3-D image registration by developing a hybrid system which executes on the CPU and utilizes the GPU for parallelizing the generation of digitally reconstructed radiographs (DRRs). Based on the advancements of the GPU over the CPU, it is timely to exploit the benefits of many-core GPU technology by developing algorithms for DRR generation. Although some previous work has investigated the rendering of DRRs using the GPU, this paper investigates approximations which reduce the computational overhead while still maintaining a quality consistent with that needed for 2-D/3-D registration with sufficient accuracy to be clinically acceptable in certain applications of radiation oncology. Furthermore, by comparing implementations of 2-D/3-D registration on the CPU and GPU, we investigate current performance and propose an optimal framework for PC implementations addressing the rigid registration problem. Using this framework, we are able to render DRR images from a 256×256×133 CT volume in ~24 ms using an NVidia GeForce 8800 GTX and in ~2 ms using NVidia GeForce GTX 580. In addition to applications requiring fast automatic patient setup, these levels of performance suggest image-guided radiation therapy at video frame rates is technically feasible using relatively low cost PC
Dong, Tingzing Tim; Tomov, Stanimire Z; Luszczek, Piotr R; Dongarra, Jack J
2015-01-01
As modern hardware keeps evolving, an increasingly effective approach to developing energy efficient and high-performance solvers is to design them to work on many small size and independent problems. Many applications already need this functionality, especially for GPUs, which are currently known to be about four to five times more energy efficient than multicore CPUs. We describe the development of one-sided factorizations that work for a set of small dense matrices in parallel, and we illustrate our techniques on the QR factorization based on Householder transformations. We refer to this mode of operation as a batched factorization. Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPU-only execution. This is in contrast to the hybrid CPU-GPU algorithms that rely heavily on using the multicore CPU for specific parts of the workload. But for a system to benefit fully from the GPU's significantly higher energy efficiency, avoiding the use of the multicore CPU must be a primary design goal, so the system can rely more heavily on the more efficient GPU. Additionally, this will result in the removal of the costly CPU-to-GPU communication. Furthermore, we do not use a single symmetric multiprocessor(on the GPU) to factorize a single problem at a time. We illustrate how our performance analysis, and the use of profiling and tracing tools, guided the development and optimization of our batched factorization to achieve up to a 2-fold speedup and a 3-fold energy efficiency improvement compared to our highly optimized batched CPU implementations based on the MKL library(when using two sockets of Intel Sandy Bridge CPUs). Compared to a batched QR factorization featured in the CUBLAS library for GPUs, we achieved up to 5x speedup on the K40 GPU.
Revisiting Molecular Dynamics on a CPU/GPU system: Water Kernel and SHAKE Parallelization.
Ruymgaart, A Peter; Elber, Ron
2012-11-13
We report Graphics Processing Unit (GPU) and Open-MP parallel implementations of water-specific force calculations and of bond constraints for use in Molecular Dynamics simulations. We focus on a typical laboratory computing-environment in which a CPU with a few cores is attached to a GPU. We discuss in detail the design of the code and we illustrate performance comparable to highly optimized codes such as GROMACS. Beside speed our code shows excellent energy conservation. Utilization of water-specific lists allows the efficient calculations of non-bonded interactions that include water molecules and results in a speed-up factor of more than 40 on the GPU compared to code optimized on a single CPU core for systems larger than 20,000 atoms. This is up four-fold from a factor of 10 reported in our initial GPU implementation that did not include a water-specific code. Another optimization is the implementation of constrained dynamics entirely on the GPU. The routine, which enforces constraints of all bonds, runs in parallel on multiple Open-MP cores or entirely on the GPU. It is based on Conjugate Gradient solution of the Lagrange multipliers (CG SHAKE). The GPU implementation is partially in double precision and requires no communication with the CPU during the execution of the SHAKE algorithm. The (parallel) implementation of SHAKE allows an increase of the time step to 2.0fs while maintaining excellent energy conservation. Interestingly, CG SHAKE is faster than the usual bond relaxation algorithm even on a single core if high accuracy is expected. The significant speedup of the optimized components transfers the computational bottleneck of the MD calculation to the reciprocal part of Particle Mesh Ewald (PME). PMID:23264758
Revisiting Molecular Dynamics on a CPU/GPU system: Water Kernel and SHAKE Parallelization
Ruymgaart, A. Peter; Elber, Ron
2012-01-01
We report Graphics Processing Unit (GPU) and Open-MP parallel implementations of water-specific force calculations and of bond constraints for use in Molecular Dynamics simulations. We focus on a typical laboratory computing-environment in which a CPU with a few cores is attached to a GPU. We discuss in detail the design of the code and we illustrate performance comparable to highly optimized codes such as GROMACS. Beside speed our code shows excellent energy conservation. Utilization of water-specific lists allows the efficient calculations of non-bonded interactions that include water molecules and results in a speed-up factor of more than 40 on the GPU compared to code optimized on a single CPU core for systems larger than 20,000 atoms. This is up four-fold from a factor of 10 reported in our initial GPU implementation that did not include a water-specific code. Another optimization is the implementation of constrained dynamics entirely on the GPU. The routine, which enforces constraints of all bonds, runs in parallel on multiple Open-MP cores or entirely on the GPU. It is based on Conjugate Gradient solution of the Lagrange multipliers (CG SHAKE). The GPU implementation is partially in double precision and requires no communication with the CPU during the execution of the SHAKE algorithm. The (parallel) implementation of SHAKE allows an increase of the time step to 2.0fs while maintaining excellent energy conservation. Interestingly, CG SHAKE is faster than the usual bond relaxation algorithm even on a single core if high accuracy is expected. The significant speedup of the optimized components transfers the computational bottleneck of the MD calculation to the reciprocal part of Particle Mesh Ewald (PME). PMID:23264758
3D Laplace-domain full waveform inversion using a single GPU card
NASA Astrophysics Data System (ADS)
Shin, Jungkyun; Ha, Wansoo; Jun, Hyunggu; Min, Dong-Joo; Shin, Changsoo
2014-06-01
The Laplace-domain full waveform inversion is an efficient long-wavelength velocity estimation method for seismic datasets lacking low-frequency components. However, to invert a 3D velocity model, a large cluster of CPU cores have commonly been required to overcome the extremely long computing time caused by a large impedance matrix and a number of source positions. In this study, a workstation with a single GPU card (NVIDIA GTX 580) is successfully used for the 3D Laplace-domain full waveform inversion rather than a large cluster of CPU cores. To exploit a GPU for our inversion algorithm, the routine for the iterative matrix solver is ported to the CUDA programming language for forward and backward modeling parts with minimized modification of the remaining parts, which were originally written in Fortran 90. Using a uniformly structured grid set, nonzero values in the sparse impedance matrix can be arranged according to certain rules, which efficiently parallelize the preconditioned conjugate gradient method for a number of threads contained in the GPU card. We perform a numerical experiment to verify the accuracy of a floating point operation performed by a GPU to calculate the Laplace-domain wavefield. We also measure the efficiencies of the original CPU and modified GPU programs using a cluster of CPU cores and a workstation with a GPU card, respectively. Through the analysis, the parallelized inversion code for a GPU achieves the speedup of 14.7-24.6x compared to a CPU-based serial code depending on the degrees of freedom of the impedance matrix. Finally, the practicality of the proposed algorithm is examined by inverting a 3D long-wavelength velocity model using wide azimuth real datasets in 3.7 days.
NASA Astrophysics Data System (ADS)
Ammazzalorso, F.; Bednarz, T.; Jelen, U.
2014-03-01
We demonstrate acceleration on graphic processing units (GPU) of automatic identification of robust particle therapy beam setups, minimizing negative dosimetric effects of Bragg peak displacement caused by treatment-time patient positioning errors. Our particle therapy research toolkit, RobuR, was extended with OpenCL support and used to implement calculation on GPU of the Port Homogeneity Index, a metric scoring irradiation port robustness through analysis of tissue density patterns prior to dose optimization and computation. Results were benchmarked against an independent native CPU implementation. Numerical results were in agreement between the GPU implementation and native CPU implementation. For 10 skull base cases, the GPU-accelerated implementation was employed to select beam setups for proton and carbon ion treatment plans, which proved to be dosimetrically robust, when recomputed in presence of various simulated positioning errors. From the point of view of performance, average running time on the GPU decreased by at least one order of magnitude compared to the CPU, rendering the GPU-accelerated analysis a feasible step in a clinical treatment planning interactive session. In conclusion, selection of robust particle therapy beam setups can be effectively accelerated on a GPU and become an unintrusive part of the particle therapy treatment planning workflow. Additionally, the speed gain opens new usage scenarios, like interactive analysis manipulation (e.g. constraining of some setup) and re-execution. Finally, through OpenCL portable parallelism, the new implementation is suitable also for CPU-only use, taking advantage of multiple cores, and can potentially exploit types of accelerators other than GPUs.
Comparison of CPU and GPU based coding on low-complexity algorithms for display signals
NASA Astrophysics Data System (ADS)
Richter, Thomas; Simon, Sven
2013-09-01
Graphics Processing Units (GPUs) are freely programmable massively parallel general purpose processing units and thus offer the opportunity to off-load heavy computations from the CPU to the GPU. One application for GPU programming is image compression, where the massively parallel nature of GPUs promises high speed benefits. This article analyzes the predicaments of data-parallel image coding on the example of two high-throughput coding algorithms. The codecs discussed here were designed to answer a call from the Video Electronics Standards Association (VESA), and require only minimal buffering at encoder and decoder side while avoiding any pixel-based feedback loops limiting the operating frequency of hardware implementations. Comparing CPU and GPU implementations of the codes show that GPU based codes are usually not considerably faster, or perform only with less than ideal rate-distortion performance. Analyzing the details of this result provides theoretical evidence that for any coding engine either parts of the entropy coding and bit-stream build-up must remain serial, or rate-distortion penalties must be paid when offloading all computations on the GPU.
A Hybrid CPU/GPU Pattern-Matching Algorithm for Deep Packet Inspection
Chen, Yaw-Chung
2015-01-01
The large quantities of data now being transferred via high-speed networks have made deep packet inspection indispensable for security purposes. Scalable and low-cost signature-based network intrusion detection systems have been developed for deep packet inspection for various software platforms. Traditional approaches that only involve central processing units (CPUs) are now considered inadequate in terms of inspection speed. Graphic processing units (GPUs) have superior parallel processing power, but transmission bottlenecks can reduce optimal GPU efficiency. In this paper we describe our proposal for a hybrid CPU/GPU pattern-matching algorithm (HPMA) that divides and distributes the packet-inspecting workload between a CPU and GPU. All packets are initially inspected by the CPU and filtered using a simple pre-filtering algorithm, and packets that might contain malicious content are sent to the GPU for further inspection. Test results indicate that in terms of random payload traffic, the matching speed of our proposed algorithm was 3.4 times and 2.7 times faster than those of the AC-CPU and AC-GPU algorithms, respectively. Further, HPMA achieved higher energy efficiency than the other tested algorithms. PMID:26437335
GPU accelerated simulations of 3D deterministic particle transport using discrete ordinates method
Gong Chunye; Liu Jie; Chi Lihua; Huang Haowei; Fang Jingyue; Gong Zhenghu
2011-07-01
Graphics Processing Unit (GPU), originally developed for real-time, high-definition 3D graphics in computer games, now provides great faculty in solving scientific applications. The basis of particle transport simulation is the time-dependent, multi-group, inhomogeneous Boltzmann transport equation. The numerical solution to the Boltzmann equation involves the discrete ordinates (S{sub n}) method and the procedure of source iteration. In this paper, we present a GPU accelerated simulation of one energy group time-independent deterministic discrete ordinates particle transport in 3D Cartesian geometry (Sweep3D). The performance of the GPU simulations are reported with the simulations of vacuum boundary condition. The discussion of the relative advantages and disadvantages of the GPU implementation, the simulation on multi GPUs, the programming effort and code portability are also reported. The results show that the overall performance speedup of one NVIDIA Tesla M2050 GPU ranges from 2.56 compared with one Intel Xeon X5670 chip to 8.14 compared with one Intel Core Q6600 chip for no flux fixup. The simulation with flux fixup on one M2050 is 1.23 times faster than on one X5670.
NASA Astrophysics Data System (ADS)
Lokavarapu, H. V.; Matsui, H.
2015-12-01
Convection and magnetic field of the Earth's outer core are expected to have vast length scales. To resolve these flows, high performance computing is required for geodynamo simulations using spherical harmonics transform (SHT), a significant portion of the execution time is spent on the Legendre transform. Calypso is a geodynamo code designed to model magnetohydrodynamics of a Boussinesq fluid in a rotating spherical shell, such as the outer core of the Earth. The code has been shown to scale well on computer clusters capable of computing at the order of 10⁵ cores using Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) parallelization for CPUs. To further optimize, we investigate three different algorithms of the SHT using GPUs. One is to preemptively compute the Legendre polynomials on the CPU before executing SHT on the GPU within the time integration loop. In the second approach, both the Legendre polynomials and the SHT are computed on the GPU simultaneously. In the third approach , we initially partition the radial grid for the forward transform and the harmonic order for the backward transform between the CPU and GPU. There after, the partitioned works are simultaneously computed in the time integration loop. We examine the trade-offs between space and time, memory bandwidth and GPU computations on Maverick, a Texas Advanced Computing Center (TACC) supercomputer. We have observed improved performance using a GPU enabled Legendre transform. Furthermore, we will compare and contrast the different algorithms in the context of GPUs.
A Hybrid CPU/GPU Pattern-Matching Algorithm for Deep Packet Inspection.
Lee, Chun-Liang; Lin, Yi-Shan; Chen, Yaw-Chung
2015-01-01
The large quantities of data now being transferred via high-speed networks have made deep packet inspection indispensable for security purposes. Scalable and low-cost signature-based network intrusion detection systems have been developed for deep packet inspection for various software platforms. Traditional approaches that only involve central processing units (CPUs) are now considered inadequate in terms of inspection speed. Graphic processing units (GPUs) have superior parallel processing power, but transmission bottlenecks can reduce optimal GPU efficiency. In this paper we describe our proposal for a hybrid CPU/GPU pattern-matching algorithm (HPMA) that divides and distributes the packet-inspecting workload between a CPU and GPU. All packets are initially inspected by the CPU and filtered using a simple pre-filtering algorithm, and packets that might contain malicious content are sent to the GPU for further inspection. Test results indicate that in terms of random payload traffic, the matching speed of our proposed algorithm was 3.4 times and 2.7 times faster than those of the AC-CPU and AC-GPU algorithms, respectively. Further, HPMA achieved higher energy efficiency than the other tested algorithms. PMID:26437335
Single-pass GPU-raycasting for structured adaptive mesh refinement data
NASA Astrophysics Data System (ADS)
Kaehler, Ralf; Abel, Tom
2013-01-01
Structured Adaptive Mesh Refinement (SAMR) is a popular numerical technique to study processes with high spatial and temporal dynamic range. It reduces computational requirements by adapting the lattice on which the underlying differential equations are solved to most efficiently represent the solution. Particularly in astrophysics and cosmology such simulations now can capture spatial scales ten orders of magnitude apart and more. The irregular locations and extensions of the refined regions in the SAMR scheme and the fact that different resolution levels partially overlap, poses a challenge for GPU-based direct volume rendering methods. kD-trees have proven to be advantageous to subdivide the data domain into non-overlapping blocks of equally sized cells, optimal for the texture units of current graphics hardware, but previous GPU-supported raycasting approaches for SAMR data using this data structure required a separate rendering pass for each node, preventing the application of many advanced lighting schemes that require simultaneous access to more than one block of cells. In this paper we present the first single-pass GPU-raycasting algorithm for SAMR data that is based on a kD-tree. The tree is efficiently encoded by a set of 3D-textures, which allows to adaptively sample complete rays entirely on the GPU without any CPU interaction. We discuss two different data storage strategies to access the grid data on the GPU and apply them to several datasets to prove the benefits of the proposed method.
Improving GPU-accelerated adaptive IDW interpolation algorithm using fast kNN search.
Mei, Gang; Xu, Nengxiong; Xu, Liangliang
2016-01-01
This paper presents an efficient parallel Adaptive Inverse Distance Weighting (AIDW) interpolation algorithm on modern Graphics Processing Unit (GPU). The presented algorithm is an improvement of our previous GPU-accelerated AIDW algorithm by adopting fast k-nearest neighbors (kNN) search. In AIDW, it needs to find several nearest neighboring data points for each interpolated point to adaptively determine the power parameter; and then the desired prediction value of the interpolated point is obtained by weighted interpolating using the power parameter. In this work, we develop a fast kNN search approach based on the space-partitioning data structure, even grid, to improve the previous GPU-accelerated AIDW algorithm. The improved algorithm is composed of the stages of kNN search and weighted interpolating. To evaluate the performance of the improved algorithm, we perform five groups of experimental tests. The experimental results indicate: (1) the improved algorithm can achieve a speedup of up to 1017 over the corresponding serial algorithm; (2) the improved algorithm is at least two times faster than our previous GPU-accelerated AIDW algorithm; and (3) the utilization of fast kNN search can significantly improve the computational efficiency of the entire GPU-accelerated AIDW algorithm. PMID:27610308
The development of GPU-based parallel PRNG for Monte Carlo applications in CUDA Fortran
NASA Astrophysics Data System (ADS)
Kargaran, Hamed; Minuchehr, Abdolhamid; Zolfaghari, Ahmad
2016-04-01
The implementation of Monte Carlo simulation on the CUDA Fortran requires a fast random number generation with good statistical properties on GPU. In this study, a GPU-based parallel pseudo random number generator (GPPRNG) have been proposed to use in high performance computing systems. According to the type of GPU memory usage, GPU scheme is divided into two work modes including GLOBAL_MODE and SHARED_MODE. To generate parallel random numbers based on the independent sequence method, the combination of middle-square method and chaotic map along with the Xorshift PRNG have been employed. Implementation of our developed PPRNG on a single GPU showed a speedup of 150x and 470x (with respect to the speed of PRNG on a single CPU core) for GLOBAL_MODE and SHARED_MODE, respectively. To evaluate the accuracy of our developed GPPRNG, its performance was compared to that of some other commercially available PPRNGs such as MATLAB, FORTRAN and Miller-Park algorithm through employing the specific standard tests. The results of this comparison showed that the developed GPPRNG in this study can be used as a fast and accurate tool for computational science applications.
GPU accelerated simulations of 3D deterministic particle transport using discrete ordinates method
NASA Astrophysics Data System (ADS)
Gong, Chunye; Liu, Jie; Chi, Lihua; Huang, Haowei; Fang, Jingyue; Gong, Zhenghu
2011-07-01
Graphics Processing Unit (GPU), originally developed for real-time, high-definition 3D graphics in computer games, now provides great faculty in solving scientific applications. The basis of particle transport simulation is the time-dependent, multi-group, inhomogeneous Boltzmann transport equation. The numerical solution to the Boltzmann equation involves the discrete ordinates ( Sn) method and the procedure of source iteration. In this paper, we present a GPU accelerated simulation of one energy group time-independent deterministic discrete ordinates particle transport in 3D Cartesian geometry (Sweep3D). The performance of the GPU simulations are reported with the simulations of vacuum boundary condition. The discussion of the relative advantages and disadvantages of the GPU implementation, the simulation on multi GPUs, the programming effort and code portability are also reported. The results show that the overall performance speedup of one NVIDIA Tesla M2050 GPU ranges from 2.56 compared with one Intel Xeon X5670 chip to 8.14 compared with one Intel Core Q6600 chip for no flux fixup. The simulation with flux fixup on one M2050 is 1.23 times faster than on one X5670.
Modern Methods of Bundle Adjustment on the Gpu
NASA Astrophysics Data System (ADS)
Hänsch, R.; Drude, I.; Hellwich, O.
2016-06-01
The task to compute 3D reconstructions from large amounts of data has become an active field of research within the last years. Based on an initial estimate provided by structure from motion, bundle adjustment seeks to find a solution that is optimal for all cameras and 3D points. The corresponding nonlinear optimization problem is usually solved by the Levenberg-Marquardt algorithm combined with conjugate gradient descent. While many adaptations and extensions to the classical bundle adjustment approach have been proposed, only few works consider the acceleration potentials of GPU systems. This paper elaborates the possibilities of time and space savings when fitting the implementation strategy to the terms and requirements of realizing a bundler on heterogeneous CPUGPU systems. Instead of focusing on the standard approach of Levenberg-Marquardt optimization alone, nonlinear conjugate gradient descent and alternating resection-intersection are studied as two alternatives. The experiments show that in particular alternating resection-intersection reaches low error rates very fast, but converges to larger error rates than Levenberg-Marquardt. PBA, as one of the current state-of-the-art bundlers, converges slower in 50 % of the test cases and needs 1.5-2 times more memory than the Levenberg- Marquardt implementation.
GPU acceleration of particle-in-cell methods
NASA Astrophysics Data System (ADS)
Cowan, Benjamin; Cary, John; Meiser, Dominic
2015-11-01
Graphics processing units (GPUs) have become key components in many supercomputing systems, as they can provide more computations relative to their cost and power consumption than conventional processors. However, to take full advantage of this capability, they require a strict programming model which involves single-instruction multiple-data execution as well as significant constraints on memory accesses. To bring the full power of GPUs to bear on plasma physics problems, we must adapt the computational methods to this new programming model. We have developed a GPU implementation of the particle-in-cell (PIC) method, one of the mainstays of plasma physics simulation. This framework is highly general and enables advanced PIC features such as high order particles and absorbing boundary conditions. The main elements of the PIC loop, including field interpolation and particle deposition, are designed to optimize memory access. We describe the performance of these algorithms and discuss some of the methods used. Work supported by DARPA contract W31P4Q-15-C-0061 (SBIR).
Raytracing Dynamic Scenes on the GPU Using Grids.
Guntury, S; Narayanan, P J
2012-01-01
Raytracing dynamic scenes at interactive rates have received a lot of attention recently. We present a few strategies for high performance raytracing on a commodity GPU. The construction of grids needs sorting, which is fast on today's GPUs. The grid is thus the acceleration structure of choice for dynamic scenes as per-frame rebuilding is required. We advocate the use of appropriate data structures for each stage of raytracing, resulting in multiple structure building per frame. A perspective grid built for the camera achieves perfect coherence for primary rays. A perspective grid built with respect to each light source provides the best performance for shadow rays. Spherical grids handle lights positioned inside the model space and handle spotlights. Uniform grids are best for reflection and refraction rays with little coherence. We propose an Enforced Coherence method to bring coherence to them by rearranging the ray to voxel mapping using sorting. This gives the best performance on GPUs with only user-managed caches. We also propose a simple, Independent Voxel Walk method, which performs best by taking advantage of the L1 and L2 caches on recent GPUs. We achieve over 10 fps of total rendering on the Conference model with one light source and one reflection bounce, while rebuilding the data structure for each stage. Ideas presented here are likely to give high performance on the future GPUs as well as other manycore architectures. PMID:21383409
Accelerated finite element elastodynamic simulations using the GPU
Huthwaite, Peter
2014-01-15
An approach is developed to perform explicit time domain finite element simulations of elastodynamic problems on the graphical processing unit, using Nvidia's CUDA. Of critical importance for this problem is the arrangement of nodes in memory, allowing data to be loaded efficiently and minimising communication between the independently executed blocks of threads. The initial stage of memory arrangement is partitioning the mesh; both a well established ‘greedy’ partitioner and a new, more efficient ‘aligned’ partitioner are investigated. A method is then developed to efficiently arrange the memory within each partition. The software is applied to three models from the fields of non-destructive testing, vibrations and geophysics, demonstrating a memory bandwidth of very close to the card's maximum, reflecting the bandwidth-limited nature of the algorithm. Comparison with Abaqus, a widely used commercial CPU equivalent, validated the accuracy of the results and demonstrated a speed improvement of around two orders of magnitude. A software package, Pogo, incorporating these developments, is released open source, downloadable from (http://www.pogo-fea.com/) to benefit the community. -- Highlights: •A novel memory arrangement approach is discussed for finite elements on the GPU. •The mesh is partitioned then nodes are arranged efficiently within each partition. •Models from ultrasonics, vibrations and geophysics are run. •The code is significantly faster than an equivalent commercial CPU package. •Pogo, the new software package, is released open source.
Multi-GPU Accelerated Simulation of Dynamically Evolving Fluid Pathways
NASA Astrophysics Data System (ADS)
Räss, Ludovic; Omlin, Samuel; Moulas, Evangelos; Simon, Nina S. C.; Podladchikov, Yuri
2014-05-01
Fluid flow in porous rocks, both naturally occurring and caused by reservoir operations, mostly takes place along localized high permeability pathways. Pervasive flooding of the rock matrix is rarely observed, in particular for low permeability rocks. The pathways appear to form dynamically in response to the fluid flow itself; the amount of pathways, their location and their hydraulic conductivity may change in time. We propose a physically and thermodynamically consistent model that describes the formation and evolution of fluid pathways. The model consists of a system of equations describing poro-elasto-viscous deformation and flow. We have implemented the strongly coupled equations into a numerical model. Nonlinearity of the solid rheology is also taken into account. We have developed a fully three-dimensional numerical MATLAB application based on an iterative finite difference scheme. We have ported it to C-CUDA using MPI to run it on multi-GPU clusters. Numerical tuning of the application based on memory bandwidth throughput allows to approach hardware peak performance. Conducted high-resolution three-dimensional simulations predict the formation of dynamically evolving high porosity and permeability pathways as a natural outcome of porous flow coupled with rock deformation.
A GPU Reaction Diffusion Soil-Microbial Model
NASA Astrophysics Data System (ADS)
Falconer, Ruth; Houston, Alasdair; Schmidt, Sonja; Otten, Wilfred
2014-05-01
Parallelised algorithms are frequent in bioinformatics as a consequence of the close link to informatics - however in the field of soil science and ecology they are less prevalent. A current challenge in soil ecology is to link habitat structure to microbial dynamics. Soil science is therefore entering the 'big data' paradigm as a consequence of integrating data pertinent to the physical soil environment obtained via imaging and theoretical models describing growth and development of microbial dynamics permitting accurate analyses of spatio-temporal properties of different soil microenvironments. The microenvironment is often captured by 3D imaging (CT tomography) which yields large datasets and when used in computational studies the physical sizes of the samples that are amenable to computation are less than 1 cm3. Today's commodity graphics cards are programmable and possess a data parallel architecture that in many cases is capable of out-performing the CPU in terms of computational rates. The programmable aspect is achieved via a low-level parallel programming language (CUDA, OpenCL and DirectX). We ported a Soil-Microbial Model onto the GPU using the DirectX Compute API. We noted a significant computational speed up as well as an increase in the physical size that can be simulated. Some of the drawbacks of such an approach were concerned with numerical precision and the steep learning curve associated with GPGPU technologies.
FARGO3D: A New GPU-oriented MHD Code
NASA Astrophysics Data System (ADS)
Benítez-Llambay, Pablo; Masset, Frédéric S.
2016-03-01
We present the FARGO3D code, recently publicly released. It is a magnetohydrodynamics code developed with special emphasis on the physics of protoplanetary disks and planet-disk interactions, and parallelized with MPI. The hydrodynamics algorithms are based on finite-difference upwind, dimensionally split methods. The magnetohydrodynamics algorithms consist of the constrained transport method to preserve the divergence-free property of the magnetic field to machine accuracy, coupled to a method of characteristics for the evaluation of electromotive forces and Lorentz forces. Orbital advection is implemented, and an N-body solver is included to simulate planets or stars interacting with the gas. We present our implementation in detail and present a number of widely known tests for comparison purposes. One strength of FARGO3D is that it can run on either graphical processing units (GPUs) or central processing units (CPUs), achieving large speed-up with respect to CPU cores. We describe our implementation choices, which allow a user with no prior knowledge of GPU programming to develop new routines for CPUs, and have them translated automatically for GPUs.
Toward Fast Computation of Dense Image Correspondence on the GPU
Duchaineau, M; Cohen, J; Vaidya, S
2007-08-13
Large-scale video processing systems are needed to support human analysis of massive collections of image streams. Video, from both current small-format and future large-format camera systems, constitutes the single largest data source of the near future, dwarfing the output of all other data sources combined. A critical component to further advances in the processing and analysis of such video streams is the ability to register successive video frames into a common coordinate system at the pixel level. This capability enables further downstream processing, such as background/mover segmentation, 3D model extraction, and compression. We present here our recent work on computing these correspondences. We employ coarse-to-fine hierarchical approach, matching pixels from the domain of a source image to the domain of a target image at successively higher resolutions. Our diamond-style image hierarchy, with total pixel counts increasing by only a factor of two at each level, improves the prediction quality as we advance from level to level, and reduces potential grid artifacts in the results. We demonstrate the quality our approach on real aerial city imagery. We find that registration accuracy is generally on the order of one quarter of a pixel. We also benchmark the fundamental processing kernels on the GPU to show the promise of the approach for real-time video processing applications.
GPU-accelerated minimum distance and clearance queries.
Krishnamurthy, Adarsh; McMains, Sara; Haller, Kirk
2011-06-01
We present practical algorithms for accelerating distance queries on models made of trimmed NURBS surfaces using programmable Graphics Processing Units (GPUs). We provide a generalized framework for using GPUs as coprocessors in accelerating CAD operations. By supplementing surface data with a surface bounding-box hierarchy on the GPU, we answer distance queries such as finding the closest point on a curved NURBS surface given any point in space and evaluating the clearance between two solid models constructed using multiple NURBS surfaces. We simultaneously output the parameter values corresponding to the solution of these queries along with the model space values. Though our algorithms make use of the programmable fragment processor, the accuracy is based on the model space precision, unlike earlier graphics algorithms that were based only on image space precision. In addition, we provide theoretical bounds for both the computed minimum distance values as well as the location of the closest point. Our algorithms are at least an order of magnitude faster and about two orders of magnitude more accurate than the commercial solid modeling kernel ACIS. PMID:21474862
Immersed boundary method implemented in lattice Boltzmann GPU code
NASA Astrophysics Data System (ADS)
Devincentis, Brian; Smith, Kevin; Thomas, John
2015-11-01
Lattice Boltzmann is well suited to efficiently utilize the rapidly increasing compute power of GPUs to simulate viscous incompressible flows. Fluid-structure interaction with solids of arbitrarily complex geometry can be modeled in this framework with the immersed boundary method (IBM). In IBM a solid is modeled by its surface which applies a force at the neighboring lattice points. The majority of published IBMs require solving a linear system in order to satisfy the no-slip condition. However, the method presented by Wang et al. (2014) is unique in that it produces equally accurate results without solving a linear system. Furthermore, the algorithm can be applied in a parallel manner over the immersed boundary and is, therefore, well suited for GPUs. Here, a 2D and 3D version of their algorithm is implemented in Sailfish CFD, a GPU-based open source lattice Boltzmann code. One issue unaddressed by most published work is how to correct force and torque calculated from IBM for translating and rotating solids. These corrections are necessary because the fluid inside the solid affects its inertia in a non-trivial manner. Therefore, this implementation uses the Lagrangian points approximation correction shown by Suzuki and Inamuro (2011) to be accurate.
GPU-enabled molecular dynamics simulations of ankyrin kinase complex
NASA Astrophysics Data System (ADS)
Gautam, Vertika; Chong, Wei Lim; Wisitponchai, Tanchanok; Nimmanpipug, Piyarat; Zain, Sharifuddin M.; Rahman, Noorsaadah Abd.; Tayapiwatana, Chatchai; Lee, Vannajan Sanghiran
2014-10-01
The ankyrin repeat (AR) protein can be used as a versatile scaffold for protein-protein interactions. It has been found that the heterotrimeric complex between integrin-linked kinase (ILK), PINCH, and parvin is an essential signaling platform, serving as a convergence point for integrin and growth-factor signaling and regulating cell adhesion, spreading, and migration. Using ILK-AR with high affinity for the PINCH1 as our model system, we explored a structure-based computational protocol to probe and characterize binding affinity hot spots at protein-protein interfaces. In this study, the long time scale dynamics simulations with GPU accelerated molecular dynamics (MD) simulations in AMBER12 have been performed to locate the hot spots of protein-protein interaction by the analysis of the Molecular Mechanics-Poisson-Boltzmann Surface Area/Generalized Born Solvent Area (MM-PBSA/GBSA) of the MD trajectories. Our calculations suggest good binding affinity of the complex and also the residues critical in the binding.
GPU surface extraction using the closest point embedding
NASA Astrophysics Data System (ADS)
Kim, Mark; Hansen, Charles
2015-01-01
Isosurface extraction is a fundamental technique used for both surface reconstruction and mesh generation. One method to extract well-formed isosurfaces is a particle system; unfortunately, particle systems can be slow. In this paper, we introduce an enhanced parallel particle system that uses the closest point embedding as the surface representation to speedup the particle system for isosurface extraction. The closest point embedding is used in the Closest Point Method (CPM), a technique that uses a standard three dimensional numerical PDE solver on two dimensional embedded surfaces. To fully take advantage of the closest point embedding, it is coupled with a Barnes-Hut tree code on the GPU. This new technique produces well-formed, conformal unstructured triangular and tetrahedral meshes from labeled multi-material volume datasets. Further, this new parallel implementation of the particle system is faster than any known methods for conformal multi-material mesh extraction. The resulting speed-ups gained in this implementation can reduce the time from labeled data to mesh from hours to minutes and benefits users, such as bioengineers, who employ triangular and tetrahedral meshes
Deshmukh, Nishikant P.; Kang, Hyun Jae; Billings, Seth D.; Taylor, Russell H.; Hager, Gregory D.; Boctor, Emad M.
2014-01-01
A system for real-time ultrasound (US) elastography will advance interventions for the diagnosis and treatment of cancer by advancing methods such as thermal monitoring of tissue ablation. A multi-stream graphics processing unit (GPU) based accelerated normalized cross-correlation (NCC) elastography, with a maximum frame rate of 78 frames per second, is presented in this paper. A study of NCC window size is undertaken to determine the effect on frame rate and the quality of output elastography images. This paper also presents a novel system for Online Tracked Ultrasound Elastography (O-TRuE), which extends prior work on an offline method. By tracking the US probe with an electromagnetic (EM) tracker, the system selects in-plane radio frequency (RF) data frames for generating high quality elastograms. A novel method for evaluating the quality of an elastography output stream is presented, suggesting that O-TRuE generates more stable elastograms than generated by untracked, free-hand palpation. Since EM tracking cannot be used in all systems, an integration of real-time elastography and the da Vinci Surgical System is presented and evaluated for elastography stream quality based on our metric. The da Vinci surgical robot is outfitted with a laparoscopic US probe, and palpation motions are autonomously generated by customized software. It is found that a stable output stream can be achieved, which is affected by both the frequency and amplitude of palpation. The GPU framework is validated using data from in-vivo pig liver ablation; the generated elastography images identify the ablated region, outlined more clearly than in the corresponding B-mode US images. PMID:25541954
Deshmukh, Nishikant P; Kang, Hyun Jae; Billings, Seth D; Taylor, Russell H; Hager, Gregory D; Boctor, Emad M
2014-01-01
A system for real-time ultrasound (US) elastography will advance interventions for the diagnosis and treatment of cancer by advancing methods such as thermal monitoring of tissue ablation. A multi-stream graphics processing unit (GPU) based accelerated normalized cross-correlation (NCC) elastography, with a maximum frame rate of 78 frames per second, is presented in this paper. A study of NCC window size is undertaken to determine the effect on frame rate and the quality of output elastography images. This paper also presents a novel system for Online Tracked Ultrasound Elastography (O-TRuE), which extends prior work on an offline method. By tracking the US probe with an electromagnetic (EM) tracker, the system selects in-plane radio frequency (RF) data frames for generating high quality elastograms. A novel method for evaluating the quality of an elastography output stream is presented, suggesting that O-TRuE generates more stable elastograms than generated by untracked, free-hand palpation. Since EM tracking cannot be used in all systems, an integration of real-time elastography and the da Vinci Surgical System is presented and evaluated for elastography stream quality based on our metric. The da Vinci surgical robot is outfitted with a laparoscopic US probe, and palpation motions are autonomously generated by customized software. It is found that a stable output stream can be achieved, which is affected by both the frequency and amplitude of palpation. The GPU framework is validated using data from in-vivo pig liver ablation; the generated elastography images identify the ablated region, outlined more clearly than in the corresponding B-mode US images. PMID:25541954
NASA Astrophysics Data System (ADS)
Berczik, Peter; Spurzem, Rainer; Wang, Long; Zhong, Shiyan; Huang, Siyi
2013-10-01
We present direct astrophysical N-body simulations with up to a few million bodies using our parallel MPI/CUDA code on large GPU clusters in China, Ukraine and Germany, with different kinds of GPU hardware. These clusters are directly linked under the Chinese Academy of Sciences special GPU cluster program in the cooperation of ICCS (International Center for Computational Science). We reach about the half the peak Kepler K20 GPU performance for our ?-GPU code [2], in a real application scenario with individual hierarchically block time-steps with the high (4th, 6th and 8th) order Hermite integration schemes and a real core-halo density structure of the modeled stellar systems. The code and hardware are mainly used to simulate star clusters [23, 24] and galactic nuclei with supermassive black holes [20], in which correlations between distant particles cannot be neglected.
A GPU-accelerated toolbox for the solutions of systems of linear equations
NASA Astrophysics Data System (ADS)
Humphrey, John R., Jr.; Paolini, Aaron L.; Price, Daniel K.; Kelmelis, Eric J.
2009-05-01
The modern graphics processing unit (GPU) found in many off-the shelf personal computers is a very high performance computing engine that often goes unutilized. The tremendous computing power coupled with reasonable pricing has made the GPU a topic of interest in recent research. An application for such power would be the solution to large systems of linear equations. Two popular solution domains are direct solution, via the LU decomposition, and iterative solution, via a solver such as the Generalized Method of Residuals (GMRES). Our research focuses on the acceleration of such processes, utilizing the latest in GPU technologies. We show performance that exceeds that of a standard computer by an order of magnitude, thus significantly reducing the run time of the numerous applications that depend on the solution of a set of linear equations.
High Performance Molecular Dynamic Simulation on Single and Multi-GPU Systems
Villa, Oreste; Chen, Long; Krishnamoorthy, Sriram
2010-05-30
The programming techniques supported and employed on these GPUs and Multi-GPUs systems are not sufficient to address problems exhibiting irregular, and unbalanced workload such as Molecular Dynamic (MD) simulations of systems with non-uniform densities. In this paper, we propose a task-based dynamic load-balancing solution to employ on MD simulations for single- and multi-GPU systems. The solution allows load balancing at a finer granularity than what is supported in existing APIs such as NVIDIA’s CUDA. Experimental results with a single-GPU configuration show that our fine-grained task solution can utilize the hardware more efficiently than the CUDA scheduler. On multi-GPU systems, our solution achieves near-linear speedup, load balance, and significant performance improvement over techniques based on standard CUDA APIs.
Fast GPU-based calculations in few-body quantum scattering
NASA Astrophysics Data System (ADS)
Pomerantsev, V. N.; Kukulin, V. I.; Rubtsova, O. A.; Sakhiev, S. K.
2016-07-01
A principally novel approach towards solving the few-particle (many-dimensional) quantum scattering problems is described. The approach is based on a complete discretization of few-particle continuum and usage of massively parallel computations of integral kernels for scattering equations by means of GPU. The discretization for continuous spectrum of few-particle Hamiltonian is realized with a projection of all scattering operators and wave functions onto the stationary wave-packet basis. Such projection procedure leads to a replacement of singular multidimensional integral equations with linear matrix ones having finite matrix elements. Different aspects of the employment of multithread GPU computing for fast calculation of the matrix kernel of the equation are studied in detail. As a result, the fully realistic three-body scattering problem above the break-up threshold is solved on an ordinary desktop PC with GPU for a rather small computational time.
Key Techniques of Flat-Earth Phase Removal by Acceleration on the GPU
NASA Astrophysics Data System (ADS)
Gao, Zeng; Zeng, Qiming; Jiao, Jian; Cui, Xiai; Liang, Cunren
2013-01-01
Because InSAR processing is complex and time-consuming, parallel computing has been drawing more and more attention from researchers. GPUs (Graphics Processing Units) have become an increasingly important parallel platform for image processing in recent years. They are cheap and convenient, compared with large-scale, expensive high performance computing clusters, which have a small marketplace presence. In this paper, a valid parallelism implemented on the GPU is introduced. Taking the flat-earth phase removal for example, we introduced two different techniques that can accelerate applications dramatically on a GPU. From the experiment results, we can see that the result accomplished on the GPU is the same as on the CPU; the two techniques used really work in performance improvement.
de Paula, Lauro C. M.; Soares, Anderson S.; de Lima, Telma W.; Delbem, Alexandre C. B.; Coelho, Clarimar J.; Filho, Arlindo R. G.
2014-01-01
Several variable selection algorithms in multivariate calibration can be accelerated using Graphics Processing Units (GPU). Among these algorithms, the Firefly Algorithm (FA) is a recent proposed metaheuristic that may be used for variable selection. This paper presents a GPU-based FA (FA-MLR) with multiobjective formulation for variable selection in multivariate calibration problems and compares it with some traditional sequential algorithms in the literature. The advantage of the proposed implementation is demonstrated in an example involving a relatively large number of variables. The results showed that the FA-MLR, in comparison with the traditional algorithms is a more suitable choice and a relevant contribution for the variable selection problem. Additionally, the results also demonstrated that the FA-MLR performed in a GPU can be five times faster than its sequential implementation. PMID:25493625
GPU-based Scalable Volumetric Reconstruction for Multi-view Stereo
Kim, H; Duchaineau, M; Max, N
2011-09-21
We present a new scalable volumetric reconstruction algorithm for multi-view stereo using a graphics processing unit (GPU). It is an effectively parallelized GPU algorithm that simultaneously uses a large number of GPU threads, each of which performs voxel carving, in order to integrate depth maps with images from multiple views. Each depth map, triangulated from pair-wise semi-dense correspondences, represents a view-dependent surface of the scene. This algorithm also provides scalability for large-scale scene reconstruction in a high resolution voxel grid by utilizing streaming and parallel computation. The output is a photo-realistic 3D scene model in a volumetric or point-based representation. We demonstrate the effectiveness and the speed of our algorithm with a synthetic scene and real urban/outdoor scenes. Our method can also be integrated with existing multi-view stereo algorithms such as PMVS2 to fill holes or gaps in textureless regions.
Real-time generation of infrared ocean scene based on GPU
NASA Astrophysics Data System (ADS)
Jiang, Zhaoyi; Wang, Xun; Lin, Yun; Jin, Jianqiu
2007-12-01
Infrared (IR) image synthesis for ocean scene has become more and more important nowadays, especially for remote sensing and military application. Although a number of works present ready-to-use simulations, those techniques cover only a few possible ways of water interacting with the environment. And the detail calculation of ocean temperature is rarely considered by previous investigators. With the advance of programmable features of graphic card, many algorithms previously limited to offline processing have become feasible for real-time usage. In this paper, we propose an efficient algorithm for real-time rendering of infrared ocean scene using the newest features of programmable graphics processors (GPU). It differs from previous works in three aspects: adaptive GPU-based ocean surface tessellation, sophisticated balance equation of thermal balance for ocean surface, and GPU-based rendering for infrared ocean scene. Finally some results of infrared image are shown, which are in good accordance with real images.
Accelerating Large Scale Image Analyses on Parallel, CPU-GPU Equipped Systems
Teodoro, George; Kurc, Tahsin M.; Pan, Tony; Cooper, Lee A.D.; Kong, Jun; Widener, Patrick; Saltz, Joel H.
2014-01-01
The past decade has witnessed a major paradigm shift in high performance computing with the introduction of accelerators as general purpose processors. These computing devices make available very high parallel computing power at low cost and power consumption, transforming current high performance platforms into heterogeneous CPU-GPU equipped systems. Although the theoretical performance achieved by these hybrid systems is impressive, taking practical advantage of this computing power remains a very challenging problem. Most applications are still deployed to either GPU or CPU, leaving the other resource under- or un-utilized. In this paper, we propose, implement, and evaluate a performance aware scheduling technique along with optimizations to make efficient collaborative use of CPUs and GPUs on a parallel system. In the context of feature computations in large scale image analysis applications, our evaluations show that intelligently co-scheduling CPUs and GPUs can significantly improve performance over GPU-only or multi-core CPU-only approaches. PMID:25419545
Gallarno, George; Rogers, James H; Maxwell, Don E
2015-01-01
The high computational capability of graphics processing units (GPUs) is enabling and driving the scientific discovery process at large-scale. The world s second fastest supercomputer for open science, Titan, has more than 18,000 GPUs that computational scientists use to perform scientific simu- lations and data analysis. Understanding of GPU reliability characteristics, however, is still in its nascent stage since GPUs have only recently been deployed at large-scale. This paper presents a detailed study of GPU errors and their impact on system operations and applications, describing experiences with the 18,688 GPUs on the Titan supercom- puter as well as lessons learned in the process of efficient operation of GPUs at scale. These experiences are helpful to HPC sites which already have large-scale GPU clusters or plan to deploy GPUs in the future.
Massively Parallel Computation of Soil Surface Roughness Parameters on A Fermi GPU
NASA Astrophysics Data System (ADS)
Li, Xiaojie; Song, Changhe
2016-06-01
Surface roughness is description of the surface micro topography of randomness or irregular. The standard deviation of surface height and the surface correlation length describe the statistical variation for the random component of a surface height relative to a reference surface. When the number of data points is large, calculation of surface roughness parameters is time-consuming. With the advent of Graphics Processing Unit (GPU) architectures, inherently parallel problem can be effectively solved using GPUs. In this paper we propose a GPU-based massively parallel computing method for 2D bare soil surface roughness estimation. This method was applied to the data collected by the surface roughness tester based on the laser triangulation principle during the field experiment in April 2012. The total number of data points was 52,040. It took 47 seconds on a Fermi GTX 590 GPU whereas its serial CPU version took 5422 seconds, leading to a significant 115x speedup.
Implementation and optimization of ultrasound signal processing algorithms on mobile GPU
NASA Astrophysics Data System (ADS)
Kong, Woo Kyu; Lee, Wooyoul; Kim, Kyu Cheol; Yoo, Yangmo; Song, Tai-Kyong
2014-03-01
A general-purpose graphics processing unit (GPGPU) has been used for improving computing power in medical ultrasound imaging systems. Recently, a mobile GPU becomes powerful to deal with 3D games and videos at high frame rates on Full HD or HD resolution displays. This paper proposes the method to implement ultrasound signal processing on a mobile GPU available in the high-end smartphone (Galaxy S4, Samsung Electronics, Seoul, Korea) with programmable shaders on the OpenGL ES 2.0 platform. To maximize the performance of the mobile GPU, the optimization of shader design and load sharing between vertex and fragment shader was performed. The beamformed data were captured from a tissue mimicking phantom (Model 539 Multipurpose Phantom, ATS Laboratories, Inc., Bridgeport, CT, USA) by using a commercial ultrasound imaging system equipped with a research package (Ultrasonix Touch, Ultrasonix, Richmond, BC, Canada). The real-time performance is evaluated by frame rates while varying the range of signal processing blocks. The implementation method of ultrasound signal processing on OpenGL ES 2.0 was verified by analyzing PSNR with MATLAB gold standard that has the same signal path. CNR was also analyzed to verify the method. From the evaluations, the proposed mobile GPU-based processing method has no significant difference with the processing using MATLAB (i.e., PSNR<52.51 dB). The comparable results of CNR were obtained from both processing methods (i.e., 11.31). From the mobile GPU implementation, the frame rates of 57.6 Hz were achieved. The total execution time was 17.4 ms that was faster than the acquisition time (i.e., 34.4 ms). These results indicate that the mobile GPU-based processing method can support real-time ultrasound B-mode processing on the smartphone.
GPU-accelerated 3D neutron diffusion code based on finite difference method
Xu, Q.; Yu, G.; Wang, K.
2012-07-01
Finite difference method, as a traditional numerical solution to neutron diffusion equation, although considered simpler and more precise than the coarse mesh nodal methods, has a bottle neck to be widely applied caused by the huge memory and unendurable computation time it requires. In recent years, the concept of General-Purpose computation on GPUs has provided us with a powerful computational engine for scientific research. In this study, a GPU-Accelerated multi-group 3D neutron diffusion code based on finite difference method was developed. First, a clean-sheet neutron diffusion code (3DFD-CPU) was written in C++ on the CPU architecture, and later ported to GPUs under NVIDIA's CUDA platform (3DFD-GPU). The IAEA 3D PWR benchmark problem was calculated in the numerical test, where three different codes, including the original CPU-based sequential code, the HYPRE (High Performance Pre-conditioners)-based diffusion code and CITATION, were used as counterpoints to test the efficiency and accuracy of the GPU-based program. The results demonstrate both high efficiency and adequate accuracy of the GPU implementation for neutron diffusion equation. A speedup factor of about 46 times was obtained, using NVIDIA's Geforce GTX470 GPU card against a 2.50 GHz Intel Quad Q9300 CPU processor. Compared with the HYPRE-based code performing in parallel on an 8-core tower server, the speedup of about 2 still could be observed. More encouragingly, without any mathematical acceleration technology, the GPU implementation ran about 5 times faster than CITATION which was speeded up by using the SOR method and Chebyshev extrapolation technique. (authors)
Real-time high definition H.264 video decode using the Xbox 360 GPU
NASA Astrophysics Data System (ADS)
Arevalo Baeza, Juan Carlos; Chen, William; Christoffersen, Eric; Dinu, Daniel; Friemel, Barry
2007-09-01
The Xbox 360 is powered by three dual pipeline 3.2 GHz IBM PowerPC processors and a 500 MHz ATI graphics processing unit. The Graphics Processing Unit (GPU) is a special-purpose device, intended to create advanced visual effects and to render realistic scenes for the latest Xbox 360 games. In this paper, we report work on using the GPU as a parallel processing unit to accelerate the decoding of H.264/AVC high-definition (1920x1080) video. We report our experiences in developing a real-time, software-only high-definition video decoder for the Xbox 360.
A 3D front tracking method on a CPU/GPU system
Bo, Wurigen; Grove, John
2011-01-21
We describe the method to port a sequential 3D interface tracking code to a GPU with CUDA. The interface is represented as a triangular mesh. Interface geometry properties and point propagation are performed on a GPU. Interface mesh adaptation is performed on a CPU. The convergence of the method is assessed from the test problems with given velocity fields. Performance results show overall speedups from 11 to 14 for the test problems under mesh refinement. We also briefly describe our ongoing work to couple the interface tracking method with a hydro solver.
Multi-GPU and multi-CPU accelerated FDTD scheme for vibroacoustic applications
NASA Astrophysics Data System (ADS)
Francés, J.; Otero, B.; Bleda, S.; Gallego, S.; Neipp, C.; Márquez, A.; Beléndez, A.
2015-06-01
The Finite-Difference Time-Domain (FDTD) method is applied to the analysis of vibroacoustic problems and to study the propagation of longitudinal and transversal waves in a stratified media. The potential of the scheme and the relevance of each acceleration strategy for massively computations in FDTD are demonstrated in this work. In this paper, we propose two new specific implementations of the bi-dimensional scheme of the FDTD method using multi-CPU and multi-GPU, respectively. In the first implementation, an open source message passing interface (OMPI) has been included in order to massively exploit the resources of a biprocessor station with two Intel Xeon processors. Moreover, regarding CPU code version, the streaming SIMD extensions (SSE) and also the advanced vectorial extensions (AVX) have been included with shared memory approaches that take advantage of the multi-core platforms. On the other hand, the second implementation called the multi-GPU code version is based on Peer-to-Peer communications available in CUDA on two GPUs (NVIDIA GTX 670). Subsequently, this paper presents an accurate analysis of the influence of the different code versions including shared memory approaches, vector instructions and multi-processors (both CPU and GPU) and compares them in order to delimit the degree of improvement of using distributed solutions based on multi-CPU and multi-GPU. The performance of both approaches was analysed and it has been demonstrated that the addition of shared memory schemes to CPU computing improves substantially the performance of vector instructions enlarging the simulation sizes that use efficiently the cache memory of CPUs. In this case GPU computing is slightly twice times faster than the fine tuned CPU version in both cases one and two nodes. However, for massively computations explicit vector instructions do not worth it since the memory bandwidth is the limiting factor and the performance tends to be the same than the sequential version
New Multithreaded Hybrid CPU/GPU Approach to Hartree-Fock.
Asadchev, Andrey; Gordon, Mark S
2012-11-13
In this article, a new multithreaded Hartree-Fock CPU/GPU method is presented which utilizes automatically generated code and modern C++ techniques to achieve a significant improvement in memory usage and computer time. In particular, the newly implemented Rys Quadrature and Fock Matrix algorithms, implemented as a stand-alone C++ library, with C and Fortran bindings, provides up to 40% improvement over the traditional Fortran Rys Quadrature. The C++ GPU HF code provides approximately a factor of 17.5 improvement over the corresponding C++ CPU code. PMID:26605582