On implementation of EM-type algorithms in the stochastic models for a matrix computing on GPU
Gorshenin, Andrey K.
2015-03-10
The paper discusses the main ideas of an implementation of EM-type algorithms for computing on the graphics processors and the application for the probabilistic models based on the Cox processes. An example of the GPU’s adapted MATLAB source code for the finite normal mixtures with the expectation-maximization matrix formulas is given. The testing of computational efficiency for GPU vs CPU is illustrated for the different sample sizes.
gpuPOM: a GPU-based Princeton Ocean Model
NASA Astrophysics Data System (ADS)
Xu, S.; Huang, X.; Zhang, Y.; Fu, H.; Oey, L.-Y.; Xu, F.; Yang, G.
2014-11-01
Rapid advances in the performance of the graphics processing unit (GPU) have made the GPU a compelling solution for a series of scientific applications. However, most existing GPU acceleration works for climate models are doing partial code porting for certain hot spots, and can only achieve limited speedup for the entire model. In this work, we take the mpiPOM (a parallel version of the Princeton Ocean Model) as our starting point, design and implement a GPU-based Princeton Ocean Model. By carefully considering the architectural features of the state-of-the-art GPU devices, we rewrite the full mpiPOM model from the original Fortran version into a new Compute Unified Device Architecture C (CUDA-C) version. We take several accelerating methods to further improve the performance of gpuPOM, including optimizing memory access in a single GPU, overlapping communication and boundary operations among multiple GPUs, and overlapping input/output (I/O) between the hybrid Central Processing Unit (CPU) and the GPU. Our experimental results indicate that the performance of the gpuPOM on a workstation containing 4 GPUs is comparable to a powerful cluster with 408 CPU cores and it reduces the energy consumption by 6.8 times.
NASA Astrophysics Data System (ADS)
Masset, Frédéric
2015-09-01
GFARGO is a GPU version of FARGO. It is written in C and C for CUDA and runs only on NVIDIA’s graphics cards. Though it corresponds to the standard, isothermal version of FARGO, not all functionnalities of the CPU version have been translated to CUDA. The code is available in single and double precision versions, the latter compatible with FERMI architectures. GFARGO can run on a graphics card connected to the display, allowing the user to see in real time how the fields evolve.
Moss, Nicholas
2016-07-15
The Kokkos Clang compiler is a version of the Clang C++ compiler that has been modified to perform targeted code generation for Kokkos constructs in the goal of generating highly optimized code and to provide semantic (domain) awareness throughout the compilation toolchain of these constructs such as parallel for and parallel reduce. This approach is taken to explore the possibilities of exposing the developer’s intentions to the underlying compiler infrastructure (e.g. optimization and analysis passes within the middle stages of the compiler) instead of relying solely on the restricted capabilities of C++ template metaprogramming. To date our current activities have focused on correct GPU code generation and thus we have not yet focused on improving overall performance. The compiler is implemented by recognizing specific (syntactic) Kokkos constructs in order to bypass normal template expansion mechanisms and instead use the semantic knowledge of Kokkos to directly generate code in the compiler’s intermediate representation (IR); which is then translated into an NVIDIA-centric GPU program and supporting runtime calls. In addition, by capturing and maintaining the higher-level semantics of Kokkos directly within the lower levels of the compiler has the potential for significantly improving the ability of the compiler to communicate with the developer in the terms of their original programming model/semantics.
Astronomia para/com crianças carentes em Limeira
NASA Astrophysics Data System (ADS)
Bretones, P. S.; Oliveira, V. C.
2003-08-01
Em 2001, o Instituto Superior de Ciências Aplicadas (ISCA Faculdades de Limeira) iniciou um projeto pelo qual o Observatório do Morro Azul empreendeu uma parceria com o Centro de Promoção Social Municipal (CEPROSOM), instituição mantida pela Prefeitura Municipal de Limeira para atender crianças e adolescentes carentes. O CEPROSOM contava com dois projetos: Projeto Centro de Convivência Infantil (CCI) e Programa Criança e Adolescente (PCA), que atendiam crianças e adolescentes em Centros Comunitários de diversas áreas da cidade. Esses projetos têm como prioridades estabelecer atividades prazerosas para as crianças no sentido de retirá-las das ruas. Assim sendo, as crianças passaram a ter mais um tipo de atividade - as visitas ao observatório. Este painel descreve as várias fases do projeto, que envolveu: reuniões de planejamento, curso de Astronomia para as orientadoras dos CCIs e PCAs, atividades relacionadas a visitas das crianças ao Observatório, proposta de construção de gnômons e relógios de Sol nos diversos Centros Comunitários de Limeira e divulgação do projeto na imprensa. O painel inclui discussões sobre a aprendizagem de crianças carentes, relatos que mostram a postura das orientadoras sobre a pertinência do ensino de Astronomia, relatos do monitor que fez o atendimento no Observatório e o que o número de crianças atendidas representou para as atividades da instituição desde o início de suas atividades e, em particular, em 2001. Os resultados são baseados na análise de relatos das orientadoras e do monitor do Observatório, registros de visitas e matérias da imprensa local. Conclui com uma avaliação do que tal projeto representou para as Instituições participantes. Para o Observatório, em particular, foi feita uma análise com relação às outras modalidades de atendimentos que envolvem alunos de escolas e público em geral. Também é abordada a questão do compromisso social do Observatório na educação do
GPU-Powered Coherent Beamforming
NASA Astrophysics Data System (ADS)
Magro, A.; Adami, K. Zarb; Hickish, J.
2015-03-01
Graphics processing units (GPU)-based beamforming is a relatively unexplored area in radio astronomy, possibly due to the assumption that any such system will be severely limited by the PCIe bandwidth required to transfer data to the GPU. We have developed a CUDA-based GPU implementation of a coherent beamformer, specifically designed and optimized for deployment at the BEST-2 array which can generate an arbitrary number of synthesized beams for a wide range of parameters. It achieves ˜1.3 TFLOPs on an NVIDIA Tesla K20, approximately 10x faster than an optimized, multithreaded CPU implementation. This kernel has been integrated into two real-time, GPU-based time-domain software pipelines deployed at the BEST-2 array in Medicina: a standalone beamforming pipeline and a transient detection pipeline. We present performance benchmarks for the beamforming kernel as well as the transient detection pipeline with beamforming capabilities as well as results of test observation.
GPU applications for data processing
Vladymyrov, Mykhailo; Aleksandrov, Andrey; Tioukov, Valeri
2015-12-31
Modern experiments that use nuclear photoemulsion imply fast and efficient data acquisition from the emulsion can be performed. The new approaches in developing scanning systems require real-time processing of large amount of data. Methods that use Graphical Processing Unit (GPU) computing power for emulsion data processing are presented here. It is shown how the GPU-accelerated emulsion processing helped us to rise the scanning speed by factor of nine.
OV-Wav: um novo pacote para análise multiescalar em astronomia
NASA Astrophysics Data System (ADS)
Pereira, D. N. E.; Rabaça, C. R.
2003-08-01
Wavelets e outras formas de análise multiescalar têm sido amplamente empregadas em diversas áreas do conhecimento, sendo reconhecidamente superiores a técnicas mais tradicionais, como as análises de Fourier e de Gabor, em certas aplicações. Embora a teoria dos wavelets tenha começado a ser elaborada há quase trinta anos, seu impacto no estudo de imagens astronômicas tem sido pequeno até bem recentemente. Apresentamos um conjunto de programas desenvolvidos ao longo dos últimos três anos no Observatório do Valongo/UFRJ que possibilitam aplicar essa poderosa ferramenta a problemas comuns em astronomia, como a remoção de ruído, a detecção hierárquica de fontes e a modelagem de objetos com perfis de brilho arbitrários em condições não ideais. Este pacote, desenvolvido para execução em plataforma IDL, teve sua primeira versão concluída recentemente e está sendo disponibilizado à comunidade científica de forma aberta. Mostramos também resultados de testes controlados ao quais submetemos os programas, com a sua aplicação a imagens artificiais, com resultados satisfatórios. Algumas aplicações astrofísicas foram estudadas com o uso do pacote, em caráter experimental, incluindo a análise da componente de luz difusa em grupos compactos de galáxias de Hickson e o estudo de subestruturas de nebulosas planetárias no espaço multiescalar.
Randomized selection on the GPU
Monroe, Laura Marie; Wendelberger, Joanne R; Michalak, Sarah E
2011-01-13
We implement here a fast and memory-sparing probabilistic top N selection algorithm on the GPU. To our knowledge, this is the first direct selection in the literature for the GPU. The algorithm proceeds via a probabilistic-guess-and-chcck process searching for the Nth element. It always gives a correct result and always terminates. The use of randomization reduces the amount of data that needs heavy processing, and so reduces the average time required for the algorithm. Probabilistic Las Vegas algorithms of this kind are a form of stochastic optimization and can be well suited to more general parallel processors with limited amounts of fast memory.
Distributed GPU Computing in GIScience
NASA Astrophysics Data System (ADS)
Jiang, Y.; Yang, C.; Huang, Q.; Li, J.; Sun, M.
2013-12-01
Geoscientists strived to discover potential principles and patterns hidden inside ever-growing Big Data for scientific discoveries. To better achieve this objective, more capable computing resources are required to process, analyze and visualize Big Data (Ferreira et al., 2003; Li et al., 2013). Current CPU-based computing techniques cannot promptly meet the computing challenges caused by increasing amount of datasets from different domains, such as social media, earth observation, environmental sensing (Li et al., 2013). Meanwhile CPU-based computing resources structured as cluster or supercomputer is costly. In the past several years with GPU-based technology matured in both the capability and performance, GPU-based computing has emerged as a new computing paradigm. Compare to traditional computing microprocessor, the modern GPU, as a compelling alternative microprocessor, has outstanding high parallel processing capability with cost-effectiveness and efficiency(Owens et al., 2008), although it is initially designed for graphical rendering in visualization pipe. This presentation reports a distributed GPU computing framework for integrating GPU-based computing within distributed environment. Within this framework, 1) for each single computer, computing resources of both GPU-based and CPU-based can be fully utilized to improve the performance of visualizing and processing Big Data; 2) within a network environment, a variety of computers can be used to build up a virtual super computer to support CPU-based and GPU-based computing in distributed computing environment; 3) GPUs, as a specific graphic targeted device, are used to greatly improve the rendering efficiency in distributed geo-visualization, especially for 3D/4D visualization. Key words: Geovisualization, GIScience, Spatiotemporal Studies Reference : 1. Ferreira de Oliveira, M. C., & Levkowitz, H. (2003). From visual data exploration to visual data mining: A survey. Visualization and Computer Graphics, IEEE
Uma grade de perfis teóricos para estrelas massivas em transição
NASA Astrophysics Data System (ADS)
Nascimento, C. M. P.; Machado, M. A.
2003-08-01
Na XXVIII Reunião Anual da Sociedade Astronômica Brasileira (2002) apresentamos uma grade de perfis calculados de acordo com os pontos da trajetória evolutiva de metalicidade solar, Z = 0.02 e taxa de perda de massa () padrão, para estrelas com massa inicial de 25, 40, 60, 85 e 120 massas solares. Estes perfis foram calculados com o auxílio de um código numérico adequado para descrever os ventos de objetos massivos, supondo simetria esférica, estacionaridade e homogeneidade. No presente trabalho, apresentamos a complementação da grade com os perfis teóricos relativos às trajetórias de Z = 0.02 com taxa de perda de massa dobrada em relação a padrão (2´), e de metalicidade Z = 0.008. Para cada ponto das três trajetórias obtemos os perfis teóricos de Ha, Hb, Hg e Hd, e como esperado eles se apresentam em pura emissão, pura absorção ou em P-Cygni. Para valores de taxa de perda de massa muito baixos (~10-7) não há formação de linhas, o que é visto nos primeiros pontos em todas as trajetórias. Em geral, para um mesmo ponto a componente de emissão diminui e a absorção aumenta de Ha para Hd. É verificado que as trajetórias com Z = 0.02 e padrão possuem menos circuitos (loops) do que as com metalicidade Z = 0.02 e 2´ padrão, e seus perfis são, em geral, menos intensos. Em relação a trajetória de Z = 0.008, verifica-se menos circuitos e maior variação em luminosidade, e seus perfis mostram-se em, algumas trajetórias, mais intensos. Verificamos também que, pontos distintos em uma mesma trajetória, apresentam perfis diferentes para valores similares de luminosidade e temperatura efetiva. Sendo assim, uma grade de perfis teóricos parece ser útil para fornecer uma informação preliminar sobre o estágio evolutivo de uma estrela massiva.
Vínculos observacionais para o processo-S em estrelas gigantes de Bário
NASA Astrophysics Data System (ADS)
Smiljanic, R. H. S.; Porto de Mello, G. F.; da Silva, L.
2003-08-01
Estrelas de bário são gigantes vermelhas de tipo GK que apresentam excessos atmosféricos dos elementos do processo-s. Tais excessos são esperados em estrelas na fase de pulsos térmicos do AGB (TP-AGB). As estrelas de bário são, no entanto, menos massivas e menos luminosas que as estrelas do AGB, assim, não poderiam ter se auto-enriquecido. Seu enriquecimento teria origem em uma estrela companheira, inicialmente mais massiva, que evolui pelo TP-AGB, se auto-enriquece com os elementos do processo-s e transfere material contaminado para a atmosfera da atual estrela de bário. A companheira evolui então para anã branca deixando de ser observada diretamente. As estrelas de bário são, portanto, úteis como testes observacionais para teorias de nucleossíntese pelo processo-s, convecção e perda de massa. Análises detalhadas de abundância com dados de alta qualidade para estes objetos são ainda escassas na literatura. Neste trabalho construímos modelos de atmosferas e, procedendo a uma análise diferencial, determinamos parâmetros atmosféricos e evolutivos de uma amostra de dez gigantes de bário e quatro normais. Determinamos seus padrões de abundância para Na, Mg, Al, Si, Ca, Sc, Ti, V, Cr, Mn, Fe, Co, Ni, Cu, Zn, Sr, Y, Zr, Ba, La, Ce, Nd, Sm, Eu e Gd, concluindo que algumas estrelas classificadas na literatura como gigantes de bário são na verdade gigantes normais. Comparamos dois padrões médios de abundância, para estrelas com grandes excessos e estrelas com excessos moderados, com modelos teóricos de enriquecimento pelo processo-s. Os dois grupos de estrelas são ajustados pelos mesmos parâmetros de exposição de nêutrons. Tal resultado sugere que a ocorrência do fenômeno de bário com diferentes intensidades não se deve a diferentes exposições de nêutrons. Discutimos ainda efeitos nucleossintéticos, ligados ao processo-s, sugeridos na literatura para os elementos Cu, Mn, V e Sc.
BSSDATA - um programa otimizado para filtragem de dados em radioastronomia solar
NASA Astrophysics Data System (ADS)
Martinon, A. R. F.; Sawant, H. S.; Fernandes, F. C. R.; Stephany, S.; Preto, A. J.; Dobrowolski, K. M.
2003-08-01
A partir de 1998, entrou em operação regular no INPE, em São José dos Campos, o Brazilian Solar Spectroscope (BSS). O BSS é dedicado às observações de explosões solares decimétricas com alta resolução temporal e espectral, com a principal finalidade de investigar fenômenos associados com a liberação de energia dos "flares" solares. Entre os anos de 1999 e 2002, foram catalogadas, aproximadamente 340 explosões solares classificadas em 8 tipos distintos, de acordo com suas características morfológicas. Na análise detalhada de cada tipo, ou grupo, de explosões solares deve-se considerar a variação do fluxo do sol calmo ("background"), em função da freqüência e a variação temporal, além da complexidade das explosões e estruturas finas registradas superpostas ao fundo variável. Com o intuito de realizar tal análise foi desenvolvido o programa BSSData. Este programa, desenvolvido em linguagem C++, é constituído de várias ferramentas que auxiliam no tratamento e análise dos dados registrados pelo BSS. Neste trabalho iremos abordar as ferramentas referentes à filtragem do ruído de fundo. As rotinas do BSSData para filtragem de ruído foram testadas nos diversos grupos de explosões solares ("dots", "fibra", "lace", "patch", "spikes", "tipo III" e "zebra") alcançando um bom resultado na diminuição do ruído de fundo e obtendo, em conseqüência, dados onde o sinal torna-se mais homogêneo ressaltando as áreas onde existem explosões solares e tornando mais precisas as determinações dos parâmetros observacionais de cada explosão. Estes resultados serão apresentados e discutidos.
Parallelization of MODFLOW using a GPU library.
Ji, Xiaohui; Li, Dandan; Cheng, Tangpei; Wang, Xu-Sheng; Wang, Qun
2014-01-01
A new method based on a graphics processing unit (GPU) library is proposed in the paper to parallelize MODFLOW. Two programs, GetAb_CG and CG_GPU, have been developed to reorganize the equations in MODFLOW and solve them with the GPU library. Experimental tests using the NVIDIA Tesla C1060 show that a 1.6- to 10.6-fold speedup can be achieved for models with more than 10(5) cells. The efficiency can be further improved by using up-to-date GPU devices.
Image compression based on GPU encoding
NASA Astrophysics Data System (ADS)
Bai, Zhaofeng; Qiu, Yuehong
2015-07-01
With the rapid development of digital technology, the data increased greatly in both static image and dynamic video image. It is noticeable how to decrease the redundant data in order to save or transmit information more efficiently. So the research on image compression becomes more and more important. Using GPU to achieve higher compression ratio has superiority in interactive remote visualization. Contrast to CPU, GPU may be a good way to accelerate the image compression. Currently, GPU of NIVIDIA has evolved into the eighth generation, which increasingly dominates the high-powered general purpose computer field. This paper explains the way of GPU encoding image. Some experiment results are also presented.
NASA Astrophysics Data System (ADS)
Weigel, Martin
2011-09-01
Over the last couple of years it has been realized that the vast computational power of graphics processing units (GPUs) could be harvested for purposes other than the video game industry. This power, which at least nominally exceeds that of current CPUs by large factors, results from the relative simplicity of the GPU architectures as compared to CPUs, combined with a large number of parallel processing units on a single chip. To benefit from this setup for general computing purposes, the problems at hand need to be prepared in a way to profit from the inherent parallelism and hierarchical structure of memory accesses. In this contribution I discuss the performance potential for simulating spin models, such as the Ising model, on GPU as compared to conventional simulations on CPU.
Cosmological calculations on the GPU
NASA Astrophysics Data System (ADS)
Bard, D.; Bellis, M.; Allen, M. T.; Yepremyan, H.; Kratochvil, J. M.
2013-02-01
Cosmological measurements require the calculation of nontrivial quantities over large datasets. The next generation of survey telescopes will yield measurements of billions of galaxies. The scale of these datasets, and the nature of the calculations involved, make cosmological calculations ideal models for implementation on graphics processing units (GPUs). We consider two cosmological calculations, the two-point angular correlation function and the aperture mass statistic, and aim to improve the calculation time by constructing code for calculating them on the GPU. Using CUDA, we implement the two algorithms on the GPU and compare the calculation speeds to comparable code run on the CPU. We obtain a code speed-up of between 10 and 180× faster, compared to performing the same calculation on the CPU. The code has been made publicly available. GPUs are a useful tool for cosmological calculations, even for datasets the size of current surveys, allowing calculations to be made one or two orders of magnitude faster.
GPU Accelerated Vector Median Filter
NASA Technical Reports Server (NTRS)
Aras, Rifat; Shen, Yuzhong
2011-01-01
Noise reduction is an important step for most image processing tasks. For three channel color images, a widely used technique is vector median filter in which color values of pixels are treated as 3-component vectors. Vector median filters are computationally expensive; for a window size of n x n, each of the n(sup 2) vectors has to be compared with other n(sup 2) - 1 vectors in distances. General purpose computation on graphics processing units (GPUs) is the paradigm of utilizing high-performance many-core GPU architectures for computation tasks that are normally handled by CPUs. In this work. NVIDIA's Compute Unified Device Architecture (CUDA) paradigm is used to accelerate vector median filtering. which has to the best of our knowledge never been done before. The performance of GPU accelerated vector median filter is compared to that of the CPU and MPI-based versions for different image and window sizes, Initial findings of the study showed 100x improvement of performance of vector median filter implementation on GPUs over CPU implementations and further speed-up is expected after more extensive optimizations of the GPU algorithm .
SIFT implementation based on GPU
NASA Astrophysics Data System (ADS)
Jiang, Chao; Geng, Ze-xun; Wei, Xiao-feng; Shen, Chen
2013-08-01
Abstract—Image matching is the core research topics of digital photogrammetry and computer vision. SIFT(Scale-Invariant Feature Transform) algorithm is a feature matching algorithm based on local invariant features which is proposed by Lowe at 1999, SIFT features are invariant to image rotation and scaling, even partially invariant to change in 3D camera viewpoint and illumination. They are well localized in both the spatial and frequency domains, reducing the probability of disruption by occlusion, clutter, or noise. So the algorithm has a widely used in image matching and 3D reconstruction based on stereo image. Traditional SIFT algorithm's implementation and optimization are generally for CPU. Due to the large numbers of extracted features(even if only several objects can also extract large numbers of SIFT feature), high-dimensional of the feature vector(usually a 128-dimensional SIFT feature vector), and the complexity for the SIFT algorithm, therefore the SIFT algorithm on the CPU processing speed is slow, hard to fulfil the real-time requirements. Programmable Graphic Process United(PGPU) is commonly used by the current computer graphics as a dedicated device for image processing. The development experience of recent years shows that a high-performance GPU, which can be achieved 10 times single-precision floating-point processing performanceone compared with the same time of a high-performance desktop CPU, simultaneity the GPU's memory bandwidth is up to five times compared with the same period desktop platform. Provide the same computing power, the GPU's cost and power consumption should be less than the CPU-based system. At the same time, due to the parallel nature of graphics rendering and image processing, so GPU-accelerated image processing become to an efficient solution for some algorithm which have requirements for real-time. In this paper, we realized the algorithm by OpenGL shader language and compare to the results which realized by CPU
The experience of GPU calculations at Lunarc
NASA Astrophysics Data System (ADS)
Sjöström, Anders; Lindemann, Jonas; Church, Ross
2011-09-01
To meet the ever increasing demand for computational speed and use of ever larger datasets, multi GPU instal- lations look very tempting. Lunarc and the Theoretical Astrophysics group at Lund Observatory collaborate on a pilot project to evaluate and utilize multi-GPU architectures for scientific calculations. Starting with a small workshop in 2009, continued investigations eventually lead to the procurement of the GPU-resource Timaeus, which is a four-node eight-GPU cluster with two Nvidia m2050 GPU-cards per node. The resource is housed within the larger cluster Platon and share disk-, network- and system resources with that cluster. The inaugu- ration of Timaeus coincided with the meeting "Computational Physics with GPUs" in November 2010, hosted by the Theoretical Astrophysics group at Lund Observatory. The meeting comprised of a two-day workshop on GPU-computing and a two-day science meeting on using GPUs as a tool for computational physics research, with a particular focus on astrophysics and computational biology. Today Timaeus is used by research groups from Lund, Stockholm and Lule in fields ranging from Astrophysics to Molecular Chemistry. We are investigating the use of GPUs with commercial software packages and user supplied MPI-enabled codes. Looking ahead, Lunarc will be installing a new cluster during the summer of 2011 which will have a small number of GPU-enabled nodes that will enable us to continue working with the combination of parallel codes and GPU-computing. It is clear that the combination of GPUs/CPUs is becoming an important part of high performance computing and here we will describe what has been done at Lunarc regarding GPU-computations and how we will continue to investigate the new and coming multi-GPU servers and how they can be utilized in our environment.
Enhancing professionalism at GPU Nuclear
Coe, R.P. )
1992-01-01
Late in 1988, GPU Nuclear embarked on a major program aimed at enhancing professionalism at its Oyster Creek and Three Mile Island nuclear generating stations. The program was also to include its corporate headquarters in Parsippany, New Jersey. The overall program was to take several directions, including on-site degree programs, a sabbatical leave-type program for personnel to finish college degrees, advanced technical training for licensed staff, career progression for senior reactor operators, and expanded teamwork and leadership training for control room crew. The largest portion of this initiative was the development and delivery of professionalism training to the nearly 2,000 people at both nuclear generating sites.
GPU Developments for General Circulation Models
NASA Astrophysics Data System (ADS)
Appleyard, Jeremy; Posey, Stan; Ponder, Carl; Eaton, Joe
2014-05-01
Current trends in high performance computing (HPC) are moving towards the use of graphics processing units (GPUs) to achieve speedups through the extraction of fine-grain parallelism of application software. GPUs have been developed exclusively for computational tasks as massively-parallel co-processors to the CPU, and during 2013 an extensive set of new HPC architectural features were developed in a 4th generation of NVIDIA GPUs that provide further opportunities for GPU acceleration of general circulation models used in climate science and numerical weather prediction. Today computational efficiency and simulation turnaround time continue to be important factors behind scientific decisions to develop models at higher resolutions and deploy increased use of ensembles. This presentation will examine the current state of GPU parallel developments for stencil based numerical operations typical of dynamical cores, and introduce new GPU-based implicit iterative schemes with GPU parallel preconditioning and linear solvers based on ILU, Krylov methods, and multigrid. Several GCMs show substantial gain in parallel efficiency from second-level fine-grain parallelism under first-level distributed memory parallel through a hybrid parallel implementation. Examples are provided relevant to science-scale HPC practice of CPU-GPU system configurations based on model resolution requirements of a particular simulation. Performance results compare use of the latest conventional CPUs with and without GPU acceleration. Finally a forward looking discussion is provided on the roadmap of GPU hardware, software, tools, and programmability for GCM development.
Memory-Scalable GPU Spatial Hierarchy Construction.
Qiming Hou; Xin Sun; Kun Zhou; Lauterbach, C; Manocha, D
2011-04-01
Recent GPU algorithms for constructing spatial hierarchies have achieved promising performance for moderately complex models by using the breadth-first search (BFS) construction order. While being able to exploit the massive parallelism on the GPU, the BFS order also consumes excessive GPU memory, which becomes a serious issue for interactive applications involving very complex models with more than a few million triangles. In this paper, we propose to use the partial breadth-first search (PBFS) construction order to control memory consumption while maximizing performance. We apply the PBFS order to two hierarchy construction algorithms. The first algorithm is for kd-trees that automatically balances between the level of parallelism and intermediate memory usage. With PBFS, peak memory consumption during construction can be efficiently controlled without costly CPU-GPU data transfer. We also develop memory allocation strategies to effectively limit memory fragmentation. The resulting algorithm scales well with GPU memory and constructs kd-trees of models with millions of triangles at interactive rates on GPUs with 1 GB memory. Compared with existing algorithms, our algorithm is an order of magnitude more scalable for a given GPU memory bound. The second algorithm is for out-of-core bounding volume hierarchy (BVH) construction for very large scenes based on the PBFS construction order. At each iteration, all constructed nodes are dumped to the CPU memory, and the GPU memory is freed for the next iteration's use. In this way, the algorithm is able to build trees that are too large to be stored in the GPU memory. Experiments show that our algorithm can construct BVHs for scenes with up to 20 M triangles, several times larger than previous GPU algorithms.
GPU computing for systems biology.
Dematté, Lorenzo; Prandi, Davide
2010-05-01
The development of detailed, coherent, models of complex biological systems is recognized as a key requirement for integrating the increasing amount of experimental data. In addition, in-silico simulation of bio-chemical models provides an easy way to test different experimental conditions, helping in the discovery of the dynamics that regulate biological systems. However, the computational power required by these simulations often exceeds that available on common desktop computers and thus expensive high performance computing solutions are required. An emerging alternative is represented by general-purpose scientific computing on graphics processing units (GPGPU), which offers the power of a small computer cluster at a cost of approximately $400. Computing with a GPU requires the development of specific algorithms, since the programming paradigm substantially differs from traditional CPU-based computing. In this paper, we review some recent efforts in exploiting the processing power of GPUs for the simulation of biological systems.
CULA: hybrid GPU accelerated linear algebra routines
NASA Astrophysics Data System (ADS)
Humphrey, John R.; Price, Daniel K.; Spagnoli, Kyle E.; Paolini, Aaron L.; Kelmelis, Eric J.
2010-04-01
The modern graphics processing unit (GPU) found in many standard personal computers is a highly parallel math processor capable of nearly 1 TFLOPS peak throughput at a cost similar to a high-end CPU and an excellent FLOPS/watt ratio. High-level linear algebra operations are computationally intense, often requiring O(N3) operations and would seem a natural fit for the processing power of the GPU. Our work is on CULA, a GPU accelerated implementation of linear algebra routines. We present results from factorizations such as LU decomposition, singular value decomposition and QR decomposition along with applications like system solution and least squares. The GPU execution model featured by NVIDIA GPUs based on CUDA demands very strong parallelism, requiring between hundreds and thousands of simultaneous operations to achieve high performance. Some constructs from linear algebra map extremely well to the GPU and others map poorly. CPUs, on the other hand, do well at smaller order parallelism and perform acceptably during low-parallelism code segments. Our work addresses this via hybrid a processing model, in which the CPU and GPU work simultaneously to produce results. In many cases, this is accomplished by allowing each platform to do the work it performs most naturally.
MIGS-GPU: Microarray Image Gridding and Segmentation on the GPU.
Katsigiannis, Stamos; Zacharia, Eleni; Maroulis, Dimitris
2016-03-03
cDNA microarray is a powerful tool for simultaneously studying the expression level of thousands of genes. Nevertheless, the analysis of microarray images remains an arduous and challenging task due to the poor quality of the images which often suffer from noise, artifacts, and uneven background. In this work, the MIGS-GPU (Microarray Image Gridding and Segmentation on GPU) software for gridding and segmenting microarray images is presented. MIGS-GPU's computations are performed on the graphics processing unit (GPU) by means of the CUDA architecture in order to achieve fast performance and increase the utilization of available system resources. Evaluation on both real and synthetic cDNA microarray images showed that MIGS-GPU provides better performance than state-of-the-art alternatives, while the proposed GPU implementation achieves significantly lower computational times compared to the respective CPU approaches. Consequently, MIGS-GPU can be an advantageous and useful tool for biomedical laboratories, offering a userfriendly interface that requires minimum input in order to run.
GPU-accelerated voxelwise hepatic perfusion quantification.
Wang, H; Cao, Y
2012-09-07
Voxelwise quantification of hepatic perfusion parameters from dynamic contrast enhanced (DCE) imaging greatly contributes to assessment of liver function in response to radiation therapy. However, the efficiency of the estimation of hepatic perfusion parameters voxel-by-voxel in the whole liver using a dual-input single-compartment model requires substantial improvement for routine clinical applications. In this paper, we utilize the parallel computation power of a graphics processing unit (GPU) to accelerate the computation, while maintaining the same accuracy as the conventional method. Using compute unified device architecture-GPU, the hepatic perfusion computations over multiple voxels are run across the GPU blocks concurrently but independently. At each voxel, nonlinear least-squares fitting the time series of the liver DCE data to the compartmental model is distributed to multiple threads in a block, and the computations of different time points are performed simultaneously and synchronically. An efficient fast Fourier transform in a block is also developed for the convolution computation in the model. The GPU computations of the voxel-by-voxel hepatic perfusion images are compared with ones by the CPU using the simulated DCE data and the experimental DCE MR images from patients. The computation speed is improved by 30 times using a NVIDIA Tesla C2050 GPU compared to a 2.67 GHz Intel Xeon CPU processor. To obtain liver perfusion maps with 626 400 voxels in a patient's liver, it takes 0.9 min with the GPU-accelerated voxelwise computation, compared to 110 min with the CPU, while both methods result in perfusion parameters differences less than 10(-6). The method will be useful for generating liver perfusion images in clinical settings.
GAMER: GPU-accelerated Adaptive MEsh Refinement code
NASA Astrophysics Data System (ADS)
Schive, Hsi-Yu; Tsai, Yu-Chih; Chiueh, Tzihong
2016-12-01
GAMER (GPU-accelerated Adaptive MEsh Refinement) serves as a general-purpose adaptive mesh refinement + GPU framework and solves hydrodynamics with self-gravity. The code supports adaptive mesh refinement (AMR), hydrodynamics with self-gravity, and a variety of GPU-accelerated hydrodynamic and Poisson solvers. It also supports hybrid OpenMP/MPI/GPU parallelization, concurrent CPU/GPU execution for performance optimization, and Hilbert space-filling curve for load balance. Although the code is designed for simulating galaxy formation, it can be easily modified to solve a variety of applications with different governing equations. All optimization strategies implemented in the code can be inherited straightforwardly.
Colloquium: Large scale simulations on GPU clusters
NASA Astrophysics Data System (ADS)
Bernaschi, Massimo; Bisson, Mauro; Fatica, Massimiliano
2015-06-01
Graphics processing units (GPU) are currently used as a cost-effective platform for computer simulations and big-data processing. Large scale applications require that multiple GPUs work together but the efficiency obtained with cluster of GPUs is, at times, sub-optimal because the GPU features are not exploited at their best. We describe how it is possible to achieve an excellent efficiency for applications in statistical mechanics, particle dynamics and networks analysis by using suitable memory access patterns and mechanisms like CUDA streams, profiling tools, etc. Similar concepts and techniques may be applied also to other problems like the solution of Partial Differential Equations.
Using GPU shaders for visualization, part 3.
Bailey, Mike
2013-01-01
GPU shaders aren't just for glossy special effects. Parts 1 and 2 of this discussion looked at using them for point clouds, cutting planes, line integral convolution, and terrain bump-mapping. Part 3 covers compute shaders and shader storage buffer objects-two features announced as part of OpenGL 4.3.
Accelerating Computation of DCM for ERP in MATLAB by External Function Calls to the GPU.
Wang, Wei-Jen; Hsieh, I-Fan; Chen, Chun-Chuan
2013-01-01
This study aims to improve the performance of Dynamic Causal Modelling for Event Related Potentials (DCM for ERP) in MATLAB by using external function calls to a graphics processing unit (GPU). DCM for ERP is an advanced method for studying neuronal effective connectivity. DCM utilizes an iterative procedure, the expectation maximization (EM) algorithm, to find the optimal parameters given a set of observations and the underlying probability model. As the EM algorithm is computationally demanding and the analysis faces possible combinatorial explosion of models to be tested, we propose a parallel computing scheme using the GPU to achieve a fast estimation of DCM for ERP. The computation of DCM for ERP is dynamically partitioned and distributed to threads for parallel processing, according to the DCM model complexity and the hardware constraints. The performance efficiency of this hardware-dependent thread arrangement strategy was evaluated using the synthetic data. The experimental data were used to validate the accuracy of the proposed computing scheme and quantify the time saving in practice. The simulation results show that the proposed scheme can accelerate the computation by a factor of 155 for the parallel part. For experimental data, the speedup factor is about 7 per model on average, depending on the model complexity and the data. This GPU-based implementation of DCM for ERP gives qualitatively the same results as the original MATLAB implementation does at the group level analysis. In conclusion, we believe that the proposed GPU-based implementation is very useful for users as a fast screen tool to select the most likely model and may provide implementation guidance for possible future clinical applications such as online diagnosis.
Accelerating Computation of DCM for ERP in MATLAB by External Function Calls to the GPU
Wang, Wei-Jen; Hsieh, I-Fan; Chen, Chun-Chuan
2013-01-01
This study aims to improve the performance of Dynamic Causal Modelling for Event Related Potentials (DCM for ERP) in MATLAB by using external function calls to a graphics processing unit (GPU). DCM for ERP is an advanced method for studying neuronal effective connectivity. DCM utilizes an iterative procedure, the expectation maximization (EM) algorithm, to find the optimal parameters given a set of observations and the underlying probability model. As the EM algorithm is computationally demanding and the analysis faces possible combinatorial explosion of models to be tested, we propose a parallel computing scheme using the GPU to achieve a fast estimation of DCM for ERP. The computation of DCM for ERP is dynamically partitioned and distributed to threads for parallel processing, according to the DCM model complexity and the hardware constraints. The performance efficiency of this hardware-dependent thread arrangement strategy was evaluated using the synthetic data. The experimental data were used to validate the accuracy of the proposed computing scheme and quantify the time saving in practice. The simulation results show that the proposed scheme can accelerate the computation by a factor of 155 for the parallel part. For experimental data, the speedup factor is about 7 per model on average, depending on the model complexity and the data. This GPU-based implementation of DCM for ERP gives qualitatively the same results as the original MATLAB implementation does at the group level analysis. In conclusion, we believe that the proposed GPU-based implementation is very useful for users as a fast screen tool to select the most likely model and may provide implementation guidance for possible future clinical applications such as online diagnosis. PMID:23840507
Accelerated GPU based SPECT Monte Carlo simulations
NASA Astrophysics Data System (ADS)
Garcia, Marie-Paule; Bert, Julien; Benoit, Didier; Bardiès, Manuel; Visvikis, Dimitris
2016-06-01
Monte Carlo (MC) modelling is widely used in the field of single photon emission computed tomography (SPECT) as it is a reliable technique to simulate very high quality scans. This technique provides very accurate modelling of the radiation transport and particle interactions in a heterogeneous medium. Various MC codes exist for nuclear medicine imaging simulations. Recently, new strategies exploiting the computing capabilities of graphical processing units (GPU) have been proposed. This work aims at evaluating the accuracy of such GPU implementation strategies in comparison to standard MC codes in the context of SPECT imaging. GATE was considered the reference MC toolkit and used to evaluate the performance of newly developed GPU Geant4-based Monte Carlo simulation (GGEMS) modules for SPECT imaging. Radioisotopes with different photon energies were used with these various CPU and GPU Geant4-based MC codes in order to assess the best strategy for each configuration. Three different isotopes were considered: 99m Tc, 111In and 131I, using a low energy high resolution (LEHR) collimator, a medium energy general purpose (MEGP) collimator and a high energy general purpose (HEGP) collimator respectively. Point source, uniform source, cylindrical phantom and anthropomorphic phantom acquisitions were simulated using a model of the GE infinia II 3/8" gamma camera. Both simulation platforms yielded a similar system sensitivity and image statistical quality for the various combinations. The overall acceleration factor between GATE and GGEMS platform derived from the same cylindrical phantom acquisition was between 18 and 27 for the different radioisotopes. Besides, a full MC simulation using an anthropomorphic phantom showed the full potential of the GGEMS platform, with a resulting acceleration factor up to 71. The good agreement with reference codes and the acceleration factors obtained support the use of GPU implementation strategies for improving computational efficiency
Accelerated GPU based SPECT Monte Carlo simulations.
Garcia, Marie-Paule; Bert, Julien; Benoit, Didier; Bardiès, Manuel; Visvikis, Dimitris
2016-06-07
Monte Carlo (MC) modelling is widely used in the field of single photon emission computed tomography (SPECT) as it is a reliable technique to simulate very high quality scans. This technique provides very accurate modelling of the radiation transport and particle interactions in a heterogeneous medium. Various MC codes exist for nuclear medicine imaging simulations. Recently, new strategies exploiting the computing capabilities of graphical processing units (GPU) have been proposed. This work aims at evaluating the accuracy of such GPU implementation strategies in comparison to standard MC codes in the context of SPECT imaging. GATE was considered the reference MC toolkit and used to evaluate the performance of newly developed GPU Geant4-based Monte Carlo simulation (GGEMS) modules for SPECT imaging. Radioisotopes with different photon energies were used with these various CPU and GPU Geant4-based MC codes in order to assess the best strategy for each configuration. Three different isotopes were considered: (99m) Tc, (111)In and (131)I, using a low energy high resolution (LEHR) collimator, a medium energy general purpose (MEGP) collimator and a high energy general purpose (HEGP) collimator respectively. Point source, uniform source, cylindrical phantom and anthropomorphic phantom acquisitions were simulated using a model of the GE infinia II 3/8" gamma camera. Both simulation platforms yielded a similar system sensitivity and image statistical quality for the various combinations. The overall acceleration factor between GATE and GGEMS platform derived from the same cylindrical phantom acquisition was between 18 and 27 for the different radioisotopes. Besides, a full MC simulation using an anthropomorphic phantom showed the full potential of the GGEMS platform, with a resulting acceleration factor up to 71. The good agreement with reference codes and the acceleration factors obtained support the use of GPU implementation strategies for improving computational
GPU-based fast gamma index calculation
NASA Astrophysics Data System (ADS)
Gu, Xuejun; Jia, Xun; Jiang, Steve B.
2011-03-01
The γ-index dose comparison tool has been widely used to compare dose distributions in cancer radiotherapy. The accurate calculation of γ-index requires an exhaustive search of the closest Euclidean distance in the high-resolution dose-distance space. This is a computational intensive task when dealing with 3D dose distributions. In this work, we combine a geometric method (Ju et al 2008 Med. Phys. 35 879-87) with a radial pre-sorting technique (Wendling et al 2007 Med. Phys. 34 1647-54) and implement them on computer graphics processing units (GPUs). The developed GPU-based γ-index computational tool is evaluated on eight pairs of IMRT dose distributions. The γ-index calculations can be finished within a few seconds for all 3D testing cases on one single NVIDIA Tesla C1060 card, achieving 45-75× speedup compared to CPU computations conducted on an Intel Xeon 2.27 GHz processor. We further investigated the effect of various factors on both CPU and GPU computation time. The strategy of pre-sorting voxels based on their dose difference values speeds up the GPU calculation by about 2.7-5.5 times. For n-dimensional dose distributions, γ-index calculation time on CPU is proportional to the summation of γn over all voxels, while that on GPU is affected by γn distributions and is approximately proportional to the γn summation over all voxels. We found that increasing the resolution of dose distributions leads to a quadratic increase of computation time on CPU, while less-than-quadratic increase on GPU. The values of dose difference and distance-to-agreement criteria also have an impact on γ-index calculation time.
Modelos Teoricos de Linhas de Recombinacao EM Radio Frequencias Para Regioes H II
NASA Astrophysics Data System (ADS)
Abraham, Z.; Cancoro, A. C. O.
1987-05-01
Foram feitos modelos de linhas de recombinção provenientes de regiões HII nas frequências de rádio para distintos números quãnticos. Estes modelos consideram regrões H II esfericamente simétricas com variações radiais na densidade e temperatura eletrônica, efeitos de colisoes inelásticas dos eletrons (alargarnento por pressão), e afastarnento do equiliíbrio termodinâmico local. 0 bojetivo é construir o perfil da linha para cada ponto da nuvern e obter o valor médio resultante da sua convoluçã com o feixe da antena de tarnanho comparável corn o tarnanho angular da nuvern para posterIor cornpara o corn
Gpu Implementation of Preconditioning Method for Low-Speed Flows
NASA Astrophysics Data System (ADS)
Zhang, Jiale; Chen, Hongquan
2016-06-01
An improved preconditioning method for low-Mach-number flows is implemented on a GPU platform. The improved preconditioning method employs the fluctuation of the fluid variables to weaken the influence of accuracy caused by the truncation error. The GPU parallel computing platform is implemented to accelerate the calculations. Both details concerning the improved preconditioning method and the GPU implementation technology are described in this paper. Then a set of typical low-speed flow cases are simulated for both validation and performance analysis of the resulting GPU solver. Numerical results show that dozens of times speedup relative to a serial CPU implementation can be achieved using a single GPU desktop platform, which demonstrates that the GPU desktop can serve as a cost-effective parallel computing platform to accelerate CFD simulations for low-Speed flows substantially.
GPU accelerated chemical similarity calculation for compound library comparison.
Ma, Chao; Wang, Lirong; Xie, Xiang-Qun
2011-07-25
Chemical similarity calculation plays an important role in compound library design, virtual screening, and "lead" optimization. In this manuscript, we present a novel GPU-accelerated algorithm for all-vs-all Tanimoto matrix calculation and nearest neighbor search. By taking advantage of multicore GPU architecture and CUDA parallel programming technology, the algorithm is up to 39 times superior to the existing commercial software that runs on CPUs. Because of the utilization of intrinsic GPU instructions, this approach is nearly 10 times faster than existing GPU-accelerated sparse vector algorithm, when Unity fingerprints are used for Tanimoto calculation. The GPU program that implements this new method takes about 20 min to complete the calculation of Tanimoto coefficients between 32 M PubChem compounds and 10K Active Probes compounds, i.e., 324G Tanimoto coefficients, on a 128-CUDA-core GPU.
GPU implementation of simultaneous iterative reconstruction techniques for computed tomograpy
NASA Astrophysics Data System (ADS)
Xin, Junjun; Bardel, Chuck; Udpa, Lalita; Udpa, Satish
2013-01-01
This paper presents implementation of simultaneous iteration reconstruction techniques on GPU with parallel computing languages using CUDA and its intrinsic libraries on four different Graphic Processing (GPU) cards. GPUs are highly parallel computing structures that enable acceleration of scientific and engineering computations. The GPU implementations offer significant performance improvement in reconstruction times. Initial results on the Shepp-Logan phantom of size ranging from 16×16 to 256×256 pixels are presented.
Fast box-counting algorithm on GPU.
Jiménez, J; Ruiz de Miras, J
2012-12-01
The box-counting algorithm is one of the most widely used methods for calculating the fractal dimension (FD). The FD has many image analysis applications in the biomedical field, where it has been used extensively to characterize a wide range of medical signals. However, computing the FD for large images, especially in 3D, is a time consuming process. In this paper we present a fast parallel version of the box-counting algorithm, which has been coded in CUDA for execution on the Graphic Processing Unit (GPU). The optimized GPU implementation achieved an average speedup of 28 times (28×) compared to a mono-threaded CPU implementation, and an average speedup of 7 times (7×) compared to a multi-threaded CPU implementation. The performance of our improved box-counting algorithm has been tested with 3D models with different complexity, features and sizes. The validity and accuracy of the algorithm has been confirmed using models with well-known FD values. As a case study, a 3D FD analysis of several brain tissues has been performed using our GPU box-counting algorithm.
GPU-based video motion magnification
NASA Astrophysics Data System (ADS)
DomŻał, Mariusz; Jedrasiak, Karol; Sobel, Dawid; Ryt, Artur; Nawrat, Aleksander
2016-06-01
Video motion magnification (VMM) allows people see otherwise not visible subtle changes in surrounding world. VMM is also capable of hiding them with a modified version of the algorithm. It is possible to magnify motion related to breathing of patients in hospital to observe it or extinguish it and extract other information from stabilized image sequence for example blood flow. In both cases we would like to perform calculations in real time. Unfortunately, the VMM algorithm requires a great amount of computing power. In the article we suggest that VMM algorithm can be parallelized (each thread processes one pixel) and in order to prove that we implemented the algorithm on GPU using CUDA technology. CPU is used only to grab, write, display frame and schedule work for GPU. Each GPU kernel performs spatial decomposition, reconstruction and motion amplification. In this work we presented approach that achieves a significant speedup over existing methods and allow to VMM process video in real-time. This solution can be used as preprocessing for other algorithms in more complex systems or can find application wherever real time motion magnification would be useful. It is worth to mention that the implementation runs on most modern desktops and laptops compatible with CUDA technology.
Problems Related to Parallelization of CFD Algorithms on GPU, Multi-GPU and Hybrid Architectures
NASA Astrophysics Data System (ADS)
Biazewicz, Marek; Kurowski, Krzysztof; Ludwiczak, Bogdan; Napieraia, Krystyna
2010-09-01
Computational Fluid Dynamics (CFD) is one of the branches of fluid mechanics, which uses numerical methods and algorithms to solve and analyze fluid flows. CFD is used in various domains, such as oil and gas reservoir uncertainty analysis, aerodynamic body shapes optimization (e.g. planes, cars, ships, sport helmets, skis), natural phenomena analysis, numerical simulation for weather forecasting or realistic visualizations. CFD problem is very complex and needs a lot of computational power to obtain the results in a reasonable time. We have implemented a parallel application for two-dimensional CFD simulation with a free surface approximation (MAC method) using new hardware architectures, in particular multi-GPU and hybrid computing environments. For this purpose we decided to use NVIDIA graphic cards with CUDA environment due to its simplicity of programming and good computations performance. We used finite difference discretization of Navier-Stokes equations, where fluid is propagated over an Eulerian Grid. In this model, the behavior of the fluid inside the cell depends only on the properties of local, surrounding cells, therefore it is well suited for the GPU-based architecture. In this paper we demonstrate how to use efficiently the computing power of GPUs for CFD. Additionally, we present some best practices to help users analyze and improve the performance of CFD applications executed on GPU. Finally, we discuss various challenges around the multi-GPU implementation on the example of matrix multiplication.
GPU-based High-Performance Computing for Radiation Therapy
Jia, Xun; Ziegenhein, Peter; Jiang, Steve B.
2014-01-01
Recent developments in radiotherapy therapy demand high computation powers to solve challenging problems in a timely fashion in a clinical environment. Graphics processing unit (GPU), as an emerging high-performance computing platform, has been introduced to radiotherapy. It is particularly attractive due to its high computational power, small size, and low cost for facility deployment and maintenance. Over the past a few years, GPU-based high-performance computing in radiotherapy has experienced rapid developments. A tremendous amount of studies have been conducted, in which large acceleration factors compared with the conventional CPU platform have been observed. In this article, we will first give a brief introduction to the GPU hardware structure and programming model. We will then review the current applications of GPU in major imaging-related and therapy-related problems encountered in radiotherapy. A comparison of GPU with other platforms will also be presented. PMID:24486639
GPU-based high-performance computing for radiation therapy.
Jia, Xun; Ziegenhein, Peter; Jiang, Steve B
2014-02-21
Recent developments in radiotherapy therapy demand high computation powers to solve challenging problems in a timely fashion in a clinical environment. The graphics processing unit (GPU), as an emerging high-performance computing platform, has been introduced to radiotherapy. It is particularly attractive due to its high computational power, small size, and low cost for facility deployment and maintenance. Over the past few years, GPU-based high-performance computing in radiotherapy has experienced rapid developments. A tremendous amount of study has been conducted, in which large acceleration factors compared with the conventional CPU platform have been observed. In this paper, we will first give a brief introduction to the GPU hardware structure and programming model. We will then review the current applications of GPU in major imaging-related and therapy-related problems encountered in radiotherapy. A comparison of GPU with other platforms will also be presented.
Evaluating the power of GPU acceleration for IDW interpolation algorithm.
Mei, Gang
2014-01-01
We first present two GPU implementations of the standard Inverse Distance Weighting (IDW) interpolation algorithm, the tiled version that takes advantage of shared memory and the CDP version that is implemented using CUDA Dynamic Parallelism (CDP). Then we evaluate the power of GPU acceleration for IDW interpolation algorithm by comparing the performance of CPU implementation with three GPU implementations, that is, the naive version, the tiled version, and the CDP version. Experimental results show that the tilted version has the speedups of 120x and 670x over the CPU version when the power parameter p is set to 2 and 3.0, respectively. In addition, compared to the naive GPU implementation, the tiled version is about two times faster. However, the CDP version is 4.8x ∼ 6.0x slower than the naive GPU version, and therefore does not have any potential advantages in practical applications.
GPU accelerated implementation of NCI calculations using promolecular density.
Rubez, Gaëtan; Etancelin, Jean-Matthieu; Vigouroux, Xavier; Krajecki, Michael; Boisson, Jean-Charles; Hénon, Eric
2017-03-25
The NCI approach is a modern tool to reveal chemical noncovalent interactions. It is particularly attractive to describe ligand-protein binding. A custom implementation for NCI using promolecular density is presented. It is designed to leverage the computational power of NVIDIA graphics processing unit (GPU) accelerators through the CUDA programming model. The code performances of three versions are examined on a test set of 144 systems. NCI calculations are particularly well suited to the GPU architecture, which reduces drastically the computational time. On a single compute node, the dual-GPU version leads to a 39-fold improvement for the biggest instance compared to the optimal OpenMP parallel run (C code, icc compiler) with 16 CPU cores. Energy consumption measurements carried out on both CPU and GPU NCI tests show that the GPU approach provides substantial energy savings. © 2017 Wiley Periodicals, Inc.
GPU-Accelerated Adjoint Algorithmic Differentiation.
Gremse, Felix; Höfter, Andreas; Razik, Lukas; Kiessling, Fabian; Naumann, Uwe
2016-03-01
Many scientific problems such as classifier training or medical image reconstruction can be expressed as minimization of differentiable real-valued cost functions and solved with iterative gradient-based methods. Adjoint algorithmic differentiation (AAD) enables automated computation of gradients of such cost functions implemented as computer programs. To backpropagate adjoint derivatives, excessive memory is potentially required to store the intermediate partial derivatives on a dedicated data structure, referred to as the "tape". Parallelization is difficult because threads need to synchronize their accesses during taping and backpropagation. This situation is aggravated for many-core architectures, such as Graphics Processing Units (GPUs), because of the large number of light-weight threads and the limited memory size in general as well as per thread. We show how these limitations can be mediated if the cost function is expressed using GPU-accelerated vector and matrix operations which are recognized as intrinsic functions by our AAD software. We compare this approach with naive and vectorized implementations for CPUs. We use four increasingly complex cost functions to evaluate the performance with respect to memory consumption and gradient computation times. Using vectorization, CPU and GPU memory consumption could be substantially reduced compared to the naive reference implementation, in some cases even by an order of complexity. The vectorization allowed usage of optimized parallel libraries during forward and reverse passes which resulted in high speedups for the vectorized CPU version compared to the naive reference implementation. The GPU version achieved an additional speedup of 7.5 ± 4.4, showing that the processing power of GPUs can be utilized for AAD using this concept. Furthermore, we show how this software can be systematically extended for more complex problems such as nonlinear absorption reconstruction for fluorescence-mediated tomography.
GPU-accelerated adjoint algorithmic differentiation
NASA Astrophysics Data System (ADS)
Gremse, Felix; Höfter, Andreas; Razik, Lukas; Kiessling, Fabian; Naumann, Uwe
2016-03-01
Many scientific problems such as classifier training or medical image reconstruction can be expressed as minimization of differentiable real-valued cost functions and solved with iterative gradient-based methods. Adjoint algorithmic differentiation (AAD) enables automated computation of gradients of such cost functions implemented as computer programs. To backpropagate adjoint derivatives, excessive memory is potentially required to store the intermediate partial derivatives on a dedicated data structure, referred to as the "tape". Parallelization is difficult because threads need to synchronize their accesses during taping and backpropagation. This situation is aggravated for many-core architectures, such as Graphics Processing Units (GPUs), because of the large number of light-weight threads and the limited memory size in general as well as per thread. We show how these limitations can be mediated if the cost function is expressed using GPU-accelerated vector and matrix operations which are recognized as intrinsic functions by our AAD software. We compare this approach with naive and vectorized implementations for CPUs. We use four increasingly complex cost functions to evaluate the performance with respect to memory consumption and gradient computation times. Using vectorization, CPU and GPU memory consumption could be substantially reduced compared to the naive reference implementation, in some cases even by an order of complexity. The vectorization allowed usage of optimized parallel libraries during forward and reverse passes which resulted in high speedups for the vectorized CPU version compared to the naive reference implementation. The GPU version achieved an additional speedup of 7.5 ± 4.4, showing that the processing power of GPUs can be utilized for AAD using this concept. Furthermore, we show how this software can be systematically extended for more complex problems such as nonlinear absorption reconstruction for fluorescence-mediated tomography.
GPU-Accelerated Adjoint Algorithmic Differentiation
Gremse, Felix; Höfter, Andreas; Razik, Lukas; Kiessling, Fabian; Naumann, Uwe
2015-01-01
Many scientific problems such as classifier training or medical image reconstruction can be expressed as minimization of differentiable real-valued cost functions and solved with iterative gradient-based methods. Adjoint algorithmic differentiation (AAD) enables automated computation of gradients of such cost functions implemented as computer programs. To backpropagate adjoint derivatives, excessive memory is potentially required to store the intermediate partial derivatives on a dedicated data structure, referred to as the “tape”. Parallelization is difficult because threads need to synchronize their accesses during taping and backpropagation. This situation is aggravated for many-core architectures, such as Graphics Processing Units (GPUs), because of the large number of light-weight threads and the limited memory size in general as well as per thread. We show how these limitations can be mediated if the cost function is expressed using GPU-accelerated vector and matrix operations which are recognized as intrinsic functions by our AAD software. We compare this approach with naive and vectorized implementations for CPUs. We use four increasingly complex cost functions to evaluate the performance with respect to memory consumption and gradient computation times. Using vectorization, CPU and GPU memory consumption could be substantially reduced compared to the naive reference implementation, in some cases even by an order of complexity. The vectorization allowed usage of optimized parallel libraries during forward and reverse passes which resulted in high speedups for the vectorized CPU version compared to the naive reference implementation. The GPU version achieved an additional speedup of 7.5 ± 4.4, showing that the processing power of GPUs can be utilized for AAD using this concept. Furthermore, we show how this software can be systematically extended for more complex problems such as nonlinear absorption reconstruction for fluorescence-mediated tomography
Using GPU shaders for visualization, part 2.
Bailey, M
2011-01-01
GPU shaders aren't just for special effects. Previously, I looked at some uses for them in visualization. Here, the idea continues. Because visualization relies so much on high speed interaction, we use shaders for the same reason we use them in effects programming: appearance and performance. In the drive to understand large, complex data sets, no method should be overlooked. This article describes two additional visualization applications: line integral convolution (LIC) and terrain bump-mapping. I also comment on the recent (and rapid) changes to OpenGL and what these mean to educators.
GPU-completeness: theory and implications
NASA Astrophysics Data System (ADS)
Lin, I.-Jong
2011-01-01
This paper formalizes a major insight into a class of algorithms that relate parallelism and performance. The purpose of this paper is to define a class of algorithms that trades off parallelism for quality of result (e.g. visual quality, compression rate), and we propose a similar method for algorithmic classification based on NP-Completeness techniques, applied toward parallel acceleration. We will define this class of algorithm as "GPU-Complete" and will postulate the necessary properties of the algorithms for admission into this class. We will also formally relate his algorithmic space and imaging algorithms space. This concept is based upon our experience in the print production area where GPUs (Graphic Processing Units) have shown a substantial cost/performance advantage within the context of HPdelivered enterprise services and commercial printing infrastructure. While CPUs and GPUs are converging in their underlying hardware and functional blocks, their system behaviors are clearly distinct in many ways: memory system design, programming paradigms, and massively parallel SIMD architecture. There are applications that are clearly suited to each architecture: for CPU: language compilation, word processing, operating systems, and other applications that are highly sequential in nature; for GPU: video rendering, particle simulation, pixel color conversion, and other problems clearly amenable to massive parallelization. While GPUs establishing themselves as a second, distinct computing architecture from CPUs, their end-to-end system cost/performance advantage in certain parts of computation inform the structure of algorithms and their efficient parallel implementations. While GPUs are merely one type of architecture for parallelization, we show that their introduction into the design space of printing systems demonstrate the trade-offs against competing multi-core, FPGA, and ASIC architectures. While each architecture has its own optimal application, we believe
CFD Computations on Multi-GPU Configurations.
NASA Astrophysics Data System (ADS)
Menon, Sandeep; Perot, Blair
2007-11-01
Programmable graphics processors have shown favorable potential for use in practical CFD simulations -- often delivering a speed-up factor between 3 to 5 times over conventional CPUs. In recent times, most PCs are supplied with the option of installing multiple GPUs on a single motherboard, thereby providing the option of a parallel GPU configuration in a shared-memory paradigm. We demonstrate our implementation of an unstructured CFD solver using a set up which is configured to run two GPUs in parallel, and discuss its performance details.
Efficient implementation of MrBayes on multi-GPU.
Bao, Jie; Xia, Hongju; Zhou, Jianfu; Liu, Xiaoguang; Wang, Gang
2013-06-01
MrBayes, using Metropolis-coupled Markov chain Monte Carlo (MCMCMC or (MC)(3)), is a popular program for Bayesian inference. As a leading method of using DNA data to infer phylogeny, the (MC)(3) Bayesian algorithm and its improved and parallel versions are now not fast enough for biologists to analyze massive real-world DNA data. Recently, graphics processor unit (GPU) has shown its power as a coprocessor (or rather, an accelerator) in many fields. This article describes an efficient implementation a(MC)(3) (aMCMCMC) for MrBayes (MC)(3) on compute unified device architecture. By dynamically adjusting the task granularity to adapt to input data size and hardware configuration, it makes full use of GPU cores with different data sets. An adaptive method is also developed to split and combine DNA sequences to make full use of a large number of GPU cards. Furthermore, a new "node-by-node" task scheduling strategy is developed to improve concurrency, and several optimizing methods are used to reduce extra overhead. Experimental results show that a(MC)(3) achieves up to 63× speedup over serial MrBayes on a single machine with one GPU card, and up to 170× speedup with four GPU cards, and up to 478× speedup with a 32-node GPU cluster. a(MC)(3) is dramatically faster than all the previous (MC)(3) algorithms and scales well to large GPU clusters.
GPU Optimizations for a Production Molecular Docking Code.
Landaverde, Raphael; Herbordt, Martin C
2014-09-01
Modeling molecular docking is critical to both understanding life processes and designing new drugs. In previous work we created the first published GPU-accelerated docking code (PIPER) which achieved a roughly 5× speed-up over a contemporaneous 4 core CPU. Advances in GPU architecture and in the CPU code, however, have since reduced this relalative performance by a factor of 10. In this paper we describe the upgrade of GPU PIPER. This required an entire rewrite, including algorithm changes and moving most remaining non-accelerated CPU code onto the GPU. The result is a 7× improvement in GPU performance and a 3.3× speedup over the CPU-only code. We find that this difference in time is almost entirely due to the difference in run times of the 3D FFT library functions on CPU (MKL) and GPU (cuFFT), respectively. The GPU code has been integrated into the ClusPro docking server which has over 4000 active users.
Architecting the Finite Element Method Pipeline for the GPU
Fu, Zhisong; Lewis, T. James; Kirby, Robert M.
2014-01-01
The finite element method (FEM) is a widely employed numerical technique for approximating the solution of partial differential equations (PDEs) in various science and engineering applications. Many of these applications benefit from fast execution of the FEM pipeline. One way to accelerate the FEM pipeline is by exploiting advances in modern computational hardware, such as the many-core streaming processors like the graphical processing unit (GPU). In this paper, we present the algorithms and data-structures necessary to move the entire FEM pipeline to the GPU. First we propose an efficient GPU-based algorithm to generate local element information and to assemble the global linear system associated with the FEM discretization of an elliptic PDE. To solve the corresponding linear system efficiently on the GPU, we implement a conjugate gradient method preconditioned with a geometry-informed algebraic multi-grid (AMG) method preconditioner. We propose a new fine-grained parallelism strategy, a corresponding multigrid cycling stage and efficient data mapping to the many-core architecture of GPU. Comparison of our on-GPU assembly versus a traditional serial implementation on the CPU achieves up to an 87 × speedup. Focusing on the linear system solver alone, we achieve a speedup of up to 51 × versus use of a comparable state-of-the-art serial CPU linear system solver. Furthermore, the method compares favorably with other GPU-based, sparse, linear solvers. PMID:25202164
Performance evaluation of image processing algorithms on the GPU.
Castaño-Díez, Daniel; Moser, Dominik; Schoenegger, Andreas; Pruggnaller, Sabine; Frangakis, Achilleas S
2008-10-01
The graphics processing unit (GPU), which originally was used exclusively for visualization purposes, has evolved into an extremely powerful co-processor. In the meanwhile, through the development of elaborate interfaces, the GPU can be used to process data and deal with computationally intensive applications. The speed-up factors attained compared to the central processing unit (CPU) are dependent on the particular application, as the GPU architecture gives the best performance for algorithms that exhibit high data parallelism and high arithmetic intensity. Here, we evaluate the performance of the GPU on a number of common algorithms used for three-dimensional image processing. The algorithms were developed on a new software platform called "CUDA", which allows a direct translation from C code to the GPU. The implemented algorithms include spatial transformations, real-space and Fourier operations, as well as pattern recognition procedures, reconstruction algorithms and classification procedures. In our implementation, the direct porting of C code in the GPU achieves typical acceleration values in the order of 10-20 times compared to a state-of-the-art conventional processor, but they vary depending on the type of the algorithm. The gained speed-up comes with no additional costs, since the software runs on the GPU of the graphics card of common workstations.
Architecting the Finite Element Method Pipeline for the GPU.
Fu, Zhisong; Lewis, T James; Kirby, Robert M; Whitaker, Ross T
2014-02-01
The finite element method (FEM) is a widely employed numerical technique for approximating the solution of partial differential equations (PDEs) in various science and engineering applications. Many of these applications benefit from fast execution of the FEM pipeline. One way to accelerate the FEM pipeline is by exploiting advances in modern computational hardware, such as the many-core streaming processors like the graphical processing unit (GPU). In this paper, we present the algorithms and data-structures necessary to move the entire FEM pipeline to the GPU. First we propose an efficient GPU-based algorithm to generate local element information and to assemble the global linear system associated with the FEM discretization of an elliptic PDE. To solve the corresponding linear system efficiently on the GPU, we implement a conjugate gradient method preconditioned with a geometry-informed algebraic multi-grid (AMG) method preconditioner. We propose a new fine-grained parallelism strategy, a corresponding multigrid cycling stage and efficient data mapping to the many-core architecture of GPU. Comparison of our on-GPU assembly versus a traditional serial implementation on the CPU achieves up to an 87 × speedup. Focusing on the linear system solver alone, we achieve a speedup of up to 51 × versus use of a comparable state-of-the-art serial CPU linear system solver. Furthermore, the method compares favorably with other GPU-based, sparse, linear solvers.
Efficient Implementation of MrBayes on Multi-GPU
Zhou, Jianfu; Liu, Xiaoguang; Wang, Gang
2013-01-01
MrBayes, using Metropolis-coupled Markov chain Monte Carlo (MCMCMC or (MC)3), is a popular program for Bayesian inference. As a leading method of using DNA data to infer phylogeny, the (MC)3 Bayesian algorithm and its improved and parallel versions are now not fast enough for biologists to analyze massive real-world DNA data. Recently, graphics processor unit (GPU) has shown its power as a coprocessor (or rather, an accelerator) in many fields. This article describes an efficient implementation a(MC)3 (aMCMCMC) for MrBayes (MC)3 on compute unified device architecture. By dynamically adjusting the task granularity to adapt to input data size and hardware configuration, it makes full use of GPU cores with different data sets. An adaptive method is also developed to split and combine DNA sequences to make full use of a large number of GPU cards. Furthermore, a new “node-by-node” task scheduling strategy is developed to improve concurrency, and several optimizing methods are used to reduce extra overhead. Experimental results show that a(MC)3 achieves up to 63× speedup over serial MrBayes on a single machine with one GPU card, and up to 170× speedup with four GPU cards, and up to 478× speedup with a 32-node GPU cluster. a(MC)3 is dramatically faster than all the previous (MC)3 algorithms and scales well to large GPU clusters. PMID:23493260
GPU accelerated curve fitting with IDL
NASA Astrophysics Data System (ADS)
Galloy, M.
2012-12-01
Curve fitting is a common mathematical calculation done in all scientific areas. The Interactive Data Language (IDL) is also widely used in this community for data analysis and visualization. We are creating a general-purpose, GPU accelerated curve fitting library for use from within IDL. We have developed GPULib, a library of routines in IDL for accelerating common scientific operations including arithmetic, FFTs, interpolation, and others. These routines are accelerated using modern GPUs using NVIDIA's CUDA architecture. We will add curve fitting routines to the GPULib library suite, making curve fitting much faster. In addition, library routines required for efficient curve fitting will also be generally useful to other users of GPULib. In particular, a GPU accelerated LAPACK implementation such as MAGMA is required for the Levenberg-Marquardt curve fitting and is commonly used in many other scientific computations. Furthermore, the ability to evaluate custom expressions at runtime necessary for specifying a function model will be useful for users in all areas.
Parallel hyperspectral compressive sensing method on GPU
NASA Astrophysics Data System (ADS)
Bernabé, Sergio; Martín, Gabriel; Nascimento, José M. P.
2015-10-01
Remote hyperspectral sensors collect large amounts of data per flight usually with low spatial resolution. It is known that the bandwidth connection between the satellite/airborne platform and the ground station is reduced, thus a compression onboard method is desirable to reduce the amount of data to be transmitted. This paper presents a parallel implementation of an compressive sensing method, called parallel hyperspectral coded aperture (P-HYCA), for graphics processing units (GPU) using the compute unified device architecture (CUDA). This method takes into account two main properties of hyperspectral dataset, namely the high correlation existing among the spectral bands and the generally low number of endmembers needed to explain the data, which largely reduces the number of measurements necessary to correctly reconstruct the original data. Experimental results conducted using synthetic and real hyperspectral datasets on two different GPU architectures by NVIDIA: GeForce GTX 590 and GeForce GTX TITAN, reveal that the use of GPUs can provide real-time compressive sensing performance. The achieved speedup is up to 20 times when compared with the processing time of HYCA running on one core of the Intel i7-2600 CPU (3.4GHz), with 16 Gbyte memory.
Parallelization and checkpointing of GPU applications through program transformation
Solano-Quinde, Lizandro Damian
2012-01-01
GPUs have emerged as a powerful tool for accelerating general-purpose applications. The availability of programming languages that makes writing general-purpose applications for running on GPUs tractable have consolidated GPUs as an alternative for accelerating general purpose applications. Among the areas that have benefited from GPU acceleration are: signal and image processing, computational fluid dynamics, quantum chemistry, and, in general, the High Performance Computing (HPC) Industry. In order to continue to exploit higher levels of parallelism with GPUs, multi-GPU systems are gaining popularity. In this context, single-GPU applications are parallelized for running in multi-GPU systems. Furthermore, multi-GPU systems help to solve the GPU memory limitation for applications with large application memory footprint. Parallelizing single-GPU applications has been approached by libraries that distribute the workload at runtime, however, they impose execution overhead and are not portable. On the other hand, on traditional CPU systems, parallelization has been approached through application transformation at pre-compile time, which enhances the application to distribute the workload at application level and does not have the issues of library-based approaches. Hence, a parallelization scheme for GPU systems based on application transformation is needed. Like any computing engine of today, reliability is also a concern in GPUs. GPUs are vulnerable to transient and permanent failures. Current checkpoint/restart techniques are not suitable for systems with GPUs. Checkpointing for GPU systems present new and interesting challenges, primarily due to the natural differences imposed by the hardware design, the memory subsystem architecture, the massive number of threads, and the limited amount of synchronization among threads. Therefore, a checkpoint/restart technique suitable for GPU systems is needed. The goal of this work is to exploit higher levels of parallelism and
GPU-Accelerated Denoising in 3D (GD3D)
2013-10-01
The raw computational power GPU Accelerators enables fast denoising of 3D MR images using bilateral filtering, anisotropic diffusion, and non-local means. This software addresses two facets of this promising application: what tuning is necessary to achieve optimal performance on a modern GPU? And what parameters yield the best denoising results in practice? To answer the first question, the software performs an autotuning step to empirically determine optimal memory blocking on the GPU. To answer the second, it performs a sweep of algorithm parameters to determine the combination that best reduces the mean squared error relative to a noiseless reference image.
GPU real-time processing in NA62 trigger system
NASA Astrophysics Data System (ADS)
Ammendola, R.; Biagioni, A.; Chiozzi, S.; Cretaro, P.; Di Lorenzo, S.; Fantechi, R.; Fiorini, M.; Frezza, O.; Lamanna, G.; Lo Cicero, F.; Lonardo, A.; Martinelli, M.; Neri, I.; Paolucci, P. S.; Pastorelli, E.; Piandani, R.; Piccini, M.; Pontisso, L.; Rossetti, D.; Simula, F.; Sozzi, M.; Vicini, P.
2017-01-01
A commercial Graphics Processing Unit (GPU) is used to build a fast Level 0 (L0) trigger system tested parasitically with the TDAQ (Trigger and Data Acquisition systems) of the NA62 experiment at CERN. In particular, the parallel computing power of the GPU is exploited to perform real-time fitting in the Ring Imaging CHerenkov (RICH) detector. Direct GPU communication using a FPGA-based board has been used to reduce the data transmission latency. The performance of the system for multi-ring reconstrunction obtained during the NA62 physics run will be presented.
A GPU algorithm for minimum vertex cover problems
NASA Astrophysics Data System (ADS)
Toume, Kouta; Kinjo, Daiki; Nakamura, Morikazu
2014-10-01
The minimum vertex cover problem is one of the fundamental problems in graph theory and is known to be NP-hard. For data mining in large-scale structured systems, we proposes a GPU algorithm for the minimum vertex cover problem. The algorithm is designed to derive sufficient parallelism of the problem for the GPU architecture and also to arrange data on the device memory for efficient coalesced accessing. Through the experimental evaluation, we demonstrate that our GPU algorithm is quite faster than CPU programs and the speedup becomes much evident when the graph size is enlarged.
GPU accelerated kinetic solvers for rarefied gas dynamics
NASA Astrophysics Data System (ADS)
Zabelok, Sergey A.; Kolobov, Vladimir I.; Arslanbekov, Robert R.
2012-11-01
GPU-acceleration is applied to the Boltzmann solver with adaptive Cartesian mesh in the Unified Flow Solver framework. NVIDIA CUDA technology is used with threads being grouped in thread blocks by points of Korobov sequences in each cell for computing the collision integral and by points in coordinate space for the free-molecular flow stage. GPU-accelerated Boltzmann solver with octree Cartesian mesh has been tested on several computer systems. Speedup of several times for GPU-based code compared to single-core CPU computations on the same machines has been observed.
NASA Astrophysics Data System (ADS)
Wong, Un-Hong; Aoki, Takayuki; Wong, Hon-Cheng
2014-07-01
Modern graphics processing units (GPUs) have been widely utilized in magnetohydrodynamic (MHD) simulations in recent years. Due to the limited memory of a single GPU, distributed multi-GPU systems are needed to be explored for large-scale MHD simulations. However, the data transfer between GPUs bottlenecks the efficiency of the simulations on such systems. In this paper we propose a novel GPU Direct-MPI hybrid approach to address this problem for overall performance enhancement. Our approach consists of two strategies: (1) We exploit GPU Direct 2.0 to speedup the data transfers between multiple GPUs in a single node and reduce the total number of message passing interface (MPI) communications; (2) We design Compute Unified Device Architecture (CUDA) kernels instead of using memory copy to speedup the fragmented data exchange in the three-dimensional (3D) decomposition. 3D decomposition is usually not preferable for distributed multi-GPU systems due to its low efficiency of the fragmented data exchange. Our approach has made a breakthrough to make 3D decomposition available on distributed multi-GPU systems. As a result, it can reduce the memory usage and computation time of each partition of the computational domain. Experiment results show twice the FLOPS comparing to common 2D decomposition MPI-only implementation method. The proposed approach has been developed in an efficient implementation for MHD simulations on distributed multi-GPU systems, called MGPU-MHD code. The code realizes the GPU parallelization of a total variation diminishing (TVD) algorithm for solving the multidimensional ideal MHD equations, extending our work from single GPU computation (Wong et al., 2011) to multiple GPUs. Numerical tests and performance measurements are conducted on the TSUBAME 2.0 supercomputer at the Tokyo Institute of Technology. Our code achieves 2 TFLOPS in double precision for the problem with 12003 grid points using 216 GPUs.
Local alignment tool based on Hadoop framework and GPU architecture.
Hung, Che-Lun; Hua, Guan-Jie
2014-01-01
With the rapid growth of next generation sequencing technologies, such as Slex, more and more data have been discovered and published. To analyze such huge data the computational performance is an important issue. Recently, many tools, such as SOAP, have been implemented on Hadoop and GPU parallel computing architectures. BLASTP is an important tool, implemented on GPU architectures, for biologists to compare protein sequences. To deal with the big biology data, it is hard to rely on single GPU. Therefore, we implement a distributed BLASTP by combining Hadoop and multi-GPUs. The experimental results present that the proposed method can improve the performance of BLASTP on single GPU, and also it can achieve high availability and fault tolerance.
Acceleration of a QM/MM-QMC simulation using GPU.
Uejima, Yutaka; Terashima, Tomoharu; Maezono, Ryo
2011-07-30
We accelerated an ab initio molecular QMC calculation by using GPGPU. Only the bottle-neck part of the calculation is replaced by CUDA subroutine and performed on GPU. The performance on a (single core CPU + GPU) is compared with that on a (single core CPU with double precision), getting 23.6 (11.0) times faster calculations in single (double) precision treatments on GPU. The energy deviation caused by the single precision treatment was found to be within the accuracy required in the calculation, ∼10(-5) hartree. The accelerated computational nodes mounting GPU are combined to form a hybrid MPI cluster on which we confirmed the performance linearly scales to the number of nodes.
GPU accelerated numerical simulations of viscoelastic phase separation model.
Yang, Keda; Su, Jiaye; Guo, Hongxia
2012-07-05
We introduce a complete implementation of viscoelastic model for numerical simulations of the phase separation kinetics in dynamic asymmetry systems such as polymer blends and polymer solutions on a graphics processing unit (GPU) by CUDA language and discuss algorithms and optimizations in details. From studies of a polymer solution, we show that the GPU-based implementation can predict correctly the accepted results and provide about 190 times speedup over a single central processing unit (CPU). Further accuracy analysis demonstrates that both the single and the double precision calculations on the GPU are sufficient to produce high-quality results in numerical simulations of viscoelastic model. Therefore, the GPU-based viscoelastic model is very promising for studying many phase separation processes of experimental and theoretical interests that often take place on the large length and time scales and are not easily addressed by a conventional implementation running on a single CPU.
The Coming Role of GPU in Computational Geodynamics (Invited)
NASA Astrophysics Data System (ADS)
Yuen, D. A.; Knepley, M. G.; Erlebacher, G.; Wright, G. B.
2009-12-01
With the proliferation of GPU ( graphics accelerator board) the computing landscape has changed enormously in the last 3 years. The new additional capabilities of the GPU , such as larger shared memories and load-store operations , allow it to be considered as a viable stand-alone computational and visualization engine. Today the massive threading and computing capability of GPU can deliver at least an order of magnitude performance over the multi-core CPU architecture. The cost of a GPU system is also considerably cheaper than a CPU cluster by more than an order of magnitude.The introduction of CUDA and ancillary software aids, such as Jackets, have allowed the rapid translation of many venerable codes into software usable on GPU. We will discuss our experience acquired over the past year in attacking five different computational problems in the geosciences, using the GPU. They include (1.) 3-D seismic wave propagation with the spectral-element method (2.)2-D shallow water equation as applied to tsunami wave propagation, using finite-differences (3.) 3-D mantle convection with constant viscosity using a 4th order compact finite-difference operator (4.) non-linear heat-diffusion equation in 2-D using a collocation method based on radial basis functions over an elliptical area . Grid points are divided so as to lie on a centroidal Voronoi mesh . Derivatives are calculated at each grid point using a point-dependent stencil computed from the nearest neighbors .(5.) Stokes flow with variable viscosity by means of a pre-conditioner calculated on the GPU based on the vortex method using Green’s functions, along with the radial basis functions and the fast multi-pole method. The Krylov method is used on the CPU for the final iterative step .We will discuss the relative speed-ups of the GPU over the CPU in each of these cases. We will point out the need to go to more computationally intensive mode with multiple GPUs, which calls for key CPUs to control the message
Simulating Spin Models on Gpu: a Tour
NASA Astrophysics Data System (ADS)
Weigel, Martin
2012-08-01
The use of graphics processing units (GPUs) in scientific computing has gathered considerable momentum in the past five years. While GPUs in general promise high performance and excellent performance per Watt ratios, not every class of problems is equally well suitable for exploiting the massively parallel architecture they provide. Lattice spin models appear to be prototypic examples of problems suitable for this architecture, at least as long as local update algorithms are employed. In this review, I summarize our recent experience with the simulation of a wide range of spin models on GPU employing an equally wide range of update algorithms, ranging from Metropolis and heat bath updates, over cluster algorithms to generalized ensemble simulations.
Efficient GPU implementation for Particle in Cell algorithm
Joseph, Rejith George; Ravunnikutty, Girish; Ranka, Sanjay; Klasky, Scott A
2011-01-01
Particle in cell method is widely used method in the plasma physics to study the trajectories of charged particles under electromagnetic fields. The PIC algorithm is computationally intensive and its time requirements are proportional to the number of charged particles involved in the simulation. The focus of the paper is to parallelize the PIC algorithm on Graphics Processing Unit (GPU). We present several performance tradeoffs related to the small shared memory and atomic operations on the GPU to achieve high performance.
Advantages of GPU technology in DFT calculations of intercalated graphene
NASA Astrophysics Data System (ADS)
Pešić, J.; Gajić, R.
2014-09-01
Over the past few years, the expansion of general-purpose graphic-processing unit (GPGPU) technology has had a great impact on computational science. GPGPU is the utilization of a graphics-processing unit (GPU) to perform calculations in applications usually handled by the central processing unit (CPU). Use of GPGPUs as a way to increase computational power in the material sciences has significantly decreased computational costs in already highly demanding calculations. A level of the acceleration and parallelization depends on the problem itself. Some problems can benefit from GPU acceleration and parallelization, such as the finite-difference time-domain algorithm (FTDT) and density-functional theory (DFT), while others cannot take advantage of these modern technologies. A number of GPU-supported applications had emerged in the past several years (www.nvidia.com/object/gpu-applications.html). Quantum Espresso (QE) is reported as an integrated suite of open source computer codes for electronic-structure calculations and materials modeling at the nano-scale. It is based on DFT, the use of a plane-waves basis and a pseudopotential approach. Since the QE 5.0 version, it has been implemented as a plug-in component for standard QE packages that allows exploiting the capabilities of Nvidia GPU graphic cards (www.qe-forge.org/gf/proj). In this study, we have examined the impact of the usage of GPU acceleration and parallelization on the numerical performance of DFT calculations. Graphene has been attracting attention worldwide and has already shown some remarkable properties. We have studied an intercalated graphene, using the QE package PHonon, which employs GPU. The term ‘intercalation’ refers to a process whereby foreign adatoms are inserted onto a graphene lattice. In addition, by intercalating different atoms between graphene layers, it is possible to tune their physical properties. Our experiments have shown there are benefits from using GPUs, and we reached an
gEMfitter: a highly parallel FFT-based 3D density fitting tool with GPU texture memory acceleration.
Hoang, Thai V; Cavin, Xavier; Ritchie, David W
2013-11-01
Fitting high resolution protein structures into low resolution cryo-electron microscopy (cryo-EM) density maps is an important technique for modeling the atomic structures of very large macromolecular assemblies. This article presents "gEMfitter", a highly parallel fast Fourier transform (FFT) EM density fitting program which can exploit the special hardware properties of modern graphics processor units (GPUs) to accelerate both the translational and rotational parts of the correlation search. In particular, by using the GPU's special texture memory hardware to rotate 3D voxel grids, the cost of rotating large 3D density maps is almost completely eliminated. Compared to performing 3D correlations on one core of a contemporary central processor unit (CPU), running gEMfitter on a modern GPU gives up to 26-fold speed-up. Furthermore, using our parallel processing framework, this speed-up increases linearly with the number of CPUs or GPUs used. Thus, it is now possible to use routinely more robust but more expensive 3D correlation techniques. When tested on low resolution experimental cryo-EM data for the GroEL-GroES complex, we demonstrate the satisfactory fitting results that may be achieved by using a locally normalised cross-correlation with a Laplacian pre-filter, while still being up to three orders of magnitude faster than the well-known COLORES program.
A survey of CPU-GPU heterogeneous computing techniques
Mittal, Sparsh; Vetter, Jeffrey S.
2015-07-04
As both CPU and GPU become employed in a wide range of applications, it has been acknowledged that both of these processing units (PUs) have their unique features and strengths and hence, CPU-GPU collaboration is inevitable to achieve high-performance computing. This has motivated significant amount of research on heterogeneous computing techniques, along with the design of CPU-GPU fused chips and petascale heterogeneous supercomputers. In this paper, we survey heterogeneous computing techniques (HCTs) such as workload-partitioning which enable utilizing both CPU and GPU to improve performance and/or energy efficiency. We review heterogeneous computing approaches at runtime, algorithm, programming, compiler and application level. Further, we review both discrete and fused CPU-GPU systems; and discuss benchmark suites designed for evaluating heterogeneous computing systems (HCSs). Furthermore, we believe that this paper will provide insights into working and scope of applications of HCTs to researchers and motivate them to further harness the computational powers of CPUs and GPUs to achieve the goal of exascale performance.
High Performance GPU-Based Fourier Volume Rendering
Abdellah, Marwan; Eldeib, Ayman; Sharawi, Amr
2015-01-01
Fourier volume rendering (FVR) is a significant visualization technique that has been used widely in digital radiography. As a result of its 𝒪(N2logN) time complexity, it provides a faster alternative to spatial domain volume rendering algorithms that are 𝒪(N3) computationally complex. Relying on the Fourier projection-slice theorem, this technique operates on the spectral representation of a 3D volume instead of processing its spatial representation to generate attenuation-only projections that look like X-ray radiographs. Due to the rapid evolution of its underlying architecture, the graphics processing unit (GPU) became an attractive competent platform that can deliver giant computational raw power compared to the central processing unit (CPU) on a per-dollar-basis. The introduction of the compute unified device architecture (CUDA) technology enables embarrassingly-parallel algorithms to run efficiently on CUDA-capable GPU architectures. In this work, a high performance GPU-accelerated implementation of the FVR pipeline on CUDA-enabled GPUs is presented. This proposed implementation can achieve a speed-up of 117x compared to a single-threaded hybrid implementation that uses the CPU and GPU together by taking advantage of executing the rendering pipeline entirely on recent GPU architectures. PMID:25866499
Finite Difference Elastic Wave Field Simulation On GPU
NASA Astrophysics Data System (ADS)
Hu, Y.; Zhang, W.
2011-12-01
Numerical modeling of seismic wave propagation is considered as a basic and important aspect in investigation of the Earth's structure, and earthquake phenomenon. Among various numerical methods, the finite-difference method is considered one of the most efficient tools for the wave field simulation. However, with the increment of computing scale, the power of computing has becoming a bottleneck. With the development of hardware, in recent years, GPU shows powerful computational ability and bright application prospects in scientific computing. Many works using GPU demonstrate that GPU is powerful . Recently, GPU has not be used widely in the simulation of wave field. In this work, we present forward finite difference simulation of acoustic and elastic seismic wave propagation in heterogeneous media on NVIDIA graphics cards with the CUDA programming language. We also implement perfectly matched layers on the graphics cards to efficiently absorb outgoing waves on the fictitious edges of the grid Simulations compared with the results on CPU platform shows reliable accuracy and remarkable efficiency. This work proves that GPU can be an effective platform for wave field simulation, and it can also be used as a practical tool for real-time strong ground motion simulation.
Parallel Optimization of 3D Cardiac Electrophysiological Model Using GPU.
Xia, Yong; Wang, Kuanquan; Zhang, Henggui
2015-01-01
Large-scale 3D virtual heart model simulations are highly demanding in computational resources. This imposes a big challenge to the traditional computation resources based on CPU environment, which already cannot meet the requirement of the whole computation demands or are not easily available due to expensive costs. GPU as a parallel computing environment therefore provides an alternative to solve the large-scale computational problems of whole heart modeling. In this study, using a 3D sheep atrial model as a test bed, we developed a GPU-based simulation algorithm to simulate the conduction of electrical excitation waves in the 3D atria. In the GPU algorithm, a multicellular tissue model was split into two components: one is the single cell model (ordinary differential equation) and the other is the diffusion term of the monodomain model (partial differential equation). Such a decoupling enabled realization of the GPU parallel algorithm. Furthermore, several optimization strategies were proposed based on the features of the virtual heart model, which enabled a 200-fold speedup as compared to a CPU implementation. In conclusion, an optimized GPU algorithm has been developed that provides an economic and powerful platform for 3D whole heart simulations.
Parallel Optimization of 3D Cardiac Electrophysiological Model Using GPU
Xia, Yong; Wang, Kuanquan; Zhang, Henggui
2015-01-01
Large-scale 3D virtual heart model simulations are highly demanding in computational resources. This imposes a big challenge to the traditional computation resources based on CPU environment, which already cannot meet the requirement of the whole computation demands or are not easily available due to expensive costs. GPU as a parallel computing environment therefore provides an alternative to solve the large-scale computational problems of whole heart modeling. In this study, using a 3D sheep atrial model as a test bed, we developed a GPU-based simulation algorithm to simulate the conduction of electrical excitation waves in the 3D atria. In the GPU algorithm, a multicellular tissue model was split into two components: one is the single cell model (ordinary differential equation) and the other is the diffusion term of the monodomain model (partial differential equation). Such a decoupling enabled realization of the GPU parallel algorithm. Furthermore, several optimization strategies were proposed based on the features of the virtual heart model, which enabled a 200-fold speedup as compared to a CPU implementation. In conclusion, an optimized GPU algorithm has been developed that provides an economic and powerful platform for 3D whole heart simulations. PMID:26581957
High Performance GPU-Based Fourier Volume Rendering.
Abdellah, Marwan; Eldeib, Ayman; Sharawi, Amr
2015-01-01
Fourier volume rendering (FVR) is a significant visualization technique that has been used widely in digital radiography. As a result of its (N (2)logN) time complexity, it provides a faster alternative to spatial domain volume rendering algorithms that are (N (3)) computationally complex. Relying on the Fourier projection-slice theorem, this technique operates on the spectral representation of a 3D volume instead of processing its spatial representation to generate attenuation-only projections that look like X-ray radiographs. Due to the rapid evolution of its underlying architecture, the graphics processing unit (GPU) became an attractive competent platform that can deliver giant computational raw power compared to the central processing unit (CPU) on a per-dollar-basis. The introduction of the compute unified device architecture (CUDA) technology enables embarrassingly-parallel algorithms to run efficiently on CUDA-capable GPU architectures. In this work, a high performance GPU-accelerated implementation of the FVR pipeline on CUDA-enabled GPUs is presented. This proposed implementation can achieve a speed-up of 117x compared to a single-threaded hybrid implementation that uses the CPU and GPU together by taking advantage of executing the rendering pipeline entirely on recent GPU architectures.
A survey of CPU-GPU heterogeneous computing techniques
Mittal, Sparsh; Vetter, Jeffrey S.
2015-07-04
As both CPU and GPU become employed in a wide range of applications, it has been acknowledged that both of these processing units (PUs) have their unique features and strengths and hence, CPU-GPU collaboration is inevitable to achieve high-performance computing. This has motivated significant amount of research on heterogeneous computing techniques, along with the design of CPU-GPU fused chips and petascale heterogeneous supercomputers. In this paper, we survey heterogeneous computing techniques (HCTs) such as workload-partitioning which enable utilizing both CPU and GPU to improve performance and/or energy efficiency. We review heterogeneous computing approaches at runtime, algorithm, programming, compiler and applicationmore » level. Further, we review both discrete and fused CPU-GPU systems; and discuss benchmark suites designed for evaluating heterogeneous computing systems (HCSs). Furthermore, we believe that this paper will provide insights into working and scope of applications of HCTs to researchers and motivate them to further harness the computational powers of CPUs and GPUs to achieve the goal of exascale performance.« less
GPU Lossless Hyperspectral Data Compression System
NASA Technical Reports Server (NTRS)
Aranki, Nazeeh I.; Keymeulen, Didier; Kiely, Aaron B.; Klimesh, Matthew A.
2014-01-01
Hyperspectral imaging systems onboard aircraft or spacecraft can acquire large amounts of data, putting a strain on limited downlink and storage resources. Onboard data compression can mitigate this problem but may require a system capable of a high throughput. In order to achieve a high throughput with a software compressor, a graphics processing unit (GPU) implementation of a compressor was developed targeting the current state-of-the-art GPUs from NVIDIA(R). The implementation is based on the fast lossless (FL) compression algorithm reported in "Fast Lossless Compression of Multispectral-Image Data" (NPO- 42517), NASA Tech Briefs, Vol. 30, No. 8 (August 2006), page 26, which operates on hyperspectral data and achieves excellent compression performance while having low complexity. The FL compressor uses an adaptive filtering method and achieves state-of-the-art performance in both compression effectiveness and low complexity. The new Consultative Committee for Space Data Systems (CCSDS) Standard for Lossless Multispectral & Hyperspectral image compression (CCSDS 123) is based on the FL compressor. The software makes use of the highly-parallel processing capability of GPUs to achieve a throughput at least six times higher than that of a software implementation running on a single-core CPU. This implementation provides a practical real-time solution for compression of data from airborne hyperspectral instruments.
IMPAIR: massively parallel deconvolution on the GPU
NASA Astrophysics Data System (ADS)
Sherry, Michael; Shearer, Andy
2013-02-01
The IMPAIR software is a high throughput image deconvolution tool for processing large out-of-core datasets of images, varying from large images with spatially varying PSFs to large numbers of images with spatially invariant PSFs. IMPAIR implements a parallel version of the tried and tested Richardson-Lucy deconvolution algorithm regularised via a custom wavelet thresholding library. It exploits the inherently parallel nature of the convolution operation to achieve quality results on consumer grade hardware: through the NVIDIA Tesla GPU implementation, the multi-core OpenMP implementation, and the cluster computing MPI implementation of the software. IMPAIR aims to address the problem of parallel processing in both top-down and bottom-up approaches: by managing the input data at the image level, and by managing the execution at the instruction level. These combined techniques will lead to a scalable solution with minimal resource consumption and maximal load balancing. IMPAIR is being developed as both a stand-alone tool for image processing, and as a library which can be embedded into non-parallel code to transparently provide parallel high throughput deconvolution.
GISAXS simulation and analysis on GPU clusters
NASA Astrophysics Data System (ADS)
Chourou, Slim; Sarje, Abhinav; Li, Xiaoye; Chan, Elaine; Hexemer, Alexander
2012-02-01
We have implemented a flexible Grazing Incidence Small-Angle Scattering (GISAXS) simulation code based on the Distorted Wave Born Approximation (DWBA) theory that effectively utilizes the parallel processing power provided by the GPUs. This constitutes a handy tool for experimentalists facing a massive flux of data, allowing them to accurately simulate the GISAXS process and analyze the produced data. The software computes the diffraction image for any given superposition of custom shapes or morphologies (e.g. obtained graphically via a discretization scheme) in a user-defined region of k-space (or region of the area detector) for all possible grazing incidence angles and in-plane sample rotations. This flexibility then allows to easily tackle a wide range of possible sample geometries such as nanostructures on top of or embedded in a substrate or a multilayered structure. In cases where the sample displays regions of significant refractive index contrast, an algorithm has been implemented to perform an optimal slicing of the sample along the vertical direction and compute the averaged refractive index profile to be used as the reference geometry of the unperturbed system. Preliminary tests on a single GPU show a speedup of over 200 times compared to the sequential code.
Linear Bregman algorithm implemented in parallel GPU
NASA Astrophysics Data System (ADS)
Li, Pengyan; Ke, Jue; Sui, Dong; Wei, Ping
2015-08-01
At present, most compressed sensing (CS) algorithms have poor converging speed, thus are difficult to run on PC. To deal with this issue, we use a parallel GPU, to implement a broadly used compressed sensing algorithm, the Linear Bregman algorithm. Linear iterative Bregman algorithm is a reconstruction algorithm proposed by Osher and Cai. Compared with other CS reconstruction algorithms, the linear Bregman algorithm only involves the vector and matrix multiplication and thresholding operation, and is simpler and more efficient for programming. We use C as a development language and adopt CUDA (Compute Unified Device Architecture) as parallel computing architectures. In this paper, we compared the parallel Bregman algorithm with traditional CPU realized Bregaman algorithm. In addition, we also compared the parallel Bregman algorithm with other CS reconstruction algorithms, such as OMP and TwIST algorithms. Compared with these two algorithms, the result of this paper shows that, the parallel Bregman algorithm needs shorter time, and thus is more convenient for real-time object reconstruction, which is important to people's fast growing demand to information technology.
Adaptive mesh fluid simulations on GPU
NASA Astrophysics Data System (ADS)
Wang, Peng; Abel, Tom; Kaehler, Ralf
2010-10-01
We describe an implementation of compressible inviscid fluid solvers with block-structured adaptive mesh refinement on Graphics Processing Units using NVIDIA's CUDA. We show that a class of high resolution shock capturing schemes can be mapped naturally on this architecture. Using the method of lines approach with the second order total variation diminishing Runge-Kutta time integration scheme, piecewise linear reconstruction, and a Harten-Lax-van Leer Riemann solver, we achieve an overall speedup of approximately 10 times faster execution on one graphics card as compared to a single core on the host computer. We attain this speedup in uniform grid runs as well as in problems with deep AMR hierarchies. Our framework can readily be applied to more general systems of conservation laws and extended to higher order shock capturing schemes. This is shown directly by an implementation of a magneto-hydrodynamic solver and comparing its performance to the pure hydrodynamic case. Finally, we also combined our CUDA parallel scheme with MPI to make the code run on GPU clusters. Close to ideal speedup is observed on up to four GPUs.
NASA Astrophysics Data System (ADS)
Tavares, E. T., Jr.; Klafke, J. C.
2003-08-01
O presente trabalho propõe-se a resgatar uma experiência que teve lugar no Planetário de São Paulo nos anos 60. Em 1962, o Sr. Acácio, então com 37 anos, deficiente visual desde os 27, passou a assistir às aulas ministradas pelo Prof. Aristóteles Orsini aos integrantes do corpo de servidores do Planetário. O Sr. Acácio era o único deficiente da turma e, embora possuísse conhecimentos básicos e relativamente avançados de matemática, enfrentava dificuldades na compreensão e acompanhamento da exposição, como também em estudos posteriores. Com o intuito de auxiliá-lo na superação desses problemas, o Prof. Orsini solicitou a construção de modelos mecânicos que, através do sentido do tato, permitissem o acompanhamento das aulas e a transposição do modelo para o "constructo" mental. Essa prática mostrou-se tão eficaz que facilitou sobejamente o aprendizado da matéria pelo sujeito. O Sr. Acácio passou a integrar o corpo de professores do Planetário/Escola Municipal de Astrofísica, tendo ficado responsável pelo curso de "Introdução à Astronomia" por vários anos. Além disso, a experiência foi tão bem sucedida que alguns dos modelos tiveram seus elementos constitutivos pintados diferencialmente para serem utilizados em cursos regulares do Planetário, tornando-se parte integrante do conjunto de recursos didáticos da instituição. É pensando nessa eficácia, tanto em seu objetivo original permitir o aprendizado de um deficiente visual quanto no subsidiário recurso didático sistemático da instituição que decidimos resgatar essa experiência. Estribados nela, acreditamos ser extremamente produtivo, em termos educacionais, o aperfeiçoamento dos modelos originais, agora resgatados e restaurados, e a criação de outros que pudessem ser utilizados no ensino dessa ciência a deficientes visuais.
Direct numerical simulation of turbulence using GPU accelerated supercomputers
NASA Astrophysics Data System (ADS)
Khajeh-Saeed, Ali; Blair Perot, J.
2013-02-01
Direct numerical simulations of turbulence are optimized for up to 192 graphics processors. The results from two large GPU clusters are compared to the performance of corresponding CPU clusters. A number of important algorithm changes are necessary to access the full computational power of graphics processors and these adaptations are discussed. It is shown that the handling of subdomain communication becomes even more critical when using GPU based supercomputers. The potential for overlap of MPI communication with GPU computation is analyzed and then optimized. Detailed timings reveal that the internal calculations are now so efficient that the operations related to MPI communication are the primary scaling bottleneck at all but the very largest problem sizes that can fit on the hardware. This work gives a glimpse of the CFD performance issues will dominate many hardware platform in the near future.
Medical image processing on the GPU - past, present and future.
Eklund, Anders; Dufort, Paul; Forsberg, Daniel; LaConte, Stephen M
2013-12-01
Graphics processing units (GPUs) are used today in a wide range of applications, mainly because they can dramatically accelerate parallel computing, are affordable and energy efficient. In the field of medical imaging, GPUs are in some cases crucial for enabling practical use of computationally demanding algorithms. This review presents the past and present work on GPU accelerated medical image processing, and is meant to serve as an overview and introduction to existing GPU implementations. The review covers GPU acceleration of basic image processing operations (filtering, interpolation, histogram estimation and distance transforms), the most commonly used algorithms in medical imaging (image registration, image segmentation and image denoising) and algorithms that are specific to individual modalities (CT, PET, SPECT, MRI, fMRI, DTI, ultrasound, optical imaging and microscopy). The review ends by highlighting some future possibilities and challenges.
Multicore and GPU Algorithms for Nussinov RNA Folding.
Li, Junjie; Ranka, Sanjay; Sahni, Sartaj
2013-01-01
We develop cache efficient, multicore, and GPU algorithms for RNA folding using Nussinov's equations. Our cache efficient algorithm provides a speedup between 1.6 and 3.0 relative to a naive straightforward single core code. The multicore version of the cache efficient single core algorithm provides a speedup, relative to the naive single core algorithm, between 7.5 and 14.0 on a 6 core hyperthreaded CPU. Our GPU algorithm for the NVIDIA C2050 is up to 1582 times as fast as the naive single core algorithm and between 5.1 and 11.2 times as fast as the fastest previously known GPU algorithm for Nussinov RNA folding.
GPU accelerated spectral finite elements on all-hex meshes
NASA Astrophysics Data System (ADS)
Remacle, J.-F.; Gandham, R.; Warburton, T.
2016-11-01
This paper presents a spectral element finite element scheme that efficiently solves elliptic problems on unstructured hexahedral meshes. The discrete equations are solved using a matrix-free preconditioned conjugate gradient algorithm. An additive Schwartz two-scale preconditioner is employed that allows h-independence convergence. An extensible multi-threading programming API is used as a common kernel language that allows runtime selection of different computing devices (GPU and CPU) and different threading interfaces (CUDA, OpenCL and OpenMP). Performance tests demonstrate that problems with over 50 million degrees of freedom can be solved in a few seconds on an off-the-shelf GPU.
Determinant Computation on the GPU using the Condensation Method
NASA Astrophysics Data System (ADS)
Anisul Haque, Sardar; Moreno Maza, Marc
2012-02-01
We report on a GPU implementation of the condensation method designed by Abdelmalek Salem and Kouachi Said for computing the determinant of a matrix. We consider two types of coefficients: modular integers and floating point numbers. We evaluate the performance of our code by measuring its effective bandwidth and argue that it is numerical stable in the floating point number case. In addition, we compare our code with serial implementation of determinant computation from well-known mathematical packages. Our results suggest that a GPU implementation of the condensation method has a large potential for improving those packages in terms of running time and numerical stability.
Accelerating Pseudo-Random Number Generator for MCNP on GPU
NASA Astrophysics Data System (ADS)
Gong, Chunye; Liu, Jie; Chi, Lihua; Hu, Qingfeng; Deng, Li; Gong, Zhenghu
2010-09-01
Pseudo-random number generators (PRNG) are intensively used in many stochastic algorithms in particle simulations, artificial neural networks and other scientific computation. The PRNG in Monte Carlo N-Particle Transport Code (MCNP) requires long period, high quality, flexible jump and fast enough. In this paper, we implement such a PRNG for MCNP on NVIDIA's GTX200 Graphics Processor Units (GPU) using CUDA programming model. Results shows that 3.80 to 8.10 times speedup are achieved compared with 4 to 6 cores CPUs and more than 679.18 million double precision random numbers can be generated per second on GPU.
Numerical cosmology on the GPU with Enzo and Ramses
NASA Astrophysics Data System (ADS)
Gheller, C.; Wang, P.; Vazza, F.; Teyssier, R.
2015-09-01
A number of scientific numerical codes can currently exploit GPUs with remarkable performance. In astrophysics, Enzo and Ramses are prime examples of such applications. The two codes have been ported to GPUs adopting different strategies and programming models, Enzo adopting CUDA and Ramses using OpenACC. We describe here the different solutions used for the GPU implementation of both cases. Performance benchmarks will be presented for Ramses. The results of the usage of the more mature GPU version of Enzo, adopted for a scientific project within the CHRONOS programme, will be summarised.
Full Stokes glacier model on GPU
NASA Astrophysics Data System (ADS)
Licul, Aleksandar; Herman, Frédéric; Podladchikov, Yuri; Räss, Ludovic; Omlin, Samuel
2015-04-01
Two different approaches are commonly used in glacier ice flow modeling: models based on asymptotic approximations of ice physics and full stokes models. Lower order models are computationally lighter but reach their limits in regions of complex flow, while full Stokes models are more exact but computationally expansive. To overcome this constrain, we investigate the potential of GPU acceleration in glacier modeling. The goal of this preliminary research is to develop a three-dimensional full Stokes numerical model and apply it to the glacier flow. We numerically solve the nonlinear Stokes momentum balance equations together with the incompressibility equation. Strong nonlinearities for the ice rheology are also taken into account. We have developed a fully three-dimensional numerical MATLAB application based on an iterative finite difference scheme. We have ported it to C-CUDA to run it on GPUs. Our model is benchmarked against other full Stokes solutions for all diagnostic ISMIP-HOM experiments (Pattyn et al.,2008). The preliminary results show good agreement with the other models. The major advantages of our programming approach are simplicity and order 10-100 times speed-up in comparison to serial CPU version of the code. Future work will include some real world applications and we will implement the free surface evolution capabilities. References: [1] F. Pattyn, L. Perichon, A. Aschwanden, B. Breuer, D.B. Smedt, O. Gagliardini, G.H. Gudmundsson, R.C.A. Hindmarsh, A. Hubbard, J.V. Johnson, T. Kleiner, Y. Konovalov, C. Martin, A.J. Payne, D. Pollard, S. Price, M. Ruckamp, F. Saito, S. Sugiyama, S., and T. Zwinger, Benchmark experiments for higher-order and full-Stokes ice sheet models (ISMIP-HOM), The Cryosphere, 2 (2008), 95-108.
Accelerating DNA analysis applications on GPU clusters
Tumeo, Antonino; Villa, Oreste
2010-06-13
DNA analysis is an emerging application of high performance bioinformatic. Modern sequencing machinery are able to provide, in few hours, large input streams of data which needs to be matched against exponentially growing databases known fragments. The ability to recognize these patterns effectively and fastly may allow extending the scale and the reach of the investigations performed by biology scientists. Aho-Corasick is an exact, multiple pattern matching algorithm often at the base of this application. High performance systems are a promising platform to accelerate this algorithm, which is computationally intensive but also inherently parallel. Nowadays, high performance systems also include heterogeneous processing elements, such as Graphic Processing Units (GPUs), to further accelerate parallel algorithms. Unfortunately, the Aho-Corasick algorithm exhibits large performance variabilities, depending on the size of the input streams, on the number of patterns to search and on the number of matches, and poses significant challenges on current high performance software and hardware implementations. An adequate mapping of the algorithm on the target architecture, coping with the limit of the underlining hardware, is required to reach the desired high throughputs. Load balancing also plays a crucial role when considering the limited bandwidth among the nodes of these systems. In this paper we present an efficient implementation of the Aho-Corasick algorithm for high performance clusters accelerated with GPUs. We discuss how we partitioned and adapted the algorithm to fit the Tesla C1060 GPU and then present a MPI based implementation for a heterogeneous high performance cluster. We compare this implementation to MPI and MPI with pthreads based implementations for a homogeneous cluster of x86 processors, discussing the stability vs. the performance and the scaling of the solutions, taking into consideration aspects such as the bandwidth among the different nodes.
MDI-GPU: accelerating integrative modelling for genomic-scale data using GP-GPU computing.
Mason, Samuel A; Sayyid, Faiz; Kirk, Paul D W; Starr, Colin; Wild, David L
2016-03-01
The integration of multi-dimensional datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct--but often complementary--information. However, the large amount of data adds burden to any inference task. Flexible Bayesian methods may reduce the necessity for strong modelling assumptions, but can also increase the computational burden. We present an improved implementation of a Bayesian correlated clustering algorithm, that permits integrated clustering to be routinely performed across multiple datasets, each with tens of thousands of items. By exploiting GPU based computation, we are able to improve runtime performance of the algorithm by almost four orders of magnitude. This permits analysis across genomic-scale data sets, greatly expanding the range of applications over those originally possible. MDI is available here: http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/.
POM.gpu-v1.0: a GPU-based Princeton Ocean Model
NASA Astrophysics Data System (ADS)
Xu, S.; Huang, X.; Oey, L.-Y.; Xu, F.; Fu, H.; Zhang, Y.; Yang, G.
2015-09-01
Graphics processing units (GPUs) are an attractive solution in many scientific applications due to their high performance. However, most existing GPU conversions of climate models use GPUs for only a few computationally intensive regions. In the present study, we redesign the mpiPOM (a parallel version of the Princeton Ocean Model) with GPUs. Specifically, we first convert the model from its original Fortran form to a new Compute Unified Device Architecture C (CUDA-C) code, then we optimize the code on each of the GPUs, the communications between the GPUs, and the I / O between the GPUs and the central processing units (CPUs). We show that the performance of the new model on a workstation containing four GPUs is comparable to that on a powerful cluster with 408 standard CPU cores, and it reduces the energy consumption by a factor of 6.8.
GPU-accelerated denoising of 3D magnetic resonance images
Howison, Mark; Wes Bethel, E.
2014-05-29
The raw computational power of GPU accelerators enables fast denoising of 3D MR images using bilateral filtering, anisotropic diffusion, and non-local means. In practice, applying these filtering operations requires setting multiple parameters. This study was designed to provide better guidance to practitioners for choosing the most appropriate parameters by answering two questions: what parameters yield the best denoising results in practice? And what tuning is necessary to achieve optimal performance on a modern GPU? To answer the first question, we use two different metrics, mean squared error (MSE) and mean structural similarity (MSSIM), to compare denoising quality against a reference image. Surprisingly, the best improvement in structural similarity with the bilateral filter is achieved with a small stencil size that lies within the range of real-time execution on an NVIDIA Tesla M2050 GPU. Moreover, inappropriate choices for parameters, especially scaling parameters, can yield very poor denoising performance. To answer the second question, we perform an autotuning study to empirically determine optimal memory tiling on the GPU. The variation in these results suggests that such tuning is an essential step in achieving real-time performance. These results have important implications for the real-time application of denoising to MR images in clinical settings that require fast turn-around times.
A survey of GPU-based medical image computing techniques.
Shi, Lin; Liu, Wen; Zhang, Heye; Xie, Yongming; Wang, Defeng
2012-09-01
Medical imaging currently plays a crucial role throughout the entire clinical applications from medical scientific research to diagnostics and treatment planning. However, medical imaging procedures are often computationally demanding due to the large three-dimensional (3D) medical datasets to process in practical clinical applications. With the rapidly enhancing performances of graphics processors, improved programming support, and excellent price-to-performance ratio, the graphics processing unit (GPU) has emerged as a competitive parallel computing platform for computationally expensive and demanding tasks in a wide range of medical image applications. The major purpose of this survey is to provide a comprehensive reference source for the starters or researchers involved in GPU-based medical image processing. Within this survey, the continuous advancement of GPU computing is reviewed and the existing traditional applications in three areas of medical image processing, namely, segmentation, registration and visualization, are surveyed. The potential advantages and associated challenges of current GPU-based medical imaging are also discussed to inspire future applications in medicine.
Computing 2D constrained delaunay triangulation using the GPU.
Qi, Meng; Cao, Thanh-Tung; Tan, Tiow-Seng
2013-05-01
We propose the first graphics processing unit (GPU) solution to compute the 2D constrained Delaunay triangulation (CDT) of a planar straight line graph (PSLG) consisting of points and edges. There are many existing CPU algorithms to solve the CDT problem in computational geometry, yet there has been no prior approach to solve this problem efficiently using the parallel computing power of the GPU. For the special case of the CDT problem where the PSLG consists of just points, which is simply the normal Delaunay triangulation (DT) problem, a hybrid approach using the GPU together with the CPU to partially speed up the computation has already been presented in the literature. Our work, on the other hand, accelerates the entire computation on the GPU. Our implementation using the CUDA programming model on NVIDIA GPUs is numerically robust, and runs up to an order of magnitude faster than the best sequential implementations on the CPU. This result is reflected in our experiment with both randomly generated PSLGs and real-world GIS data having millions of points and edges.
Multi-level graph layout on the GPU.
Frishman, Yaniv; Tal, Ayellet
2007-01-01
This paper presents a new algorithm for force directed graph layout on the GPU. The algorithm, whose goal is to compute layouts accurately and quickly, has two contributions. The first contribution is proposing a general multi-level scheme, which is based on spectral partitioning. The second contribution is computing the layout on the GPU. Since the GPU requires a data parallel programming model, the challenge is devising a mapping of a naturally unstructured graph into a well-partitioned structured one. This is done by computing a balanced partitioning of a general graph. This algorithm provides a general multi-level scheme, which has the potential to be used not only for computation on the GPU, but also on emerging multi-core architectures. The algorithm manages to compute high quality layouts of large graphs in a fraction of the time required by existing algorithms of similar quality. An application for visualization of the topologies of ISP (Internet Service Provider) networks is presented.
Optimizing a mobile robot control system using GPU acceleration
NASA Astrophysics Data System (ADS)
Tuck, Nat; McGuinness, Michael; Martin, Fred
2012-01-01
This paper describes our attempt to optimize a robot control program for the Intelligent Ground Vehicle Competition (IGVC) by running computationally intensive portions of the system on a commodity graphics processing unit (GPU). The IGVC Autonomous Challenge requires a control program that performs a number of different computationally intensive tasks ranging from computer vision to path planning. For the 2011 competition our Robot Operating System (ROS) based control system would not run comfortably on the multicore CPU on our custom robot platform. The process of profiling the ROS control program and selecting appropriate modules for porting to run on a GPU is described. A GPU-targeting compiler, Bacon, is used to speed up development and help optimize the ported modules. The impact of the ported modules on overall performance is discussed. We conclude that GPU optimization can free a significant amount of CPU resources with minimal effort for expensive user-written code, but that replacing heavily-optimized library functions is more difficult, and a much less efficient use of time.
Cloud GPU-based simulations for SQUAREMR
NASA Astrophysics Data System (ADS)
Kantasis, George; Xanthis, Christos G.; Haris, Kostas; Heiberg, Einar; Aletras, Anthony H.
2017-01-01
Quantitative Magnetic Resonance Imaging (MRI) is a research tool, used more and more in clinical practice, as it provides objective information with respect to the tissues being imaged. Pixel-wise T1 quantification (T1 mapping) of the myocardium is one such application with diagnostic significance. A number of mapping sequences have been developed for myocardial T1 mapping with a wide range in terms of measurement accuracy and precision. Furthermore, measurement results obtained with these pulse sequences are affected by errors introduced by the particular acquisition parameters used. SQUAREMR is a new method which has the potential of improving the accuracy of these mapping sequences through the use of massively parallel simulations on Graphical Processing Units (GPUs) by taking into account different acquisition parameter sets. This method has been shown to be effective in myocardial T1 mapping; however, execution times may exceed 30 min which is prohibitively long for clinical applications. The purpose of this study was to accelerate the construction of SQUAREMR's multi-parametric database to more clinically acceptable levels. The aim of this study was to develop a cloud-based cluster in order to distribute the computational load to several GPU-enabled nodes and accelerate SQUAREMR. This would accommodate high demands for computational resources without the need for major upfront equipment investment. Moreover, the parameter space explored by the simulations was optimized in order to reduce the computational load without compromising the T1 estimates compared to a non-optimized parameter space approach. A cloud-based cluster with 16 nodes resulted in a speedup of up to 13.5 times compared to a single-node execution. Finally, the optimized parameter set approach allowed for an execution time of 28 s using the 16-node cluster, without compromising the T1 estimates by more than 10 ms. The developed cloud-based cluster and optimization of the parameter set reduced
Cheng, Chun-Pei; Lan, Kuo-Lun; Liu, Wen-Chun; Chang, Ting-Tsung; Tseng, Vincent S
2016-12-01
Hepatitis B viral (HBV) infection is strongly associated with an increased risk of liver diseases like cirrhosis or hepatocellular carcinoma (HCC). Many lines of evidence suggest that deletions occurring in HBV genomic DNA are highly associated with the activity of HBV via the interplay between aberrant viral proteins release and human immune system. Deletions finding on the HBV whole genome sequences is thus a very important issue though there exist underlying the challenges in mining such big and complex biological data. Although some next generation sequencing (NGS) tools are recently designed for identifying structural variations such as insertions or deletions, their validity is generally committed to human sequences study. This design may not be suitable for viruses due to different species. We propose a graphics processing unit (GPU)-based data mining method called DeF-GPU to efficiently and precisely identify HBV deletions from large NGS data, which generally contain millions of reads. To fit the single instruction multiple data instructions, sequencing reads are referred to as multiple data and the deletion finding procedure is referred to as a single instruction. We use Compute Unified Device Architecture (CUDA) to parallelize the procedures, and further validate DeF-GPU on 5 synthetic and 1 real datasets. Our results suggest that DeF-GPU outperforms the existing commonly-used method Pindel and is able to exactly identify the deletions of our ground truth in few seconds. The source code and other related materials are available at https://sourceforge.net/projects/defgpu/.
High-Speed GPU-Based Fully Three-Dimensional Diffuse Optical Tomographic System
Saikia, Manob Jyoti; Kanhirodan, Rajan; Mohan Vasu, Ram
2014-01-01
We have developed a graphics processor unit (GPU-) based high-speed fully 3D system for diffuse optical tomography (DOT). The reduction in execution time of 3D DOT algorithm, a severely ill-posed problem, is made possible through the use of (1) an algorithmic improvement that uses Broyden approach for updating the Jacobian matrix and thereby updating the parameter matrix and (2) the multinode multithreaded GPU and CUDA (Compute Unified Device Architecture) software architecture. Two different GPU implementations of DOT programs are developed in this study: (1) conventional C language program augmented by GPU CUDA and CULA routines (C GPU), (2) MATLAB program supported by MATLAB parallel computing toolkit for GPU (MATLAB GPU). The computation time of the algorithm on host CPU and the GPU system is presented for C and Matlab implementations. The forward computation uses finite element method (FEM) and the problem domain is discretized into 14610, 30823, and 66514 tetrahedral elements. The reconstruction time, so achieved for one iteration of the DOT reconstruction for 14610 elements, is 0.52 seconds for a C based GPU program for 2-plane measurements. The corresponding MATLAB based GPU program took 0.86 seconds. The maximum number of reconstructed frames so achieved is 2 frames per second. PMID:24891848
High-Speed GPU-Based Fully Three-Dimensional Diffuse Optical Tomographic System.
Saikia, Manob Jyoti; Kanhirodan, Rajan; Mohan Vasu, Ram
2014-01-01
We have developed a graphics processor unit (GPU-) based high-speed fully 3D system for diffuse optical tomography (DOT). The reduction in execution time of 3D DOT algorithm, a severely ill-posed problem, is made possible through the use of (1) an algorithmic improvement that uses Broyden approach for updating the Jacobian matrix and thereby updating the parameter matrix and (2) the multinode multithreaded GPU and CUDA (Compute Unified Device Architecture) software architecture. Two different GPU implementations of DOT programs are developed in this study: (1) conventional C language program augmented by GPU CUDA and CULA routines (C GPU), (2) MATLAB program supported by MATLAB parallel computing toolkit for GPU (MATLAB GPU). The computation time of the algorithm on host CPU and the GPU system is presented for C and Matlab implementations. The forward computation uses finite element method (FEM) and the problem domain is discretized into 14610, 30823, and 66514 tetrahedral elements. The reconstruction time, so achieved for one iteration of the DOT reconstruction for 14610 elements, is 0.52 seconds for a C based GPU program for 2-plane measurements. The corresponding MATLAB based GPU program took 0.86 seconds. The maximum number of reconstructed frames so achieved is 2 frames per second.
SU-D-BRD-03: A Gateway for GPU Computing in Cancer Radiotherapy Research
Jia, X; Folkerts, M; Shi, F; Yan, H; Yan, Y; Jiang, S; Sivagnanam, S; Majumdar, A
2014-06-01
Purpose: Graphics Processing Unit (GPU) has become increasingly important in radiotherapy. However, it is still difficult for general clinical researchers to access GPU codes developed by other researchers, and for developers to objectively benchmark their codes. Moreover, it is quite often to see repeated efforts spent on developing low-quality GPU codes. The goal of this project is to establish an infrastructure for testing GPU codes, cross comparing them, and facilitating code distributions in radiotherapy community. Methods: We developed a system called Gateway for GPU Computing in Cancer Radiotherapy Research (GCR2). A number of GPU codes developed by our group and other developers can be accessed via a web interface. To use the services, researchers first upload their test data or use the standard data provided by our system. Then they can select the GPU device on which the code will be executed. Our system offers all mainstream GPU hardware for code benchmarking purpose. After the code running is complete, the system automatically summarizes and displays the computing results. We also released a SDK to allow the developers to build their own algorithm implementation and submit their binary codes to the system. The submitted code is then systematically benchmarked using a variety of GPU hardware and representative data provided by our system. The developers can also compare their codes with others and generate benchmarking reports. Results: It is found that the developed system is fully functioning. Through a user-friendly web interface, researchers are able to test various GPU codes. Developers also benefit from this platform by comprehensively benchmarking their codes on various GPU platforms and representative clinical data sets. Conclusion: We have developed an open platform allowing the clinical researchers and developers to access the GPUs and GPU codes. This development will facilitate the utilization of GPU in radiation therapy field.
Parallel hyperbolic PDE simulation on clusters: Cell versus GPU
NASA Astrophysics Data System (ADS)
Rostrup, Scott; De Sterck, Hans
2010-12-01
Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications. Program summaryProgram title: SWsolver Catalogue identifier: AEGY_v1_0 Program summary URL
Miki, T; Wang, X; Aoki, T; Imai, Y; Ishikawa, T; Takase, K; Yamaguchi, T
2012-01-01
In this paper, we propose a novel patient-specific method of modelling pulmonary airflow using graphics processing unit (GPU) computation that can be applied in medical practice. To overcome the barriers imposed by computation speed, installation price and footprint to the application of computational fluid dynamics, we focused on GPU computation and the lattice Boltzmann method (LBM). The GPU computation and LBM are compatible due to the characteristics of the GPU. As the optimisation of data access is essential for the performance of the GPU computation, we developed an adaptive meshing method, in which an airway model is covered by isotropic subdomains consisting of a uniform Cartesian mesh. We found that 4(3) size subdomains gave the best performance. The code was also tested on a small GPU cluster to confirm its performance and applicability, as the price and footprint are reasonable for medical applications.
Blaze-DEMGPU: Modular high performance DEM framework for the GPU architecture
NASA Astrophysics Data System (ADS)
Govender, Nicolin; Wilke, Daniel N.; Kok, Schalk
Blaze-DEMGPU is a modular GPU based discrete element method (DEM) framework that supports polyhedral shaped particles. The high level performance is attributed to the light weight and Single Instruction Multiple Data (SIMD) that the GPU architecture offers. Blaze-DEMGPU offers suitable algorithms to conduct DEM simulations on the GPU and these algorithms can be extended and modified. Since a large number of scientific simulations are particle based, many of the algorithms and strategies for GPU implementation present in Blaze-DEMGPU can be applied to other fields. Blaze-DEMGPU will make it easier for new researchers to use high performance GPU computing as well as stimulate wider GPU research efforts by the DEM community.
Supporting Real-Time Computer Vision Workloads using OpenVX on Multicore+GPU Platforms
2015-05-01
first acquire one of several tokens associated with that GPU using a locking protocol. Once a token has been acquired, a task may acquire an engine... tokens per GPU is configurable. Also, lock wait queues may be configured so that tasks wait in FIFO order or priority order. Other configuration...task may be assigned a token for any GPU in such a cluster) and the cluster size is configurable. GPUSync can be used in conjunction with various
Fast computer simulation of reconstructed image from rainbow hologram based on GPU
NASA Astrophysics Data System (ADS)
Shuming, Jiao; Yoshikawa, Hiroshi
2015-10-01
A fast computer simulation solution for rainbow hologram reconstruction based on GPU is proposed. In the commonly used segment Fourier transform method for rainbow hologram reconstruction, the computation of 2D Fourier transform on each hologram segment is very time consuming. GPU-based parallel computing can be applied to improve the computing speed. Compared with CPU computing, simulation results indicate that our proposed GPU computing can effectively reduce the computation time by as much as eight times.
Combating the Reliability Challenge of GPU Register File at Low Supply Voltage
Tan, Jingweijia; Song, Shuaiwen; Yan, Kaige; Fu, Xin; Marquez, Andres; Kerbyson, Darren J.
2016-09-11
Supply voltage reduction is an effective approach to significantly reduce GPU energy consumption. As the largest on-chip storage structure, the GPU register file becomes the reliability hotspot that prevents further supply voltage reduction below the safe limit (Vmin) due to process variation effects. This work addresses the reliability challenge of the GPU register file at low supply voltages, which is an essential first step for aggressive supply voltage reduction of the entire GPU chip. We propose GR-Guard, an architectural solution that leverages long register dead time to enable reliable operations from unreliable register file at low voltages.
GPU phase-field lattice Boltzmann simulations of growth and motion of a binary alloy dendrite
NASA Astrophysics Data System (ADS)
Takaki, T.; Rojas, R.; Ohno, M.; Shimokawabe, T.; Aoki, T.
2015-06-01
A GPU code has been developed for a phase-field lattice Boltzmann (PFLB) method, which can simulate the dendritic growth with motion of solids in a dilute binary alloy melt. The GPU accelerated PFLB method has been implemented using CUDA C. The equiaxed dendritic growth in a shear flow and settling condition have been simulated by the developed GPU code. It has been confirmed that the PFLB simulations were efficiently accelerated by introducing the GPU computation. The characteristic dendrite morphologies which depend on the melt flow and the motion of the dendrite could also be confirmed by the simulations.
Nagaoka, Tomoaki; Watanabe, Soichi
2012-01-01
Electromagnetic simulation with anatomically realistic computational human model using the finite-difference time domain (FDTD) method has recently been performed in a number of fields in biomedical engineering. To improve the method's calculation speed and realize large-scale computing with the computational human model, we adapt three-dimensional FDTD code to a multi-GPU cluster environment with Compute Unified Device Architecture and Message Passing Interface. Our multi-GPU cluster system consists of three nodes. The seven GPU boards (NVIDIA Tesla C2070) are mounted on each node. We examined the performance of the FDTD calculation on multi-GPU cluster environment. We confirmed that the FDTD calculation on the multi-GPU clusters is faster than that on a multi-GPU (a single workstation), and we also found that the GPU cluster system calculate faster than a vector supercomputer. In addition, our GPU cluster system allowed us to perform the large-scale FDTD calculation because were able to use GPU memory of over 100 GB.
GPU accelerated FDTD solver and its application in MRI.
Chi, J; Liu, F; Jin, J; Mason, D G; Crozier, S
2010-01-01
The finite difference time domain (FDTD) method is a popular technique for computational electromagnetics (CEM). The large computational power often required, however, has been a limiting factor for its applications. In this paper, we will present a graphics processing unit (GPU)-based parallel FDTD solver and its successful application to the investigation of a novel B1 shimming scheme for high-field magnetic resonance imaging (MRI). The optimized shimming scheme exhibits considerably improved transmit B(1) profiles. The GPU implementation dramatically shortened the runtime of FDTD simulation of electromagnetic field compared with its CPU counterpart. The acceleration in runtime has made such investigation possible, and will pave the way for other studies of large-scale computational electromagnetic problems in modern MRI which were previously impractical.
GPU and APU computations of Finite Time Lyapunov Exponent fields
NASA Astrophysics Data System (ADS)
Conti, Christian; Rossinelli, Diego; Koumoutsakos, Petros
2012-03-01
We present GPU and APU accelerated computations of Finite-Time Lyapunov Exponent (FTLE) fields. The calculation of FTLEs is a computationally intensive process, as in order to obtain the sharp ridges associated with the Lagrangian Coherent Structures an extensive resampling of the flow field is required. The computational performance of this resampling is limited by the memory bandwidth of the underlying computer architecture. The present technique harnesses data-parallel execution of many-core architectures and relies on fast and accurate evaluations of moment conserving functions for the mesh to particle interpolations. We demonstrate how the computation of FTLEs can be efficiently performed on a GPU and on an APU through OpenCL and we report over one order of magnitude improvements over multi-threaded executions in FTLE computations of bluff body flows.
Implementation and Optimization of Image Processing Algorithms on Embedded GPU
NASA Astrophysics Data System (ADS)
Singhal, Nitin; Yoo, Jin Woo; Choi, Ho Yeol; Park, In Kyu
In this paper, we analyze the key factors underlying the implementation, evaluation, and optimization of image processing and computer vision algorithms on embedded GPU using OpenGL ES 2.0 shader model. First, we present the characteristics of the embedded GPU and its inherent advantage when compared to embedded CPU. Additionally, we propose techniques to achieve increased performance with optimized shader design. To show the effectiveness of the proposed techniques, we employ cartoon-style non-photorealistic rendering (NPR), speeded-up robust feature (SURF) detection, and stereo matching as our example algorithms. Performance is evaluated in terms of the execution time and speed-up achieved in comparison with the implementation on embedded CPU.
Rapid Parallel Calculation of shell Element Based On GPU
NASA Astrophysics Data System (ADS)
Wanga, Jian Hua; Lia, Guang Yao; Lib, Sheng; Li, Guang Yao
2010-06-01
Long computing time bottlenecked the application of finite element. In this paper, an effective method to speed up the FEM calculation by using the existing modern graphic processing unit and programmable colored rendering tool was put forward, which devised the representation of unit information in accordance with the features of GPU, converted all the unit calculation into film rendering process, solved the simulation work of all the unit calculation of the internal force, and overcame the shortcomings of lowly parallel level appeared ever before when it run in a single computer. Studies shown that this method could improve efficiency and shorten calculating hours greatly. The results of emulation calculation about the elasticity problem of large number cells in the sheet metal proved that using the GPU parallel simulation calculation was faster than using the CPU's. It is useful and efficient to solve the project problems in this way.
Performance potential for simulating spin models on GPU
NASA Astrophysics Data System (ADS)
Weigel, Martin
2012-04-01
Graphics processing units (GPUs) are recently being used to an increasing degree for general computational purposes. This development is motivated by their theoretical peak performance, which significantly exceeds that of broadly available CPUs. For practical purposes, however, it is far from clear how much of this theoretical performance can be realized in actual scientific applications. As is discussed here for the case of studying classical spin models of statistical mechanics by Monte Carlo simulations, only an explicit tailoring of the involved algorithms to the specific architecture under consideration allows to harvest the computational power of GPU systems. A number of examples, ranging from Metropolis simulations of ferromagnetic Ising models, over continuous Heisenberg and disordered spin-glass systems to parallel-tempering simulations are discussed. Significant speed-ups by factors of up to 1000 compared to serial CPU code as well as previous GPU implementations are observed.
GPU Based Software Correlators - Perspectives for VLBI2010
NASA Technical Reports Server (NTRS)
Hobiger, Thomas; Kimura, Moritaka; Takefuji, Kazuhiro; Oyama, Tomoaki; Koyama, Yasuhiro; Kondo, Tetsuro; Gotoh, Tadahiro; Amagai, Jun
2010-01-01
Caused by historical separation and driven by the requirements of the PC gaming industry, Graphics Processing Units (GPUs) have evolved to massive parallel processing systems which entered the area of non-graphic related applications. Although a single processing core on the GPU is much slower and provides less functionality than its counterpart on the CPU, the huge number of these small processing entities outperforms the classical processors when the application can be parallelized. Thus, in recent years various radio astronomical projects have started to make use of this technology either to realize the correlator on this platform or to establish the post-processing pipeline with GPUs. Therefore, the feasibility of GPUs as a choice for a VLBI correlator is being investigated, including pros and cons of this technology. Additionally, a GPU based software correlator will be reviewed with respect to energy consumption/GFlop/sec and cost/GFlop/sec.
GPU-Accelerated Molecular Modeling Coming Of Age
Stone, John E.; Hardy, David J.; Ufimtsev, Ivan S.
2010-01-01
Graphics processing units (GPUs) have traditionally been used in molecular modeling solely for visualization of molecular structures and animation of trajectories resulting from molecular dynamics simulations. Modern GPUs have evolved into fully programmable, massively parallel co-processors that can now be exploited to accelerate many scientific computations, typically providing about one order of magnitude speedup over CPU code and in special cases providing speedups of two orders of magnitude. This paper surveys the development of molecular modeling algorithms that leverage GPU computing, the advances already made and remaining issues to be resolved, and the continuing evolution of GPU technology that promises to become even more useful to molecular modeling. Hardware acceleration with commodity GPUs is expected to benefit the overall computational biology community by bringing teraflops performance to desktop workstations and in some cases potentially changing what were formerly batch-mode computational jobs into interactive tasks. PMID:20675161
Interactive brain shift compensation using GPU based programming
NASA Astrophysics Data System (ADS)
van der Steen, Sander; Noordmans, Herke Jan; Verdaasdonk, Rudolf
2009-02-01
Processing large images files or real-time video streams requires intense computational power. Driven by the gaming industry, the processing power of graphic process units (GPUs) has increased significantly. With the pixel shader model 4.0 the GPU can be used for image processing 10x faster than the CPU. Dedicated software was developed to deform 3D MR and CT image sets for real-time brain shift correction during navigated neurosurgery using landmarks or cortical surface traces defined by the navigation pointer. Feedback was given using orthogonal slices and an interactively raytraced 3D brain image. GPU based programming enables real-time processing of high definition image datasets and various applications can be developed in medicine, optics and image sciences.
GPU accelerating technique for rendering implicitly represented vasculatures.
Hong, Qingqi; Wang, Beizhan; Li, Qingde; Li, Yan; Wu, Qingqiang
2014-01-01
With the flooding datasets of medical Computed Tomography (CT) and Magnetic Resonance Imaging (MRI), implicit modeling techniques are increasingly applied to reconstruct the human organs, especially the vasculature. However, displaying implicitly represented geometric objects arises heavy computational burden. In this study, a Graphics Processing Unit (GPU) accelerating technique was developed for high performance rendering of implicitly represented objects, especially the vasculatures. The experimental results suggested that the rendering performance was greatly enhanced via exploiting the advantages of modern GPUs.
GPU Lossless Hyperspectral Data Compression System for Space Applications
NASA Technical Reports Server (NTRS)
Keymeulen, Didier; Aranki, Nazeeh; Hopson, Ben; Kiely, Aaron; Klimesh, Matthew; Benkrid, Khaled
2012-01-01
On-board lossless hyperspectral data compression reduces data volume in order to meet NASA and DoD limited downlink capabilities. At JPL, a novel, adaptive and predictive technique for lossless compression of hyperspectral data, named the Fast Lossless (FL) algorithm, was recently developed. This technique uses an adaptive filtering method and achieves state-of-the-art performance in both compression effectiveness and low complexity. Because of its outstanding performance and suitability for real-time onboard hardware implementation, the FL compressor is being formalized as the emerging CCSDS Standard for Lossless Multispectral & Hyperspectral image compression. The FL compressor is well-suited for parallel hardware implementation. A GPU hardware implementation was developed for FL targeting the current state-of-the-art GPUs from NVIDIA(Trademark). The GPU implementation on a NVIDIA(Trademark) GeForce(Trademark) GTX 580 achieves a throughput performance of 583.08 Mbits/sec (44.85 MSamples/sec) and an acceleration of at least 6 times a software implementation running on a 3.47 GHz single core Intel(Trademark) Xeon(Trademark) processor. This paper describes the design and implementation of the FL algorithm on the GPU. The massively parallel implementation will provide in the future a fast and practical real-time solution for airborne and space applications.
A GPU-computing Approach to Solar Stokes Profile Inversion
NASA Astrophysics Data System (ADS)
Harker, Brian J.; Mighell, Kenneth J.
2012-09-01
We present a new computational approach to the inversion of solar photospheric Stokes polarization profiles, under the Milne-Eddington model, for vector magnetography. Our code, named GENESIS, employs multi-threaded parallel-processing techniques to harness the computing power of graphics processing units (GPUs), along with algorithms designed to exploit the inherent parallelism of the Stokes inversion problem. Using a genetic algorithm (GA) engineered specifically for use with a GPU, we produce full-disk maps of the photospheric vector magnetic field from polarized spectral line observations recorded by the Synoptic Optical Long-term Investigations of the Sun (SOLIS) Vector Spectromagnetograph (VSM) instrument. We show the advantages of pairing a population-parallel GA with data-parallel GPU-computing techniques, and present an overview of the Stokes inversion problem, including a description of our adaptation to the GPU-computing paradigm. Full-disk vector magnetograms derived by this method are shown using SOLIS/VSM data observed on 2008 March 28 at 15:45 UT.
Implementation of GPU-accelerated back projection for EPR imaging.
Qiao, Zhiwei; Redler, Gage; Epel, Boris; Qian, Yuhua; Halpern, Howard
2015-01-01
Electron paramagnetic resonance (EPR) Imaging (EPRI) is a robust method for measuring in vivo oxygen concentration (pO2). For 3D pulse EPRI, a commonly used reconstruction algorithm is the filtered backprojection (FBP) algorithm, in which the backprojection process is computationally intensive and may be time consuming when implemented on a CPU. A multistage implementation of the backprojection can be used for acceleration, however it is not flexible (requires equal linear angle projection distribution) and may still be time consuming. In this work, single-stage backprojection is implemented on a GPU (Graphics Processing Units) having 1152 cores to accelerate the process. The GPU implementation results in acceleration by over a factor of 200 overall and by over a factor of 3500 if only the computing time is considered. Some important experiences regarding the implementation of GPU-accelerated backprojection for EPRI are summarized. The resulting accelerated image reconstruction is useful for real-time image reconstruction monitoring and other time sensitive applications.
GPU accelerated processing of astronomical high frame-rate videosequences
NASA Astrophysics Data System (ADS)
Vítek, Stanislav; Švihlík, Jan; Krasula, Lukáš; Fliegel, Karel; Páta, Petr
2015-09-01
Astronomical instruments located around the world are producing an incredibly large amount of possibly interesting scientific data. Astronomical research is expanding into large and highly sensitive telescopes. Total volume of data rates per night of operations also increases with the quality and resolution of state-of-the-art CCD/CMOS detectors. Since many of the ground-based astronomical experiments are placed in remote locations with limited access to the Internet, it is necessary to solve the problem of the data storage. It mostly means that current data acquistion, processing and analyses algorithm require review. Decision about importance of the data has to be taken in very short time. This work deals with GPU accelerated processing of high frame-rate astronomical video-sequences, mostly originating from experiment MAIA (Meteor Automatic Imager and Analyser), an instrument primarily focused to observing of faint meteoric events with a high time resolution. The instrument with price bellow 2000 euro consists of image intensifier and gigabite ethernet camera running at 61 fps. With resolution better than VGA the system produces up to 2TB of scientifically valuable video data per night. Main goal of the paper is not to optimize any GPU algorithm, but to propose and evaluate parallel GPU algorithms able to process huge amount of video-sequences in order to delete all uninteresting data.
Developing a portable GPU library for hyperspectral image processing
NASA Astrophysics Data System (ADS)
Pérez-Irizarry, Gabriel J.; De La Cruz-Sanchez, Francisco; Landrón-Rivera, Brian A.; Santiago, Nayda G.; Velez-Reyes, Miguel
2012-06-01
The increasing volume of data produced by hyperspectral image sensors have forced researches and developers to seek out new and more ecient ways of analyzing the data as quick as possible. Medical, scientic, and military applications present performance requirements for tools that perform operations on hyperspectral sensor data. By providing a hyperspectral image analysis library, we aim to accelerate hyperspectral image application development. Development of a cross-platform library, Libdect, with GPU support for hyperspectral image analysis is presented. Coupling library development with ecient hyperspectral algorithms escalates into a signicant time invest- ment in many projects or prototypes. Provided a solution to these issues, developers can implement hyperspectral image analysis applications in less time. Developers will not be focused on implementing target detection code and potential issues related to platform or GPU architecture dierences. Libdect's development team counts with previously implemented detection algorithms. By utilizing proven tools, such as CMake and CTest, to develop Libdect's infrastructure, we were able to develop and test a prototype library that provides target detection code with GPU support on Linux platforms. As a whole, Libdect is an early prototype of an open and documented example of Software Engineering practices and tools. They are put together in an eort to increase developer productivity and encourage new developers into the eld of hyperspectral image application development.
GPU acceleration of Dock6's Amber scoring computation.
Yang, Hailong; Zhou, Qiongqiong; Li, Bo; Wang, Yongjian; Luan, Zhongzhi; Qian, Depei; Li, Hanlu
2010-01-01
Dressing the problem of virtual screening is a long-term goal in the drug discovery field, which if properly solved, can significantly shorten new drugs' R&D cycle. The scoring functionality that evaluates the fitness of the docking result is one of the major challenges in virtual screening. In general, scoring functionality in docking requires a large amount of floating-point calculations, which usually takes several weeks or even months to be finished. This time-consuming procedure is unacceptable, especially when highly fatal and infectious virus arises such as SARS and H1N1, which forces the scoring task to be done in a limited time. This paper presents how to leverage the computational power of GPU to accelerate Dock6's (http://dock.compbio.ucsf.edu/DOCK_6/) Amber (J. Comput. Chem. 25: 1157-1174, 2004) scoring with NVIDIA CUDA (NVIDIA Corporation Technical Staff, Compute Unified Device Architecture - Programming Guide, NVIDIA Corporation, 2008) (Compute Unified Device Architecture) platform. We also discuss many factors that will greatly influence the performance after porting the Amber scoring to GPU, including thread management, data transfer, and divergence hidden. Our experiments show that the GPU-accelerated Amber scoring achieves a 6.5× speedup with respect to the original version running on AMD dual-core CPU for the same problem size. This acceleration makes the Amber scoring more competitive and efficient for large-scale virtual screening problems.
NASA Astrophysics Data System (ADS)
Eckert, C. H. J.; Zenker, E.; Bussmann, M.; Albach, D.
2016-10-01
We present an adaptive Monte Carlo algorithm for computing the amplified spontaneous emission (ASE) flux in laser gain media pumped by pulsed lasers. With the design of high power lasers in mind, which require large size gain media, we have developed the open source code HASEonGPU that is capable of utilizing multiple graphic processing units (GPUs). With HASEonGPU, time to solution is reduced to minutes on a medium size GPU cluster of 64 NVIDIA Tesla K20m GPUs and excellent speedup is achieved when scaling to multiple GPUs. Comparison of simulation results to measurements of ASE in Y b 3 + : Y AG ceramics show perfect agreement.
GPU-based Integration with Application in Sensitivity Analysis
NASA Astrophysics Data System (ADS)
Atanassov, Emanouil; Ivanovska, Sofiya; Karaivanova, Aneta; Slavov, Dimitar
2010-05-01
The presented work is an important part of the grid application MCSAES (Monte Carlo Sensitivity Analysis for Environmental Studies) which aim is to develop an efficient Grid implementation of a Monte Carlo based approach for sensitivity studies in the domains of Environmental modelling and Environmental security. The goal is to study the damaging effects that can be caused by high pollution levels (especially effects on human health), when the main modeling tool is the Danish Eulerian Model (DEM). Generally speaking, sensitivity analysis (SA) is the study of how the variation in the output of a mathematical model can be apportioned to, qualitatively or quantitatively, different sources of variation in the input of a model. One of the important classes of methods for Sensitivity Analysis are Monte Carlo based, first proposed by Sobol, and then developed by Saltelli and his group. In MCSAES the general Saltelli procedure has been adapted for SA of the Danish Eulerian model. In our case we consider as factors the constants determining the speeds of the chemical reactions in the DEM and as output a certain aggregated measure of the pollution. Sensitivity simulations lead to huge computational tasks (systems with up to 4 × 109 equations at every time-step, and the number of time-steps can be more than a million) which motivates its grid implementation. MCSAES grid implementation scheme includes two main tasks: (i) Grid implementation of the DEM, (ii) Grid implementation of the Monte Carlo integration. In this work we present our new developments in the integration part of the application. We have developed an algorithm for GPU-based generation of scrambled quasirandom sequences which can be combined with the CPU-based computations related to the SA. Owen first proposed scrambling of Sobol sequence through permutation in a manner that improves the convergence rates. Scrambling is necessary not only for error analysis but for parallel implementations. Good scrambling is
GASPRNG: GPU accelerated scalable parallel random number generator library
NASA Astrophysics Data System (ADS)
Gao, Shuang; Peterson, Gregory D.
2013-04-01
Graphics processors represent a promising technology for accelerating computational science applications. Many computational science applications require fast and scalable random number generation with good statistical properties, so they use the Scalable Parallel Random Number Generators library (SPRNG). We present the GPU Accelerated SPRNG library (GASPRNG) to accelerate SPRNG in GPU-based high performance computing systems. GASPRNG includes code for a host CPU and CUDA code for execution on NVIDIA graphics processing units (GPUs) along with a programming interface to support various usage models for pseudorandom numbers and computational science applications executing on the CPU, GPU, or both. This paper describes the implementation approach used to produce high performance and also describes how to use the programming interface. The programming interface allows a user to be able to use GASPRNG the same way as SPRNG on traditional serial or parallel computers as well as to develop tightly coupled programs executing primarily on the GPU. We also describe how to install GASPRNG and use it. To help illustrate linking with GASPRNG, various demonstration codes are included for the different usage models. GASPRNG on a single GPU shows up to 280x speedup over SPRNG on a single CPU core and is able to scale for larger systems in the same manner as SPRNG. Because GASPRNG generates identical streams of pseudorandom numbers as SPRNG, users can be confident about the quality of GASPRNG for scalable computational science applications. Catalogue identifier: AEOI_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEOI_v1_0.html Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland Licensing provisions: UTK license. No. of lines in distributed program, including test data, etc.: 167900 No. of bytes in distributed program, including test data, etc.: 1422058 Distribution format: tar.gz Programming language: C and CUDA. Computer: Any PC or
Accelerated rescaling of single Monte Carlo simulation runs with the Graphics Processing Unit (GPU).
Yang, Owen; Choi, Bernard
2013-01-01
To interpret fiber-based and camera-based measurements of remitted light from biological tissues, researchers typically use analytical models, such as the diffusion approximation to light transport theory, or stochastic models, such as Monte Carlo modeling. To achieve rapid (ideally real-time) measurement of tissue optical properties, especially in clinical situations, there is a critical need to accelerate Monte Carlo simulation runs. In this manuscript, we report on our approach using the Graphics Processing Unit (GPU) to accelerate rescaling of single Monte Carlo runs to calculate rapidly diffuse reflectance values for different sets of tissue optical properties. We selected MATLAB to enable non-specialists in C and CUDA-based programming to use the generated open-source code. We developed a software package with four abstraction layers. To calculate a set of diffuse reflectance values from a simulated tissue with homogeneous optical properties, our rescaling GPU-based approach achieves a reduction in computation time of several orders of magnitude as compared to other GPU-based approaches. Specifically, our GPU-based approach generated a diffuse reflectance value in 0.08ms. The transfer time from CPU to GPU memory currently is a limiting factor with GPU-based calculations. However, for calculation of multiple diffuse reflectance values, our GPU-based approach still can lead to processing that is ~3400 times faster than other GPU-based approaches.
Shi, Yulin; Veidenbaum, Alexander V.; Nicolau, Alex; Xu, Xiangmin
2014-01-01
Background Modern neuroscience research demands computing power. Neural circuit mapping studies such as those using laser scanning photostimulation (LSPS) produce large amounts of data and require intensive computation for post-hoc processing and analysis. New Method Here we report on the design and implementation of a cost-effective desktop computer system for accelerated experimental data processing with recent GPU computing technology. A new version of Matlab software with GPU enabled functions is used to develop programs that run on Nvidia GPUs to harness their parallel computing power. Results We evaluated both the central processing unit (CPU) and GPU-enabled computational performance of our system in benchmark testing and practical applications. The experimental results show that the GPU-CPU co-processing of simulated data and actual LSPS experimental data clearly outperformed the multi-core CPU with up to a 22x speedup, depending on computational tasks. Further, we present a comparison of numerical accuracy between GPU and CPU computation to verify the precision of GPU computation. In addition, we show how GPUs can be effectively adapted to improve the performance of commercial image processing software such as Adobe Photoshop. Comparison with Existing Method(s) To our best knowledge, this is the first demonstration of GPU application in neural circuit mapping and electrophysiology-based data processing. Conclusions Together, GPU enabled computation enhances our ability to process large-scale data sets derived from neural circuit mapping studies, allowing for increased processing speeds while retaining data precision. PMID:25277633
A GPU-based calculation using the three-dimensional FDTD method for electromagnetic field analysis.
Nagaoka, Tomoaki; Watanabe, Soichi
2010-01-01
Numerical simulations with the numerical human model using the finite-difference time domain (FDTD) method have recently been performed frequently in a number of fields in biomedical engineering. However, the FDTD calculation runs too slowly. We focus, therefore, on general purpose programming on the graphics processing unit (GPGPU). The three-dimensional FDTD method was implemented on the GPU using Compute Unified Device Architecture (CUDA). In this study, we used the NVIDIA Tesla C1060 as a GPGPU board. The performance of the GPU is evaluated in comparison with the performance of a conventional CPU and a vector supercomputer. The results indicate that three-dimensional FDTD calculations using a GPU can significantly reduce run time in comparison with that using a conventional CPU, even a native GPU implementation of the three-dimensional FDTD method, while the GPU/CPU speed ratio varies with the calculation domain and thread block size.
GPU-centric resolved-particle disperse two-phase flow simulation using the Physalis method
NASA Astrophysics Data System (ADS)
Sierakowski, Adam J.
2016-10-01
We present work on a new implementation of the Physalis method for resolved-particle disperse two-phase flow simulations. We discuss specifically our GPU-centric programming model that avoids all device-host data communication during the simulation. Summarizing the details underlying the implementation of the Physalis method, we illustrate the application of two GPU-centric parallelization paradigms and record insights on how to best leverage the GPU's prioritization of bandwidth over latency. We perform a comparison of the computational efficiency between the current GPU-centric implementation and a legacy serial-CPU-optimized code and conclude that the GPU hardware accounts for run time improvements up to a factor of 60 by carefully normalizing the run times of both codes.
NASA Astrophysics Data System (ADS)
Mu, Dawei; Chen, Po; Wang, Liqiang
2013-02-01
We have successfully ported an arbitrary high-order discontinuous Galerkin (ADER-DG) method for solving the three-dimensional elastic seismic wave equation on unstructured tetrahedral meshes to an Nvidia Tesla C2075 GPU using the Nvidia CUDA programming model. On average our implementation obtained a speedup factor of about 24.3 for the single-precision version of our GPU code and a speedup factor of about 12.8 for the double-precision version of our GPU code when compared with the double precision serial CPU code running on one Intel Xeon W5880 core. When compared with the parallel CPU code running on two, four and eight cores, the speedup factor of our single-precision GPU code is around 12.9, 6.8 and 3.6, respectively. In this article, we give a brief summary of the ADER-DG method, a short introduction to the CUDA programming model and a description of our CUDA implementation and optimization of the ADER-DG method on the GPU. To our knowledge, this is the first study that explores the potential of accelerating the ADER-DG method for seismic wave-propagation simulations using a GPU.
MRISIMUL: a GPU-based parallel approach to MRI simulations.
Xanthis, Christos G; Venetis, Ioannis E; Chalkias, A V; Aletras, Anthony H
2014-03-01
A new step-by-step comprehensive MR physics simulator (MRISIMUL) of the Bloch equations is presented. The aim was to develop a magnetic resonance imaging (MRI) simulator that makes no assumptions with respect to the underlying pulse sequence and also allows for complex large-scale analysis on a single computer without requiring simplifications of the MRI model. We hypothesized that such a simulation platform could be developed with parallel acceleration of the executable core within the graphic processing unit (GPU) environment. MRISIMUL integrates realistic aspects of the MRI experiment from signal generation to image formation and solves the entire complex problem for densely spaced isochromats and for a densely spaced time axis. The simulation platform was developed in MATLAB whereas the computationally demanding core services were developed in CUDA-C. The MRISIMUL simulator imaged three different computer models: a user-defined phantom, a human brain model and a human heart model. The high computational power of GPU-based simulations was compared against other computer configurations. A speedup of about 228 times was achieved when compared to serially executed C-code on the CPU whereas a speedup between 31 to 115 times was achieved when compared to the OpenMP parallel executed C-code on the CPU, depending on the number of threads used in multithreading (2-8 threads). The high performance of MRISIMUL allows its application in large-scale analysis and can bring the computational power of a supercomputer or a large computer cluster to a single GPU personal computer.
GPU Accelerated Numerical Simulation of Viscous Flow Down a Slope
NASA Astrophysics Data System (ADS)
Gygax, Remo; Räss, Ludovic; Omlin, Samuel; Podladchikov, Yuri; Jaboyedoff, Michel
2014-05-01
Numerical simulations are an effective tool in natural risk analysis. They are useful to determine the propagation and the runout distance of gravity driven movements such as debris flows or landslides. To evaluate these processes an approach on analogue laboratory experiments and a GPU accelerated numerical simulation of the flow of a viscous liquid down an inclined slope is considered. The physical processes underlying large gravity driven flows share certain aspects with the propagation of debris mass in a rockslide and the spreading of water waves. Several studies have shown that the numerical implementation of the physical processes of viscous flow produce a good fit with the observation of experiments in laboratory in both a quantitative and a qualitative way. When considering a process that is this far explored we can concentrate on its numerical transcription and the application of the code in a GPU accelerated environment to obtain a 3D simulation. The objective of providing a numerical solution in high resolution by NVIDIA-CUDA GPU parallel processing is to increase the speed of the simulation and the accuracy on the prediction. The main goal is to write an easily adaptable and as short as possible code on the widely used platform MATLAB, which will be translated to C-CUDA to achieve higher resolution and processing speed while running on a NVIDIA graphics card cluster. The numerical model, based on the finite difference scheme, is compared to analogue laboratory experiments. This way our numerical model parameters are adjusted to reproduce the effective movements observed by high-speed camera acquisitions during the laboratory experiments.
An evaluation of GPU acceleration for sparse reconstruction
NASA Astrophysics Data System (ADS)
Braun, Thomas R.
2010-04-01
Image processing applications typically parallelize well. This gives a developer interested in data throughput several different implementation options, including multiprocessor machines, general purpose computation on the graphics processor, and custom gate-array designs. Herein, we will investigate these first two options for dictionary learning and sparse reconstruction, specifically focusing on the K-SVD algorithm for dictionary learning and the Batch Orthogonal Matching Pursuit for sparse reconstruction. These methods have been shown to provide state of the art results for image denoising, classification, and object recognition. We'll explore the GPU implementation and show that GPUs are not significantly better or worse than CPUs for this application.
A GPU code for analytic continuation through a sampling method
NASA Astrophysics Data System (ADS)
Nordström, Johan; Schött, Johan; Locht, Inka L. M.; Di Marco, Igor
We here present a code for performing analytic continuation of fermionic Green's functions and self-energies as well as bosonic susceptibilities on a graphics processing unit (GPU). The code is based on the sampling method introduced by Mishchenko et al. (2000), and is written for the widely used CUDA platform from NVidia. Detailed scaling tests are presented, for two different GPUs, in order to highlight the advantages of this code with respect to standard CPU computations. Finally, as an example of possible applications, we provide the analytic continuation of model Gaussian functions, as well as more realistic test cases from many-body physics.
Engineering a fully GPU-accelerated H.264 encoder
NASA Astrophysics Data System (ADS)
Li, Bowei; Deng, Yangdong S.
2013-07-01
H.264/AVC is the most popular video coding standard and playing an essential role in today's Internet based content-delivery businesses. H.264's encoding process is highly computationally expensive due to the integration of complex video coding techniques. As a result, transcoding has become a bottleneck of content-hosting services. Recently, general purpose computing on graphics processing units (GPUs) is rapidly rising as a popular computing model to expedite time-consuming applications. In this paper, we propose a fully GPU-accelerated H.264 encoder. Experimental results show that a 100% speed-up ratio can be achieved.
GPU-Acceleration of Sequence Homology Searches with Database Subsequence Clustering
Suzuki, Shuji; Kakuta, Masanori; Ishida, Takashi; Akiyama, Yutaka
2016-01-01
Sequence homology searches are used in various fields and require large amounts of computation time, especially for metagenomic analysis, owing to the large number of queries and the database size. To accelerate computing analyses, graphics processing units (GPUs) are widely used as a low-cost, high-performance computing platform. Therefore, we mapped the time-consuming steps involved in GHOSTZ, which is a state-of-the-art homology search algorithm for protein sequences, onto a GPU and implemented it as GHOSTZ-GPU. In addition, we optimized memory access for GPU calculations and for communication between the CPU and GPU. As per results of the evaluation test involving metagenomic data, GHOSTZ-GPU with 12 CPU threads and 1 GPU was approximately 3.0- to 4.1-fold faster than GHOSTZ with 12 CPU threads. Moreover, GHOSTZ-GPU with 12 CPU threads and 3 GPUs was approximately 5.8- to 7.7-fold faster than GHOSTZ with 12 CPU threads. PMID:27482905
Accelerating Spaceborne SAR Imaging Using Multiple CPU/GPU Deep Collaborative Computing.
Zhang, Fan; Li, Guojun; Li, Wei; Hu, Wei; Hu, Yuxin
2016-04-07
With the development of synthetic aperture radar (SAR) technologies in recent years, the huge amount of remote sensing data brings challenges for real-time imaging processing. Therefore, high performance computing (HPC) methods have been presented to accelerate SAR imaging, especially the GPU based methods. In the classical GPU based imaging algorithm, GPU is employed to accelerate image processing by massive parallel computing, and CPU is only used to perform the auxiliary work such as data input/output (IO). However, the computing capability of CPU is ignored and underestimated. In this work, a new deep collaborative SAR imaging method based on multiple CPU/GPU is proposed to achieve real-time SAR imaging. Through the proposed tasks partitioning and scheduling strategy, the whole image can be generated with deep collaborative multiple CPU/GPU computing. In the part of CPU parallel imaging, the advanced vector extension (AVX) method is firstly introduced into the multi-core CPU parallel method for higher efficiency. As for the GPU parallel imaging, not only the bottlenecks of memory limitation and frequent data transferring are broken, but also kinds of optimized strategies are applied, such as streaming, parallel pipeline and so on. Experimental results demonstrate that the deep CPU/GPU collaborative imaging method enhances the efficiency of SAR imaging on single-core CPU by 270 times and realizes the real-time imaging in that the imaging rate outperforms the raw data generation rate.
GPU accelerated simulations of bluff body flows using vortex particle methods
NASA Astrophysics Data System (ADS)
Rossinelli, Diego; Bergdorf, Michael; Cottet, Georges-Henri; Koumoutsakos, Petros
2010-05-01
We present a GPU accelerated solver for simulations of bluff body flows in 2D using a remeshed vortex particle method and the vorticity formulation of the Brinkman penalization technique to enforce boundary conditions. The efficiency of the method relies on fast and accurate particle-grid interpolations on GPUs for the remeshing of the particles and the computation of the field operators. The GPU implementation uses OpenGL so as to perform efficient particle-grid operations and a CUFFT-based solver for the Poisson equation with unbounded boundary conditions. The accuracy and performance of the GPU simulations and their relative advantages/drawbacks over CPU based computations are reported in simulations of flows past an impulsively started circular cylinder from Reynolds numbers between 40 and 9500. The results indicate up to two orders of magnitude speed up of the GPU implementation over the respective CPU implementations. The accuracy of the GPU computations depends on the Re number of the flow. For Re up to 1000 there is little difference between GPU and CPU calculations but this agreement deteriorates (albeit remaining to within 5% in drag calculations) for higher Re numbers as the single precision of the GPU adversely affects the accuracy of the simulations.
GPU-Acceleration of Sequence Homology Searches with Database Subsequence Clustering.
Suzuki, Shuji; Kakuta, Masanori; Ishida, Takashi; Akiyama, Yutaka
2016-01-01
Sequence homology searches are used in various fields and require large amounts of computation time, especially for metagenomic analysis, owing to the large number of queries and the database size. To accelerate computing analyses, graphics processing units (GPUs) are widely used as a low-cost, high-performance computing platform. Therefore, we mapped the time-consuming steps involved in GHOSTZ, which is a state-of-the-art homology search algorithm for protein sequences, onto a GPU and implemented it as GHOSTZ-GPU. In addition, we optimized memory access for GPU calculations and for communication between the CPU and GPU. As per results of the evaluation test involving metagenomic data, GHOSTZ-GPU with 12 CPU threads and 1 GPU was approximately 3.0- to 4.1-fold faster than GHOSTZ with 12 CPU threads. Moreover, GHOSTZ-GPU with 12 CPU threads and 3 GPUs was approximately 5.8- to 7.7-fold faster than GHOSTZ with 12 CPU threads.
GPU-accelerated Monte Carlo simulation of particle coagulation based on the inverse method
NASA Astrophysics Data System (ADS)
Wei, J.; Kruis, F. E.
2013-09-01
Simulating particle coagulation using Monte Carlo methods is in general a challenging computational task due to its numerical complexity and the computing cost. Currently, the lowest computing costs are obtained when applying a graphic processing unit (GPU) originally developed for speeding up graphic processing in the consumer market. In this article we present an implementation of accelerating a Monte Carlo method based on the Inverse scheme for simulating particle coagulation on the GPU. The abundant data parallelism embedded within the Monte Carlo method is explained as it will allow an efficient parallelization of the MC code on the GPU. Furthermore, the computation accuracy of the MC on GPU was validated with a benchmark, a CPU-based discrete-sectional method. To evaluate the performance gains by using the GPU, the computing time on the GPU against its sequential counterpart on the CPU were compared. The measured speedups show that the GPU can accelerate the execution of the MC code by a factor 10-100, depending on the chosen particle number of simulation particles. The algorithm shows a linear dependence of computing time with the number of simulation particles, which is a remarkable result in view of the n2 dependence of the coagulation.
Ultrafast convolution/superposition using tabulated and exponential kernels on GPU
Chen Quan; Chen Mingli; Lu Weiguo
2011-03-15
Purpose: Collapsed-cone convolution/superposition (CCCS) dose calculation is the workhorse for IMRT dose calculation. The authors present a novel algorithm for computing CCCS dose on the modern graphic processing unit (GPU). Methods: The GPU algorithm includes a novel TERMA calculation that has no write-conflicts and has linear computation complexity. The CCCS algorithm uses either tabulated or exponential cumulative-cumulative kernels (CCKs) as reported in literature. The authors have demonstrated that the use of exponential kernels can reduce the computation complexity by order of a dimension and achieve excellent accuracy. Special attentions are paid to the unique architecture of GPU, especially the memory accessing pattern, which increases performance by more than tenfold. Results: As a result, the tabulated kernel implementation in GPU is two to three times faster than other GPU implementations reported in literature. The implementation of CCCS showed significant speedup on GPU over single core CPU. On tabulated CCK, speedups as high as 70 are observed; on exponential CCK, speedups as high as 90 are observed. Conclusions: Overall, the GPU algorithm using exponential CCK is 1000-3000 times faster over a highly optimized single-threaded CPU implementation using tabulated CCK, while the dose differences are within 0.5% and 0.5 mm. This ultrafast CCCS algorithm will allow many time-sensitive applications to use accurate dose calculation.
Accelerating Spaceborne SAR Imaging Using Multiple CPU/GPU Deep Collaborative Computing
Zhang, Fan; Li, Guojun; Li, Wei; Hu, Wei; Hu, Yuxin
2016-01-01
With the development of synthetic aperture radar (SAR) technologies in recent years, the huge amount of remote sensing data brings challenges for real-time imaging processing. Therefore, high performance computing (HPC) methods have been presented to accelerate SAR imaging, especially the GPU based methods. In the classical GPU based imaging algorithm, GPU is employed to accelerate image processing by massive parallel computing, and CPU is only used to perform the auxiliary work such as data input/output (IO). However, the computing capability of CPU is ignored and underestimated. In this work, a new deep collaborative SAR imaging method based on multiple CPU/GPU is proposed to achieve real-time SAR imaging. Through the proposed tasks partitioning and scheduling strategy, the whole image can be generated with deep collaborative multiple CPU/GPU computing. In the part of CPU parallel imaging, the advanced vector extension (AVX) method is firstly introduced into the multi-core CPU parallel method for higher efficiency. As for the GPU parallel imaging, not only the bottlenecks of memory limitation and frequent data transferring are broken, but also kinds of optimized strategies are applied, such as streaming, parallel pipeline and so on. Experimental results demonstrate that the deep CPU/GPU collaborative imaging method enhances the efficiency of SAR imaging on single-core CPU by 270 times and realizes the real-time imaging in that the imaging rate outperforms the raw data generation rate. PMID:27070606
GPU accelerated cell-based adaptive mesh refinement on unstructured quadrilateral grid
NASA Astrophysics Data System (ADS)
Luo, Xisheng; Wang, Luying; Ran, Wei; Qin, Fenghua
2016-10-01
A GPU accelerated inviscid flow solver is developed on an unstructured quadrilateral grid in the present work. For the first time, the cell-based adaptive mesh refinement (AMR) is fully implemented on GPU for the unstructured quadrilateral grid, which greatly reduces the frequency of data exchange between GPU and CPU. Specifically, the AMR is processed with atomic operations to parallelize list operations, and null memory recycling is realized to improve the efficiency of memory utilization. It is found that results obtained by GPUs agree very well with the exact or experimental results in literature. An acceleration ratio of 4 is obtained between the parallel code running on the old GPU GT9800 and the serial code running on E3-1230 V2. With the optimization of configuring a larger L1 cache and adopting Shared Memory based atomic operations on the newer GPU C2050, an acceleration ratio of 20 is achieved. The parallelized cell-based AMR processes have achieved 2x speedup on GT9800 and 18x on Tesla C2050, which demonstrates that parallel running of the cell-based AMR method on GPU is feasible and efficient. Our results also indicate that the new development of GPU architecture benefits the fluid dynamics computing significantly.
NASA Astrophysics Data System (ADS)
Ji, Zhe; Xu, Fei; Takahashi, Akiyuki; Sun, Yu
2016-12-01
In this paper, a Weakly Compressible Smoothed Particle Hydrodynamics (WCSPH) framework is presented utilizing the parallel architecture of single- and multi-GPU (Graphic Processing Unit) platforms. The program is developed for water entry simulations where an efficient potential based contact force is introduced to tackle the interaction between fluid and solid particles. The single-GPU SPH scheme is implemented with a series of optimization to achieve high performance. To go beyond the memory limitation of single GPU, the scheme is further extended to multi-GPU platform basing on an improved 3D domain decomposition and inter-node data communication strategy. A typical benchmark test of wedge entry is investigated in varied dimensions and scales to validate the accuracy and efficiency of the program. The results of 2D and 3D benchmark tests manifest great consistency with the experiment and better accuracy than other numerical models. The performance of the single-GPU code is assessed by comparing with serial and parallel CPU codes. The improvement of the domain decomposition strategy is verified, and a study on the scalability and efficiency of the multi-GPU code is carried out as well by simulating tests with varied scales in different amount of GPUs. Lastly, the single- and multi-GPU codes are further compared with existing state-of-the-art SPH parallel frameworks for a comprehensive assessment.
Large Data Visualization on Distributed Memory Mulit-GPU Clusters
Childs, Henry R.
2010-03-01
Data sets of immense size are regularly generated on large scale computing resources. Even among more traditional methods for acquisition of volume data, such as MRI and CT scanners, data which is too large to be effectively visualization on standard workstations is now commonplace. One solution to this problem is to employ a 'visualization cluster,' a small to medium scale cluster dedicated to performing visualization and analysis of massive data sets generated on larger scale supercomputers. These clusters are designed to fit a different need than traditional supercomputers, and therefore their design mandates different hardware choices, such as increased memory, and more recently, graphics processing units (GPUs). While there has been much previous work on distributed memory visualization as well as GPU visualization, there is a relative dearth of algorithms which effectively use GPUs at a large scale in a distributed memory environment. In this work, we study a common visualization technique in a GPU-accelerated, distributed memory setting, and present performance characteristics when scaling to extremely large data sets.
GPU acceleration of simplex volume algorithm for hyperspectral endmember extraction
NASA Astrophysics Data System (ADS)
Qu, Haicheng; Zhang, Junping; Lin, Zhouhan; Chen, Hao; Huang, Bormin
2012-10-01
The simplex volume algorithm (SVA)1 is an endmember extraction algorithm based on the geometrical properties of a simplex in the feature space of hyperspectral image. By utilizing the relation between a simplex volume and its corresponding parallelohedron volume in the high-dimensional space, the algorithm extracts endmembers from the initial hyperspectral image directly without the need of dimension reduction. It thus avoids the drawback of the N-FINDER algorithm, which requires the dimension of the data to be reduced to one less than the number of the endmembers. In this paper, we take advantage of the large-scale parallelism of CUDA (Compute Unified Device Architecture) to accelerate the computation of SVA on the NVidia GeForce 560 GPU. The time for computing a simplex volume increases with the number of endmembers. Experimental results show that the proposed GPU-based SVA achieves a significant 112.56x speedup for extracting 16 endmembers, as compared to its CPU-based single-threaded counterpart.
True 4D Image Denoising on the GPU.
Eklund, Anders; Andersson, Mats; Knutsson, Hans
2011-01-01
The use of image denoising techniques is an important part of many medical imaging applications. One common application is to improve the image quality of low-dose (noisy) computed tomography (CT) data. While 3D image denoising previously has been applied to several volumes independently, there has not been much work done on true 4D image denoising, where the algorithm considers several volumes at the same time. The problem with 4D image denoising, compared to 2D and 3D denoising, is that the computational complexity increases exponentially. In this paper we describe a novel algorithm for true 4D image denoising, based on local adaptive filtering, and how to implement it on the graphics processing unit (GPU). The algorithm was applied to a 4D CT heart dataset of the resolution 512 × 512 × 445 × 20. The result is that the GPU can complete the denoising in about 25 minutes if spatial filtering is used and in about 8 minutes if FFT-based filtering is used. The CPU implementation requires several days of processing time for spatial filtering and about 50 minutes for FFT-based filtering. The short processing time increases the clinical value of true 4D image denoising significantly.
Molecular dynamics simulations through GPU video games technologies
Loukatou, Styliani; Papageorgiou, Louis; Fakourelis, Paraskevas; Filntisi, Arianna; Polychronidou, Eleftheria; Bassis, Ioannis; Megalooikonomou, Vasileios; Makałowski, Wojciech; Vlachakis, Dimitrios; Kossida, Sophia
2016-01-01
Bioinformatics is the scientific field that focuses on the application of computer technology to the management of biological information. Over the years, bioinformatics applications have been used to store, process and integrate biological and genetic information, using a wide range of methodologies. One of the most de novo techniques used to understand the physical movements of atoms and molecules is molecular dynamics (MD). MD is an in silico method to simulate the physical motions of atoms and molecules under certain conditions. This has become a state strategic technique and now plays a key role in many areas of exact sciences, such as chemistry, biology, physics and medicine. Due to their complexity, MD calculations could require enormous amounts of computer memory and time and therefore their execution has been a big problem. Despite the huge computational cost, molecular dynamics have been implemented using traditional computers with a central memory unit (CPU). A graphics processing unit (GPU) computing technology was first designed with the goal to improve video games, by rapidly creating and displaying images in a frame buffer such as screens. The hybrid GPU-CPU implementation, combined with parallel computing is a novel technology to perform a wide range of calculations. GPUs have been proposed and used to accelerate many scientific computations including MD simulations. Herein, we describe the new methodologies developed initially as video games and how they are now applied in MD simulations. PMID:27525251
Radial basis function networks GPU-based implementation.
Brandstetter, Andreas; Artusi, Alessandro
2008-12-01
Neural networks (NNs) have been used in several areas, showing their potential but also their limitations. One of the main limitations is the long time required for the training process; this is not useful in the case of a fast training process being required to respond to changes in the application domain. A possible way to accelerate the learning process of an NN is to implement it in hardware, but due to the high cost and the reduced flexibility of the original central processing unit (CPU) implementation, this solution is often not chosen. Recently, the power of the graphic processing unit (GPU), on the market, has increased and it has started to be used in many applications. In particular, a kind of NN named radial basis function network (RBFN) has been used extensively, proving its power. However, their limiting time performances reduce their application in many areas. In this brief paper, we describe a GPU implementation of the entire learning process of an RBFN showing the ability to reduce the computational cost by about two orders of magnitude with respect to its CPU implementation.
Meso-Scale Radioactive Dispersion Modelling using GPU
NASA Astrophysics Data System (ADS)
Sunarko; Suud, Zaki
2017-01-01
Lagrangian Particle Dispersion Method (LPDM) is applied to model atmospheric dispersion of radioactive material in a meso-scale of a few tens of kilometers for site study purpose. Empirical relationships are used to determine the dispersion coefficient for various atmospheric stabilities. Diagnostic 3-D wind field is created based on data from a meteorological station using mass-conservation principle. Particles imitating radioactive pollutant are dispersed in the wind-field as a point source. Time-integrated air concentration is calculated using kernel density estimator (KDE) in the lowest layer of the atmosphere. Parallel code is developed for GTX-660Ti GPU with a total of 1344 scalar processors using CUDA programming. Significant speedup of about 20 times is achieved compared to the serial version of the code while accuracy is kept at reasonable level. Only small differences in particle positions and grid doses are observed when using the same sets of random number and meteorological data in both CPU and GPU versions of the code.
GPU Accelerated Spectral Element Methods: 3D Euler equations
NASA Astrophysics Data System (ADS)
Abdi, D. S.; Wilcox, L.; Giraldo, F.; Warburton, T.
2015-12-01
A GPU accelerated nodal discontinuous Galerkin method for the solution of three dimensional Euler equations is presented. The Euler equations are nonlinear hyperbolic equations that are widely used in Numerical Weather Prediction (NWP). Therefore, acceleration of the method plays an important practical role in not only getting daily forecasts faster but also in obtaining more accurate (high resolution) results. The equation sets used in our atomospheric model NUMA (non-hydrostatic unified model of the atmosphere) take into consideration non-hydrostatic effects that become more important with high resolution. We use algorithms suitable for the single instruction multiple thread (SIMT) architecture of GPUs to accelerate solution by an order of magnitude (20x) relative to CPU implementation. For portability to heterogeneous computing environment, we use a new programming language OCCA, which can be cross-compiled to either OpenCL, CUDA or OpenMP at runtime. Finally, the accuracy and performance of our GPU implementations are veried using several benchmark problems representative of different scales of atmospheric dynamics.
GPU implementation of the simplex identification via split augmented Lagrangian
NASA Astrophysics Data System (ADS)
Sevilla, Jorge; Nascimento, José M. P.
2015-10-01
Hyperspectral imaging can be used for object detection and for discriminating between different objects based on their spectral characteristics. One of the main problems of hyperspectral data analysis is the presence of mixed pixels, due to the low spatial resolution of such images. This means that several spectrally pure signatures (endmembers) are combined into the same mixed pixel. Linear spectral unmixing follows an unsupervised approach which aims at inferring pure spectral signatures and their material fractions at each pixel of the scene. The huge data volumes acquired by such sensors put stringent requirements on processing and unmixing methods. This paper proposes an efficient implementation of a unsupervised linear unmixing method on GPUs using CUDA. The method finds the smallest simplex by solving a sequence of nonsmooth convex subproblems using variable splitting to obtain a constraint formulation, and then applying an augmented Lagrangian technique. The parallel implementation of SISAL presented in this work exploits the GPU architecture at low level, using shared memory and coalesced accesses to memory. The results herein presented indicate that the GPU implementation can significantly accelerate the method's execution over big datasets while maintaining the methods accuracy.
Electromagnetic metamaterial simulations using a GPU-accelerated FDTD method
NASA Astrophysics Data System (ADS)
Seok, Myung-Su; Lee, Min-Gon; Yoo, SeokJae; Park, Q.-Han
2015-12-01
Metamaterials composed of artificial subwavelength structures exhibit extraordinary properties that cannot be found in nature. Designing artificial structures having exceptional properties plays a pivotal role in current metamaterial research. We present a new numerical simulation scheme for metamaterial research. The scheme is based on a graphic processing unit (GPU)-accelerated finite-difference time-domain (FDTD) method. The FDTD computation can be significantly accelerated when GPUs are used instead of only central processing units (CPUs). We explain how the fast FDTD simulation of large-scale metamaterials can be achieved through communication optimization in a heterogeneous CPU/GPU-based computer cluster. Our method also includes various advanced FDTD techniques: the non-uniform grid technique, the total-field/scattered-field (TFSF) technique, the auxiliary field technique for dispersive materials, the running discrete Fourier transform, and the complex structure setting. We demonstrate the power of our new FDTD simulation scheme by simulating the negative refraction of light in a coaxial waveguide metamaterial.
Molecular dynamics simulations through GPU video games technologies.
Loukatou, Styliani; Papageorgiou, Louis; Fakourelis, Paraskevas; Filntisi, Arianna; Polychronidou, Eleftheria; Bassis, Ioannis; Megalooikonomou, Vasileios; Makałowski, Wojciech; Vlachakis, Dimitrios; Kossida, Sophia
Bioinformatics is the scientific field that focuses on the application of computer technology to the management of biological information. Over the years, bioinformatics applications have been used to store, process and integrate biological and genetic information, using a wide range of methodologies. One of the most de novo techniques used to understand the physical movements of atoms and molecules is molecular dynamics (MD). MD is an in silico method to simulate the physical motions of atoms and molecules under certain conditions. This has become a state strategic technique and now plays a key role in many areas of exact sciences, such as chemistry, biology, physics and medicine. Due to their complexity, MD calculations could require enormous amounts of computer memory and time and therefore their execution has been a big problem. Despite the huge computational cost, molecular dynamics have been implemented using traditional computers with a central memory unit (CPU). A graphics processing unit (GPU) computing technology was first designed with the goal to improve video games, by rapidly creating and displaying images in a frame buffer such as screens. The hybrid GPU-CPU implementation, combined with parallel computing is a novel technology to perform a wide range of calculations. GPUs have been proposed and used to accelerate many scientific computations including MD simulations. Herein, we describe the new methodologies developed initially as video games and how they are now applied in MD simulations.
GPU Implementation of Stokes Equation with Strongly Variable Coefficients
NASA Astrophysics Data System (ADS)
Zheng, L.; Gerya, T.; Yuen, D. A.; Knepley, M. G.; Zhang, H.; Shi, Y.
2010-12-01
Solving Stokes flow problem is commonplace for numerical modeling of geodynamical processes as the lithosphere and mantle can be always regarded as incompressible for long geological time scales. For Stokes flow the Reynold Number is effectively zero so that we can ignore advective transport of momentum thus resulting in the slowly creeping flow called Stokes’ equation. Because of the ill-conditioned matrix due to the saddle points in the matrix system coupling mass and momentum partial differential equations it is still extremely to efficiently solve this elliptic PDE system with strongly variable coefficients due to rheology which is coupled to the conservation of mass equation. Since NVIDIA issued the CUDA programming framework in 2007 scientists can use commodity CPU-GPU system to do such geodynamical simulation efficiently with the advantage of CPU and GPU respectively. We implemented a finite difference solver for Stokes Equations with variable viscosity based on CUDA using geometry multi-grid methods in staggered grids. In 2D version we use a mixture of Jacobi and Gauss-Seidel iteration with SOR as the smoother. In 3D version we use the Red-Black updating method to avoid the problem of disordered threads. We are also considering the deployment of the multigrid solver as a preconditioner for Krylov subspace scheme within the PETSc.
Illustrative volume visualization using GPU-based particle systems.
van Pelt, Roy; Vilanova, Anna; van de Wetering, Huub
2010-01-01
Illustrative techniques are generally applied to produce stylized renderings. Various illustrative styles have been applied to volumetric data sets, producing clearer images and effectively conveying visual information. We adopt particle systems to produce user-configurable stylized renderings from the volume data, imitating traditional pen-and-ink drawings. In the following, we present an interactive GPU-based illustrative volume rendering framework, called VolFliesGPU. In this framework, isosurfaces are sampled by evenly distributed particle sets, delineating surface shape by illustrative styles. The appearance of these styles is based on locally-measured surface properties. For instance, hatches convey surface shape by orientation and shape characteristics are enhanced by color, mapped using a curvature-based transfer function. Hidden-surfaces are generally removed to avoid visual clutter, after that a combination of styles is applied per isosurface. Multiple surfaces and styles can be explored interactively, exploiting parallelism in both graphics hardware and particle systems. We achieve real-time interaction and prompt parametrization of the illustrative styles, using an intuitive GPGPU paradigm that delivers the computational power to drive our particle system and visualization algorithms.
Application of GPU processing for Brownian particle simulation
NASA Astrophysics Data System (ADS)
Cheng, Way Lee; Sheharyar, Ali; Sadr, Reza; Bouhali, Othmane
2015-01-01
Reports on the anomalous thermal-fluid properties of nanofluids (dilute suspension of nano-particles in a base fluid) have been the subject of attention for 15 years. The underlying physics that govern nanofluid behavior, however, is not fully understood and is a subject of much dispute. The interactions between the suspended particles and the base fluid have been cited as a major contributor to the improvement in heat transfer reported in the literature. Numerical simulations are instrumental in studying the behavior of nanofluids. However, such simulations can be computationally intensive due to the small dimensions and complexity of these problems. In this study, a simplified computational approach for isothermal nanofluid simulations was applied, and simulations were conducted using both conventional CPU and parallel GPU implementations. The GPU implementations significantly improved the computational performance, in terms of the simulation time, by a factor of 1000-2500. The results of this investigation show that, as the computational load increases, the simulation efficiency approaches a constant. At a very high computational load, the amount of improvement may even decrease due to limited system memory.
GPU-Based Tracking Algorithms for the ATLAS High-Level Trigger
NASA Astrophysics Data System (ADS)
Emeliyanov, D.; Howard, J.
2012-12-01
Results on the performance and viability of data-parallel algorithms on Graphics Processing Units (GPUs) in the ATLAS Level 2 trigger system are presented. We describe the existing trigger data preparation and track reconstruction algorithms, motivation for their optimization, GPU-parallelized versions of these algorithms, and a “client-server” solution for hybrid CPU/GPU event processing used for integration of the GPU-oriented algorithms into existing ATLAS trigger software. The resulting speed-up of event processing times obtained with high-luminosity simulated data is presented and discussed.
gEMpicker: a highly parallel GPU-accelerated particle picking tool for cryo-electron microscopy
2013-01-01
Background Picking images of particles in cryo-electron micrographs is an important step in solving the 3D structures of large macromolecular assemblies. However, in order to achieve sub-nanometre resolution it is often necessary to capture and process many thousands or even several millions of 2D particle images. Thus, a computational bottleneck in reaching high resolution is the accurate and automatic picking of particles from raw cryo-electron micrographs. Results We have developed “gEMpicker”, a highly parallel correlation-based particle picking tool. To our knowledge, gEMpicker is the first particle picking program to use multiple graphics processor units (GPUs) to accelerate the calculation. When tested on the publicly available keyhole limpet hemocyanin dataset, we find that gEMpicker gives similar results to the FindEM program. However, compared to calculating correlations on one core of a contemporary central processor unit (CPU), running gEMpicker on a modern GPU gives a speed-up of about 27 ×. To achieve even higher processing speeds, the basic correlation calculations are accelerated considerably by using a hierarchy of parallel programming techniques to distribute the calculation over multiple GPUs and CPU cores attached to multiple nodes of a computer cluster. By using a theoretically optimal reduction algorithm to collect and combine the cluster calculation results, the speed of the overall calculation scales almost linearly with the number of cluster nodes available. Conclusions The very high picking throughput that is now possible using GPU-powered workstations or computer clusters will help experimentalists to achieve higher resolution 3D reconstructions more rapidly than before. PMID:24144335
Wu, Xin; Koslowski, Axel; Thiel, Walter
2012-07-10
In this work, we demonstrate that semiempirical quantum chemical calculations can be accelerated significantly by leveraging the graphics processing unit (GPU) as a coprocessor on a hybrid multicore CPU-GPU computing platform. Semiempirical calculations using the MNDO, AM1, PM3, OM1, OM2, and OM3 model Hamiltonians were systematically profiled for three types of test systems (fullerenes, water clusters, and solvated crambin) to identify the most time-consuming sections of the code. The corresponding routines were ported to the GPU and optimized employing both existing library functions and a GPU kernel that carries out a sequence of noniterative Jacobi transformations during pseudodiagonalization. The overall computation times for single-point energy calculations and geometry optimizations of large molecules were reduced by one order of magnitude for all methods, as compared to runs on a single CPU core.
GPU-based parallel group ICA for functional magnetic resonance data.
Jing, Yanshan; Zeng, Weiming; Wang, Nizhuan; Ren, Tianlong; Shi, Yingchao; Yin, Jun; Xu, Qi
2015-04-01
The goal of our study is to develop a fast parallel implementation of group independent component analysis (ICA) for functional magnetic resonance imaging (fMRI) data using graphics processing units (GPU). Though ICA has become a standard method to identify brain functional connectivity of the fMRI data, it is computationally intensive, especially has a huge cost for the group data analysis. GPU with higher parallel computation power and lower cost are used for general purpose computing, which could contribute to fMRI data analysis significantly. In this study, a parallel group ICA (PGICA) on GPU, mainly consisting of GPU-based PCA using SVD and Infomax-ICA, is presented. In comparison to the serial group ICA, the proposed method demonstrated both significant speedup with 6-11 times and comparable accuracy of functional networks in our experiments. This proposed method is expected to perform the real-time post-processing for fMRI data analysis.
GPU technology as a platform for accelerating local complexity analysis of protein sequences.
Papadopoulos, Agathoklis; Kirmitzoglou, Ioannis; Promponas, Vasilis J; Theocharides, Theocharis
2013-01-01
The use of GPGPU programming paradigm (running CUDA-enabled algorithms on GPU cards) in Bioinformatics showed promising results [1]. As such a similar approach can be used to speedup other algorithms such as CAST, a popular tool used for masking low-complexity regions (LCRs) in protein sequences [2] with increased sensitivity. We developed and implemented a CUDA-enabled version (GPU_CAST) of the multi-threaded version of CAST software first presented in [3] and optimized in [4]. The proposed software implementation uses the nVIDIA CUDA libraries and the GPGPU programming paradigm to take advantage of the inherent parallel characteristics of the CAST algorithm to execute the calculations on the GPU card of the host computer system. The GPU-based implementation presented in this work, is compared against the multi-threaded, multi-core optimized version of CAST [4] and yielded speedups of 5x-10x for large protein sequence datasets.
Multi-GPU accelerated three-dimensional FDTD method for electromagnetic simulation.
Nagaoka, Tomoaki; Watanabe, Soichi
2011-01-01
Numerical simulation with a numerical human model using the finite-difference time domain (FDTD) method has recently been performed in a number of fields in biomedical engineering. To improve the method's calculation speed and realize large-scale computing with the numerical human model, we adapt three-dimensional FDTD code to a multi-GPU environment using Compute Unified Device Architecture (CUDA). In this study, we used NVIDIA Tesla C2070 as GPGPU boards. The performance of multi-GPU is evaluated in comparison with that of a single GPU and vector supercomputer. The calculation speed with four GPUs was approximately 3.5 times faster than with a single GPU, and was slightly (approx. 1.3 times) slower than with the supercomputer. Calculation speed of the three-dimensional FDTD method using GPUs can significantly improve with an expanding number of GPUs.
Papadopoulos, Agathoklis; Kostoglou, Kyriaki; Mitsis, Georgios D; Theocharides, Theocharis
2015-01-01
The use of a GPGPU programming paradigm (running CUDA-enabled algorithms on GPU cards) in biomedical engineering and biology-related applications have shown promising results. GPU acceleration can be used to speedup computation-intensive models, such as the mathematical modeling of biological systems, which often requires the use of nonlinear modeling approaches with a large number of free parameters. In this context, we developed a CUDA-enabled version of a model which implements a nonlinear identification approach that combines basis expansions and polynomial-type networks, termed Laguerre-Volterra networks and can be used in diverse biological applications. The proposed software implementation uses the GPGPU programming paradigm to take advantage of the inherent parallel characteristics of the aforementioned modeling approach to execute the calculations on the GPU card of the host computer system. The initial results of the GPU-based model presented in this work, show performance improvements over the original MATLAB model.
Singular value decomposition for collaborative filtering on a GPU
NASA Astrophysics Data System (ADS)
Kato, Kimikazu; Hosino, Tikara
2010-06-01
A collaborative filtering predicts customers' unknown preferences from known preferences. In a computation of the collaborative filtering, a singular value decomposition (SVD) is needed to reduce the size of a large scale matrix so that the burden for the next phase computation will be decreased. In this application, SVD means a roughly approximated factorization of a given matrix into smaller sized matrices. Webb (a.k.a. Simon Funk) showed an effective algorithm to compute SVD toward a solution of an open competition called "Netflix Prize". The algorithm utilizes an iterative method so that the error of approximation improves in each step of the iteration. We give a GPU version of Webb's algorithm. Our algorithm is implemented in the CUDA and it is shown to be efficient by an experiment.
A GPU Accelerated Simulation Program for Electron Cooling Process
NASA Astrophysics Data System (ADS)
Zhang, He; Huang, He; Li, Rui; Chen, Jie; Luo, Li-Shi
2015-04-01
Electron cooling is essential to achieve high luminosity in the medium energy electron ion collider (MIEC) project at Jefferson Lab. Bunched electron beam with energy above 50 MeV is used to cool coasting and/or bunched ion beams. Although the conventional electron cooling technique has been widely used, such an implementation in MEIC is still challenging. We are developing a simulation program for the electron cooling process to fulfill the need of the electron cooling system design for MEIC. The program simulates the evolution of the ion beam under the intrabeam scattering (IBS) effect and the electron cooling effect using Monte Carlo method. To accelerate the calculation, the program is developed on a GPU platform. We will present some preliminary simulation results. Work supported by the Department of Energy, Laboratory Directed Research and Development Funding, under Contract No. DE-AC05-06OR23177.
Explicit integration with GPU acceleration for large kinetic networks
Brock, Benjamin; Belt, Andrew; Billings, Jay Jay; ...
2015-09-15
In this study, we demonstrate the first implementation of recently-developed fast explicit kinetic integration algorithms on modern graphics processing unit (GPU) accelerators. Taking as a generic test case a Type Ia supernova explosion with an extremely stiff thermonuclear network having 150 isotopic species and 1604 reactions coupled to hydrodynamics using operator splitting, we demonstrate the capability to solve of order 100 realistic kinetic networks in parallel in the same time that standard implicit methods can solve a single such network on a CPU. In addition, this orders-of-magnitude decrease in computation time for solving systems of realistic kinetic networks implies thatmore » important coupled, multiphysics problems in various scientific and technical fields that were intractable, or could be simulated only with highly schematic kinetic networks, are now computationally feasible.« less
Explicit integration with GPU acceleration for large kinetic networks
Brock, Benjamin; Belt, Andrew; Billings, Jay Jay; Guidry, Mike W.
2015-09-15
In this study, we demonstrate the first implementation of recently-developed fast explicit kinetic integration algorithms on modern graphics processing unit (GPU) accelerators. Taking as a generic test case a Type Ia supernova explosion with an extremely stiff thermonuclear network having 150 isotopic species and 1604 reactions coupled to hydrodynamics using operator splitting, we demonstrate the capability to solve of order 100 realistic kinetic networks in parallel in the same time that standard implicit methods can solve a single such network on a CPU. In addition, this orders-of-magnitude decrease in computation time for solving systems of realistic kinetic networks implies that important coupled, multiphysics problems in various scientific and technical fields that were intractable, or could be simulated only with highly schematic kinetic networks, are now computationally feasible.
Numerically Tracking Contact Discontinuities with an Introduction for GPU Programming
Davis, Sean L
2012-08-17
We review some of the classic numerical techniques used to analyze contact discontinuities and compare their effectiveness. Several finite difference methods (the Lax-Wendroff method, a Multidimensional Positive Definite Advection Transport Algorithm (MPDATA) method and a Monotone Upstream Scheme for Conservation Laws (MUSCL) scheme with an Artificial Compression Method (ACM)) as well as the finite element Streamlined Upwind Petrov-Galerkin (SUPG) method were considered. These methods were applied to solve the 2D advection equation. Based on our results we concluded that the MUSCL scheme produces the sharpest interfaces but can inappropriately steepen the solution. The SUPG method seems to represent a good balance between stability and interface sharpness without any inappropriate steepening. However, for solutions with discontinuities, the MUSCL scheme is superior. In addition, a preliminary implementation in a GPU program is discussed.
Explicit integration with GPU acceleration for large kinetic networks
Brock, Benjamin; Belt, Andrew; Billings, Jay Jay; Guidry, Mike
2015-12-01
We demonstrate the first implementation of recently-developed fast explicit kinetic integration algorithms on modern graphics processing unit (GPU) accelerators. Taking as a generic test case a Type Ia supernova explosion with an extremely stiff thermonuclear network having 150 isotopic species and 1604 reactions coupled to hydrodynamics using operator splitting, we demonstrate the capability to solve of order 100 realistic kinetic networks in parallel in the same time that standard implicit methods can solve a single such network on a CPU. This orders-of-magnitude decrease in computation time for solving systems of realistic kinetic networks implies that important coupled, multiphysics problems in various scientific and technical fields that were intractable, or could be simulated only with highly schematic kinetic networks, are now computationally feasible.
MATCHED FILTER COMPUTATION ON FPGA, CELL, AND GPU
BAKER, ZACHARY K.; GOKHALE, MAYA B.; TRIPP, JUSTIN L.
2007-01-08
The matched filter is an important kernel in the processing of hyperspectral data. The filter enables researchers to sift useful data from instruments that span large frequency bands. In this work, they evaluate the performance of a matched filter algorithm implementation on accelerated co-processor (XD1000), the IBM Cell microprocessor, and the NVIDIA GeForce 6900 GTX GPU graphics card. They provide extensive discussion of the challenges and opportunities afforded by each platform. In particular, they explore the problems of partitioning the filter most efficiently between the host CPU and the co-processor. Using their results, they derive several performance metrics that provide the optimal solution for a variety of application situations.
GPU Accelerated Discontinuous Galerkin Methods for Shallow Water Equations
NASA Astrophysics Data System (ADS)
Gandham, Rajesh; Medina, David; Warburton, Timothy
2015-07-01
We discuss the development, verification, and performance of a GPU accelerated discontinuous Galerkin method for the solutions of two dimensional nonlinear shallow water equations. The shallow water equations are hyperbolic partial differential equations and are widely used in the simulation of tsunami wave propagations. Our algorithms are tailored to take advantage of the single instruction multiple data (SIMD) architecture of graphic processing units. The time integration is accelerated by local time stepping based on a multi-rate Adams-Bashforth scheme. A total variational bounded limiter is adopted for nonlinear stability of the numerical scheme. This limiter is coupled with a mass and momentum conserving positivity preserving limiter for the special treatment of a dry or partially wet element in the triangulation. Accuracy, robustness and performance are demonstrated with the aid of test cases. We compare the performance of the kernels expressed in a portable threading language OCCA, when cross compiled with OpenCL, CUDA, and OpenMP at runtime.
Inversion of Heavy Current Electroheat Problems on a Graphics Processing Unit (GPU)
2013-11-10
Genetic and Gradient-based Optimization Algorithms for Solving Electromagnetics Problem,” IEEE Trans. Magnetics, Vol. 31 (3), pp. 1932-1935, 1995...and the optimization, we use the GPU to perform the electroheat optimization by the genetic algorithm to achieve computational efficiencies better...part problem and the optimization, we use the GPU to perform the electroheat optimization by the genetic algorithm to achieve computational
Development of High-speed Visualization System of Hypocenter Data Using CUDA-based GPU computing
NASA Astrophysics Data System (ADS)
Kumagai, T.; Okubo, K.; Uchida, N.; Matsuzawa, T.; Kawada, N.; Takeuchi, N.
2014-12-01
After the Great East Japan Earthquake on March 11, 2011, intelligent visualization of seismic information is becoming important to understand the earthquake phenomena. On the other hand, to date, the quantity of seismic data becomes enormous as a progress of high accuracy observation network; we need to treat many parameters (e.g., positional information, origin time, magnitude, etc.) to efficiently display the seismic information. Therefore, high-speed processing of data and image information is necessary to handle enormous amounts of seismic data. Recently, GPU (Graphic Processing Unit) is used as an acceleration tool for data processing and calculation in various study fields. This movement is called GPGPU (General Purpose computing on GPUs). In the last few years the performance of GPU keeps on improving rapidly. GPU computing gives us the high-performance computing environment at a lower cost than before. Moreover, use of GPU has an advantage of visualization of processed data, because GPU is originally architecture for graphics processing. In the GPU computing, the processed data is always stored in the video memory. Therefore, we can directly write drawing information to the VRAM on the video card by combining CUDA and the graphics API. In this study, we employ CUDA and OpenGL and/or DirectX to realize full-GPU implementation. This method makes it possible to write drawing information to the VRAM on the video card without PCIe bus data transfer: It enables the high-speed processing of seismic data. The present study examines the GPU computing-based high-speed visualization and the feasibility for high-speed visualization system of hypocenter data.
Efficient simulation of diffusion-based choice RT models on CPU and GPU.
Verdonck, Stijn; Meers, Kristof; Tuerlinckx, Francis
2016-03-01
In this paper, we present software for the efficient simulation of a broad class of linear and nonlinear diffusion models for choice RT, using either CPU or graphical processing unit (GPU) technology. The software is readily accessible from the popular scripting languages MATLAB and R (both 64-bit). The speed obtained on a single high-end GPU is comparable to that of a small CPU cluster, bringing standard statistical inference of complex diffusion models to the desktop platform.
2010-01-01
Background Simulation of sophisticated biological models requires considerable computational power. These models typically integrate together numerous biological phenomena such as spatially-explicit heterogeneous cells, cell-cell interactions, cell-environment interactions and intracellular gene networks. The recent advent of programming for graphical processing units (GPU) opens up the possibility of developing more integrative, detailed and predictive biological models while at the same time decreasing the computational cost to simulate those models. Results We construct a 3D model of epidermal development and provide a set of GPU algorithms that executes significantly faster than sequential central processing unit (CPU) code. We provide a parallel implementation of the subcellular element method for individual cells residing in a lattice-free spatial environment. Each cell in our epidermal model includes an internal gene network, which integrates cellular interaction of Notch signaling together with environmental interaction of basement membrane adhesion, to specify cellular state and behaviors such as growth and division. We take a pedagogical approach to describing how modeling methods are efficiently implemented on the GPU including memory layout of data structures and functional decomposition. We discuss various programmatic issues and provide a set of design guidelines for GPU programming that are instructive to avoid common pitfalls as well as to extract performance from the GPU architecture. Conclusions We demonstrate that GPU algorithms represent a significant technological advance for the simulation of complex biological models. We further demonstrate with our epidermal model that the integration of multiple complex modeling methods for heterogeneous multicellular biological processes is both feasible and computationally tractable using this new technology. We hope that the provided algorithms and source code will be a starting point for modelers to
Three Dimensional TEM Forward Modeling Using FDTD Accelerated by GPU
NASA Astrophysics Data System (ADS)
Li, Z.; Huang, Q.
2015-12-01
Three dimensional inversion of transient electromagnetic (TEM) data is still challenging. The inversion speed mostly depends on the forward modeling. Finite-difference time-domain (FDTD) method is one of the popular forward modeling scheme. In an explicit type, which is based on the Du Fort-Frankel scheme, the time step is under the constraint of quasi-static approximation. Often an upward-continuation boundary condition (UCBC) is applied on the earth-air surface to avoid time stepping in the model air. However, UCBC is not suitable for models with topography and has a low parallel efficiency. Modeling without UCBC may cause a much smaller time step because of the resistive attribute of the air and the quasi-static constraint, which may also low the efficiency greatly. Our recent research shows that the time step in the model air is not needed to be constrained by the quasi-static approximation, which can let the time step without UCBC much closer to that with UCBC. The parallel performance of FDTD is then largely released. On a computer with a 4-core CPU, this newly developed method is obviously faster than the method using UCBC. Besides, without UCBC, this method can be easily accelerated by Graphics Processing Unit (GPU). On a computer with a CPU of 4790k@4.4GHz and a GPU of GTX 970, the speed accelerated by CUDA is almost 10 times of that using CPU only. For a model with a grid size of 140×140×130, if the conductivity of the model earth is 0.02S/m, and the minimal space interval is 15m, it takes only 80 seconds to evolve the field from excitation to 0.032s.
GPU-based parallel clustered differential pulse code modulation
NASA Astrophysics Data System (ADS)
Wu, Jiaji; Li, Wenze; Kong, Wanqiu
2015-10-01
Hyperspectral remote sensing technology is widely used in marine remote sensing, geological exploration, atmospheric and environmental remote sensing. Owing to the rapid development of hyperspectral remote sensing technology, resolution of hyperspectral image has got a huge boost. Thus data size of hyperspectral image is becoming larger. In order to reduce their saving and transmission cost, lossless compression for hyperspectral image has become an important research topic. In recent years, large numbers of algorithms have been proposed to reduce the redundancy between different spectra. Among of them, the most classical and expansible algorithm is the Clustered Differential Pulse Code Modulation (CDPCM) algorithm. This algorithm contains three parts: first clusters all spectral lines, then trains linear predictors for each band. Secondly, use these predictors to predict pixels, and get the residual image by subtraction between original image and predicted image. Finally, encode the residual image. However, the process of calculating predictors is timecosting. In order to improve the processing speed, we propose a parallel C-DPCM based on CUDA (Compute Unified Device Architecture) with GPU. Recently, general-purpose computing based on GPUs has been greatly developed. The capacity of GPU improves rapidly by increasing the number of processing units and storage control units. CUDA is a parallel computing platform and programming model created by NVIDIA. It gives developers direct access to the virtual instruction set and memory of the parallel computational elements in GPUs. Our core idea is to achieve the calculation of predictors in parallel. By respectively adopting global memory, shared memory and register memory, we finally get a decent speedup.
GPU accelerated dynamic functional connectivity analysis for functional MRI data.
Akgün, Devrim; Sakoğlu, Ünal; Esquivel, Johnny; Adinoff, Bryon; Mete, Mutlu
2015-07-01
Recent advances in multi-core processors and graphics card based computational technologies have paved the way for an improved and dynamic utilization of parallel computing techniques. Numerous applications have been implemented for the acceleration of computationally-intensive problems in various computational science fields including bioinformatics, in which big data problems are prevalent. In neuroimaging, dynamic functional connectivity (DFC) analysis is a computationally demanding method used to investigate dynamic functional interactions among different brain regions or networks identified with functional magnetic resonance imaging (fMRI) data. In this study, we implemented and analyzed a parallel DFC algorithm based on thread-based and block-based approaches. The thread-based approach was designed to parallelize DFC computations and was implemented in both Open Multi-Processing (OpenMP) and Compute Unified Device Architecture (CUDA) programming platforms. Another approach developed in this study to better utilize CUDA architecture is the block-based approach, where parallelization involves smaller parts of fMRI time-courses obtained by sliding-windows. Experimental results showed that the proposed parallel design solutions enabled by the GPUs significantly reduce the computation time for DFC analysis. Multicore implementation using OpenMP on 8-core processor provides up to 7.7× speed-up. GPU implementation using CUDA yielded substantial accelerations ranging from 18.5× to 157× speed-up once thread-based and block-based approaches were combined in the analysis. Proposed parallel programming solutions showed that multi-core processor and CUDA-supported GPU implementations accelerated the DFC analyses significantly. Developed algorithms make the DFC analyses more practical for multi-subject studies with more dynamic analyses.
High energy electromagnetic particle transportation on the GPU
Canal, P.; Elvira, D.; Jun, S. Y.; Kowalkowski, J.; Paterno, M.; Apostolakis, J.
2014-01-01
We present massively parallel high energy electromagnetic particle transportation through a finely segmented detector on a Graphics Processing Unit (GPU). Simulating events of energetic particle decay in a general-purpose high energy physics (HEP) detector requires intensive computing resources, due to the complexity of the geometry as well as physics processes applied to particles copiously produced by primary collisions and secondary interactions. The recent advent of hardware architectures of many-core or accelerated processors provides the variety of concurrent programming models applicable not only for the high performance parallel computing, but also for the conventional computing intensive application such as the HEP detector simulation. The components of our prototype are a transportation process under a non-uniform magnetic field, geometry navigation with a set of solid shapes and materials, electromagnetic physics processes for electrons and photons, and an interface to a framework that dispatches bundles of tracks in a highly vectorized manner optimizing for spatial locality and throughput. Core algorithms and methods are excerpted from the Geant4 toolkit, and are modified and optimized for the GPU application. Program kernels written in C/C++ are designed to be compatible with CUDA and OpenCL and with the aim to be generic enough for easy porting to future programming models and hardware architectures. To improve throughput by overlapping data transfers with kernel execution, multiple CUDA streams are used. Issues with floating point accuracy, random numbers generation, data structure, kernel divergences and register spills are also considered. Performance evaluation for the relative speedup compared to the corresponding sequential execution on CPU is presented as well.
High energy electromagnetic particle transportation on the GPU
NASA Astrophysics Data System (ADS)
Canal, P.; Elvira, D.; Jun, S. Y.; Kowalkowski, J.; Paterno, M.; Apostolakis, J.
2014-06-01
We present massively parallel high energy electromagnetic particle transportation through a finely segmented detector on a Graphics Processing Unit (GPU). Simulating events of energetic particle decay in a general-purpose high energy physics (HEP) detector requires intensive computing resources, due to the complexity of the geometry as well as physics processes applied to particles copiously produced by primary collisions and secondary interactions. The recent advent of hardware architectures of many-core or accelerated processors provides the variety of concurrent programming models applicable not only for the high performance parallel computing, but also for the conventional computing intensive application such as the HEP detector simulation. The components of our prototype are a transportation process under a non-uniform magnetic field, geometry navigation with a set of solid shapes and materials, electromagnetic physics processes for electrons and photons, and an interface to a framework that dispatches bundles of tracks in a highly vectorized manner optimizing for spatial locality and throughput. Core algorithms and methods are excerpted from the Geant4 toolkit, and are modified and optimized for the GPU application. Program kernels written in C/C++ are designed to be compatible with CUDA and OpenCL and with the aim to be generic enough for easy porting to future programming models and hardware architectures. To improve throughput by overlapping data transfers with kernel execution, multiple CUDA streams are used. Issues with floating point accuracy, random numbers generation, data structure, kernel divergences and register spills are also considered. Performance evaluation for the relative speedup compared to the corresponding sequential execution on CPU is presented as well.
Suchard, Marc A.; Wang, Quanli; Chan, Cliburn; Frelinger, Jacob; Cron, Andrew; West, Mike
2010-01-01
This article describes advances in statistical computation for large-scale data analysis in structured Bayesian mixture models via graphics processing unit (GPU) programming. The developments are partly motivated by computational challenges arising in fitting models of increasing heterogeneity to increasingly large datasets. An example context concerns common biological studies using high-throughput technologies generating many, very large datasets and requiring increasingly high-dimensional mixture models with large numbers of mixture components. We outline important strategies and processes for GPU computation in Bayesian simulation and optimization approaches, give examples of the benefits of GPU implementations in terms of processing speed and scale-up in ability to analyze large datasets, and provide a detailed, tutorial-style exposition that will benefit readers interested in developing GPU-based approaches in other statistical models. Novel, GPU-oriented approaches to modifying existing algorithms software design can lead to vast speed-up and, critically, enable statistical analyses that presently will not be performed due to compute time limitations in traditional computational environments. Supplemental materials are provided with all source code, example data, and details that will enable readers to implement and explore the GPU approach in this mixture modeling context. PMID:20877443
GPU-based prompt gamma ray imaging from boron neutron capture therapy
Yoon, Do-Kun; Jung, Joo-Young; Suk Suh, Tae; Jo Hong, Key; Sil Lee, Keum
2015-01-15
Purpose: The purpose of this research is to perform the fast reconstruction of a prompt gamma ray image using a graphics processing unit (GPU) computation from boron neutron capture therapy (BNCT) simulations. Methods: To evaluate the accuracy of the reconstructed image, a phantom including four boron uptake regions (BURs) was used in the simulation. After the Monte Carlo simulation of the BNCT, the modified ordered subset expectation maximization reconstruction algorithm using the GPU computation was used to reconstruct the images with fewer projections. The computation times for image reconstruction were compared between the GPU and the central processing unit (CPU). Also, the accuracy of the reconstructed image was evaluated by a receiver operating characteristic (ROC) curve analysis. Results: The image reconstruction time using the GPU was 196 times faster than the conventional reconstruction time using the CPU. For the four BURs, the area under curve values from the ROC curve were 0.6726 (A-region), 0.6890 (B-region), 0.7384 (C-region), and 0.8009 (D-region). Conclusions: The tomographic image using the prompt gamma ray event from the BNCT simulation was acquired using the GPU computation in order to perform a fast reconstruction during treatment. The authors verified the feasibility of the prompt gamma ray image reconstruction using the GPU computation for BNCT simulations.
Fast distributed large-pixel-count hologram computation using a GPU cluster.
Pan, Yuechao; Xu, Xuewu; Liang, Xinan
2013-09-10
Large-pixel-count holograms are one essential part for big size holographic three-dimensional (3D) display, but the generation of such holograms is computationally demanding. In order to address this issue, we have built a graphics processing unit (GPU) cluster with 32.5 Tflop/s computing power and implemented distributed hologram computation on it with speed improvement techniques, such as shared memory on GPU, GPU level adaptive load balancing, and node level load distribution. Using these speed improvement techniques on the GPU cluster, we have achieved 71.4 times computation speed increase for 186M-pixel holograms. Furthermore, we have used the approaches of diffraction limits and subdivision of holograms to overcome the GPU memory limit in computing large-pixel-count holograms. 745M-pixel and 1.80G-pixel holograms were computed in 343 and 3326 s, respectively, for more than 2 million object points with RGB colors. Color 3D objects with 1.02M points were successfully reconstructed from 186M-pixel hologram computed in 8.82 s with all the above three speed improvement techniques. It is shown that distributed hologram computation using a GPU cluster is a promising approach to increase the computation speed of large-pixel-count holograms for large size holographic display.
Real-time GPU surface curvature estimation on deforming meshes and volumetric data sets.
Griffin, Wesley; Wang, Yu; Berrios, David; Olano, Marc
2012-10-01
Surface curvature is used in a number of areas in computer graphics, including texture synthesis and shape representation, mesh simplification, surface modeling, and nonphotorealistic line drawing. Most real-time applications must estimate curvature on a triangular mesh. This estimation has been limited to CPU algorithms, forcing object geometry to reside in main memory. However, as more computational work is done directly on the GPU, it is increasingly common for object geometry to exist only in GPU memory. Examples include vertex skinned animations and isosurfaces from GPU-based surface reconstruction algorithms. For static models, curvature can be precomputed and CPU algorithms are a reasonable choice. For deforming models where the geometry only resides on the GPU, transferring the deformed mesh back to the CPU limits performance. We introduce a GPU algorithm for estimating curvature in real time on arbitrary triangular meshes. We demonstrate our algorithm with curvature-based NPR feature lines and a curvature-based approximation for an ambient occlusion. We show curvature computation on volumetric data sets with a GPU isosurface extraction algorithm and vertex-skinned animations. We present a graphics pipeline and CUDA implementation. Our curvature estimation is up to ~18x faster than a multithreaded CPU benchmark.
GMC: a GPU implementation of a Monte Carlo dose calculation based on Geant4.
Jahnke, Lennart; Fleckenstein, Jens; Wenz, Frederik; Hesser, Jürgen
2012-03-07
We present a GPU implementation called GMC (GPU Monte Carlo) of the low energy (<100 GeV) electromagnetic part of the Geant4 Monte Carlo code using the NVIDIA® CUDA programming interface. The classes for electron and photon interactions as well as a new parallel particle transport engine were implemented. The way a particle is processed is not in a history by history manner but rather by an interaction by interaction method. Every history is divided into steps that are then calculated in parallel by different kernels. The geometry package is currently limited to voxelized geometries. A modified parallel Mersenne twister was used to generate random numbers and a random number repetition method on the GPU was introduced. All phantom results showed a very good agreement between GPU and CPU simulation with gamma indices of >97.5% for a 2%/2 mm gamma criteria. The mean acceleration on one GTX 580 for all cases compared to Geant4 on one CPU core was 4860. The mean number of histories per millisecond on the GPU for all cases was 658 leading to a total simulation time for one intensity-modulated radiation therapy dose distribution of 349 s. In conclusion, Geant4-based Monte Carlo dose calculations were significantly accelerated on the GPU.
A new embedded solution of hyperspectral data processing platform: the embedded GPU computer
NASA Astrophysics Data System (ADS)
Zhang, Lei; Gao, Jiao Bo; Hu, Yu; Sun, Ke Feng; Wang, Ying Hui; Cheng, Juan; Sun, Dan Dan; Li, Yu
2016-10-01
During the research of hyper-spectral imaging spectrometer, how to process the huge amount of image data is a difficult problem for all researchers. The amount of image data is about the order of magnitude of several hundred megabytes per second. Traditional solution of the embedded hyper-spectral data processing platform such as DSP and FPGA has its own drawback. With the development of GPU, parallel computing on GPU is increasingly applied in large-scale data processing. In this paper, we propose a new embedded solution of hyper-spectral data processing platform which is based on the embedded GPU computer. We also give a detailed discussion of how to acquire and process hyper-spectral data in embedded GPU computer. We use C++ AMP technology to control GPU and schedule the parallel computing. Experimental results show that the speed of hyper-spectral data processing on embedded GPU computer is apparently faster than ordinary computer. Our research has significant meaning for the engineering application of hyper-spectral imaging spectrometer.
GPU-accelerated molecular dynamics simulation for study of liquid crystalline flows
NASA Astrophysics Data System (ADS)
Sunarso, Alfeus; Tsuji, Tomohiro; Chono, Shigeomi
2010-08-01
We have developed a GPU-based molecular dynamics simulation for the study of flows of fluids with anisotropic molecules such as liquid crystals. An application of the simulation to the study of macroscopic flow (backflow) generation by molecular reorientation in a nematic liquid crystal under the application of an electric field is presented. The computations of intermolecular force and torque are parallelized on the GPU using the cell-list method, and an efficient algorithm to update the cell lists was proposed. Some important issues in the implementation of computations that involve a large number of arithmetic operations and data on the GPU that has limited high-speed memory resources are addressed extensively. Despite the relatively low GPU occupancy in the calculation of intermolecular force and torque, the computation on a recent GPU is about 50 times faster than that on a single core of a recent CPU, thus simulations involving a large number of molecules using a personal computer are possible. The GPU-based simulation should allow an extensive investigation of the molecular-level mechanisms underlying various macroscopic flow phenomena in fluids with anisotropic molecules.
Understanding GPU Power. A Survey of Profiling, Modeling, and Simulation Methods
Bridges, Robert A.; Imam, Neena; Mintz, Tiffany M.
2016-09-01
Modern graphics processing units (GPUs) have complex architectures that admit exceptional performance and energy efficiency for high throughput applications.Though GPUs consume large amounts of power, their use for high throughput applications facilitate state-of-the-art energy efficiency and performance. Consequently, continued development relies on understanding their power consumption. Our work is a survey of GPU power modeling and profiling methods with increased detail on noteworthy efforts. Moreover, as direct measurement of GPU power is necessary for model evaluation and parameter initiation, internal and external power sensors are discussed. Hardware counters, which are low-level tallies of hardware events, share strong correlation to powermore » use and performance. Statistical correlation between power and performance counters has yielded worthwhile GPU power models, yet the complexity inherent to GPU architectures presents new hurdles for power modeling. Developments and challenges of counter-based GPU power modeling is discussed. Often building on the counter-based models, research efforts for GPU power simulation, which make power predictions from input code and hardware knowledge, provide opportunities for optimization in programming or architectural design. Noteworthy strides in power simulations for GPUs are included along with their performance or functional simulator counterparts when appropriate. Lastly, possible directions for future research are discussed.« less
Understanding GPU Power. A Survey of Profiling, Modeling, and Simulation Methods
Bridges, Robert A.; Imam, Neena; Mintz, Tiffany M.
2016-09-01
Modern graphics processing units (GPUs) have complex architectures that admit exceptional performance and energy efficiency for high throughput applications.Though GPUs consume large amounts of power, their use for high throughput applications facilitate state-of-the-art energy efficiency and performance. Consequently, continued development relies on understanding their power consumption. Our work is a survey of GPU power modeling and profiling methods with increased detail on noteworthy efforts. Moreover, as direct measurement of GPU power is necessary for model evaluation and parameter initiation, internal and external power sensors are discussed. Hardware counters, which are low-level tallies of hardware events, share strong correlation to power use and performance. Statistical correlation between power and performance counters has yielded worthwhile GPU power models, yet the complexity inherent to GPU architectures presents new hurdles for power modeling. Developments and challenges of counter-based GPU power modeling is discussed. Often building on the counter-based models, research efforts for GPU power simulation, which make power predictions from input code and hardware knowledge, provide opportunities for optimization in programming or architectural design. Noteworthy strides in power simulations for GPUs are included along with their performance or functional simulator counterparts when appropriate. Lastly, possible directions for future research are discussed.
GPU-based parallel algorithm for blind image restoration using midfrequency-based methods
NASA Astrophysics Data System (ADS)
Xie, Lang; Luo, Yi-han; Bao, Qi-liang
2013-08-01
GPU-based general-purpose computing is a new branch of modern parallel computing, so the study of parallel algorithms specially designed for GPU hardware architecture is of great significance. In order to solve the problem of high computational complexity and poor real-time performance in blind image restoration, the midfrequency-based algorithm for blind image restoration was analyzed and improved in this paper. Furthermore, a midfrequency-based filtering method is also used to restore the image hardly with any recursion or iteration. Combining the algorithm with data intensiveness, data parallel computing and GPU execution model of single instruction and multiple threads, a new parallel midfrequency-based algorithm for blind image restoration is proposed in this paper, which is suitable for stream computing of GPU. In this algorithm, the GPU is utilized to accelerate the estimation of class-G point spread functions and midfrequency-based filtering. Aiming at better management of the GPU threads, the threads in a grid are scheduled according to the decomposition of the filtering data in frequency domain after the optimization of data access and the communication between the host and the device. The kernel parallelism structure is determined by the decomposition of the filtering data to ensure the transmission rate to get around the memory bandwidth limitation. The results show that, with the new algorithm, the operational speed is significantly increased and the real-time performance of image restoration is effectively improved, especially for high-resolution images.
Multi-GPU implementation of a VMAT treatment plan optimization algorithm
Tian, Zhen E-mail: Xun.Jia@UTSouthwestern.edu Folkerts, Michael; Tan, Jun; Jia, Xun E-mail: Xun.Jia@UTSouthwestern.edu Jiang, Steve B. E-mail: Xun.Jia@UTSouthwestern.edu; Peng, Fei
2015-06-15
Purpose: Volumetric modulated arc therapy (VMAT) optimization is a computationally challenging problem due to its large data size, high degrees of freedom, and many hardware constraints. High-performance graphics processing units (GPUs) have been used to speed up the computations. However, GPU’s relatively small memory size cannot handle cases with a large dose-deposition coefficient (DDC) matrix in cases of, e.g., those with a large target size, multiple targets, multiple arcs, and/or small beamlet size. The main purpose of this paper is to report an implementation of a column-generation-based VMAT algorithm, previously developed in the authors’ group, on a multi-GPU platform to solve the memory limitation problem. While the column-generation-based VMAT algorithm has been previously developed, the GPU implementation details have not been reported. Hence, another purpose is to present detailed techniques employed for GPU implementation. The authors also would like to utilize this particular problem as an example problem to study the feasibility of using a multi-GPU platform to solve large-scale problems in medical physics. Methods: The column-generation approach generates VMAT apertures sequentially by solving a pricing problem (PP) and a master problem (MP) iteratively. In the authors’ method, the sparse DDC matrix is first stored on a CPU in coordinate list format (COO). On the GPU side, this matrix is split into four submatrices according to beam angles, which are stored on four GPUs in compressed sparse row format. Computation of beamlet price, the first step in PP, is accomplished using multi-GPUs. A fast inter-GPU data transfer scheme is accomplished using peer-to-peer access. The remaining steps of PP and MP problems are implemented on CPU or a single GPU due to their modest problem scale and computational loads. Barzilai and Borwein algorithm with a subspace step scheme is adopted here to solve the MP problem. A head and neck (H and N) cancer case is
NASA Astrophysics Data System (ADS)
Su, Lin; Du, Xining; Liu, Tianyu; Xu, X. George
2014-06-01
An electron-photon coupled Monte Carlo code ARCHER -
SU-E-J-60: Efficient Monte Carlo Dose Calculation On CPU-GPU Heterogeneous Systems
Xiao, K; Chen, D. Z; Hu, X. S; Zhou, B
2014-06-01
Purpose: It is well-known that the performance of GPU-based Monte Carlo dose calculation implementations is bounded by memory bandwidth. One major cause of this bottleneck is the random memory writing patterns in dose deposition, which leads to several memory efficiency issues on GPU such as un-coalesced writing and atomic operations. We propose a new method to alleviate such issues on CPU-GPU heterogeneous systems, which achieves overall performance improvement for Monte Carlo dose calculation. Methods: Dose deposition is to accumulate dose into the voxels of a dose volume along the trajectories of radiation rays. Our idea is to partition this procedure into the following three steps, which are fine-tuned for CPU or GPU: (1) each GPU thread writes dose results with location information to a buffer on GPU memory, which achieves fully-coalesced and atomic-free memory transactions; (2) the dose results in the buffer are transferred to CPU memory; (3) the dose volume is constructed from the dose buffer on CPU. We organize the processing of all radiation rays into streams. Since the steps within a stream use different hardware resources (i.e., GPU, DMA, CPU), we can overlap the execution of these steps for different streams by pipelining. Results: We evaluated our method using a Monte Carlo Convolution Superposition (MCCS) program and tested our implementation for various clinical cases on a heterogeneous system containing an Intel i7 quad-core CPU and an NVIDIA TITAN GPU. Comparing with a straightforward MCCS implementation on the same system (using both CPU and GPU for radiation ray tracing), our method gained 2-5X speedup without losing dose calculation accuracy. Conclusion: The results show that our new method improves the effective memory bandwidth and overall performance for MCCS on the CPU-GPU systems. Our proposed method can also be applied to accelerate other Monte Carlo dose calculation approaches. This research was supported in part by NSF under Grants CCF
Revisiting Molecular Dynamics on a CPU/GPU system: Water Kernel and SHAKE Parallelization.
Ruymgaart, A Peter; Elber, Ron
2012-11-13
We report Graphics Processing Unit (GPU) and Open-MP parallel implementations of water-specific force calculations and of bond constraints for use in Molecular Dynamics simulations. We focus on a typical laboratory computing-environment in which a CPU with a few cores is attached to a GPU. We discuss in detail the design of the code and we illustrate performance comparable to highly optimized codes such as GROMACS. Beside speed our code shows excellent energy conservation. Utilization of water-specific lists allows the efficient calculations of non-bonded interactions that include water molecules and results in a speed-up factor of more than 40 on the GPU compared to code optimized on a single CPU core for systems larger than 20,000 atoms. This is up four-fold from a factor of 10 reported in our initial GPU implementation that did not include a water-specific code. Another optimization is the implementation of constrained dynamics entirely on the GPU. The routine, which enforces constraints of all bonds, runs in parallel on multiple Open-MP cores or entirely on the GPU. It is based on Conjugate Gradient solution of the Lagrange multipliers (CG SHAKE). The GPU implementation is partially in double precision and requires no communication with the CPU during the execution of the SHAKE algorithm. The (parallel) implementation of SHAKE allows an increase of the time step to 2.0fs while maintaining excellent energy conservation. Interestingly, CG SHAKE is faster than the usual bond relaxation algorithm even on a single core if high accuracy is expected. The significant speedup of the optimized components transfers the computational bottleneck of the MD calculation to the reciprocal part of Particle Mesh Ewald (PME).
Richmond, Paul; Buesing, Lars; Giugliano, Michele; Vasilaki, Eleni
2011-05-04
High performance computing on the Graphics Processing Unit (GPU) is an emerging field driven by the promise of high computational power at a low cost. However, GPU programming is a non-trivial task and moreover architectural limitations raise the question of whether investing effort in this direction may be worthwhile. In this work, we use GPU programming to simulate a two-layer network of Integrate-and-Fire neurons with varying degrees of recurrent connectivity and investigate its ability to learn a simplified navigation task using a policy-gradient learning rule stemming from Reinforcement Learning. The purpose of this paper is twofold. First, we want to support the use of GPUs in the field of Computational Neuroscience. Second, using GPU computing power, we investigate the conditions under which the said architecture and learning rule demonstrate best performance. Our work indicates that networks featuring strong Mexican-Hat-shaped recurrent connections in the top layer, where decision making is governed by the formation of a stable activity bump in the neural population (a "non-democratic" mechanism), achieve mediocre learning results at best. In absence of recurrent connections, where all neurons "vote" independently ("democratic") for a decision via population vector readout, the task is generally learned better and more robustly. Our study would have been extremely difficult on a desktop computer without the use of GPU programming. We present the routines developed for this purpose and show that a speed improvement of 5x up to 42x is provided versus optimised Python code. The higher speed is achieved when we exploit the parallelism of the GPU in the search of learning parameters. This suggests that efficient GPU programming can significantly reduce the time needed for simulating networks of spiking neurons, particularly when multiple parameter configurations are investigated.
GPU accelerated generation of digitally reconstructed radiographs for 2-D/3-D image registration.
Dorgham, Osama M; Laycock, Stephen D; Fisher, Mark H
2012-09-01
Recent advances in programming languages for graphics processing units (GPUs) provide developers with a convenient way of implementing applications which can be executed on the CPU and GPU interchangeably. GPUs are becoming relatively cheap, powerful, and widely available hardware components, which can be used to perform intensive calculations. The last decade of hardware performance developments shows that GPU-based computation is progressing significantly faster than CPU-based computation, particularly if one considers the execution of highly parallelisable algorithms. Future predictions illustrate that this trend is likely to continue. In this paper, we introduce a way of accelerating 2-D/3-D image registration by developing a hybrid system which executes on the CPU and utilizes the GPU for parallelizing the generation of digitally reconstructed radiographs (DRRs). Based on the advancements of the GPU over the CPU, it is timely to exploit the benefits of many-core GPU technology by developing algorithms for DRR generation. Although some previous work has investigated the rendering of DRRs using the GPU, this paper investigates approximations which reduce the computational overhead while still maintaining a quality consistent with that needed for 2-D/3-D registration with sufficient accuracy to be clinically acceptable in certain applications of radiation oncology. Furthermore, by comparing implementations of 2-D/3-D registration on the CPU and GPU, we investigate current performance and propose an optimal framework for PC implementations addressing the rigid registration problem. Using this framework, we are able to render DRR images from a 256×256×133 CT volume in ~24 ms using an NVidia GeForce 8800 GTX and in ~2 ms using NVidia GeForce GTX 580. In addition to applications requiring fast automatic patient setup, these levels of performance suggest image-guided radiation therapy at video frame rates is technically feasible using relatively low cost PC
Fast GPU-based computation of spatial multigrid multiframe LMEM for PET.
Nassiri, Moulay Ali; Carrier, Jean-François; Després, Philippe
2015-09-01
Significant efforts were invested during the last decade to accelerate PET list-mode reconstructions, notably with GPU devices. However, the computation time per event is still relatively long, and the list-mode efficiency on the GPU is well below the histogram-mode efficiency. Since list-mode data are not arranged in any regular pattern, costly accesses to the GPU global memory can hardly be optimized and geometrical symmetries cannot be used. To overcome obstacles that limit the acceleration of reconstruction from list-mode on the GPU, a multigrid and multiframe approach of an expectation-maximization algorithm was developed. The reconstruction process is started during data acquisition, and calculations are executed concurrently on the GPU and the CPU, while the system matrix is computed on-the-fly. A new convergence criterion also was introduced, which is computationally more efficient on the GPU. The implementation was tested on a Tesla C2050 GPU device for a Gemini GXL PET system geometry. The results show that the proposed algorithm (multigrid and multiframe list-mode expectation-maximization, MGMF-LMEM) converges to the same solution as the LMEM algorithm more than three times faster. The execution time of the MGMF-LMEM algorithm was 1.1 s per million of events on the Tesla C2050 hardware used, for a reconstructed space of 188 x 188 x 57 voxels of 2 x 2 x 3.15 mm3. For 17- and 22-mm simulated hot lesions, the MGMF-LMEM algorithm led on the first iteration to contrast recovery coefficients (CRC) of more than 75 % of the maximum CRC while achieving a minimum in the relative mean square error. Therefore, the MGMF-LMEM algorithm can be used as a one-pass method to perform real-time reconstructions for low-count acquisitions, as in list-mode gated studies. The computation time for one iteration and 60 millions of events was approximately 66 s.
Lossless data compression for improving the performance of a GPU-based beamformer.
Lok, U-Wai; Fan, Gang-Wei; Li, Pai-Chi
2015-04-01
The powerful parallel computation ability of a graphics processing unit (GPU) makes it feasible to perform dynamic receive beamforming However, a real time GPU-based beamformer requires high data rate to transfer radio-frequency (RF) data from hardware to software memory, as well as from central processing unit (CPU) to GPU memory. There are data compression methods (e.g. Joint Photographic Experts Group (JPEG)) available for the hardware front end to reduce data size, alleviating the data transfer requirement of the hardware interface. Nevertheless, the required decoding time may even be larger than the transmission time of its original data, in turn degrading the overall performance of the GPU-based beamformer. This article proposes and implements a lossless compression-decompression algorithm, which enables in parallel compression and decompression of data. By this means, the data transfer requirement of hardware interface and the transmission time of CPU to GPU data transfers are reduced, without sacrificing image quality. In simulation results, the compression ratio reached around 1.7. The encoder design of our lossless compression approach requires low hardware resources and reasonable latency in a field programmable gate array. In addition, the transmission time of transferring data from CPU to GPU with the parallel decoding process improved by threefold, as compared with transferring original uncompressed data. These results show that our proposed lossless compression plus parallel decoder approach not only mitigate the transmission bandwidth requirement to transfer data from hardware front end to software system but also reduce the transmission time for CPU to GPU data transfer.
Adaptive multi-GPU Exchange Monte Carlo for the 3D Random Field Ising Model
NASA Astrophysics Data System (ADS)
Navarro, Cristóbal A.; Huang, Wei; Deng, Youjin
2016-08-01
This work presents an adaptive multi-GPU Exchange Monte Carlo approach for the simulation of the 3D Random Field Ising Model (RFIM). The design is based on a two-level parallelization. The first level, spin-level parallelism, maps the parallel computation as optimal 3D thread-blocks that simulate blocks of spins in shared memory with minimal halo surface, assuming a constant block volume. The second level, replica-level parallelism, uses multi-GPU computation to handle the simulation of an ensemble of replicas. CUDA's concurrent kernel execution feature is used in order to fill the occupancy of each GPU with many replicas, providing a performance boost that is more notorious at the smallest values of L. In addition to the two-level parallel design, the work proposes an adaptive multi-GPU approach that dynamically builds a proper temperature set free of exchange bottlenecks. The strategy is based on mid-point insertions at the temperature gaps where the exchange rate is most compromised. The extra work generated by the insertions is balanced across the GPUs independently of where the mid-point insertions were performed. Performance results show that spin-level performance is approximately two orders of magnitude faster than a single-core CPU version and one order of magnitude faster than a parallel multi-core CPU version running on 16-cores. Multi-GPU performance is highly convenient under a weak scaling setting, reaching up to 99 % efficiency as long as the number of GPUs and L increase together. The combination of the adaptive approach with the parallel multi-GPU design has extended our possibilities of simulation to sizes of L = 32 , 64 for a workstation with two GPUs. Sizes beyond L = 64 can eventually be studied using larger multi-GPU systems.
Arafat, Humayun; Dinan, James; Krishnamoorthy, Sriram; Balaji, Pavan; Sadayappan, P.
2016-01-06
Task parallelism is an attractive approach to automatically load balance the computation in a parallel system and adapt to dynamism exhibited by parallel systems. Exploiting task parallelism through work stealing has been extensively studied in shared and distributed-memory contexts. In this paper, we study the design of a system that uses work stealing for dynamic load balancing of task-parallel programs executed on hybrid distributed-memory CPU-graphics processing unit (GPU) systems in a global-address space framework. We take into account the unique nature of the accelerator model employed by GPUs, the significant performance difference between GPU and CPU execution as a function of problem size, and the distinct CPU and GPU memory domains. We consider various alternatives in designing a distributed work stealing algorithm for CPU-GPU systems, while taking into account the impact of task distribution and data movement overheads. These strategies are evaluated using microbenchmarks that capture various execution configurations as well as the state-of-the-art CCSD(T) application module from the computational chemistry domain.
GPU implementation issues for fast unmixing of hyperspectral images
NASA Astrophysics Data System (ADS)
Legendre, Maxime; Capriotti, Luca; Schmidt, Frédéric; Moussaoui, Saïd; Schmidt, Albrecht
2013-04-01
Space missions usually use hyperspectral imaging techniques to analyse the composition of planetary surfaces. Missions such as ESA's Mars Express and Venus Express generate extensive datasets whose processing demands so far have exceeded the resources available to many researchers. To overcome this limitation, the challenge is to develop numerical methods allowing to exploit the potential of modern calculation tools. The processing of a hyperspectral image consists of the identification of the observed surface components and eventually the assessment of their fractional abundances inside each pixel area. In this latter case, the problem is referred to as spectral unmixing. This work focuses on a supervised unmixing approach where the relevant component spectra are supposed to be part of an available spectral library. Therefore, the question addressed here is reduced to the estimation of the fractional abundances, or abundance maps. It requires the solution of a large-scale optimization problem subject to linear constraints; positivity of the abundances and their partial/full additivity (sum less/equal to one). Conventional approaches to such a problem usually suffer from a high computational overhead. Recently, an interior-point optimization using a primal-dual approach has been proven an efficient method to solve this spectral unmixing problem at reduced computational cost. This is achieved with a parallel implementation based on Graphics Processing Units (GPUs). Several issues are discussed such as the data organization in memory and the strategy used to compute efficiently one global quantity from a large dataset in a parallel fashion. Every step of the algorithm is optimized to be GPU-efficient. Finally, the main steps of the global system for the processing of a large number of hyperspectral images are discussed. The advantage of using a GPU is demonstrated by unmixing a large dataset consisting of 1300 hyperspectral images from Mars Express' OMEGA instrument
Compilação de dados atômicos e moleculares do UV ao IV próximo para uso em síntese espectral
NASA Astrophysics Data System (ADS)
Coelho, P.; Barbuy, B.; Melendez, J.; Allen, D. M.; Castilho, B.
2003-08-01
Espectros sintéticos são utéis em uma grande variedade de aplicações, desde análise de abundâncias em espectros estelares de alta resolução ao estudo de populações estelares em espectros integrados. A confiabilidade de um espectro sintético depende do modelo de atmosfera adotado, do código de formação de linhas e da qualidade dos dados atômicos e moleculares que são determinantes no cálculo das opacidades da fotosfera. O nosso grupo no departamento de Astronomia no IAG tem utilizado espectros sintéticos há mais de 15 anos, em aplicações voltadas principalmente para a análise de abundâncias de estrelas G, K e M e populações estelares velhas. Ao longo desse tempo, as listas de linhas vieram sendo construídas e atualizadas continuamente, e alguns acréscimos recentes podem ser citados: Castilho (1999, átomos e moléculas no UV), Schiavon (1998, bandas moleculares de TiO) e Melendez (2001, átomos e moléculas no IV próximo). Com o intuito de calcular uma grade de espectros do UV ao IV próximo para uso no estudo de populações estelares velhas, se fazia necessário compilar e homogeneizar as diversas listas em apenas uma lista atômica e uma molecular. Nesse processo, a nova lista compilada foi correlacionada com outras bases de dados (NIST, Kurucz Database, O' Brian et al. 1991) para atualização dos parâmetros que caracterizam a transição atômica (comprimento de onda, log gf e potencial de excitação). Adicionalmente as constantes de interação C6 foram calculadas segundo a teoria de Anstee & O'Mara (1995) e artigos posteriores. As bandas moleculares de CH e CN foram recalculadas com o programa LIFBASE (Luque & Crosley 1999). Nesse poster estão detalhados os procedimentos citados acima, as comparações entre espectros calculados com as novas listas e espectros observados em alta resolução do Sol e de Arcturus, e uma análise do impacto decorrente da utilização de diferentes modelos de atmosfera no espectro sintético. Ao
Fast 3D elastic micro-seismic source location using new GPU features
NASA Astrophysics Data System (ADS)
Xue, Qingfeng; Wang, Yibo; Chang, Xu
2016-12-01
In this paper, we describe new GPU features and their applications in passive seismic - micro-seismic location. Locating micro-seismic events is quite important in seismic exploration, especially when searching for unconventional oil and gas resources. Different from the traditional ray-based methods, the wave equation method, such as the method we use in our paper, has a remarkable advantage in adapting to low signal-to-noise ratio conditions and does not need a person to select the data. However, because it has a conspicuous deficiency due to its computation cost, these methods are not widely used in industrial fields. To make the method useful, we implement imaging-like wave equation micro-seismic location in a 3D elastic media and use GPU to accelerate our algorithm. We also introduce some new GPU features into the implementation to solve the data transfer and GPU utilization problems. Numerical and field data experiments show that our method can achieve a more than 30% performance improvement in GPU implementation just by using these new features.
GPU accelerated simulations of 3D deterministic particle transport using discrete ordinates method
Gong Chunye; Liu Jie; Chi Lihua; Huang Haowei; Fang Jingyue; Gong Zhenghu
2011-07-01
Graphics Processing Unit (GPU), originally developed for real-time, high-definition 3D graphics in computer games, now provides great faculty in solving scientific applications. The basis of particle transport simulation is the time-dependent, multi-group, inhomogeneous Boltzmann transport equation. The numerical solution to the Boltzmann equation involves the discrete ordinates (S{sub n}) method and the procedure of source iteration. In this paper, we present a GPU accelerated simulation of one energy group time-independent deterministic discrete ordinates particle transport in 3D Cartesian geometry (Sweep3D). The performance of the GPU simulations are reported with the simulations of vacuum boundary condition. The discussion of the relative advantages and disadvantages of the GPU implementation, the simulation on multi GPUs, the programming effort and code portability are also reported. The results show that the overall performance speedup of one NVIDIA Tesla M2050 GPU ranges from 2.56 compared with one Intel Xeon X5670 chip to 8.14 compared with one Intel Core Q6600 chip for no flux fixup. The simulation with flux fixup on one M2050 is 1.23 times faster than on one X5670.
A Hybrid CPU/GPU Pattern-Matching Algorithm for Deep Packet Inspection
Chen, Yaw-Chung
2015-01-01
The large quantities of data now being transferred via high-speed networks have made deep packet inspection indispensable for security purposes. Scalable and low-cost signature-based network intrusion detection systems have been developed for deep packet inspection for various software platforms. Traditional approaches that only involve central processing units (CPUs) are now considered inadequate in terms of inspection speed. Graphic processing units (GPUs) have superior parallel processing power, but transmission bottlenecks can reduce optimal GPU efficiency. In this paper we describe our proposal for a hybrid CPU/GPU pattern-matching algorithm (HPMA) that divides and distributes the packet-inspecting workload between a CPU and GPU. All packets are initially inspected by the CPU and filtered using a simple pre-filtering algorithm, and packets that might contain malicious content are sent to the GPU for further inspection. Test results indicate that in terms of random payload traffic, the matching speed of our proposed algorithm was 3.4 times and 2.7 times faster than those of the AC-CPU and AC-GPU algorithms, respectively. Further, HPMA achieved higher energy efficiency than the other tested algorithms. PMID:26437335
A GPU-Accelerated Approach for Feature Tracking in Time-Varying Imagery Datasets.
Peng, Chao; Sahani, Saindip; Rushing, John
2016-12-09
We propose a novel parallel Connected Component Labeling (CCL) algorithm along with efficient out-of-core data management to detect and track feature regions of large time-varying imagery datasets. Our approach contributes to big data field with parallel algorithms tailored for GPU architectures. We remove the data dependency between frames and achieve pixel-level parallelism. Due to the large size, the entire dataset cannot fit into cached memory. Frames have to be streamed through the memory hierarchy (disk to CPU main memory and then to GPU memory), partitioned and processed as batches, where each batch is small enough to fit into the GPU. To reconnect separated feature regions caused by data partitioning, we present a novel batch merging algorithm that extracts the connection information of feature regions in multiple batches in a parallel fashion. The information is organized in a memory-efficient structure and supports fast indexing on the GPU. For the experiment we use a real-world weather dataset. Using a commodity workstation equipped with a single GPU, our approach can process terabytes of time-varying imagery data. The advantages of our approach are demonstrated by comparing to the performance of a CPU cluster implementation that is being used by weather scientists.
Improving GPU-accelerated adaptive IDW interpolation algorithm using fast kNN search.
Mei, Gang; Xu, Nengxiong; Xu, Liangliang
2016-01-01
This paper presents an efficient parallel Adaptive Inverse Distance Weighting (AIDW) interpolation algorithm on modern Graphics Processing Unit (GPU). The presented algorithm is an improvement of our previous GPU-accelerated AIDW algorithm by adopting fast k-nearest neighbors (kNN) search. In AIDW, it needs to find several nearest neighboring data points for each interpolated point to adaptively determine the power parameter; and then the desired prediction value of the interpolated point is obtained by weighted interpolating using the power parameter. In this work, we develop a fast kNN search approach based on the space-partitioning data structure, even grid, to improve the previous GPU-accelerated AIDW algorithm. The improved algorithm is composed of the stages of kNN search and weighted interpolating. To evaluate the performance of the improved algorithm, we perform five groups of experimental tests. The experimental results indicate: (1) the improved algorithm can achieve a speedup of up to 1017 over the corresponding serial algorithm; (2) the improved algorithm is at least two times faster than our previous GPU-accelerated AIDW algorithm; and (3) the utilization of fast kNN search can significantly improve the computational efficiency of the entire GPU-accelerated AIDW algorithm.
A Hybrid CPU/GPU Pattern-Matching Algorithm for Deep Packet Inspection.
Lee, Chun-Liang; Lin, Yi-Shan; Chen, Yaw-Chung
2015-01-01
The large quantities of data now being transferred via high-speed networks have made deep packet inspection indispensable for security purposes. Scalable and low-cost signature-based network intrusion detection systems have been developed for deep packet inspection for various software platforms. Traditional approaches that only involve central processing units (CPUs) are now considered inadequate in terms of inspection speed. Graphic processing units (GPUs) have superior parallel processing power, but transmission bottlenecks can reduce optimal GPU efficiency. In this paper we describe our proposal for a hybrid CPU/GPU pattern-matching algorithm (HPMA) that divides and distributes the packet-inspecting workload between a CPU and GPU. All packets are initially inspected by the CPU and filtered using a simple pre-filtering algorithm, and packets that might contain malicious content are sent to the GPU for further inspection. Test results indicate that in terms of random payload traffic, the matching speed of our proposed algorithm was 3.4 times and 2.7 times faster than those of the AC-CPU and AC-GPU algorithms, respectively. Further, HPMA achieved higher energy efficiency than the other tested algorithms.
Geant4-based Monte Carlo simulations on GPU for medical applications.
Bert, Julien; Perez-Ponce, Hector; El Bitar, Ziad; Jan, Sébastien; Boursier, Yannick; Vintache, Damien; Bonissent, Alain; Morel, Christian; Brasse, David; Visvikis, Dimitris
2013-08-21
Monte Carlo simulation (MCS) plays a key role in medical applications, especially for emission tomography and radiotherapy. However MCS is also associated with long calculation times that prevent its use in routine clinical practice. Recently, graphics processing units (GPU) became in many domains a low cost alternative for the acquisition of high computational power. The objective of this work was to develop an efficient framework for the implementation of MCS on GPU architectures. Geant4 was chosen as the MCS engine given the large variety of physics processes available for targeting different medical imaging and radiotherapy applications. In addition, Geant4 is the MCS engine behind GATE which is actually the most popular medical applications' simulation platform. We propose the definition of a global strategy and associated structures for such a GPU based simulation implementation. Different photon and electron physics effects are resolved on the fly directly on GPU without any approximations with respect to Geant4. Validations have shown equivalence in the underlying photon and electron physics processes between the Geant4 and the GPU codes with a speedup factor of 80-90. More clinically realistic simulations in emission and transmission imaging led to acceleration factors of 400-800 respectively compared to corresponding GATE simulations.
The development of GPU-based parallel PRNG for Monte Carlo applications in CUDA Fortran
NASA Astrophysics Data System (ADS)
Kargaran, Hamed; Minuchehr, Abdolhamid; Zolfaghari, Ahmad
2016-04-01
The implementation of Monte Carlo simulation on the CUDA Fortran requires a fast random number generation with good statistical properties on GPU. In this study, a GPU-based parallel pseudo random number generator (GPPRNG) have been proposed to use in high performance computing systems. According to the type of GPU memory usage, GPU scheme is divided into two work modes including GLOBAL_MODE and SHARED_MODE. To generate parallel random numbers based on the independent sequence method, the combination of middle-square method and chaotic map along with the Xorshift PRNG have been employed. Implementation of our developed PPRNG on a single GPU showed a speedup of 150x and 470x (with respect to the speed of PRNG on a single CPU core) for GLOBAL_MODE and SHARED_MODE, respectively. To evaluate the accuracy of our developed GPPRNG, its performance was compared to that of some other commercially available PPRNGs such as MATLAB, FORTRAN and Miller-Park algorithm through employing the specific standard tests. The results of this comparison showed that the developed GPPRNG in this study can be used as a fast and accurate tool for computational science applications.
Comparison of CPU and GPU based coding on low-complexity algorithms for display signals
NASA Astrophysics Data System (ADS)
Richter, Thomas; Simon, Sven
2013-09-01
Graphics Processing Units (GPUs) are freely programmable massively parallel general purpose processing units and thus offer the opportunity to off-load heavy computations from the CPU to the GPU. One application for GPU programming is image compression, where the massively parallel nature of GPUs promises high speed benefits. This article analyzes the predicaments of data-parallel image coding on the example of two high-throughput coding algorithms. The codecs discussed here were designed to answer a call from the Video Electronics Standards Association (VESA), and require only minimal buffering at encoder and decoder side while avoiding any pixel-based feedback loops limiting the operating frequency of hardware implementations. Comparing CPU and GPU implementations of the codes show that GPU based codes are usually not considerably faster, or perform only with less than ideal rate-distortion performance. Analyzing the details of this result provides theoretical evidence that for any coding engine either parts of the entropy coding and bit-stream build-up must remain serial, or rate-distortion penalties must be paid when offloading all computations on the GPU.
NASA Astrophysics Data System (ADS)
Lokavarapu, H. V.; Matsui, H.
2015-12-01
Convection and magnetic field of the Earth's outer core are expected to have vast length scales. To resolve these flows, high performance computing is required for geodynamo simulations using spherical harmonics transform (SHT), a significant portion of the execution time is spent on the Legendre transform. Calypso is a geodynamo code designed to model magnetohydrodynamics of a Boussinesq fluid in a rotating spherical shell, such as the outer core of the Earth. The code has been shown to scale well on computer clusters capable of computing at the order of 10⁵ cores using Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) parallelization for CPUs. To further optimize, we investigate three different algorithms of the SHT using GPUs. One is to preemptively compute the Legendre polynomials on the CPU before executing SHT on the GPU within the time integration loop. In the second approach, both the Legendre polynomials and the SHT are computed on the GPU simultaneously. In the third approach , we initially partition the radial grid for the forward transform and the harmonic order for the backward transform between the CPU and GPU. There after, the partitioned works are simultaneously computed in the time integration loop. We examine the trade-offs between space and time, memory bandwidth and GPU computations on Maverick, a Texas Advanced Computing Center (TACC) supercomputer. We have observed improved performance using a GPU enabled Legendre transform. Furthermore, we will compare and contrast the different algorithms in the context of GPUs.
GPU-based Point Cloud Superpositioning for Structural Comparisons of Protein Binding Sites.
Leinweber, Matthias; Fober, Thomas; Freisleben, Bernd
2016-11-07
In this paper, we present a novel approach to solve the labeled point cloud superpositioning problem for performing structural comparisons of protein binding sites. The solution is based on a parallel evolution strategy that operates on large populations and runs on GPU hardware. The proposed evolution strategy reduces the likelihood of getting stuck in a local optimum of the multimodal real-valued optimization problem represented by labeled point cloud superpositioning. The performance of the GPU-based parallel evolution strategy is compared to a previously proposed CPU-based sequential approach for labeled point cloud superpositioning, indicating that the GPU-based parallel evolution strategy leads to qualitatively better results and significantly shorter runtimes, with speed improvements of up to a factor of 1,500 for large populations. Binary classification tests based on the ATP, NADH and FAD protein subsets of CavBase, a database containing putative binding sites, show average classification rate improvements from about 92% (CPU) to 96% (GPU). Further experiments indicate that the proposed GPU-based labeled point cloud superpositioning approach can be superior to traditional protein comparison approaches based on sequence alignments.
GPU-accelerated Tersoff potentials for massively parallel Molecular Dynamics simulations
NASA Astrophysics Data System (ADS)
Nguyen, Trung Dac
2017-03-01
The Tersoff potential is one of the empirical many-body potentials that has been widely used in simulation studies at atomic scales. Unlike pair-wise potentials, the Tersoff potential involves three-body terms, which require much more arithmetic operations and data dependency. In this contribution, we have implemented the GPU-accelerated version of several variants of the Tersoff potential for LAMMPS, an open-source massively parallel Molecular Dynamics code. Compared to the existing MPI implementation in LAMMPS, the GPU implementation exhibits a better scalability and offers a speedup of 2.2X when run on 1000 compute nodes on the Titan supercomputer. On a single node, the speedup ranges from 2.0 to 8.0 times, depending on the number of atoms per GPU and hardware configurations. The most notable features of our GPU-accelerated version include its design for MPI/accelerator heterogeneous parallelism, its compatibility with other functionalities in LAMMPS, its ability to give deterministic results and to support both NVIDIA CUDA- and OpenCL-enabled accelerators. Our implementation is now part of the GPU package in LAMMPS and accessible for public use.
Modern Methods of Bundle Adjustment on the Gpu
NASA Astrophysics Data System (ADS)
Hänsch, R.; Drude, I.; Hellwich, O.
2016-06-01
The task to compute 3D reconstructions from large amounts of data has become an active field of research within the last years. Based on an initial estimate provided by structure from motion, bundle adjustment seeks to find a solution that is optimal for all cameras and 3D points. The corresponding nonlinear optimization problem is usually solved by the Levenberg-Marquardt algorithm combined with conjugate gradient descent. While many adaptations and extensions to the classical bundle adjustment approach have been proposed, only few works consider the acceleration potentials of GPU systems. This paper elaborates the possibilities of time and space savings when fitting the implementation strategy to the terms and requirements of realizing a bundler on heterogeneous CPUGPU systems. Instead of focusing on the standard approach of Levenberg-Marquardt optimization alone, nonlinear conjugate gradient descent and alternating resection-intersection are studied as two alternatives. The experiments show that in particular alternating resection-intersection reaches low error rates very fast, but converges to larger error rates than Levenberg-Marquardt. PBA, as one of the current state-of-the-art bundlers, converges slower in 50 % of the test cases and needs 1.5-2 times more memory than the Levenberg- Marquardt implementation.
GPU Acceleration of Particle-In-Cell Methods
NASA Astrophysics Data System (ADS)
Cowan, Benjamin; Cary, John; Sides, Scott
2016-10-01
Graphics processing units (GPUs) have become key components in many supercomputing systems, as they can provide more computations relative to their cost and power consumption than conventional processors. However, to take full advantage of this capability, they require a strict programming model which involves single-instruction multiple-data execution as well as significant constraints on memory accesses. To bring the full power of GPUs to bear on plasma physics problems, we must adapt the computational methods to this new programming model. We have developed a GPU implementation of the particle-in-cell (PIC) method, one of the mainstays of plasma physics simulation. This framework is highly general and enables advanced PIC features such as high order particles and absorbing boundary conditions. The main elements of the PIC loop, including field interpolation and particle deposition, are designed to optimize memory access. We describe the performance of these algorithms and discuss some of the methods used. Work supported by DARPA Contract No. W31P4Q-16-C-0009.
GPU acceleration of particle-in-cell methods
NASA Astrophysics Data System (ADS)
Cowan, Benjamin; Cary, John; Meiser, Dominic
2015-11-01
Graphics processing units (GPUs) have become key components in many supercomputing systems, as they can provide more computations relative to their cost and power consumption than conventional processors. However, to take full advantage of this capability, they require a strict programming model which involves single-instruction multiple-data execution as well as significant constraints on memory accesses. To bring the full power of GPUs to bear on plasma physics problems, we must adapt the computational methods to this new programming model. We have developed a GPU implementation of the particle-in-cell (PIC) method, one of the mainstays of plasma physics simulation. This framework is highly general and enables advanced PIC features such as high order particles and absorbing boundary conditions. The main elements of the PIC loop, including field interpolation and particle deposition, are designed to optimize memory access. We describe the performance of these algorithms and discuss some of the methods used. Work supported by DARPA contract W31P4Q-15-C-0061 (SBIR).
GPU-enabled projectile guidance for impact area constraints
NASA Astrophysics Data System (ADS)
Rogers, Jonathan
2013-05-01
Guided projectile engagement scenarios often involve impact area constraints, in which it may be less desirable to incur miss distance on one side of a target or within a specified boundary near the target area. Current projectile guidance schemes such as impact point predictors cannot handle these constraints within the guidance loop, and may produce dispersion patterns that are insensitive to these constraints. In this paper, a new projectile guidance law is proposed that leverages real-time Monte Carlo impact point prediction to continually evaluate the probability of violating impact area constraints. The desired aim point is then adjusted accordingly. Real-time Monte Carlo simulation is enabled within the feedback loop through use of graphics processing units (GPU's), which provide parallel pipelines through which a dispersion pattern can routinely be predicted. The result is a guidance law that can achieve minimum miss distance while avoiding impact area constraints. The new guidance law is described and formulated as a nonlinear optimization problem which is solved in real-time through massively-parallel Monte Carlo simulation. An example simulation is shown in which impact area constraints are enforced and the methodology of stochastic guidance is demonstrated. Finally, Monte Carlo simulations are shown which demonstrate the ability of the stochastic guidance scheme to avoid an arbitrary set of impact area constraints, generating an impact probability density function that optimally trades miss distance within the restricted impact area. The proposed guidance scheme has applications beyond smart weapons to include missiles, UAV's, and other autonomous systems.
GPU-enabled molecular dynamics simulations of ankyrin kinase complex
NASA Astrophysics Data System (ADS)
Gautam, Vertika; Chong, Wei Lim; Wisitponchai, Tanchanok; Nimmanpipug, Piyarat; Zain, Sharifuddin M.; Rahman, Noorsaadah Abd.; Tayapiwatana, Chatchai; Lee, Vannajan Sanghiran
2014-10-01
The ankyrin repeat (AR) protein can be used as a versatile scaffold for protein-protein interactions. It has been found that the heterotrimeric complex between integrin-linked kinase (ILK), PINCH, and parvin is an essential signaling platform, serving as a convergence point for integrin and growth-factor signaling and regulating cell adhesion, spreading, and migration. Using ILK-AR with high affinity for the PINCH1 as our model system, we explored a structure-based computational protocol to probe and characterize binding affinity hot spots at protein-protein interfaces. In this study, the long time scale dynamics simulations with GPU accelerated molecular dynamics (MD) simulations in AMBER12 have been performed to locate the hot spots of protein-protein interaction by the analysis of the Molecular Mechanics-Poisson-Boltzmann Surface Area/Generalized Born Solvent Area (MM-PBSA/GBSA) of the MD trajectories. Our calculations suggest good binding affinity of the complex and also the residues critical in the binding.
Modeling Cooperative Threads to Project GPU Performance for Adaptive Parallelism
Meng, Jiayuan; Uram, Thomas; Morozov, Vitali A.; Vishwanath, Venkatram; Kumaran, Kalyan
2015-01-01
Most accelerators, such as graphics processing units (GPUs) and vector processors, are particularly suitable for accelerating massively parallel workloads. On the other hand, conventional workloads are developed for multi-core parallelism, which often scale to only a few dozen OpenMP threads. When hardware threads significantly outnumber the degree of parallelism in the outer loop, programmers are challenged with efficient hardware utilization. A common solution is to further exploit the parallelism hidden deep in the code structure. Such parallelism is less structured: parallel and sequential loops may be imperfectly nested within each other, neigh boring inner loops may exhibit different concurrency patterns (e.g. Reduction vs. Forall), yet have to be parallelized in the same parallel section. Many input-dependent transformations have to be explored. A programmer often employs a larger group of hardware threads to cooperatively walk through a smaller outer loop partition and adaptively exploit any encountered parallelism. This process is time-consuming and error-prone, yet the risk of gaining little or no performance remains high for such workloads. To reduce risk and guide implementation, we propose a technique to model workloads with limited parallelism that can automatically explore and evaluate transformations involving cooperative threads. Eventually, our framework projects the best achievable performance and the most promising transformations without implementing GPU code or using physical hardware. We envision our technique to be integrated into future compilers or optimization frameworks for autotuning.
Toward Fast Computation of Dense Image Correspondence on the GPU
Duchaineau, M; Cohen, J; Vaidya, S
2007-08-13
Large-scale video processing systems are needed to support human analysis of massive collections of image streams. Video, from both current small-format and future large-format camera systems, constitutes the single largest data source of the near future, dwarfing the output of all other data sources combined. A critical component to further advances in the processing and analysis of such video streams is the ability to register successive video frames into a common coordinate system at the pixel level. This capability enables further downstream processing, such as background/mover segmentation, 3D model extraction, and compression. We present here our recent work on computing these correspondences. We employ coarse-to-fine hierarchical approach, matching pixels from the domain of a source image to the domain of a target image at successively higher resolutions. Our diamond-style image hierarchy, with total pixel counts increasing by only a factor of two at each level, improves the prediction quality as we advance from level to level, and reduces potential grid artifacts in the results. We demonstrate the quality our approach on real aerial city imagery. We find that registration accuracy is generally on the order of one quarter of a pixel. We also benchmark the fundamental processing kernels on the GPU to show the promise of the approach for real-time video processing applications.
Prefiltering Model for Homology Detection Algorithms on GPU.
Retamosa, Germán; de Pedro, Luis; González, Ivan; Tamames, Javier
2016-01-01
Homology detection has evolved over the time from heavy algorithms based on dynamic programming approaches to lightweight alternatives based on different heuristic models. However, the main problem with these algorithms is that they use complex statistical models, which makes it difficult to achieve a relevant speedup and find exact matches with the original results. Thus, their acceleration is essential. The aim of this article was to prefilter a sequence database. To make this work, we have implemented a groundbreaking heuristic model based on NVIDIA's graphics processing units (GPUs) and multicore processors. Depending on the sensitivity settings, this makes it possible to quickly reduce the sequence database by factors between 50% and 95%, while rejecting no significant sequences. Furthermore, this prefiltering application can be used together with multiple homology detection algorithms as a part of a next-generation sequencing system. Extensive performance and accuracy tests have been carried out in the Spanish National Centre for Biotechnology (NCB). The results show that GPU hardware can accelerate the execution times of former homology detection applications, such as National Centre for Biotechnology Information (NCBI), Basic Local Alignment Search Tool for Proteins (BLASTP), up to a factor of 4.
GPU surface extraction using the closest point embedding
NASA Astrophysics Data System (ADS)
Kim, Mark; Hansen, Charles
2015-01-01
Isosurface extraction is a fundamental technique used for both surface reconstruction and mesh generation. One method to extract well-formed isosurfaces is a particle system; unfortunately, particle systems can be slow. In this paper, we introduce an enhanced parallel particle system that uses the closest point embedding as the surface representation to speedup the particle system for isosurface extraction. The closest point embedding is used in the Closest Point Method (CPM), a technique that uses a standard three dimensional numerical PDE solver on two dimensional embedded surfaces. To fully take advantage of the closest point embedding, it is coupled with a Barnes-Hut tree code on the GPU. This new technique produces well-formed, conformal unstructured triangular and tetrahedral meshes from labeled multi-material volume datasets. Further, this new parallel implementation of the particle system is faster than any known methods for conformal multi-material mesh extraction. The resulting speed-ups gained in this implementation can reduce the time from labeled data to mesh from hours to minutes and benefits users, such as bioengineers, who employ triangular and tetrahedral meshes
Probability Density Function Analysis of Turbulent Condensation Using GPU Hardware
NASA Astrophysics Data System (ADS)
Keedy, Ryan; Riley, James; Aliseda, Alberto
2014-11-01
Growth of liquid droplets by condensation is an important phenomenon in many environmental and industrial applications. In a homogenous, supersaturated environment, condensation will tend to narrow the diameter distribution of a poly-disperse collection of droplets. However, free shear turbulence can broaden the diameter distribution due to intermittency in the mixing and by subjecting droplets to non-Gaussian supersaturation statistics. In order to understand the condensation behavior of water droplets in a turbulent flow, it is necessary to understand the dispersion of the droplets and transported scalars. We describe a hybrid approach for predicting droplet growth and dispersion in a turbulent mixing layer and compare our computational predictions to experimental data. The approach utilizes a finite-volume code to calculate the fluid velocity field and a particle-mesh Monte Carlo method to track the locations and thermodynamics of the large number of stochastic particles throughout the domain required to resolve the Probability Density Function of the water vapor and droplets. The particle tracking algorithm is designed to take advantage of the computational power of a large number of GPU cores, with significant speed-up when compared against a baseline CPU configuration.
GPU-enabled Computational Model of Electrochemical Energy Storage Systems
NASA Astrophysics Data System (ADS)
Andersen, Charles; Qiu, Gang; Kandasamy, Nagarajan; Sun, Ying
2013-11-01
We present a computational model of a Redox Flow Battery (RFB), which uses real pore-scale fiber geometry obtained through X-ray computed tomography (XCT). Our pore-scale approach is in contrast to the more common volume-averaged model, which considers the domain as a homogenous medium of uniform porosity. We apply a finite volume method to solve the coupled species and charge transport equations. The flow field in our system is evaluated using the Lattice Boltzmann method (LBM). To resolve the governing equations at the pore-scale of carbon fibers, which are on the order of tens of microns, is a highly computationally expensive task. To overcome this challenge, in lieu of traditional implementation with Message Passing Interface (MPI), we employ the use of Graphics Processing Units (GPUs) as a means of parallelization. The Butler-Volmer equation provides a coupling between the species and charge equations on the fiber surface. Scalability of the GPU implementation is examined along with the effects of fiber geometry, porosity, and flow rate on battery performance.
Immersed boundary method implemented in lattice Boltzmann GPU code
NASA Astrophysics Data System (ADS)
Devincentis, Brian; Smith, Kevin; Thomas, John
2015-11-01
Lattice Boltzmann is well suited to efficiently utilize the rapidly increasing compute power of GPUs to simulate viscous incompressible flows. Fluid-structure interaction with solids of arbitrarily complex geometry can be modeled in this framework with the immersed boundary method (IBM). In IBM a solid is modeled by its surface which applies a force at the neighboring lattice points. The majority of published IBMs require solving a linear system in order to satisfy the no-slip condition. However, the method presented by Wang et al. (2014) is unique in that it produces equally accurate results without solving a linear system. Furthermore, the algorithm can be applied in a parallel manner over the immersed boundary and is, therefore, well suited for GPUs. Here, a 2D and 3D version of their algorithm is implemented in Sailfish CFD, a GPU-based open source lattice Boltzmann code. One issue unaddressed by most published work is how to correct force and torque calculated from IBM for translating and rotating solids. These corrections are necessary because the fluid inside the solid affects its inertia in a non-trivial manner. Therefore, this implementation uses the Lagrangian points approximation correction shown by Suzuki and Inamuro (2011) to be accurate.
Accelerated finite element elastodynamic simulations using the GPU
Huthwaite, Peter
2014-01-15
An approach is developed to perform explicit time domain finite element simulations of elastodynamic problems on the graphical processing unit, using Nvidia's CUDA. Of critical importance for this problem is the arrangement of nodes in memory, allowing data to be loaded efficiently and minimising communication between the independently executed blocks of threads. The initial stage of memory arrangement is partitioning the mesh; both a well established ‘greedy’ partitioner and a new, more efficient ‘aligned’ partitioner are investigated. A method is then developed to efficiently arrange the memory within each partition. The software is applied to three models from the fields of non-destructive testing, vibrations and geophysics, demonstrating a memory bandwidth of very close to the card's maximum, reflecting the bandwidth-limited nature of the algorithm. Comparison with Abaqus, a widely used commercial CPU equivalent, validated the accuracy of the results and demonstrated a speed improvement of around two orders of magnitude. A software package, Pogo, incorporating these developments, is released open source, downloadable from (http://www.pogo-fea.com/) to benefit the community. -- Highlights: •A novel memory arrangement approach is discussed for finite elements on the GPU. •The mesh is partitioned then nodes are arranged efficiently within each partition. •Models from ultrasonics, vibrations and geophysics are run. •The code is significantly faster than an equivalent commercial CPU package. •Pogo, the new software package, is released open source.
GPU-based flow simulation with detailed chemical kinetics
NASA Astrophysics Data System (ADS)
Le, Hai P.; Cambier, Jean-Luc; Cole, Lord K.
2013-03-01
The current paper reports on the implementation of a numerical solver on the Graphic Processing Units (GPUs) to model reactive gas mixtures with detailed chemical kinetics. The solver incorporates high-order finite volume methods for solving the fluid dynamical equations coupled with stiff source terms. The chemical kinetics are solved implicitly via an operator-splitting method. We explored different approaches in implementing a fast kinetics solver on the GPU. The detail of the implementation is discussed in the paper. The solver is tested with two high-order shock capturing schemes: MP5 (Suresh and Huynh, 1997) [9] and ADERWENO (Titarev and Toro, 2005) [10]. Considering only the fluid dynamics calculation, the speed-up factors obtained are 30 for the MP5 scheme and 55 for ADERWENO scheme. For the fully-coupled solver, the performance gain depended on the size of the reaction mechanism. Two different examples of chemistry were explored. The first mechanism consisted of 9 species and 38 reactions, resulting in a speed-up factor up to 35. The second, larger mechanism, consisted of 36 species and 308 reactions, resulting in a speed-up factor of up to 40.
FARGO3D: A NEW GPU-ORIENTED MHD CODE
Benitez-Llambay, Pablo; Masset, Frédéric S. E-mail: masset@icf.unam.mx
2016-03-15
We present the FARGO3D code, recently publicly released. It is a magnetohydrodynamics code developed with special emphasis on the physics of protoplanetary disks and planet–disk interactions, and parallelized with MPI. The hydrodynamics algorithms are based on finite-difference upwind, dimensionally split methods. The magnetohydrodynamics algorithms consist of the constrained transport method to preserve the divergence-free property of the magnetic field to machine accuracy, coupled to a method of characteristics for the evaluation of electromotive forces and Lorentz forces. Orbital advection is implemented, and an N-body solver is included to simulate planets or stars interacting with the gas. We present our implementation in detail and present a number of widely known tests for comparison purposes. One strength of FARGO3D is that it can run on either graphical processing units (GPUs) or central processing units (CPUs), achieving large speed-up with respect to CPU cores. We describe our implementation choices, which allow a user with no prior knowledge of GPU programming to develop new routines for CPUs, and have them translated automatically for GPUs.
GPU MrBayes V3.1: MrBayes on Graphics Processing Units for Protein Sequence Data.
Pang, Shuai; Stones, Rebecca J; Ren, Ming-Ming; Liu, Xiao-Guang; Wang, Gang; Xia, Hong-ju; Wu, Hao-Yang; Liu, Yang; Xie, Qiang
2015-09-01
We present a modified GPU (graphics processing unit) version of MrBayes, called ta(MC)(3) (GPU MrBayes V3.1), for Bayesian phylogenetic inference on protein data sets. Our main contributions are 1) utilizing 64-bit variables, thereby enabling ta(MC)(3) to process larger data sets than MrBayes; and 2) to use Kahan summation to improve accuracy, convergence rates, and consequently runtime. Versus the current fastest software, we achieve a speedup of up to around 2.5 (and up to around 90 vs. serial MrBayes), and more on multi-GPU hardware. GPU MrBayes V3.1 is available from http://sourceforge.net/projects/mrbayes-gpu/.
4D MR phase and magnitude segmentations with GPU parallel computing.
Bergen, Robert V; Lin, Hung-Yu; Alexander, Murray E; Bidinosti, Christopher P
2015-01-01
The increasing size and number of data sets of large four dimensional (three spatial, one temporal) magnetic resonance (MR) cardiac images necessitates efficient segmentation algorithms. Analysis of phase-contrast MR images yields cardiac flow information which can be manipulated to produce accurate segmentations of the aorta. Phase contrast segmentation algorithms are proposed that use simple mean-based calculations and least mean squared curve fitting techniques. The initial segmentations are generated on a multi-threaded central processing unit (CPU) in 10 seconds or less, though the computational simplicity of the algorithms results in a loss of accuracy. A more complex graphics processing unit (GPU)-based algorithm fits flow data to Gaussian waveforms, and produces an initial segmentation in 0.5 seconds. Level sets are then applied to a magnitude image, where the initial conditions are given by the previous CPU and GPU algorithms. A comparison of results shows that the GPU algorithm appears to produce the most accurate segmentation.
de Paula, Lauro C. M.; Soares, Anderson S.; de Lima, Telma W.; Delbem, Alexandre C. B.; Coelho, Clarimar J.; Filho, Arlindo R. G.
2014-01-01
Several variable selection algorithms in multivariate calibration can be accelerated using Graphics Processing Units (GPU). Among these algorithms, the Firefly Algorithm (FA) is a recent proposed metaheuristic that may be used for variable selection. This paper presents a GPU-based FA (FA-MLR) with multiobjective formulation for variable selection in multivariate calibration problems and compares it with some traditional sequential algorithms in the literature. The advantage of the proposed implementation is demonstrated in an example involving a relatively large number of variables. The results showed that the FA-MLR, in comparison with the traditional algorithms is a more suitable choice and a relevant contribution for the variable selection problem. Additionally, the results also demonstrated that the FA-MLR performed in a GPU can be five times faster than its sequential implementation. PMID:25493625
A survey of techniques for architecting and managing GPU register file
Mittal, Sparsh
2016-04-07
To support their massively-multithreaded architecture, GPUs use very large register file (RF) which has a capacity higher than even L1 and L2 caches. In total contrast, traditional CPUs use tiny RF and much larger caches to optimize latency. Due to these differences, along with the crucial impact of RF in determining GPU performance, novel and intelligent techniques are required for managing GPU RF. In this paper, we survey the techniques for designing and managing GPU RF. We discuss techniques related to performance, energy and reliability aspects of RF. To emphasize the similarities and differences between the techniques, we classify themmore » along several parameters. Lastly, the aim of this paper is to synthesize the state-of-art developments in RF management and also stimulate further research in this area.« less
A survey of techniques for architecting and managing GPU register file
Mittal, Sparsh
2016-04-07
To support their massively-multithreaded architecture, GPUs use very large register file (RF) which has a capacity higher than even L1 and L2 caches. In total contrast, traditional CPUs use tiny RF and much larger caches to optimize latency. Due to these differences, along with the crucial impact of RF in determining GPU performance, novel and intelligent techniques are required for managing GPU RF. In this paper, we survey the techniques for designing and managing GPU RF. We discuss techniques related to performance, energy and reliability aspects of RF. To emphasize the similarities and differences between the techniques, we classify them along several parameters. Lastly, the aim of this paper is to synthesize the state-of-art developments in RF management and also stimulate further research in this area.
GPU-based Scalable Volumetric Reconstruction for Multi-view Stereo
Kim, H; Duchaineau, M; Max, N
2011-09-21
We present a new scalable volumetric reconstruction algorithm for multi-view stereo using a graphics processing unit (GPU). It is an effectively parallelized GPU algorithm that simultaneously uses a large number of GPU threads, each of which performs voxel carving, in order to integrate depth maps with images from multiple views. Each depth map, triangulated from pair-wise semi-dense correspondences, represents a view-dependent surface of the scene. This algorithm also provides scalability for large-scale scene reconstruction in a high resolution voxel grid by utilizing streaming and parallel computation. The output is a photo-realistic 3D scene model in a volumetric or point-based representation. We demonstrate the effectiveness and the speed of our algorithm with a synthetic scene and real urban/outdoor scenes. Our method can also be integrated with existing multi-view stereo algorithms such as PMVS2 to fill holes or gaps in textureless regions.
Massively Parallel Computation of Soil Surface Roughness Parameters on A Fermi GPU
NASA Astrophysics Data System (ADS)
Li, Xiaojie; Song, Changhe
2016-06-01
Surface roughness is description of the surface micro topography of randomness or irregular. The standard deviation of surface height and the surface correlation length describe the statistical variation for the random component of a surface height relative to a reference surface. When the number of data points is large, calculation of surface roughness parameters is time-consuming. With the advent of Graphics Processing Unit (GPU) architectures, inherently parallel problem can be effectively solved using GPUs. In this paper we propose a GPU-based massively parallel computing method for 2D bare soil surface roughness estimation. This method was applied to the data collected by the surface roughness tester based on the laser triangulation principle during the field experiment in April 2012. The total number of data points was 52,040. It took 47 seconds on a Fermi GTX 590 GPU whereas its serial CPU version took 5422 seconds, leading to a significant 115x speedup.
Gallarno, George; Rogers, James H; Maxwell, Don E
2015-01-01
The high computational capability of graphics processing units (GPUs) is enabling and driving the scientific discovery process at large-scale. The world s second fastest supercomputer for open science, Titan, has more than 18,000 GPUs that computational scientists use to perform scientific simu- lations and data analysis. Understanding of GPU reliability characteristics, however, is still in its nascent stage since GPUs have only recently been deployed at large-scale. This paper presents a detailed study of GPU errors and their impact on system operations and applications, describing experiences with the 18,688 GPUs on the Titan supercom- puter as well as lessons learned in the process of efficient operation of GPUs at scale. These experiences are helpful to HPC sites which already have large-scale GPU clusters or plan to deploy GPUs in the future.
GPU-based single-cluster algorithm for the simulation of the Ising model
NASA Astrophysics Data System (ADS)
Komura, Yukihiro; Okabe, Yutaka
2012-02-01
We present the GPU calculation with the common unified device architecture (CUDA) for the Wolff single-cluster algorithm of the Ising model. Proposing an algorithm for a quasi-block synchronization, we realize the Wolff single-cluster Monte Carlo simulation with CUDA. We perform parallel computations for the newly added spins in the growing cluster. As a result, the GPU calculation speed for the two-dimensional Ising model at the critical temperature with the linear size L = 4096 is 5.60 times as fast as the calculation speed on a current CPU core. For the three-dimensional Ising model with the linear size L = 256, the GPU calculation speed is 7.90 times as fast as the CPU calculation speed. The idea of quasi-block synchronization can be used not only in the cluster algorithm but also in many fields where the synchronization of all threads is required.
FastGCN: a GPU accelerated tool for fast gene co-expression networks.
Liang, Meimei; Zhang, Futao; Jin, Gulei; Zhu, Jun
2015-01-01
Gene co-expression networks comprise one type of valuable biological networks. Many methods and tools have been published to construct gene co-expression networks; however, most of these tools and methods are inconvenient and time consuming for large datasets. We have developed a user-friendly, accelerated and optimized tool for constructing gene co-expression networks that can fully harness the parallel nature of GPU (Graphic Processing Unit) architectures. Genetic entropies were exploited to filter out genes with no or small expression changes in the raw data preprocessing step. Pearson correlation coefficients were then calculated. After that, we normalized these coefficients and employed the False Discovery Rate to control the multiple tests. At last, modules identification was conducted to construct the co-expression networks. All of these calculations were implemented on a GPU. We also compressed the coefficient matrix to save space. We compared the performance of the GPU implementation with those of multi-core CPU implementations with 16 CPU threads, single-thread C/C++ implementation and single-thread R implementation. Our results show that GPU implementation largely outperforms single-thread C/C++ implementation and single-thread R implementation, and GPU implementation outperforms multi-core CPU implementation when the number of genes increases. With the test dataset containing 16,000 genes and 590 individuals, we can achieve greater than 63 times the speed using a GPU implementation compared with a single-thread R implementation when 50 percent of genes were filtered out and about 80 times the speed when no genes were filtered out.
Performance evaluation of H.264/AVC decoding and visualization using the GPU
NASA Astrophysics Data System (ADS)
Pieters, Bart; Van Rijsselbergen, Dieter; De Neve, Wesley; Van de Walle, Rik
2007-09-01
The coding efficiency of the H.264/AVC standard makes the decoding process computationally demanding. This has limited the availability of cost-effective, high-performance solutions. Modern computers are typically equipped with powerful yet cost-effective Graphics Processing Units (GPUs) to accelerate graphics operations. These GPUs can be addressed by means of a 3-D graphics API such as Microsoft Direct3D or OpenGL, using programmable shaders as generic processing units for vector data. The new CUDA (Compute Unified Device Architecture) platform of NVIDIA provides a straightforward way to address the GPU directly, without the need for a 3-D graphics API in the middle. In CUDA, a compiler generates executable code from C code with specific modifiers that determine the execution model. This paper first presents an own-developed H.264/AVC renderer, which is capable of executing motion compensation (MC), reconstruction, and Color Space Conversion (CSC) entirely on the GPU. To steer the GPU, Direct3D combined with programmable pixel and vertex shaders is used. Next, we also present a GPU-enabled decoder utilizing the new CUDA architecture from NVIDIA. This decoder performs MC, reconstruction, and CSC on the GPU as well. Our results compare both GPU-enabled decoders, as well as a CPU-only decoder in terms of speed, complexity, and CPU requirements. Our measurements show that a significant speedup is possible, relative to a CPU-only solution. As an example, real-time playback of high-definition video (1080p) was achieved with our Direct3D and CUDA-based H.264/AVC renderers.
Implementation and optimization of ultrasound signal processing algorithms on mobile GPU
NASA Astrophysics Data System (ADS)
Kong, Woo Kyu; Lee, Wooyoul; Kim, Kyu Cheol; Yoo, Yangmo; Song, Tai-Kyong
2014-03-01
A general-purpose graphics processing unit (GPGPU) has been used for improving computing power in medical ultrasound imaging systems. Recently, a mobile GPU becomes powerful to deal with 3D games and videos at high frame rates on Full HD or HD resolution displays. This paper proposes the method to implement ultrasound signal processing on a mobile GPU available in the high-end smartphone (Galaxy S4, Samsung Electronics, Seoul, Korea) with programmable shaders on the OpenGL ES 2.0 platform. To maximize the performance of the mobile GPU, the optimization of shader design and load sharing between vertex and fragment shader was performed. The beamformed data were captured from a tissue mimicking phantom (Model 539 Multipurpose Phantom, ATS Laboratories, Inc., Bridgeport, CT, USA) by using a commercial ultrasound imaging system equipped with a research package (Ultrasonix Touch, Ultrasonix, Richmond, BC, Canada). The real-time performance is evaluated by frame rates while varying the range of signal processing blocks. The implementation method of ultrasound signal processing on OpenGL ES 2.0 was verified by analyzing PSNR with MATLAB gold standard that has the same signal path. CNR was also analyzed to verify the method. From the evaluations, the proposed mobile GPU-based processing method has no significant difference with the processing using MATLAB (i.e., PSNR<52.51 dB). The comparable results of CNR were obtained from both processing methods (i.e., 11.31). From the mobile GPU implementation, the frame rates of 57.6 Hz were achieved. The total execution time was 17.4 ms that was faster than the acquisition time (i.e., 34.4 ms). These results indicate that the mobile GPU-based processing method can support real-time ultrasound B-mode processing on the smartphone.
Performance of new GPU-based scan-conversion algorithm implemented using OpenGL.
Steelman, William A; Richard, William D
2011-04-01
A new GPU-based scan-conversion algorithm implemented using OpenGL is described. The compute performance of this new algorithm running on a modem GPU is compared to the performance of three common scan-conversion algorithms (nearest-neighbor, linear interpolation and bilinear interpolation) implemented in software using a modem CPU. The quality of the images produced by the algorithm, as measured by signal-to-noise power, is also compared to the quality of the images produced using these three common scan-conversion algorithms.
Real-time high definition H.264 video decode using the Xbox 360 GPU
NASA Astrophysics Data System (ADS)
Arevalo Baeza, Juan Carlos; Chen, William; Christoffersen, Eric; Dinu, Daniel; Friemel, Barry
2007-09-01
The Xbox 360 is powered by three dual pipeline 3.2 GHz IBM PowerPC processors and a 500 MHz ATI graphics processing unit. The Graphics Processing Unit (GPU) is a special-purpose device, intended to create advanced visual effects and to render realistic scenes for the latest Xbox 360 games. In this paper, we report work on using the GPU as a parallel processing unit to accelerate the decoding of H.264/AVC high-definition (1920x1080) video. We report our experiences in developing a real-time, software-only high-definition video decoder for the Xbox 360.
A 3D front tracking method on a CPU/GPU system
Bo, Wurigen; Grove, John
2011-01-21
We describe the method to port a sequential 3D interface tracking code to a GPU with CUDA. The interface is represented as a triangular mesh. Interface geometry properties and point propagation are performed on a GPU. Interface mesh adaptation is performed on a CPU. The convergence of the method is assessed from the test problems with given velocity fields. Performance results show overall speedups from 11 to 14 for the test problems under mesh refinement. We also briefly describe our ongoing work to couple the interface tracking method with a hydro solver.
GPU based cloud system for high-performance arrhythmia detection with parallel k-NN algorithm.
Tae Joon Jun; Hyun Ji Park; Hyuk Yoo; Young-Hak Kim; Daeyoung Kim
2016-08-01
In this paper, we propose an GPU based Cloud system for high-performance arrhythmia detection. Pan-Tompkins algorithm is used for QRS detection and we optimized beat classification algorithm with K-Nearest Neighbor (K-NN). To support high performance beat classification on the system, we parallelized beat classification algorithm with CUDA to execute the algorithm on virtualized GPU devices on the Cloud system. MIT-BIH Arrhythmia database is used for validation of the algorithm. The system achieved about 93.5% of detection rate which is comparable to previous researches while our algorithm shows 2.5 times faster execution time compared to CPU only detection algorithm.
NASA Astrophysics Data System (ADS)
Quillen, Alice C.; Moore, A.
2008-09-01
Planetesimal and dust dynamical simulations require collision and nearest neighbor detection. A brute force implementation for sorting interparticle distances requires O(N2) computations for N particles, limiting the numbers of particles that have been simulated. Parallel algorithms recently developed for the GPU (graphics processing unit), such as the radix sort, can run as fast as O(N) and sort distances between a million particles in a few hundred milliseconds. We introduce improvements in collision and nearest neighbor detection algorithms and how we have incorporated them into our efficient parallel 2nd order democratic heliocentric method symplectic integrator written in NVIDIA's CUDA for the GPU.
GPU accelerated support vector machines for mining high-throughput screening data.
Liao, Quan; Wang, Jibo; Webster, Yue; Watson, Ian A
2009-12-01
Support Vector Machine (SVM), one of the most promising tools in chemical informatics, is time-consuming for mining large high-throughput screening (HTS) data sets. Here, we describe a parallelization of SVM-light algorithm on a graphic processor unit (GPU), using molecular fingerprints as descriptors and the Tanimoto index as kernel function. Comparison experiments based on six PubChem Bioassay data sets show that the GPU version is 43-104x faster than SVM-light for building classification models and 112-212x over SVM-light for building regression models.
Multi-GPU and multi-CPU accelerated FDTD scheme for vibroacoustic applications
NASA Astrophysics Data System (ADS)
Francés, J.; Otero, B.; Bleda, S.; Gallego, S.; Neipp, C.; Márquez, A.; Beléndez, A.
2015-06-01
The Finite-Difference Time-Domain (FDTD) method is applied to the analysis of vibroacoustic problems and to study the propagation of longitudinal and transversal waves in a stratified media. The potential of the scheme and the relevance of each acceleration strategy for massively computations in FDTD are demonstrated in this work. In this paper, we propose two new specific implementations of the bi-dimensional scheme of the FDTD method using multi-CPU and multi-GPU, respectively. In the first implementation, an open source message passing interface (OMPI) has been included in order to massively exploit the resources of a biprocessor station with two Intel Xeon processors. Moreover, regarding CPU code version, the streaming SIMD extensions (SSE) and also the advanced vectorial extensions (AVX) have been included with shared memory approaches that take advantage of the multi-core platforms. On the other hand, the second implementation called the multi-GPU code version is based on Peer-to-Peer communications available in CUDA on two GPUs (NVIDIA GTX 670). Subsequently, this paper presents an accurate analysis of the influence of the different code versions including shared memory approaches, vector instructions and multi-processors (both CPU and GPU) and compares them in order to delimit the degree of improvement of using distributed solutions based on multi-CPU and multi-GPU. The performance of both approaches was analysed and it has been demonstrated that the addition of shared memory schemes to CPU computing improves substantially the performance of vector instructions enlarging the simulation sizes that use efficiently the cache memory of CPUs. In this case GPU computing is slightly twice times faster than the fine tuned CPU version in both cases one and two nodes. However, for massively computations explicit vector instructions do not worth it since the memory bandwidth is the limiting factor and the performance tends to be the same than the sequential version
Accelerated GPU simulation of compressible flow by the discontinuous evolution Galerkin method
NASA Astrophysics Data System (ADS)
Block, B. J.; Lukáčová-Medvid'ová, M.; Virnau, P.; Yelash, L.
2012-08-01
The aim of the present paper is to report on our recent results for GPU accelerated simulations of compressible flows. For numerical simulation the adaptive discontinuous Galerkin method with the multidimensional bicharacteristic based evolution Galerkin operator has been used. For time discretization we have applied the explicit third order Runge-Kutta method. Evaluation of the genuinely multidimensional evolution operator has been accelerated using the GPU implementation. We have obtained a speedup up to 30 (in comparison to a single CPU core) for the calculation of the evolution Galerkin operator on a typical discretization mesh consisting of 16384 mesh cells.
Mobile Devices and GPU Parallelism in Ionospheric Data Processing
NASA Astrophysics Data System (ADS)
Mascharka, D.; Pankratius, V.
2015-12-01
Scientific data acquisition in the field is often constrained by data transfer backchannels to analysis environments. Geoscientists are therefore facing practical bottlenecks with increasing sensor density and variety. Mobile devices, such as smartphones and tablets, offer promising solutions to key problems in scientific data acquisition, pre-processing, and validation by providing advanced capabilities in the field. This is due to affordable network connectivity options and the increasing mobile computational power. This contribution exemplifies a scenario faced by scientists in the field and presents the "Mahali TEC Processing App" developed in the context of the NSF-funded Mahali project. Aimed at atmospheric science and the study of ionospheric Total Electron Content (TEC), this app is able to gather data from various dual-frequency GPS receivers. It demonstrates parsing of full-day RINEX files on mobile devices and on-the-fly computation of vertical TEC values based on satellite ephemeris models that are obtained from NASA. Our experiments show how parallel computing on the mobile device GPU enables fast processing and visualization of up to 2 million datapoints in real-time using OpenGL. GPS receiver bias is estimated through minimum TEC approximations that can be interactively adjusted by scientists in the graphical user interface. Scientists can also perform approximate computations for "quickviews" to reduce CPU processing time and memory consumption. In the final stage of our mobile processing pipeline, scientists can upload data to the cloud for further processing. Acknowledgements: The Mahali project (http://mahali.mit.edu) is funded by the NSF INSPIRE grant no. AGS-1343967 (PI: V. Pankratius). We would like to acknowledge our collaborators at Boston College, Virginia Tech, Johns Hopkins University, Colorado State University, as well as the support of UNAVCO for loans of dual-frequency GPS receivers for use in this project, and Intel for loans of
A comparative analysis of GPU implementations of spectral unmixing algorithms
NASA Astrophysics Data System (ADS)
Sanchez, Sergio; Plaza, Antonio
2011-11-01
Spectral unmixing is a very important task for remotely sensed hyperspectral data exploitation. It involves the separation of a mixed pixel spectrum into its pure component spectra (called endmembers) and the estimation of the proportion (abundance) of each endmember in the pixel. Over the last years, several algorithms have been proposed for: i) automatic extraction of endmembers, and ii) estimation of the abundance of endmembers in each pixel of the hyperspectral image. The latter step usually imposes two constraints in abundance estimation: the non-negativity constraint (meaning that the estimated abundances cannot be negative) and the sum-toone constraint (meaning that the sum of endmember fractional abundances for a given pixel must be unity). These two steps comprise a hyperspectral unmixing chain, which can be very time-consuming (particularly for high-dimensional hyperspectral images). Parallel computing architectures have offered an attractive solution for fast unmixing of hyperspectral data sets, but these systems are expensive and difficult to adapt to on-board data processing scenarios, in which low-weight and low-power integrated components are essential to reduce mission payload and obtain analysis results in (near) real-time. In this paper, we perform an inter-comparison of parallel algorithms for automatic extraction of pure spectral signatures or endmembers and for estimation of the abundance of endmembers in each pixel of the scene. The compared techniques are implemented in graphics processing units (GPUs). These hardware accelerators can bridge the gap towards on-board processing of this kind of data. The considered algorithms comprise the orthogonal subspace projection (OSP), iterative error analysis (IEA) and N-FINDR algorithms for endmember extraction, as well as unconstrained, partially constrained and fully constrained abundance estimation. The considered implementations are inter-compared using different GPU architectures and hyperspectral
Prefiltering Model for Homology Detection Algorithms on GPU
Retamosa, Germán; de Pedro, Luis; González, Ivan; Tamames, Javier
2016-01-01
Homology detection has evolved over the time from heavy algorithms based on dynamic programming approaches to lightweight alternatives based on different heuristic models. However, the main problem with these algorithms is that they use complex statistical models, which makes it difficult to achieve a relevant speedup and find exact matches with the original results. Thus, their acceleration is essential. The aim of this article was to prefilter a sequence database. To make this work, we have implemented a groundbreaking heuristic model based on NVIDIA’s graphics processing units (GPUs) and multicore processors. Depending on the sensitivity settings, this makes it possible to quickly reduce the sequence database by factors between 50% and 95%, while rejecting no significant sequences. Furthermore, this prefiltering application can be used together with multiple homology detection algorithms as a part of a next-generation sequencing system. Extensive performance and accuracy tests have been carried out in the Spanish National Centre for Biotechnology (NCB). The results show that GPU hardware can accelerate the execution times of former homology detection applications, such as National Centre for Biotechnology Information (NCBI), Basic Local Alignment Search Tool for Proteins (BLASTP), up to a factor of 4. KEY POINTS:Owing to the increasing size of the current sequence datasets, filtering approach and high-performance computing (HPC) techniques are the best solution to process all these information in acceptable processing times.Graphics processing unit cards and their corresponding programming models are good options to carry out these processing methods.Combination of filtration models with HPC techniques is able to offer new levels of performance and accuracy in homology detection algorithms such as National Centre for Biotechnology Information Basic Local Alignment Search Tool. PMID:28008220
Design and Testing of GPU based RTC for TMT NFIRAOS
NASA Astrophysics Data System (ADS)
Wang, Lianqi
2013-12-01
Graphical processing units (GPUs) are now gaining popularity in general computing applications due to their high computing power and high memory bandwidth (~10x of CPUs). For the same reason, GPUs are also suitable processors for the real time controllers (RTCs) of next generation adaptive optics (AO) systems. In this talk, we present a CPU+GPU based RTC design for the Thirty Meter Telescope (TMT) Narrow Field Infrared AO System (NFIRAOS), as part of an ongoing trade study of control algorithms and processor hardware options. We demonstrate that the system will meet the stringent latency requirement of first computing gradients for ~15500 laser guide star wavefront sensor sub-apertures, and then commands for ~7000 deformable mirror actuator at 800 Hz, using 12 Nvidia GTX 580 GPUs (2 GPUs per WFS). A classical matrix vector multiply reconstruction algorithm is used for its simplicity and parallelizability. Obtaining the conventional control matrix by inverting the forward influence matrix is impractical due to the large system size and sub-optimal performance due to lacking proper regularization. Instead, the control matrix implements a minimum variance wavefront reconstruction algorithm and is computed column-by-column using an iterative solver. We demonstrate that we can initialize the control matrix in about 1 minute and update it in 10 seconds as operating conditions vary to maintain optimal performance. Additionally, the weights used to compute the subaperture gradients are updated at a similar rate to track changes in the profile of the mesospheric sodium layer. These soft real time and background processes will largely be handled by CPUs. Finally,we will show a first version of the complete block diagram of data flow and mapping to hardware.
GPU-based relative fuzzy connectedness image segmentation
Zhuge Ying; Ciesielski, Krzysztof C.; Udupa, Jayaram K.; Miller, Robert W.
2013-01-15
Purpose:Recently, clinical radiological research and practice are becoming increasingly quantitative. Further, images continue to increase in size and volume. For quantitative radiology to become practical, it is crucial that image segmentation algorithms and their implementations are rapid and yield practical run time on very large data sets. The purpose of this paper is to present a parallel version of an algorithm that belongs to the family of fuzzy connectedness (FC) algorithms, to achieve an interactive speed for segmenting large medical image data sets. Methods: The most common FC segmentations, optimizing an Script-Small-L {sub {infinity}}-based energy, are known as relative fuzzy connectedness (RFC) and iterative relative fuzzy connectedness (IRFC). Both RFC and IRFC objects (of which IRFC contains RFC) can be found via linear time algorithms, linear with respect to the image size. The new algorithm, P-ORFC (for parallel optimal RFC), which is implemented by using NVIDIA's Compute Unified Device Architecture (CUDA) platform, considerably improves the computational speed of the above mentioned CPU based IRFC algorithm. Results: Experiments based on four data sets of small, medium, large, and super data size, achieved speedup factors of 32.8 Multiplication-Sign , 22.9 Multiplication-Sign , 20.9 Multiplication-Sign , and 17.5 Multiplication-Sign , correspondingly, on the NVIDIA Tesla C1060 platform. Although the output of P-ORFC need not precisely match that of IRFC output, it is very close to it and, as the authors prove, always lies between the RFC and IRFC objects. Conclusions: A parallel version of a top-of-the-line algorithm in the family of FC has been developed on the NVIDIA GPUs. An interactive speed of segmentation has been achieved, even for the largest medical image data set. Such GPU implementations may play a crucial role in automatic anatomy recognition in clinical radiology.
NASA Astrophysics Data System (ADS)
Kohno, R.; Hotta, K.; Nishioka, S.; Matsubara, K.; Tansho, R.; Suzuki, T.
2011-11-01
We implemented the simplified Monte Carlo (SMC) method on graphics processing unit (GPU) architecture under the computer-unified device architecture platform developed by NVIDIA. The GPU-based SMC was clinically applied for four patients with head and neck, lung, or prostate cancer. The results were compared to those obtained by a traditional CPU-based SMC with respect to the computation time and discrepancy. In the CPU- and GPU-based SMC calculations, the estimated mean statistical errors of the calculated doses in the planning target volume region were within 0.5% rms. The dose distributions calculated by the GPU- and CPU-based SMCs were similar, within statistical errors. The GPU-based SMC showed 12.30-16.00 times faster performance than the CPU-based SMC. The computation time per beam arrangement using the GPU-based SMC for the clinical cases ranged 9-67 s. The results demonstrate the successful application of the GPU-based SMC to a clinical proton treatment planning.
Kohno, R; Hotta, K; Nishioka, S; Matsubara, K; Tansho, R; Suzuki, T
2011-11-21
We implemented the simplified Monte Carlo (SMC) method on graphics processing unit (GPU) architecture under the computer-unified device architecture platform developed by NVIDIA. The GPU-based SMC was clinically applied for four patients with head and neck, lung, or prostate cancer. The results were compared to those obtained by a traditional CPU-based SMC with respect to the computation time and discrepancy. In the CPU- and GPU-based SMC calculations, the estimated mean statistical errors of the calculated doses in the planning target volume region were within 0.5% rms. The dose distributions calculated by the GPU- and CPU-based SMCs were similar, within statistical errors. The GPU-based SMC showed 12.30-16.00 times faster performance than the CPU-based SMC. The computation time per beam arrangement using the GPU-based SMC for the clinical cases ranged 9-67 s. The results demonstrate the successful application of the GPU-based SMC to a clinical proton treatment planning.
Hallock, Michael J; Stone, John E; Roberts, Elijah; Fry, Corey; Luthey-Schulten, Zaida
2014-05-01
Simulation of in vivo cellular processes with the reaction-diffusion master equation (RDME) is a computationally expensive task. Our previous software enabled simulation of inhomogeneous biochemical systems for small bacteria over long time scales using the MPD-RDME method on a single GPU. Simulations of larger eukaryotic systems exceed the on-board memory capacity of individual GPUs, and long time simulations of modest-sized cells such as yeast are impractical on a single GPU. We present a new multi-GPU parallel implementation of the MPD-RDME method based on a spatial decomposition approach that supports dynamic load balancing for workstations containing GPUs of varying performance and memory capacity. We take advantage of high-performance features of CUDA for peer-to-peer GPU memory transfers and evaluate the performance of our algorithms on state-of-the-art GPU devices. We present parallel e ciency and performance results for simulations using multiple GPUs as system size, particle counts, and number of reactions grow. We also demonstrate multi-GPU performance in simulations of the Min protein system in E. coli. Moreover, our multi-GPU decomposition and load balancing approach can be generalized to other lattice-based problems.
Hallock, Michael J.; Stone, John E.; Roberts, Elijah; Fry, Corey; Luthey-Schulten, Zaida
2014-01-01
Simulation of in vivo cellular processes with the reaction-diffusion master equation (RDME) is a computationally expensive task. Our previous software enabled simulation of inhomogeneous biochemical systems for small bacteria over long time scales using the MPD-RDME method on a single GPU. Simulations of larger eukaryotic systems exceed the on-board memory capacity of individual GPUs, and long time simulations of modest-sized cells such as yeast are impractical on a single GPU. We present a new multi-GPU parallel implementation of the MPD-RDME method based on a spatial decomposition approach that supports dynamic load balancing for workstations containing GPUs of varying performance and memory capacity. We take advantage of high-performance features of CUDA for peer-to-peer GPU memory transfers and evaluate the performance of our algorithms on state-of-the-art GPU devices. We present parallel e ciency and performance results for simulations using multiple GPUs as system size, particle counts, and number of reactions grow. We also demonstrate multi-GPU performance in simulations of the Min protein system in E. coli. Moreover, our multi-GPU decomposition and load balancing approach can be generalized to other lattice-based problems. PMID:24882911
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA Tesla GPU Cluster
Allada, Veerendra, Benjegerdes, Troy; Bode, Brett
2009-08-31
Commodity clusters augmented with application accelerators are evolving as competitive high performance computing systems. The Graphical Processing Unit (GPU) with a very high arithmetic density and performance per price ratio is a good platform for the scientific application acceleration. In addition to the interconnect bottlenecks among the cluster compute nodes, the cost of memory copies between the host and the GPU device have to be carefully amortized to improve the overall efficiency of the application. Scientific applications also rely on efficient implementation of the BAsic Linear Algebra Subroutines (BLAS), among which the General Matrix Multiply (GEMM) is considered as the workhorse subroutine. In this paper, they study the performance of the memory copies and GEMM subroutines that are critical to port the computational chemistry algorithms to the GPU clusters. To that end, a benchmark based on the NetPIPE framework is developed to evaluate the latency and bandwidth of the memory copies between the host and the GPU device. The performance of the single and double precision GEMM subroutines from the NVIDIA CUBLAS 2.0 library are studied. The results have been compared with that of the BLAS routines from the Intel Math Kernel Library (MKL) to understand the computational trade-offs. The test bed is a Intel Xeon cluster equipped with NVIDIA Tesla GPUs.
GPU.proton.DOCK: Genuine Protein Ultrafast proton equilibria consistent DOCKing
Kantardjiev, Alexander A.
2011-01-01
GPU.proton.DOCK (Genuine Protein Ultrafast proton equilibria consistent DOCKing) is a state of the art service for in silico prediction of protein–protein interactions via rigorous and ultrafast docking code. It is unique in providing stringent account of electrostatic interactions self-consistency and proton equilibria mutual effects of docking partners. GPU.proton.DOCK is the first server offering such a crucial supplement to protein docking algorithms—a step toward more reliable and high accuracy docking results. The code (especially the Fast Fourier Transform bottleneck and electrostatic fields computation) is parallelized to run on a GPU supercomputer. The high performance will be of use for large-scale structural bioinformatics and systems biology projects, thus bridging physics of the interactions with analysis of molecular networks. We propose workflows for exploring in silico charge mutagenesis effects. Special emphasis is given to the interface-intuitive and user-friendly. The input is comprised of the atomic coordinate files in PDB format. The advanced user is provided with a special input section for addition of non-polypeptide charges, extra ionogenic groups with intrinsic pKa values or fixed ions. The output is comprised of docked complexes in PDB format as well as interactive visualization in a molecular viewer. GPU.proton.DOCK server can be accessed at http://gpudock.orgchm.bas.bg/. PMID:21666258
cuBLASTP: Fine-Grained Parallelization of Protein Sequence Search on CPU+GPU.
Zhang, Jing; Wang, Hao; Feng, Wu-Chun
2015-10-12
BLAST, short for Basic Local Alignment Search Tool, is a ubiquitous tool used in the life sciences for pairwise sequence search. However, with the advent of next-generation sequencing (NGS), whether at the outset or downstream from NGS, the exponential growth of sequence databases is outstripping our ability to analyze the data. While recent studies have utilized the graphics processing unit (GPU) to speedup the BLAST algorithm for searching protein sequences (i.e., BLASTP), these studies use coarse-grained parallelism, where one sequence alignment is mapped to only one thread. Such an approach does not efficiently utilize the capabilities of a GPU, particularly due to the irregularity of BLASTP in both execution paths and memory-access patterns. To address the above shortcomings, we present a fine-grained approach to parallelize BLASTP, where each individual phase of sequence search is mapped to many threads on a GPU. This approach, which we refer to as cuBLASTP, reorders data-access patterns and reduces divergent branches of the most time-consuming phases (i.e., hit detection and ungapped extension). In addition, cuBLASTP optimizes the remaining phases (i.e., gapped extension and alignment with trace back) on a multicore CPU and overlaps their execution with the phases running on the GPU.
PuReMD-GPU: A reactive molecular dynamics simulation package for GPUs
Kylasa, S.B.; Aktulga, H.M.; Grama, A.Y.
2014-09-01
We present an efficient and highly accurate GP-GPU implementation of our community code, PuReMD, for reactive molecular dynamics simulations using the ReaxFF force field. PuReMD and its incorporation into LAMMPS (Reax/C) is used by a large number of research groups worldwide for simulating diverse systems ranging from biomembranes to explosives (RDX) at atomistic level of detail. The sub-femtosecond time-steps associated with ReaxFF strongly motivate significant improvements to per-timestep simulation time through effective use of GPUs. This paper presents, in detail, the design and implementation of PuReMD-GPU, which enables ReaxFF simulations on GPUs, as well as various performance optimization techniques we developed to obtain high performance on state-of-the-art hardware. Comprehensive experiments on model systems (bulk water and amorphous silica) are presented to quantify the performance improvements achieved by PuReMD-GPU and to verify its accuracy. In particular, our experiments show up to 16× improvement in runtime compared to our highly optimized CPU-only single-core ReaxFF implementation. PuReMD-GPU is a unique production code, and is currently available on request from the authors.
GPU Implementation of Bayesian Neural Network Construction for Data-Intensive Applications
NASA Astrophysics Data System (ADS)
Perry, Michelle; Prosper, Harrison B.; Meyer-Baese, Anke
2014-06-01
We describe a graphical processing unit (GPU) implementation of the Hybrid Markov Chain Monte Carlo (HMC) method for training Bayesian Neural Networks (BNN). Our implementation uses NVIDIA's parallel computing architecture, CUDA. We briefly review BNNs and the HMC method and we describe our implementations and give preliminary results.
Maia, Julio Daniel Carvalho; Urquiza Carvalho, Gabriel Aires; Mangueira, Carlos Peixoto; Santana, Sidney Ramos; Cabral, Lucidio Anjos Formiga; Rocha, Gerd B
2012-09-11
In this study, we present some modifications in the semiempirical quantum chemistry MOPAC2009 code that accelerate single-point energy calculations (1SCF) of medium-size (up to 2500 atoms) molecular systems using GPU coprocessors and multithreaded shared-memory CPUs. Our modifications consisted of using a combination of highly optimized linear algebra libraries for both CPU (LAPACK and BLAS from Intel MKL) and GPU (MAGMA and CUBLAS) to hasten time-consuming parts of MOPAC such as the pseudodiagonalization, full diagonalization, and density matrix assembling. We have shown that it is possible to obtain large speedups just by using CPU serial linear algebra libraries in the MOPAC code. As a special case, we show a speedup of up to 14 times for a methanol simulation box containing 2400 atoms and 4800 basis functions, with even greater gains in performance when using multithreaded CPUs (2.1 times in relation to the single-threaded CPU code using linear algebra libraries) and GPUs (3.8 times). This degree of acceleration opens new perspectives for modeling larger structures which appear in inorganic chemistry (such as zeolites and MOFs), biochemistry (such as polysaccharides, small proteins, and DNA fragments), and materials science (such as nanotubes and fullerenes). In addition, we believe that this parallel (GPU-GPU) MOPAC code will make it feasible to use semiempirical methods in lengthy molecular simulations using both hybrid QM/MM and QM/QM potentials.
Fast 2-D ultrasound strain imaging: the benefits of using a GPU.
Idzenga, Tim; Gaburov, Evghenii; Vermin, Willem; Menssen, Jan; de Korte, Chris
2014-01-01
Deformation of tissue can be accurately estimated from radio-frequency ultrasound data using a 2-dimensional normalized cross correlation (NCC)-based algorithm. This procedure, however, is very computationally time-consuming. A major time reduction can be achieved by parallelizing the numerous computations of NCC. In this paper, two approaches for parallelization have been investigated: the OpenMP interface on a multi-CPU system and Compute Unified Device Architecture (CUDA) on a graphics processing unit (GPU). The performance of the OpenMP and GPU approaches were compared with a conventional Matlab implementation of NCC. The OpenMP approach with 8 threads achieved a maximum speed-up factor of 132 on the computing of NCC, whereas the GPU approach on an Nvidia Tesla K20 achieved a maximum speed-up factor of 376. Neither parallelization approach resulted in a significant loss in image quality of the elastograms. Parallelization of the NCC computations using the GPU, therefore, significantly reduces the computation time and increases the frame rate for motion estimation.
Accelerating Image Reconstruction in Dual-Head PET System by GPU and Symmetry Properties
Chou, Cheng-Ying; Kao, Yu-Jiun; Wang, Weichung; Kao, Chien-Min; Chen, Chin-Tu
2012-01-01
Positron emission tomography (PET) is an important imaging modality in both clinical usage and research studies. We have developed a compact high-sensitivity PET system that consisted of two large-area panel PET detector heads, which produce more than 224 million lines of response and thus request dramatic computational demands. In this work, we employed a state-of-the-art graphics processing unit (GPU), NVIDIA Tesla C2070, to yield an efficient reconstruction process. Our approaches ingeniously integrate the distinguished features of the symmetry properties of the imaging system and GPU architectures, including block/warp/thread assignments and effective memory usage, to accelerate the computations for ordered subset expectation maximization (OSEM) image reconstruction. The OSEM reconstruction algorithms were implemented employing both CPU-based and GPU-based codes, and their computational performance was quantitatively analyzed and compared. The results showed that the GPU-accelerated scheme can drastically reduce the reconstruction time and thus can largely expand the applicability of the dual-head PET system. PMID:23300527
NASA Astrophysics Data System (ADS)
Huang, M.; Mielikainen, J.; Huang, B.; Chen, H.; Huang, H.-L. A.; Goldberg, M. D.
2015-09-01
The planetary boundary layer (PBL) is the lowest part of the atmosphere and where its character is directly affected by its contact with the underlying planetary surface. The PBL is responsible for vertical sub-grid-scale fluxes due to eddy transport in the whole atmospheric column. It determines the flux profiles within the well-mixed boundary layer and the more stable layer above. It thus provides an evolutionary model of atmospheric temperature, moisture (including clouds), and horizontal momentum in the entire atmospheric column. For such purposes, several PBL models have been proposed and employed in the weather research and forecasting (WRF) model of which the Yonsei University (YSU) scheme is one. To expedite weather research and prediction, we have put tremendous effort into developing an accelerated implementation of the entire WRF model using graphics processing unit (GPU) massive parallel computing architecture whilst maintaining its accuracy as compared to its central processing unit (CPU)-based implementation. This paper presents our efficient GPU-based design on a WRF YSU PBL scheme. Using one NVIDIA Tesla K40 GPU, the GPU-based YSU PBL scheme achieves a speedup of 193× with respect to its CPU counterpart running on one CPU core, whereas the speedup for one CPU socket (4 cores) with respect to 1 CPU core is only 3.5×. We can even boost the speedup to 360× with respect to 1 CPU core as two K40 GPUs are applied.
NASA Astrophysics Data System (ADS)
Huang, M.; Mielikainen, J.; Huang, B.; Chen, H.; Huang, H.-L. A.; Goldberg, M. D.
2014-11-01
The planetary boundary layer (PBL) is the lowest part of the atmosphere and where its character is directly affected by its contact with the underlying planetary surface. The PBL is responsible for vertical sub-grid-scale fluxes due to eddy transport in the whole atmospheric column. It determines the flux profiles within the well-mixed boundary layer and the more stable layer above. It thus provides an evolutionary model of atmospheric temperature, moisture (including clouds), and horizontal momentum in the entire atmospheric column. For such purposes, several PBL models have been proposed and employed in the weather research and forecasting (WRF) model of which the Yonsei University (YSU) scheme is one. To expedite weather research and prediction, we have put tremendous effort into developing an accelerated implementation of the entire WRF model using Graphics Processing Unit (GPU) massive parallel computing architecture whilst maintaining its accuracy as compared to its CPU-based implementation. This paper presents our efficient GPU-based design on WRF YSU PBL scheme. Using one NVIDIA Tesla K40 GPU, the GPU-based YSU PBL scheme achieves a speedup of 193× with respect to its Central Processing Unit (CPU) counterpart running on one CPU core, whereas the speedup for one CPU socket (4 cores) with respect to one CPU core is only 3.5×. We can even boost the speedup to 360× with respect to one CPU core as two K40 GPUs are applied.
Accelerating DynEarthSol3D on tightly coupled CPU-GPU heterogeneous processors
NASA Astrophysics Data System (ADS)
Ta, Tuan; Choo, Kyoshin; Tan, Eh; Jang, Byunghyun; Choi, Eunseo
2015-06-01
DynEarthSol3D (Dynamic Earth Solver in Three Dimensions) is a flexible, open-source finite element solver that models the momentum balance and the heat transfer of elasto-visco-plastic material in the Lagrangian form using unstructured meshes. It provides a platform for the study of the long-term deformation of earth's lithosphere and various problems in civil and geotechnical engineering. However, the continuous computation and update of a very large mesh poses an intolerably high computational burden to developers and users in practice. For example, simulating a small input mesh containing around 3000 elements in 20 million time steps would take more than 10 days on a high-end desktop CPU. In this paper, we explore tightly coupled CPU-GPU heterogeneous processors to address the computing concern by leveraging their new features and developing hardware-architecture-aware optimizations. Our proposed key optimization techniques are three-fold: memory access pattern improvement, data transfer elimination and kernel launch overhead minimization. Experimental results show that our proposed implementation on a tightly coupled heterogeneous processor outperforms all other alternatives including traditional discrete GPU, quad-core CPU using OpenMP, and serial implementations by 67%, 50%, and 154% respectively even though the embedded GPU in the heterogeneous processor has significantly less number of cores than high-end discrete GPU.
A GPU-based Real-time Software Correlation System for the Murchison Widefield Array Prototype
NASA Astrophysics Data System (ADS)
Wayth, Randall B.; Greenhill, Lincoln J.; Briggs, Frank H.
2009-08-01
Modern graphics processing units (GPUs) are inexpensive commodity hardware that offer Tflop/s theoretical computing capacity. GPUs are well suited to many compute-intensive tasks including digital signal processing. We describe the implementation and performance of a GPU-based digital correlator for radio astronomy. The correlator is implemented using the NVIDIA CUDA development environment. We evaluate three design options on two generations of NVIDIA hardware. The different designs utilize the internal registers, shared memory, and multiprocessors in different ways. We find that optimal performance is achieved with the design that minimizes global memory reads on recent generations of hardware. The GPU-based correlator outperforms a single-threaded CPU equivalent by a factor of 60 for a 32-antenna array, and runs on commodity PC hardware. The extra compute capability provided by the GPU maximizes the correlation capability of a PC while retaining the fast development time associated with using standard hardware, networking, and programming languages. In this way, a GPU-based correlation system represents a middle ground in design space between high performance, custom-built hardware, and pure CPU-based software correlation. The correlator was deployed at the Murchison Widefield Array 32-antenna prototype system where it ran in real time for extended periods. We briefly describe the data capture, streaming, and correlation system for the prototype array.
LU Factorization with Partial Pivoting for a Multi-CPU, Multi-GPU Shared Memory System
Kurzak, Jakub; Luszczek, Pitior; Faverge, Mathieu; Dongarra, Jack
2012-03-01
LU factorization with partial pivoting is a canonical numerical procedure and the main component of the High Performance LINPACK benchmark. This article presents an implementation of the algorithm for a hybrid, shared memory, system with standard CPU cores and GPU accelerators. Performance in excess of one TeraFLOPS is achieved using four AMD Magny Cours CPUs and four NVIDIA Fermi GPUs.
Graphics Processing Unit (GPU) Acceleration of the Goddard Earth Observing System Atmospheric Model
NASA Technical Reports Server (NTRS)
Putnam, Williama
2011-01-01
The Goddard Earth Observing System 5 (GEOS-5) is the atmospheric model used by the Global Modeling and Assimilation Office (GMAO) for a variety of applications, from long-term climate prediction at relatively coarse resolution, to data assimilation and numerical weather prediction, to very high-resolution cloud-resolving simulations. GEOS-5 is being ported to a graphics processing unit (GPU) cluster at the NASA Center for Climate Simulation (NCCS). By utilizing GPU co-processor technology, we expect to increase the throughput of GEOS-5 by at least an order of magnitude, and accelerate the process of scientific exploration across all scales of global modeling, including: The large-scale, high-end application of non-hydrostatic, global, cloud-resolving modeling at 10- to I-kilometer (km) global resolutions Intermediate-resolution seasonal climate and weather prediction at 50- to 25-km on small clusters of GPUs Long-range, coarse-resolution climate modeling, enabled on a small box of GPUs for the individual researcher After being ported to the GPU cluster, the primary physics components and the dynamical core of GEOS-5 have demonstrated a potential speedup of 15-40 times over conventional processor cores. Performance improvements of this magnitude reduce the required scalability of 1-km, global, cloud-resolving models from an unfathomable 6 million cores to an attainable 200,000 GPU-enabled cores.
Accelerating Satellite Image Based Large-Scale Settlement Detection with GPU
Patlolla, Dilip Reddy; Cheriyadat, Anil M; Weaver, Jeanette E; Bright, Eddie A
2012-01-01
Computer vision algorithms for image analysis are often computationally demanding. Application of such algorithms on large image databases\\---- such as the high-resolution satellite imagery covering the entire land surface, can easily saturate the computational capabilities of conventional CPUs. There is a great demand for vision algorithms running on high performance computing (HPC) architecture capable of processing petascale image data. We exploit the parallel processing capability of GPUs to present a GPU-friendly algorithm for robust and efficient detection of settlements from large-scale high-resolution satellite imagery. Feature descriptor generation is an expensive, but a key step in automated scene analysis. To address this challenge, we present GPU implementations for three different feature descriptors\\-- multiscale Historgram of Oriented Gradients (HOG), Gray Level Co-Occurrence Matrix (GLCM) Contrast and local pixel intensity statistics. We perform extensive experimental evaluations of our implementation using diverse and large image datasets. Our GPU implementation of the feature descriptor algorithms results in speedups of 220 times compared to the CPU version. We present an highly efficient settlement detection system running on a multiGPU architecture capable of extracting human settlement regions from a city-scale sub-meter spatial resolution aerial imagery spanning roughly 1200 sq. kilometers in just 56 seconds with detection accuracy close to 90\\%. This remarkable speedup gained by our vision algorithm maintaining high detection accuracy clearly demonstrates that such computational advancements clearly hold the solution for petascale image analysis challenges.
High performance MRI simulations of motion on multi-GPU systems
2014-01-01
Background MRI physics simulators have been developed in the past for optimizing imaging protocols and for training purposes. However, these simulators have only addressed motion within a limited scope. The purpose of this study was the incorporation of realistic motion, such as cardiac motion, respiratory motion and flow, within MRI simulations in a high performance multi-GPU environment. Methods Three different motion models were introduced in the Magnetic Resonance Imaging SIMULator (MRISIMUL) of this study: cardiac motion, respiratory motion and flow. Simulation of a simple Gradient Echo pulse sequence and a CINE pulse sequence on the corresponding anatomical model was performed. Myocardial tagging was also investigated. In pulse sequence design, software crushers were introduced to accommodate the long execution times in order to avoid spurious echoes formation. The displacement of the anatomical model isochromats was calculated within the Graphics Processing Unit (GPU) kernel for every timestep of the pulse sequence. Experiments that would allow simulation of custom anatomical and motion models were also performed. Last, simulations of motion with MRISIMUL on single-node and multi-node multi-GPU systems were examined. Results Gradient Echo and CINE images of the three motion models were produced and motion-related artifacts were demonstrated. The temporal evolution of the contractility of the heart was presented through the application of myocardial tagging. Better simulation performance and image quality were presented through the introduction of software crushers without the need to further increase the computational load and GPU resources. Last, MRISIMUL demonstrated an almost linear scalable performance with the increasing number of available GPU cards, in both single-node and multi-node multi-GPU computer systems. Conclusions MRISIMUL is the first MR physics simulator to have implemented motion with a 3D large computational load on a single computer
A Fast Poisson Solver with Periodic Boundary Conditions for GPU Clusters in Various Configurations
NASA Astrophysics Data System (ADS)
Rattermann, Dale Nicholas
Fast Poisson solvers using the Fast Fourier Transform on uniform grids are especially suited for parallel implementation, making them appropriate for portability on graphical processing unit (GPU) devices. The goal of the following work was to implement, test, and evaluate a fast Poisson solver for periodic boundary conditions for use on a variety of GPU configurations. The solver used in this research was FLASH, an immersed-boundary-based method, which is well suited for complex, time-dependent geometries, has robust adaptive mesh refinement/de-refinement capabilities to capture evolving flow structures, and has been successfully implemented on conventional, parallel supercomputers. However, these solvers are still computationally costly to employ, and the total solver time is dominated by the solution of the pressure Poisson equation using state-of-the-art multigrid methods. FLASH improves the performance of its multigrid solvers by integrating a parallel FFT solver on a uniform grid during a coarse level. This hybrid solver could then be theoretically improved by replacing the highly-parallelizable FFT solver with one that utilizes GPUs, and, thus, was the motivation for my research. In the present work, the CPU-utilizing parallel FFT solver (PFFT) used in the base version of FLASH for solving the Poisson equation on uniform grids has been modified to enable parallel execution on CUDA-enabled GPU devices. New algorithms have been implemented to replace the Poisson solver that decompose the computational domain and send each new block to a GPU for parallel computation. One-dimensional (1-D) decomposition of the computational domain minimizes the amount of network traffic involved in this bandwidth-intensive computation by limiting the amount of all-to-all communication required between processes. Advanced techniques have been incorporated and implemented in a GPU-centric code design, while allowing end users the flexibility of parameter control at runtime in
2013-01-01
CUDA * Optimal employment of GPU memory...the GPU using the stream construct within CUDA . Using this technique, a small amount of...input tile data is sent to the GPU initially. Then, while the CUDA kernels process
Multi-GPU Jacobian Accelerated Computing for Soft Field Tomography
Borsic, A.; Attardo, E. A.; Halter, R. J.
2012-01-01
20 times on a single NVIDIA S1070 GPU, and of 50 times on 4 GPUs, bringing the Jacobian computing time for a fine 3D mesh from 12 minutes to 14 seconds. We regard this as an important step towards gaining interactive reconstruction times in 3D imaging, particularly when coupled in the future with acceleration of the forward problem. While we demonstrate results for Electrical Impedance Tomography, these results apply to any soft-field imaging modality where the Jacobian matrix is computed with the Adjoint Method. PMID:23010857
A GPU-Parallelized Eigen-Based Clutter Filter Framework for Ultrasound Color Flow Imaging.
Chee, Adrian J Y; Yiu, Billy Y S; Yu, Alfred C H
2017-01-01
Eigen-filters with attenuation response adapted to clutter statistics in color flow imaging (CFI) have shown improved flow detection sensitivity in the presence of tissue motion. Nevertheless, its practical adoption in clinical use is not straightforward due to the high computational cost for solving eigendecompositions. Here, we provide a pedagogical description of how a real-time computing framework for eigen-based clutter filtering can be developed through a single-instruction, multiple data (SIMD) computing approach that can be implemented on a graphical processing unit (GPU). Emphasis is placed on the single-ensemble-based eigen-filtering approach (Hankel singular value decomposition), since it is algorithmically compatible with GPU-based SIMD computing. The key algebraic principles and the corresponding SIMD algorithm are explained, and annotations on how such algorithm can be rationally implemented on the GPU are presented. Real-time efficacy of our framework was experimentally investigated on a single GPU device (GTX Titan X), and the computing throughput for varying scan depths and slow-time ensemble lengths was studied. Using our eigen-processing framework, real-time video-range throughput (24 frames/s) can be attained for CFI frames with full view in azimuth direction (128 scanlines), up to a scan depth of 5 cm ( λ pixel axial spacing) for slow-time ensemble length of 16 samples. The corresponding CFI image frames, with respect to the ones derived from non-adaptive polynomial regression clutter filtering, yielded enhanced flow detection sensitivity in vivo, as demonstrated in a carotid imaging case example. These findings indicate that the GPU-enabled eigen-based clutter filtering can improve CFI flow detection performance in real time.
A GPU-Parallelized Eigen-Based Clutter Filter Framework for Ultrasound Color Flow Imaging.
Chee, Adrian; Yiu, Billy; Yu, Alfred
2016-09-07
Eigen-filters with attenuation response adapted to clutter statistics in color flow imaging (CFI) have shown improved flow detection sensitivity in the presence of tissue motion. Nevertheless, its practical adoption in clinical use is not straightforward due to the high computational cost for solving eigen-decompositions. Here, we provide a pedagogical description of how a real-time computing framework for eigen-based clutter filtering can be developed through a single-instruction, multiple data (SIMD) computing approach that can be implemented on a graphical processing unit (GPU). Emphasis is placed on the single-ensemble-based eigen-filtering approach (Hankel-SVD) since it is algorithmically compatible with GPU-based SIMD computing. The key algebraic principles and the corresponding SIMD algorithm are explained, and annotations on how such algorithm can be rationally implemented on the GPU are presented. Real-time efficacy of our framework was experimentally investigated on a single GPU device (GTX Titan X), and the computing throughput for varying scan depths and slow-time ensemble lengths were studied. Using our eigenprocessing framework, real-time video-range throughput (24 fps) can be attained for CFI frames with full-view in azimuth direction (128 scanlines), up to a scan depth of 5 cm (λ pixel axial spacing) for slow-time ensemble length of 16 samples. The corresponding CFI image frames, with respect to the ones derived from non-adaptive polynomial regression clutter filtering, yielded enhanced flow detection sensitivity in vivo, as demonstrated in a carotid imaging case example. These findings indicate that GPU-enabled eigen-based clutter filtering can improve CFI flow detection performance in real time.
Parallel implementation of 3D protein structure similarity searches using a GPU and the CUDA.
Mrozek, Dariusz; Brożek, Miłosz; Małysiak-Mrozek, Bożena
2014-02-01
Searching for similar 3D protein structures is one of the primary processes employed in the field of structural bioinformatics. However, the computational complexity of this process means that it is constantly necessary to search for new methods that can perform such a process faster and more efficiently. Finding molecular substructures that complex protein structures have in common is still a challenging task, especially when entire databases containing tens or even hundreds of thousands of protein structures must be scanned. Graphics processing units (GPUs) and general purpose graphics processing units (GPGPUs) can perform many time-consuming and computationally demanding processes much more quickly than a classical CPU can. In this paper, we describe the GPU-based implementation of the CASSERT algorithm for 3D protein structure similarity searching. This algorithm is based on the two-phase alignment of protein structures when matching fragments of the compared proteins. The GPU (GeForce GTX 560Ti: 384 cores, 2GB RAM) implementation of CASSERT ("GPU-CASSERT") parallelizes both alignment phases and yields an average 180-fold increase in speed over its CPU-based, single-core implementation on an Intel Xeon E5620 (2.40GHz, 4 cores). In this paper, we show that massive parallelization of the 3D structure similarity search process on many-core GPU devices can reduce the execution time of the process, allowing it to be performed in real time. GPU-CASSERT is available at: http://zti.polsl.pl/dmrozek/science/gpucassert/cassert.htm.
SU-E-T-558: Monte Carlo Photon Transport Simulations On GPU with Quadric Geometry
Chi, Y; Tian, Z; Jiang, S; Jia, X
2015-06-15
Purpose: Monte Carlo simulation on GPU has experienced rapid advancements over the past a few years and tremendous accelerations have been achieved. Yet existing packages were developed only in voxelized geometry. In some applications, e.g. radioactive seed modeling, simulations in more complicated geometry are needed. This abstract reports our initial efforts towards developing a quadric geometry module aiming at expanding the application scope of GPU-based MC simulations. Methods: We defined the simulation geometry consisting of a number of homogeneous bodies, each specified by its material composition and limiting surfaces characterized by quadric functions. A tree data structure was utilized to define geometric relationship between different bodies. We modified our GPU-based photon MC transport package to incorporate this geometry. Specifically, geometry parameters were loaded into GPU’s shared memory for fast access. Geometry functions were rewritten to enable the identification of the body that contains the current particle location via a fast searching algorithm based on the tree data structure. Results: We tested our package in an example problem of HDR-brachytherapy dose calculation for shielded cylinder. The dose under the quadric geometry and that under the voxelized geometry agreed in 94.2% of total voxels within 20% isodose line based on a statistical t-test (95% confidence level), where the reference dose was defined to be the one at 0.5cm away from the cylinder surface. It took 243sec to transport 100million source photons under this quadric geometry on an NVidia Titan GPU card. Compared with simulation time of 99.6sec in the voxelized geometry, including quadric geometry reduced efficiency due to the complicated geometry-related computations. Conclusion: Our GPU-based MC package has been extended to support photon transport simulation in quadric geometry. Satisfactory accuracy was observed with a reduced efficiency. Developments for charged
Chi, Yujie; Tian, Zhen; Jia, Xun
2016-08-07
Monte Carlo (MC) particle transport simulation on a graphics-processing unit (GPU) platform has been extensively studied recently due to the efficiency advantage achieved via massive parallelization. Almost all of the existing GPU-based MC packages were developed for voxelized geometry. This limited application scope of these packages. The purpose of this paper is to develop a module to model parametric geometry and integrate it in GPU-based MC simulations. In our module, each continuous region was defined by its bounding surfaces that were parameterized by quadratic functions. Particle navigation functions in this geometry were developed. The module was incorporated to two previously developed GPU-based MC packages and was tested in two example problems: (1) low energy photon transport simulation in a brachytherapy case with a shielded cylinder applicator and (2) MeV coupled photon/electron transport simulation in a phantom containing several inserts of different shapes. In both cases, the calculated dose distributions agreed well with those calculated in the corresponding voxelized geometry. The averaged dose differences were 1.03% and 0.29%, respectively. We also used the developed package to perform simulations of a Varian VS 2000 brachytherapy source and generated a phase-space file. The computation time under the parameterized geometry depended on the memory location storing the geometry data. When the data was stored in GPU's shared memory, the highest computational speed was achieved. Incorporation of parameterized geometry yielded a computation time that was ~3 times of that in the corresponding voxelized geometry. We also developed a strategy to use an auxiliary index array to reduce frequency of geometry calculations and hence improve efficiency. With this strategy, the computational time ranged in 1.75-2.03 times of the voxelized geometry for coupled photon/electron transport depending on the voxel dimension of the auxiliary index array, and in 0
Federal Register 2010, 2011, 2012, 2013, 2014
2013-04-15
... From the Federal Register Online via the Government Publishing Office NUCLEAR REGULATORY COMMISSION GPU Nuclear Inc., Three Mile Island Nuclear Power Station, Unit 2, Exemption From Certain Security Requirements AGENCY: Nuclear Regulatory Commission. ACTION: Exemption. FOR FURTHER INFORMATION CONTACT: John...
SU-E-T-423: Fast Photon Convolution Calculation with a 3D-Ideal Kernel On the GPU
Moriya, S; Sato, M; Tachibana, H
2015-06-15
Purpose: The calculation time is a trade-off for improving the accuracy of convolution dose calculation with fine calculation spacing of the KERMA kernel. We investigated to accelerate the convolution calculation using an ideal kernel on the Graphic Processing Units (GPU). Methods: The calculation was performed on the AMD graphics hardware of Dual FirePro D700 and our algorithm was implemented using the Aparapi that convert Java bytecode to OpenCL. The process of dose calculation was separated with the TERMA and KERMA steps. The dose deposited at the coordinate (x, y, z) was determined in the process. In the dose calculation running on the central processing unit (CPU) of Intel Xeon E5, the calculation loops were performed for all calculation points. On the GPU computation, all of the calculation processes for the points were sent to the GPU and the multi-thread computation was done. In this study, the dose calculation was performed in a water equivalent homogeneous phantom with 150{sup 3} voxels (2 mm calculation grid) and the calculation speed on the GPU to that on the CPU and the accuracy of PDD were compared. Results: The calculation time for the GPU and the CPU were 3.3 sec and 4.4 hour, respectively. The calculation speed for the GPU was 4800 times faster than that for the CPU. The PDD curve for the GPU was perfectly matched to that for the CPU. Conclusion: The convolution calculation with the ideal kernel on the GPU was clinically acceptable for time and may be more accurate in an inhomogeneous region. Intensity modulated arc therapy needs dose calculations for different gantry angles at many control points. Thus, it would be more practical that the kernel uses a coarse spacing technique if the calculation is faster while keeping the similar accuracy to a current treatment planning system.
A New GPU-Enabled MODTRAN Thermal Model for the PLUME TRACKER Volcanic Emission Analysis Toolkit
NASA Astrophysics Data System (ADS)
Acharya, P. K.; Berk, A.; Guiang, C.; Kennett, R.; Perkins, T.; Realmuto, V. J.
2013-12-01
Real-time quantification of volcanic gaseous and particulate releases is important for (1) recognizing rapid increases in SO2 gaseous emissions which may signal an impending eruption; (2) characterizing ash clouds to enable safe and efficient commercial aviation; and (3) quantifying the impact of volcanic aerosols on climate forcing. The Jet Propulsion Laboratory (JPL) has developed state-of-the-art algorithms, embedded in their analyst-driven Plume Tracker toolkit, for performing SO2, NH3, and CH4 retrievals from remotely sensed multi-spectral Thermal InfraRed spectral imagery. While Plume Tracker provides accurate results, it typically requires extensive analyst time. A major bottleneck in this processing is the relatively slow but accurate FORTRAN-based MODTRAN atmospheric and plume radiance model, developed by Spectral Sciences, Inc. (SSI). To overcome this bottleneck, SSI in collaboration with JPL, is porting these slow thermal radiance algorithms onto massively parallel, relatively inexpensive and commercially-available GPUs. This paper discusses SSI's efforts to accelerate the MODTRAN thermal emission algorithms used by Plume Tracker. Specifically, we are developing a GPU implementation of the Curtis-Godson averaging and the Voigt in-band transmittances from near line center molecular absorption, which comprise the major computational bottleneck. The transmittance calculations were decomposed into separate functions, individually implemented as GPU kernels, and tested for accuracy and performance relative to the original CPU code. Speedup factors of 14 to 30× were realized for individual processing components on an NVIDIA GeForce GTX 295 graphics card with no loss of accuracy. Due to the separate host (CPU) and device (GPU) memory spaces, a redesign of the MODTRAN architecture was required to ensure efficient data transfer between host and device, and to facilitate high parallel throughput. Currently, we are incorporating the separate GPU kernels into a
Shen, Wenfeng; Wei, Daming; Xu, Weimin; Zhu, Xin; Yuan, Shizhong
2010-10-01
Biological computations like electrocardiological modelling and simulation usually require high-performance computing environments. This paper introduces an implementation of parallel computation for computer simulation of electrocardiograms (ECGs) in a personal computer environment with an Intel CPU of Core (TM) 2 Quad Q6600 and a GPU of Geforce 8800GT, with software support by OpenMP and CUDA. It was tested in three parallelization device setups: (a) a four-core CPU without a general-purpose GPU, (b) a general-purpose GPU plus 1 core of CPU, and (c) a four-core CPU plus a general-purpose GPU. To effectively take advantage of a multi-core CPU and a general-purpose GPU, an algorithm based on load-prediction dynamic scheduling was developed and applied to setting (c). In the simulation with 1600 time steps, the speedup of the parallel computation as compared to the serial computation was 3.9 in setting (a), 16.8 in setting (b), and 20.0 in setting (c). This study demonstrates that a current PC with a multi-core CPU and a general-purpose GPU provides a good environment for parallel computations in biological modelling and simulation studies.
Ng, C M
2013-10-01
The development of a population PK/PD model, an essential component for model-based drug development, is both time- and labor-intensive. A graphical-processing unit (GPU) computing technology has been proposed and used to accelerate many scientific computations. The objective of this study was to develop a hybrid GPU-CPU implementation of parallelized Monte Carlo parametric expectation maximization (MCPEM) estimation algorithm for population PK data analysis. A hybrid GPU-CPU implementation of the MCPEM algorithm (MCPEMGPU) and identical algorithm that is designed for the single CPU (MCPEMCPU) were developed using MATLAB in a single computer equipped with dual Xeon 6-Core E5690 CPU and a NVIDIA Tesla C2070 GPU parallel computing card that contained 448 stream processors. Two different PK models with rich/sparse sampling design schemes were used to simulate population data in assessing the performance of MCPEMCPU and MCPEMGPU. Results were analyzed by comparing the parameter estimation and model computation times. Speedup factor was used to assess the relative benefit of parallelized MCPEMGPU over MCPEMCPU in shortening model computation time. The MCPEMGPU consistently achieved shorter computation time than the MCPEMCPU and can offer more than 48-fold speedup using a single GPU card. The novel hybrid GPU-CPU implementation of parallelized MCPEM algorithm developed in this study holds a great promise in serving as the core for the next-generation of modeling software for population PK/PD analysis.
Xu, Daguang; Huang, Yong; Kang, Jin U
2014-06-16
We implemented the graphics processing unit (GPU) accelerated compressive sensing (CS) non-uniform in k-space spectral domain optical coherence tomography (SD OCT). Kaiser-Bessel (KB) function and Gaussian function are used independently as the convolution kernel in the gridding-based non-uniform fast Fourier transform (NUFFT) algorithm with different oversampling ratios and kernel widths. Our implementation is compared with the GPU-accelerated modified non-uniform discrete Fourier transform (MNUDFT) matrix-based CS SD OCT and the GPU-accelerated fast Fourier transform (FFT)-based CS SD OCT. It was found that our implementation has comparable performance to the GPU-accelerated MNUDFT-based CS SD OCT in terms of image quality while providing more than 5 times speed enhancement. When compared to the GPU-accelerated FFT based-CS SD OCT, it shows smaller background noise and less side lobes while eliminating the need for the cumbersome k-space grid filling and the k-linear calibration procedure. Finally, we demonstrated that by using a conventional desktop computer architecture having three GPUs, real-time B-mode imaging can be obtained in excess of 30 fps for the GPU-accelerated NUFFT based CS SD OCT with frame size 2048(axial) × 1,000(lateral).
Ha, S; Matej, S; Ispiryan, M; Mueller, K
2013-02-01
We describe a GPU-accelerated framework that efficiently models spatially (shift) variant system response kernels and performs forward- and back-projection operations with these kernels for the DIRECT (Direct Image Reconstruction for TOF) iterative reconstruction approach. Inherent challenges arise from the poor memory cache performance at non-axis aligned TOF directions. Focusing on the GPU memory access patterns, we utilize different kinds of GPU memory according to these patterns in order to maximize the memory cache performance. We also exploit the GPU instruction-level parallelism to efficiently hide long latencies from the memory operations. Our experiments indicate that our GPU implementation of the projection operators has slightly faster or approximately comparable time performance than FFT-based approaches using state-of-the-art FFTW routines. However, most importantly, our GPU framework can also efficiently handle any generic system response kernels, such as spatially symmetric and shift-variant as well as spatially asymmetric and shift-variant, both of which an FFT-based approach cannot cope with.
Ha, S.; Matej, S.; Ispiryan, M.; Mueller, K.
2013-01-01
We describe a GPU-accelerated framework that efficiently models spatially (shift) variant system response kernels and performs forward- and back-projection operations with these kernels for the DIRECT (Direct Image Reconstruction for TOF) iterative reconstruction approach. Inherent challenges arise from the poor memory cache performance at non-axis aligned TOF directions. Focusing on the GPU memory access patterns, we utilize different kinds of GPU memory according to these patterns in order to maximize the memory cache performance. We also exploit the GPU instruction-level parallelism to efficiently hide long latencies from the memory operations. Our experiments indicate that our GPU implementation of the projection operators has slightly faster or approximately comparable time performance than FFT-based approaches using state-of-the-art FFTW routines. However, most importantly, our GPU framework can also efficiently handle any generic system response kernels, such as spatially symmetric and shift-variant as well as spatially asymmetric and shift-variant, both of which an FFT-based approach cannot cope with. PMID:23531763
A Simple GPU-Accelerated Two-Dimensional MUSCL-Hancock Solver for Ideal Magnetohydrodynamics
NASA Technical Reports Server (NTRS)
Bard, Christopher; Dorelli, John C.
2013-01-01
We describe our experience using NVIDIA's CUDA (Compute Unified Device Architecture) C programming environment to implement a two-dimensional second-order MUSCL-Hancock ideal magnetohydrodynamics (MHD) solver on a GTX 480 Graphics Processing Unit (GPU). Taking a simple approach in which the MHD variables are stored exclusively in the global memory of the GTX 480 and accessed in a cache-friendly manner (without further optimizing memory access by, for example, staging data in the GPU's faster shared memory), we achieved a maximum speed-up of approx. = 126 for a sq 1024 grid relative to the sequential C code running on a single Intel Nehalem (2.8 GHz) core. This speedup is consistent with simple estimates based on the known floating point performance, memory throughput and parallel processing capacity of the GTX 480.
GPU linear and non-linear Poisson–Boltzmann solver module for DelPhi
Colmenares, José; Ortiz, Jesús; Rocchia, Walter
2014-01-01
Summary: In this work, we present a CUDA-based GPU implementation of a Poisson–Boltzmann equation solver, in both the linear and non-linear versions, using double precision. A finite difference scheme is adopted and made suitable for the GPU architecture. The resulting code was interfaced with the electrostatics software for biomolecules DelPhi, which is widely used in the computational biology community. The algorithm has been implemented using CUDA and tested over a few representative cases of biological interest. Details of the implementation and performance test results are illustrated. A speedup of ∼10 times was achieved both in the linear and non-linear cases. Availability and implementation: The module is open-source and available at http://www.electrostaticszone.eu/index.php/downloads. Contact: walter.rocchia@iit.it Supplementary information: Supplementary data are available at Bioinformatics online PMID:24292939
NASA Astrophysics Data System (ADS)
Guan, Xiaowei; Guo, Lixin; Liu, Zhongyu
2015-10-01
A novel ray tracing algorithm for high-speed propagation prediction in multi-room indoor environments is proposed in this paper, whose theoretical foundations are geometrical optics (GO) and the uniform theory of diffraction(UTD). Taking the geometrical and electromagnetic information of the complex indoor scene into account, some acceleration techniques are adopted to raise the efficiency of the ray tracing algorithm. The simulation results indicate that the runtime of the ray tracing algorithm will sharply increase when the number of the objects in multi-room buildings is large enough. Therefore, GPU acceleration technology is used to solve that problem. Finally, a typical multi-room indoor environment with several objects in each room is simulated by using the serial ray tracing algorithm and the parallel one respectively. It can be found easily from the results that compared with the serial algorithm, the GPU-based one can achieve greater efficiency.
APES-based procedure for super-resolution SAR imagery with GPU parallel computing
NASA Astrophysics Data System (ADS)
Jia, Weiwei; Xu, Xiaojian; Xu, Guangyao
2015-10-01
The amplitude and phase estimation (APES) algorithm is widely used in modern spectral analysis. Compared with conventional Fourier transform (FFT), APES results in lower sidelobes and narrower spectral peaks. However, in synthetic aperture radar (SAR) imaging with large scene, without parallel computation, it is difficult to apply APES directly to super-resolution radar image processing due to its great amount of calculation. In this paper, a procedure is proposed to achieve target extraction and parallel computing of APES for super-resolution SAR imaging. Numerical experimental are carried out on Tesla K40C with 745 MHz GPU clock rate and 2880 CUDA cores. Results of SAR image with GPU parallel computing show that the parallel APES is remarkably more efficient than that of CPU-based with the same super-resolution.
GPU-based acceleration of an automatic white matter segmentation algorithm using CUDA.
Labra, Nicole; Figueroa, Miguel; Guevara, Pamela; Duclap, Delphine; Hoeunou, Josselin; Poupon, Cyril; Mangin, Jean-Francois
2013-01-01
This paper presents a parallel implementation of an algorithm for automatic segmentation of white matter fibers from tractography data. We execute the algorithm in parallel using a high-end video card with a Graphics Processing Unit (GPU) as a computation accelerator, using the CUDA language. By exploiting the parallelism and the properties of the memory hierarchy available on the GPU, we obtain a speedup in execution time of 33.6 with respect to an optimized sequential version of the algorithm written in C, and of 240 with respect to the original Python/C++ implementation. The execution time is reduced from more than two hours to only 35 seconds for a subject dataset of 800,000 fibers, thus enabling applications that use interactive segmentation and visualization of small to medium-sized tractography datasets.
A block-wise approximate parallel implementation for ART algorithm on CUDA-enabled GPU.
Fan, Zhongyin; Xie, Yaoqin
2015-01-01
Computed tomography (CT) has been widely used to acquire volumetric anatomical information in the diagnosis and treatment of illnesses in many clinics. However, the ART algorithm for reconstruction from under-sampled and noisy projection is still time-consuming. It is the goal of our work to improve a block-wise approximate parallel implementation for the ART algorithm on CUDA-enabled GPU to make the ART algorithm applicable to the clinical environment. The resulting method has several compelling features: (1) the rays are allotted into blocks, making the rays in the same block parallel; (2) GPU implementation caters to the actual industrial and medical application demand. We test the algorithm on a digital shepp-logan phantom, and the results indicate that our method is more efficient than the existing CPU implementation. The high computation efficiency achieved in our algorithm makes it possible for clinicians to obtain real-time 3D images.
A GPU-accelerated adaptive discontinuous Galerkin method for level set equation
NASA Astrophysics Data System (ADS)
Karakus, A.; Warburton, T.; Aksel, M. H.; Sert, C.
2016-01-01
This paper presents a GPU-accelerated nodal discontinuous Galerkin method for the solution of two- and three-dimensional level set (LS) equation on unstructured adaptive meshes. Using adaptive mesh refinement, computations are localised mostly near the interface location to reduce the computational cost. Small global time step size resulting from the local adaptivity is avoided by local time-stepping based on a multi-rate Adams-Bashforth scheme. Platform independence of the solver is achieved with an extensible multi-threading programming API that allows runtime selection of different computing devices (GPU and CPU) and different threading interfaces (CUDA, OpenCL and OpenMP). Overall, a highly scalable, accurate and mass conservative numerical scheme that preserves the simplicity of LS formulation is obtained. Efficiency, performance and local high-order accuracy of the method are demonstrated through distinct numerical test cases.
How to obtain efficient GPU kernels: An illustration using FMM & FGT algorithms
NASA Astrophysics Data System (ADS)
Cruz, Felipe A.; Layton, Simon K.; Barba, L. A.
2011-10-01
Computing on graphics processors is maybe one of the most important developments in computational science to happen in decades. Not since the arrival of the Beowulf cluster, which combined open source software with commodity hardware to truly democratize high-performance computing, has the community been so electrified. Like then, the opportunity comes with challenges. The formulation of scientific algorithms to take advantage of the performance offered by the new architecture requires rethinking core methods. Here, we have tackled fast summation algorithms (fast multipole method and fast Gauss transform), and applied algorithmic redesign for attaining performance on GPUs. The progression of performance improvements attained illustrates the exercise of formulating algorithms for the massively parallel architecture of the GPU. The end result has been GPU kernels that run at over 500 Gop/s on one NVIDIATESLA C1060 card, thereby reaching close to practical peak.
SOAP3: ultra-fast GPU-based parallel alignment tool for short reads.
Liu, Chi-Man; Wong, Thomas; Wu, Edward; Luo, Ruibang; Yiu, Siu-Ming; Li, Yingrui; Wang, Bingqiang; Yu, Chang; Chu, Xiaowen; Zhao, Kaiyong; Li, Ruiqiang; Lam, Tak-Wah
2012-03-15
SOAP3 is the first short read alignment tool that leverages the multi-processors in a graphic processing unit (GPU) to achieve a drastic improvement in speed. We adapted the compressed full-text index (BWT) used by SOAP2 in view of the advantages and disadvantages of GPU. When tested with millions of Illumina Hiseq 2000 length-100 bp reads, SOAP3 takes < 30 s to align a million read pairs onto the human reference genome and is at least 7.5 and 20 times faster than BWA and Bowtie, respectively. For aligning reads with up to four mismatches, SOAP3 aligns slightly more reads than BWA and Bowtie; this is because SOAP3, unlike BWA and Bowtie, is not heuristic-based and always reports all answers.
GPU-Accelerated Stony-Brook University 5-class Microphysics Scheme in WRF
NASA Astrophysics Data System (ADS)
Mielikainen, J.; Huang, B.; Huang, A.
2011-12-01
The Weather Research and Forecasting (WRF) model is a next-generation mesoscale numerical weather prediction system. Microphysics plays an important role in weather and climate prediction. Several bulk water microphysics schemes are available within the WRF, with different numbers of simulated hydrometeor classes and methods for estimating their size fall speeds, distributions and densities. Stony-Brook University scheme (SBU-YLIN) is a 5-class scheme with riming intensity predicted to account for mixed-phase processes. In the past few years, co-processing on Graphics Processing Units (GPUs) has been a disruptive technology in High Performance Computing (HPC). GPUs use the ever increasing transistor count for adding more processor cores. Therefore, GPUs are well suited for massively data parallel processing with high floating point arithmetic intensity. Thus, it is imperative to update legacy scientific applications to take advantage of this unprecedented increase in computing power. CUDA is an extension to the C programming language offering programming GPU's directly. It is designed so that its constructs allow for natural expression of data-level parallelism. A CUDA program is organized into two parts: a serial program running on the CPU and a CUDA kernel running on the GPU. The CUDA code consists of three computational phases: transmission of data into the global memory of the GPU, execution of the CUDA kernel, and transmission of results from the GPU into the memory of CPU. CUDA takes a bottom-up point of view of parallelism is which thread is an atomic unit of parallelism. Individual threads are part of groups called warps, within which every thread executes exactly the same sequence of instructions. To test SBU-YLIN, we used a CONtinental United States (CONUS) benchmark data set for 12 km resolution domain for October 24, 2001. A WRF domain is a geographic region of interest discretized into a 2-dimensional grid parallel to the ground. Each grid point has
NASA Astrophysics Data System (ADS)
Le Grand, Scott; Götz, Andreas W.; Walker, Ross C.
2013-02-01
A new precision model is proposed for the acceleration of all-atom classical molecular dynamics (MD) simulations on graphics processing units (GPUs). This precision model replaces double precision arithmetic with fixed point integer arithmetic for the accumulation of force components as compared to a previously introduced model that uses mixed single/double precision arithmetic. This significantly boosts performance on modern GPU hardware without sacrificing numerical accuracy. We present an implementation for NVIDIA GPUs of both generalized Born implicit solvent simulations as well as explicit solvent simulations using the particle mesh Ewald (PME) algorithm for long-range electrostatics using this precision model. Tests demonstrate both the performance of this implementation as well as its numerical stability for constant energy and constant temperature biomolecular MD as compared to a double precision CPU implementation and double and mixed single/double precision GPU implementations.
GPU acceleration of melody accurate matching in query-by-humming.
Xiao, Limin; Zheng, Yao; Tang, Wenqi; Yao, Guangchao; Ruan, Li
2014-01-01
With the increasing scale of the melody database, the query-by-humming system faces the trade-offs between response speed and retrieval accuracy. Melody accurate matching is the key factor to restrict the response speed. In this paper, we present a GPU acceleration method for melody accurate matching, in order to improve the response speed without reducing retrieval accuracy. The method develops two parallel strategies (intra-task parallelism and inter-task parallelism) to obtain accelerated effects. The efficiency of our method is validated through extensive experiments. Evaluation results show that our single GPU implementation achieves 20x to 40x speedup ratio, when compared to a typical general purpose CPU's execution time.
Convolution of large 3D images on GPU and its decomposition
NASA Astrophysics Data System (ADS)
Karas, Pavel; Svoboda, David
2011-12-01
In this article, we propose a method for computing convolution of large 3D images. The convolution is performed in a frequency domain using a convolution theorem. The algorithm is accelerated on a graphic card by means of the CUDA parallel computing model. Convolution is decomposed in a frequency domain using the decimation in frequency algorithm. We pay attention to keeping our approach efficient in terms of both time and memory consumption and also in terms of memory transfers between CPU and GPU which have a significant inuence on overall computational time. We also study the implementation on multiple GPUs and compare the results between the multi-GPU and multi-CPU implementations.
GPU accelerated Foreign Object Debris Detection on Airfield Pavement with visual saliency algorithm
NASA Astrophysics Data System (ADS)
Qi, Jun; Gong, Guoping; Cao, Xiaoguang
2017-01-01
We present a GPU-based implementation of visual saliency algorithm to detect foreign object debris(FOD) on airfield pavement with effectiveness and efficiency. Visual saliency algorithm is introduced in FOD detection for the first time. We improve the image signature algorithm to target at FOD detection in complex background of pavement. First, we make pooling operations in obtaining saliency map to improve recall rate. Then, connected component analysis is applied to filter candidate regions in saliency map to get the final targets in original image. Besides, we map the algorithm to GPU-based kernels and data structures. The parallel version of the algorithm is able to get the results with 23.5 times speedup. Experimental results elucidate that the proposed method is effective to detect FOD real-time.
NASA Astrophysics Data System (ADS)
Yu, H.; Wang, Z.; Zhang, C.; Chen, N.; Zhao, Y.; Sawchuk, A. P.; Dalsing, M. C.; Teague, S. D.; Cheng, Y.
2014-11-01
Existing research of patient-specific computational hemodynamics (PSCH) heavily relies on software for anatomical extraction of blood arteries. Data reconstruction and mesh generation have to be done using existing commercial software due to the gap between medical image processing and CFD, which increases computation burden and introduces inaccuracy during data transformation thus limits the medical applications of PSCH. We use lattice Boltzmann method (LBM) to solve the level-set equation over an Eulerian distance field and implicitly and dynamically segment the artery surfaces from radiological CT/MRI imaging data. The segments seamlessly feed to the LBM based CFD computation of PSCH thus explicit mesh construction and extra data management are avoided. The LBM is ideally suited for GPU (graphic processing unit)-based parallel computing. The parallel acceleration over GPU achieves excellent performance in PSCH computation. An application study will be presented which segments an aortic artery from a chest CT dataset and models PSCH of the segmented artery.
An analytic linear accelerator source model for GPU-based Monte Carlo dose calculations.
Tian, Zhen; Li, Yongbao; Folkerts, Michael; Shi, Feng; Jiang, Steve B; Jia, Xun
2015-10-21
Recently, there has been a lot of research interest in developing fast Monte Carlo (MC) dose calculation methods on graphics processing unit (GPU) platforms. A good linear accelerator (linac) source model is critical for both accuracy and efficiency considerations. In principle, an analytical source model should be more preferred for GPU-based MC dose engines than a phase-space file-based model, in that data loading and CPU-GPU data transfer can be avoided. In this paper, we presented an analytical field-independent source model specifically developed for GPU-based MC dose calculations, associated with a GPU-friendly sampling scheme. A key concept called phase-space-ring (PSR) was proposed. Each PSR contained a group of particles that were of the same type, close in energy and reside in a narrow ring on the phase-space plane located just above the upper jaws. The model parameterized the probability densities of particle location, direction and energy for each primary photon PSR, scattered photon PSR and electron PSR. Models of one 2D Gaussian distribution or multiple Gaussian components were employed to represent the particle direction distributions of these PSRs. A method was developed to analyze a reference phase-space file and derive corresponding model parameters. To efficiently use our model in MC dose calculations on GPU, we proposed a GPU-friendly sampling strategy, which ensured that the particles sampled and transported simultaneously are of the same type and close in energy to alleviate GPU thread divergences. To test the accuracy of our model, dose distributions of a set of open fields in a water phantom were calculated using our source model and compared to those calculated using the reference phase-space files. For the high dose gradient regions, the average distance-to-agreement (DTA) was within 1 mm and the maximum DTA within 2 mm. For relatively low dose gradient regions, the root-mean-square (RMS) dose difference was within 1.1% and the maximum
Noniterative Multireference Coupled Cluster Methods on Heterogeneous CPU-GPU Systems
Bhaskaran-Nair, Kiran; Ma, Wenjing; Krishnamoorthy, Sriram; Villa, Oreste; van Dam, Hubertus JJ; Apra, Edoardo; Kowalski, Karol
2013-04-09
A novel parallel algorithm for non-iterative multireference coupled cluster (MRCC) theories, which merges recently introduced reference-level parallelism (RLP) [K. Bhaskaran-Nair, J.Brabec, E. Aprà, H.J.J. van Dam, J. Pittner, K. Kowalski, J. Chem. Phys. 137, 094112 (2012)] with the possibility of accelerating numerical calculations using graphics processing unit (GPU) is presented. We discuss the performance of this algorithm on the example of the MRCCSD(T) method (iterative singles and doubles and perturbative triples), where the corrections due to triples are added to the diagonal elements of the MRCCSD (iterative singles and doubles) effective Hamiltonian matrix. The performance of the combined RLP/GPU algorithm is illustrated on the example of the Brillouin-Wigner (BW) and Mukherjee (Mk) state-specific MRCCSD(T) formulations.
Real-time, fast radio transient searches with GPU de-dispersion
NASA Astrophysics Data System (ADS)
Magro, A.; Karastergiou, A.; Salvini, S.; Mort, B.; Dulwich, F.; Zarb Adami, K.
2011-11-01
The identification and subsequent discovery of fast radio transients using blind-search surveys require a large amount of processing power, in worst cases scaling as ?. For this reason, survey data are generally processed off-line, using high-performance computing architectures or hardware-based designs. In recent years, graphics processing units (GPUs) have been extensively used for numerical analysis and scientific simulations, especially after the introduction of new high-level application programming interfaces. Here, we show how GPUs can be used for fast transient discovery in real time. We present a solution to the problem of de-dispersion, providing performance comparisons with a typical computing machine and traditional pulsar processing software. We describe the architecture of a real-time, GPU-based transient search machine. In terms of performance, our GPU solution provides a speed-up factor of between 50 and 200, depending on the parameters of the search.
GPU-accelerated elastic 3D image registration for intra-surgical applications.
Ruijters, Daniel; ter Haar Romeny, Bart M; Suetens, Paul
2011-08-01
Local motion within intra-patient biomedical images can be compensated by using elastic image registration. The application of B-spline based elastic registration during interventional treatment is seriously hampered by its considerable computation time. The graphics processing unit (GPU) can be used to accelerate the calculation of such elastic registrations by using its parallel processing power, and by employing the hardwired tri-linear interpolation capabilities in order to efficiently perform the cubic B-spline evaluation. In this article it is shown that the similarity measure and its derivatives also can be calculated on the GPU, using a two pass approach. On average a speedup factor 50 compared to a straight-forward CPU implementation was reached.
GPU Acceleration of Melody Accurate Matching in Query-by-Humming
Xiao, Limin; Zheng, Yao; Tang, Wenqi; Yao, Guangchao; Ruan, Li
2014-01-01
With the increasing scale of the melody database, the query-by-humming system faces the trade-offs between response speed and retrieval accuracy. Melody accurate matching is the key factor to restrict the response speed. In this paper, we present a GPU acceleration method for melody accurate matching, in order to improve the response speed without reducing retrieval accuracy. The method develops two parallel strategies (intra-task parallelism and inter-task parallelism) to obtain accelerated effects. The efficiency of our method is validated through extensive experiments. Evaluation results show that our single GPU implementation achieves 20x to 40x speedup ratio, when compared to a typical general purpose CPU's execution time. PMID:24693239
A study of potential numerical pitfalls in GPU-based Monte Carlo dose calculation
NASA Astrophysics Data System (ADS)
Magnoux, Vincent; Ozell, Benoît; Bonenfant, Éric; Després, Philippe
2015-07-01
The purpose of this study was to evaluate the impact of numerical errors caused by the floating point representation of real numbers in a GPU-based Monte Carlo code used for dose calculation in radiation oncology, and to identify situations where this type of error arises. The program used as a benchmark was bGPUMCD. Three tests were performed on the code, which was divided into three functional components: energy accumulation, particle tracking and physical interactions. First, the impact of single-precision calculations was assessed for each functional component. Second, a GPU-specific compilation option that reduces execution time as well as precision was examined. Third, a specific function used for tracking and potentially more sensitive to precision errors was tested by comparing it to a very high-precision implementation. Numerical errors were found in two components of the program. Because of the energy accumulation process, a few voxels surrounding a radiation source end up with a lower computed dose than they should. The tracking system contained a series of operations that abnormally amplify rounding errors in some situations. This resulted in some rare instances (less than 0.1%) of computed distances that are exceedingly far from what they should have been. Most errors detected had no significant effects on the result of a simulation due to its random nature, either because they cancel each other out or because they only affect a small fraction of particles. The results of this work can be extended to other types of GPU-based programs and be used as guidelines to avoid numerical errors on the GPU computing platform.
GPU acceleration of Runge Kutta-Fehlberg and its comparison with Dormand-Prince method
NASA Astrophysics Data System (ADS)
Seen, Wo Mei; Gobithaasan, R. U.; Miura, Kenjiro T.
2014-07-01
There is a significant reduction of processing time and speedup of performance in computer graphics with the emergence of Graphic Processing Units (GPUs). GPUs have been developed to surpass Central Processing Unit (CPU) in terms of performance and processing speed. This evolution has opened up a new area in computing and researches where highly parallel GPU has been used for non-graphical algorithms. Physical or phenomenal simulations and modelling can be accelerated through General Purpose Graphic Processing Units (GPGPU) and Compute Unified Device Architecture (CUDA) implementations. These phenomena can be represented with mathematical models in the form of Ordinary Differential Equations (ODEs) which encompasses the gist of change rate between independent and dependent variables. ODEs are numerically integrated over time in order to simulate these behaviours. The classical Runge-Kutta (RK) scheme is the common method used to numerically solve ODEs. The Runge Kutta Fehlberg (RKF) scheme has been specially developed to provide an estimate of the principal local truncation error at each step, known as embedding estimate technique. This paper delves into the implementation of RKF scheme for GPU devices and compares its result with Dorman Prince method. A pseudo code is developed to show the implementation in detail. Hence, practitioners will be able to understand the data allocation in GPU, formation of RKF kernels and the flow of data to/from GPU-CPU upon RKF kernel evaluation. The pseudo code is then written in C Language and two ODE models are executed to show the achievable speedup as compared to CPU implementation. The accuracy and efficiency of the proposed implementation method is discussed in the final section of this paper.
A study of potential numerical pitfalls in GPU-based Monte Carlo dose calculation.
Magnoux, Vincent; Ozell, Benoît; Bonenfant, Éric; Després, Philippe
2015-07-07
The purpose of this study was to evaluate the impact of numerical errors caused by the floating point representation of real numbers in a GPU-based Monte Carlo code used for dose calculation in radiation oncology, and to identify situations where this type of error arises. The program used as a benchmark was bGPUMCD. Three tests were performed on the code, which was divided into three functional components: energy accumulation, particle tracking and physical interactions. First, the impact of single-precision calculations was assessed for each functional component. Second, a GPU-specific compilation option that reduces execution time as well as precision was examined. Third, a specific function used for tracking and potentially more sensitive to precision errors was tested by comparing it to a very high-precision implementation. Numerical errors were found in two components of the program. Because of the energy accumulation process, a few voxels surrounding a radiation source end up with a lower computed dose than they should. The tracking system contained a series of operations that abnormally amplify rounding errors in some situations. This resulted in some rare instances (less than 0.1%) of computed distances that are exceedingly far from what they should have been. Most errors detected had no significant effects on the result of a simulation due to its random nature, either because they cancel each other out or because they only affect a small fraction of particles. The results of this work can be extended to other types of GPU-based programs and be used as guidelines to avoid numerical errors on the GPU computing platform.
APEnet+: a 3D Torus network optimized for GPU-based HPC Systems
NASA Astrophysics Data System (ADS)
Ammendola, R.; Biagioni, A.; Frezza, O.; Lo Cicero, F.; Lonardo, A.; Paolucci, P. S.; Rossetti, D.; Simula, F.; Tosoratto, L.; Vicini, P.
2012-12-01
In the supercomputing arena, the strong rise of GPU-accelerated clusters is a matter of fact. Within INFN, we proposed an initiative — the QUonG project — whose aim is to deploy a high performance computing system dedicated to scientific computations leveraging on commodity multi-core processors coupled with latest generation GPUs. The inter-node interconnection system is based on a point-to-point, high performance, low latency 3D torus network which is built in the framework of the APEnet+ project. It takes the form of an FPGA-based PCIe network card exposing six full bidirectional links running at 34 Gbps each that implements the RDMA protocol. In order to enable significant access latency reduction for inter-node data transfer, a direct network-to-GPU interface was built. The specialized hardware blocks, integrated in the APEnet+ board, provide support for GPU-initiated communications using the so called PCIe peer-to-peer (P2P) transactions. This development is made in close collaboration with the GPU vendor NVIDIA. The final shape of a complete QUonG deployment is an assembly of standard 42U racks, each one capable of 80 TFLOPS/rack of peak performance, at a cost of 5 k€/T F LOPS and for an estimated power consumption of 25 kW/rack. In this paper we report on the status of final rack deployment and on the R&D activities for 2012 that will focus on performance enhancement of the APEnet+ hardware through the adoption of new generation 28 nm FPGAs allowing the implementation of PCIe Gen3 host interface and the addition of new fault tolerance-oriented capabilities.
A GPU Accelerated Discontinuous Galerkin Conservative Level Set Method for Simulating Atomization
NASA Astrophysics Data System (ADS)
Jibben, Zechariah J.
This dissertation describes a process for interface capturing via an arbitrary-order, nearly quadrature free, discontinuous Galerkin (DG) scheme for the conservative level set method (Olsson et al., 2005, 2008). The DG numerical method is utilized to solve both advection and reinitialization, and executed on a refined level set grid (Herrmann, 2008) for effective use of processing power. Computation is executed in parallel utilizing both CPU and GPU architectures to make the method feasible at high order. Finally, a sparse data structure is implemented to take full advantage of parallelism on the GPU, where performance relies on well-managed memory operations. With solution variables projected into a kth order polynomial basis, a k + 1 order convergence rate is found for both advection and reinitialization tests using the method of manufactured solutions. Other standard test cases, such as Zalesak's disk and deformation of columns and spheres in periodic vortices are also performed, showing several orders of magnitude improvement over traditional WENO level set methods. These tests also show the impact of reinitialization, which often increases shape and volume errors as a result of level set scalar trapping by normal vectors calculated from the local level set field. Accelerating advection via GPU hardware is found to provide a 30x speedup factor comparing a 2.0GHz Intel Xeon E5-2620 CPU in serial vs. a Nvidia Tesla K20 GPU, with speedup factors increasing with polynomial degree until shared memory is filled. A similar algorithm is implemented for reinitialization, which relies on heavier use of shared and global memory and as a result fills them more quickly and produces smaller speedups of 18x.
Real Time Motion Detection Based on the Spatio-Temporal Median Filter using GPU Integral Histograms
2012-12-01
histogram is extensible to higher dimen- sions and different bin structures. The integral histogram at position (x, y) in the image holds the histogram for...syncthreads(); // write the transposed matrix tile to global memory xIndex = blockIdx.z * BLOCK_DIM + threadIdx.x; yIndex = blockIdx.y * BLOCK_DIM...bank conflict shared memory and guaranties that global reads and writes are coalesced. Our GPU integral histogram implementation benefits from
Zhmurov, A; Dima, R I; Kholodov, Y; Barsegov, V
2010-11-01
Theoretical exploration of fundamental biological processes involving the forced unraveling of multimeric proteins, the sliding motion in protein fibers and the mechanical deformation of biomolecular assemblies under physiological force loads is challenging even for distributed computing systems. Using a C(α)-based coarse-grained self organized polymer (SOP) model, we implemented the Langevin simulations of proteins on graphics processing units (SOP-GPU program). We assessed the computational performance of an end-to-end application of the program, where all the steps of the algorithm are running on a GPU, by profiling the simulation time and memory usage for a number of test systems. The ∼90-fold computational speedup on a GPU, compared with an optimized central processing unit program, enabled us to follow the dynamics in the centisecond timescale, and to obtain the force-extension profiles using experimental pulling speeds (v(f) = 1-10 μm/s) employed in atomic force microscopy and in optical tweezers-based dynamic force spectroscopy. We found that the mechanical molecular response critically depends on the conditions of force application and that the kinetics and pathways for unfolding change drastically even upon a modest 10-fold increase in v(f). This implies that, to resolve accurately the free energy landscape and to relate the results of single-molecule experiments in vitro and in silico, molecular simulations should be carried out under the experimentally relevant force loads. This can be accomplished in reasonable wall-clock time for biomolecules of size as large as 10(5) residues using the SOP-GPU package.
TH-E-BRE-08: GPU-Monte Carlo Based Fast IMRT Plan Optimization
Li, Y; Tian, Z; Shi, F; Jiang, S; Jia, X
2014-06-15
Purpose: Intensity-modulated radiation treatment (IMRT) plan optimization needs pre-calculated beamlet dose distribution. Pencil-beam or superposition/convolution type algorithms are typically used because of high computation speed. However, inaccurate beamlet dose distributions, particularly in cases with high levels of inhomogeneity, may mislead optimization, hindering the resulting plan quality. It is desire to use Monte Carlo (MC) methods for beamlet dose calculations. Yet, the long computational time from repeated dose calculations for a number of beamlets prevents this application. It is our objective to integrate a GPU-based MC dose engine in lung IMRT optimization using a novel two-steps workflow. Methods: A GPU-based MC code gDPM is used. Each particle is tagged with an index of a beamlet where the source particle is from. Deposit dose are stored separately for beamlets based on the index. Due to limited GPU memory size, a pyramid space is allocated for each beamlet, and dose outside the space is neglected. A two-steps optimization workflow is proposed for fast MC-based optimization. At first step, rough beamlet dose calculations is conducted with only a small number of particles per beamlet. Plan optimization is followed to get an approximated fluence map. In the second step, more accurate beamlet doses are calculated, where sampled number of particles for a beamlet is proportional to the intensity determined previously. A second-round optimization is conducted, yielding the final Result. Results: For a lung case with 5317 beamlets, 10{sup 5} particles per beamlet in the first round, and 10{sup 8} particles per beam in the second round are enough to get a good plan quality. The total simulation time is 96.4 sec. Conclusion: A fast GPU-based MC dose calculation method along with a novel two-step optimization workflow are developed. The high efficiency allows the use of MC for IMRT optimizations.
BALSA: integrated secondary analysis for whole-genome and whole-exome sequencing, accelerated by GPU
Lee, Lap-Kei; Cheung, Jeanno; Liu, Chi-Man
2014-01-01
This paper reports an integrated solution, called BALSA, for the secondary analysis of next generation sequencing data; it exploits the computational power of GPU and an intricate memory management to give a fast and accurate analysis. From raw reads to variants (including SNPs and Indels), BALSA, using just a single computing node with a commodity GPU board, takes 5.5 h to process 50-fold whole genome sequencing (∼750 million 100 bp paired-end reads), or just 25 min for 210-fold whole exome sequencing. BALSA’s speed is rooted at its parallel algorithms to effectively exploit a GPU to speed up processes like alignment, realignment and statistical testing. BALSA incorporates a 16-genotype model to support the calling of SNPs and Indels and achieves competitive variant calling accuracy and sensitivity when compared to the ensemble of six popular variant callers. BALSA also supports efficient identification of somatic SNVs and CNVs; experiments showed that BALSA recovers all the previously validated somatic SNVs and CNVs, and it is more sensitive for somatic Indel detection. BALSA outputs variants in VCF format. A pileup-like SNAPSHOT format, while maintaining the same fidelity as BAM in variant calling, enables efficient storage and indexing, and facilitates the App development of downstream analyses. BALSA is available at: http://sourceforge.net/p/balsa. PMID:24949238
NASA Astrophysics Data System (ADS)
Portillo, Ricardo; Arunagiri, Sarala; Teller, Patricia J.; Park, Song J.; Nguyen, Lam H.; Deroba, Joseph C.; Shires, Dale
2011-06-01
The continuing miniaturization and parallelization of computer hardware has facilitated the development of mobile and field-deployable systems that can accommodate terascale processing within once prohibitively small size and weight constraints. General-purpose Graphics Processing Units (GPUs) are prominent examples of such terascale devices. Unfortunately, the added computational capability of these devices often comes at the cost of larger demands on power, an already strained resource in these systems. This study explores power versus performance issues for a workload that can take advantage of GPU capability and is targeted to run in field-deployable environments, i.e., Synthetic Aperture Radar (SAR). Specifically, we focus on the Image Formation (IF) computational phase of SAR, often the most compute intensive, and evaluate two different state-of-the-art GPU implementations of this IF method. Using real and simulated data sets, we evaluate performance tradeoffs for single- and double-precision versions of these implementations in terms of time-to-solution, image output quality, and total energy consumption. We employ fine-grain direct-measurement techniques to capture isolated power utilization and energy consumption of the GPU device, and use general and radarspecific metrics to evaluate image output quality. We show that double-precision IF can provide slight image improvement to low-reflective areas of SAR images, but note that the added quality may not be worth the higher power and energy costs associated with higher precision operations.
The GENGA Code: Gravitational Encounters in N-body simulations with GPU Acceleration.
NASA Astrophysics Data System (ADS)
Grimm, Simon; Stadel, Joachim
2013-07-01
We present a GPU (Graphics Processing Unit) implementation of a hybrid symplectic N-body integrator based on the Mercury Code (Chambers 1999), which handles close encounters with a very good energy conservation. It uses a combination of a mixed variable integration (Wisdom & Holman 1991) and a direct N-body Bulirsch-Stoer method. GENGA is written in CUDA C and runs on NVidia GPU's. The GENGA code supports three simulation modes: Integration of up to 2048 massive bodies, integration with up to a million test particles, or parallel integration of a large number of individual planetary systems. To achieve the best performance, GENGA runs completely on the GPU, where it can take advantage of the very fast, but limited, memory that exists there. All operations are performed in parallel, including the close encounter detection and grouping independent close encounter pairs. Compared to Mercury, GENGA runs up to 30 times faster. Two applications of GENGA are presented: First, the dynamics of planetesimals and the late stage of rocky planet formation due to planetesimal collisions. Second, a dynamical stability analysis of an exoplanetary system with an additional hypothetical super earth, which shows that in some multiple planetary systems, additional super earths could exist without perturbing the dynamical stability of the other planets (Elser et al. 2013).
An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU
Lyakh, Dmitry I.
2015-01-05
An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typically appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the na ve scattering algorithm (no memory access optimization). Furthermore, the tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).
An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU
Lyakh, Dmitry I.
2015-01-05
An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typicallymore » appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the na ve scattering algorithm (no memory access optimization). Furthermore, the tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).« less
A performance/cost evaluation for a GPU-based drug discovery application on volunteer computing.
Guerrero, Ginés D; Imbernón, Baldomero; Pérez-Sánchez, Horacio; Sanz, Francisco; García, José M; Cecilia, José M
2014-01-01
Bioinformatics is an interdisciplinary research field that develops tools for the analysis of large biological databases, and, thus, the use of high performance computing (HPC) platforms is mandatory for the generation of useful biological knowledge. The latest generation of graphics processing units (GPUs) has democratized the use of HPC as they push desktop computers to cluster-level performance. Many applications within this field have been developed to leverage these powerful and low-cost architectures. However, these applications still need to scale to larger GPU-based systems to enable remarkable advances in the fields of healthcare, drug discovery, genome research, etc. The inclusion of GPUs in HPC systems exacerbates power and temperature issues, increasing the total cost of ownership (TCO). This paper explores the benefits of volunteer computing to scale bioinformatics applications as an alternative to own large GPU-based local infrastructures. We use as a benchmark a GPU-based drug discovery application called BINDSURF that their computational requirements go beyond a single desktop machine. Volunteer computing is presented as a cheap and valid HPC system for those bioinformatics applications that need to process huge amounts of data and where the response time is not a critical factor.
Luo, Ruibang; Wong, Yiu-Lun; Law, Wai-Chun; Lee, Lap-Kei; Cheung, Jeanno; Liu, Chi-Man; Lam, Tak-Wah
2014-01-01
This paper reports an integrated solution, called BALSA, for the secondary analysis of next generation sequencing data; it exploits the computational power of GPU and an intricate memory management to give a fast and accurate analysis. From raw reads to variants (including SNPs and Indels), BALSA, using just a single computing node with a commodity GPU board, takes 5.5 h to process 50-fold whole genome sequencing (∼750 million 100 bp paired-end reads), or just 25 min for 210-fold whole exome sequencing. BALSA's speed is rooted at its parallel algorithms to effectively exploit a GPU to speed up processes like alignment, realignment and statistical testing. BALSA incorporates a 16-genotype model to support the calling of SNPs and Indels and achieves competitive variant calling accuracy and sensitivity when compared to the ensemble of six popular variant callers. BALSA also supports efficient identification of somatic SNVs and CNVs; experiments showed that BALSA recovers all the previously validated somatic SNVs and CNVs, and it is more sensitive for somatic Indel detection. BALSA outputs variants in VCF format. A pileup-like SNAPSHOT format, while maintaining the same fidelity as BAM in variant calling, enables efficient storage and indexing, and facilitates the App development of downstream analyses. BALSA is available at: http://sourceforge.net/p/balsa.
Yu, Fengchao; Liu, Huafeng; Hu, Zhenghui; Shi, Pengcheng
2012-04-01
As a consequence of the random nature of photon emissions and detections, the data collected by a positron emission tomography (PET) imaging system can be shown to be Poisson distributed. Meanwhile, there have been considerable efforts within the tracer kinetic modeling communities aimed at establishing the relationship between the PET data and physiological parameters that affect the uptake and metabolism of the tracer. Both statistical and physiological models are important to PET reconstruction. The majority of previous efforts are based on simplified, nonphysical mathematical expression, such as Poisson modeling of the measured data, which is, on the whole, completed without consideration of the underlying physiology. In this paper, we proposed a graphics processing unit (GPU)-accelerated reconstruction strategy that can take both statistical model and physiological model into consideration with the aid of state-space evolution equations. The proposed strategy formulates the organ activity distribution through tracer kinetics models and the photon-counting measurements through observation equations, thus making it possible to unify these two constraints into a general framework. In order to accelerate reconstruction, GPU-based parallel computing is introduced. Experiments of Zubal-thorax-phantom data, Monte Carlo simulated phantom data, and real phantom data show the power of the method. Furthermore, thanks to the computing power of the GPU, the reconstruction time is practical for clinical application.
Impact of data layouts on the efficiency of GPU-accelerated IDW interpolation.
Mei, Gang; Tian, Hong
2016-01-01
This paper focuses on evaluating the impact of different data layouts on the computational efficiency of GPU-accelerated Inverse Distance Weighting (IDW) interpolation algorithm. First we redesign and improve our previous GPU implementation that was performed by exploiting the feature of CUDA dynamic parallelism (CDP). Then we implement three versions of GPU implementations, i.e., the naive version, the tiled version, and the improved CDP version, based upon five data layouts, including the Structure of Arrays (SoA), the Array of Structures (AoS), the Array of aligned Structures (AoaS), the Structure of Arrays of aligned Structures (SoAoS), and the Hybrid layout. We also carry out several groups of experimental tests to evaluate the impact. Experimental results show that: the layouts AoS and AoaS achieve better performance than the layout SoA for both the naive version and tiled version, while the layout SoA is the best choice for the improved CDP version. We also observe that: for the two combined data layouts (the SoAoS and the Hybrid), there are no notable performance gains when compared to other three basic layouts. We recommend that: in practical applications, the layout AoaS is the best choice since the tiled version is the fastest one among three versions. The source code of all implementations are publicly available.
GPU-based ray tracing algorithm for high-speed propagation prediction in typical indoor environments
NASA Astrophysics Data System (ADS)
Guo, Lixin; Guan, Xiaowei; Liu, Zhongyu
2015-10-01
A fast 3-D ray tracing propagation prediction model based on virtual source tree is presented in this paper, whose theoretical foundations are geometrical optics(GO) and the uniform theory of diffraction(UTD). In terms of typical single room indoor scene, taking the geometrical and electromagnetic information into account, some acceleration techniques are adopted to raise the efficiency of the ray tracing algorithm. The simulation results indicate that the runtime of the ray tracing algorithm will sharply increase when the number of the objects in the single room is large enough. Therefore, GPU acceleration technology is used to solve that problem. As is known to all, GPU is good at calculation operation rather than logical judgment, so that tens of thousands of threads in CUDA programs are able to calculate at the same time, in order to achieve massively parallel acceleration. Finally, a typical single room with several objects is simulated by using the serial ray tracing algorithm and the parallel one respectively. It can be found easily from the results that compared with the serial algorithm, the GPU-based one can achieve greater efficiency.
Efficient parallel video processing techniques on GPU: from framework to implementation.
Su, Huayou; Wen, Mei; Wu, Nan; Ren, Ju; Zhang, Chunyuan
2014-01-01
Through reorganizing the execution order and optimizing the data structure, we proposed an efficient parallel framework for H.264/AVC encoder based on massively parallel architecture. We implemented the proposed framework by CUDA on NVIDIA's GPU. Not only the compute intensive components of the H.264 encoder are parallelized but also the control intensive components are realized effectively, such as CAVLC and deblocking filter. In addition, we proposed serial optimization methods, including the multiresolution multiwindow for motion estimation, multilevel parallel strategy to enhance the parallelism of intracoding as much as possible, component-based parallel CAVLC, and direction-priority deblocking filter. More than 96% of workload of H.264 encoder is offloaded to GPU. Experimental results show that the parallel implementation outperforms the serial program by 20 times of speedup ratio and satisfies the requirement of the real-time HD encoding of 30 fps. The loss of PSNR is from 0.14 dB to 0.77 dB, when keeping the same bitrate. Through the analysis to the kernels, we found that speedup ratios of the compute intensive algorithms are proportional with the computation power of the GPU. However, the performance of the control intensive parts (CAVLC) is much related to the memory bandwidth, which gives an insight for new architecture design.
Efficient parallel implementation of active appearance model fitting algorithm on GPU.
Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou
2014-01-01
The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures.
Development of GPU-Optimized EFIT for DIII-D Equilibrium Reconstructions
NASA Astrophysics Data System (ADS)
Huang, Y.; Lao, L. L.; Xiao, B. J.; Luo, Z. P.; Yue, X. N.
2015-11-01
The development of a parallel, Graphical Processing Unit (GPU)-optimized version of EFIT for DIII-D equilibrium reconstructions is presented. This GPU-optimized version (P-EFIT) is built with the CUDA (Compute Unified Device Architecture) platform to take advantage of massively parallel GPU cores to significantly accelerate the computation under the EFIT framework. The parallel processing is implemented with the Single-Instruction Multiple-Thread (SIMT) architecture. New parallel modules to trace plasma surfaces and compute plasma parameters have been constructed. DIII-D magnetic benchmark tests show that P-EFIT could accurately reproduce the EFIT reconstruction algorithms at a fraction of the computational cost. The acceleration factor continues to increase as the (R, Z) spatial grids are increased from 65 × 65 to 257 × 257 , suggesting there may be rooms for further optimization by further reducing the communication cost. Details of the P-EFIT optimization algorithms will be discussed. Work supported by US DOE DE-FC02-04ER54698, and by China MOST under 2014GB103000, China NNSF 11205191, China CAS GJHZ201303.
GPU-Based 3D Cone-Beam CT Image Reconstruction for Large Data Volume
Zhao, Xing; Hu, Jing-jing; Zhang, Peng
2009-01-01
Currently, 3D cone-beam CT image reconstruction speed is still a severe limitation for clinical application. The computational power of modern graphics processing units (GPUs) has been harnessed to provide impressive acceleration of 3D volume image reconstruction. For extra large data volume exceeding the physical graphic memory of GPU, a straightforward compromise is to divide data volume into blocks. Different from the conventional Octree partition method, a new partition scheme is proposed in this paper. This method divides both projection data and reconstructed image volume into subsets according to geometric symmetries in circular cone-beam projection layout, and a fast reconstruction for large data volume can be implemented by packing the subsets of projection data into the RGBA channels of GPU, performing the reconstruction chunk by chunk and combining the individual results in the end. The method is evaluated by reconstructing 3D images from computer-simulation data and real micro-CT data. Our results indicate that the GPU implementation can maintain original precision and speed up the reconstruction process by 110–120 times for circular cone-beam scan, as compared to traditional CPU implementation. PMID:19730744
Fast, Accurate and Shift-Varying Line Projections for Iterative Reconstruction Using the GPU
Pratx, Guillem; Chinn, Garry; Olcott, Peter D.; Levin, Craig S.
2013-01-01
List-mode processing provides an efficient way to deal with sparse projections in iterative image reconstruction for emission tomography. An issue often reported is the tremendous amount of computation required by such algorithm. Each recorded event requires several back- and forward line projections. We investigated the use of the programmable graphics processing unit (GPU) to accelerate the line-projection operations and implement fully-3D list-mode ordered-subsets expectation-maximization for positron emission tomography (PET). We designed a reconstruction approach that incorporates resolution kernels, which model the spatially-varying physical processes associated with photon emission, transport and detection. Our development is particularly suitable for applications where the projection data is sparse, such as high-resolution, dynamic, and time-of-flight PET reconstruction. The GPU approach runs more than 50 times faster than an equivalent CPU implementation while image quality and accuracy are virtually identical. This paper describes in details how the GPU can be used to accelerate the line projection operations, even when the lines-of-response have arbitrary endpoint locations and shift-varying resolution kernels are used. A quantitative evaluation is included to validate the correctness of this new approach. PMID:19244015
Sub-second pencil beam dose calculation on GPU for adaptive proton therapy.
da Silva, Joakim; Ansorge, Richard; Jena, Rajesh
2015-06-21
Although proton therapy delivered using scanned pencil beams has the potential to produce better dose conformity than conventional radiotherapy, the created dose distributions are more sensitive to anatomical changes and patient motion. Therefore, the introduction of adaptive treatment techniques where the dose can be monitored as it is being delivered is highly desirable. We present a GPU-based dose calculation engine relying on the widely used pencil beam algorithm, developed for on-line dose calculation. The calculation engine was implemented from scratch, with each step of the algorithm parallelized and adapted to run efficiently on the GPU architecture. To ensure fast calculation, it employs several application-specific modifications and simplifications, and a fast scatter-based implementation of the computationally expensive kernel superposition step. The calculation time for a skull base treatment plan using two beam directions was 0.22 s on an Nvidia Tesla K40 GPU, whereas a test case of a cubic target in water from the literature took 0.14 s to calculate. The accuracy of the patient dose distributions was assessed by calculating the γ-index with respect to a gold standard Monte Carlo simulation. The passing rates were 99.2% and 96.7%, respectively, for the 3%/3 mm and 2%/2 mm criteria, matching those produced by a clinical treatment planning system.
GPU techniques applied to Euler flow simulations and comparison to CPU performance
NASA Astrophysics Data System (ADS)
Koop, Blake
With the decrease in cost of computing, and the increasingly friendly programming environments, the demand for computer generated models of real world problems has surged. Each generation of computer hardware becomes marginally faster than its predecessor, allowing for decreases in required computation time. However, the progression is slowing and will soon reach a barrier as lithography reaches its natural limits. General Purpose Graphics Processing Unit (GPGPU) programming, rather than traditional programming written for Central Processing Unit (CPU) architectures may be a viable way for computational scientists to continue to realize wall clock time reductions at a Moore's Law pace. If a code can be modified to take advantage of the Single-Input-Multiple-Data (SIMD) architecture of Graphics Processing Units (GPUs), it may be possible to gain the functionality of hundreds or thousands of cores available on a GPU card. This paper details the investigation of a specific compressible flow simulation and its functionality in both CPU and GPU programming schemes. The flow is governed by the unsteady Euler flow equations and it is checked for validity against the known solution in all three directions. It is then run over varying grid sizes using both the CPU and GPU programming schemes to evaluate wall clock time reductions.
Analysis of performance improvements for host and GPU interface of the APENet+ 3D Torus network
NASA Astrophysics Data System (ADS)
Ammendola A, R.; Biagioni, A.; Frezza, O.; Lo Cicero, F.; Lonardo, A.; Paolucci, P. S.; Rossetti, D.; Simula, F.; Tosoratto, L.; Vicini, P.
2014-06-01
APEnet+ is an INFN (Italian Institute for Nuclear Physics) project aiming to develop a custom 3-Dimensional torus interconnect network optimized for hybrid clusters CPU-GPU dedicated to High Performance scientific Computing. The APEnet+ interconnect fabric is built on a FPGA-based PCI-express board with 6 bi-directional off-board links showing 34 Gbps of raw bandwidth per direction, and leverages upon peer-to-peer capabilities of Fermi and Kepler-class NVIDIA GPUs to obtain real zero-copy, GPU-to-GPU low latency transfers. The minimization of APEnet+ transfer latency is achieved through the adoption of RDMA protocol implemented in FPGA with specialized hardware blocks tightly coupled with embedded microprocessor. This architecture provides a high performance low latency offload engine for both trasmit and receive side of data transactions: preliminary results are encouraging, showing 50% of bandwidth increase for large packet size transfers. In this paper we describe the APEnet+ architecture, detailing the hardware implementation and discuss the impact of such RDMA specialized hardware on host interface latency and bandwidth.
NASA Astrophysics Data System (ADS)
Neophytou, Neophytos; Xu, Fang; Mueller, Klaus
2007-03-01
Three-dimensional computed tomography (CT) is a compute-intensive process, due to the large amounts of source and destination data, and this limits the speed at which a reconstruction can be obtained. There are two main approaches to cope with this problem: (i) lowering the overall computational complexity via algorithmic means, and/or (ii) running CT on specialized high-performance hardware. Since the latter requires considerable capital investment into rather inflexible hardware, the former option is all one has typically available in a traditional CPU-based computing environment. However, the emergence of programmable commodity graphics hardware (GPUs) has changed this situation in a decisive way. In this paper, we show that GPUs represent a commodity high-performance parallel architecture that resonates very well with the computational structure and operations inherent to CT. Using formal arguments as well as experiments we demonstrate that GPU-based 'brute-force' CT (i.e., CT at regular complexity) can be significantly faster than CPU-based as well as GPU-based CT with optimal complexity, at least for practical data sizes. Therefore, the answer to the title question: "Can GPU-based processing beat complexity optimization for CT?" is "Absolutely!"
Efficient Parallel Video Processing Techniques on GPU: From Framework to Implementation
Su, Huayou; Wen, Mei; Wu, Nan; Ren, Ju; Zhang, Chunyuan
2014-01-01
Through reorganizing the execution order and optimizing the data structure, we proposed an efficient parallel framework for H.264/AVC encoder based on massively parallel architecture. We implemented the proposed framework by CUDA on NVIDIA's GPU. Not only the compute intensive components of the H.264 encoder are parallelized but also the control intensive components are realized effectively, such as CAVLC and deblocking filter. In addition, we proposed serial optimization methods, including the multiresolution multiwindow for motion estimation, multilevel parallel strategy to enhance the parallelism of intracoding as much as possible, component-based parallel CAVLC, and direction-priority deblocking filter. More than 96% of workload of H.264 encoder is offloaded to GPU. Experimental results show that the parallel implementation outperforms the serial program by 20 times of speedup ratio and satisfies the requirement of the real-time HD encoding of 30 fps. The loss of PSNR is from 0.14 dB to 0.77 dB, when keeping the same bitrate. Through the analysis to the kernels, we found that speedup ratios of the compute intensive algorithms are proportional with the computation power of the GPU. However, the performance of the control intensive parts (CAVLC) is much related to the memory bandwidth, which gives an insight for new architecture design. PMID:24757432
A Performance/Cost Evaluation for a GPU-Based Drug Discovery Application on Volunteer Computing
Guerrero, Ginés D.; Imbernón, Baldomero; García, José M.
2014-01-01
Bioinformatics is an interdisciplinary research field that develops tools for the analysis of large biological databases, and, thus, the use of high performance computing (HPC) platforms is mandatory for the generation of useful biological knowledge. The latest generation of graphics processing units (GPUs) has democratized the use of HPC as they push desktop computers to cluster-level performance. Many applications within this field have been developed to leverage these powerful and low-cost architectures. However, these applications still need to scale to larger GPU-based systems to enable remarkable advances in the fields of healthcare, drug discovery, genome research, etc. The inclusion of GPUs in HPC systems exacerbates power and temperature issues, increasing the total cost of ownership (TCO). This paper explores the benefits of volunteer computing to scale bioinformatics applications as an alternative to own large GPU-based local infrastructures. We use as a benchmark a GPU-based drug discovery application called BINDSURF that their computational requirements go beyond a single desktop machine. Volunteer computing is presented as a cheap and valid HPC system for those bioinformatics applications that need to process huge amounts of data and where the response time is not a critical factor. PMID:25025055
GPU-accelerated ray-tracing for real-time treatment planning
NASA Astrophysics Data System (ADS)
Heinrich, H.; Ziegenhein, P.; Kamerling, C. P.; Froening, H.; Oelfke, U.
2014-03-01
Dose calculation methods in radiotherapy treatment planning require the radiological depth information of the voxels that represent the patient volume to correct for tissue inhomogeneities. This information is acquired by time consuming ray-tracing-based calculations. For treatment planning scenarios with changing geometries and real-time constraints this is a severe bottleneck. We implemented an algorithm for the graphics processing unit (GPU) which implements a ray-matrix approach to reduce the number of rays to trace. Furthermore, we investigated the impact of different strategies of accessing memory in kernel implementations as well as strategies for rapid data transfers between main memory and memory of the graphics device. Our study included the overlapping of computations and memory transfers to reduce the overall runtime using Hyper-Q. We tested our approach on a prostate case (9 beams, coplanar). The measured execution times for a complete ray-tracing range from 28 msec for the computations on the GPU to 99 msec when considering data transfers to and from the graphics device. Our GPU-based algorithm performed the ray-tracing in real-time. The strategies efficiently reduce the time consumption of memory accesses and data transfer overhead. The achieved runtimes demonstrate the viability of this approach and allow improved real-time performance for dose calculation methods in clinical routine.
GPU-based Monte Carlo radiotherapy dose calculation using phase-space sources
NASA Astrophysics Data System (ADS)
Townson, Reid W.; Jia, Xun; Tian, Zhen; Jiang Graves, Yan; Zavgorodni, Sergei; Jiang, Steve B.
2013-06-01
A novel phase-space source implementation has been designed for graphics processing unit (GPU)-based Monte Carlo dose calculation engines. Short of full simulation of the linac head, using a phase-space source is the most accurate method to model a clinical radiation beam in dose calculations. However, in GPU-based Monte Carlo dose calculations where the computation efficiency is very high, the time required to read and process a large phase-space file becomes comparable to the particle transport time. Moreover, due to the parallelized nature of GPU hardware, it is essential to simultaneously transport particles of the same type and similar energies but separated spatially to yield a high efficiency. We present three methods for phase-space implementation that have been integrated into the most recent version of the GPU-based Monte Carlo radiotherapy dose calculation package gDPM v3.0. The first method is to sequentially read particles from a patient-dependent phase-space and sort them on-the-fly based on particle type and energy. The second method supplements this with a simple secondary collimator model and fluence map implementation so that patient-independent phase-space sources can be used. Finally, as the third method (called the phase-space-let, or PSL, method) we introduce a novel source implementation utilizing pre-processed patient-independent phase-spaces that are sorted by particle type, energy and position. Position bins located outside a rectangular region of interest enclosing the treatment field are ignored, substantially decreasing simulation time with little effect on the final dose distribution. The three methods were validated in absolute dose against BEAMnrc/DOSXYZnrc and compared using gamma-index tests (2%/2 mm above the 10% isodose). It was found that the PSL method has the optimal balance between accuracy and efficiency and thus is used as the default method in gDPM v3.0. Using the PSL method, open fields of 4 × 4, 10 × 10 and 30 × 30 cm
a method of gravity and seismic sequential inversion and its GPU implementation
NASA Astrophysics Data System (ADS)
Liu, G.; Meng, X.
2011-12-01
In this abstract, we introduce a gravity and seismic sequential inversion method to invert for density and velocity together. For the gravity inversion, we use an iterative method based on correlation imaging algorithm; for the seismic inversion, we use the full waveform inversion. The link between the density and velocity is an empirical formula called Gardner equation, for large volumes of data, we use the GPU to accelerate the computation. For the gravity inversion method , we introduce a method based on correlation imaging algorithm,it is also a interative method, first we calculate the correlation imaging of the observed gravity anomaly, it is some value between -1 and +1, then we multiply this value with a little density ,this value become the initial density model. We get a forward reuslt with this initial model and also calculate the correaltion imaging of the misfit of observed data and the forward data, also multiply the correaltion imaging result a little density and add it to the initial model, then do the same procedure above , at last ,we can get a inversion density model. For the seismic inveron method ,we use a mothod base on the linearity of acoustic wave equation written in the frequency domain,with a intial velociy model, we can get a good velocity result. In the sequential inversion of gravity and seismic , we need a link formula to convert between density and velocity ,in our method , we use the Gardner equation. Driven by the insatiable market demand for real time, high-definition 3D images, the programmable NVIDIA Graphic Processing Unit (GPU) as co-processor of CPU has been developed for high performance computing. Compute Unified Device Architecture (CUDA) is a parallel programming model and software environment provided by NVIDIA designed to overcome the challenge of using traditional general purpose GPU while maintaining a low learn curve for programmers familiar with standard programming languages such as C. In our inversion processing
GPU-based Monte Carlo radiotherapy dose calculation using phase-space sources.
Townson, Reid W; Jia, Xun; Tian, Zhen; Graves, Yan Jiang; Zavgorodni, Sergei; Jiang, Steve B
2013-06-21
A novel phase-space source implementation has been designed for graphics processing unit (GPU)-based Monte Carlo dose calculation engines. Short of full simulation of the linac head, using a phase-space source is the most accurate method to model a clinical radiation beam in dose calculations. However, in GPU-based Monte Carlo dose calculations where the computation efficiency is very high, the time required to read and process a large phase-space file becomes comparable to the particle transport time. Moreover, due to the parallelized nature of GPU hardware, it is essential to simultaneously transport particles of the same type and similar energies but separated spatially to yield a high efficiency. We present three methods for phase-space implementation that have been integrated into the most recent version of the GPU-based Monte Carlo radiotherapy dose calculation package gDPM v3.0. The first method is to sequentially read particles from a patient-dependent phase-space and sort them on-the-fly based on particle type and energy. The second method supplements this with a simple secondary collimator model and fluence map implementation so that patient-independent phase-space sources can be used. Finally, as the third method (called the phase-space-let, or PSL, method) we introduce a novel source implementation utilizing pre-processed patient-independent phase-spaces that are sorted by particle type, energy and position. Position bins located outside a rectangular region of interest enclosing the treatment field are ignored, substantially decreasing simulation time with little effect on the final dose distribution. The three methods were validated in absolute dose against BEAMnrc/DOSXYZnrc and compared using gamma-index tests (2%/2 mm above the 10% isodose). It was found that the PSL method has the optimal balance between accuracy and efficiency and thus is used as the default method in gDPM v3.0. Using the PSL method, open fields of 4 × 4, 10 × 10 and 30 × 30 cm
Development of a GPU Compatible Version of the Fast Radiation Code RRTMG
NASA Astrophysics Data System (ADS)
Iacono, M. J.; Mlawer, E. J.; Berthiaume, D.; Cady-Pereira, K. E.; Suarez, M.; Oreopoulos, L.; Lee, D.
2012-12-01
The absorption of solar radiation and emission/absorption of thermal radiation are crucial components of the physics that drive Earth's climate and weather. Therefore, accurate radiative transfer calculations are necessary for realistic climate and weather simulations. Efficient radiation codes have been developed for this purpose, but their accuracy requirements still necessitate that as much as 30% of the computational time of a GCM is spent computing radiative fluxes and heating rates. The overall computational expense constitutes a limitation on a GCM's predictive ability if it becomes an impediment to adding new physics to or increasing the spatial and/or vertical resolution of the model. The emergence of Graphics Processing Unit (GPU) technology, which will allow the parallel computation of multiple independent radiative calculations in a GCM, will lead to a fundamental change in the competition between accuracy and speed. Processing time previously consumed by radiative transfer will now be available for the modeling of other processes, such as physics parameterizations, without any sacrifice in the accuracy of the radiative transfer. Furthermore, fast radiation calculations can be performed much more frequently and will allow the modeling of radiative effects of rapid changes in the atmosphere. The fast radiation code RRTMG, developed at Atmospheric and Environmental Research (AER), is utilized operationally in many dynamical models throughout the world. We will present the results from the first stage of an effort to create a version of the RRTMG radiation code designed to run efficiently in a GPU environment. This effort will focus on the RRTMG implementation in GEOS-5. RRTMG has an internal pseudo-spectral vector of length of order 100 that, when combined with the much greater length of the global horizontal grid vector from which the radiation code is called in GEOS-5, makes RRTMG/GEOS-5 particularly suited to achieving a significant speed improvement
NASA Astrophysics Data System (ADS)
Deng, Liang; Bai, Hanli; Wang, Fang; Xu, Qingxin
2016-06-01
CPU/GPU computing allows scientists to tremendously accelerate their numerical codes. In this paper, we port and optimize a double precision alternating direction implicit (ADI) solver for three-dimensional compressible Navier-Stokes equations from our in-house Computational Fluid Dynamics (CFD) software on heterogeneous platform. First, we implement a full GPU version of the ADI solver to remove a lot of redundant data transfers between CPU and GPU, and then design two fine-grain schemes, namely “one-thread-one-point” and “one-thread-one-line”, to maximize the performance. Second, we present a dual-level parallelization scheme using the CPU/GPU collaborative model to exploit the computational resources of both multi-core CPUs and many-core GPUs within the heterogeneous platform. Finally, considering the fact that memory on a single node becomes inadequate when the simulation size grows, we present a tri-level hybrid programming pattern MPI-OpenMP-CUDA that merges fine-grain parallelism using OpenMP and CUDA threads with coarse-grain parallelism using MPI for inter-node communication. We also propose a strategy to overlap the computation with communication using the advanced features of CUDA and MPI programming. We obtain speedups of 6.0 for the ADI solver on one Tesla M2050 GPU in contrast to two Xeon X5670 CPUs. Scalability tests show that our implementation can offer significant performance improvement on heterogeneous platform.
Xiao, Weijun; Huang, He; Sun, Yan; Yang, Qing
2009-01-01
Applying electromyographic (EMG) signal pattern recognition to artificial leg control is challenging because leg EMGs are non-stationary. Time-frequency features are suitable for representing non-stationary signals; however, the computational complexity to extract time-frequency features is too high and current embedded systems used for artificial limb control are inadequate for real-time computing. The aim of this study was to quantify the computational speed of a novel embedded system, the Graphic Processor Unit (GPU), on EMG time-frequency feature extraction. The computational time derived from a GPU was compared to that derived from a general purpose CPU. The results indicated that the GPU significantly increased the computational speed. When the size of EMG analysis window was set to 100 ms, the GPU extracted EMG time-frequency features over 50 times faster than the CPU setting. Therefore, high performance GPU shows a great promise for EMG-controlled artificial legs and other medical applications that need high-speed and real-time computation.
GPU accelerated study of heat transfer and fluid flow by lattice Boltzmann method on CUDA
NASA Astrophysics Data System (ADS)
Ren, Qinlong
Lattice Boltzmann method (LBM) has been developed as a powerful numerical approach to simulate the complex fluid flow and heat transfer phenomena during the past two decades. As a mesoscale method based on the kinetic theory, LBM has several advantages compared with traditional numerical methods such as physical representation of microscopic interactions, dealing with complex geometries and highly parallel nature. Lattice Boltzmann method has been applied to solve various fluid behaviors and heat transfer process like conjugate heat transfer, magnetic and electric field, diffusion and mixing process, chemical reactions, multiphase flow, phase change process, non-isothermal flow in porous medium, microfluidics, fluid-structure interactions in biological system and so on. In addition, as a non-body-conformal grid method, the immersed boundary method (IBM) could be applied to handle the complex or moving geometries in the domain. The immersed boundary method could be coupled with lattice Boltzmann method to study the heat transfer and fluid flow problems. Heat transfer and fluid flow are solved on Euler nodes by LBM while the complex solid geometries are captured by Lagrangian nodes using immersed boundary method. Parallel computing has been a popular topic for many decades to accelerate the computational speed in engineering and scientific fields. Today, almost all the laptop and desktop have central processing units (CPUs) with multiple cores which could be used for parallel computing. However, the cost of CPUs with hundreds of cores is still high which limits its capability of high performance computing on personal computer. Graphic processing units (GPU) is originally used for the computer video cards have been emerged as the most powerful high-performance workstation in recent years. Unlike the CPUs, the cost of GPU with thousands of cores is cheap. For example, the GPU (GeForce GTX TITAN) which is used in the current work has 2688 cores and the price is only 1
Sailfish: A flexible multi-GPU implementation of the lattice Boltzmann method
NASA Astrophysics Data System (ADS)
Januszewski, M.; Kostur, M.
2014-09-01
We present Sailfish, an open source fluid simulation package implementing the lattice Boltzmann method (LBM) on modern Graphics Processing Units (GPUs) using CUDA/OpenCL. We take a novel approach to GPU code implementation and use run-time code generation techniques and a high level programming language (Python) to achieve state of the art performance, while allowing easy experimentation with different LBM models and tuning for various types of hardware. We discuss the general design principles of the code, scaling to multiple GPUs in a distributed environment, as well as the GPU implementation and optimization of many different LBM models, both single component (BGK, MRT, ELBM) and multicomponent (Shan-Chen, free energy). The paper also presents results of performance benchmarks spanning the last three NVIDIA GPU generations (Tesla, Fermi, Kepler), which we hope will be useful for researchers working with this type of hardware and similar codes. Catalogue identifier: AETA_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AETA_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GNU Lesser General Public License, version 3 No. of lines in distributed program, including test data, etc.: 225864 No. of bytes in distributed program, including test data, etc.: 46861049 Distribution format: tar.gz Programming language: Python, CUDA C, OpenCL. Computer: Any with an OpenCL or CUDA-compliant GPU. Operating system: No limits (tested on Linux and Mac OS X). RAM: Hundreds of megabytes to tens of gigabytes for typical cases. Classification: 12, 6.5. External routines: PyCUDA/PyOpenCL, Numpy, Mako, ZeroMQ (for multi-GPU simulations), scipy, sympy Nature of problem: GPU-accelerated simulation of single- and multi-component fluid flows. Solution method: A wide range of relaxation models (LBGK, MRT, regularized LB, ELBM, Shan-Chen, free energy, free surface) and boundary conditions within the lattice
Liu, T.; Ding, A.; Ji, W.; Xu, X. G.; Carothers, C. D.; Brown, F. B.
2012-07-01
Monte Carlo (MC) method is able to accurately calculate eigenvalues in reactor analysis. Its lengthy computation time can be reduced by general-purpose computing on Graphics Processing Units (GPU), one of the latest parallel computing techniques under development. The method of porting a regular transport code to GPU is usually very straightforward due to the 'embarrassingly parallel' nature of MC code. However, the situation becomes different for eigenvalue calculation in that it will be performed on a generation-by-generation basis and the thread coordination should be explicitly taken care of. This paper presents our effort to develop such a GPU-based MC code in Compute Unified Device Architecture (CUDA) environment. The code is able to perform eigenvalue calculation under simple geometries on a multi-GPU system. The specifics of algorithm design, including thread organization and memory management were described in detail. The original CPU version of the code was tested on an Intel Xeon X5660 2.8 GHz CPU, and the adapted GPU version was tested on NVIDIA Tesla M2090 GPUs. Double-precision floating point format was used throughout the calculation. The result showed that a speedup of 7.0 and 33.3 were obtained for a bare spherical core and a binary slab system respectively. The speedup factor was further increased by a factor of {approx}2 on a dual GPU system. The upper limit of device-level parallelism was analyzed, and a possible method to enhance the thread-level parallelism was proposed. (authors)
GL4D: a GPU-based architecture for interactive 4D visualization.
Chu, Alan; Fu, Chi-Wing; Hanson, Andrew J; Heng, Pheng-Ann
2009-01-01
This paper describes GL4D, an interactive system for visualizing 2-manifolds and 3-manifolds embedded in four Euclidean dimensions and illuminated by 4D light sources. It is a tetrahedron-based rendering pipeline that projects geometry into volume images, an exact parallel to the conventional triangle-based rendering pipeline for 3D graphics. Novel features include GPU-based algorithms for real-time 4D occlusion handling and transparency compositing; we thus enable a previously impossible level of quality and interactivity for exploring lit 4D objects. The 4D tetrahedrons are stored in GPU memory as vertex buffer objects, and the vertex shader is used to perform per-vertex 4D modelview transformations and 4D-to-3D projection. The geometry shader extension is utilized to slice the projected tetrahedrons and rasterize the slices into individual 2D layers of voxel fragments. Finally, the fragment shader performs per-voxel operations such as lighting and alpha blending with previously computed layers. We account for 4D voxel occlusion along the 4D-to-3D projection ray by supporting a multi-pass back-to-front fragment composition along the projection ray; to accomplish this, we exploit a new adaptation of the dual depth peeling technique to produce correct volume image data and to simultaneously render the resulting volume data using 3D transfer functions into the final 2D image. Previous CPU implementations of the rendering of 4D-embedded 3-manifolds could not perform either the 4D depth-buffered projection or manipulation of the volume-rendered image in real-time; in particular, the dual depth peeling algorithm is a novel GPU-based solution to the real-time 4D depth-buffering problem. GL4D is implemented as an integrated OpenGL-style API library, so that the underlying shader operations are as transparent as possible to the user.
A GPU-Based Visualization Method for Computing Dark Matter Annihilation Signal
NASA Astrophysics Data System (ADS)
Yang, L.; Szalay, A.
2013-10-01
We present a novel GPU-based visualization method for computing the dark matter annihilation signal for cosmological dark matter simulations. The technique increased the speed of rendering by more than 1,000 times. In a previous study, using a code running on regular CPUs, each particle's contribution was explicitly calculated pixel by pixel over a HEALPIX map, then remapped onto a Molleweide projection. Using Via Lactea II simulation (˜ 400M particles), it takes over 7 hours for a single thread CPU (˜3 GHz) to complete an all-sky map with NSIDE=512 resolution. Our novel method is based on a separate stereographic projection for each hemisphere, and a hardware accelerated rendering pipeline on a GPU (OpenGL). We project the particles instead of the celestial sphere to the tangent plane with a skewed flux profile appropriate for the STR projection. OpenGL's Point Sprite feature and shader language allow us to render those eccentric circular flux profiles at the rate of more than 10M particles per second. The new method can process a single snapshot of the Via Lactea II data in less than 1 minute with a single NVIDIA GTX 480 GPU, including I/O, with effective rendering time less than 24 seconds. Using an approximate normalization for the flux, accurate to 2.5% in total flux, the rendering can be done in less than 13 seconds. The stereographic images corresponding to the two hemispheres are then warped to an all-sky image in the Molleweide projection, and are in good agreement with the result from the regular CPU code, at similar resolution.
Using the GPU based Model Tsunami-HySEA for the Italian CTSP
NASA Astrophysics Data System (ADS)
Gonzalez Vida, J. M., Sr.; Castro, M. J.; Macias, J.; de la Asuncion, M.; Molinari, I.; Melini, D.; Romano, F.; Tonini, R.; Lorito, S.; Piatanesi, A.
2015-12-01
The Istituto Nazionale di Geofisica e Vulcanologia of Italy (INGV) in collaboration with the EDANYA Group (University of Málaga) are proposing a FTRT (Faster Than Real Time) tsunami simulation approach that is being implemented in the NEAMTWS Italian CTSP, namely the Centro Allerta Tsunami (CAT), which is in pre-operational stage starting from 1 October 2014, in the 24/7 seismic monitoring room at INGV. We here present the different versions and capabilities of the NTHMP benchmarked HySEA model, developed by EDANYA Group. HySEA implements a multi-GPU version that can compute in several minutes the maximum amplitude and arrival time of the main tsunami wave in about 17,000 selected locations along the Mediterranean. At the same time, HySEA is implemented in nested meshes with different resolution and multi-GPU environment, which allows much faster than real time (below few minutes) inundation simulations. The performances of the code allows as well the preparation of a huge number of different pre-calculated scenarios that are being used for PTHA and for warning applications. Acknowledgements. This research has been partially supported by the Junta de Andalucía research project TESELA (P11-RNM7069), the Spanish Government Research project DAIFLUID (MTM2012-38383-C02-01). Also these results have received funding from the Italian Flagship Project RITMARE, from the INGV-DPC agreement, All.B2, and from EU FP7 project ASTARTE, Assessment, Strategy and Risk Reduction for Tsunamis in Europe grant n° 603839 (Project ASTARTE). The multi-GPU computationswere performed at the Laboratory of Numerical Methods (University of Malaga).
Discrete shearlet transform on GPU with applications in anomaly detection and denoising
NASA Astrophysics Data System (ADS)
Gibert, Xavier; Patel, Vishal M.; Labate, Demetrio; Chellappa, Rama
2014-12-01
Shearlets have emerged in recent years as one of the most successful methods for the multiscale analysis of multidimensional signals. Unlike wavelets, shearlets form a pyramid of well-localized functions defined not only over a range of scales and locations, but also over a range of orientations and with highly anisotropic supports. As a result, shearlets are much more effective than traditional wavelets in handling the geometry of multidimensional data, and this was exploited in a wide range of applications from image and signal processing. However, despite their desirable properties, the wider applicability of shearlets is limited by the computational complexity of current software implementations. For example, denoising a single 512 × 512 image using a current implementation of the shearlet-based shrinkage algorithm can take between 10 s and 2 min, depending on the number of CPU cores, and much longer processing times are required for video denoising. On the other hand, due to the parallel nature of the shearlet transform, it is possible to use graphics processing units (GPU) to accelerate its implementation. In this paper, we present an open source stand-alone implementation of the 2D discrete shearlet transform using CUDA C++ as well as GPU-accelerated MATLAB implementations of the 2D and 3D shearlet transforms. We have instrumented the code so that we can analyze the running time of each kernel under different GPU hardware. In addition to denoising, we describe a novel application of shearlets for detecting anomalies in textured images. In this application, computation times can be reduced by a factor of 50 or more, compared to multicore CPU implementations.
GGEMS-Brachy: GPU GEant4-based Monte Carlo simulation for brachytherapy applications
NASA Astrophysics Data System (ADS)
Lemaréchal, Yannick; Bert, Julien; Falconnet, Claire; Després, Philippe; Valeri, Antoine; Schick, Ulrike; Pradier, Olivier; Garcia, Marie-Paule; Boussion, Nicolas; Visvikis, Dimitris
2015-07-01
In brachytherapy, plans are routinely calculated using the AAPM TG43 formalism which considers the patient as a simple water object. An accurate modeling of the physical processes considering patient heterogeneity using Monte Carlo simulation (MCS) methods is currently too time-consuming and computationally demanding to be routinely used. In this work we implemented and evaluated an accurate and fast MCS on Graphics Processing Units (GPU) for brachytherapy low dose rate (LDR) applications. A previously proposed Geant4 based MCS framework implemented on GPU (GGEMS) was extended to include a hybrid GPU navigator, allowing navigation within voxelized patient specific images and analytically modeled 125I seeds used in LDR brachytherapy. In addition, dose scoring based on track length estimator including uncertainty calculations was incorporated. The implemented GGEMS-brachy platform was validated using a comparison with Geant4 simulations and reference datasets. Finally, a comparative dosimetry study based on the current clinical standard (TG43) and the proposed platform was performed on twelve prostate cancer patients undergoing LDR brachytherapy. Considering patient 3D CT volumes of 400 × 250 × 65 voxels and an average of 58 implanted seeds, the mean patient dosimetry study run time for a 2% dose uncertainty was 9.35 s (≈500 ms 10-6 simulated particles) and 2.5 s when using one and four GPUs, respectively. The performance of the proposed GGEMS-brachy platform allows envisaging the use of Monte Carlo simulation based dosimetry studies in brachytherapy compatible with clinical practice. Although the proposed platform was evaluated for prostate cancer, it is equally applicable to other LDR brachytherapy clinical applications. Future extensions will allow its application in high dose rate brachytherapy applications.
Quantum.Ligand.Dock: protein–ligand docking with quantum entanglement refinement on a GPU system
Kantardjiev, Alexander A.
2012-01-01
Quantum.Ligand.Dock (protein–ligand docking with graphic processing unit (GPU) quantum entanglement refinement on a GPU system) is an original modern method for in silico prediction of protein–ligand interactions via high-performance docking code. The main flavour of our approach is a combination of fast search with a special account for overlooked physical interactions. On the one hand, we take care of self-consistency and proton equilibria mutual effects of docking partners. On the other hand, Quantum.Ligand.Dock is the the only docking server offering such a subtle supplement to protein docking algorithms as quantum entanglement contributions. The motivation for development and proposition of the method to the community hinges upon two arguments—the fundamental importance of quantum entanglement contribution in molecular interaction and the realistic possibility to implement it by the availability of supercomputing power. The implementation of sophisticated quantum methods is made possible by parallelization at several bottlenecks on a GPU supercomputer. The high-performance implementation will be of use for large-scale virtual screening projects, structural bioinformatics, systems biology and fundamental research in understanding protein–ligand recognition. The design of the interface is focused on feasibility and ease of use. Protein and ligand molecule structures are supposed to be submitted as atomic coordinate files in PDB format. A customization section is offered for addition of user-specified charges, extra ionogenic groups with intrinsic pKa values or fixed ions. Final predicted complexes are ranked according to obtained scores and provided in PDB format as well as interactive visualization in a molecular viewer. Quantum.Ligand.Dock server can be accessed at http://87.116.85.141/LigandDock.html. PMID:22669908
NASA Astrophysics Data System (ADS)
Hammitzsch, M.; Spazier, J.; Reißland, S.
2014-12-01
Usually, tsunami early warning and mitigation systems (TWS or TEWS) are based on several software components deployed in a client-server based infrastructure. The vast majority of systems importantly include desktop-based clients with a graphical user interface (GUI) for the operators in early warning centers. However, in times of cloud computing and ubiquitous computing the use of concepts and paradigms, introduced by continuously evolving approaches in information and communications technology (ICT), have to be considered even for early warning systems (EWS). Based on the experiences and the knowledge gained in three research projects - 'German Indonesian Tsunami Early Warning System' (GITEWS), 'Distant Early Warning System' (DEWS), and 'Collaborative, Complex, and Critical Decision-Support in Evolving Crises' (TRIDEC) - new technologies are exploited to implement a cloud-based and web-based prototype to open up new prospects for EWS. This prototype, named 'TRIDEC Cloud', merges several complementary external and in-house cloud-based services into one platform for automated background computation with graphics processing units (GPU), for web-mapping of hazard specific geospatial data, and for serving relevant functionality to handle, share, and communicate threat specific information in a collaborative and distributed environment. The prototype in its current version addresses tsunami early warning and mitigation. The integration of GPU accelerated tsunami simulation computations have been an integral part of this prototype to foster early warning with on-demand tsunami predictions based on actual source parameters. However, the platform is meant for researchers around the world to make use of the cloud-based GPU computation to analyze other types of geohazards and natural hazards and react upon the computed situation picture with a web-based GUI in a web browser at remote sites. The current website is an early alpha version for demonstration purposes to give the
GPU-based fast Monte Carlo dose calculation for proton therapy.
Jia, Xun; Schümann, Jan; Paganetti, Harald; Jiang, Steve B
2012-12-07
Accurate radiation dose calculation is essential for successful proton radiotherapy. Monte Carlo (MC) simulation is considered to be the most accurate method. However, the long computation time limits it from routine clinical applications. Recently, graphics processing units (GPUs) have been widely used to accelerate computationally intensive tasks in radiotherapy. We have developed a fast MC dose calculation package, gPMC, for proton dose calculation on a GPU. In gPMC, proton transport is modeled by the class II condensed history simulation scheme with a continuous slowing down approximation. Ionization, elastic and inelastic proton nucleus interactions are considered. Energy straggling and multiple scattering are modeled. Secondary electrons are not transported and their energies are locally deposited. After an inelastic nuclear interaction event, a variety of products are generated using an empirical model. Among them, charged nuclear fragments are terminated with energy locally deposited. Secondary protons are stored in a stack and transported after finishing transport of the primary protons, while secondary neutral particles are neglected. gPMC is implemented on the GPU under the CUDA platform. We have validated gPMC using the TOPAS/Geant4 MC code as the gold standard. For various cases including homogeneous and inhomogeneous phantoms as well as a patient case, good agreements between gPMC and TOPAS/Geant4 are observed. The gamma passing rate for the 2%/2 mm criterion is over 98.7% in the region with dose greater than 10% maximum dose in all cases, excluding low-density air regions. With gPMC it takes only 6-22 s to simulate 10 million source protons to achieve ∼1% relative statistical uncertainty, depending on the phantoms and energy. This is an extremely high efficiency compared to the computational time of tens of CPU hours for TOPAS/Geant4. Our fast GPU-based code can thus facilitate the routine use of MC dose calculation in proton therapy.
Bridging FPGA and GPU technologies for AO real-time control
NASA Astrophysics Data System (ADS)
Perret, Denis; Lainé, Maxime; Bernard, Julien; Gratadour, Damien; Sevin, Arnaud
2016-07-01
Our team has developed a common environment for high performance simulations and real-time control of AO systems based on the use of Graphics Processors Units in the context of the COMPASS project. Such a solution, based on the ability of the real time core in the simulation to provide adequate computing performance, limits the cost of developing AO RTC systems and makes them more scalable. A code developed and validated in the context of the simulation may be injected directly into the system and tested on sky. Furthermore, the use of relatively low cost components also offers significant advantages for the system hardware platform. However, the use of GPUs in an AO loop comes with drawbacks: the traditional way of offloading computation from CPU to GPUs - involving multiple copies and unacceptable overhead in kernel launching - is not well suited in a real time context. This last application requires the implementation of a solution enabling direct memory access (DMA) to the GPU memory from a third party device, bypassing the operating system. This allows this device to communicate directly with the real-time core of the simulation feeding it with the WFS camera pixel stream. We show that DMA between a custom FPGA-based frame-grabber and a computation unit (GPU, FPGA, or Coprocessor such as Xeon-phi) across PCIe allows us to get latencies compatible with what will be needed on ELTs. As a fine-grained synchronization mechanism is not yet made available by GPU vendors, we propose the use of memory polling to avoid interrupts handling and involvement of a CPU. Network and Vision protocols are handled by the FPGA-based Network Interface Card (NIC). We present the results we obtained on a complete AO loop using camera and deformable mirror simulators.
TH-A-19A-09: Towards Sub-Second Proton Dose Calculation On GPU
Silva, J da
2014-06-15
Purpose: To achieve sub-second dose calculation for clinically relevant proton therapy treatment plans. Rapid dose calculation is a key component of adaptive radiotherapy, necessary to take advantage of the better dose conformity offered by hadron therapy. Methods: To speed up proton dose calculation, the pencil beam algorithm (PBA; clinical standard) was parallelised and implemented to run on a graphics processing unit (GPU). The implementation constitutes the first PBA to run all steps on GPU, and each part of the algorithm was carefully adapted for efficiency. Monte Carlo (MC) simulations obtained using Fluka of individual beams of energies representative of the clinical range impinging on simple geometries were used to tune the PBA. For benchmarking, a typical skull base case with a spot scanning plan consisting of a total of 8872 spots divided between two beam directions of 49 energy layers each was provided by CNAO (Pavia, Italy). The calculations were carried out on an Nvidia Geforce GTX680 desktop GPU with 1536 cores running at 1006 MHz. Results: The PBA reproduced within ±3% of maximum dose results obtained from MC simulations for a range of pencil beams impinging on a water tank. Additional analysis of more complex slab geometries is currently under way to fine-tune the algorithm. Full calculation of the clinical test case took 0.9 seconds in total, with the majority of the time spent in the kernel superposition step. Conclusion: The PBA lends itself well to implementation on many-core systems such as GPUs. Using the presented implementation and current hardware, sub-second dose calculation for a clinical proton therapy plan was achieved, opening the door for adaptive treatment. The successful parallelisation of all steps of the calculation indicates that further speedups can be expected with new hardware, brightening the prospects for real-time dose calculation. This work was funded by ENTERVISION, European Commission FP7 grant 264552.
GPU-based fast Monte Carlo dose calculation for proton therapy
NASA Astrophysics Data System (ADS)
Jia, Xun; Schümann, Jan; Paganetti, Harald; Jiang, Steve B.
2012-12-01
Accurate radiation dose calculation is essential for successful proton radiotherapy. Monte Carlo (MC) simulation is considered to be the most accurate method. However, the long computation time limits it from routine clinical applications. Recently, graphics processing units (GPUs) have been widely used to accelerate computationally intensive tasks in radiotherapy. We have developed a fast MC dose calculation package, gPMC, for proton dose calculation on a GPU. In gPMC, proton transport is modeled by the class II condensed history simulation scheme with a continuous slowing down approximation. Ionization, elastic and inelastic proton nucleus interactions are considered. Energy straggling and multiple scattering are modeled. Secondary electrons are not transported and their energies are locally deposited. After an inelastic nuclear interaction event, a variety of products are generated using an empirical model. Among them, charged nuclear fragments are terminated with energy locally deposited. Secondary protons are stored in a stack and transported after finishing transport of the primary protons, while secondary neutral particles are neglected. gPMC is implemented on the GPU under the CUDA platform. We have validated gPMC using the TOPAS/Geant4 MC code as the gold standard. For various cases including homogeneous and inhomogeneous phantoms as well as a patient case, good agreements between gPMC and TOPAS/Geant4 are observed. The gamma passing rate for the 2%/2 mm criterion is over 98.7% in the region with dose greater than 10% maximum dose in all cases, excluding low-density air regions. With gPMC it takes only 6-22 s to simulate 10 million source protons to achieve ˜1% relative statistical uncertainty, depending on the phantoms and energy. This is an extremely high efficiency compared to the computational time of tens of CPU hours for TOPAS/Geant4. Our fast GPU-based code can thus facilitate the routine use of MC dose calculation in proton therapy.
A new morphological anomaly detection algorithm for hyperspectral images and its GPU implementation
NASA Astrophysics Data System (ADS)
Paz, Abel; Plaza, Antonio
2011-10-01
Anomaly detection is considered a very important task for hyperspectral data exploitation. It is now routinely applied in many application domains, including defence and intelligence, public safety, precision agriculture, geology, or forestry. Many of these applications require timely responses for swift decisions which depend upon high computing performance of algorithm analysis. However, with the recent explosion in the amount and dimensionality of hyperspectral imagery, this problem calls for the incorporation of parallel computing techniques. In the past, clusters of computers have offered an attractive solution for fast anomaly detection in hyperspectral data sets already transmitted to Earth. However, these systems are expensive and difficult to adapt to on-board data processing scenarios, in which low-weight and low-power integrated components are essential to reduce mission payload and obtain analysis results in (near) real-time, i.e., at the same time as the data is collected by the sensor. An exciting new development in the field of commodity computing is the emergence of commodity graphics processing units (GPUs), which can now bridge the gap towards on-board processing of remotely sensed hyperspectral data. In this paper, we develop a new morphological algorithm for anomaly detection in hyperspectral images along with an efficient GPU implementation of the algorithm. The algorithm is implemented on latest-generation GPU architectures, and evaluated with regards to other anomaly detection algorithms using hyperspectral data collected by NASA's Airborne Visible Infra-Red Imaging Spectrometer (AVIRIS) over the World Trade Center (WTC) in New York, five days after the terrorist attacks that collapsed the two main towers in the WTC complex. The proposed GPU implementation achieves real-time performance in the considered case study.
GGEMS-Brachy: GPU GEant4-based Monte Carlo simulation for brachytherapy applications.
Lemaréchal, Yannick; Bert, Julien; Falconnet, Claire; Després, Philippe; Valeri, Antoine; Schick, Ulrike; Pradier, Olivier; Garcia, Marie-Paule; Boussion, Nicolas; Visvikis, Dimitris
2015-07-07
In brachytherapy, plans are routinely calculated using the AAPM TG43 formalism which considers the patient as a simple water object. An accurate modeling of the physical processes considering patient heterogeneity using Monte Carlo simulation (MCS) methods is currently too time-consuming and computationally demanding to be routinely used. In this work we implemented and evaluated an accurate and fast MCS on Graphics Processing Units (GPU) for brachytherapy low dose rate (LDR) applications. A previously proposed Geant4 based MCS framework implemented on GPU (GGEMS) was extended to include a hybrid GPU navigator, allowing navigation within voxelized patient specific images and analytically modeled (125)I seeds used in LDR brachytherapy. In addition, dose scoring based on track length estimator including uncertainty calculations was incorporated. The implemented GGEMS-brachy platform was validated using a comparison with Geant4 simulations and reference datasets. Finally, a comparative dosimetry study based on the current clinical standard (TG43) and the proposed platform was performed on twelve prostate cancer patients undergoing LDR brachytherapy. Considering patient 3D CT volumes of 400 × 250 × 65 voxels and an average of 58 implanted seeds, the mean patient dosimetry study run time for a 2% dose uncertainty was 9.35 s (≈500 ms 10(-6) simulated particles) and 2.5 s when using one and four GPUs, respectively. The performance of the proposed GGEMS-brachy platform allows envisaging the use of Monte Carlo simulation based dosimetry studies in brachytherapy compatible with clinical practice. Although the proposed platform was evaluated for prostate cancer, it is equally applicable to other LDR brachytherapy clinical applications. Future extensions will allow its application in high dose rate brachytherapy applications.
NASA Astrophysics Data System (ADS)
Perkins, S. J.; Marais, P. C.; Zwart, J. T. L.; Natarajan, I.; Tasse, C.; Smirnov, O.
2015-09-01
We present Montblanc, a GPU implementation of the Radio interferometer measurement equation (RIME) in support of the Bayesian inference for radio observations (BIRO) technique. BIRO uses Bayesian inference to select sky models that best match the visibilities observed by a radio interferometer. To accomplish this, BIRO evaluates the RIME multiple times, varying sky model parameters to produce multiple model visibilities. χ2 values computed from the model and observed visibilities are used as likelihood values to drive the Bayesian sampling process and select the best sky model. As most of the elements of the RIME and χ2 calculation are independent of one another, they are highly amenable to parallel computation. Additionally, Montblanc caters for iterative RIME evaluation to produce multiple χ2 values. Modified model parameters are transferred to the GPU between each iteration. We implemented Montblanc as a Python package based upon NVIDIA's CUDA architecture. As such, it is easy to extend and implement different pipelines. At present, Montblanc supports point and Gaussian morphologies, but is designed for easy addition of new source profiles. Montblanc's RIME implementation is performant: On an NVIDIA K40, it is approximately 250 times faster than MEQTREES on a dual hexacore Intel E5-2620v2 CPU. Compared to the OSKAR simulator's GPU-implemented RIME components it is 7.7 and 12 times faster on the same K40 for single and double-precision floating point respectively. However, OSKAR's RIME implementation is more general than Montblanc's BIRO-tailored RIME. Theoretical analysis of Montblanc's dominant CUDA kernel suggests that it is memory bound. In practice, profiling shows that is balanced between compute and memory, as much of the data required by the problem is retained in L1 and L2 caches.
Best bang for your buck: GPU nodes for GROMACS biomolecular simulations.
Kutzner, Carsten; Páll, Szilárd; Fechner, Martin; Esztermann, Ansgar; de Groot, Bert L; Grubmüller, Helmut
2015-10-05
The molecular dynamics simulation package GROMACS runs efficiently on a wide variety of hardware from commodity workstations to high performance computing clusters. Hardware features are well-exploited with a combination of single instruction multiple data, multithreading, and message passing interface (MPI)-based single program multiple data/multiple program multiple data parallelism while graphics processing units (GPUs) can be used as accelerators to compute interactions off-loaded from the CPU. Here, we evaluate which hardware produces trajectories with GROMACS 4.6 or 5.0 in the most economical way. We have assembled and benchmarked compute nodes with various CPU/GPU combinations to identify optimal compositions in terms of raw trajectory production rate, performance-to-price ratio, energy efficiency, and several other criteria. Although hardware prices are naturally subject to trends and fluctuations, general tendencies are clearly visible. Adding any type of GPU significantly boosts a node's simulation performance. For inexpensive consumer-class GPUs this improvement equally reflects in the performance-to-price ratio. Although memory issues in consumer-class GPUs could pass unnoticed as these cards do not support error checking and correction memory, unreliable GPUs can be sorted out with memory checking tools. Apart from the obvious determinants for cost-efficiency like hardware expenses and raw performance, the energy consumption of a node is a major cost factor. Over the typical hardware lifetime until replacement of a few years, the costs for electrical power and cooling can become larger than the costs of the hardware itself. Taking that into account, nodes with a well-balanced ratio of CPU and consumer-class GPU resources produce the maximum amount of GROMACS trajectory over their lifetime.
Best bang for your buck: GPU nodes for GROMACS biomolecular simulations
Páll, Szilárd; Fechner, Martin; Esztermann, Ansgar; de Groot, Bert L.; Grubmüller, Helmut
2015-01-01
The molecular dynamics simulation package GROMACS runs efficiently on a wide variety of hardware from commodity workstations to high performance computing clusters. Hardware features are well‐exploited with a combination of single instruction multiple data, multithreading, and message passing interface (MPI)‐based single program multiple data/multiple program multiple data parallelism while graphics processing units (GPUs) can be used as accelerators to compute interactions off‐loaded from the CPU. Here, we evaluate which hardware produces trajectories with GROMACS 4.6 or 5.0 in the most economical way. We have assembled and benchmarked compute nodes with various CPU/GPU combinations to identify optimal compositions in terms of raw trajectory production rate, performance‐to‐price ratio, energy efficiency, and several other criteria. Although hardware prices are naturally subject to trends and fluctuations, general tendencies are clearly visible. Adding any type of GPU significantly boosts a node's simulation performance. For inexpensive consumer‐class GPUs this improvement equally reflects in the performance‐to‐price ratio. Although memory issues in consumer‐class GPUs could pass unnoticed as these cards do not support error checking and correction memory, unreliable GPUs can be sorted out with memory checking tools. Apart from the obvious determinants for cost‐efficiency like hardware expenses and raw performance, the energy consumption of a node is a major cost factor. Over the typical hardware lifetime until replacement of a few years, the costs for electrical power and cooling can become larger than the costs of the hardware itself. Taking that into account, nodes with a well‐balanced ratio of CPU and consumer‐class GPU resources produce the maximum amount of GROMACS trajectory over their lifetime. © 2015 The Authors. Journal of Computational Chemistry Published by Wiley Periodicals, Inc. PMID:26238484
Real-time optical flow estimation on a GPU for a skied-steered mobile robot
NASA Astrophysics Data System (ADS)
Kniaz, V. V.
2016-04-01
Accurate egomotion estimation is required for mobile robot navigation. Often the egomotion is estimated using optical flow algorithms. For an accurate estimation of optical flow most of modern algorithms require high memory resources and processor speed. However simple single-board computers that control the motion of the robot usually do not provide such resources. On the other hand, most of modern single-board computers are equipped with an embedded GPU that could be used in parallel with a CPU to improve the performance of the optical flow estimation algorithm. This paper presents a new Z-flow algorithm for efficient computation of an optical flow using an embedded GPU. The algorithm is based on the phase correlation optical flow estimation and provide a real-time performance on a low cost embedded GPU. The layered optical flow model is used. Layer segmentation is performed using graph-cut algorithm with a time derivative based energy function. Such approach makes the algorithm both fast and robust in low light and low texture conditions. The algorithm implementation for a Raspberry Pi Model B computer is discussed. For evaluation of the algorithm the computer was mounted on a Hercules mobile skied-steered robot equipped with a monocular camera. The evaluation was performed using a hardware-in-the-loop simulation and experiments with Hercules mobile robot. Also the algorithm was evaluated using KITTY Optical Flow 2015 dataset. The resulting endpoint error of the optical flow calculated with the developed algorithm was low enough for navigation of the robot along the desired trajectory.
The efficient implementation of correction procedure via reconstruction with GPU computing
NASA Astrophysics Data System (ADS)
Zimmerman, Ben J.
Computational fluid dynamics (CFD) has long been a useful tool to model fluid flow problems across many engineering disciplines, and while problem size, complexity, and difficulty continue to expand, the demands for robustness and accuracy grow. Furthermore, generating high-order accurate solutions has escalated the required computational resources, and as problems continue to increase in complexity, so will computational needs such as memory requirements and calculation time for accurate flow field prediction. To improve upon computational time, vast amounts of computational power and resources are employed, but even over dozens to hundreds of central processing units (CPUs), the required computational time to formulate solutions can be weeks, months, or longer, which is particularly true when generating high-order accurate solutions over large computational domains. One response to lower the computational time for CFD problems is to implement graphical processing units (GPUs) with current CFD solvers. GPUs have illustrated the ability to solve problems orders of magnitude faster than their CPU counterparts with identical accuracy. The goal of the presented work is to combine a CFD solver and GPU computing with the intent to solve complex problems at a high-order of accuracy while lowering the computational time required to generate the solution. The CFD solver should have high-order spacial capabilities to evaluate small fluctuations and fluid structures not generally captured by lower-order methods and be efficient for the GPU architecture. This research combines the high-order Correction Procedure via Reconstruction (CPR) method with compute unified device architecture (CUDA) from NVIDIA to reach these goals. In addition, the study demonstrates accuracy of the developed solver by comparing results with other solvers and exact solutions. Solving CFD problems accurately and quickly are two factors to consider for the next generation of solvers. GPU computing is a
Quantum.Ligand.Dock: protein-ligand docking with quantum entanglement refinement on a GPU system.
Kantardjiev, Alexander A
2012-07-01
Quantum.Ligand.Dock (protein-ligand docking with graphic processing unit (GPU) quantum entanglement refinement on a GPU system) is an original modern method for in silico prediction of protein-ligand interactions via high-performance docking code. The main flavour of our approach is a combination of fast search with a special account for overlooked physical interactions. On the one hand, we take care of self-consistency and proton equilibria mutual effects of docking partners. On the other hand, Quantum.Ligand.Dock is the the only docking server offering such a subtle supplement to protein docking algorithms as quantum entanglement contributions. The motivation for development and proposition of the method to the community hinges upon two arguments-the fundamental importance of quantum entanglement contribution in molecular interaction and the realistic possibility to implement it by the availability of supercomputing power. The implementation of sophisticated quantum methods is made possible by parallelization at several bottlenecks on a GPU supercomputer. The high-performance implementation will be of use for large-scale virtual screening projects, structural bioinformatics, systems biology and fundamental research in understanding protein-ligand recognition. The design of the interface is focused on feasibility and ease of use. Protein and ligand molecule structures are supposed to be submitted as atomic coordinate files in PDB format. A customization section is offered for addition of user-specified charges, extra ionogenic groups with intrinsic pK(a) values or fixed ions. Final predicted complexes are ranked according to obtained scores and provided in PDB format as well as interactive visualization in a molecular viewer. Quantum.Ligand.Dock server can be accessed at http://87.116.85.141/LigandDock.html.
Accelerated Monte Carlo simulation on the chemical stage in water radiolysis using GPU
NASA Astrophysics Data System (ADS)
Tian, Zhen; Jiang, Steve B.; Jia, Xun
2017-04-01
The accurate simulation of water radiolysis is an important step to understand the mechanisms of radiobiology and quantitatively test some hypotheses regarding radiobiological effects. However, the simulation of water radiolysis is highly time consuming, taking hours or even days to be completed by a conventional CPU processor. This time limitation hinders cell-level simulations for a number of research studies. We recently initiated efforts to develop gMicroMC, a GPU-based fast microscopic MC simulation package for water radiolysis. The first step of this project focused on accelerating the simulation of the chemical stage, the most time consuming stage in the entire water radiolysis process. A GPU-friendly parallelization strategy was designed to address the highly correlated many-body simulation problem caused by the mutual competitive chemical reactions between the radiolytic molecules. Two cases were tested, using a 750 keV electron and a 5 MeV proton incident in pure water, respectively. The time-dependent yields of all the radiolytic species during the chemical stage were used to evaluate the accuracy of the simulation. The relative differences between our simulation and the Geant4-DNA simulation were on average 5.3% and 4.4% for the two cases. Our package, executed on an Nvidia Titan black GPU card, successfully completed the chemical stage simulation of the two cases within 599.2 s and 489.0 s. As compared with Geant4-DNA that was executed on an Intel i7-5500U CPU processor and needed 28.6 h and 26.8 h for the two cases using a single CPU core, our package achieved a speed-up factor of 171.1–197.2.
NASA Technical Reports Server (NTRS)
Putnam, William M.
2011-01-01
Earth system models like the Goddard Earth Observing System model (GEOS-5) have been pushing the limits of large clusters of multi-core microprocessors, producing breath-taking fidelity in resolving cloud systems at a global scale. GPU computing presents an opportunity for improving the efficiency of these leading edge models. A GPU implementation of GEOS-5 will facilitate the use of cloud-system resolving resolutions in data assimilation and weather prediction, at resolutions near 3.5 km, improving our ability to extract detailed information from high-resolution satellite observations and ultimately produce better weather and climate predictions
Kim, Byungyeon; Park, Byungjun; Lee, Seungrag; Won, Youngjae
2016-01-01
We demonstrated GPU accelerated real-time confocal fluorescence lifetime imaging microscopy (FLIM) based on the analog mean-delay (AMD) method. Our algorithm was verified for various fluorescence lifetimes and photon numbers. The GPU processing time was faster than the physical scanning time for images up to 800 × 800, and more than 149 times faster than a single core CPU. The frame rate of our system was demonstrated to be 13 fps for a 200 × 200 pixel image when observing maize vascular tissue. This system can be utilized for observing dynamic biological reactions, medical diagnosis, and real-time industrial inspection. PMID:28018724
Huang Bormin; Mielikainen, Jarno; Oh, Hyunjong; Allen Huang, Hung-Lung
2011-03-20
Satellite-observed radiance is a nonlinear functional of surface properties and atmospheric temperature and absorbing gas profiles as described by the radiative transfer equation (RTE). In the era of hyperspectral sounders with thousands of high-resolution channels, the computation of the radiative transfer model becomes more time-consuming. The radiative transfer model performance in operational numerical weather prediction systems still limits the number of channels we can use in hyperspectral sounders to only a few hundreds. To take the full advantage of such high-resolution infrared observations, a computationally efficient radiative transfer model is needed to facilitate satellite data assimilation. In recent years the programmable commodity graphics processing unit (GPU) has evolved into a highly parallel, multi-threaded, many-core processor with tremendous computational speed and very high memory bandwidth. The radiative transfer model is very suitable for the GPU implementation to take advantage of the hardware's efficiency and parallelism where radiances of many channels can be calculated in parallel in GPUs. In this paper, we develop a GPU-based high-performance radiative transfer model for the Infrared Atmospheric Sounding Interferometer (IASI) launched in 2006 onboard the first European meteorological polar-orbiting satellites, METOP-A. Each IASI spectrum has 8461 spectral channels. The IASI radiative transfer model consists of three modules. The first module for computing the regression predictors takes less than 0.004% of CPU time, while the second module for transmittance computation and the third module for radiance computation take approximately 92.5% and 7.5%, respectively. Our GPU-based IASI radiative transfer model is developed to run on a low-cost personal supercomputer with four GPUs with total 960 compute cores, delivering near 4 TFlops theoretical peak performance. By massively parallelizing the second and third modules, we reached 364x
GPU-based four-dimensional general-relativistic ray tracing
NASA Astrophysics Data System (ADS)
Kuchelmeister, Daniel; Müller, Thomas; Ament, Marco; Wunner, Günter; Weiskopf, Daniel
2012-10-01
This paper presents a new general-relativistic ray tracer that enables image synthesis on an interactive basis by exploiting the performance of graphics processing units (GPUs). The application is capable of visualizing the distortion of the stellar background as well as trajectories of moving astronomical objects orbiting a compact mass. Its source code includes metric definitions for the Schwarzschild and Kerr spacetimes that can be easily extended to other metric definitions, relying on its object-oriented design. The basic functionality features a scene description interface based on the scripting language Lua, real-time image output, and the ability to edit almost every parameter at runtime. The ray tracing code itself is implemented for parallel execution on the GPU using NVidia's Compute Unified Device Architecture (CUDA), which leads to performance improvement of an order of magnitude compared to a single CPU and makes the application competitive with small CPU cluster architectures. Program summary Program title: GpuRay4D Catalog identifier: AEMV_v1_0 Program summary URL: http://cpc.cs.qub.ac.uk/summaries/AEMV_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 73649 No. of bytes in distributed program, including test data, etc.: 1334251 Distribution format: tar.gz Programming language: C++, CUDA. Computer: Linux platforms with a NVidia CUDA enabled GPU (Compute Capability 1.3 or higher), C++ compiler, NVCC (The CUDA Compiler Driver). Operating system: Linux. RAM: 2 GB Classification: 1.5. External routines: OpenGL Utility Toolkit development files, NVidia CUDA Toolkit 3.2, Lua5.2 Nature of problem: Ray tracing in four-dimensional Lorentzian spacetimes. Solution method: Numerical integration of light rays, GPU-based parallel programming using CUDA, 3D
Software Graphics Processing Unit (sGPU) for Deep Space Applications
NASA Technical Reports Server (NTRS)
McCabe, Mary; Salazar, George; Steele, Glen
2015-01-01
A graphics processing capability will be required for deep space missions and must include a range of applications, from safety-critical vehicle health status to telemedicine for crew health. However, preliminary radiation testing of commercial graphics processing cards suggest they cannot operate in the deep space radiation environment. Investigation into an Software Graphics Processing Unit (sGPU)comprised of commercial-equivalent radiation hardened/tolerant single board computers, field programmable gate arrays, and safety-critical display software shows promising results. Preliminary performance of approximately 30 frames per second (FPS) has been achieved. Use of multi-core processors may provide a significant increase in performance.
A fast and memory-sparing probabilistic selection algorithm for the GPU
Monroe, Laura M; Wendelberger, Joanne; Michalak, Sarah
2010-09-29
A fast and memory-sparing probabilistic top-N selection algorithm is implemented on the GPU. This probabilistic algorithm gives a deterministic result and always terminates. The use of randomization reduces the amount of data that needs heavy processing, and so reduces both the memory requirements and the average time required for the algorithm. This algorithm is well-suited to more general parallel processors with multiple layers of memory hierarchy. Probabilistic Las Vegas algorithms of this kind are a form of stochastic optimization and can be especially useful for processors having a limited amount of fast memory available.
Multi-GPU three dimensional Stokes solver for simulating glacier flow
NASA Astrophysics Data System (ADS)
Licul, Aleksandar; Herman, Frédéric; Podladchikov, Yuri; Räss, Ludovic; Omlin, Samuel
2016-04-01
Here we present how we have recently developed a three-dimensional Stokes solver on the GPUs and apply it to a glacier flow. We numerically solve the Stokes momentum balance equations together with the incompressibility equation, while also taking into account strong nonlinearities for ice rheology. We have developed a fully three-dimensional numerical MATLAB application based on an iterative finite difference scheme with preconditioning of residuals. Differential equations are discretized on a regular staggered grid. We have ported it to C-CUDA to run it on GPU's in parallel, using MPI. We demonstrate the accuracy and efficiency of our developed model by manufactured analytical solution test for three-dimensional Stokes ice sheet models (Leng et al.,2013) and by comparison with other well-established ice sheet models on diagnostic ISMIP-HOM benchmark experiments (Pattyn et al., 2008). The results show that our developed model is capable to accurately and efficiently solve Stokes system of equations in a variety of different test scenarios, while preserving good parallel efficiency on up to 80 GPU's. For example, in 3D test scenarios with 250000 grid points our solver converges in around 3 minutes for single precision computations and around 10 minutes for double precision computations. We have also optimized the developed code to efficiently run on our newly acquired state-of-the-art GPU cluster octopus. This allows us to solve our problem on more than 20 million grid points, by just increasing the number of GPU used, while keeping the computation time the same. In future work we will apply our solver to real world applications and implement the free surface evolution capabilities. REFERENCES Leng,W.,Ju,L.,Gunzburger,M. & Price,S., 2013. Manufactured solutions and the verification of three-dimensional stokes ice-sheet models. Cryosphere 7,19-29. Pattyn, F., Perichon, L., Aschwanden, A., Breuer, B., de Smedt, B., Gagliardini, O., Gudmundsson,G.H., Hindmarsh, R
GPU-optimized Code for Long-term Simulations of Beam-beam Effects in Colliders
Roblin, Yves; Morozov, Vasiliy; Terzic, Balsa; Aturban, Mohamed A.; Ranjan, D.; Zubair, Mohammed
2013-06-01
We report on the development of the new code for long-term simulation of beam-beam effects in particle colliders. The underlying physical model relies on a matrix-based arbitrary-order symplectic particle tracking for beam transport and the Bassetti-Erskine approximation for beam-beam interaction. The computations are accelerated through a parallel implementation on a hybrid GPU/CPU platform. With the new code, a previously computationally prohibitive long-term simulations become tractable. We use the new code to model the proposed medium-energy electron-ion collider (MEIC) at Jefferson Lab.
Interesting Spatio-Temporal Region Discovery Computations Over Gpu and Mapreduce Platforms
NASA Astrophysics Data System (ADS)
McDermott, M.; Prasad, S. K.; Shekhar, S.; Zhou, X.
2015-07-01
Discovery of interesting paths and regions in spatio-temporal data sets is important to many fields such as the earth and atmospheric sciences, GIS, public safety and public health both as a goal and as a preliminary step in a larger series of computations. This discovery is usually an exhaustive procedure that quickly becomes extremely time consuming to perform using traditional paradigms and hardware and given the rapidly growing sizes of today's data sets is quickly outpacing the speed at which computational capacity is growing. In our previous work (Prasad et al., 2013a) we achieved a 50 times speedup over sequential using a single GPU. We were able to achieve near linear speedup over this result on interesting path discovery by using Apache Hadoop to distribute the workload across multiple GPU nodes. Leveraging the parallel architecture of GPUs we were able to drastically reduce the computation time of a 3-dimensional spatio-temporal interest region search on a single tile of normalized difference vegetative index for Saudi Arabia. We were further able to see an almost linear speedup in compute performance by distributing this workload across several GPUs with a simple MapReduce model. This increases the speed of processing 10 fold over the comparable sequential while simultaneously increasing the amount of data being processed by 384 fold. This allowed us to process the entirety of the selected data set instead of a constrained window.
Feasibility of GPU-assisted iterative image reconstruction for mobile C-arm CT
NASA Astrophysics Data System (ADS)
Pan, Yongsheng; Whitaker, Ross; Cheryauka, Arvi; Ferguson, Dave
2009-02-01
Computed tomography (CT) has been extensively studied and widely used for a variety of medical applications. The reconstruction of 3D images from a projection series is an important aspect of the modality. Reconstruction by filtered backprojection (FBP) is used by most manufacturers because of speed, ease of implementation, and relatively few parameters. Iterative reconstruction methods have a significant potential to provide superior performance with incomplete or noisy data, or with less than ideal geometries, such as cone-beam systems. However, iterative methods have a high computational cost, and regularization is usually required to reduce the effects of noise. The simultaneous algebraic reconstruction technique (SART) is studied in this paper, where the Feldkamp method (FDK) for filtered back projection is used as an initialization for iterative SART. Additionally, graphics hardware is utilized to increase the speed of SART implementation. Nvidia processors and compute unified device architecture (CUDA) form the platform for GPU computation. Total variation (TV) minimization is applied for the regularization of SART results. Preliminary results of SART on 3-D Shepp-Logan phantom using using TV regularization and GPU computation are presented in this paper. Potential improvements of the proposed framework are also discussed.
Multi-core/GPU accelerated multi-resolution simulations of compressible flows
NASA Astrophysics Data System (ADS)
Hejazialhosseini, Babak; Rossinelli, Diego; Koumoutsakos, Petros
2010-11-01
We develop a multi-resolution solver for single and multi-phase compressible flow simulations by coupling average interpolating wavelets and local time stepping schemes with high order finite volume schemes. Wavelets allow for high compression rates and explicit control over the error in adaptive representation of the flow field, but their efficient parallel implementation is hindered by the use of traditional data parallel models. In this work we demonstrate that this methodology can be implemented so that it can benefit from the processing power of emerging hybrid multicore and multi-GPU architectures. This is achieved by exploiting task-based parallelism paradigm and the concept of wavelet blocks combined with OpenCL and Intel Threading Building Blocks. The solver is able to handle high resolution jumps and benefits from adaptive time integration using local time stepping schemes as implemented on heterogeneous multi-core/GPU architectures. We demonstrate the accuracy of our method and the performance of our solver on different architectures for 2D simulations of shock-bubble interaction and Richtmeyer-Meshkov instability.
JiTTree: A Just-in-Time Compiled Sparse GPU Volume Data Structure.
Labschütz, Matthias; Bruckner, Stefan; Gröller, M Eduard; Hadwiger, Markus; Rautek, Peter
2016-01-01
Sparse volume data structures enable the efficient representation of large but sparse volumes in GPU memory for computation and visualization. However, the choice of a specific data structure for a given data set depends on several factors, such as the memory budget, the sparsity of the data, and data access patterns. In general, there is no single optimal sparse data structure, but a set of several candidates with individual strengths and drawbacks. One solution to this problem are hybrid data structures which locally adapt themselves to the sparsity. However, they typically suffer from increased traversal overhead which limits their utility in many applications. This paper presents JiTTree, a novel sparse hybrid volume data structure that uses just-in-time compilation to overcome these problems. By combining multiple sparse data structures and reducing traversal overhead we leverage their individual advantages. We demonstrate that hybrid data structures adapt well to a large range of data sets. They are especially superior to other sparse data structures for data sets that locally vary in sparsity. Possible optimization criteria are memory, performance and a combination thereof. Through just-in-time (JIT) compilation, JiTTree reduces the traversal overhead of the resulting optimal data structure. As a result, our hybrid volume data structure enables efficient computations on the GPU, while being superior in terms of memory usage when compared to non-hybrid data structures.
Edge-preserving image denoising via group coordinate descent on the GPU.
McGaffin, Madison Gray; Fessler, Jeffrey A
2015-04-01
Image denoising is a fundamental operation in image processing, and its applications range from the direct (photographic enhancement) to the technical (as a subproblem in image reconstruction algorithms). In many applications, the number of pixels has continued to grow, while the serial execution speed of computational hardware has begun to stall. New image processing algorithms must exploit the power offered by massively parallel architectures like graphics processing units (GPUs). This paper describes a family of image denoising algorithms well-suited to the GPU. The algorithms iteratively perform a set of independent, parallel 1D pixel-update subproblems. To match GPU memory limitations, they perform these pixel updates in-place and only store the noisy data, denoised image, and problem parameters. The algorithms can handle a wide range of edge-preserving roughness penalties, including differentiable convex penalties and anisotropic total variation. Both algorithms use the majorize-minimize framework to solve the 1D pixel update subproblem. Results from a large 2D image denoising problem and a 3D medical imaging denoising problem demonstrate that the proposed algorithms converge rapidly in terms of both iteration and run-time.
GRay: A MASSIVELY PARALLEL GPU-BASED CODE FOR RAY TRACING IN RELATIVISTIC SPACETIMES
Chan, Chi-kwan; Psaltis, Dimitrios; Özel, Feryal
2013-11-01
We introduce GRay, a massively parallel integrator designed to trace the trajectories of billions of photons in a curved spacetime. This graphics-processing-unit (GPU)-based integrator employs the stream processing paradigm, is implemented in CUDA C/C++, and runs on nVidia graphics cards. The peak performance of GRay using single-precision floating-point arithmetic on a single GPU exceeds 300 GFLOP (or 1 ns per photon per time step). For a realistic problem, where the peak performance cannot be reached, GRay is two orders of magnitude faster than existing central-processing-unit-based ray-tracing codes. This performance enhancement allows more effective searches of large parameter spaces when comparing theoretical predictions of images, spectra, and light curves from the vicinities of compact objects to observations. GRay can also perform on-the-fly ray tracing within general relativistic magnetohydrodynamic algorithms that simulate accretion flows around compact objects. Making use of this algorithm, we calculate the properties of the shadows of Kerr black holes and the photon rings that surround them. We also provide accurate fitting formulae of their dependencies on black hole spin and observer inclination, which can be used to interpret upcoming observations of the black holes at the center of the Milky Way, as well as M87, with the Event Horizon Telescope.
GPU accelerated non-rigid registration for the evaluation of cardiac function.
Li, Bo; Young, Alistair A; Cowan, Brett R
2008-01-01
We present a method for the fast and efficient tracking of motion in cardiac magnetic resonance (CMR) cines. A GPU accelerated Levenberg-Marquardt non-linear least squares optimization procedure for finite element non-rigid registration was implemented on an NVIDIA graphics card using the OpenGL environment. Points were tracked from frame to frame using forward and backward incremental registration. The inner (endocardial) and outer (epicardial) boarders of the heart were tracked in six short axis cines with approximately 25 frames through the cardiac cycle in 36 patients with vascular disease. Contours placed by two independent expert observers using a semi-automatic ventricular analysis program (CIM version 4.6) were used as the gold standard. The method took 0.5 seconds per frame, and the maximum Hausdorff errors were less than 2 mm on average which was of the same order as the expert inter-observer error. In conclusion, GPU accelerated Levenberg-Marquardt non-linear optimization enables fast and accurate tracking of cardiac motion in CMR images.
ODYSSEY: A PUBLIC GPU-BASED CODE FOR GENERAL RELATIVISTIC RADIATIVE TRANSFER IN KERR SPACETIME
Pu, Hung-Yi; Younsi, Ziri
2016-04-01
General relativistic radiative transfer calculations coupled with the calculation of geodesics in the Kerr spacetime are an essential tool for determining the images, spectra, and light curves from matter in the vicinity of black holes. Such studies are especially important for ongoing and upcoming millimeter/submillimeter very long baseline interferometry observations of the supermassive black holes at the centers of Sgr A* and M87. To this end we introduce Odyssey, a graphics processing unit (GPU) based code for ray tracing and radiative transfer in the Kerr spacetime. On a single GPU, the performance of Odyssey can exceed 1 ns per photon, per Runge–Kutta integration step. Odyssey is publicly available, fast, accurate, and flexible enough to be modified to suit the specific needs of new users. Along with a Graphical User Interface powered by a video-accelerated display architecture, we also present an educational software tool, Odyssey-Edu, for showing in real time how null geodesics around a Kerr black hole vary as a function of black hole spin and angle of incidence onto the black hole.
Examination of nanoparticle dispersion using a novel GPU based radial distribution function code
NASA Astrophysics Data System (ADS)
Rosch, Thomas; Wade, Matthew; Phelan, Frederick
We have developed a novel GPU-based code that rapidly calculates radial distribution function (RDF) for an entire system, with no cutoff, ensuring accuracy. Built on top of this code, we have developed tools to calculate the second virial coefficient (B2) and the structure factor from the RDF, two properties that are directly related to the dispersion of nanoparticles in nancomposite systems. We validate the RDF calculations by comparison with previously published results, and also show how our code, which takes into account bonding in polymeric systems, enables more accurate predictions of g(r) than current state of the art GPU-based RDF codes currently available for these systems. In addition, our code reduces the computational time by approximately an order of magnitude compared to CPU-based calculations. We demonstrate the application of our toolset by the examination of a coarse-grained nanocomposite system and show how different surface energies between particle and polymer lead to different dispersion states, and effect properties such as viscosity, yield strength, elasticity, and thermal conductivity.
GPU simulation of nonlinear propagation of dual band ultrasound pulse complexes
Kvam, Johannes Angelsen, Bjørn A. J.; Elster, Anne C.
2015-10-28
In a new method of ultrasound imaging, called SURF imaging, dual band pulse complexes composed of overlapping low frequency (LF) and high frequency (HF) pulses are transmitted, where the frequency ratio LF:HF ∼ 1 : 20, and the relative bandwidth of both pulses are ∼ 50 − 70%. The LF pulse length is hence ∼ 20 times the HF pulse length. The LF pulse is used to nonlinearly manipulate the material elasticity observed by the co-propagating HF pulse. This produces nonlinear interaction effects that give more information on the propagation of the pulse complex. Due to the large difference in frequency and pulse length between the LF and the HF pulses, we have developed a dual level simulation where the LF pulse propagation is first simulated independent of the HF pulse, using a temporal sampling frequency matched to the LF pulse. A separate equation for the HF pulse is developed, where the the presimulated LF pulse modifies the propagation velocity. The equations are adapted to parallel processing in a GPU, where nonlinear simulations of a typical HF beam of 10 MHz down to 40 mm is done in ∼ 2 secs in a standard GPU. This simulation is hence very useful for studying the manipulation effect of the LF pulse on the HF pulse.
GPU-based minimum variance beamformer for synthetic aperture imaging of the eye.
Yiu, Billy Y S; Yu, Alfred C H
2015-03-01
Minimum variance (MV) beamforming has emerged as an adaptive apodization approach to bolster the quality of images generated from synthetic aperture ultrasound imaging methods that are based on unfocused transmission principles. In this article, we describe a new high-speed, pixel-based MV beamforming framework for synthetic aperture imaging to form entire frames of adaptively apodized images at real-time throughputs and document its performance in swine eye imaging case examples. Our framework is based on parallel computing principles, and its real-time operational feasibility was realized on a six-GPU (graphics processing unit) platform with 3,072 computing cores. This framework was used to form images with synthetic aperture imaging data acquired from swine eyes (based on virtual point-source emissions). Results indicate that MV-apodized image formation with video-range processing throughput (>20 fps) can be realized for practical aperture sizes (128 channels) and frames with λ/2 pixel spacing. Also, in a corneal wound detection experiment, MV-apodized images generated using our framework revealed apparent contrast enhancement of the wound site (10.8 dB with respect to synthetic aperture images formed with fixed apodization). These findings indicate that GPU-based MV beamforming can, in real time, potentially enhance image quality when performing synthetic aperture imaging that uses unfocused firings.
The GENGA code: gravitational encounters in N-body simulations with GPU acceleration
Grimm, Simon L.; Stadel, Joachim G.
2014-11-20
We describe an open source GPU implementation of a hybrid symplectic N-body integrator, GENGA (Gravitational ENcounters with Gpu Acceleration), designed to integrate planet and planetesimal dynamics in the late stage of planet formation and stability analyses of planetary systems. GENGA uses a hybrid symplectic integrator to handle close encounters with very good energy conservation, which is essential in long-term planetary system integration. We extended the second-order hybrid integration scheme to higher orders. The GENGA code supports three simulation modes: integration of up to 2048 massive bodies, integration with up to a million test particles, or parallel integration of a large number of individual planetary systems. We compare the results of GENGA to Mercury and pkdgrav2 in terms of energy conservation and performance and find that the energy conservation of GENGA is comparable to Mercury and around two orders of magnitude better than pkdgrav2. GENGA runs up to 30 times faster than Mercury and up to 8 times faster than pkdgrav2. GENGA is written in CUDA C and runs on all NVIDIA GPUs with a computing capability of at least 2.0.
NASA Astrophysics Data System (ADS)
Mena, Andres; Ferrero, Jose M.; Rodriguez Matas, Jose F.
2015-11-01
Solving the electric activity of the heart possess a big challenge, not only because of the structural complexities inherent to the heart tissue, but also because of the complex electric behaviour of the cardiac cells. The multi-scale nature of the electrophysiology problem makes difficult its numerical solution, requiring temporal and spatial resolutions of 0.1 ms and 0.2 mm respectively for accurate simulations, leading to models with millions degrees of freedom that need to be solved for thousand time steps. Solution of this problem requires the use of algorithms with higher level of parallelism in multi-core platforms. In this regard the newer programmable graphic processing units (GPU) has become a valid alternative due to their tremendous computational horsepower. This paper presents results obtained with a novel electrophysiology simulation software entirely developed in Compute Unified Device Architecture (CUDA). The software implements fully explicit and semi-implicit solvers for the monodomain model, using operator splitting. Performance is compared with classical multi-core MPI based solvers operating on dedicated high-performance computer clusters. Results obtained with the GPU based solver show enormous potential for this technology with accelerations over 50 × for three-dimensional problems.
High-throughput protein crystallization on the World Community Grid and the GPU
NASA Astrophysics Data System (ADS)
Kotseruba, Yulia; Cumbaa, Christian A.; Jurisica, Igor
2012-02-01
We have developed CPU and GPU versions of an automated image analysis and classification system for protein crystallization trial images from the Hauptman Woodward Institute's High-Throughput Screening lab. The analysis step computes 12,375 numerical features per image. Using these features, we have trained a classifier that distinguishes 11 different crystallization outcomes, recognizing 80% of all crystals, 94% of clear drops, 94% of precipitates. The computing requirements for this analysis system are large. The complete HWI archive of 120 million images is being processed by the donated CPU cycles on World Community Grid, with a GPU phase launching in early 2012. The main computational burden of the analysis is the measure of textural (GLCM) features within the image at multiple neighbourhoods, distances, and at multiple greyscale intensity resolutions. CPU runtime averages 4,092 seconds (single threaded) on an Intel Xeon, but only 65 seconds on an NVIDIA Tesla C2050. We report on the process of adapting the C++ code to OpenCL, optimized for multiple platforms.
Matrix Algebra for GPU and Multicore Architectures (MAGMA) for Large Petascale Systems
Dongarra, Jack J.; Tomov, Stanimire
2014-03-24
The goal of the MAGMA project is to create a new generation of linear algebra libraries that achieve the fastest possible time to an accurate solution on hybrid Multicore+GPU-based systems, using all the processing power that future high-end systems can make available within given energy constraints. Our efforts at the University of Tennessee achieved the goals set in all of the five areas identified in the proposal: 1. Communication optimal algorithms; 2. Autotuning for GPU and hybrid processors; 3. Scheduling and memory management techniques for heterogeneity and scale; 4. Fault tolerance and robustness for large scale systems; 5. Building energy efficiency into software foundations. The University of Tennessee’s main contributions, as proposed, were the research and software development of new algorithms for hybrid multi/many-core CPUs and GPUs, as related to two-sided factorizations and complete eigenproblem solvers, hybrid BLAS, and energy efficiency for dense, as well as sparse, operations. Furthermore, as proposed, we investigated and experimented with various techniques targeting the five main areas outlined.
A GPU accelerated Barnes-Hut tree code for FLASH4
NASA Astrophysics Data System (ADS)
Lukat, Gunther; Banerjee, Robi
2016-05-01
We present a GPU accelerated CUDA-C implementation of the Barnes Hut (BH) tree code for calculating the gravitational potential on octree adaptive meshes. The tree code algorithm is implemented within the FLASH4 adaptive mesh refinement (AMR) code framework and therefore fully MPI parallel. We describe the algorithm and present test results that demonstrate its accuracy and performance in comparison to the algorithms available in the current FLASH4 version. We use a MacLaurin spheroid to test the accuracy of our new implementation and use spherical, collapsing cloud cores with effective AMR to carry out performance tests also in comparison with previous gravity solvers. Depending on the setup and the GPU/CPU ratio, we find a speedup for the gravity unit of at least a factor of 3 and up to 60 in comparison to the gravity solvers implemented in the FLASH4 code. We find an overall speedup factor for full simulations of at least factor 1.6 up to a factor of 10.
Fast parallel tandem mass spectral library searching using GPU hardware acceleration.
Baumgardner, Lydia Ashleigh; Shanmugam, Avinash Kumar; Lam, Henry; Eng, Jimmy K; Martin, Daniel B
2011-06-03
Mass spectrometry-based proteomics is a maturing discipline of biologic research that is experiencing substantial growth. Instrumentation has steadily improved over time with the advent of faster and more sensitive instruments collecting ever larger data files. Consequently, the computational process of matching a peptide fragmentation pattern to its sequence, traditionally accomplished by sequence database searching and more recently also by spectral library searching, has become a bottleneck in many mass spectrometry experiments. In both of these methods, the main rate-limiting step is the comparison of an acquired spectrum with all potential matches from a spectral library or sequence database. This is a highly parallelizable process because the core computational element can be represented as a simple but arithmetically intense multiplication of two vectors. In this paper, we present a proof of concept project taking advantage of the massively parallel computing available on graphics processing units (GPUs) to distribute and accelerate the process of spectral assignment using spectral library searching. This program, which we have named FastPaSS (for Fast Parallelized Spectral Searching), is implemented in CUDA (Compute Unified Device Architecture) from NVIDIA, which allows direct access to the processors in an NVIDIA GPU. Our efforts demonstrate the feasibility of GPU computing for spectral assignment, through implementation of the validated spectral searching algorithm SpectraST in the CUDA environment.
Real-time GPU implementation of transverse oscillation vector velocity flow imaging
NASA Astrophysics Data System (ADS)
Bradway, David Pierson; Pihl, Michael Johannes; Krebs, Andreas; Tomov, Borislav Gueorguiev; Kjær, Carsten Straso; Nikolov, Svetoslav Ivanov; Jensen, Jørgen Arendt
2014-03-01
Rapid estimation of blood velocity and visualization of complex flow patterns are important for clinical use of diagnostic ultrasound. This paper presents real-time processing for two-dimensional (2-D) vector flow imaging which utilizes an off-the-shelf graphics processing unit (GPU). In this work, Open Computing Language (OpenCL) is used to estimate 2-D vector velocity flow in vivo in the carotid artery. Data are streamed live from a BK Medical 2202 Pro Focus UltraView Scanner to a workstation running a research interface software platform. Processing data from a 50 millisecond frame of a duplex vector flow acquisition takes 2.3 milliseconds seconds on an Advanced Micro Devices Radeon HD 7850 GPU card. The detected velocities are accurate to within the precision limit of the output format of the display routine. Because this tool was developed as a module external to the scanner's built-in processing, it enables new opportunities for prototyping novel algorithms, optimizing processing parameters, and accelerating the path from development lab to clinic.
GPU Based Real-time Instrument Tracking with Three Dimensional Ultrasound
Novotny, Paul M.; Stoll, Jeff A.; Vasilyev, Nikolay V.; Del Nido, Pedro J.; Dupont, Pierre E.; Howe, Robert D.
2009-01-01
Real-time three-dimensional ultrasound enables new intra-cardiac surgical procedures, but the distorted appearance of instruments in ultrasound poses a challenge to surgeons. This paper presents a detection technique that identifies the position of the instrument within the ultrasound volume. The algorithm uses a form of the generalized Radon transform to search for long straight objects in the ultrasound image, a feature characteristic of instruments and not found in cardiac tissue. When combined with passive markers placed on the instrument shaft, the full position and orientation of the instrument is found in 3D space. This detection technique is amenable to rapid execution on the current generation of personal computer graphics processor units (GPU). Our GPU implementation detected a surgical instrument in 31 ms, sufficient for real-time tracking at the 25 volumes per second rate of the ultrasound machine. A water tank experiment found instrument orientation errors of 1.1 degrees and tip position errors of less than 1.8 mm. Finally, an in vivo study demonstrated successful instrument tracking inside a beating porcine heart. PMID:17681483
CudaChain: an alternative algorithm for finding 2D convex hulls on the GPU.
Mei, Gang
2016-01-01
This paper presents an alternative GPU-accelerated convex hull algorithm and a novel S orting-based P reprocessing A pproach (SPA) for planar point sets. The proposed convex hull algorithm termed as CudaChain consists of two stages: (1) two rounds of preprocessing performed on the GPU and (2) the finalization of calculating the expected convex hull on the CPU. Those interior points locating inside a quadrilateral formed by four extreme points are first discarded, and then the remaining points are distributed into several (typically four) sub regions. For each subset of points, they are first sorted in parallel; then the second round of discarding is performed using SPA; and finally a simple chain is formed for the current remaining points. A simple polygon can be easily generated by directly connecting all the chains in sub regions. The expected convex hull of the input points can be finally obtained by calculating the convex hull of the simple polygon. The library Thrust is utilized to realize the parallel sorting, reduction, and partitioning for better efficiency and simplicity. Experimental results show that: (1) SPA can very effectively detect and discard the interior points; and (2) CudaChain achieves 5×-6× speedups over the famous Qhull implementation for 20M points.
Edge-preserving image denoising via group coordinate descent on the GPU
McGaffin, Madison G.; Fessler, Jeffrey A.
2015-01-01
Image denoising is a fundamental operation in image processing, and its applications range from the direct (photographic enhancement) to the technical (as a subproblem in image reconstruction algorithms). In many applications, the number of pixels has continued to grow, while the serial execution speed of computational hardware has begun to stall. New image processing algorithms must exploit the power offered by massively parallel architectures like graphics processing units (GPUs). This paper describes a family of image denoising algorithms well-suited to the GPU. The algorithms iteratively perform a set of independent, parallel one-dimensional pixel-update subproblems. To match GPU memory limitations, they perform these pixel updates inplace and only store the noisy data, denoised image and problem parameters. The algorithms can handle a wide range of edge-preserving roughness penalties, including differentiable convex penalties and anisotropic total variation (TV). Both algorithms use the majorize-minimize (MM) framework to solve the one-dimensional pixel update subproblem. Results from a large 2D image denoising problem and a 3D medical imaging denoising problem demonstrate that the proposed algorithms converge rapidly in terms of both iteration and run-time. PMID:25675454
A GPU-Based Gibbs Sampler for a Unidimensional IRT Model
Welling, William S.; Zhu, Michelle M.
2014-01-01
Item response theory (IRT) is a popular approach used for addressing large-scale statistical problems in psychometrics as well as in other fields. The fully Bayesian approach for estimating IRT models is usually memory and computationally expensive due to the large number of iterations. This limits the use of the procedure in many applications. In an effort to overcome such restraint, previous studies focused on utilizing the message passing interface (MPI) in a distributed memory-based Linux cluster to achieve certain speedups. However, given the high data dependencies in a single Markov chain for IRT models, the communication overhead rapidly grows as the number of cluster nodes increases. This makes it difficult to further improve the performance under such a parallel framework. This study aims to tackle the problem using massive core-based graphic processing units (GPU), which is practical, cost-effective, and convenient in actual applications. The performance comparisons among serial CPU, MPI, and compute unified device architecture (CUDA) programs demonstrate that the CUDA GPU approach has many advantages over the CPU-based approach and therefore is preferred. PMID:27355058
GPU-accelerated computing for Lagrangian coherent structures of multi-body gravitational regimes
NASA Astrophysics Data System (ADS)
Lin, Mingpei; Xu, Ming; Fu, Xiaoyu
2017-04-01
Based on a well-established theoretical foundation, Lagrangian Coherent Structures (LCSs) have elicited widespread research on the intrinsic structures of dynamical systems in many fields, including the field of astrodynamics. Although the application of LCSs in dynamical problems seems straightforward theoretically, its associated computational cost is prohibitive. We propose a block decomposition algorithm developed on Compute Unified Device Architecture (CUDA) platform for the computation of the LCSs of multi-body gravitational regimes. In order to take advantage of GPU's outstanding computing properties, such as Shared Memory, Constant Memory, and Zero-Copy, the algorithm utilizes a block decomposition strategy to facilitate computation of finite-time Lyapunov exponent (FTLE) fields of arbitrary size and timespan. Simulation results demonstrate that this GPU-based algorithm can satisfy double-precision accuracy requirements and greatly decrease the time needed to calculate final results, increasing speed by approximately 13 times. Additionally, this algorithm can be generalized to various large-scale computing problems, such as particle filters, constellation design, and Monte-Carlo simulation.
Alternating dual updates algorithm for X-ray CT reconstruction on the GPU
McGaffin, Madison G.; Fessler, Jeffrey A.
2015-01-01
Model-based image reconstruction (MBIR) for X-ray computed tomography (CT) offers improved image quality and potential low-dose operation, but has yet to reach ubiquity in the clinic. MBIR methods form an image by solving a large statistically motivated optimization problem, and the long time it takes to numerically solve this problem has hampered MBIR’s adoption. We present a new optimization algorithm for X-ray CT MBIR based on duality and group coordinate ascent that may converge even with approximate updates and can handle a wide range of regularizers, including total variation (TV). The algorithm iteratively updates groups of dual variables corresponding to terms in the cost function; these updates are highly parallel and map well onto the GPU. Although the algorithm stores a large number of variables, the “working size” for each of the algorithm’s steps is small and can be efficiently streamed to the GPU while other calculations are being performed. The proposed algorithm converges rapidly on both real and simulated data and shows promising parallelization over multiple devices. PMID:26878031
Sound speed estimation using wave-based ultrasound tomography: theory and GPU implementation
NASA Astrophysics Data System (ADS)
Roy, O.; Jovanović, I.; Hormati, A.; Parhizkar, R.; Vetterli, M.
2010-03-01
We present preliminary results obtained using a time domain wave-based reconstruction algorithm for an ultrasound transmission tomography scanner with a circular geometry. While a comprehensive description of this type of algorithm has already been given elsewhere, the focus of this work is on some practical issues arising with this approach. In fact, wave-based reconstruction methods suffer from two major drawbacks which limit their application in a practical setting: convergence is difficult to obtain and the computational cost is prohibitive. We address the first problem by appropriate initialization using a ray-based reconstruction. Then, the complexity of the method is reduced by means of an efficient parallel implementation on graphical processing units (GPU). We provide a mathematical derivation of the wave-based method under consideration, describe some details of our implementation and present simulation results obtained with a numerical phantom designed for a breast cancer detection application. The source code of our GPU implementation is freely available on the web at www.usense.org.
A GPU-Based Gibbs Sampler for a Unidimensional IRT Model.
Sheng, Yanyan; Welling, William S; Zhu, Michelle M
2014-01-01
Item response theory (IRT) is a popular approach used for addressing large-scale statistical problems in psychometrics as well as in other fields. The fully Bayesian approach for estimating IRT models is usually memory and computationally expensive due to the large number of iterations. This limits the use of the procedure in many applications. In an effort to overcome such restraint, previous studies focused on utilizing the message passing interface (MPI) in a distributed memory-based Linux cluster to achieve certain speedups. However, given the high data dependencies in a single Markov chain for IRT models, the communication overhead rapidly grows as the number of cluster nodes increases. This makes it difficult to further improve the performance under such a parallel framework. This study aims to tackle the problem using massive core-based graphic processing units (GPU), which is practical, cost-effective, and convenient in actual applications. The performance comparisons among serial CPU, MPI, and compute unified device architecture (CUDA) programs demonstrate that the CUDA GPU approach has many advantages over the CPU-based approach and therefore is preferred.
A distributed multi-GPU system for high speed electron microscopic tomographic reconstruction.
Zheng, Shawn Q; Branlund, Eric; Kesthelyi, Bettina; Braunfeld, Michael B; Cheng, Yifan; Sedat, John W; Agard, David A
2011-07-01
Full resolution electron microscopic tomographic (EMT) reconstruction of large-scale tilt series requires significant computing power. The desire to perform multiple cycles of iterative reconstruction and realignment dramatically increases the pressing need to improve reconstruction performance. This has motivated us to develop a distributed multi-GPU (graphics processing unit) system to provide the required computing power for rapid constrained, iterative reconstructions of very large three-dimensional (3D) volumes. The participating GPUs reconstruct segments of the volume in parallel, and subsequently, the segments are assembled to form the complete 3D volume. Owing to its power and versatility, the CUDA (NVIDIA, USA) platform was selected for GPU implementation of the EMT reconstruction. For a system containing 10 GPUs provided by 5 GTX295 cards, 10 cycles of SIRT reconstruction for a tomogram of 4096(2) × 512 voxels from an input tilt series containing 122 projection images of 4096(2) pixels (single precision float) takes a total of 1845 s of which 1032 s are for computation with the remainder being the system overhead. The same system takes only 39 s total to reconstruct 1024(2) × 256 voxels from 122 1024(2) pixel projections. While the system overhead is non-trivial, performance analysis indicates that adding extra GPUs to the system would lead to steadily enhanced overall performance. Therefore, this system can be easily expanded to generate superior computing power for very large tomographic reconstructions and especially to empower iterative cycles of reconstruction and realignment.
A GPU-based incompressible Navier-Stokes solver on moving overset grids
NASA Astrophysics Data System (ADS)
Chandar, Dominic D. J.; Sitaraman, Jayanarayanan; Mavriplis, Dimitri J.
2013-07-01
In pursuit of obtaining high fidelity solutions to the fluid flow equations in a short span of time, graphics processing units (GPUs) which were originally intended for gaming applications are currently being used to accelerate computational fluid dynamics (CFD) codes. With a high peak throughput of about 1 TFLOPS on a PC, GPUs seem to be favourable for many high-resolution computations. One such computation that involves a lot of number crunching is computing time accurate flow solutions past moving bodies. The aim of the present paper is thus to discuss the development of a flow solver on unstructured and overset grids and its implementation on GPUs. In its present form, the flow solver solves the incompressible fluid flow equations on unstructured/hybrid/overset grids using a fully implicit projection method. The resulting discretised equations are solved using a matrix-free Krylov solver using several GPU kernels such as gradient, Laplacian and reduction. Some of the simple arithmetic vector calculations are implemented using the CU++: An Object Oriented Framework for Computational Fluid Dynamics Applications using Graphics Processing Units, Journal of Supercomputing, 2013, doi:10.1007/s11227-013-0985-9 approach where GPU kernels are automatically generated at compile time. Results are presented for two- and three-dimensional computations on static and moving grids.
A GPU enhanced approach to identify atomic vacancies in solid materials
NASA Astrophysics Data System (ADS)
Peralta, Joaquín; Loyola, Claudia; Davis, Sergio
2015-08-01
Identification of vacancies in atomic structures plays a crucial role in the characterization of a material, from structural to dynamical properties. In this work we introduce a computationally improved vacancy recognition technique, based in a previous developed search algorithm. The procedure is highly parallel, based in the use of Graphics Processing Unit (GPU), taking advantage of parallel random number generation as well as the use of a large amount of simultaneous threads as available in GPU architecture. This increases the spatial resolution in the sample and the speed during the process of identification of atomic vacancies. The results show an improvement of efficiency up to two orders of magnitude compared to a single CPU. Along with the above a reduction of required parameters with respect to the original algorithm is presented. We show that only the lattice constant and a tunable overlap parameter are enough as input parameters, and that they are also highly related. A study of those parameters is presented, suggesting how the parameter choice must be addressed.
Odyssey: A Public GPU-based Code for General Relativistic Radiative Transfer in Kerr Spacetime
NASA Astrophysics Data System (ADS)
Pu, Hung-Yi; Yun, Kiyun; Younsi, Ziri; Yoon, Suk-Jin
2016-04-01
General relativistic radiative transfer calculations coupled with the calculation of geodesics in the Kerr spacetime are an essential tool for determining the images, spectra, and light curves from matter in the vicinity of black holes. Such studies are especially important for ongoing and upcoming millimeter/submillimeter very long baseline interferometry observations of the supermassive black holes at the centers of Sgr A* and M87. To this end we introduce Odyssey, a graphics processing unit (GPU) based code for ray tracing and radiative transfer in the Kerr spacetime. On a single GPU, the performance of Odyssey can exceed 1 ns per photon, per Runge-Kutta integration step. Odyssey is publicly available, fast, accurate, and flexible enough to be modified to suit the specific needs of new users. Along with a Graphical User Interface powered by a video-accelerated display architecture, we also present an educational software tool, Odyssey_Edu, for showing in real time how null geodesics around a Kerr black hole vary as a function of black hole spin and angle of incidence onto the black hole.
A Distributed Multi-GPU System for High Speed Electron Microscopic Tomographic Reconstruction
Zheng, Shawn Q.; Branlund, Eric; Kesthelyi, Bettina; Braunfeld, Michael B.; Cheng, Yifan; Sedat, John W.; Agard, David A.
2011-01-01
Full resolution electron microscopic tomographic (EMT) reconstruction of large-scale tilt series requires significant computing power. The desire to perform multiple cycles of iterative reconstruction and realignment dramatically increases the pressing need to improve reconstruction performance. This has motivated us to develop a distributed multi-GPU (graphics processing unit) system to provide the required computing power for rapid constrained, iterative reconstructions of very large three-dimensional (3D) volumes. The participating GPUs reconstruct segments of the volume in parallel, and subsequently, the segments are assembled to form the complete 3D volume. Owing to its power and versatility, the CUDA (NVIDIA, USA) platform was selected for GPU implementation of the EMT reconstruction. For a system containing 10 GPUs provided by 5 GTX295 cards, 10 cycles of SIRT reconstruction for a tomogram of 40962 × 512 voxels from an input tilt series containing 122 projection images of 40962 pixels (single precision float) takes a total of 1845 seconds of which 1032 seconds are for computation with the remainder being the system overhead. The same system takes only 39 seconds total to reconstruct 10242 × 256 voxels from 122 10242 pixel projections. While the system overhead is non-trivial, performance analysis indicates that adding extra GPUs to the system would lead to steadily enhanced overall performance. Therefore, this system can be easily expanded to generate superior computing power for very large tomographic reconstructions and especially to empower iterative cycles of reconstruction and realignment. PMID:21741915
GPU-accelerated FDTD modeling of radio-frequency field-tissue interactions in high-field MRI.
Chi, Jieru; Liu, Feng; Weber, Ewald; Li, Yu; Crozier, Stuart
2011-06-01
The analysis of high-field RF field-tissue interactions requires high-performance finite-difference time-domain (FDTD) computing. Conventional CPU-based FDTD calculations offer limited computing performance in a PC environment. This study presents a graphics processing unit (GPU)-based parallel-computing framework, producing substantially boosted computing efficiency (with a two-order speedup factor) at a PC-level cost. Specific details of implementing the FDTD method on a GPU architecture have been presented and the new computational strategy has been successfully applied to the design of a novel 8-element transceive RF coil system at 9.4 T. Facilitated by the powerful GPU-FDTD computing, the new RF coil array offers optimized fields (averaging 25% improvement in sensitivity, and 20% reduction in loop coupling compared with conventional array structures of the same size) for small animal imaging with a robust RF configuration. The GPU-enabled acceleration paves the way for FDTD to be applied for both detailed forward modeling and inverse design of MRI coils, which were previously impractical.
Peng, Bo; Wang, Yuqi; Hall, Timothy J; Jiang, Jingfeng
2017-01-31
Our primary objective of this work was to extend a previously published 2D coupled sub-sample tracking algorithm for 3D speckle tracking in the framework of ultrasound breast strain elastography. In order to overcome heavy computational cost, we investigated the use of a graphic processing unit (GPU) to accelerate the 3D coupled sub-sample speckle tracking method. The performance of the proposed GPU implementation was tested using a tissue-mimicking (TM) phantom and in vivo breast ultrasound data. The performance of this 3D sub-sample tracking algorithm was compared with the conventional 3D quadratic subsample estimation algorithm. On the basis of these evaluations, we concluded that the GPU implementation of this 3D sub-sample estimation algorithm can provide high-quality strain data (i.e. high correlation between the pre- and the motion-compensated post-deformation RF echo data and high contrast-to-noise ratio strain images), as compared to the conventional 3D quadratic sub-sample algorithm. Using the GPU implementation of the 3D speckle tracking algorithm, volumetric strain data can be achieved relatively fast (approximately 20 seconds per volume [2.5 cm 2.5 cm 2.5 cm]).
Ling, Cheng; Hamada, Tsuyoshi; Gao, Jingyang; Zhao, Guoguang; Sun, Donghong; Shi, Weifeng
2016-01-01
MrBayes is a widespread phylogenetic inference tool harnessing empirical evolutionary models and Bayesian statistics. However, the computational cost on the likelihood estimation is very expensive, resulting in undesirably long execution time. Although a number of multi-threaded optimizations have been proposed to speed up MrBayes, there are bottlenecks that severely limit the GPU thread-level parallelism of likelihood estimations. This study proposes a high performance and resource-efficient method for GPU-oriented parallelization of likelihood estimations. Instead of having to rely on empirical programming, the proposed novel decomposition storage model implements high performance data transfers implicitly. In terms of performance improvement, a speedup factor of up to 178 can be achieved on the analysis of simulated datasets by four Tesla K40 cards. In comparison to the other publicly available GPU-oriented MrBayes, the tgMC(3)++ method (proposed herein) outperforms the tgMC(3) (v1.0), nMC(3) (v2.1.1) and oMC(3) (v1.00) methods by speedup factors of up to 1.6, 1.9 and 2.9, respectively. Moreover, tgMC(3)++ supports more evolutionary models and gamma categories, which previous GPU-oriented methods fail to take into analysis.
Araki, Hiromitsu; Takada, Naoki; Niwase, Hiroaki; Ikawa, Shohei; Fujiwara, Masato; Nakayama, Hirotaka; Kakue, Takashi; Shimobaba, Tomoyoshi; Ito, Tomoyoshi
2015-12-01
We propose real-time time-division color electroholography using a single graphics processing unit (GPU) and a simple synchronization system of reference light. To facilitate real-time time-division color electroholography, we developed a light emitting diode (LED) controller with a universal serial bus (USB) module and the drive circuit for reference light. A one-chip RGB LED connected to a personal computer via an LED controller was used as the reference light. A single GPU calculates three computer-generated holograms (CGHs) suitable for red, green, and blue colors in each frame of a three-dimensional (3D) movie. After CGH calculation using a single GPU, the CPU can synchronize the CGH display with the color switching of the one-chip RGB LED via the LED controller. Consequently, we succeeded in real-time time-division color electroholography for a 3D object consisting of around 1000 points per color when an NVIDIA GeForce GTX TITAN was used as the GPU. Furthermore, we implemented the proposed method in various GPUs. The experimental results showed that the proposed method was effective for various GPUs.
Leang, Sarom S; Rendell, Alistair P; Gordon, Mark S
2014-03-11
Increasingly, modern computer systems comprise a multicore general-purpose processor augmented with a number of special purpose devices or accelerators connected via an external interface such as a PCI bus. The NVIDIA Kepler Graphical Processing Unit (GPU) and the Intel Phi are two examples of such accelerators. Accelerators offer peak performances that can be well above those of the host processor. How to exploit this heterogeneous environment for legacy application codes is not, however, straightforward. This paper considers how matrix operations in typical quantum chemical calculations can be migrated to the GPU and Phi systems. Double precision general matrix multiply operations are endemic in electronic structure calculations, especially methods that include electron correlation, such as density functional theory, second order perturbation theory, and coupled cluster theory. The use of approaches that automatically determine whether to use the host or an accelerator, based on problem size, is explored, with computations that are occurring on the accelerator and/or the host. For data-transfers over PCI-e, the GPU provides the best overall performance for data sizes up to 4096 MB with consistent upload and download rates between 5-5.6 GB/s and 5.4-6.3 GB/s, respectively. The GPU outperforms the Phi for both square and nonsquare matrix multiplications.
A GPU OpenCL based cross-platform Monte Carlo dose calculation engine (goMC).
Tian, Zhen; Shi, Feng; Folkerts, Michael; Qin, Nan; Jiang, Steve B; Jia, Xun
2015-10-07
Monte Carlo (MC) simulation has been recognized as the most accurate dose calculation method for radiotherapy. However, the extremely long computation time impedes its clinical application. Recently, a lot of effort has been made to realize fast MC dose calculation on graphic processing units (GPUs). However, most of the GPU-based MC dose engines have been developed under NVidia's CUDA environment. This limits the code portability to other platforms, hindering the introduction of GPU-based MC simulations to clinical practice. The objective of this paper is to develop a GPU OpenCL based cross-platform MC dose engine named goMC with coupled photon-electron simulation for external photon and electron radiotherapy in the MeV energy range. Compared to our previously developed GPU-based MC code named gDPM (Jia et al 2012 Phys. Med. Biol. 57 7783-97), goMC has two major differences. First, it was developed under the OpenCL environment for high code portability and hence could be run not only on different GPU cards but also on CPU platforms. Second, we adopted the electron transport model used in EGSnrc MC package and PENELOPE's random hinge method in our new dose engine, instead of the dose planning method employed in gDPM. Dose distributions were calculated for a 15 MeV electron beam and a 6 MV photon beam in a homogenous water phantom, a water-bone-lung-water slab phantom and a half-slab phantom. Satisfactory agreement between the two MC dose engines goMC and gDPM was observed in all cases. The average dose differences in the regions that received a dose higher than 10% of the maximum dose were 0.48-0.53% for the electron beam cases and 0.15-0.17% for the photon beam cases. In terms of efficiency, goMC was ~4-16% slower than gDPM when running on the same NVidia TITAN card for all the cases we tested, due to both the different electron transport models and the different development environments. The code portability of our new dose engine goMC was validated by
A GPU OpenCL based cross-platform Monte Carlo dose calculation engine (goMC)
NASA Astrophysics Data System (ADS)
Tian, Zhen; Shi, Feng; Folkerts, Michael; Qin, Nan; Jiang, Steve B.; Jia, Xun
2015-09-01
Monte Carlo (MC) simulation has been recognized as the most accurate dose calculation method for radiotherapy. However, the extremely long computation time impedes its clinical application. Recently, a lot of effort has been made to realize fast MC dose calculation on graphic processing units (GPUs). However, most of the GPU-based MC dose engines have been developed under NVidia’s CUDA environment. This limits the code portability to other platforms, hindering the introduction of GPU-based MC simulations to clinical practice. The objective of this paper is to develop a GPU OpenCL based cross-platform MC dose engine named goMC with coupled photon-electron simulation for external photon and electron radiotherapy in the MeV energy range. Compared to our previously developed GPU-based MC code named gDPM (Jia et al 2012 Phys. Med. Biol. 57 7783-97), goMC has two major differences. First, it was developed under the OpenCL environment for high code portability and hence could be run not only on different GPU cards but also on CPU platforms. Second, we adopted the electron transport model used in EGSnrc MC package and PENELOPE’s random hinge method in our new dose engine, instead of the dose planning method employed in gDPM. Dose distributions were calculated for a 15 MeV electron beam and a 6 MV photon beam in a homogenous water phantom, a water-bone-lung-water slab phantom and a half-slab phantom. Satisfactory agreement between the two MC dose engines goMC and gDPM was observed in all cases. The average dose differences in the regions that received a dose higher than 10% of the maximum dose were 0.48-0.53% for the electron beam cases and 0.15-0.17% for the photon beam cases. In terms of efficiency, goMC was ~4-16% slower than gDPM when running on the same NVidia TITAN card for all the cases we tested, due to both the different electron transport models and the different development environments. The code portability of our new dose engine goMC was validated by
NASA Astrophysics Data System (ADS)
Rueda, Antonio J.; Noguera, José M.; Luque, Adrián
2016-02-01
In recent years GPU computing has gained wide acceptance as a simple low-cost solution for speeding up computationally expensive processing in many scientific and engineering applications. However, in most cases accelerating a traditional CPU implementation for a GPU is a non-trivial task that requires a thorough refactorization of the code and specific optimizations that depend on the architecture of the device. OpenACC is a promising technology that aims at reducing the effort required to accelerate C/C++/Fortran code on an attached multicore device. Virtually with this technology the CPU code only has to be augmented with a few compiler directives to identify the areas to be accelerated and the way in which data has to be moved between the CPU and GPU. Its potential benefits are multiple: better code readability, less development time, lower risk of errors and less dependency on the underlying architecture and future evolution of the GPU technology. Our aim with this work is to evaluate the pros and cons of using OpenACC against native GPU implementations in computationally expensive hydrological applications, using the classic D8 algorithm of O'Callaghan and Mark for river network extraction as case-study. We implemented the flow accumulation step of this algorithm in CPU, using OpenACC and two different CUDA versions, comparing the length and complexity of the code and its performance with different datasets. We advance that although OpenACC can not match the performance of a CUDA optimized implementation (×3.5 slower in average), it provides a significant performance improvement against a CPU implementation (×2-6) with by far a simpler code and less implementation effort.
A GPU-accelerated and Monte Carlo-based intensity modulated proton therapy optimization system
Ma, Jiasen Beltran, Chris; Seum Wan Chan Tseung, Hok; Herman, Michael G.
2014-12-15
Purpose: Conventional spot scanning intensity modulated proton therapy (IMPT) treatment planning systems (TPSs) optimize proton spot weights based on analytical dose calculations. These analytical dose calculations have been shown to have severe limitations in heterogeneous materials. Monte Carlo (MC) methods do not have these limitations; however, MC-based systems have been of limited clinical use due to the large number of beam spots in IMPT and the extremely long calculation time of traditional MC techniques. In this work, the authors present a clinically applicable IMPT TPS that utilizes a very fast MC calculation. Methods: An in-house graphics processing unit (GPU)-based MC dose calculation engine was employed to generate the dose influence map for each proton spot. With the MC generated influence map, a modified least-squares optimization method was used to achieve the desired dose volume histograms (DVHs). The intrinsic CT image resolution was adopted for voxelization in simulation and optimization to preserve spatial resolution. The optimizations were computed on a multi-GPU framework to mitigate the memory limitation issues for the large dose influence maps that resulted from maintaining the intrinsic CT resolution. The effects of tail cutoff and starting condition were studied and minimized in this work. Results: For relatively large and complex three-field head and neck cases, i.e., >100 000 spots with a target volume of ∼1000 cm{sup 3} and multiple surrounding critical structures, the optimization together with the initial MC dose influence map calculation was done in a clinically viable time frame (less than 30 min) on a GPU cluster consisting of 24 Nvidia GeForce GTX Titan cards. The in-house MC TPS plans were comparable to a commercial TPS plans based on DVH comparisons. Conclusions: A MC-based treatment planning system was developed. The treatment planning can be performed in a clinically viable time frame on a hardware system costing around 45
Fast GPU-based absolute intensity determination for energy-dispersive X-ray Laue diffraction
NASA Astrophysics Data System (ADS)
Alghabi, F.; Send, S.; Schipper, U.; Abboud, A.; Pietsch, U.; Kolb, A.
2016-01-01
This paper presents a novel method for fast determination of absolute intensities in the sites of Laue spots generated by a tetragonal hen egg-white lysozyme crystal after exposure to white synchrotron radiation during an energy-dispersive X-ray Laue diffraction experiment. The Laue spots are taken by means of an energy-dispersive X-ray 2D pnCCD detector. Current pnCCD detectors have a spatial resolution of 384 × 384 pixels of size 75 × 75 μm2 each and operate at a maximum of 400 Hz. Future devices are going to have higher spatial resolution and frame rates. The proposed method runs on a computer equipped with multiple Graphics Processing Units (GPUs) which provide fast and parallel processing capabilities. Accordingly, our GPU-based algorithm exploits these capabilities to further analyse the Laue spots of the sample. The main contribution of the paper is therefore an alternative algorithm for determining absolute intensities of Laue spots which are themselves computed from a sequence of pnCCD frames. Moreover, a new method for integrating spectral peak intensities and improved background correction, a different way of calculating mean count rate of the background signal and also a new method for n-dimensional Poisson fitting are presented.We present a comparison of the quality of results from the GPU-based algorithm with the quality of results from a prior (base) algorithm running on CPU. This comparison shows that our algorithm is able to produce results with at least the same quality as the base algorithm. Furthermore, the GPU-based algorithm is able to speed up one of the most time-consuming parts of the base algorithm, which is n-dimensional Poisson fitting, by a factor of more than 3. Also, the entire procedure of extracting Laue spots' positions, energies and absolute intensities from a raw dataset of pnCCD frames is accelerated by a factor of more than 3.
A GPU tool for efficient, accurate, and realistic simulation of cone beam CT projections
Jia, Xun; Yan, Hao; Cerviño, Laura; Folkerts, Michael; Jiang, Steve B.
2012-01-01
Purpose: Simulation of x-ray projection images plays an important role in cone beam CT (CBCT) related research projects, such as the design of reconstruction algorithms or scanners. A projection image contains primary signal, scatter signal, and noise. It is computationally demanding to perform accurate and realistic computations for all of these components. In this work, the authors develop a package on graphics processing unit (GPU), called gDRR, for the accurate and efficient computations of x-ray projection images in CBCT under clinically realistic conditions. Methods: The primary signal is computed by a trilinear ray-tracing algorithm. A Monte Carlo (MC) simulation is then performed, yielding the primary signal and the scatter signal, both with noise. A denoising process specifically designed for Poisson noise removal is applied to obtain a smooth scatter signal. The noise component is then obtained by combining the difference between the MC primary and the ray-tracing primary signals, and the difference between the MC simulated scatter and the denoised scatter signals. Finally, a calibration step converts the calculated noise signal into a realistic one by scaling its amplitude according to a specified mAs level. The computations of gDRR include a number of realistic features, e.g., a bowtie filter, a polyenergetic spectrum, and detector response. The implementation is fine-tuned for a GPU platform to yield high computational efficiency. Results: For a typical CBCT projection with a polyenergetic spectrum, the calculation time for the primary signal using the ray-tracing algorithms is 1.2–2.3 s, while the MC simulations take 28.1–95.3 s, depending on the voxel size. Computation time for all other steps is negligible. The ray-tracing primary signal matches well with the primary part of the MC simulation result. The MC simulated scatter signal using gDRR is in agreement with EGSnrc results with a relative difference of 3.8%. A noise calibration process is
NASA Astrophysics Data System (ADS)
Komura, Yukihiro; Okabe, Yutaka
2013-01-01
We present multiple GPU computing with the common unified device architecture (CUDA) for the Swendsen-Wang multi-cluster algorithm of two-dimensional (2D) q-state Potts model. Extending our algorithm for single GPU computing [Y. Komura, Y. Okabe, GPU-based Swendsen-Wang multi-cluster algorithm for the simulation of two-dimensional classical spin systems, Comput. Phys. Comm. 183 (2012) 1155-1161], we realize the GPU computation of the Swendsen-Wang multi-cluster algorithm for multiple GPUs. We implement our code on the large-scale open science supercomputer TSUBAME 2.0, and test the performance and the scalability of the simulation of the 2D Potts model. The performance on Tesla M2050 using 256 GPUs is obtained as 37.3 spin flips per a nano second for the q=2 Potts model (Ising model) at the critical temperature with the linear system size L=65536.
A realization of semi-global matching stereo algorithm on GPU for real-time application
NASA Astrophysics Data System (ADS)
Chen, Bin; Chen, He-ping
2011-11-01
Real-time stereo vision systems have many applications such as automotive and robotics. According to the Middlebury Stereo Database, Semi-Global Matching (SGM) is commonly regarded as the most efficient algorithm among the top-performing stereo algorithms. Recently, most effective real-time implementations of this algorithm are based on reconfigurable hardware (FPGA). However, with the development of General-Purpose computation on Graphics Processing Unit, an effective real-time implementation on general purpose PCs can be expected. In this paper, a real-time SGM realization on Graphics Processing Unit (GPU) is introduced. CUDA, a general purpose parallel computing architecture introduced by NVIDIA in November 2006, has been used to realize the algorithm. Some important optimizations according to CUDA and Fermi (the latest architecture of NVIDA GPUs) are also introduced in this paper.
Innovative approaches to training and qualifying plant personnel at GPU Nuclear
Coe, R.P.
1994-12-31
For the past 10 yr, technical training programs at GPU Nuclear (GPUN) have been highly successful in the training and qualifying of nuclear station personnel at its Oyster Creek and Three Mile Island sites. The programs have received accreditation by the National Academy for Nuclear Training and have successfully reviewed and approved by the US Nuclear Regulatory Commission, American Nuclear Insurers, and various internal oversight groups. Over the past several years, the training and education department at GPUN has attempted to make learning more interesting and meaningful through a series of innovative approaches. Student feedback has been very positive. Plant management has seen an increase in productivity as key learning tasks are reinforced. Failure rates are dwindling, and retention rates appear to be improving. Equally important, instructors are becoming more and more comfortable with these approaches and the increased quality in classroom and laboratory training.
Aerodynamic optimization of supersonic compressor cascade using differential evolution on GPU
NASA Astrophysics Data System (ADS)
Aissa, Mohamed Hasanine; Verstraete, Tom; Vuik, Cornelis
2016-06-01
Differential Evolution (DE) is a powerful stochastic optimization method. Compared to gradient-based algorithms, DE is able to avoid local minima but requires at the same time more function evaluations. In turbomachinery applications, function evaluations are performed with time-consuming CFD simulation, which results in a long, non affordable, design cycle. Modern High Performance Computing systems, especially Graphic Processing Units (GPUs), are able to alleviate this inconvenience by accelerating the design evaluation itself. In this work we present a validated CFD Solver running on GPUs, able to accelerate the design evaluation and thus the entire design process. An achieved speedup of 20x to 30x enabled the DE algorithm to run on a high-end computer instead of a costly large cluster. The GPU-enhanced DE was used to optimize the aerodynamics of a supersonic compressor cascade, achieving an aerodynamic loss minimization of 20%.
gpuSPHASE-A shared memory caching implementation for 2D SPH using CUDA
NASA Astrophysics Data System (ADS)
Winkler, Daniel; Meister, Michael; Rezavand, Massoud; Rauch, Wolfgang
2017-04-01
Smoothed particle hydrodynamics (SPH) is a meshless Lagrangian method that has been successfully applied to computational fluid dynamics (CFD), solid mechanics and many other multi-physics problems. Using the method to solve transport phenomena in process engineering requires the simulation of several days to weeks of physical time. Based on the high computational demand of CFD such simulations in 3D need a computation time of years so that a reduction to a 2D domain is inevitable. In this paper gpuSPHASE, a new open-source 2D SPH solver implementation for graphics devices, is developed. It is optimized for simulations that must be executed with thousands of frames per second to be computed in reasonable time. A novel caching algorithm for Compute Unified Device Architecture (CUDA) shared memory is proposed and implemented. The software is validated and the performance is evaluated for the well established dambreak test case.
Abdellah, Marwan; Eldeib, Ayman; Owis, Mohamed I
2015-01-01
This paper features an advanced implementation of the X-ray rendering algorithm that harnesses the giant computing power of the current commodity graphics processors to accelerate the generation of high resolution digitally reconstructed radiographs (DRRs). The presented pipeline exploits the latest features of NVIDIA Graphics Processing Unit (GPU) architectures, mainly bindless texture objects and dynamic parallelism. The rendering throughput is substantially improved by exploiting the interoperability mechanisms between CUDA and OpenGL. The benchmarks of our optimized rendering pipeline reflect its capability of generating DRRs with resolutions of 2048(2) and 4096(2) at interactive and semi interactive frame-rates using an NVIDIA GeForce 970 GTX device.
Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics
Ronald Babich, Michael Clark, Balint Joo
2010-11-01
Graphics Processing Units (GPUs) are having a transformational effect on numerical lattice quantum chromodynamics (LQCD) calculations of importance in nuclear and particle physics. The QUDA library provides a package of mixed precision sparse matrix linear solvers for LQCD applications, supporting single GPUs based on NVIDIA's Compute Unified Device Architecture (CUDA). This library, interfaced to the QDP++/Chroma framework for LQCD calculations, is currently in production use on the "9g" cluster at the Jefferson Laboratory, enabling unprecedented price/performance for a range of problems in LQCD. Nevertheless, memory constraints on current GPU devices limit the problem sizes that can be tackled. In this contribution we describe the parallelization of the QUDA library onto multiple GPUs using MPI, including strategies for the overlapping of communication and computation. We report on both weak and strong scaling for up to 32 GPUs interconnected by InfiniBand, on which we sustain in excess of 4 Tflops.
Atomic detail visualization of photosynthetic membranes with GPU-accelerated ray tracing
Stone, John E.; Sener, Melih; Vandivort, Kirby L.; Barragan, Angela; Singharoy, Abhishek; Teo, Ivan; Ribeiro, Joao V.; Isralewitz, Barry; Liu, Bo; Goh, Boon Chong; Phillips, James C.; MacGregor-Chatwin, Craig; Johnson, Matthew P.; Kourkoutis, Lena F.; Hunter, C. Neil; Schulten, Klaus
2015-12-12
The cellular process responsible for providing energy for most life on Earth, namely, photosynthetic light-harvesting, requires the cooperation of hundreds of proteins across an organelle, involving length and time scales spanning several orders of magnitude over quantum and classical regimes. Simulation and visualization of this fundamental energy conversion process pose many unique methodological and computational challenges. In this paper, we present, in two accompanying movies, light-harvesting in the photosynthetic apparatus found in purple bacteria, the so-called chromatophore. The movies are the culmination of three decades of modeling efforts, featuring the collaboration of theoretical, experimental, and computational scientists. Finally, we describe the techniques that were used to build, simulate, analyze, and visualize the structures shown in the movies, and we highlight cases where scientific needs spurred the development of new parallel algorithms that efficiently harness GPU accelerators and petascale computers.
Atomic Detail Visualization of Photosynthetic Membranes with GPU-Accelerated Ray Tracing
Vandivort, Kirby L.; Barragan, Angela; Singharoy, Abhishek; Teo, Ivan; Ribeiro, João V.; Isralewitz, Barry; Liu, Bo; Goh, Boon Chong; Phillips, James C.; MacGregor-Chatwin, Craig; Johnson, Matthew P.; Kourkoutis, Lena F.; Hunter, C. Neil
2016-01-01
The cellular process responsible for providing energy for most life on Earth, namely photosynthetic light-harvesting, requires the cooperation of hundreds of proteins across an organelle, involving length and time scales spanning several orders of magnitude over quantum and classical regimes. Simulation and visualization of this fundamental energy conversion process pose many unique methodological and computational challenges. We present, in two accompanying movies, light-harvesting in the photosynthetic apparatus found in purple bacteria, the so-called chromatophore. The movies are the culmination of three decades of modeling efforts, featuring the collaboration of theoretical, experimental, and computational scientists. We describe the techniques that were used to build, simulate, analyze, and visualize the structures shown in the movies, and we highlight cases where scientific needs spurred the development of new parallel algorithms that efficiently harness GPU accelerators and petascale computers. PMID:27274603
Atomic detail visualization of photosynthetic membranes with GPU-accelerated ray tracing
Stone, John E.; Sener, Melih; Vandivort, Kirby L.; ...
2015-12-12
The cellular process responsible for providing energy for most life on Earth, namely, photosynthetic light-harvesting, requires the cooperation of hundreds of proteins across an organelle, involving length and time scales spanning several orders of magnitude over quantum and classical regimes. Simulation and visualization of this fundamental energy conversion process pose many unique methodological and computational challenges. In this paper, we present, in two accompanying movies, light-harvesting in the photosynthetic apparatus found in purple bacteria, the so-called chromatophore. The movies are the culmination of three decades of modeling efforts, featuring the collaboration of theoretical, experimental, and computational scientists. Finally, we describemore » the techniques that were used to build, simulate, analyze, and visualize the structures shown in the movies, and we highlight cases where scientific needs spurred the development of new parallel algorithms that efficiently harness GPU accelerators and petascale computers.« less
GPU-accelerated phase extraction algorithm for interferograms: a real-time application
NASA Astrophysics Data System (ADS)
Zhu, Xiaoqiang; Wu, Yongqian; Liu, Fengwei
2016-11-01
Optical testing, having the merits of non-destruction and high sensitivity, provides a vital guideline for optical manufacturing. But the testing process is often computationally intensive and expensive, usually up to a few seconds, which is sufferable for dynamic testing. In this paper, a GPU-accelerated phase extraction algorithm is proposed, which is based on the advanced iterative algorithm. The accelerated algorithm can extract the right phase-distribution from thirteen 1024x1024 fringe patterns with arbitrary phase shifts in 233 milliseconds on average using NVIDIA Quadro 4000 graphic card, which achieved a 12.7x speedup ratio than the same algorithm executed on CPU and 6.6x speedup ratio than that on Matlab using DWANING W5801 workstation. The performance improvement can fulfill the demand of computational accuracy and real-time application.
GPU-Accelerated Asynchronous Error Correction for Mixed Precision Iterative Refinement
Antz, Hartwig; Luszczek, Piotr; Dongarra, Jack; Heuveline, Vinent
2011-12-14
In hardware-aware high performance computing, block- asynchronous iteration and mixed precision iterative refinement are two techniques that are applied to leverage the computing power of SIMD accelerators like GPUs. Although they use a very different approach for this purpose, they share the basic idea of compensating the convergence behaviour of an inferior numerical algorithm by a more efficient usage of the provided computing power. In this paper, we want to analyze the potential of combining both techniques. Therefore, we implement a mixed precision iterative refinement algorithm using a block-asynchronous iteration as an error correction solver, and compare its performance with a pure implementation of a block-asynchronous iteration and an iterative refinement method using double precision for the error correction solver. For matrices from theUniversity of FloridaMatrix collection,we report the convergence behaviour and provide the total solver runtime using different GPU architectures.
NASA Astrophysics Data System (ADS)
Fafchamps, Lionel J.; Neil, Mark A. A.; Juskaitis, Rimas
2013-02-01
Aperture Correlation Microscopy (ACM) is a fluorescence microscopy technique capable of depth resolved imaging and enhanced lateral resolution at real-time acquisition rates. It relies on the subtraction of 2 separate images from different cameras which must be registered to the sub-pixel level. In order to achieve real-time registration and subtraction, the graphics processing unit (GPU) is used to apply a transformation from one frame to the other, resulting in a system capable of processing over 200 frames per second on modest hardware (GeForce 330M). Currently, this rate is limited by camera acquision to 16fps. Additionally, a novel reflection mode correlation microscope is introduced which functions on similar principles as the fluorescent system but can be used to examine reflective samples. Images and z-stacks taken with this system are presented here.
GPU-accelerated Block Matching Algorithm for Deformable Registration of Lung CT Images
Li, Min; Xiang, Zhikang; Xiao, Liang; Castillo, Edward; Castillo, Richard; Guerrero, Thomas
2016-01-01
Deformable registration (DR) is a key technology in the medical field. However, many of the existing DR methods are time-consuming and the registration accuracy needs to be improved, which prevents their clinical applications. In this study, we propose a parallel block matching algorithm for lung CT image registration, in which the sum of squared difference metric is modified as the cost function and the moving least squares approach is used to generate the full displacement field. The algorithm is implemented on Graphic Processing Unit (GPU) with the Compute Unified Device Architecture (CUDA). Results show that the proposed parallel block matching method achieves a fast runtime while maintaining an average registration error (standard deviation) of 1.08 (0.69) mm. PMID:28042622
NASA Astrophysics Data System (ADS)
Wang, Wei-Hsiang; Fu, Wu-Shung; Tsubokura, Makoto
2016-11-01
Unstable phenomena of low speed compressible natural convection are investigated numerically. Geometry contains parallel square plates or single heated plate with open boundaries is taken into consideration. Numerical methods of the Roe scheme, preconditioning and dual time stepping matching the DP-LUR method are used for low speed compressible flow. The absorbing boundary condition and modified LODI method is adopted to solve open boundary problems. High performance parallel computation is achieved by multi-GPU implementation with CUDA platform. The effects of natural convection by isothermal plates facing upwards in air is then carried out by the methods mentioned above Unstable behaviors appeared upon certain Rayleigh number with characteristic length respect to the width of plates or height between plates.
Performance improvements of differential operators code for MPS method on GPU
NASA Astrophysics Data System (ADS)
Murotani, Kohei; Masaie, Issei; Matsunaga, Takuya; Koshizuka, Seiichi; Shioya, Ryuji; Ogino, Masao; Fujisawa, Toshimitsu
2015-09-01
In the present study, performance improvements of the particle search and particle interaction calculation steps constituting the performance bottleneck in the moving particle simulation method are achieved by developing GPU-compatible algorithms for many core processor architectures. In the improvements of particle search, bucket loops of the cell-linked list are changed to a loop structure having fewer local variables and the linked list and the forward star of particle search algorithms within a bucket are compared. In the particle interaction calculation, the problem of the ratio of particles within the interaction domain to the neighboring particle candidates being quite low is improved. By these improvements, a performance efficiency of 24.7 % can be achieved for the first-order polynomial approximation scheme using NVIDIA Tesla K20, CUDA-6.5, and double-precision floating-point operations.
GPU-accelerated Block Matching Algorithm for Deformable Registration of Lung CT Images.
Li, Min; Xiang, Zhikang; Xiao, Liang; Castillo, Edward; Castillo, Richard; Guerrero, Thomas
2015-12-01
Deformable registration (DR) is a key technology in the medical field. However, many of the existing DR methods are time-consuming and the registration accuracy needs to be improved, which prevents their clinical applications. In this study, we propose a parallel block matching algorithm for lung CT image registration, in which the sum of squared difference metric is modified as the cost function and the moving least squares approach is used to generate the full displacement field. The algorithm is implemented on Graphic Processing Unit (GPU) with the Compute Unified Device Architecture (CUDA). Results show that the proposed parallel block matching method achieves a fast runtime while maintaining an average registration error (standard deviation) of 1.08 (0.69) mm.
XaNSoNS: GPU-accelerated simulator of diffraction patterns of nanoparticles
NASA Astrophysics Data System (ADS)
Neverov, V. S.
XaNSoNS is an open source software with GPU support, which simulates X-ray and neutron 1D (or 2D) diffraction patterns and pair-distribution functions (PDF) for amorphous or crystalline nanoparticles (up to ∼107 atoms) of heterogeneous structural content. Among the multiple parameters of the structure the user may specify atomic displacements, site occupancies, molecular displacements and molecular rotations. The software uses general equations nonspecific to crystalline structures to calculate the scattering intensity. It supports four major standards of parallel computing: MPI, OpenMP, Nvidia CUDA and OpenCL, enabling it to run on various architectures, from CPU-based HPCs to consumer-level GPUs.
GPU-enabled particle-particle particle-tree scheme for simulating dense stellar cluster system
NASA Astrophysics Data System (ADS)
Iwasawa, Masaki; Portegies Zwart, Simon; Makino, Junichiro
2015-07-01
We describe the implementation and performance of the (Particle-Particle Particle-Tree) scheme for simulating dense stellar systems. In , the force experienced by a particle is split into short-range and long-range contributions. Short-range forces are evaluated by direct summation and integrated with the fourth order Hermite predictor-corrector method with the block timesteps. For long-range forces, we use a combination of the Barnes-Hut tree code and the leapfrog integrator. The tree part of our simulation environment is accelerated using graphical processing units (GPU), whereas the direct summation is carried out on the host CPU. Our code gives excellent performance and accuracy for star cluster simulations with a large number of particles even when the core size of the star cluster is small.
Parallel, distributed and GPU computing technologies in single-particle electron microscopy
Schmeisser, Martin; Heisen, Burkhard C.; Luettich, Mario; Busche, Boris; Hauer, Florian; Koske, Tobias; Knauber, Karl-Heinz; Stark, Holger
2009-01-01
Most known methods for the determination of the structure of macromolecular complexes are limited or at least restricted at some point by their computational demands. Recent developments in information technology such as multicore, parallel and GPU processing can be used to overcome these limitations. In particular, graphics processing units (GPUs), which were originally developed for rendering real-time effects in computer games, are now ubiquitous and provide unprecedented computational power for scientific applications. Each parallel-processing paradigm alone can improve overall performance; the increased computational performance obtained by combining all paradigms, unleashing the full power of today’s technology, makes certain applications feasible that were previously virtually impossible. In this article, state-of-the-art paradigms are introduced, the tools and infrastructure needed to apply these paradigms are presented and a state-of-the-art infrastructure and solution strategy for moving scientific applications to the next generation of computer hardware is outlined. PMID:19564686
Optimized analysis of isotropic high-nuclearity spin clusters with GPU acceleration
NASA Astrophysics Data System (ADS)
Lamas Daviña, A.; Ramos, E.; Roman, J. E.
2016-12-01
The numerical simulation of molecular clusters formed by a finite number of exchange-coupled paramagnetic centers is very relevant for many applications, modeling systems between molecules and extended solids. In the context of realistic scenarios, many centers need to be considered, and thus the required computational effort grows very fast. In a previous work (Ramos et al., 2010), a set of parallel programs were presented with standard message-passing parallelization (MPI) for both anisotropic and isotropic systems. In this work, we have further developed the code for isotropic models. On one hand, the computational cost has been significantly reduced by avoiding some of the matrix diagonalizations, corresponding to blocks with negligible contribution for the particular configuration. On the other hand, we have extended the parallelization in order to exploit available graphics processing units (GPUs). The new MPI-GPU paradigm reduces the computational time by at least one additional order of magnitude and enables the resolution of larger problems.
Astrophysical data mining with GPU. A case study: Genetic classification of globular clusters
NASA Astrophysics Data System (ADS)
Cavuoti, S.; Garofalo, M.; Brescia, M.; Paolillo, M.; Pescape', A.; Longo, G.; Ventre, G.
2014-01-01
We present a multi-purpose genetic algorithm, designed and implemented with GPGPU/CUDA parallel computing technology. The model was derived from our CPU serial implementation, named GAME (Genetic Algorithm Model Experiment). It was successfully tested and validated on the detection of candidate Globular Clusters in deep, wide-field, single band HST images. The GPU version of GAME will be made available to the community by integrating it into the web application DAMEWARE (DAta Mining Web Application REsource, http://dame.dsf.unina.it/beta_info.html), a public data mining service specialized on massive astrophysical data. Since genetic algorithms are inherently parallel, the GPGPU computing paradigm leads to a speedup of a factor of 200× in the training phase with respect to the CPU based version.
Free energy simulations with the AMOEBA polarizable force field and metadynamics on GPU platform.
Peng, Xiangda; Zhang, Yuebin; Chu, Huiying; Li, Guohui
2016-03-05
The free energy calculation library PLUMED has been incorporated into the OpenMM simulation toolkit, with the purpose to perform enhanced sampling MD simulations using the AMOEBA polarizable force field on GPU platform. Two examples, (I) the free energy profile of water pair separation (II) alanine dipeptide dihedral angle free energy surface in explicit solvent, are provided here to demonstrate the accuracy and efficiency of our implementation. The converged free energy profiles could be obtained within an affordable MD simulation time when the AMOEBA polarizable force field is employed. Moreover, the free energy surfaces estimated using the AMOEBA polarizable force field are in agreement with those calculated from experimental data and ab initio methods. Hence, the implementation in this work is reliable and would be utilized to study more complicated biological phenomena in both an accurate and efficient way. © 2015 Wiley Periodicals, Inc.
Modelling Nonlinear Dynamic Textures using Hybrid DWT-DCT and Kernel PCA with GPU
NASA Astrophysics Data System (ADS)
Ghadekar, Premanand Pralhad; Chopade, Nilkanth Bhikaji
2016-12-01
Most of the real-world dynamic textures are nonlinear, non-stationary, and irregular. Nonlinear motion also has some repetition of motion, but it exhibits high variation, stochasticity, and randomness. Hybrid DWT-DCT and Kernel Principal Component Analysis (KPCA) with YCbCr/YIQ colour coding using the Dynamic Texture Unit (DTU) approach is proposed to model a nonlinear dynamic texture, which provides better results than state-of-art methods in terms of PSNR, compression ratio, model coefficients, and model size. Dynamic texture is decomposed into DTUs as they help to extract temporal self-similarity. Hybrid DWT-DCT is used to extract spatial redundancy. YCbCr/YIQ colour encoding is performed to capture chromatic correlation. KPCA is applied to capture nonlinear motion. Further, the proposed algorithm is implemented on Graphics Processing Unit (GPU), which comprise of hundreds of small processors to decrease time complexity and to achieve parallelism.
SU-E-T-806: Very Fast GPU-Based IMPT Dose Computation
Sullivan, A; Brand, M
2015-06-15
Purpose: Designing particle therapy treatment plans is a dosimetrist-in-the-loop optimization wherein the conflicting constraints of achieving a desired tumor dose distribution must be balanced against the need to minimize the dose to nearby OARs. IMPT introduces an additional, inner, numerical optimization step in which the dosimetrist’s current set of constraints are used to determine the weighting of beam spots. Very fast dose calculations are needed to enable the dosimetrist to perform many iterations of the outer optimization in a commercially reasonable time. Methods: We have developed a GPU-based convolution-type dose computation algorithm that more accurately handles heterogeneities than earlier algorithms by redistributing energy from dose computed in a water volume. The depth dependence of the beam size is handled by pre-processing Bragg curves using a weighted superposition of Gaussian bases. Additionally, scattering, the orientation of treatment ports, and the non-parallel propagation of beams are handled by large, but sparse, energy-redistribution matrices that implement affine transforms. Results: We tested our algorithm using a brain tumor dataset with 1 mm voxels and a single treatment port from the patient’s anterior through the sinuses. The resulting dose volume is 100 × 100 × 230 mm with 66,200 beam spots on a 3 × 3 × 2 mm grid. The dose computation takes <1 msec on a GeForce GTX Titan GPU with the Gamma passing rate for 2mm/2% criterion of 99.1% compared to dose calculated by an alternative dose algorithm based on pencil beams. We will present comparisons to Monte Carlo dose calculations. Conclusion: Our high-speed dose computation method enables the IMPT spot weights to be optimized in <1 second, resulting in a nearly instantaneous response to user changes to dose constraints. This permits the creation of higher quality plans by allowing the dosimetrist to evaluate more alternatives in a short period of time.
GPU-based interactive cut-surface extraction from high-order finite element fields.
Nelson, Blake; Haimes, Robert; Kirby, Robert M
2011-12-01
We present a GPU-based ray-tracing system for the accurate and interactive visualization of cut-surfaces through 3D simulations of physical processes created from spectral/hp high-order finite element methods. When used by the numerical analyst to debug the solver, the ability for the imagery to precisely reflect the data is critical. In practice, the investigator interactively selects from a palette of visualization tools to construct a scene that can answer a query of the data. This is effective as long as the implicit contract of image quality between the individual and the visualization system is upheld. OpenGL rendering of scientific visualizations has worked remarkably well for exploratory visualization for most solver results. This is due to the consistency between the use of first-order representations in the simulation and the linear assumptions inherent in OpenGL (planar fragments and color-space interpolation). Unfortunately, the contract is broken when the solver discretization is of higher-order. There have been attempts to mitigate this through the use of spatial adaptation and/or texture mapping. These methods do a better job of approximating what the imagery should be but are not exact and tend to be view-dependent. This paper introduces new rendering mechanisms that specifically deal with the kinds of native data generated by high-order finite element solvers. The exploratory visualization tools are reassessed and cast in this system with the focus on image accuracy. This is accomplished in a GPU setting to ensure interactivity.
Real-time unmanned aircraft systems surveillance video mosaicking using GPU
NASA Astrophysics Data System (ADS)
Camargo, Aldo; Anderson, Kyle; Wang, Yi; Schultz, Richard R.; Fevig, Ronald A.
2010-04-01
Digital video mosaicking from Unmanned Aircraft Systems (UAS) is being used for many military and civilian applications, including surveillance, target recognition, border protection, forest fire monitoring, traffic control on highways, monitoring of transmission lines, among others. Additionally, NASA is using digital video mosaicking to explore the moon and planets such as Mars. In order to compute a "good" mosaic from video captured by a UAS, the algorithm must deal with motion blur, frame-to-frame jitter associated with an imperfectly stabilized platform, perspective changes as the camera tilts in flight, as well as a number of other factors. The most suitable algorithms use SIFT (Scale-Invariant Feature Transform) to detect the features consistent between video frames. Utilizing these features, the next step is to estimate the homography between two consecutives video frames, perform warping to properly register the image data, and finally blend the video frames resulting in a seamless video mosaick. All this processing takes a great deal of resources of resources from the CPU, so it is almost impossible to compute a real time video mosaic on a single processor. Modern graphics processing units (GPUs) offer computational performance that far exceeds current CPU technology, allowing for real-time operation. This paper presents the development of a GPU-accelerated digital video mosaicking implementation and compares it with CPU performance. Our tests are based on two sets of real video captured by a small UAS aircraft; one video comes from Infrared (IR) and Electro-Optical (EO) cameras. Our results show that we can obtain a speed-up of more than 50 times using GPU technology, so real-time operation at a video capture of 30 frames per second is feasible.
Advanced noise reduction in placental ultrasound imaging using CPU and GPU: a comparative study
NASA Astrophysics Data System (ADS)
Zombori, G.; Ryan, J.; McAuliffe, F.; Rainford, L.; Moran, M.; Brennan, P.
2010-03-01
This paper presents a comparison of different implementations of 3D anisotropic diffusion speckle noise reduction technique on ultrasound images. In this project we are developing a novel volumetric calcification assessment metric for the placenta, and providing a software tool for this purpose. The tool can also automatically segment and visualize (in 3D) ultrasound data. One of the first steps when developing such a tool is to find a fast and efficient way to eliminate speckle noise. Previous works on this topic by Duan, Q. [1] and Sun, Q. [2] have proven that the 3D noise reducing anisotropic diffusion (3D SRAD) method shows exceptional performance in enhancing ultrasound images for object segmentation. Therefore we have implemented this method in our software application and performed a comparative study on the different variants in terms of performance and computation time. To increase processing speed it was necessary to utilize the full potential of current state of the art Graphics Processing Units (GPUs). Our 3D datasets are represented in a spherical volume format. With the aim of 2D slice visualization and segmentation, a "scan conversion" or "slice-reconstruction" step is needed, which includes coordinate transformation from spherical to Cartesian, re-sampling of the volume and interpolation. Combining the noise filtering and slice reconstruction in one process on the GPU, we can achieve close to real-time operation on high quality data sets without the need for down-sampling or reducing image quality. For the GPU programming OpenCL language was used. Therefore the presented solution is fully portable.
GPU-based iterative cone-beam CT reconstruction using tight frame regularization
NASA Astrophysics Data System (ADS)
Jia, Xun; Dong, Bin; Lou, Yifei; Jiang, Steve B.
2011-07-01
The x-ray imaging dose from serial cone-beam computed tomography (CBCT) scans raises a clinical concern in most image-guided radiation therapy procedures. It is the goal of this paper to develop a fast graphic processing unit (GPU)-based algorithm to reconstruct high-quality CBCT images from undersampled and noisy projection data so as to lower the imaging dose. For this purpose, we have developed an iterative tight-frame (TF)-based CBCT reconstruction algorithm. A condition that a real CBCT image has a sparse representation under a TF basis is imposed in the iteration process as regularization to the solution. To speed up the computation, a multi-grid method is employed. Our GPU implementation has achieved high computational efficiency and a CBCT image of resolution 512 × 512 × 70 can be reconstructed in ~5 min. We have tested our algorithm on a digital NCAT phantom and a physical Catphan phantom. It is found that our TF-based algorithm is able to reconstruct CBCT in the context of undersampling and low mAs levels. We have also quantitatively analyzed the reconstructed CBCT image quality in terms of the modulation-transfer function and contrast-to-noise ratio under various scanning conditions. The results confirm the high CBCT image quality obtained from our TF algorithm. Moreover, our algorithm has also been validated in a real clinical context using a head-and-neck patient case. Comparisons of the developed TF algorithm and the current state-of-the-art TV algorithm have also been made in various cases studied in terms of reconstructed image quality and computation efficiency.
NASA Astrophysics Data System (ADS)
Paz, Abel; Plaza, Antonio
2010-08-01
Automatic target and anomaly detection are considered very important tasks for hyperspectral data exploitation. These techniques are now routinely applied in many application domains, including defence and intelligence, public safety, precision agriculture, geology, or forestry. Many of these applications require timely responses for swift decisions which depend upon high computing performance of algorithm analysis. However, with the recent explosion in the amount and dimensionality of hyperspectral imagery, this problem calls for the incorporation of parallel computing techniques. In the past, clusters of computers have offered an attractive solution for fast anomaly and target detection in hyperspectral data sets already transmitted to Earth. However, these systems are expensive and difficult to adapt to on-board data processing scenarios, in which low-weight and low-power integrated components are essential to reduce mission payload and obtain analysis results in (near) real-time, i.e., at the same time as the data is collected by the sensor. An exciting new development in the field of commodity computing is the emergence of commodity graphics processing units (GPUs), which can now bridge the gap towards on-board processing of remotely sensed hyperspectral data. In this paper, we describe several new GPU-based implementations of target and anomaly detection algorithms for hyperspectral data exploitation. The parallel algorithms are implemented on latest-generation Tesla C1060 GPU architectures, and quantitatively evaluated using hyperspectral data collected by NASA's AVIRIS system over the World Trade Center (WTC) in New York, five days after the terrorist attacks that collapsed the two main towers in the WTC complex.
Nguyen, Trung Dac; Carrillo, Jan-Michael Y; Dobrynin, Andrey V; Brown, W Michael
2013-01-08
Numerous issues have disrupted the trend for increasing computational performance with faster CPU clock frequencies. In order to exploit the potential performance of new computers, it is becoming increasingly desirable to re-evaluate computational physics methods and models with an eye toward approaches that allow for increased concurrency and data locality. The evaluation of long-range Coulombic interactions is a common bottleneck for molecular dynamics simulations. Enhanced truncation approaches have been proposed as an alternative method and are particularly well-suited for many-core architectures and GPUs due to the inherent fine-grain parallelism that can be exploited. In this paper, we compare efficient truncation-based approximations to evaluation of electrostatic forces with the more traditional particle-particle particle-mesh (P(3)M) method for the molecular dynamics simulation of polyelectrolyte brush layers. We show that with the use of GPU accelerators, large parallel simulations using P(3)M can be greater than 3 times faster due to a reduction in the mesh-size required. Alternatively, using a truncation-based scheme can improve performance even further. This approach can be up to 3.9 times faster than GPU-accelerated P(3)M for many polymer systems and results in accurate calculation of shear velocities and disjoining pressures for brush layers. For configurations with highly nonuniform charge distributions, however, we find that it is more efficient to use P(3)M; for these systems, computationally efficient parametrizations of the truncation-based approach do not produce accurate counterion density profiles or brush morphologies.
Phillips, Carolyn L.; Anderson, Joshua A.; Glotzer, Sharon C.
2011-08-10
Highlights: {yields} Molecular Dynamics codes implemented on GPUs have achieved two-order of magnitude computational accelerations. {yields} Brownian Dynamics and Dissipative Particle Dynamics simulations require a large number of random numbers per time step. {yields} We introduce a method for generating small batches of pseudorandom numbers distributed over many threads of calculations. {yields} With this method, Dissipative Particle Dynamics is implemented on a GPU device without requiring thread-to-thread communication. - Abstract: Brownian Dynamics (BD), also known as Langevin Dynamics, and Dissipative Particle Dynamics (DPD) are implicit solvent methods commonly used in models of soft matter and biomolecular systems. The interaction of the numerous solvent particles with larger particles is coarse-grained as a Langevin thermostat is applied to individual particles or to particle pairs. The Langevin thermostat requires a pseudo-random number generator (PRNG) to generate the stochastic force applied to each particle or pair of neighboring particles during each time step in the integration of Newton's equations of motion. In a Single-Instruction-Multiple-Thread (SIMT) GPU parallel computing environment, small batches of random numbers must be generated over thousands of threads and millions of kernel calls. In this communication we introduce a one-PRNG-per-kernel-call-per-thread scheme, in which a micro-stream of pseudorandom numbers is generated in each thread and kernel call. These high quality, statistically robust micro-streams require no global memory for state storage, are more computationally efficient than other PRNG schemes in memory-bound kernels, and uniquely enable the DPD simulation method without requiring communication between threads.
High-throughput Analysis of Large Microscopy Image Datasets on CPU-GPU Cluster Platforms
Teodoro, George; Pan, Tony; Kurc, Tahsin M.; Kong, Jun; Cooper, Lee A. D.; Podhorszki, Norbert; Klasky, Scott; Saltz, Joel H.
2014-01-01
Analysis of large pathology image datasets offers significant opportunities for the investigation of disease morphology, but the resource requirements of analysis pipelines limit the scale of such studies. Motivated by a brain cancer study, we propose and evaluate a parallel image analysis application pipeline for high throughput computation of large datasets of high resolution pathology tissue images on distributed CPU-GPU platforms. To achieve efficient execution on these hybrid systems, we have built runtime support that allows us to express the cancer image analysis application as a hierarchical data processing pipeline. The application is implemented as a coarse-grain pipeline of stages, where each stage may be further partitioned into another pipeline of fine-grain operations. The fine-grain operations are efficiently managed and scheduled for computation on CPUs and GPUs using performance aware scheduling techniques along with several optimizations, including architecture aware process placement, data locality conscious task assignment, data prefetching, and asynchronous data copy. These optimizations are employed to maximize the utilization of the aggregate computing power of CPUs and GPUs and minimize data copy overheads. Our experimental evaluation shows that the cooperative use of CPUs and GPUs achieves significant improvements on top of GPU-only versions (up to 1.6×) and that the execution of the application as a set of fine-grain operations provides more opportunities for runtime optimizations and attains better performance than coarser-grain, monolithic implementations used in other works. An implementation of the cancer image analysis pipeline using the runtime support was able to process an image dataset consisting of 36,848 4Kx4K-pixel image tiles (about 1.8TB uncompressed) in less than 4 minutes (150 tiles/second) on 100 nodes of a state-of-the-art hybrid cluster system. PMID:25419546
cudaMap: a GPU accelerated program for gene expression connectivity mapping
2013-01-01
Background Modern cancer research often involves large datasets and the use of sophisticated statistical techniques. Together these add a heavy computational load to the analysis, which is often coupled with issues surrounding data accessibility. Connectivity mapping is an advanced bioinformatic and computational technique dedicated to therapeutics discovery and drug re-purposing around differential gene expression analysis. On a normal desktop PC, it is common for the connectivity mapping task with a single gene signature to take > 2h to complete using sscMap, a popular Java application that runs on standard CPUs (Central Processing Units). Here, we describe new software, cudaMap, which has been implemented using CUDA C/C++ to harness the computational power of NVIDIA GPUs (Graphics Processing Units) to greatly reduce processing times for connectivity mapping. Results cudaMap can identify candidate therapeutics from the same signature in just over thirty seconds when using an NVIDIA Tesla C2050 GPU. Results from the analysis of multiple gene signatures, which would previously have taken several days, can now be obtained in as little as 10 minutes, greatly facilitating candidate therapeutics discovery with high throughput. We are able to demonstrate dramatic speed differentials between GPU assisted performance and CPU executions as the computational load increases for high accuracy evaluation of statistical significance. Conclusion Emerging ‘omics’ technologies are constantly increasing the volume of data and information to be processed in all areas of biomedical research. Embracing the multicore functionality of GPUs represents a major avenue of local accelerated computing. cudaMap will make a strong contribution in the discovery of candidate therapeutics by enabling speedy execution of heavy duty connectivity mapping tasks, which are increasingly required in modern cancer research. cudaMap is open source and can be freely downloaded from http
NASA Astrophysics Data System (ADS)
Topping, T. Russell; French, James; Hancock, Monte F., Jr.
2010-04-01
Working with the Naval Research Laboratory, Celestech has implemented advanced non-linear hyperspectral image (HSI) processing algorithms optimized for Graphics Processing Units (GPU). These algorithms have demonstrated performance improvements of nearly 2 orders of magnitude over optimal CPU-based implementations. The paper briefly covers the architecture of the NIVIDIA GPU to provide a basis for discussing GPU optimization challenges and strategies. The paper then covers optimization approaches employed to extract performance from the GPU implementation of Dr. Bachmann's algorithms including memory utilization and process thread optimization considerations. The paper goes on to discuss strategies for deploying GPU-enabled servers into enterprise service oriented architectures. Also discussed are Celestech's on-going work in the area of middleware frameworks to provide an optimized multi-GPU utilization and scheduling approach that supports both multiple GPUs in a single computer as well as across multiple computers. This paper is a complementary work to the paper submitted by Dr. Charles Bachmann entitled "A Scalable Approach to Modeling Nonlinear Structure in Hyperspectral Imagery and Other High-Dimensional Data Using Manifold Coordinate Representations". Dr. Bachmann's paper covers the algorithmic and theoretical basis for the HSI processing approach.
Busca de estruturas em grandes escalas em altos redshifts
NASA Astrophysics Data System (ADS)
Boris, N. V.; Sodrã©, L., Jr.; Cypriano, E.
2003-08-01
A busca por estruturas em grandes escalas (aglomerados de galáxias, por exemplo) é um ativo tópico de pesquisas hoje em dia, pois a detecção de um único aglomerado em altos redshifts pode por vínculos fortes sobre os modelos cosmológicos. Neste projeto estamos fazendo uma busca de estruturas distantes em campos contendo pares de quasares próximos entre si em zÂ Â3Â 0.9. Os pares de quasares foram extraídos do catálogo de Véron-Cetty & Véron (2001) e estão sendo observados com os telescópios: 2,2m da University of Hawaii (UH), 2,5m do Observatório de Las Campanas e com o GEMINI. Apresentamos aqui a análise preliminar de um par de quasares observado nos filtros i'(7800 Å) e z'(9500 Å) com o GEMINI. A cor (i'-z') mostrou-se útil para detectar objetos "early-type" em redshifts menores que 1.1. No estudo do par 131046+0006/J131055+0008, com redshift ~ 0.9, o uso deste método possibilitou a detecção de sete objetos candidatos a galáxias "early-type". Num mapa da distribuição projetada dos objetos para 22 < i' < 25 observou-se que estas galáxias estão localizadas próximas a um dos quasares e há indícios de que estejam aglomeradas dentro de um área de ~ 6 arcmin2. Se esse for o caso, estes objetos seriam membros de uma estrutura em grande escala. Um outro argumento em favor dessa hipótese é que eles obedecem uma relação do tipo Kormendy (raio equivalente X brilho superficial dentro desse raio), como a apresentada pelas galáxias elípticas em z = 0.
A GPU-based large-scale Monte Carlo simulation method for systems with long-range interactions
NASA Astrophysics Data System (ADS)
Liang, Yihao; Xing, Xiangjun; Li, Yaohang
2017-06-01
In this work we present an efficient implementation of Canonical Monte Carlo simulation for Coulomb many body systems on graphics processing units (GPU). Our method takes advantage of the GPU Single Instruction, Multiple Data (SIMD) architectures, and adopts the sequential updating scheme of Metropolis algorithm. It makes no approximation in the computation of energy, and reaches a remarkable 440-fold speedup, compared with the serial implementation on CPU. We further use this method to simulate primitive model electrolytes, and measure very precisely all ion-ion pair correlation functions at high concentrations. From these data, we extract the renormalized Debye length, renormalized valences of constituent ions, and renormalized dielectric constants. These results demonstrate unequivocally physics beyond the classical Poisson-Boltzmann theory.
NASA Astrophysics Data System (ADS)
Yi, Xi; Chen, Weiting; Wu, Linhui; Zhang, Wei; Li, Jiao; Wang, Xin; Zhang, Limin; Zhao, Huijuan; Gao, Feng
2013-03-01
At present, the most widely accepted forward model in diffuse optical tomography (DOT) is the diffusion equation, which is derived from the radiative transfer equation by employing the P1 approximation. However, due to its validity restricted to highly scattering regions, this model has several limitations for the whole-body imaging of small-animals, where some cavity and low scattering areas exist. To overcome the difficulty, we presented a Graphic-Processing- Unit(GPU) implementation of Monte-Carlo (MC) modeling for photon migration in arbitrarily heterogeneous turbid medium, and, based on this GPU-accelerated MC forward calculation, developed a fast, universal DOT image reconstruction algorithm. We experimentally validated the proposed method using a continuous-wave DOT system in the photon-counting mode and a cylindrical phantom with a cavity inclusion.
NASA Astrophysics Data System (ADS)
Dosopoulos, Stylianos; Gardiner, Judith D.; Lee, Jin-Fa
2011-06-01
In this paper we discuss our approach to the MPI/GPU implementation of an Interior Penalty Discontinuous Galerkin Time domain (IPDGTD) method to solve the time dependent Maxwell's equations. In our approach, we exploit the inherent DGTD parallelism and describe a combined MPI/GPU and local time stepping implementation. This combination is aimed at increasing efficiency and reducing computational time, especially for multiscale applications. The CUDA programming model was used, together with non-blocking MPI calls to overlap communications across the network. A 10× speedup compared to CPU clusters is observed for double precision arithmetic. Finally, for p = 1 basis functions, a good scalability with parallelization efficiency of 85% for up to 40 GPUs and 80% for up to 160 CPU cores was achieved on the Ohio Supercomputer Center's Glenn cluster.
NaNet: a low-latency NIC enabling GPU-based, real-time low level trigger systems
NASA Astrophysics Data System (ADS)
Ammendola, Roberto; Biagioni, Andrea; Fantechi, Riccardo; Frezza, Ottorino; Lamanna, Gianluca; Lo Cicero, Francesca; Lonardo, Alessandro; Stanislao Paolucci, Pier; Pantaleo, Felice; Piandani, Roberto; Pontisso, Luca; Rossetti, Davide; Simula, Francesco; Sozzi, Marco; Tosoratto, Laura; Vicini, Piero
2014-06-01
We implemented the NaNet FPGA-based PCIe Gen2 GbE/APElink NIC, featuring GPUDirect RDMA capabilities and UDP protocol management offloading. NaNet is able to receive a UDP input data stream from its GbE interface and redirect it, without any intermediate buffering or CPU intervention, to the memory of a Fermi/Kepler GPU hosted on the same PCIe bus, provided that the two devices share the same upstream root complex. Synthetic benchmarks for latency and bandwidth are presented. We describe how NaNet can be employed in the prototype of the GPU-based RICH low-level trigger processor of the NA62 CERN experiment, to implement the data link between the TEL62 readout boards and the low level trigger processor. Results for the throughput and latency of the integrated system are presented and discussed.
SU-E-T-37: A GPU-Based Pencil Beam Algorithm for Dose Calculations in Proton Radiation Therapy
Kalantzis, G; Leventouri, T; Tachibana, H; Shang, C
2015-06-15
Purpose: Recent developments in radiation therapy have been focused on applications of charged particles, especially protons. Over the years several dose calculation methods have been proposed in proton therapy. A common characteristic of all these methods is their extensive computational burden. In the current study we present for the first time, to our best knowledge, a GPU-based PBA for proton dose calculations in Matlab. Methods: In the current study we employed an analytical expression for the protons depth dose distribution. The central-axis term is taken from the broad-beam central-axis depth dose in water modified by an inverse square correction while the distribution of the off-axis term was considered Gaussian. The serial code was implemented in MATLAB and was launched on a desktop with a quad core Intel Xeon X5550 at 2.67GHz with 8 GB of RAM. For the parallelization on the GPU, the parallel computing toolbox was employed and the code was launched on a GTX 770 with Kepler architecture. The performance comparison was established on the speedup factors. Results: The performance of the GPU code was evaluated for three different energies: low (50 MeV), medium (100 MeV) and high (150 MeV). Four square fields were selected for each energy, and the dose calculations were performed with both the serial and parallel codes for a homogeneous water phantom with size 300×300×300 mm3. The resolution of the PBs was set to 1.0 mm. The maximum speedup of ∼127 was achieved for the highest energy and the largest field size. Conclusion: A GPU-based PB algorithm for proton dose calculations in Matlab was presented. A maximum speedup of ∼127 was achieved. Future directions of the current work include extension of our method for dose calculation in heterogeneous phantoms.
Li, Yongbao; Tian, Zhen; Song, Ting; Wu, Zhaoxia; Liu, Yaqiang; Jiang, Steve; Jia, Xun
2017-01-07
Monte Carlo (MC)-based spot dose calculation is highly desired for inverse treatment planning in proton therapy because of its accuracy. Recent studies on biological optimization have also indicated the use of MC methods to compute relevant quantities of interest, e.g. linear energy transfer. Although GPU-based MC engines have been developed to address inverse optimization problems, their efficiency still needs to be improved. Also, the use of a large number of GPUs in MC calculation is not favorable for clinical applications. The previously proposed adaptive particle sampling (APS) method can improve the efficiency of MC-based inverse optimization by using the computationally expensive MC simulation more effectively. This method is more efficient than the conventional approach that performs spot dose calculation and optimization in two sequential steps. In this paper, we propose a computational library to perform MC-based spot dose calculation on GPU with the APS scheme. The implemented APS method performs a non-uniform sampling of the particles from pencil beam spots during the optimization process, favoring those from the high intensity spots. The library also conducts two computationally intensive matrix-vector operations frequently used when solving an optimization problem. This library design allows a streamlined integration of the MC-based spot dose calculation into an existing proton therapy inverse planning process. We tested the developed library in a typical inverse optimization system with four patient cases. The library achieved the targeted functions by supporting inverse planning in various proton therapy schemes, e.g. single field uniform dose, 3D intensity modulated proton therapy, and distal edge tracking. The efficiency was 41.6 ± 15.3% higher than the use of a GPU-based MC package in a conventional calculation scheme. The total computation time ranged between 2 and 50 min on a single GPU card depending on the problem size.
NASA Astrophysics Data System (ADS)
Kauffmann, Claude; Tang, An; Therasse, Eric; Soulez, Gilles
2010-03-01
We developed a hybrid CPU-GPU framework enabling semi-automated segmentation of abdominal aortic aneurysm (AAA) on Computed Tomography Angiography (CTA) examinations. AAA maximal diameter (D-max) and volume measurements and their progression between 2 examinations can be generated by this software improving patient followup. In order to improve the workflow efficiency some segmentation tasks were implemented and executed on the graphics processing unit (GPU). A GPU based algorithm is used to automatically segment the lumen of the aneurysm within short computing time. In a second step, the user interacted with the software to validate the boundaries of the intra-luminal thrombus (ILT) on GPU-based curved image reformation. Automatic computation of D-max and volume were performed on the 3D AAA model. Clinical validation was conducted on 34 patients having 2 consecutive MDCT examinations within a minimum interval of 6 months. The AAA segmentation was performed twice by a experienced radiologist (reference standard) and once by 3 unsupervised technologists on all 68 MDCT. The ICC for intra-observer reproducibility was 0.992 (>=0.987) for D-max and 0.998 (>=0.994) for volume measurement. The ICC for inter-observer reproducibility was 0.985 (0.977-0.90) for D-max and 0.998 (0.996- 0.999) for volume measurement. Semi-automated AAA segmentation for volume follow-up was more than twice as sensitive than D-max follow-up, while providing an equivalent reproducibility.
Neylon, J; Qi, S; Sheng, K; Kupelian, P; Santhanam, A
2014-06-15
Purpose: To develop a GPU-based framework that can generate highresolution and patient-specific biomechanical models from a given simulation CT and contoured structures, optimized to run at interactive speeds, for addressing adaptive radiotherapy objectives. Method: A Massspring-damping (MSD) model was generated from a given simulation CT. The model's mass elements were generated for every voxel of anatomy, and positioned in a deformation space in the GPU memory. MSD connections were established between neighboring mass elements in a dense distribution. Contoured internal structures allowed control over elastic material properties of different tissues. Once the model was initialized in GPU memory, skeletal anatomy was actuated using rigid-body transformations, while soft tissues were governed by elastic corrective forces and constraints, which included tensile forces, shear forces, and spring damping forces. The model was validated by applying a known load to a soft tissue block and comparing the observed deformation to ground truth calculations from established elastic mechanics. Results: Our analyses showed that both local and global load experiments yielded results with a correlation coefficient R{sup 2} > 0.98 compared to ground truth. Models were generated for several anatomical regions. Head and neck models accurately simulated posture changes by rotating the skeletal anatomy in three dimensions. Pelvic models were developed for realistic deformations for changes in bladder volume. Thoracic models demonstrated breast deformation due to gravity when changing treatment position from supine to prone. The GPU framework performed at greater than 30 iterations per second for over 1 million mass elements with up to 26 MSD connections each. Conclusions: Realistic simulations of site-specific, complex posture and physiological changes were simulated at interactive speeds using patient data. Incorporating such a model with live patient tracking would facilitate real
NASA Astrophysics Data System (ADS)
Li, Yongbao; Tian, Zhen; Song, Ting; Wu, Zhaoxia; Liu, Yaqiang; Jiang, Steve; Jia, Xun
2017-01-01
Monte Carlo (MC)-based spot dose calculation is highly desired for inverse treatment planning in proton therapy because of its accuracy. Recent studies on biological optimization have also indicated the use of MC methods to compute relevant quantities of interest, e.g. linear energy transfer. Although GPU-based MC engines have been developed to address inverse optimization problems, their efficiency still needs to be improved. Also, the use of a large number of GPUs in MC calculation is not favorable for clinical applications. The previously proposed adaptive particle sampling (APS) method can improve the efficiency of MC-based inverse optimization by using the computationally expensive MC simulation more effectively. This method is more efficient than the conventional approach that performs spot dose calculation and optimization in two sequential steps. In this paper, we propose a computational library to perform MC-based spot dose calculation on GPU with the APS scheme. The implemented APS method performs a non-uniform sampling of the particles from pencil beam spots during the optimization process, favoring those from the high intensity spots. The library also conducts two computationally intensive matrix-vector operations frequently used when solving an optimization problem. This library design allows a streamlined integration of the MC-based spot dose calculation into an existing proton therapy inverse planning process. We tested the developed library in a typical inverse optimization system with four patient cases. The library achieved the targeted functions by supporting inverse planning in various proton therapy schemes, e.g. single field uniform dose, 3D intensity modulated proton therapy, and distal edge tracking. The efficiency was 41.6 ± 15.3% higher than the use of a GPU-based MC package in a conventional calculation scheme. The total computation time ranged between 2 and 50 min on a single GPU card depending on the problem size.
NASA Astrophysics Data System (ADS)
Le, Anh H.; Park, Young W.; Ma, Kevin; Jacobs, Colin; Liu, Brent J.
2010-03-01
Multiple Sclerosis (MS) is a progressive neurological disease affecting myelin pathways in the brain. Multiple lesions in the white matter can cause paralysis and severe motor disabilities of the affected patient. To solve the issue of inconsistency and user-dependency in manual lesion measurement of MRI, we have proposed a 3-D automated lesion quantification algorithm to enable objective and efficient lesion volume tracking. The computer-aided detection (CAD) of MS, written in MATLAB, utilizes K-Nearest Neighbors (KNN) method to compute the probability of lesions on a per-voxel basis. Despite the highly optimized algorithm of imaging processing that is used in CAD development, MS CAD integration and evaluation in clinical workflow is technically challenging due to the requirement of high computation rates and memory bandwidth in the recursive nature of the algorithm. In this paper, we present the development and evaluation of using a computing engine in the graphical processing unit (GPU) with MATLAB for segmentation of MS lesions. The paper investigates the utilization of a high-end GPU for parallel computing of KNN in the MATLAB environment to improve algorithm performance. The integration is accomplished using NVIDIA's CUDA developmental toolkit for MATLAB. The results of this study will validate the practicality and effectiveness of the prototype MS CAD in a clinical setting. The GPU method may allow MS CAD to rapidly integrate in an electronic patient record or any disease-centric health care system.
A real-time autostereoscopic display method based on partial sub-pixel by general GPU processing
NASA Astrophysics Data System (ADS)
Chen, Duo; Sang, Xinzhu; Cai, Yuanfa
2013-08-01
With the progress of 3D technology, the huge computing capacity for the real-time autostereoscopic display is required. Because of complicated sub-pixel allocating, masks providing arranged sub-pixels are fabricated to reduce real-time computation. However, the binary mask has inherent drawbacks. In order to solve these problems, weighted masks are used in displaying based on partial sub-pixel. Nevertheless, the corresponding computations will be tremendously growing and unbearable for CPU. To improve calculating speed, Graphics Processing Unit (GPU) processing with parallel computing ability is adopted. Here the principle of partial sub-pixel is presented, and the texture array of Direct3D 10 is used to increase the number of computable textures. When dealing with a HD display and multi-viewpoints, a low level GPU is still able to permit a fluent real time displaying, while the performance of high level CPU is really not acceptable. Meanwhile, after using texture array, the performance of D3D10 could be double, and sometimes be triple faster than D3D9. There are several distinguishing features for the proposed method, such as the good portability, less overhead and good stability. The GPU display system could also be used for the future Ultra HD autostereoscopic display.
NASA Astrophysics Data System (ADS)
Rossi, Francesco; Londrillo, Pasquale; Sgattoni, Andrea; Sinigardi, Stefano; Turchetti, Giorgio
2012-12-01
We present `jasmine', an implementation of a fully relativistic, 3D, electromagnetic Particle-In-Cell (PIC) code, capable of running simulations in various laser plasma acceleration regimes on Graphics-Processing-Units (GPUs) HPC clusters. Standard energy/charge preserving FDTD-based algorithms have been implemented using double precision and quadratic (or arbitrary sized) shape functions for the particle weighting. When porting a PIC scheme to the GPU architecture (or, in general, a shared memory environment), the particle-to-grid operations (e.g. the evaluation of the current density) require special care to avoid memory inconsistencies and conflicts. Here we present a robust implementation of this operation that is efficient for any number of particles per cell and particle shape function order. Our algorithm exploits the exposed GPU memory hierarchy and avoids the use of atomic operations, which can hurt performance especially when many particles lay on the same cell. We show the code multi-GPU scalability results and present a dynamic load-balancing algorithm. The code is written using a python-based C++ meta-programming technique which translates in a high level of modularity and allows for easy performance tuning and simple extension of the core algorithms to various simulation schemes.
A Multi-CPU/GPU implementation of RBF-generated finite differences for PDEs on a Sphere
NASA Astrophysics Data System (ADS)
Bollig, E. F.; Flyer, N.; Erlebacher, G.
2011-12-01
Numerical methods leveraging Radial Basis Functions (RBFs) are on the rise in computational science. With natural extensions into higher dimensions, functionality in the face of unstructured grids, stability for large time-steps, competitive accuracy and convergence when compared to other state-of-the-art methods, it is hard to ignore these simple-to-code alternatives. RBF-generated finite differences (RBF-FD) hold a promising future in that they have the advantages of global RBFs but have the ability to be highly parallelizable on multi-core machines. They differ from classical finite differences in that the test functions used to calculate the differentiation weights are n-dimensional RBFs rather than one-dimensional polynomials. This allows for generalization to n-dimensional space on completely scattered node layouts. We present an ongoing effort to develop fast and efficient implementations of RBF-FD for the geosciences. Specifically, we introduce a multi-CPU/GPU implementation for the solution of parabolic and hyperbolic PDEs. This work targets the NSF funded Keeneland GPU cluster, which---like many of the latest HPC systems around the world---offers significantly more GPU accelerators than CPU counterparts. We will discuss parallelization strategies, algorithms and data-structures used to span computation across the heterogeneous architecture.
NASA Astrophysics Data System (ADS)
Fang, Ye; Feng, Sheng; Tam, Ka-Ming; Yun, Zhifeng; Moreno, Juana; Ramanujam, J.; Jarrell, Mark
2014-10-01
Monte Carlo simulations of the Ising model play an important role in the field of computational statistical physics, and they have revealed many properties of the model over the past few decades. However, the effect of frustration due to random disorder, in particular the possible spin glass phase, remains a crucial but poorly understood problem. One of the obstacles in the Monte Carlo simulation of random frustrated systems is their long relaxation time making an efficient parallel implementation on state-of-the-art computation platforms highly desirable. The Graphics Processing Unit (GPU) is such a platform that provides an opportunity to significantly enhance the computational performance and thus gain new insight into this problem. In this paper, we present optimization and tuning approaches for the CUDA implementation of the spin glass simulation on GPUs. We discuss the integration of various design alternatives, such as GPU kernel construction with minimal communication, memory tiling, and look-up tables. We present a binary data format, Compact Asynchronous Multispin Coding (CAMSC), which provides an additional 28.4% speedup compared with the traditionally used Asynchronous Multispin Coding (AMSC). Our overall design sustains a performance of 33.5 ps per spin flip attempt for simulating the three-dimensional Edwards-Anderson model with parallel tempering, which significantly improves the performance over existing GPU implementations.
NASA Astrophysics Data System (ADS)
Ha, Sanghyun; You, Donghyun
2015-11-01
Utility of the computational power of Graphics Processing Units (GPUs) is elaborated for solutions of both incompressible and compressible Navier-Stokes equations. A semi-implicit ADI finite-volume method for integration of the incompressible and compressible Navier-Stokes equations, which are discretized on a structured arbitrary grid, is parallelized for GPU computations using CUDA (Compute Unified Device Architecture). In the semi-implicit ADI finite-volume method, the nonlinear convection terms and the linear diffusion terms are integrated in time using a combination of an explicit scheme and an ADI scheme. Inversion of multiple tri-diagonal matrices is found to be the major challenge in GPU computations of the present method. Some of the algorithms for solving tri-diagonal matrices on GPUs are evaluated and optimized for GPU-acceleration of the present semi-implicit ADI computations of incompressible and compressible Navier-Stokes equations. Supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT and Future Planning Grant NRF-2014R1A2A1A11049599.
NASA Astrophysics Data System (ADS)
Dardikman, Gili; Shaked, Natan T.
2016-03-01
We present highly parallel and efficient algorithms for real-time reconstruction of the quantitative three-dimensional (3-D) refractive-index maps of biological cells without labeling, as obtained from the interferometric projections acquired by tomographic phase microscopy (TPM). The new algorithms are implemented on the graphic processing unit (GPU) of the computer using CUDA programming environment. The reconstruction process includes two main parts. First, we used parallel complex wave-front reconstruction of the TPM-based interferometric projections acquired at various angles. The complex wave front reconstructions are done on the GPU in parallel, while minimizing the calculation time of the Fourier transforms and phase unwrapping needed. Next, we implemented on the GPU in parallel the 3-D refractive index map retrieval using the TPM filtered-back projection algorithm. The incorporation of algorithms that are inherently parallel with a programming environment such as Nvidia's CUDA makes it possible to obtain real-time processing rate, and enables high-throughput platform for label-free, 3-D cell visualization and diagnosis.
Bustamam, Alhadi; Burrage, Kevin; Hamilton, Nicholas A
2012-01-01
Markov clustering (MCL) is becoming a key algorithm within bioinformatics for determining clusters in networks. However,with increasing vast amount of data on biological networks, performance and scalability issues are becoming a critical limiting factor in applications. Meanwhile, GPU computing, which uses CUDA tool for implementing a massively parallel computing environment in the GPU card, is becoming a very powerful, efficient, and low-cost option to achieve substantial performance gains over CPU approaches. The use of on-chip memory on the GPU is efficiently lowering the latency time, thus, circumventing a major issue in other parallel computing environments, such as MPI. We introduce a very fast Markov clustering algorithm using CUDA (CUDA-MCL) to perform parallel sparse matrix-matrix computations and parallel sparse Markov matrix normalizations, which are at the heart of MCL. We utilized ELLPACK-R sparse format to allow the effective and fine-grain massively parallel processing to cope with the sparse nature of interaction networks data sets in bioinformatics applications. As the results show, CUDA-MCL is significantly faster than the original MCL running on CPU. Thus, large-scale parallel computation on off-the-shelf desktop-machines, that were previously only possible on supercomputing architectures, can significantly change the way bioinformaticians and biologists deal with their data.
NASA Astrophysics Data System (ADS)
Sharp, G. C.; Kandasamy, N.; Singh, H.; Folkert, M.
2007-09-01
This paper shows how to significantly accelerate cone-beam CT reconstruction and 3D deformable image registration using the stream-processing model. We describe data-parallel designs for the Feldkamp, Davis and Kress (FDK) reconstruction algorithm, and the demons deformable registration algorithm, suitable for use on a commodity graphics processing unit. The streaming versions of these algorithms are implemented using the Brook programming environment and executed on an NVidia 8800 GPU. Performance results using CT data of a preserved swine lung indicate that the GPU-based implementations of the FDK and demons algorithms achieve a substantial speedup—up to 80 times for FDK and 70 times for demons when compared to an optimized reference implementation on a 2.8 GHz Intel processor. In addition, the accuracy of the GPU-based implementations was found to be excellent. Compared with CPU-based implementations, the RMS differences were less than 0.1 Hounsfield unit for reconstruction and less than 0.1 mm for deformable registration.
NASA Astrophysics Data System (ADS)
Zhou, Han; Zhou, Lu-chun
2013-08-01
In this paper, a real-time on-line performance evaluation processor based on graphic processing unit (GPU) for adaptive optics (AO) system is presented, aiming to monitor the 127-element AO system during its close-loop work by quantifying its correction results, which can provide reference to improve the performance of the system. Since there is a contradiction between the heavy computation burden and the real-time processing requirement, we modified operations and algorithms to fit the CPU-GPU heterogeneous environment, in which GPU is used to handle the complex computation but simple logicality, and CPU is assigned to undertake data transportation between internal storage and video memory，as well as some small-scale computations. In the real-time processor, performance parameters to be computed include peak-valley (PV) and root-mean-square (RMS) of near-field wavefront phase, point spread function (PSF), full width half maximum (FWHM) of far-field image，modulation transfer function (MTF) and Strehl ratio (SR). And the inputs are residual slopes obtained from Hartmann wavefront sensor of 127-element AO system. By computation 4096 frames of parameters, the average rate by single precision is 4.11ms/frame.
A GPU-based High-order Multi-resolution Framework for Compressible Flows at All Mach Numbers
NASA Astrophysics Data System (ADS)
Forster, Christopher J.; Smith, Marc K.
2016-11-01
The Wavelet Adaptive Multiresolution Representation (WAMR) method is a general and robust technique for providing grid adaptivity around the evolution of features in the solutions of partial differential equations and is capable of resolving length scales spanning 6 orders of magnitude. A new flow solver based on the WAMR method and specifically parallelized for the GPU computing architecture has been developed. The compressible formulation of the Navier-Stokes equations is solved using a preconditioned dual-time stepping method that provides accurate solutions for flows at all Mach numbers. The dual-time stepping method allows for control over the residuals of the governing equations and is used to complement the spatial error control provided by the WAMR method. An analytical inverse preconditioning matrix has been derived for an arbitrary number of species that allows preconditioning to be efficiently implemented on the GPU architecture. Additional modifications required for the combination of wavelet-adaptive grids and preconditioned dual-time stepping on the GPU architecture will be discussed. Verification using the Taylor-Green vortex to demonstrate the accuracy of the method will be presented.
Comparison of a 3-D GPU-Assisted Maxwell Code and Ray Tracing for Reflectometry on ITER
NASA Astrophysics Data System (ADS)
Gady, Sarah; Kubota, Shigeyuki; Johnson, Irena
2015-11-01
Electromagnetic wave propagation and scattering in magnetized plasmas are important diagnostics for high temperature plasmas. 1-D and 2-D full-wave codes are standard tools for measurements of the electron density profile and fluctuations; however, ray tracing results have shown that beam propagation in tokamak plasmas is inherently a 3-D problem. The GPU-Assisted Maxwell Code utilizes the FDTD (Finite-Difference Time-Domain) method for solving the Maxwell equations with the cold plasma approximation in a 3-D geometry. Parallel processing with GPGPU (General-Purpose computing on Graphics Processing Units) is used to accelerate the computation. Previously, we reported on initial comparisons of the code results to 1-D numerical and analytical solutions, where the size of the computational grid was limited by the on-board memory of the GPU. In the current study, this limitation is overcome by using domain decomposition and an additional GPU. As a practical application, this code is used to study the current design of the ITER Low Field Side Reflectometer (LSFR) for the Equatorial Port Plug 11 (EPP11). A detailed examination of Gaussian beam propagation in the ITER edge plasma will be presented, as well as comparisons with ray tracing. This work was made possible by funding from the Department of Energy for the Summer Undergraduate Laboratory Internship (SULI) program. This work is supported by the US DOE Contract No.DE-AC02-09CH11466 and DE-FG02-99-ER54527.
Qin, N; Tian, Z; Pompos, A; Jiang, S; Jia, X; Giantsoudi, D; Schuemann, J; Paganetti, H
2015-06-15
Purpose: A GPU-based Monte Carlo (MC) simulation package gPMC has been previously developed and high computational efficiency was achieved. This abstract reports our recent improvements on this package in terms of accuracy, functionality, and code portability. Methods: In the latest version of gPMC, nuclear interaction cross section database was updated to include data from TOPAS/Geant4. Inelastic interaction model, particularly the proton scattering angle distribution, was updated to improve overall simulation accuracy. Calculation of dose averaged LET (LETd) was implemented. gPMC was ported onto an OpenCL environment to enable portability across different computing devices (GPUs from different vendors and CPUs). We also performed comprehensive tests of the code accuracy. Dose from electro-magnetic (EM) interaction channel, primary and secondary proton doses and fluences were scored and compared with those computed by TOPAS. Results: In a homogeneous water phantom with 100 and 200 MeV beams, mean dose differences in EM channel computed by gPMC and by TOPAS were 0.28% and 0.65% of the corresponding maximum dose, respectively. With the Geant4 nuclear interaction cross section data, mean difference of primary proton dose was 0.84% for the 200 MeV case and 0.78% for the 100 MeV case. After updating inelastic interaction model, maximum difference of secondary proton fluence and dose were 0.08% and 0.5% for the 200 MeV beam, and 0.04% and 0.2% for the 100 MeV beam. In a test case with a 150MeV proton beam, the mean difference between LETd computed by gPMC and TOPAS was 0.96% within the proton range. With the OpenCL implementation, gPMC is executable on AMD and Nvidia GPUs, as well as on Intel CPU in single or multiple threads. Results on different platforms agreed within statistical uncertainty. Conclusion: Several improvements have been implemented in the latest version of gPMC, which enhanced its accuracy, functionality, and code portability.
Hsu, Yu-Han H; Ferl, Gregory Z; Ng, Chee M
2013-05-01
Dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) is often used to examine vascular function in malignant tumors and noninvasively monitor drug efficacy of antivascular therapies in clinical studies. However, complex numerical methods used to derive tumor physiological properties from DCE-MRI images can be time-consuming and computationally challenging. Recent advancement of computing technology in graphics processing unit (GPU) makes it possible to build an energy-efficient and high-power parallel computing platform for solving complex numerical problems. This study develops the first reported fast GPU-based method for nonparametric kinetic analysis of DCE-MRI data using clinical scans of glioblastoma patients treated with bevacizumab (Avastin®). In the method, contrast agent concentration-time profiles in arterial blood and tumor tissue are smoothed using a robust kernel-based regression algorithm in order to remove artifacts due to patient motion and then deconvolved to produce the impulse response function (IRF). The area under the curve (AUC) and mean residence time (MRT) of the IRF are calculated using statistical moment analysis, and two tumor physiological properties that relate to vascular permeability, volume transfer constant between blood plasma and extravascular extracellular space (K(trans)) and fractional interstitial volume (ve) are estimated using the approximations AUC/MRT and AUC. The most significant feature in this method is the use of GPU-computing to analyze data from more than 60,000 voxels in each DCE-MRI image in parallel fashion. All analysis steps have been automated in a single program script that requires only blood and tumor data as the sole input. The GPU-accelerated method produces K(trans) and ve estimates that are comparable to results from previous studies but reduces computational time by more than 80-fold compared to a previously reported central processing unit-based nonparametric method. Furthermore, it is at
Real-time dose computation: GPU-accelerated source modeling and superposition/convolution
Jacques, Robert; Wong, John; Taylor, Russell; McNutt, Todd
2011-01-15
Purpose: To accelerate dose calculation to interactive rates using highly parallel graphics processing units (GPUs). Methods: The authors have extended their prior work in GPU-accelerated superposition/convolution with a modern dual-source model and have enhanced performance. The primary source algorithm supports both focused leaf ends and asymmetric rounded leaf ends. The extra-focal algorithm uses a discretized, isotropic area source and models multileaf collimator leaf height effects. The spectral and attenuation effects of static beam modifiers were integrated into each source's spectral function. The authors introduce the concepts of arc superposition and delta superposition. Arc superposition utilizes separate angular sampling for the total energy released per unit mass (TERMA) and superposition computations to increase accuracy and performance. Delta superposition allows single beamlet changes to be computed efficiently. The authors extended their concept of multi-resolution superposition to include kernel tilting. Multi-resolution superposition approximates solid angle ray-tracing, improving performance and scalability with a minor loss in accuracy. Superposition/convolution was implemented using the inverse cumulative-cumulative kernel and exact radiological path ray-tracing. The accuracy analyses were performed using multiple kernel ray samplings, both with and without kernel tilting and multi-resolution superposition. Results: Source model performance was <9 ms (data dependent) for a high resolution (400{sup 2}) field using an NVIDIA (Santa Clara, CA) GeForce GTX 280. Computation of the physically correct multispectral TERMA attenuation was improved by a material centric approach, which increased performance by over 80%. Superposition performance was improved by {approx}24% to 0.058 and 0.94 s for 64{sup 3} and 128{sup 3} water phantoms; a speed-up of 101-144x over the highly optimized Pinnacle{sup 3} (Philips, Madison, WI) implementation. Pinnacle{sup 3
NASA Astrophysics Data System (ADS)
Qin, Nan; Botas, Pablo; Giantsoudi, Drosoula; Schuemann, Jan; Tian, Zhen; Jiang, Steve B.; Paganetti, Harald; Jia, Xun
2016-10-01
Monte Carlo (MC) simulation is commonly considered as the most accurate dose calculation method for proton therapy. Aiming at achieving fast MC dose calculations for clinical applications, we have previously developed a graphics-processing unit (GPU)-based MC tool, gPMC. In this paper, we report our recent updates on gPMC in terms of its accuracy, portability, and functionality, as well as comprehensive tests on this tool. The new version, gPMC v2.0, was developed under the OpenCL environment to enable portability across different computational platforms. Physics models of nuclear interactions were refined to improve calculation accuracy. Scoring functions of gPMC were expanded to enable tallying particle fluence, dose deposited by different particle types, and dose-averaged linear energy transfer (LETd). A multiple counter approach was employed to improve efficiency by reducing the frequency of memory writing conflict at scoring. For dose calculation, accuracy improvements over gPMC v1.0 were observed in both water phantom cases and a patient case. For a prostate cancer case planned using high-energy proton beams, dose discrepancies in beam entrance and target region seen in gPMC v1.0 with respect to the gold standard tool for proton Monte Carlo simulations (TOPAS) results were substantially reduced and gamma test passing rate (1%/1 mm) was improved from 82.7%-93.1%. The average relative difference in LETd between gPMC and TOPAS was 1.7%. The average relative differences in the dose deposited by primary, secondary, and other heavier particles were within 2.3%, 0.4%, and 0.2%. Depending on source proton energy and phantom complexity, it took 8-17 s on an AMD Radeon R9 290x GPU to simulate {{10}7} source protons, achieving less than 1% average statistical uncertainty. As the beam size was reduced from 10 × 10 cm2 to 1 × 1 cm2, the time on scoring was only increased by 4.8% with eight counters, in contrast to a 40% increase using only
NASA Astrophysics Data System (ADS)
Gonzalez, C. M.; Sanchez, D. A.; Yuen, D. A.; Wright, G. B.; Barnett, G. A.
2010-12-01
As computational modeling became prolific throughout the physical sciences community, newer and more efficient ways of processing large amounts of data needed to be devised. One particular method for processing such large amounts of data arose in the form of using a graphics processing unit (GPU) for calculations. Computational scientists were attracted to the GPU as a computational tool as the performance, growth, and availability of GPUs over the past decade increased. Scientists began to utilize the GPU as the sole workhorse for their brute force calculations and modeling. The GPUs, however, were not originally designed for this style of use. As a result, difficulty arose when trying to find a use for the GPU from a scientific standpoint. A lack of parallel programming routines was the main culprit behind the difficulty in programming with a GPU, but with time and a rise in popularity, NVIDIA released a proprietary architecture named Fermi. The Fermi architecture, when used in conjunction with development tools such as CUDA, allowed the programmer easier access to routines that made parallel programming with the NVIDIA GPUs an ease. This new architecture enabled the programmer full access to faster memory, double-precision support, and large amounts of global memory at their fingertips. Our model was based on using a second-order, spatially correct finite difference method and a third order Runge-Kutta time-stepping scheme for studying the 2D Rayleigh-Benard code. The code extensively used the CUBLAS routines to do the heavy linear algebra calculations. The calculations themselves were completed using a single GPU, the NVDIA C2070 Fermi, which boasts 6 GB of global memory. The overall scientific goal of our work was to apply the Tesla C2070's computing potential to achieve an onset of flow reversals as a function of increasing large Rayleigh numbers. Previous investigations were successful using a smaller grid size of 1000x1999 and a Rayleigh number of 10^9. The
Xu, Chuanfu; Deng, Xiaogang; Zhang, Lilun; Fang, Jianbin; Wang, Guangxue; Jiang, Yi; Cao, Wei; Che, Yonggang; Wang, Yongxian; Wang, Zhenghua; Liu, Wei; Cheng, Xinghua
2014-12-01
Programming and optimizing complex, real-world CFD codes on current many-core accelerated HPC systems is very challenging, especially when collaborating CPUs and accelerators to fully tap the potential of heterogeneous systems. In this paper, with a tri-level hybrid and heterogeneous programming model using MPI + OpenMP + CUDA, we port and optimize our high-order multi-block structured CFD software HOSTA on the GPU-accelerated TianHe-1A supercomputer. HOSTA adopts two self-developed high-order compact definite difference schemes WCNS and HDCS that can simulate flows with complex geometries. We present a dual-level parallelization scheme for efficient multi-block computation on GPUs and perform particular kernel optimizations for high-order CFD schemes. The GPU-only approa